UhClem
Members-
Posts
282 -
Joined
-
Last visited
Content Type
Profiles
Forums
Downloads
Store
Gallery
Bug Reports
Documentation
Landing
Everything posted by UhClem
-
Question about running multiple pre-clear passes
UhClem replied to Queball's topic in Storage Devices and Controllers
Thanks very much for providing that info. Your tests, and results, with a real live specimen, have really caught my interest. Given the extra details you've provided, I am also inclined to retract my memory parity error suggestion. But, I'm going to need to take a look at badblocks' code. My concern is with those "short reads" that you made mention of. (If we're on the same wavelength here) This would be where the entire read() call's byte count request was not fully satisfied, because the (hard, kernel-logged) read_error occurred on a sector within (but not at the start of) that request. I'm wondering if, somehow, badblocks is using "too much" data during the subsequent comparison test. Or, and this part really really concerns me, is the drive violating its prime directive (Never return data whose integrity it can't 100.01% vouch for) ?? Well, I looked at the code, in routine do_read(), and it does look fine. I'm puzzled and/or worried. Just thought of something ... I don't have any WD drives. Is this specific drive of yours a 4KB sector model? Thanks again. (I've got some testing code that handles a similar situation, but haven't had the opportunity to test its analogous error-handling.) Shall we call this "Lab rat by proxy"? --UhClem -
Question about running multiple pre-clear passes
UhClem replied to Queball's topic in Storage Devices and Controllers
My take is that there are two separate classes of errors instigated by the single procedure (the badblocks run). First, there's the Read-vs-Write mismatches, which I would attribute mainly to lack of ECC memory (more on this below). Second are the Current_Pending_Sector errors (ie, ramp-up of the count). Those are, by definition, in the jurisdiction of the drive itself, most likely (in this scenario) platter surface related. A third potential source of errors, which you didn't receive, would have been CRC errors (from smart), and would have been attributable to cable/connection problems. The hard drive's prime directive is to guarantee that the 512 bytes it returns at the SATA connector for a READ request is exactly the same as it received at the SATA connector during the most recent (prior) WRITE request, for a specific LBA (sector#). It uses very sophisticated ECC (much hairier than ECC for RAM) along with encoding trickery [see RLL] (and other stuff I don't understand) to accomplish this. Also, that same ECC will tell it when it can't make the guarantee, and then it returns an error, but only after numerous retries (10-20+). Between the drive's SATA connector and the controller/HBA, CRC is employed to assure integrity (or return an error; which should elicit a retry (by the host); which will usually succeed). Hmmm, does the drive/smart only log the CRC errors that it catches (during a WRITE request)? Thus, those data-mismatch errors reported by badblocks should have no relationship to the Current_Pending_Sector increases. They can be attributed to glitches that occur between the application's user-space memory and the controller (on the way out) and between the controller and the ultimate comparison-check in the application code (on the way back in). Of those 151 errors reported that you mentioned, how many were mismatch errors vs. hard i/o errors? It appears that this corresponded to an increase in the Current_Panding_Sector from 49 to 316. One thing we don't know is the precise criteria the drive firmware uses for adding, and possibly removing, Current_Pending_Sector status (separate from escalating to Reallocation). Ie, the exact chain of events and conditions which took place for your drive. Questions I have (but can't expect you to have the data to answer [but, surprise me!]) are: 1) Did Current_Pending_Sector (CPS), after each Write-then-ReadAndTest pass, steadily increase, or did the Write clear the vast majority (by merit of the Write succeeding, or, if not, forcing a Reallocation)? (It doesn't appear that you anticipated the potential use for this; I only realized it after I finished my reply.) 2) How many CPS bumps had a corresponding (Uncorrectable)Error returned to host? 2a) Did badblocks actually report any I/O errors? 2b) Did syslog evidence any I/O errors? 3) Were the LBA's for these CPS bumps highly localized (bunched, or consecutive)? Regarding your surmisal about an abnormal power-down being a cause ... maybe, and that's why I asked about the bunching of LBA#s for the CPS guys. In general, power instability (within the drive) can apparently produce what now appears to be your situation: sector(s) are written weakly/insufficiently, and subsequent Read has problem, causing CPS bump; but the next following Write of that LBA (with proper power) succeeds, and CPS gets decremented and that LBA comes off the watch-list. I imagine there can also be areas of (permanently) marginal surface, such that failure is inconsistent--sometimes, it succeeds within the firmware's retry count, sometimes not. It could be enlightening if you did have a record of some of these problematic LBAs. Then, a badblocks test over a very limited range would be quick, regardless of how many passes (though you would want to heed your 2XCacheMB concern). And, rather than multi-pass, better would be multiple single-pass runs with logged smart's interspersed. Another puzzle: with all the read/write testing this drive has endured, shouldn't the "failed LBAs" you mention (end of last post) be represented in a non-zero Reallocated_Sector_Count? From your diff's, it would seem that is still 0. Personally, I'm looking forward to one of my own drives becoming "interesting" (like yours, or some variation). I've got some ideas, and some low-level code written to help test them. I just need my own lab rat. --UhClem -
Question about running multiple pre-clear passes
UhClem replied to Queball's topic in Storage Devices and Controllers
That was pretty funny!!!! bright light. LOL! Glad you liked it. (I've been a 'net jokester since the first days of Usenet [early 80s], and it's nice to hear the audience laugh [on this medium].) Thanks for checking. (This geriatric hacker should have done it himself, but I couldn't reach my walker.) That's reasonable. That is very interesting! Thanks for relating it. [Can I assume that this was all done on your box with the ECC memory, and that the ECC protection was actually engaged?] And thanks for the follow-on post with all the gory details (kudos on good record-keeping). Surgery can be a messy business. I'm not sure I would consider that patient cured; maybe just temporary remission. To be honest, I suspect congenital platter surface cancer. Did you return it to the general population (your array) or send it to the hospice? Maybe a convalescent home for close observation [then RMA (aka DNR)]. --UhClem -
Question about running multiple pre-clear passes
UhClem replied to Queball's topic in Storage Devices and Controllers
Marginal? Sounds interesting. I'm curious ... how is this marginality indicated? Thanks. --UhClem 1. If there is output in the badblocks log. 2. If the number of pending sectors is > 0 or increases. 3. If reallocated sectors increases significantly. What I've come to learn is any pending sectors are a time bomb for any recover effort in the future. ... Thanks for the reply. Maybe I took your initial statement too literally. Which suggested that badblocks was detecting marginal sectors. And that is what caught my attention, and prompted my query. [is it possible that your #1 above is where badblocks indicates my notion of sector marginality? I've never run it, but a careful read of the man page did not indicate any such reporting.] In my opinion, a Current Pending Sector is pretty much on its death bed, but no death certificate has been written (Reallocated). It just could survive (and then it will tell all its neighbors (the entire cylinder) about the bright white light it saw ). I was expecting badblocks to warn about sectors that were actually marginal--you know, like high cholesterol and diabetes, maybe even early cancer. "I'll let you be in my dream if I can be in yours." -- Bob Dylan said that. "Just because a sector isn't bad doesn't mean it's good." -- I said that. -- UhClem -
Question about running multiple pre-clear passes
UhClem replied to Queball's topic in Storage Devices and Controllers
Marginal? Sounds interesting. I'm curious ... how is this marginality indicated? Thanks. --UhClem -
<SOLVED> Syslog Error on ata19, which is that drive?
UhClem replied to 642's topic in General Support (V5 and Older)
Also note that both those drives are throwing errors at the same (exact) time. That seems unusual; maybe there's a clue for you ... cabling? power? --UhClem -
Alternatively, simple and effective, just the switch(es): Set of 3 remotely-controlled AC (3-prong, 10A) switches, controlled by a single (3-button) remote. $17 Link: http://www.amazon.com/gp/product/B001AS4NQS Also available at geeks.com ($13 [but maybe higher shipping]) Link: http://www.geeks.com/details.asp?invtid=BH9936-3&cpc=SCH I bought 2 sets at Geeks 5+ yrs ago; working fine with projector, upstairs fan, upstairs lamp, etc. Note: the two sets do not interfere w/each other (different codes). Remote range OK -- 25+ feet incl. ceiling and a wall. The (3) remote buttons are toggles, but each switch has discrete ON and OFF buttons (typically not needed, but in some applications very welcome). --UhClem
-
Question about running multiple pre-clear passes
UhClem replied to Queball's topic in Storage Devices and Controllers
Murphy was just jerking you around . ...He'll go f* with someone else now. --UhClem -
Question about running multiple pre-clear passes
UhClem replied to Queball's topic in Storage Devices and Controllers
Good ... Another thing (I hope you see this) ... While your preclear is running, also do a (test) write to the array and have a "tail -f syslog" running. Look for any anomalies/errors. That write speed (~3 MB/s) is so atrocious that I suspect there will be some glaring clues from the kernel. --UhClem -
Question about running multiple pre-clear passes
UhClem replied to Queball's topic in Storage Devices and Controllers
2 questions: 1. Is there an array drive on the other port on the Masscool (the one not handling the preclear)? You clarified this in the "Re: preclear" thread. 2. How is the read speed of your array during this? Both, in general, and, if answer to #1 is yes, the read speed of that drive. --UhClem -
5TB drives due from Seagate in 3 months?
UhClem replied to neilt0's topic in Storage Devices and Controllers
Yes, in a sense. -
5TB drives due from Seagate in 3 months?
UhClem replied to neilt0's topic in Storage Devices and Controllers
You guys are so greedy--all you need is 2.1991 TB support. Beyond that is effortless, and will be the end of this saga. At least till after a lot of us are "eating dirt" (and there's a replacement for the current (S)ATA spec [48-bit LBA]). The present hurdle is that LBA (sector) addresses have been stored (and manipulated) [within drivers, filesystems, BIOS, or whatever] as 32-bit (long unsigned) integers. That is where the 2TB+ difficulty arises. (Some of you already know this, but) I thought I'd provide an explanation for all this agony. [E.g. 2^32 (max # sectors) X 2^9 (512 byte/LBA[sector]) = 2^41 bytes = 2.19902TB] Once everything is cleaned up (and stable) to use 64-bit entities, this headache just fades away (and becomes just another entry in geek-history). PS The 4KB sector vs 512-byte sector thing does not enter into this discussion. Just trust me on that one. --UhClem "Doing base 8 arithmetic is easy! (if you're missing two fingers)" -T.L. -
[SOLVED] Problem using Jumbo Frames
UhClem replied to One2go's topic in General Support (V5 and Older)
Good suggestion. Maybe mod that last bit to "PCI is fine for (a well-engineered) gig-e". I got its predecessor (PWLA8390MT) [from Monoprice 5 yrs ago ($22)] and was curious what the difference might be. From this discussion (also 5 yrs ago), it doesn't seem much. It's an informative discussion, but moderately technical. This newer card (dgaschk's one) can also be had from Amazon for $1 more, but FS (saves $1.xx) [and, you don't get any sleaze on you]. -- UhClem -
and, earlier You are a wise man. [As tempted as you might have been ...] Many of us could take a lesson (self included). As Dirty Harry said, "A man has to know his limitations." ======= For anyone who likes to tinker, that SS4200 is neat little server box. Like a mini-Cooper, but made by Mercedes (~size of a briefcase). 1.6GHz Celeron, 512MB ram (upgrade to 2GB), 4 internal SATA2 "bays", 1xIDE, 4xUSB2.0, *plus* 2x eSATA ports [controlled by on-board SiI3132] *and* an Intel 1GbE (no RealTek headache). ======= [but, back to a much more serious matter (George's bits!) ...] --UhClem
-
Another possible option would be an Intel SS4200-E(HW).It has been discontinued, but there might still be some surplus units, and also used sales. Newegg closed out its supply last November for $130 each (I bought my 2nd & 3rd). But, I wouldn't call this unit "cheap"; it is a solid, top-quality (almost industrial class) Intel engineering/design. I'm a terrible procrastinator, and haven't installed unRAID on it yet, but others have (search the forum). One caveat: it doesn't (easily) accomodate a video card. You would access/manage it via network (telnet, etc). --UhClem
-
If you have entire albums (or entire concerts) as a single file (eg, .FLAC) and you have a corresponding .CUE (cuesheet) file, a recent firmware for Dune will supposedly (I don't own one) provide gapless playback, while still giving access to individual tracks (via the .CUE). Nevertheless, Dune (like almost all media[sic] players) is very antagonistic toward making music listening very user-friendly. --UhClem
-
OK. It looks like this is a real "who-done-it", and we've just ruled out the handyman. Now, let's try to rule out an "inside job", in general. Disconnect the unRAID server's ethernet connection. Is the culrpit still active? I'm suspecting not (but this is how us "detectives" do our job ). [re-establish the enet connection] If it is an "outside job", let's see what part of town he's from. Individually (one by one), disconnect each client from your network and test for suspicious activity. Report back, and we'll seek an indictment. If, however, it is an inside job, we can call in CSI. [While doing your sleuthing, try to get an exact timing of this activity--both the interval from one start to the next start, and the duration, in seconds (that "one or two minutes" won't hold up in court).] -- UhClem "Nick Danger, third eye"
-
Note, in the snippet from syslog you posted above, that error is being "instigated" by a system call from within smartctl; and the trace leads to a [sAS]controller-specific (sas_ioctl) routine. [it is likely that your error from the actual copy is similar, but I can't figure where smartctl and a file-copy operation would be doing the same ioctl().] Regardless, focusing back on the posted traceback, it would be interesting to know what NTFS-specific goings-on would cause something so "low-level" as smartctl to provoke that fault. That is really something that Tom, or the driver's code maintainer, should investigate. Glad you were able to isolate the issue, and find a (begrudgingly) acceptable workaround. -- UhClem
-
Startech PEXSATA221 and troubles generating parity
UhClem replied to Golfonauta's topic in Storage Devices and Controllers
and, then, in a follow-up, ... Forgive this old geezer's observation, but wouldn't the OP be better to start his new unraid server off with v4.7, and, once that appears stable, give v5.0-b12a a try. Sounds a little creepy: "beta" hardware with beta software. That's like "nested, circular, finger-pointing" waiting to happen. -- UhClem -
Regarding your experience of read underperformance on a data drive while doing intensive low-level I/O (the preclear) on an out-of-array drive: it might sound reasonable to you, but not to me. I do agree that it's not a big deal to relocate (or reschedule) an infrequent "maintenance operation". However, this could be "a sign" that something is either misconfigured, used inefficiently, or has some other "room for improvement". Do you recall what type of SATA port your preclear drive was on? And, what types of ports are your data drives on? (I don't expect you to know which drive that particular movie was on:)) What you describe makes sense if there is head contention, but that's not the case here. -- UhClem "I think we're all bozos on this bus."
-
[SOLVED] Added parity drive but nothing happening
UhClem replied to eldustino's topic in General Support (V5 and Older)
No problemo. Especially compared to the amount of your time. and aggravation, if you had: -- UhClem "I think we're all bozos on this bus." -
4 Bay eSATA Port Multiplier JBOD tower - $85 shipped after promo
UhClem replied to Rajahal's topic in Good Deals!
This is a decent unit, at a decent price. (I got one back in May for $10 less). [The "NC" at end of model# indicates No Card.] I would seriously modify the above to: If your server is full of drives but you have a spare eSATA port, this item might allow you to add an additional 4 data drives. Might be plug-n-play on some systems. Might. Make very sure. Do not rely on vendor specifications (ie, "Supports port multipliers"). Do not assume that a mobo_sata/controller that works with one particular port-multiplier chip [this unit uses a SiI3726] will actually work with another. There are even OS/driver dependencies. Note: I find no fault with Rajahal's comments. My intent is to clarify and emphasize. Do your research. [Also, port-multipliers usually result in reduced throughput. YMWV] -- UhClem -
[SOLVED] Added parity drive but nothing happening
UhClem replied to eldustino's topic in General Support (V5 and Older)
You're using a 3TB drive on a version of the kernel that does not (fully) support it. -- UhClem -
[SOLVED] new user. errors during first parity sync.
UhClem replied to X1pheR's topic in General Support (V5 and Older)
it appeared that it is an segmentation fault. I had the parameters for the command wrong. ... [ I came across this post while searching for something different, but, since I'm here ...] I'll bet the Author and/or Maintainer of smartctl would appreciate receiving a "bug report" (this is more like a "buglet" [cf: piglet]), with the exact command-line that will reproduce the crash/segfault. "One small step for man ..." -- UhClem