UhClem

Members
  • Posts

    282
  • Joined

  • Last visited

Everything posted by UhClem

  1. Yes--providing that the 32-bit libraries+compatibility package is installed. Since LimeTech has already mentioned this, I'm sure it won't be omitted from the 64-bit distribution.
  2. I feel it is important to include this preface: unRAID users should note that what we are discussing here, while definitely possible, is pretty unlikely. It is not a reason to distrust unRAID, or even your hardware system. But it should motivate you to be sure that your important data is properly, and rigorously, backed up. I do not share that confidence in the disks themselves, as the reason for the post-preclear-read phase in my preclear_disk script is specifically because users have found disks that randomly returned different data for the same stored data block and with no other errors to indicate anything was amiss. These disks have been found multiple times by various users. Yes, there will always be the possibility of defective hardware, and it is prudent to use tools/procedures (like preclear) to help detect such hardware early. I suspect that some of those reports were actually provoked by RAM (parity) faults (somewhere in the test-chain), but those reports could be isolated (and "re-categorized") by further focused testing. The remaining cases provide an excellent basis, even when the server uses ECC memory, for making an effort to verify the successful completion of all modifications to one's array, and to verify all reports of errors in one's array before taking corrective [and irreversible] action. I agree with this 100%. "But, wait, there's more!" [i don't mean to be alarmist here, but it is important to "confront your demons", right?] Isn't it very likely that only about half such occurrences have become visible? That same "bit mangling" is just as likely to have occurred on the way TO the disk, resulting in a persistent/consistent [array] parity error. I suspect that this particular scenario is the cause for most of the "apparently" true examples of bit rot--but, again, only a rigorous verification of all array modifications, would allow one to distinguish between this apparent bit rot, and a real bona fide (my holy grail of) bit rot. That would be where, for a particular sector, the data bits and the ECC bits morphed just so such that a sector contents different from what was originally stored and verified was returned (on a subsequent Read) successfully (ie, no ECC error) And, repeated Reads of that sector (successfully) return that same new contents (thus distinguishing from one of the bit-mangling scenarios [FROM the drive] like JoeL described). Statistically, I'd say that one ranks right up there with one getting struck by lightning--except that you definitely know it when lightning strikes you . I would expect that a quickie version of it is performed at each (drive) power-on initialization. But, since it needs to be a quick test, only the flagrant flakes will be detected. --UhClem "Welcome to the Future Fair--a fair for all, and no fair to anybody."
  3. Just to back up a little ... I do believe the core unRAID mechanism for parity protection is solid (a modified/relaxed version of Linux's md RAID-4) [even moreso now that the possible, but highly unlikely, "race condition" glitch was fixed in the 4.7-5.0 transition]. I also have a very high level of trust in modern disk drives--not that they won't fail, but that they can be relied upon to not return a sector, from a Read, whose contents differs from the sector previously stored, during a Write (for a given LBA) It might return an error condition--i.e., and No_Data (which is much!! better than Erroneous_Data [and NO error condition]) The real issues arise just before and after (and even during) the RAID parity calculation, and the primary culprit is non-ECC memory. I believe that most unRAID users (as well as most PC users in general) use non-ECC memory. While memory parity errors are not commonplace; they are definitely not rare either. (Getting struck by lightning is rare.) It is these occasional memory glitches (bit-flip) that are the likely cause of a variety of (disk) data integrity problems. They can manifest [on disk] as bad data (but good parity) or good data (but bad parity) or bad data (and bad parity). [using ECC memory does not eliminate all such risks, but is probably 10x-100x better than using non-ECC]. Regardless, for many users, because of motherboard and/or CPU, ECC is not even an option. But that is where software-based efforts can reduce the resulting (disk) data errors, without resorting to all the complexities, and overhead, of somethinhg like ZFS. [i forget which thread here, but I briefly sketched out the idea for a "limited/targeted parity check" which was intended to follow a cache-drive=>unRAID-array session.] There are other possible enhancements too. But the goal is to catch any data integrity errors as early as possible; thus, they can be easily resolved, and they are not laying in wait to either cause errors (possibly silently) during a full drive restore, or, to cause collateral damage during a correcting parity check, either explicit (user-initiated) or implicit (event-initiated [ie, the subject of this thread]). Oh yeah, speaking of this thread ... my quibble with doing a corrective parity check following the Reiser-fsck's of a crash recovery procedure is this: While you definitely need to re-generate parity corresponding to the sectors/stripes modified by reiser_fsck, those should be the only sectors/stripes which should get correctively parity checked. (Again a "targeted parity check") This way, it finishes in minutes, not hours -- and we've basically eliminated any chance of collateral damage. An option should be added to reiser_fsck to generate a list of LBAs which were written to effect the repair (and, of course, unRAID would use that). Those lists (one per repaired disk) would comprise the target for the "targeted [corrective] parity check". And, yes, RobJ, I do believe that unRAID should include a tool, nee wholiveshere, which would take a list of LBAs (or Stripe#s) and for each such arg generate the "usage" of that address, on each data disk. The need for it would be much less if error conditions were caught "in the act", but it would still come in handy in extreme cases.
  4. That makes sense--superficially. But it may prove unwise. The above assumes that the array was in perfect shape prior to the "crash" (ie, if a non-correcting parity check had been performed then, it would have passed w/0 errs). But, as many of you know, when you run your periodic parity check and find an error or two, it can't be assumed. I don't use unRAID (and,if PCRx's description is accurate, this adds another reason), but I think a correcting parity check is a BAD idea, under any condition, most especially one where the user had no say in the matter. Remember, everthing is fine--until it isn't. --UhClem
  5. Have you tried plugging the SansDigital into the MicroServer's eSATA port? [surprise!! ] Seems a pity to put a PCIe v1 card (possibly further constrained by a PCI dependency) into a PCIe v2 slot, though I empathize with the attraction for that card's connectivity [2 SATA + 2 eSATA]. Yes, they are. The MicroServer's 6 port SATA subsystem (via the SB820M SouthBridge) is pretty decent. You're still ~100 MB/s shy of saturation [(2x170)+(2x130) = 600]. I measure about a 675-700 MB/s ceiling (on a N40L). Envying your 50% faster CPU ...
  6. You are always going to see "burst-like" behavior on the network if the sink is slower than the source (which is the case most of the time in unRaid). [Just from my interpretation of the users' descriptions of their "observations" ...] It appears as though the "sink" is [easily] as fast as the "source" until you reach the hairball (otherwise known as the point where your system buffers (write-behind cache) have all been filled, and then your "sink" speed is reduced to the reality of your actual disk subsystem write performance. By timing the upload side, this is being partially masked--because after the upload appears to have finished (and reported its xfer rates accordingly), the destination side (unRAID) still has to complete the output (draining) of the write-behind cache to the disk subsystem. Maybe better if users performed a timing on the unRAID side (ie, as a download) AND be sure to include a sync at the end (to include the buffer-draining). time ([i]transfer[/i] WinPC:10GB_file unRAID_box ; sync) (substitute your choice for transfer method) then use real time to calculate xfer rate. [is that highmem toggle just making more RAM available to system buffers? (which is still good--"RAM is a terrible thing to waste") Speaking of 64-bit kernel ...
  7. UhClem

    Unraid 64bit

    Why do you say this, what would have to change? Why do >>you<< say this? (IF I was an unRAID user, I'd be worried. )
  8. OK; looks like the ECC_Recovered is not an indicative factor here. (The 2nd drive had ~46 million during its pre-clear; but 142M previously.) Seems like that first drive led a "sheltered life" (idle?) during its first 34000 hours.
  9. I think there is something strange with that drive, but it's not clear just what. Be aware that modern hard drives are not perfect; but that is factored into their design. ECC is used to handle a lot of that imperfection. Hence, it is normal for a drive to report ECC_Recovered values that grow >>at a reasonable pace<<. But your drive appears to be "shouting" Lookie-Here to me: Note from your pre-PreClear SMART report: 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 34172 ... 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 9010 and compare to the post-PreClear SMART: 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 34206 ... 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 27170949 That's 27+ million such recoveries during the, admittedly intensive, 34 hours of PreClear. On it's own, that seems excessive. But, relative to ~9k during the previous 34000 hours, it seems obscene. I'd focus my attentions/suspicions in this direction. Do you have the pre- & post- SMARTs for one of the other (non-lethargic) HD753LJ drives to compare those values? --UhClem
  10. Seems like a good plan. I'd be real surprised if HP would make you send in the whole N40L. The 250GB drive is a customized version (maybe just the firmware) by Seagate for HP, so it has to be a "standard" part (and warranty only covers "parts"). If they do provoke you, to just "toss it", let me know. I'm real curious to see just what its malady is; I'll pay for shipping it to me. (It almost sounds like it is stuck in PIO mode [no DMA].) It's been fun (miscues aside). --UhClem
  11. Ah-Hah! Thanks for clearing that up for me. I now see where your inference and my assumption diverged. With this new info/understanding, it's almost a certainty that >>your<< 250GB drive is defective/crippled. If you are still within the 1-year warranty, HP will replace it. You would probably want to be able to produce/document the problem without introducing foreign hardware--ie, just with the HP-N40L-250GBdrive, and by using one of the 4 internal bays/slots so you don't have to mention anything about "hacked BIOS" . PM me if you want some assistance with that. I have two of the N40L/250GB. And, I am solid on Unix, and disk drives, and performance tuning/optimization (Gates, and many others, will vouch for me) so try to believe this: if you had/have a properly working VB0250EAVER drive (this one's model#), it will make a dandy cache drive (as long as 250GB is sufficient for you). Even moreso in your set-up, for the reasons you've stated, regarding the DoubleTwin and cooling/clearances. Plus, you will then have that extra/large WD Green available, for data (or a spare), here (N40L) or elsewhere. I just ran a more complete test on one of my 250s, and it has a max sustained xfer rate of 113 MB/s, and a minimum of 77 MB/s (at inner "zone").
  12. But, I'll bet you did not do the test on the EADS while it was connected to sata5 or sata6. The whole point of this exercise was, as result of your skepticism/concern wrt the BIOS update, to test whether you 1) had successfully done the update, and 2) had correctly established the new settings so that sata5 & sata6 really did get full SATA/AHCI citizenship (vs. the default bare minimum for an optical drive) Good luck. (I'm done ) PS that 250GB drive probably makes a better cache drive than a barely "faster" Green, because the 250 has much faster seeks.
  13. Regarding the write performance of your cache drive, and still avoiding how the BIOS labels it: 2 preamble notes-- 1) the 250GB 7200 rpm actually has slower max transfer rate than your WD Greenies [100 (my measurement) vs 110 (WD spec)] 2) the 70-90 dropping to 30-40 >>might<< be explained by ineffectiveness of the new BIOS. The initial 70-90 could be the 1Gb/s Enet limit while data is being written into the system's buffer cache, but as the buffer cache is written to the actual drive, the measured/observed speed drops to (a possibly crippled) 30-40. Rather than mess around with other data sources, etc., try the fiollowing (on the N40L shell prompt): dd if=/dev/zero of=Cache/Barfo bs=256k count=2048 oflag=direct (where Cache is the mount-point of your cache-drive's root [i don't know what unRAID calls this]) [Note the transfer rate and remove Barfo (512MB).] That will give you a very good idea of how fast you can put new data on your cache-drive's filesystem. We eliminate the network/data-source component by using a fast, and local, source (/dev/zero); we eliminate the system's buffer cache by using dd's oflag=direct option.
  14. But, for disaster planning, you need a worst-case mindset. Ie, what about fire? [Yes, basement is better than attic, but ...] I might have the same BIOS, but my old eyes aren't quick enough for that first screenful. It doesn't matter, though--I don't care what they're "called", as long as they behave right. And I do get hot-swap (via trayless rack in optical slot) and ~250MB/s (SSD taped just below). [Ever heard this old saying?-"I don't care what you call me, just don't call me late for dinner." ]
  15. Thanks. FYI, according to WD spec sheets, the 1TB EADS, EARS, and EARX all have the same 110 MB/s max sustained xfer rate. I couldn't find a number for the EACS. As for the N40L (and not performance-related), you might consider backing up all non-replaceable bits (photos, home vids, documents) there also. Since it is so compact and self-contained, it is ideal for a quick one-handed grap-and-go, if catastrophe is imminent (fire, Sandy, N Japan, etc). A foot of steel cable, a 4-inch length of broomstick, and that "security" hole top-right-rear; get it?
  16. I'm curious--is one of your data drives, or the parity drive, some old dog-slow disk? ie, < 90MB/sec for a max transfer rate (ie,"4% complete"==outer/fastest zone). Or is unRAID's parity check not able to sustain full drive speed? (of the slowest drive, of course) [i don't use unRAID, but I do run a N40L, and I do get full drive speed (120+ MiB/s using 5x7K2000 2TBs) for the first ~15% of a comparably similar parity check (w/different software).]
  17. Good catch!! But damn!--I didn't get around to reading that e-mail till after I ordered (not gonna hassle the Egg for a few shekels, when the fault is mine). Then again, it's really 37.5% off, since it "comes off the top" [ 25 - (25 * .15) - 15reb = $6.25 !!] Thats' a really Good Deal. NOTE: Mine just arrived. It is a Rocket 620A which has a Marvell 88se9120 chip -- vs. the 88se9128 chip on a 620. Of possible significance is the fact that the 620A board/chip IDs as 1b4b:9120 (vs. 1b4b:9128 on 620). I don't use unRAID, but if your new board's connected drive(s) don't show up, you might want to try elkay14's enable_ahci script, with a minor edit to add 1b4b9120 to the strings. --UhClem "A penny saved is better than a penny earned." (They tax your earnings! [but cheer up--things may/will get worse.])
  18. Newegg has the Rocket 620 for $10 AR shipped (rebate valid on purchases thru 14Dec2012). [Link] This is a 2-port Sata3, PCIe_v2 -x1 card; uses Marvell 88SE9128 chip. Note: this is NOT a RocketRaid 620 card, but if you're reading this, you [should] know that doesn't matter. In fact, this card (less $$) might be preferable (less BIOS bloat/delay). --UhClem
  19. Are these 2-platter or 3-platter ? [Note: There are both 2- and 3-platter instances of the (2TB) ST2000DM001 drive, with no (?) visible indication to distinguish] I suppose one could do low-level max transfer rate tests to determine. Or, probably, just weigh each drive. Or ?
  20. [Hi there--I wish you the best in re-establishing normalcy in your day-to-day.] Correct. But, two things of note: (1) In the scenario being explored here, every one of those UNC sector failures actually succeeded on the first re-try; not only did they not go the whole (1+5=6) attempts [and result in an error return] but they all succeeded on the second attempt (first re-try); that could be insightful "forensics" if one wanted to do an in-depth diagnosis. (2) Realize that the drive itself makes 10-20+ attempts to read (with valid ECC) the recalcitrant sector before giving the UNC error [to the driver]. Yes, but you want to be "smarter" about how you use badblocks and SMART reports, so that you don't "destroy" valuable diagnostic information in the process. [i don't want to tell you exactly what to do, because I believe you can figure it out, and you should not be deprived of the challenge, and the satisfaction.] Once you do figure out that one, see if you can envision a simple embellishment to badblocks (about 10 lines of C) that will facilitate even further improvement to this endeavor. [i'll be suggesting this enhancement to badblocks' author.] -- UhClem "If you keep getting sick, but each time seem to completely recover, are you really a healthy individual?"
  21. Well, I've been traveling/visiting and ate a lot of turkey. Now I'm back, and I have to eat some crow . I was wrong in my assessment of badblocks' behavior regarding the drive errors that JoeL reported/documented.Please note that I have corrected (and mostly retracted) my criticism of badblocks in my earlier post [link]. Summary: drive reports UNCorrectable error to kernel driver (which records it in syslog), but kernel only reports that error to the calling read() (from badblocks) IF the error persists through MAX_RETRIES+1 (6) attempts. Apologies for "The sky is falling ..." alarm. But I hope some of you will have benefitted from the ensuing discussion. I really should not have been "fooled" since the syslog extract that JoeL posted does tell the whole story. But, because every one of those flaky sectors was reported in the same way, as a one-time UNC error (with implied success upon the first retry), I overlooked the non-presence of retry errors along with the non-presence of a final more-detailed error report, for each of the sectors. Sometimes the most important detail is divulged because it is not presented. --UhClem
  22. While the code to handle SMART resides in the drive's firmware, it really is not a part of the functioning/controlling of the drive. The drive performs reallocation when the conditions for doing so are met, and records the salient parameters for access (ie, Analysis) [and Reporting] by the co-resident SMART code. Summary: the presence, or absence, of SMART functionality has nothing to do with badblocks performing (or not) according to its stated spec. As I understand it, the closest that SMART gets to affecting the attributes is the "Offline" effort to (successfully) read recently-added Current_Pending_Sector sectors, and, if it does satisfactorily read one, that sector's Current_Pending status is "cleared", the count is decremented, and that sector becomes a normal citizen again. Funny ... I did have a cursory look at badblocks code about a year ago. I had written a little program to do (read-only, mostly performance) testing of hard drives, and was looking for some examples of what type of errors to anticipate, and how they were best dealt with. But all of my drives were perfectly healthy.[Weebo might just recall ... I had written here: ] So, I did look at badblocks, and was shocked to see that it essentially ignored the error return [from read()] and strictly relied on data comparisons. Never looked at it again; and certainly would not use it myself. And thank you JoeL for providing this "case in point" (now) that does support my quick 5-minute surmisal, about a year ago. Reminder: a failed read() returns (-1) [and sets errno] the resulting contents of the read()-specified buf are totally/completely undefined (as is the value of the system's seek-pointer for that filedesc now).
  23. (You started one day/error early; NBD) ============================ CORRECTION: [30Nov12] badblocks does report (very) bad blocks, and, for that, it can not be faulted. The test scenario upon which I based my [(now) misplaced] criticism, below (w/retractions), involves a drive that was throwing UNCorrectable errors, and corresponding nullifying increases/decreases to SMART's Current_Pending_Sector count, BUT none of those (15+) flaky sectors were sufficiently persistent in their flakiness to cross the AHCI driver's RETRY threshold, and, hence, did not result in any error returns to the calling program's (ie, badblocks) read() requests. Thus, badblocks had nothing to report. [it is unfortunate that there is no mechanism available for a (privileged) user program to be informed of drive errors (below the RETRY threshold), provoked by its own read()s. I'll mention this to the author of badblocks, who is also heavily involved with kernel/filesystem development.] This highlights the importance of simultaneously monitoring the tested drive's (mis-) behavior by other means. ======================================= Yep, those are [REAL / HARD] errors. They were NOT reported to badblocks (via error return from read() calls), and the user should have been apprised of such events. The most important conclusion to draw is: The badblocks program SUCKS at precisely what it claims to do. FALSE badblocks( - Linux man page Name badblocks - search a device for bad blocks I'm inclined to revise your thread subtitle to: "badblocks can never be trusted" ALSO FALSE No. (In this case,) It showed that the formerly pending sectors had actually recuperated, and had not been sent to the morgue. [the morgue population (Reallocated_Sector_Ct) remained at 1992] [cf: Reply#2] At least the program is consistent Yes, that verify run actually created/generated a new morgue resident. What's left to prove? You did well! [but I might have to report you to the ASPCA. I know it's just a lab rat (that 5K3000), but this is bordering on cruelty. ] --UhClem
  24. Maybe a good idea to corroborate your findings by checking the kernel message log for UNCorrectable errors on the drive in question, for the duration of the badblocks run. Sort of a "proof of the pudding ..." rather than rely on SMART to "prove" your case. I'm inclined to agree with your assessment. But it is better to actually find the dead body than just be suspicious based on the odor of decomposition.
  25. Thanks for your efforts; sorry for your "loss". (Though, as you determined earlier, you can still put any 3 of your (current) non-SSD drives on the (Z68/)9230 without penalty (and, hence, remove one of the 3132s); and you can configure one/both of the Syba's eSATAs for out-of-case expansion, etc.) [Another example that Murphy is a bevious dastard.]