tillkrueger Posted July 28, 2018 Share Posted July 28, 2018 Hi everyone... bc I'm so low on funds right now, I took a risk when ordering a 6TB WD Red drive to replace a 3TB WD Red drive in my array that was taken out of the array for being defective. the new (used) 6TB drive rebuilt from parity successfully, but today, a few days later, when remotely logging into my unRAID WebUI, it showed with a Red Cross and the array in a compromised state. my questions now: > is a pre-clear the next logical step (I probably should have done that first), or is it too late for that. > what would be the procedure to do a pre-clear now, after the fact? > if a pre-clear is *not* what's to be done next, what is? Link to comment
JorgeB Posted July 28, 2018 Share Posted July 28, 2018 Please post your diagnostics: Tools -> Diagnostics Link to comment
tillkrueger Posted July 28, 2018 Author Share Posted July 28, 2018 thanks jb...here ya go. unraid-diagnostics-20180728-0856.zip Link to comment
JorgeB Posted July 28, 2018 Share Posted July 28, 2018 It looks like a disk problem, run an extended SMART test. Link to comment
binhex Posted July 28, 2018 Share Posted July 28, 2018 > is a pre-clear the next logical step (I probably should have done that first), or is it too late for that.Hate to say this but for reference for next time you buy a disk the first thing you need to do is preclear it, this checks for errors (and zeros the drive, less important)Sent from my SM-G935F using Tapatalk Link to comment
tillkrueger Posted July 28, 2018 Author Share Posted July 28, 2018 7 hours ago, johnnie.black said: It looks like a disk problem, run an extended SMART test. how exactly do I do that, and is this the next thing I should do, or should I aim for a pre-clear first? 6 hours ago, binhex said: Hate to say this but for reference for next time you buy a disk the first thing you need to do is preclear it, this checks for errors (and zeros the drive, less important) yeah, in retrospect I do realise that I should have done that first...my bad...too late now? Link to comment
trurl Posted July 28, 2018 Share Posted July 28, 2018 5 hours ago, tillkrueger said: how exactly do I do that Click on the disk to get to its page then go to Self Test Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 thx trurl I started it yesterday around the time you pointed out to me where to do so, and some 20hrs later it's still churning away at 30%...it's a 6TB drive, so we're talking 3-4 days then? since the drive this 6tb replaced was a 3tb which was also marked as defective, would another 6tb I might have to get to replace this one run the risk of not being able to rebuild whatever data was affected by the read error this one (and maybe the 3tb it replaced) shows, or is that data "safe" as part of the parity information? in other words, the errors (bad data) itself will not be rebuilt as "bad", but as the original data before it went bad, I hope? Link to comment
pwm Posted July 30, 2018 Share Posted July 30, 2018 unRAID rebuilds the content as it should have been. With a single parity, you could see the parity logic as a+b+c+d = e where e is parity (for first parity it isn't + but X-or which is often written as ⊕ ) And unRAID computes block-for-block what content the broken disk must have contained (before failure) to satisfy the parity equation. The parity is computed based on the expected content of the different data disks, i.e. the value the sectors had when last written or when parity was last rebuilt. Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 great explanation, pwm...and what a relief! Link to comment
pwm Posted July 30, 2018 Share Posted July 30, 2018 2 minutes ago, tillkrueger said: great explanation, pwm...and what a relief! This is also why it is dangerous to throw away existing parity and rebuild the parity if you don't trust all data disks. Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 yeah, I get that...I don't think thatI have *ever* thrown away or overwritten a faulty drive, as one look into my shelves would show Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 whoops...I just navigated away from the page that showed the progress "report" of the extended SMART self-test (which had been showing 30% for the past 24hrs, and when I went back to that page it said "Completed without error" (same thing it showed before I started the extended test). I hit the "Download" button and attached what it saved...did I interrupt the actual extended test, or is this really the outcome of the extended test? WDC_WD60EFRX-68MYMN1_WD-WXL1H642H7PL-20180730-1200.txt Link to comment
pwm Posted July 30, 2018 Share Posted July 30, 2018 Last test did end without error: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 6723 - Notice that the lifetime value matches the current statistics - the SMART test ended about 9 hours ago, give or take the roundoff to whole hours. 9 Power_On_Hours -O--CK 091 091 000 - 6732 Your original estimate that the SMART test should take days didn't sound correct - most drives manages the test in 6-15 hours depending on capacity and age. Your specific disk reports that it needs about 12 hours (for healthy drive that doesn't receive read/write access requests): Extended self-test routine recommended polling time: ( 719) minutes. I wonder if you have problems with either power or vibrations. The drive could read all data correctly during the SMART test, but reported 256 uncorrectable sectors 200 hours back in time. And had a command not finish 100 hours before the SMART test. Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 So what does that mean in terms of best course of actions? Vibrations/power within the server? Link to comment
pwm Posted July 30, 2018 Share Posted July 30, 2018 13 minutes ago, tillkrueger said: So what does that mean in terms of best course of actions? Vibrations/power within the server? It means you are in uncharted territory. I don't see any obvious reasons for the failures and then the perfect SMART test, so no easy way to figure out exactly what to fix. You could try to clear the drive again and see if it works better. Or you could try to switch PSU. Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 Since I am 500 miles away from this system, clearing the drive sounds like the only option I have right now. Can this be done remotely, and if so, are there instructions online for clearing a drive that has been rebuilt as part of the array already? Link to comment
pwm Posted July 30, 2018 Share Posted July 30, 2018 6 minutes ago, tillkrueger said: Since I am 500 miles away from this system, clearing the drive sounds like the only option I have right now. Can this be done remotely, and if so, are there instructions online for clearing a drive that has been rebuilt as part of the array already? No, you can't start a clear on the drive if it has already been added to the array and completely or partially rebuilt - since it's part of the array, any writes to it will update the parity state based on the writes. So a clear would teach unRAID that the disk is expected to be empty. Link to comment
trurl Posted July 30, 2018 Share Posted July 30, 2018 5 minutes ago, tillkrueger said: Since I am 500 miles away from this system, clearing the drive sounds like the only option I have right now. Can this be done remotely, and if so, are there instructions online for clearing a drive that has been rebuilt as part of the array already? You can't clear a drive that's in the array. If you remove it from the array you will be unprotected unless you have dual parity. If you already have a spare disk installed you could replace/rebuild to that and then clear the disk. Usually clearing a disk is recommended for getting pending sectors reallocated, but I just looked at your posted diagnostics and that doesn't seem to be the problem. I don't know what others saw that told them it was a disk problem. Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 Makes sense. what if I copied all data on it to the other drives in the array first? Could I then do a pre-clear like I should have done in the first place and see what happens? there must be some way to recover from this situation, or have I really navigated myself into a check-mate? Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 so trurl, would you agree with pwm that the culpit would likely be a faulty PSU and further attempts at trying to deal with the disk are likely to be futile? Link to comment
pwm Posted July 30, 2018 Share Posted July 30, 2018 8 minutes ago, trurl said: I don't know what others saw that told them it was a disk problem. What didn't look too good was: Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 Sense Key : 0x5 [current] Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 ASC=0x21 ASCQ=0x0 Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 CDB: opcode=0x8a 8a 08 00 00 00 00 74 75 19 b0 00 00 00 08 00 00 Jul 26 04:41:33 unRAID kernel: print_req_error: critical target error, dev sdk, sector 1953831344 Jul 26 04:41:33 unRAID kernel: md: disk6 write error, sector=1953831280 and in the SMART data: Error 2 [1] occurred at disk power-on lifetime: 6629 hours (276 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 10 -- 51 00 00 00 00 74 75 19 b0 c0 00 Error: IDNF at LBA = 0x747519b0 = 1953831344 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 61 00 08 00 00 00 00 74 75 19 b0 41 00 2d+17:37:21.022 WRITE FPDMA QUEUED e5 00 00 00 00 00 00 00 00 00 00 40 00 2d+17:37:21.022 CHECK POWER MODE ea 00 00 00 00 00 00 00 00 00 00 40 00 2d+17:37:20.942 FLUSH CACHE EXT e5 00 00 00 00 00 00 00 00 00 00 00 00 2d+17:37:20.939 CHECK POWER MODE 40 00 00 00 01 00 00 00 00 00 00 40 00 2d+17:37:11.720 READ VERIFY SECTOR(S) Error 1 [0] occurred at disk power-on lifetime: 6533 hours (272 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 01 00 00 02 86 6d 4e 00 e0 00 Error: UNC 256 sectors at LBA = 0x2866d4e00 = 10845244928 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 01 00 00 02 86 6d 4e 00 e0 08 2d+03:05:41.795 READ DMA EXT 25 00 00 04 00 00 02 86 6d 4a 00 e0 08 2d+03:05:41.755 READ DMA EXT 25 00 00 02 00 00 02 86 6d 47 80 e0 08 2d+03:05:41.733 READ DMA EXT 25 00 00 00 40 00 00 9a 84 31 80 e0 08 2d+03:05:41.707 READ DMA EXT 25 00 00 01 00 00 00 9a 84 30 00 e0 08 2d+03:05:41.705 READ DMA EXT So the drive has earlier reported 256 uncorrectable sectors at 6533 hours. Not sure if that was before or after the drive was connected to the unRAID. All the SMART-data tells is the number of power-on hours - so @tillkrueger must count backwards and figure out if the UNC errors happened before/after the drive was bought. If before the drive was bought, then it might be possible to ignore this error - maybe power issue in the original system or maybe someone tried to move the machine while the drive was busy. The more recent error (IDNF) at 6629 hours could have been caused by the drive disconnecting - so the "IDNF" was because the drive did no longer have a connected controller to interact with. In which case it could be a cable problem. Or a controller card issue. Not knowing the exact conditions when the two errors happened makes it harder to guess the reason. Link to comment
tillkrueger Posted July 30, 2018 Author Share Posted July 30, 2018 My gut feeling is that the Amazon Marketplace seller who sold me this used drive in “like-new” condition dumped a faulty drive on me. if there is nothing I can do with this drive remotely at this point, I’ll have to take it up with Amazin and try to pressure him into returning my money, and get a new drive. this drive had only been in my array for 48-72hrs before being ng marrked as faulty, and I had replaced cables drive-cages, and controller all within the past 3 years, so I would think that my server hardware is good...notbto say that even new components can’t be faulty, but my gut feeling is that something is fishy with this used drive. Link to comment
trurl Posted July 30, 2018 Share Posted July 30, 2018 1 hour ago, tillkrueger said: Makes sense. what if I copied all data on it to the other drives in the array first? Could I then do a pre-clear like I should have done in the first place and see what happens? there must be some way to recover from this situation, or have I really navigated myself into a check-mate? If you copy all the data off, you could New Config without it and rebuild parity. Then it wouldn't be in the array. Link to comment
pwm Posted July 30, 2018 Share Posted July 30, 2018 51 minutes ago, tillkrueger said: My gut feeling is that the Amazon Marketplace seller who sold me this used drive in “like-new” condition dumped a faulty drive on me. The SMART data claims that the drive isn't faulty. The older uncorrectable error need not represent any error with the disk - when the disk writes, it always writes blind. It first aligns to the track in read mode. Then it counts sectors waiting until it's about to spot the correct track. Then makes a blind realign of the write head over the track and performs the write. The drive does not know if the write goes well or not - it isn't until you later try to read the sectors that the drive will find out if they could be read. Drives with a vibration sensor tries to abort writes if vibrations are seen. Drives without vibration sensors will just produce garbage writes if there is too much vibrations. When the drive aligns, each track is two-digit nanometers wide. So 1000 tracks are about the same width as a human hair. And the resolution used when aligning the head is in one-digit nanometers. 10 nm is about the width of 20 silicon atoms. So lots can go wrong when the drive tries to properly align the head and write the data. There are videos showing how the drives in a server rack stops producing data if a person shouts at the machine - the voice vibrations are enough to make the enterprise disks stop their tasks and wait for the vibrations to end. The open question is why the drive did disconnect. But Sense 0x5 means an invalid command. And ASC=0x21, ASCQ=0x0 means block out of range. Sense Key : 0x5 [current] Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 ASC=0x21 ASCQ=0x0 If we assume the drive has been up continuously, then the time stamp in the SMART data was Mon Jul 30 12:00:49 2018 and the power on counter then was 6732. The error in the SMART log happened at 6629, so 103 power-on hours earlier - 4 days and 7 hours. So somewhere around Jul 26, 05:00. That agrees with the with the unRAID log printout of Jul 26 04:41:33. So possibly the command was dropped because of a transfer error. Or a software bug. But the error doesn't represent a broken disk - just that the disk couldn't perform the task because the requested task wasn't valid. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.