stealth82 Posted November 30, 2015 Share Posted November 30, 2015 Hello, I think this could be my first disk dying of old age but I wanted to have some double confirmation by experts here. Today I was manually copying through mc data from my cache drive to /mnt/disk2 when promptly a notification on my iPhone came in. It was unRAID telling me something was wrong... Now disk2 is emulated and I tried to check SMART results to see what happened. Point is... it says the disk is unavailable and it can be spun up for diagnostic. I checked the syslog, which I attached, and looked up these 2 errors that I saw: DRDY ERR ICRC ABRT They should be, respectively: Drive media issue #1: These are almost always associated with bad sectors. Drive media issue #2: a pretty good indicator of a poor quality SATA cable Now the last one made me think. Some weeks ago I bought a Supermicro AOC-SASLP-MV8 controller and 2 Mini SAS to 4-SATA SFF-8087 Multi-Lane Forward Breakout Internal Cables. Till some moments ago I had no issues whatsoever though. Is it possible that just one sub-cable out of 4 is bad? Should I be worried about it or it could be that the cause is the disk's old age? I say old age because it shouldered 4y, 6m, 9d, 14h of service so far (I read that stat from its sibling, I have 2 disks bought in the same period). A new 4TB drive is on the way now and I will have to go for a parity swap procedure when it arrives. Are there any suggestions before getting into that or I should just give up on the old disk? tower-diagnostics-20151130-1731.zip Link to comment
trurl Posted November 30, 2015 Share Posted November 30, 2015 For unRAID v6, instead of posting syslog, you should always go to Tools - Diagnostics and post the complete diagnostics zip Link to comment
stealth82 Posted November 30, 2015 Author Share Posted November 30, 2015 Apologies. I attached the right zip file now. Link to comment
trurl Posted November 30, 2015 Share Posted November 30, 2015 SMART for disk2 looks OK. You could just check connections and try to rebuild the drive to itself: Stop array Unassign disk2 Start array Stop array Reassign disk2 Start array Wait for rebuild Link to comment
stealth82 Posted December 1, 2015 Author Share Posted December 1, 2015 Are you sure? If it reads "A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options." is it a sign that looks OK? WDC_WD20EARS-00MVWB0_WD-WMAZA0747093-20151130-1731.txt smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.13-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: /1:0:1:0 Product: User Capacity: 600,332,565,813,390,450 bytes [600 PB] Logical block size: 774843950 bytes Physical block size: 1549687900 bytes Lowest aligned LBA: 14896 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Link to comment
trurl Posted December 1, 2015 Share Posted December 1, 2015 Are you sure? If it reads "A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options." is it a sign that looks OK? WDC_WD20EARS-00MVWB0_WD-WMAZA0747093-20151130-1731.txt smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.13-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: /1:0:1:0 Product: User Capacity: 600,332,565,813,390,450 bytes [600 PB] Logical block size: 774843950 bytes Physical block size: 1549687900 bytes Lowest aligned LBA: 14896 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Sorry my bad. I was looking at smart for disk1. Replace the drive. Link to comment
stealth82 Posted December 2, 2015 Author Share Posted December 2, 2015 OK, I think the worst case scenario has just occurred. I wanted to take down the disk but since I bought a sata cage and rewired everything I wanted to give the disk another try. The disk came back online and it reported no errors. I guess a wire really got loose - it wasn't the disk, I can't give myself another explanation. Anyway I put it back into the array and unRAID started rebuilding it. As it was some hours in the rebuilding process the parity drive started throwing errors (843 in the errors column) 187 Reported uncorrect 0x0032 017 017 000 Old age Always Never 83 197 Current pending sector 0x0012 100 099 000 Old age Always Never 128 198 Offline uncorrectable 0x0010 100 099 000 Old age Offline Never 128 The disk that is getting rebuilt is toasted - data can't be trusted, I'm toasted. Am I right? :'( Link to comment
trurl Posted December 2, 2015 Share Posted December 2, 2015 OK, I think the worst case scenario has just occurred. I wanted to take down the disk but since I bought a sata cage and rewired everything I wanted to give the disk another try. The disk came back online and it reported no errors. I guess a wire really got loose - it wasn't the disk, I can't give myself another explanation. Anyway I put it back into the array and unRAID started rebuilding it. As it was some hours in the rebuilding process the parity drive started throwing errors (843 in the errors column) 187 Reported uncorrect 0x0032 017 017 000 Old age Always Never 83 197 Current pending sector 0x0012 100 099 000 Old age Always Never 128 198 Offline uncorrectable 0x0010 100 099 000 Old age Offline Never 128 The disk that is getting rebuilt is toasted - data can't be trusted, I'm toasted. Am I right? :'( That disk should be replaced. Most likely the parity issues are a connection problem caused by your rewiring since its SMART looked good from your diagnostics. Check your connections and remove the bad drive and reboot. You should be able to see if the data is being emulated. If so then you will be able to rebuild on a new disk. Link to comment
stealth82 Posted December 3, 2015 Author Share Posted December 3, 2015 Unfortunately, I don't think so. The parity drive has always been directly attached to the motherboard with a cable I don't have reasons to doubts. The connection was and is solid. That drive, though, had given me that very same error in the past. After that I put it under observation, ran a couple of preclears on it and seemed fine (I think some under a 100 sectors reallocated but no more growing pending sectors). I guess the best thing to do would have been to trash it rather than risk it... but i didn't have any disk to spare at the time. Is there any way I can know what sectors have affected the rebuilt drive now. What I would like to do if I can isolate the problem is to replace the parity drive with a new disk but what you are saying makes me think I could try to rebuild again from the "faulty" parity drive. I really don't know what to do now. Link to comment
stealth82 Posted December 3, 2015 Author Share Posted December 3, 2015 I attached a new diagnostic file. I'd really love to know if there's any way to track down whether the rebuilt has been affected - I think it has - and on what data, if any, the bad sectors "landed". I say if any because the rebuilt disk is 75% full and the errors started appearing in the last 25% of the rebuilding process I think. I don't know if this might mean that maybe there were not files there but just empty space to rebuild. Any insight? P.S. Why is unRAID considering the rebuilt disk OK considered it knows there were read errors from the parity? tower-diagnostics-20151203-1024.zip Link to comment
stealth82 Posted December 3, 2015 Author Share Posted December 3, 2015 I don't know if it's related but SMART is telling me the parity disk pending sector count is increasing (184 now). The parity disk is spun off. I wonder how it can know that considered it's off... I'm scrubbing the rebuilt disk - I don't know if it is related. Link to comment
trurl Posted December 3, 2015 Share Posted December 3, 2015 Looking at these latest diagnostics, disk2 looks good but parity is failing, as you said. Can you read disk2? Link to comment
stealth82 Posted December 3, 2015 Author Share Posted December 3, 2015 I can but my fear is that some of the rebuilt data is corrupt since it's been rebuilt by a failing parity drive with unreadable sectors. Link to comment
trurl Posted December 3, 2015 Share Posted December 3, 2015 What filesystem is disk2? Link to comment
trurl Posted December 3, 2015 Share Posted December 3, 2015 Don't know much about trying to test btrfs disks for corruption or fixing them. You can try searching, but I don't think there is much documented in our forum or wiki. Maybe out in the wild wild web where btrfs is used there may be some documentation you could google. Link to comment
stealth82 Posted December 6, 2015 Author Share Posted December 6, 2015 Well, I don't know how to intepret this but... Inspired by this thread, I just gave a couple of more tries to the issue. I ran a parity sync without corrections and it finished just a few minutes ago. The colums reads/writes at the end read 1883 errors but apart from that no sync errors?!? How should I intepret the 0 sync errors count? I don't know. Anyway I'm burying all this. A new parity disk is running the array just now and the sync is in progress. tower-diagnostics-20151206-1534.zip Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.