BrianAz Posted May 2, 2014 Share Posted May 2, 2014 My system was working just fine yesterday. This morning it ran it's monthly parity check. When it was complete I see this: Last checked on Thu May 1 10:12:13 2014 MDT, finding 456 errors. I upgraded to 5.0.5 recently, though I don't think this is my first parity check. It may however, be the first one kicked off by the monthly UnMenu package....I have never seen anything but "0 errors", so this is quite concerning. Then, I noticed that while it says my parity is good, my array appears to be in read-only mode. I am unable to add any files to UnRaid presently. Looking through the syslog, two things concern me: [*]sdg appears to have some issues. this is my cache drive. Perhaps everything is ro though and it's only mentioning sdg because thats the disk its trying to write to [*]My syslog begins at 4:40am today. I believe this is right in the middle of the parity check since it finished at 10am I would really appreciate any guidance you can provide me as I don't know where to turn. Should I run another parity check? I'm going to let it sit as-is for the moment so I don't make it any worse. The Web UI is not showing any redballed disks or any other issues aside from the parity check errors. Please let me know what else I can provide that might be helpful. Thanks in advance, Brian Edit: Realized that my syslog had gotten so large, it recycled it. So I grabbed syslog.1 which has everything back to the last reboot on 4/15. Most every error I see seems to involve my cache drive. Is it possible that it's dying and UnRaid is having issues reading from it (I saw a lot of mover errors where it said it couldn't read the file). It also seems that the mover was running during the parity check... so if the mover was having issues reading from cache, and the parity check was going.. maybe that's a bad combination. Attached syslog.1.txt Thanks! Edit #2: I'm pretty confident my cache drive is the issue. I am able to write to a share which does not have cache enabled. I just need to know how to handle the parity check situation so I don't lose any data. I imagine if I just go in and disable my cache drive on each of my shares and run a parity check, I should be OK?? Thx Edit #3: I disabled my cache drive and things seem to be working just fine. I guess I should run a preclear on the cache disk and see if it breaks? Am I OK to run a parity check in the meantime? syslog.zip syslog.1.zip Quote Link to comment
BrianAz Posted May 2, 2014 Author Share Posted May 2, 2014 Any ideas? Looking through the logs, everything was fine prior to the 1st. There were no mover errors prior. However, last nite I confirmed that some files from the cache drive had only been partially written to the array by the mover. So, I went and re-copied all the files that the log said Mover had problems with. All of those files play just fine now. I've used the array without issue now that the cache drive is disabled. Am I risking anything by running another parity check? I'm hoping that without the misbehaving cache drive, it'll complete without issue. Quote Link to comment
dgaschk Posted May 2, 2014 Share Posted May 2, 2014 Does unRAID Main show any values in the Errors column? The cache drive cannot cause parity check errors. Paste a SMART report for disk 9. See Check Disk Filesystems in my sig to fix the cache drive. Quote Link to comment
BrianAz Posted May 2, 2014 Author Share Posted May 2, 2014 Thanks for taking a look! No errors on main UnRAID Main: I've attached the SMART report for disk 9. For my own edification, I'm curious why you singled out disk 9 (sde)? In my scan of the logs I didn't seem to see anything calling out anything other than sdg. Nevermind - I see them. Edit: I ran reiserfsck on my cache drive and it comes back with the following. Note - My cache drive is not presently enabled in Tower/ShareSettingsMenu. I didn't think it needed to be though. root@Tower:~# reiserfsck --check /dev/sdg1 reiserfsck 3.6.24 Will read-only check consistency of the filesystem on /dev/sdg1 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes The problem has occurred looks like a hardware problem. If you have bad blocks, we advise you to get a new hard drive, because once you get one bad block that the disk drive internals cannot hide from your sight,the chances of getting more are generally said to become much higher (precise statistics are unknown to us), and this disk drive is probably not expensive enough for you to you to risk your time and data on it. If you don't want to follow that follow that advice then if you have just a few bad blocks, try writing to the bad blocks and see if the drive remaps the bad blocks (that means it takes a block it has in reserve and allocates it for use for of that block number). If it cannot remap the block, use badblock option (-B) with reiserfs utils to handle this block correctly. bread: Cannot read the block (2): (Input/output error). Aborted (core dumped) root@Tower:~# Smart Report for Cache drive (sdg): root@Tower:~# smartctl -a -A /dev/sdg smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: /1:0:3:0 Product: >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. root@Tower:~# My cache drive is now showing as unformatted in the unRAID Main GUI. When I attempt to format it, it won't and I see the following in my log: May 2 12:17:56 Tower emhttp: ckmbr: read: Input/output error May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] Unhandled error code May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] May 2 12:17:56 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] CDB: May 2 12:17:56 Tower kernel: cdb[0]=0x28: 28 00 00 00 00 00 00 00 20 00 May 2 12:17:56 Tower kernel: end_request: I/O error, dev sdg, sector 0 May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 0 May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 1 May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 2 May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 3 May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] Unhandled error code May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] May 2 12:17:56 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] CDB: May 2 12:17:56 Tower kernel: cdb[0]=0x28: 28 00 00 00 00 00 00 00 08 00 May 2 12:17:56 Tower kernel: end_request: I/O error, dev sdg, sector 0 May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 0 Seems like my cache drive is dead? Thanks so much for your help. -Brian smart.txt Quote Link to comment
vca Posted May 2, 2014 Share Posted May 2, 2014 From the SMART report your drive shows: 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 274 So only has one bad block that has already been remapped. So the surface is good. But has 274 UDMA errors which are usually some problem with the SATA or power cabling. So time to check (reseat or perhaps replace) the SATA or power cables. If you have a power splitter in the cable then look at it too. Regards, Stephen Quote Link to comment
BrianAz Posted May 2, 2014 Author Share Posted May 2, 2014 From the SMART report your drive shows: 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 274 So only has one bad block that has already been remapped. So the surface is good. But has 274 UDMA errors which are usually some problem with the SATA or power cabling. So time to check (reseat or perhaps replace) the SATA or power cables. If you have a power splitter in the cable then look at it too. Regards, Stephen Thanks! These errors are cumulative over the life of the drive right? So if these drives are 4-5 years old from multiple other previous machines, the errors could have happened anywhere along the way? I've confirmed all my power and sata cables are seated properly. Quote Link to comment
BrianAz Posted May 2, 2014 Author Share Posted May 2, 2014 Kicked off a parity check. See you in 10 hours (hopefully). Quote Link to comment
BrianAz Posted May 3, 2014 Author Share Posted May 3, 2014 Wanted to close out this request for help as I think I've got things under control at the moment... to summarize the events: Saw monthly parity check on unRAID main indicating 456 errors. Investigated logs and saw many drive related errors, mainly sdg (cache drive) and some to a data drive sde. Found that I could not write to any cachedrive-enabled shares. Disabled cache drive. Could write to all shares again. Saw that the mover happened to run in the middle of the monthly parity check the night before. Complete ignorant speculation, but perhaps the read errors on the cache drive introduced some bad data which the parity check detected? Noted which files mover attempted but failed to move and wrote them to Tower again from backups. Ran SMART tests. Drives are old but no major issues detected. Not sure what to do - documented sectors parity check identified and kicked off a second parity check. Second parity check finished with 380 errors. A comparison of the sectors (first 100) showed it was the exact same sectors as in the first parity check. Third parity check completed with 0 errors. Array appears to be OK. No errors in syslog So that's where it stands right now. My cache drive appears to have some serious issues. I tried the reiserfsck check on it and it couldn't connect to it. Neither could smartctl. Going to pull it out later and see if it's truely dead. Thanks for everyone's help. I plan to continue to watch syslog closely for errors and am considering calculating md5 hashes of known good files to keep on hand for review if I encounter more issues. Edit: Pulled the cache drive and ran WD diagnostics on it. Found a couple bad sectors and reallocated them. Going to run 3-4 preclears before putting the cache drive back in service. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.