PsyVision Posted February 13, 2016 Share Posted February 13, 2016 Hi All, A few weeks ago one of my drives started failing on me. I switched it out and replaced it with a new drive. On rebuilding the drive it showed there were 288 errors. I thought there could be an issue with the SATA cable so I also replaced that, I then unassigned the drive and re-assigned it to rebuild it from parity. The drive rebuilt but showing 288 errors again. I've repeated this process again, changing the SATA cable (as some of mine are old and possibly broken) and again, 288 errors. Below is a sample of my syslog and I have attached the diagnostics download. Feb 11 21:11:39 nas kernel: ata7.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 action 0x6 frozen Feb 11 21:11:39 nas kernel: ata7.00: failed command: WRITE FPDMA QUEUED Feb 11 21:11:39 nas kernel: ata7: hard resetting link Feb 11 21:11:49 nas kernel: ata7: COMRESET failed (errno=-16) Feb 11 21:11:49 nas kernel: ata7: hard resetting link Feb 11 21:11:59 nas kernel: ata7: COMRESET failed (errno=-16) Feb 11 21:11:59 nas kernel: ata7: hard resetting link Feb 11 21:12:10 nas kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Feb 11 21:12:10 nas kernel: ata8.00: failed command: WRITE DMA EXT Feb 11 21:12:10 nas kernel: ata8: hard resetting link Feb 11 21:12:16 nas kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 11 21:12:16 nas kernel: ata8.00: revalidation failed (errno=-5) Feb 11 21:12:16 nas kernel: ata8: hard resetting link Feb 11 21:12:27 nas kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 11 21:12:27 nas kernel: ata8.00: revalidation failed (errno=-5) Feb 11 21:12:27 nas kernel: ata8: hard resetting link Feb 11 21:12:34 nas kernel: ata7: COMRESET failed (errno=-16) Feb 11 21:12:34 nas kernel: ata7: hard resetting link Feb 11 21:12:39 nas kernel: ata7: COMRESET failed (errno=-16) Feb 11 21:12:39 nas kernel: ata7: reset failed, giving up Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 1465144063 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 977254148 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 977254151 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 0 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 977254181 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 1465328951 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 1465329367 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 0 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 37607 Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 978674575 Feb 11 21:12:39 nas kernel: XFS (sdh1): metadata I/O error: block 0x575452c0 ("xfs_buf_iodone_callbacks") error 5 numblks 32 Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 183166111, lost async page write Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 183166112, lost async page write Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 183166163, lost async page write Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 183166164, lost async page write Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334314, lost async page write Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334315, lost async page write Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334344, lost async page write Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334345, lost async page write Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 124078201, lost async page write Feb 11 21:12:39 nas kernel: XFS (sdh1): metadata I/O error: block 0x3a3fb6c5 ("xlog_iodone") error 5 numblks 64 Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334346, lost async page write Feb 11 21:12:39 nas kernel: XFS (sdh1): Log I/O Error Detected. Shutting down filesystem Feb 11 21:12:39 nas kernel: XFS (sdh1): metadata I/O error: block 0x3a3fb6c8 ("xlog_iodone") error 5 numblks 64 Feb 11 21:12:39 nas kernel: XFS (sdh1): metadata I/O error: block 0x3a3fb6e6 ("xlog_iodone") error 5 numblks 64 Feb 11 21:12:39 nas kernel: XFS (sdh1): xfs_log_force: error -5 returned. Feb 11 21:12:59 nas kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 11 21:12:59 nas kernel: ata8.00: revalidation failed (errno=-5) Feb 11 21:12:59 nas kernel: ata8: hard resetting link Feb 11 21:13:00 nas kernel: blk_update_request: I/O error, dev sdi, sector 251287064 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287000 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287008 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287016 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287024 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287032 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287040 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287048 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287056 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287064 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287072 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287080 Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287088 I'm not sure how to resolve this? Is it possible that I've corrupted my parity such that when rebuilding the data drive it rebuilds incorrectly? nas-diagnostics-20160213-1021.zip Quote Link to comment
JorgeB Posted February 13, 2016 Share Posted February 13, 2016 Both disk5 and your cache disk dropped offline, maybe power/cable problem? Post new diagnostics after checking cables and power cycling the server because SMART report for both is missing. Quote Link to comment
itimpi Posted February 13, 2016 Share Posted February 13, 2016 which was the drive that you replaced? The SMART reports show that the disk with serial WD-WCAU45077221 has 6 Pending sectors. These are sectors that are not being read successfully so can affect the rebuild of any other drive as it can mean that parity for those particular sectors is incorrect. There are also no SMART reports for WD-WCC4N0PKSD7P-20160213 and WD-WMAV50355779 which suggest these two drives have dropped offline. The syslog shows that you started getting write errors to the 'sdh' disk. It looks like that the drive went offline and after that things went down hill from there. The most likely causes are a cabling issue or a power related issue. Another possibility is a disk controller that is not properly seated in its motherboard slot. Quote Link to comment
JorgeB Posted February 13, 2016 Share Posted February 13, 2016 Another possibility is a disk controller that is not properly seated in its motherboard slot. Strong possibility, just noticed both dropped disks appear to be using the same marvell controller. Quote Link to comment
PsyVision Posted February 13, 2016 Author Share Posted February 13, 2016 Yea disk 5 was replaced. That disk and another are on the same onboard motherboard controller, with only those two disks. I will check cables and cycle later and then post back results with more information on things you've asked. Thank you! Quote Link to comment
PsyVision Posted February 13, 2016 Author Share Posted February 13, 2016 Okay I powered off the server and have re-connected all of the power connectors. I may have had an extra drive on one of the power leads that should have been on the other (3 and 5 rather than 4 and 4). I have powered on and then done nothing, logs attached. WD-WMAV50355779 (sdh) is my cache drive WD-WCAU45077221 (disk 2, sdc) is an old 1TB drive that potentially could be failing? I would be happy to replace this if needs be (of course). nas-diagnostics-20160213-1620.zip Quote Link to comment
JorgeB Posted February 13, 2016 Share Posted February 13, 2016 Disk 2 should be replaced asap, SMART for all others looks good. If disk2 caused the disk5 rebuild errors you're probably going to have some corrupt files, if you have backups or cheksums you should check them. Quote Link to comment
PsyVision Posted February 14, 2016 Author Share Posted February 14, 2016 THank you Johnnie. How is best to do this? If I put a new disk2 in then it tells me it's the wrong one. I see I should mark it as empty and then shutdown, put the new one in and then assign it to the slot. However, I am told that i have either too many or the wrong disks assigned. Unfortunately disk5 is still showing red-balled/cross so I'm not sure it will let me build it. Quote Link to comment
JorgeB Posted February 14, 2016 Share Posted February 14, 2016 So disk5 rebuild didn’t complete? I understood that it completed with some errors, if it’s showing a red x it’s still being emulated and you can’t change another disk before that one is dealt with. Trouble is that because disk2 has pending sectors a completely successful rebuild may be impossible, did old disk5 failed completely or is it still readable? Quote Link to comment
JorgeB Posted February 14, 2016 Share Posted February 14, 2016 Looking at your diagnostics disk5 is still disable, rebuild failed when disk5 (and cache) dropped offline, since you checked all cables I would try to rebuild it again. Quote Link to comment
PsyVision Posted February 14, 2016 Author Share Posted February 14, 2016 I tried a re-build before posting and it failed after a couple of hours, it should take ~9 to do a full rebuild apparently. Cache and disk5 appear to be online, just that disk5 gets marked as a red. I have enough disk space to move data off both drives, would it be okay to move the data to other drives and then replace 2 and 5 as though they were new drives? Quote Link to comment
JorgeB Posted February 14, 2016 Share Posted February 14, 2016 If you have the space it might be a good solution, because the pending sectors from disk2 can prevent a successful disk5 rebuild. Quote Link to comment
trurl Posted February 14, 2016 Share Posted February 14, 2016 ... disk5 appear to be online, just that disk5 gets marked as a red... The disk appears to be online because it is being emulated not because unRAID is actually using it. unRAID will not use a disabled drive. Instead it calculates the drive's data by reading all the other drives plus parity. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.