danioj Posted May 29, 2020 Posted May 29, 2020 (edited) Hi all, Im afraid I need the brains trust on this one. I woke this morning to one of my very stable (albeit old) data drives having exhibited read errors over night while the server was undergoing a parity sync. For background, I made some hardware changes to my main server yesterday: - Added a new 4 port sata expansion card (Skymaster PCIe 4-Ports SATA 6G Card EST11B) - Added a new (SMART tested and precleared) 8TB Seagate Barracuda Compute - Added a Silverstone Riser Cable so I could move the graphics card I have in there (which was taking two pcie slots) and allow me to drop the new sata expansion card in there (SilverStone RC04B PCI-e Riser Cable 400mm SST-RC04B-400) - Added a new 250GB SSD to run my living room tv LibreELEC VM via UAD My overall goal is to return to dual parity. I was running 1 parity drive (after releasing my second one some months ago for a data drive). I am intending to add 2 x new Barracuda's to be parity and dropping my single archive 8TB parity disk to be data disk. Just doing it slow and sure over time, step by step hence why I am adding a new Parity drive before replacing the current one. As the sata expansion card has a marvel chipset, I had to apply the 'iommu=pt' fix to my syslinux config to let unRAID see the drive. It is worth noting that the new 8TB and SSD drives are on the new sata card but the drive which is exhibiting problems is not. So, as you would have expected, after installing the new 8TB drive - I added it as a parity disk - and began the parity sync. Then I left it alone. I cannot see anything obvious in the diagnostics. The disks' SMART data looks fine. There were only 88 read errors. These errors coincided with the commencement of my daily CA docker backup sequence and only last a short time. I mention these only because this happened at around the same time and it was the only other thing the server was doing. I guess I could have knocked a cable while I was in there but I am pretty diligent and I checked the cables before I packed up. Plus, I could usually expect to see the UDMA CRC error count to be high if there were cabling issues. Which there weren't. I think I know the protocol. Let the parity sync finish, check the cables again and then do a correcting parity check. Id be grateful if, in the meantime, anyone has any other insights as to what might be the issue? Diagnostics attached. Thank you in advance. Daniel unraid-diagnostics-20200530-0711.zip Edited May 29, 2020 by danioj Quote
JorgeB Posted May 30, 2020 Posted May 30, 2020 The error is reported on the syslog as an actual disk problem, and same in SMART, though these kind of errors can sometimes be intermittent, you should run an extended SMART test, if it fails or if you get more similar errors on the near future you should consider replacing that disk. Quote
danioj Posted May 30, 2020 Author Posted May 30, 2020 Thanks for the review. Read errors again, this time on the parity check. I’m running a long SMART test now. I know disks fail but the absence of obvious SMART data makes me wonder if I did something when I was in there. Is there anything physical I could have done to cause this!?Sent from my iPhone using Tapatalk Quote
JorgeB Posted May 30, 2020 Posted May 30, 2020 7 minutes ago, danioj said: absence of obvious SMART data makes There are some SMART issues: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 11 This should be zero on a healthy WD drive, though just because it isn't it's not definite proof the disk is failing, but it's never a good sign, especially if it keeps climbing. Error 1 [0] occurred at disk power-on lifetime: 58318 hours (2429 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 12 78 3b e0 40 00 Error: UNC at LBA = 0x12783be0 = 309869536 This error (UNC @ LBA) usually also means a disk problem, a bad/failing sector, and looking at the power-on-hours you can see the error is recent, again it's not 100% conclusive since I've seen similar errors logged like that and it wasn't a disk problem, but it usually is, and if the SMART test fails it will confirm. Quote
danioj Posted May 30, 2020 Author Posted May 30, 2020 Doh. Long SMART test failed. unraid-smart-20200530-1831.zip Quote
JorgeB Posted May 30, 2020 Posted May 30, 2020 Yep, that confirms it really is a disk issue, and if you notice this attribute is climbing, though slowly: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 13 Quote
danioj Posted May 30, 2020 Author Posted May 30, 2020 (edited) This is the most obvious question and I’m sure it will result in the most obvious answer. I ask only, because while errors were reported, Parity Sync succeeded. Should I replace the disk? Edited May 30, 2020 by danioj Quote
itimpi Posted May 30, 2020 Posted May 30, 2020 1 minute ago, danioj said: This is the most obvious question and I’m sure it will result in the most obvious answer. Should I replace the disk? Yes. Any disk that fails the extended SMART test should be replaced Quote
JorgeB Posted May 30, 2020 Posted May 30, 2020 16 minutes ago, danioj said: while errors were reported, Parity Sync succeeded. Because Unraid used parity plus the other disks to reconstruct those sectors, but those errors would be a problem if it was a disk rebuild instead. Quote
danioj Posted May 30, 2020 Author Posted May 30, 2020 Because Unraid used parity plus the other disks to reconstruct those sectors, but those errors would be a problem if it was a disk rebuild instead.However, I can reconstruct this disk off of that recent parity sync successfully though right? I haven’t been able to do a parity check on it since I’ve had the read errors!? Sent from my iPhone using Tapatalk Quote
JorgeB Posted May 30, 2020 Posted May 30, 2020 2 minutes ago, danioj said: However, I can reconstruct this disk off of that recent parity sync successfully though right? Yep. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.