BlkNTan Posted February 23, 2022 Share Posted February 23, 2022 Hi all - first off, here is the overview of my server: Unraid v: 6.9.2 (trial) Plugins: CA, FCP, My Servers, Nerd Tools, Tips and Tweaks, Unassigned Devices, Dynamix System Buttons No dockers or VMs Hardware: ASUS P8B-E/4L mobo, Xeon CPU E3-1220 V2 @ 3.10GHz, 16 GB ECC RAM, 6 total disks all connected through onboard Intel C204 chipset controller Array: Parity: (1) WD60EFZX - 6TB Data disks: (4) WD30EFRX - 3TB Cache: (1) Samsung 870 EVO 500GB Background: I originally set this hardware up about 9 years ago with Win Server 2012 Essentials and it pretty much ran fine until recently. The OS drive (Samsung 840 Pro) seemed to have a disk error and became unbootable. I restored the backup to a new 870 EVO, but then started having a problem with one of the disks in the storage pool (a WD30EFRX). Since I knew I was overdue to move to a new OS (that wasn't 10 years old!), I decided to scrap the whole thing and try Unraid instead of another version of Win Server. I set the Unraid server up last week with a new WD60EFZX (6TB) as the parity drive and the 870 EVO (basically brand new) as the cache drive. I used 4 WD30EFRX's for the data disks. During the first parity check, on of my WD30EFRX's had read errors. I checked all the cables and replaced it with another (I have 6 total) - not the one that had the errors previously in the Win Server setup. This time around the parity check was all good so I started up the array and began restoring all my data. Problem: After restoring all my data, things seemed to be running well. Then, I started working on setting up backups to iDrive using their Linux scripts. I got it pretty much figured out and ran two selective backups (as tests) successfully. On the third one, it hung up and then I lost connectivity to the WebGUI. I logged into the console directly but I could not get it to shut down cleanly - I read a lot pages with explanations on how to do this, but apparently I didn't get all the commands to run successfully before rebooting. After it rebooted, a parity check immediately started. During that check, Disk 3 started showing a ton of read errors. When it got to something like 48k read errors, Unraid disabled the disk and the rest of the parity check finished with no other errors (all other disks show healthy). I stopped the array, downloaded diagnostics, then shut down the server and checked all cables. I restarted the server and tried to run an extended SMART test on disk 3. The test hardly started before coming back with errors and stopping. Unfortunately I did not get a diagnostics before the unclean restart, but I attached the diagnostics from after the parity check along with the SMART test report for Disk 3. Questions: 1) Based on the attached reports, is disk 3 definitely toast? (I assume it is) 2) Assuming yes to 1, then is this the proper procedure to follow to replace it: https://wiki.unraid.net/Manual/Storage_Management#Replacing_failed.2Fdisabled_disk.28s.29 3) Should I assume the other old WD30EFRX's are likely to fail soon also? Essentially, I have lost 3 of 6 so far. I am running extended SMART tests on them currently. 4) My plan is to add a second parity disk (another WD60EFZX 6TB), a second cache disk (another 870 EVO 500GB) and replace Disk 3 with a new WD60EFZX 6TB. But I am wondering if the other 3 WD30EFRX's are ticking time bombs and I should just decommission and replace all of them? That's all I can think of for now. I really appreciate any help and insights since I am new to Unraid and drinking the manual, guides, etc. like a fire hose currently, but still very green at this point! bnt-unraid-smart-20220223-0909.zip bnt-unraid-diagnostics-20220223-0908.zip Quote Link to comment
JorgeB Posted February 23, 2022 Share Posted February 23, 2022 It's logged as a disk issue and the SMART test failed, so yes, the disk needs to be replaced. 1 Quote Link to comment
JorgeB Posted February 23, 2022 Share Posted February 23, 2022 8 minutes ago, BlkNTan said: Assuming yes to 1, then is this the proper procedure to follow to replace it: https://wiki.unraid.net/Manual/Storage_Management#Replacing_failed.2Fdisabled_disk.28s.29 Yes. 8 minutes ago, BlkNTan said: Should I assume the other old WD30EFRX's are likely to fail soon also? Essentially, I have lost 3 of 6 so far. I am running extended SMART tests on them currently. Wait for the extended test results, but if they just did a parity check they should be OK for now. 1 Quote Link to comment
BlkNTan Posted February 23, 2022 Author Share Posted February 23, 2022 Jorge - thank you so much for the quick response!! Quick follow up question - is there anything specific in the SMART test report that I should be focusing on or is it just the fact that it couldn't finish and had errors. In other words, assuming the other 3 disks pass their extended tests, is there anything I should look for in those results that might tell me the drive is ok now, but likely to have trouble soon? Thanks again! Quote Link to comment
JorgeB Posted February 23, 2022 Share Posted February 23, 2022 With WD disks it's good practice to monitor these attributes: 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 997 200 Multi_Zone_Error_Rate ---R-- 100 253 000 - 0 A non zero value is never a good sign, especially if it keeps climbing, this one is from the failed disk, other ones are still at 0, so they should be good for now. Quote Link to comment
itimpi Posted February 23, 2022 Share Posted February 23, 2022 Passing the extended tests is a MUST for assuming a disk is healthy. Other signs of potential problems even is extended tests complete successfully is reallocated sectors starting to increase or Pending sectors being non-zero Quote Link to comment
BlkNTan Posted February 24, 2022 Author Share Posted February 24, 2022 Thanks both for the replies! I did add attributes 1 and 200 to SMART monitoring based on another post - thanks for confirming. As an update, disk 4 passed the extended test. Disk 1 is in process. Reallocated and pending sectors are attribute numbers 5 and 197 - correct? Thanks again for all the help!! Quote Link to comment
BlkNTan Posted February 25, 2022 Author Share Posted February 25, 2022 Update - the other 3 disks pass their extended tests. I added a replacement disk and the data rebuild finished with zero errors - yay! Next up is adding a second parity disk and a second cache disk. In case anyone else reads this and is looking to learn more about SMART attributes, etc., I found these two articles helpful: https://wiki.unraid.net/Understanding_SMART_Reports https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/ Thanks again Jorge and itimpi! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.