Kevek79 Posted November 15, 2020 Share Posted November 15, 2020 (edited) Dear fellow UnRaiders, this morning during a nice breakfast with the family my server sent me a notification that one of my drives was disabled. As this is my first real "red X-Event" in my time using unraid I want to make sure to not make any mistakes. Regular monthly parity check was started tonight at 04:00 in the morning. Notification of disabled drive was received this morning at around 08:30. Data is emulated and i did a quick spot check on the network shares and data that resides on the emulated disk could be read. Array is still running, parity check was aborted by the server. The drive in question is one of my oldest and should have been replaced by now, but what should I say I have 2 precleared replacement drives (hot spares) available, so I should be in a good shape for replacing the disabled drive. Now I just want to make sure that I got the procedure for drive replacement correct and hope someone can look into my diagnostics to give me a hint what went wrong (besides using old drives ) and what would be the best way to go forward. Procedure to replace a data drive: 1. Stop Array 2. Unassign drive 3 (Do I need to start the array once with no drive assigned to the slot of disk 3 ?) 3. Assign one of the hot spares to disk 3 slot 4. Go to the Main -> Array Operation section 5. Put a check in the Yes, I'm sure checkbox (next to the information indicating the drive will be rebuilt) 6, Start the Array Is the procedure for disk replacement correct as described above ? Can someone help me in finding out what caused the disk to be disabled? Thanks in advance morpheus-diagnostics-20201115-0945.zip Edited November 15, 2020 by Kevek79 Diagnostics atttached Quote Link to comment
itimpi Posted November 15, 2020 Share Posted November 15, 2020 you rebuild steps look fine (and no need for the extra step you queried - that is only required when trying to rebuild to the same disk). Looks like the syslog at about 4 hours into the parity check you suddenly got read errors starting with Nov 15 08:29:33 Morpheus kernel: sd 2:0:1:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 15 08:29:33 Morpheus kernel: sd 2:0:1:0: [sdd] tag#0 Sense Key : 0x3 [current] Nov 15 08:29:33 Morpheus kernel: sd 2:0:1:0: [sdd] tag#0 ASC=0x11 ASCQ=0x0 Nov 15 08:29:33 Morpheus kernel: sd 2:0:1:0: [sdd] tag#0 CDB: opcode=0x28 28 00 db 27 25 b8 00 04 00 00 Nov 15 08:29:33 Morpheus kernel: print_req_error: critical medium error, dev sdd, sector 3676775864 Nov 15 08:29:33 Morpheus kernel: md: disk3 read error, sector=3676775800 Nov 15 08:29:33 Morpheus kernel: md: disk3 read error, sector=3676775808 Nov 15 08:29:33 Morpheus kernel: md: disk3 read error, sector=3676775816 Nov 15 08:29:33 Morpheus kernel: md: disk3 read error, sector=3676775824 and after a while started getting write errors as well. Not sure but my guess is that indicates something failing internally within the drive. 1 Quote Link to comment
Kevek79 Posted November 15, 2020 Author Share Posted November 15, 2020 Thanks @itimpi for confirming the procedure. I did see this portion of the syslog, and i do also think that the drive is just dying after its >60.000 h in operation (tough little spinner). I will start rebuilding on a new disk as soon as I am back home. In the meantime, lets see if anyone else has seen this error massages before and knows where it comes from. Quote Link to comment
Kevek79 Posted November 15, 2020 Author Share Posted November 15, 2020 So I am back in front of my server and there is one more question for the gurus in the forum before i atempt the rebuild of my disabled disk. As stated above the server was in the middle of a scheduled parity check when disk 3 got disabled this morning. When I look on my main page now under array operations it says "read check in progress" and it is indicating that the read check is paused. Is this expected behaviour? Do I need to stop the paused read-check first before I stop the array? Could the paused parity / read check somehow interfere with the rebuild i am trying to achive with the procedure above? I just want to make sure that I do not make my situation worse. Thank you all for your help. Quote Link to comment
itimpi Posted November 15, 2020 Share Posted November 15, 2020 A parity check would not normally be paused unless you either manually paused it or had the Parity Check Tuning plugin installed and you have met the criteria you set for pauses to occur. stopping the array automatically abandons any array operation that is currently running. you might want to post your current diagnostics so we can check what is the current state. 1 Quote Link to comment
Kevek79 Posted November 15, 2020 Author Share Posted November 15, 2020 Current diagnostics attached besides stopping all docker containers and stopping scheduler from running mover nothing has changed yet since pulling diagnostics this morning. Thanks for investigating @itimpi morpheus-diagnostics-20201115-2138.zip Quote Link to comment
Kevek79 Posted November 16, 2020 Author Share Posted November 16, 2020 Quote stopping the array automatically abandons any array operation that is currently running. Do I understand that correctly if I assume that stopping the array (not restarting the server) will also stop/cancel the read check that is paused now for whatever reason ? And the system will not resume this paused check after restarting the array but starting a rebuild of the disabled disk? Can I go forward with stopping the array and assign the hotspare to the slot of the disabled disk? Has anyone had the chance to look in my latest diagnostics so we can narrow down what happend? thanks for all your support Toby Quote Link to comment
trurl Posted November 16, 2020 Share Posted November 16, 2020 6 hours ago, Kevek79 said: Can I go forward with stopping the array and assign the hotspare to the slot of the disabled disk? yes Quote Link to comment
trurl Posted November 16, 2020 Share Posted November 16, 2020 6 hours ago, Kevek79 said: latest diagnostics so we can narrow down what happend? Looks like a disk problem. SMART attribute 1 is an important indicator on WD Red drives so you should add that to the attributes monitored on those disks. Quote Link to comment
Kevek79 Posted November 16, 2020 Author Share Posted November 16, 2020 Thanks to @trurl and @itimpi! Rebuild is now running. Will let you know how it goes. Quote Link to comment
Kevek79 Posted November 17, 2020 Author Share Posted November 17, 2020 After about 15h of rebuild the Server is back up and running as usual. Thanks for all the help. Thread Tagged Solved 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.