[6.10.2] Broken HDD or other issue?


Go to solution Solved by Michael_P,

Recommended Posts

While a (non-correcting) Parity check was running, I received a notification from my UnRAID server with the message "Disk 1 in error state (disk dsbl)", and on the "Main" page there's a red "X" where the green dot used to be next to the drive name. 

 

The data's all still there thanks to the parity drives - but how do I figure out what happened? The given disk is apparently disconnected from the OS, as /dev/sdd (which used to be that disk) no longer exists and I also can't view SMART data through the web interface. 

 

The OS log started with these: 

[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] sd 5:0:0:0: [sdd] tag#1005 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=4s
[Fri Aug  5 14:49:19 2022] sd 5:0:0:0: [sdd] tag#1005 Sense Key : 0x3 [current] 
[Fri Aug  5 14:49:19 2022] sd 5:0:0:0: [sdd] tag#1005 ASC=0x11 ASCQ=0x0 
[Fri Aug  5 14:49:19 2022] sd 5:0:0:0: [sdd] tag#1005 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 22 f0 00 00 04 00 00 00
[Fri Aug  5 14:49:19 2022] blk_update_request: critical medium error, dev sdd, sector 186918248 op 0x0:(READ) flags 0x0 phys_seg 49 prio class 0
[Fri Aug  5 14:49:19 2022] md: disk1 read error, sector=186918184
[Fri Aug  5 14:49:19 2022] md: disk1 read error, sector=186918192
[Fri Aug  5 14:49:19 2022] md: disk1 read error, sector=186918200

and then a bunch of these: 

 

[Fri Aug  5 14:50:57 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000057bd43b8), outstanding for 94137 ms & timeout 30000 ms
[Fri Aug  5 14:50:57 2022] sd 5:0:0:0: [sdd] tag#1085 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 6d c0 00 00 01 30 00 00
[Fri Aug  5 14:50:57 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:50:57 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000057bd43b8)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000021903280), outstanding for 34003 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1084 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 93 38 00 00 03 88 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000021903280) might have completed
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000021903280)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000043132d6d), outstanding for 34046 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1082 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 90 a8 00 00 02 90 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000043132d6d) might have completed
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000043132d6d)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000098a39814), outstanding for 34106 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1081 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 90 50 00 00 00 58 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000098a39814) might have completed
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000098a39814)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000011352615), outstanding for 39686 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1079 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 85 e8 00 00 00 d8 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000011352615) might have completed
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000011352615)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000029d7771f), outstanding for 36822 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1077 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 8d 28 00 00 03 28 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000029d7771f) might have completed

 

The full log is in the diagnostics ZIP. 

 

What should be the next step to figure out if this is a dead hard drive or some other issue that goes away on its own, given that I can't view SMART data?

 

Take the array offline and see what happens? Reboot and potentially lose helpful logs? Order a new HDD; shutdown server, and just replace the Disk?

 

The "need help, read first" thread mentions to attach SMART reports, but that doesn't work given that I don't see the drive. Can I do any harm in rebooting the server to hopefully get the drive to show up to get SMART data? The overview on the Dashboard says "SMART healthy" for that drive, but if I click on "healthy" to get more information, that fails because the sdd drive is apparently offline. For now I shut down all VMs and Docker containers, but I'm not sure what I should do next. 

 

I have ordered a replacement drive if needed but due to the weekend it'll be a couple days until that arrives.

 

Sorry for these stupid questions, this is my first (potentially) dead hard drive with UnRAID and I'm not exactly sure how to proceed.

tower-diagnostics-20220805-1713.zip

Edited by Leseratte10
Link to comment

So is there anything special I need to watch out for?

Right now the server is running but the array is stopped. The broken disk has a red "X" and the others have their green bubble in the "Main" menu. 

 

Once I have the new drive I just un-assign the broken drive, shut down the server, replace the drive, go into the "Main" menu and assign the new drive to the slot that had the failed drive in it, then start the array, and UnRAID will automatically re-build the contents onto the new drive, as described at https://wiki.unraid.net/Replacing_a_Data_Drive ?

 

Or is there a difference in procedure with replacing a working data drive vs replacing a failed data drive?

Edited by Leseratte10
Link to comment
16 minutes ago, Leseratte10 said:

UnRAID will automatically re-build the contents onto the new drive, as described at https://wiki.unraid.net/Replacing_a_Data_Drive ?

yes

17 minutes ago, Leseratte10 said:

need to watch out for?

That disk should be showing SMART warnings on the Dashboard page and probably has for some time.

 

Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

Link to comment

I have notifications enabled, and no, that disk hasn't been showing SMART warnings before. I got the notification about SMART errors just today. Last time I actively checked the dashboard was a couple days ago and everything was fine.

 

And the notifications are definitely working, as I always see the ones related to Docker image updates. Only strange thing was a temperature warning for that disk today at 2:03am (49°C), then "returned to normal temp" at 3:23am, then "Array has 1 disk with read errors (49)" at 2:49pm, and then the panic alert "Disk 1 in error state (disk dsbl)" at 2:51pm. Other than that, no warnings. 

I should probably crank up the fan speeds a bit during the summer. Seagate says the drive can operate in up to 60°C and it did never reach that, so I assumed it was fine.

 

Though, that question was more regarding the drive change process itself.

I will post again once I have the new drive here and replaced the drive. Thanks for all the helpful answers so far.

Link to comment
16 hours ago, Leseratte10 said:

I should probably crank up the fan speeds a bit during the summer. Seagate says the drive can operate in up to 60°C and it did never reach that, so I assumed it was fine.

 

 

Yeah, that's not a probably, that's a definitely. You're operating those drives at close to their Tmax and they really don't like that. It can operate at up to 60c, but that's not the optimal range for performance and, most importantly, longevity. Not surprised it just keeled over out of the blue rather than gracefully.

Link to comment
  • 2 weeks later...

Okay, since my last post I've ordered a ton of additional fans for the machine, and a new 18TB drive. 

I've successfully precleared that drive in another unRAID machine and then replaced the broken drive. 

Added tons of additional fans to the machine and the data rebuild from parity is currently running. Drives are all running at 34°C now.

 

Thanks for all the advice I've received so far.

 

I have two additional questions regarding that, though: 

 

A) During the first ~2 hours of the rebuild, it was running very very slow. Like, 20-30 MB/s a second and it said it would take like a week to rebuild the drive. After 2-3 hours it shot up to more realistic speeds of 200 MB/s. Is this "normal" / is there anything else happening at the beginning of a rebuild that causes these slow speeds? I have the machine disconnected from the network and all VMs / Dockers stopped so there's nothing else using the disks. 

 

B) I've noticed that during the rebuild, all the other drives are hit with tons and tons of reads (duh, that's the point of parity, using the Parity bits and all other data bits to recreate one missing drive). I'm wondering, though, since I have dual parity, why would it not leave one disk spun-down? Since with two parity drives it would theoretically be able to recreate two broken drives at the same time, why not leave one of the disks alone, reducing the risk of an additional drive breaking due to the load of the rebuild? Why not leave Parity 2 down / not access it at all and just rebuild from Parity 1? Or even better, leave the data disk that ist most full (= has the most content at risk) spun down and use both Parity disks instead?

 

Link to comment
7 minutes ago, Leseratte10 said:

B) I've noticed that during the rebuild, all the other drives are hit with tons and tons of reads (duh, that's the point of parity, using the Parity bits and all other data bits to recreate one missing drive). I'm wondering, though, since I have dual parity, why would it not leave one disk spun-down? Since with two parity drives it would theoretically be able to recreate two broken drives at the same time, why not leave one of the disks alone, reducing the risk of an additional drive breaking due to the load of the rebuild? Why not leave Parity 2 down / not access it at all and just rebuild from Parity 1? Or even better, leave the data disk that ist most full (= has the most content at risk) spun down and use both Parity disks instead?

Although this might be theoretically possible my guess would be that it is deemed a bit difficult to implement reliably.   It could also provide an additional check that both parity drive agree on what should be on the rebuilt drive, although whether this is actually used I have no idea.

Link to comment

The data rebuild is now finished, all drives are green / healthy again and all the data is still there. 

The slow rebuild was (most likely) caused by the "Folder Caching" plugin (which has a setting to not run while the mover is running but apparently no setting to not run during a Parity rebuild ...) so that mystery is solved, too.

 

Thanks!

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.