[6.10.2] Broken HDD or other issue?


Recommended Posts

While a (non-correcting) Parity check was running, I received a notification from my UnRAID server with the message "Disk 1 in error state (disk dsbl)", and on the "Main" page there's a red "X" where the green dot used to be next to the drive name. 

 

The data's all still there thanks to the parity drives - but how do I figure out what happened? The given disk is apparently disconnected from the OS, as /dev/sdd (which used to be that disk) no longer exists and I also can't view SMART data through the web interface. 

 

The OS log started with these: 

[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[Fri Aug  5 14:49:19 2022] sd 5:0:0:0: [sdd] tag#1005 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=4s
[Fri Aug  5 14:49:19 2022] sd 5:0:0:0: [sdd] tag#1005 Sense Key : 0x3 [current] 
[Fri Aug  5 14:49:19 2022] sd 5:0:0:0: [sdd] tag#1005 ASC=0x11 ASCQ=0x0 
[Fri Aug  5 14:49:19 2022] sd 5:0:0:0: [sdd] tag#1005 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 22 f0 00 00 04 00 00 00
[Fri Aug  5 14:49:19 2022] blk_update_request: critical medium error, dev sdd, sector 186918248 op 0x0:(READ) flags 0x0 phys_seg 49 prio class 0
[Fri Aug  5 14:49:19 2022] md: disk1 read error, sector=186918184
[Fri Aug  5 14:49:19 2022] md: disk1 read error, sector=186918192
[Fri Aug  5 14:49:19 2022] md: disk1 read error, sector=186918200

and then a bunch of these: 

 

[Fri Aug  5 14:50:57 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000057bd43b8), outstanding for 94137 ms & timeout 30000 ms
[Fri Aug  5 14:50:57 2022] sd 5:0:0:0: [sdd] tag#1085 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 6d c0 00 00 01 30 00 00
[Fri Aug  5 14:50:57 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:50:57 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000057bd43b8)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000021903280), outstanding for 34003 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1084 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 93 38 00 00 03 88 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000021903280) might have completed
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000021903280)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000043132d6d), outstanding for 34046 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1082 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 90 a8 00 00 02 90 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000043132d6d) might have completed
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000043132d6d)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000098a39814), outstanding for 34106 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1081 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 90 50 00 00 00 58 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000098a39814) might have completed
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000098a39814)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000011352615), outstanding for 39686 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1079 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 85 e8 00 00 00 d8 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000011352615) might have completed
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000011352615)
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: attempting task abort!scmd(0x0000000029d7771f), outstanding for 36822 ms & timeout 30000 ms
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: [sdd] tag#1077 CDB: opcode=0x88 88 00 00 00 00 00 0b 24 8d 28 00 00 03 28 00 00
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: handle(0x000a), sas_address(0x4433221102000000), phy(2)
[Fri Aug  5 14:51:01 2022] scsi target5:0:0: enclosure logical id(0x500605b00c2a3ad0), slot(1) 
[Fri Aug  5 14:51:01 2022] sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000029d7771f) might have completed

 

The full log is in the diagnostics ZIP. 

 

What should be the next step to figure out if this is a dead hard drive or some other issue that goes away on its own, given that I can't view SMART data?

 

Take the array offline and see what happens? Reboot and potentially lose helpful logs? Order a new HDD; shutdown server, and just replace the Disk?

 

The "need help, read first" thread mentions to attach SMART reports, but that doesn't work given that I don't see the drive. Can I do any harm in rebooting the server to hopefully get the drive to show up to get SMART data? The overview on the Dashboard says "SMART healthy" for that drive, but if I click on "healthy" to get more information, that fails because the sdd drive is apparently offline. For now I shut down all VMs and Docker containers, but I'm not sure what I should do next. 

 

I have ordered a replacement drive if needed but due to the weekend it'll be a couple days until that arrives.

 

Sorry for these stupid questions, this is my first (potentially) dead hard drive with UnRAID and I'm not exactly sure how to proceed.

tower-diagnostics-20220805-1713.zip

Edited by Leseratte10
Link to comment
Posted (edited)

So is there anything special I need to watch out for?

Right now the server is running but the array is stopped. The broken disk has a red "X" and the others have their green bubble in the "Main" menu. 

 

Once I have the new drive I just un-assign the broken drive, shut down the server, replace the drive, go into the "Main" menu and assign the new drive to the slot that had the failed drive in it, then start the array, and UnRAID will automatically re-build the contents onto the new drive, as described at https://wiki.unraid.net/Replacing_a_Data_Drive ?

 

Or is there a difference in procedure with replacing a working data drive vs replacing a failed data drive?

Edited by Leseratte10
Link to comment
16 minutes ago, Leseratte10 said:

UnRAID will automatically re-build the contents onto the new drive, as described at https://wiki.unraid.net/Replacing_a_Data_Drive ?

yes

17 minutes ago, Leseratte10 said:

need to watch out for?

That disk should be showing SMART warnings on the Dashboard page and probably has for some time.

 

Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

Link to comment

I have notifications enabled, and no, that disk hasn't been showing SMART warnings before. I got the notification about SMART errors just today. Last time I actively checked the dashboard was a couple days ago and everything was fine.

 

And the notifications are definitely working, as I always see the ones related to Docker image updates. Only strange thing was a temperature warning for that disk today at 2:03am (49°C), then "returned to normal temp" at 3:23am, then "Array has 1 disk with read errors (49)" at 2:49pm, and then the panic alert "Disk 1 in error state (disk dsbl)" at 2:51pm. Other than that, no warnings. 

I should probably crank up the fan speeds a bit during the summer. Seagate says the drive can operate in up to 60°C and it did never reach that, so I assumed it was fine.

 

Though, that question was more regarding the drive change process itself.

I will post again once I have the new drive here and replaced the drive. Thanks for all the helpful answers so far.

Link to comment
16 hours ago, Leseratte10 said:

I should probably crank up the fan speeds a bit during the summer. Seagate says the drive can operate in up to 60°C and it did never reach that, so I assumed it was fine.

 

 

Yeah, that's not a probably, that's a definitely. You're operating those drives at close to their Tmax and they really don't like that. It can operate at up to 60c, but that's not the optimal range for performance and, most importantly, longevity. Not surprised it just keeled over out of the blue rather than gracefully.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.