Disk failure, system not cooperating, unsure on how to proceed

spall · December 7, 2022

Hallo,

unRAID version: 6.11.5

Hardware: see signature- Server 1. In addition, there are 3 SSDs in 2 cache pools, 1 unassigned device, and the flash drive

Diagnostics attached.

TL;DR Disk disabled, disk showing as both in array and unassigned devices, GUI half broken, can't shutdown from gui, need advice

Woke up to an error email with: Alert [SPOCK] - Disk 7 in error state (disk dsbl)

No big deal, it happens, so I decided to take a look at SMART and the logs before I swapped in a spare to rebuild. In the GUI the disk is showing with the red 'x'. If I try to run a short self test I get nothing, as if I didn't click the button.

So I decided to disable autostart and reboot the system. Trying to disable autostart does nothing when I hit apply. At this point I look at the dashboard and my CPU activity is at various percentages for each core, but is not changing at all. However, network activity is still dynamically updating. My next step was to stop the array which also does nothing.

Checking my log I see this:

Spoiler

Dec 7 14:12:05 spock emhttpd: error: hotplug_devices, 1730: No such file or directory (2): Error: tagged device WDC_WD40EFRX-68WT0N0_WD-WCC4EKPA0S73 was (sdu) is now (sdac)
Dec 7 14:12:05 spock emhttpd: read SMART /dev/sdac
Dec 7 14:12:05 spock kernel: emhttpd[8596]: segfault at 674 ip 000055a79241e9d4 sp 00007fffc7c76250 error 4 in emhttpd[55a79240c000+21000]
Dec 7 14:12:05 spock kernel: Code: 8e 27 01 00 48 89 45 f8 48 8d 05 72 27 01 00 48 89 45 f0 e9 79 01 00 00 8b 45 ec 89 c7 e8 89 b1 ff ff 48 89 45 d8 48 8b 45 d8 <8b> 80 74 06 00 00 85 c0 0f 94 c0 0f b6 c0 89 45 d4 48 8b 45 e0 48

Stuff seems to be otherwise fine. I can access files in the array. I can run SMART on other devices. My Docker containers and VM are running (seemingly) normally. At this point I decided to grab diagnostics and check in here instead of a hard reboot and make life more difficult. So it is sitting in this state pending suggestions/advice.

Any help greatly appreciated. Thanks!

spock-diagnostics-20221207-1422.zip

Edited December 8, 2022 by spall

dillpickle · December 7, 2022

Would the changed disk id indicate a potential mobo failure?

spall · December 7, 2022

@dillpickle That seems unlikely given what I'm seeing here. I feel like it would be the drive or the backplane first. However, even if the drive or a port on the backplane took a dive, I don't know why the system would be presenting like it is right now.

I've been looking at everything I can think of, and looking at the logs again it appears around 4:15am is when stuff started to go south:

Spoiler

Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdu
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdr
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdaa
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdx
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdf
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdv
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sds
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdn
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdq
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdo
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdl
Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdi
Dec 7 04:15:45 spock kernel: sd 5:0:15:0: attempting task abort!scmd(0x000000009c8f133e), outstanding for 30219 ms & timeout 30000 ms
Dec 7 04:15:45 spock kernel: sd 5:0:15:0: [sdu] tag#766 CDB: opcode=0x88 88 00 00 00 00 00 e8 e6 7d 10 00 00 00 18 00 00
Dec 7 04:15:45 spock kernel: scsi target5:0:15: handle(0x0019), sas_address(0x500304800143179c), phy(28)
Dec 7 04:15:45 spock kernel: scsi target5:0:15: enclosure logical id(0x50030480014317bf), slot(16)
Dec 7 04:15:46 spock kernel: sd 5:0:15:0: device_block, handle(0x0019)
Dec 7 04:15:47 spock kernel: sd 5:0:15:0: device_unblock and setting to running, handle(0x0019)
Dec 7 04:15:49 spock kernel: sd 5:0:15:0: task abort: SUCCESS scmd(0x000000009c8f133e)
Dec 7 04:15:49 spock kernel: md: disk7 read error, sector=3907419344
Dec 7 04:15:49 spock kernel: md: disk7 read error, sector=3907419352
Dec 7 04:15:49 spock kernel: md: disk7 read error, sector=3907419360
Dec 7 04:15:49 spock kernel: md: disk7 write error, sector=3907419344
Dec 7 04:15:49 spock kernel: md: disk7 write error, sector=3907419352
Dec 7 04:15:49 spock kernel: sd 5:0:15:0: [sdu] Synchronizing SCSI cache
Dec 7 04:15:49 spock kernel: sd 5:0:15:0: [sdu] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
Dec 7 04:15:49 spock kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x500304800143179c)
Dec 7 04:15:49 spock kernel: mpt2sas_cm0: removing handle(0x0019), sas_addr(0x500304800143179c)
Dec 7 04:15:49 spock kernel: mpt2sas_cm0: enclosure logical id(0x50030480014317bf), slot(16)
Dec 7 04:15:49 spock kernel: md: disk7 write error, sector=3907419360
Dec 7 04:15:50 spock kernel: mpt2sas_cm0: handle(0x19) sas_address(0x500304800143179c) port_type(0x1)
Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: Direct-Access ATA WDC WD40EFRX-68W 0A82 PQ: 0 ANSI: 6
Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: SATA: handle(0x0019), sas_addr(0x500304800143179c), phy(28), device_name(0x0000000000000000)
Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: enclosure logical id (0x50030480014317bf), slot(16)
Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
Dec 7 04:15:51 spock kernel: sd 5:0:24:0: Attached scsi generic sg20 type 0
Dec 7 04:15:51 spock kernel: sd 5:0:24:0: Power-on or device reset occurred
Dec 7 04:15:51 spock kernel: end_device-5:0:24: add: handle(0x0019), sas_addr(0x500304800143179c)
Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] 4096-byte physical blocks
Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] Write Protect is off
Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] Mode Sense: 7f 00 10 08
Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] Write cache: enabled, read cache: enabled, supports DPO and FUA
Dec 7 04:15:51 spock kernel: sdac: sdac1
Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] Attached SCSI disk
Dec 7 04:15:52 spock unassigned.devices: Disk with ID 'WDC_WD40EFRX-68WT0N0_WD-WCC4EKPA0S73 (sdac)' is not set to auto mount.
Dec 7 04:16:02 spock sSMTP[15149]: Creating SSL connection to host
Dec 7 04:16:02 spock sSMTP[15149]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384
Dec 7 04:16:04 spock sSMTP[15149]: Sent mail for [email protected] (221 biz142.inmotionhosting.com closing connection) uid=0 username=xxx outbytes=841
Dec 7 04:16:04 spock sSMTP[15235]: Creating SSL connection to host
Dec 7 04:16:04 spock sSMTP[15235]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384
Dec 7 04:16:06 spock sSMTP[15235]: Sent mail for [email protected] (221 biz142.inmotionhosting.com closing connection) uid=0 username=xxx outbytes=870
Dec 7 04:16:25 spock kernel: md: disk7 read error, sector=3907127336
Dec 7 04:16:25 spock kernel: md: disk7 write error, sector=3907127336
Dec 7 04:16:46 spock kernel: md: disk7 read error, sector=3907518648
Dec 7 04:16:46 spock kernel: md: disk7 write error, sector=3907518648

It looks like SMART tests were running and from what I can decipher, it seems that it took too long for drive 7, aborted the task, had some read/write errors, and then drive 7 behaved as if it had been unplugged or power cycled (?) and went from being sdu to sdac? So now drive 7 (sdu) is being emulated and the physical drive is showing as sdac in unassigned devices in an unmountable state. Where you would normally see the Mount/Unmount button is grayed out and says 'Array'.

Again, I just don't understand why this would prevent me from stopping or shutting down the array (or some other things in the UI). I also don't understand what the seg fault in the log is from my first post, but it seems like maybe it's the web interface that is 30% hosed. I'm hesitant to shutdown from the command line until somebody with more experience gives me their blessing.

JorgeB · December 8, 2022

Disk dropped offline and reconnected, check/replace cables and post new diags after array start.

spall · December 8, 2022

@JorgeB Thanks, can do.. but, for clarity, the cable in this case is a single SFF8087 mini-SAS cable going from my LSI 9211-8i to a 24 port backplane. I have spares of everything (except the backplane), so I could technically swap out most of the chain. I'm fairly new to expanders. My last backplane was direct attached. Does it make sense that 1 port in the expander would have toggled in this scenario? I guess I'm asking if it happens again, on the same port, do I need to consider the backplane is being fussy?

But what really confuses me is why if the disk dropped offline for a moment would it have hosed the whole UI? And why the seg fault? Or is that just a happy coincidence? and if so, how do I determine?

The way my brain works, I know if I swap stuff out and it seems fine, my brain is gonna say "intermittent problem" and not "solved". Heh.

Thanks!

EDIT: btw, is there a way for me to tell unRAID everything is fine, go ahead? Or am I forced into a disk rebuild anyway because the drive was disabled? I'm thinking I could swap the drive anyway instead of rebuilding over the data just for an extra layer of safety.

Edited December 8, 2022 by spall

JorgeB · December 9, 2022

11 hours ago, spall said:

the cable in this case is a single SFF8087 mini-SAS cable going from my LSI 9211-8i to a 24 port backplane.

In that case recommend using a different bay to rule that out.

spall · December 12, 2022

I had already put it back in the same bay, slot 7, and everything is fine (so far) after a data rebuild. I needed to minimize downtime, so I replaced everything feasible in the chain: the drive, the HBA, and the mini-SAS cable. I reseated all the molex connectors to the backplane. I'm going to test the removed components in my test server for a bit and see if anything crops up.

I'll mark this as solved for now. I suppose if it happens again _and_ it's slot 7, I'll have to investigate somehow if the backplane or power distribution board is faulty.

Thanks!

Disk failure, system not cooperating, unsure on how to proceed

Recommended Posts

spall

Link to comment

dillpickle

Link to comment

spall

Link to comment

JorgeB

Link to comment

spall

Link to comment

JorgeB

Link to comment

spall

Link to comment

Join the conversation