spall Posted December 7, 2022 Share Posted December 7, 2022 (edited) Hallo, unRAID version: 6.11.5 Hardware: see signature- Server 1. In addition, there are 3 SSDs in 2 cache pools, 1 unassigned device, and the flash drive Diagnostics attached. TL;DR Disk disabled, disk showing as both in array and unassigned devices, GUI half broken, can't shutdown from gui, need advice Woke up to an error email with: Alert [SPOCK] - Disk 7 in error state (disk dsbl) No big deal, it happens, so I decided to take a look at SMART and the logs before I swapped in a spare to rebuild. In the GUI the disk is showing with the red 'x'. If I try to run a short self test I get nothing, as if I didn't click the button. So I decided to disable autostart and reboot the system. Trying to disable autostart does nothing when I hit apply. At this point I look at the dashboard and my CPU activity is at various percentages for each core, but is not changing at all. However, network activity is still dynamically updating. My next step was to stop the array which also does nothing. Checking my log I see this: Spoiler Dec 7 14:12:05 spock emhttpd: error: hotplug_devices, 1730: No such file or directory (2): Error: tagged device WDC_WD40EFRX-68WT0N0_WD-WCC4EKPA0S73 was (sdu) is now (sdac) Dec 7 14:12:05 spock emhttpd: read SMART /dev/sdac Dec 7 14:12:05 spock kernel: emhttpd[8596]: segfault at 674 ip 000055a79241e9d4 sp 00007fffc7c76250 error 4 in emhttpd[55a79240c000+21000] Dec 7 14:12:05 spock kernel: Code: 8e 27 01 00 48 89 45 f8 48 8d 05 72 27 01 00 48 89 45 f0 e9 79 01 00 00 8b 45 ec 89 c7 e8 89 b1 ff ff 48 89 45 d8 48 8b 45 d8 <8b> 80 74 06 00 00 85 c0 0f 94 c0 0f b6 c0 89 45 d4 48 8b 45 e0 48 Stuff seems to be otherwise fine. I can access files in the array. I can run SMART on other devices. My Docker containers and VM are running (seemingly) normally. At this point I decided to grab diagnostics and check in here instead of a hard reboot and make life more difficult. So it is sitting in this state pending suggestions/advice. Any help greatly appreciated. Thanks! spock-diagnostics-20221207-1422.zip Edited December 8, 2022 by spall Quote Link to comment
dillpickle Posted December 7, 2022 Share Posted December 7, 2022 Would the changed disk id indicate a potential mobo failure? Quote Link to comment
spall Posted December 7, 2022 Author Share Posted December 7, 2022 @dillpickle That seems unlikely given what I'm seeing here. I feel like it would be the drive or the backplane first. However, even if the drive or a port on the backplane took a dive, I don't know why the system would be presenting like it is right now. I've been looking at everything I can think of, and looking at the logs again it appears around 4:15am is when stuff started to go south: Spoiler Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdu Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdr Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdaa Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdx Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdf Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdv Dec 7 04:15:16 spock emhttpd: read SMART /dev/sds Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdn Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdq Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdo Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdl Dec 7 04:15:16 spock emhttpd: read SMART /dev/sdi Dec 7 04:15:45 spock kernel: sd 5:0:15:0: attempting task abort!scmd(0x000000009c8f133e), outstanding for 30219 ms & timeout 30000 ms Dec 7 04:15:45 spock kernel: sd 5:0:15:0: [sdu] tag#766 CDB: opcode=0x88 88 00 00 00 00 00 e8 e6 7d 10 00 00 00 18 00 00 Dec 7 04:15:45 spock kernel: scsi target5:0:15: handle(0x0019), sas_address(0x500304800143179c), phy(28) Dec 7 04:15:45 spock kernel: scsi target5:0:15: enclosure logical id(0x50030480014317bf), slot(16) Dec 7 04:15:46 spock kernel: sd 5:0:15:0: device_block, handle(0x0019) Dec 7 04:15:47 spock kernel: sd 5:0:15:0: device_unblock and setting to running, handle(0x0019) Dec 7 04:15:49 spock kernel: sd 5:0:15:0: task abort: SUCCESS scmd(0x000000009c8f133e) Dec 7 04:15:49 spock kernel: md: disk7 read error, sector=3907419344 Dec 7 04:15:49 spock kernel: md: disk7 read error, sector=3907419352 Dec 7 04:15:49 spock kernel: md: disk7 read error, sector=3907419360 Dec 7 04:15:49 spock kernel: md: disk7 write error, sector=3907419344 Dec 7 04:15:49 spock kernel: md: disk7 write error, sector=3907419352 Dec 7 04:15:49 spock kernel: sd 5:0:15:0: [sdu] Synchronizing SCSI cache Dec 7 04:15:49 spock kernel: sd 5:0:15:0: [sdu] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Dec 7 04:15:49 spock kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x500304800143179c) Dec 7 04:15:49 spock kernel: mpt2sas_cm0: removing handle(0x0019), sas_addr(0x500304800143179c) Dec 7 04:15:49 spock kernel: mpt2sas_cm0: enclosure logical id(0x50030480014317bf), slot(16) Dec 7 04:15:49 spock kernel: md: disk7 write error, sector=3907419360 Dec 7 04:15:50 spock kernel: mpt2sas_cm0: handle(0x19) sas_address(0x500304800143179c) port_type(0x1) Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: Direct-Access ATA WDC WD40EFRX-68W 0A82 PQ: 0 ANSI: 6 Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: SATA: handle(0x0019), sas_addr(0x500304800143179c), phy(28), device_name(0x0000000000000000) Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: enclosure logical id (0x50030480014317bf), slot(16) Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) Dec 7 04:15:51 spock kernel: scsi 5:0:24:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1) Dec 7 04:15:51 spock kernel: sd 5:0:24:0: Attached scsi generic sg20 type 0 Dec 7 04:15:51 spock kernel: sd 5:0:24:0: Power-on or device reset occurred Dec 7 04:15:51 spock kernel: end_device-5:0:24: add: handle(0x0019), sas_addr(0x500304800143179c) Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB) Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] 4096-byte physical blocks Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] Write Protect is off Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] Mode Sense: 7f 00 10 08 Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] Write cache: enabled, read cache: enabled, supports DPO and FUA Dec 7 04:15:51 spock kernel: sdac: sdac1 Dec 7 04:15:51 spock kernel: sd 5:0:24:0: [sdac] Attached SCSI disk Dec 7 04:15:52 spock unassigned.devices: Disk with ID 'WDC_WD40EFRX-68WT0N0_WD-WCC4EKPA0S73 (sdac)' is not set to auto mount. Dec 7 04:16:02 spock sSMTP[15149]: Creating SSL connection to host Dec 7 04:16:02 spock sSMTP[15149]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384 Dec 7 04:16:04 spock sSMTP[15149]: Sent mail for [email protected] (221 biz142.inmotionhosting.com closing connection) uid=0 username=xxx outbytes=841 Dec 7 04:16:04 spock sSMTP[15235]: Creating SSL connection to host Dec 7 04:16:04 spock sSMTP[15235]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384 Dec 7 04:16:06 spock sSMTP[15235]: Sent mail for [email protected] (221 biz142.inmotionhosting.com closing connection) uid=0 username=xxx outbytes=870 Dec 7 04:16:25 spock kernel: md: disk7 read error, sector=3907127336 Dec 7 04:16:25 spock kernel: md: disk7 write error, sector=3907127336 Dec 7 04:16:46 spock kernel: md: disk7 read error, sector=3907518648 Dec 7 04:16:46 spock kernel: md: disk7 write error, sector=3907518648 It looks like SMART tests were running and from what I can decipher, it seems that it took too long for drive 7, aborted the task, had some read/write errors, and then drive 7 behaved as if it had been unplugged or power cycled (?) and went from being sdu to sdac? So now drive 7 (sdu) is being emulated and the physical drive is showing as sdac in unassigned devices in an unmountable state. Where you would normally see the Mount/Unmount button is grayed out and says 'Array'. Again, I just don't understand why this would prevent me from stopping or shutting down the array (or some other things in the UI). I also don't understand what the seg fault in the log is from my first post, but it seems like maybe it's the web interface that is 30% hosed. I'm hesitant to shutdown from the command line until somebody with more experience gives me their blessing. Quote Link to comment
Solution JorgeB Posted December 8, 2022 Solution Share Posted December 8, 2022 Disk dropped offline and reconnected, check/replace cables and post new diags after array start. Quote Link to comment
spall Posted December 8, 2022 Author Share Posted December 8, 2022 (edited) @JorgeB Thanks, can do.. but, for clarity, the cable in this case is a single SFF8087 mini-SAS cable going from my LSI 9211-8i to a 24 port backplane. I have spares of everything (except the backplane), so I could technically swap out most of the chain. I'm fairly new to expanders. My last backplane was direct attached. Does it make sense that 1 port in the expander would have toggled in this scenario? I guess I'm asking if it happens again, on the same port, do I need to consider the backplane is being fussy? But what really confuses me is why if the disk dropped offline for a moment would it have hosed the whole UI? And why the seg fault? Or is that just a happy coincidence? and if so, how do I determine? The way my brain works, I know if I swap stuff out and it seems fine, my brain is gonna say "intermittent problem" and not "solved". Heh. Thanks! EDIT: btw, is there a way for me to tell unRAID everything is fine, go ahead? Or am I forced into a disk rebuild anyway because the drive was disabled? I'm thinking I could swap the drive anyway instead of rebuilding over the data just for an extra layer of safety. Edited December 8, 2022 by spall Quote Link to comment
JorgeB Posted December 9, 2022 Share Posted December 9, 2022 11 hours ago, spall said: the cable in this case is a single SFF8087 mini-SAS cable going from my LSI 9211-8i to a 24 port backplane. In that case recommend using a different bay to rule that out. Quote Link to comment
spall Posted December 12, 2022 Author Share Posted December 12, 2022 I had already put it back in the same bay, slot 7, and everything is fine (so far) after a data rebuild. I needed to minimize downtime, so I replaced everything feasible in the chain: the drive, the HBA, and the mini-SAS cable. I reseated all the molex connectors to the backplane. I'm going to test the removed components in my test server for a bit and see if anything crops up. I'll mark this as solved for now. I suppose if it happens again _and_ it's slot 7, I'll have to investigate somehow if the backplane or power distribution board is faulty. Thanks! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.