April 21Apr 21 Hi all, So, briefly, I had people doing work in by basement and someone slid a table over which hit the latches on two Supermicro sleds causing the drives to slide forward enough to break their connection. Unraid 7.2.4, Supermicro 846, BPN-SAS2-846EL1, dual parity-Last night I noticed two drives with grey balls and would not spin up. Log showed the following:SyslogApr 20 18:04:11 spock kernel: sd 11:0:1:0: device_block, handle(0x000b)Apr 20 18:04:13 spock kernel: sd 11:0:1:0: device_unblock and setting to running, handle(0x000b)Apr 20 18:04:13 spock kernel: sd 11:0:3:0: device_block, handle(0x000d)Apr 20 18:04:13 spock kernel: sd 11:0:1:0: [sdg] tag#2754 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=1sApr 20 18:04:13 spock kernel: sd 11:0:1:0: [sdg] tag#2754 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00Apr 20 18:04:13 spock kernel: sd 11:0:1:0: [sdg] tag#2755 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=0sApr 20 18:04:13 spock kernel: sd 11:0:1:0: [sdg] tag#2755 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00Apr 20 18:04:13 spock kernel: sd 11:0:1:0: [sdg] Synchronizing SCSI cacheApr 20 18:04:13 spock kernel: sd 11:0:1:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OKApr 20 18:04:13 spock kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x500304800143178e)Apr 20 18:04:13 spock kernel: mpt2sas_cm0: removing handle(0x000b), sas_addr(0x500304800143178e)Apr 20 18:04:13 spock kernel: mpt2sas_cm0: enclosure logical id(0x50030480014317bf), slot(2)Apr 20 18:04:15 spock kernel: sd 11:0:3:0: device_unblock and setting to running, handle(0x000d)Apr 20 18:04:15 spock kernel: sd 11:0:3:0: [sdi] tag#2772 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=1sApr 20 18:04:15 spock kernel: sd 11:0:3:0: [sdi] tag#2772 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00Apr 20 18:04:15 spock kernel: sd 11:0:3:0: [sdi] tag#2773 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=0sApr 20 18:04:15 spock kernel: sd 11:0:3:0: [sdi] tag#2773 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00Apr 20 18:04:15 spock emhttpd: read SMART /dev/sdgApr 20 18:04:15 spock emhttpd: read SMART /dev/sdiApr 20 18:04:15 spock kernel: sd 11:0:3:0: [sdi] Synchronizing SCSI cacheApr 20 18:04:15 spock kernel: sd 11:0:3:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OKApr 20 18:04:15 spock kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x5003048001431790)Apr 20 18:04:15 spock kernel: mpt2sas_cm0: removing handle(0x000d), sas_addr(0x5003048001431790)Apr 20 18:04:15 spock kernel: mpt2sas_cm0: enclosure logical id(0x50030480014317bf), slot(4)Apr 20 18:34:13 spock emhttpd: spinning down /dev/sdgApr 20 18:34:13 spock emhttpd: spinning down /dev/sdiApr 20 18:34:13 spock emhttpd: sdspin /dev/sdg down: 2Apr 20 18:34:13 spock emhttpd: sdspin /dev/sdi down: 2Apr 21 01:26:39 spock webgui: Successful login user root from 192.168.1.20Apr 21 01:27:25 spock emhttpd: offline: WDC_WD140EDGZ-11B1PA0_XXXXXXXX (sdg) 512 27344764928Apr 21 01:29:14 spock emhttpd: offline: WDC_WUH721414ALE604_XXXXXXXX (sdi) 512 27344764928Once I discovered the physical issue, I shut the server down and reinserted the disks. Once coming back online the disks are disabled and being emulated. I am currently running long S.M.A.R.T. test on both drives, but I expect them to be fine. There should not have been any write options during when it happened and my discovery of the physical cause this morning. I immediately disabled Mover when I noticed the situation. However, I cannot be 100% certain. What is the proper procedure here to get the drives back online into the array? Do I stop and start the array again? Or do I unassign those drives first, start/stop, and then reassign? Will a data rebuild be necessary? I've never encountered this situation before. Any help is appreciated. Thanks for reading.
April 22Apr 22 Author @trurl I didn't think it relevant as the cause is known because the drives were severed from their SATA connections. I'll attach both diagnostics. One prior to taking the server down to reinsert the drives, and the other as it is now. I'm trying to figure out the proper procedure to get the array back online. The long S.M.A.R.T. tests are about 40% through at this point. As I said, I expect the drives will come back fine. Thanks! spock-diagnostics-20260421-0146.zip spock-diagnostics-20260421-2248.zip
April 22Apr 22 Solution Since both emulated disks are mounting, and assuming the contents look correct, the recommendation would be to rebuild on top.
April 22Apr 22 https://docs.unraid.net/unraid-os/using-unraid-to/manage-storage/array/replacing-disks-in-array/#re-enabling-a-disabled-disk-rebuilding-onto-itself
April 22Apr 22 10 hours ago, spall said:didn't think it relevantWe were missing this important info6 hours ago, JorgeB said:both emulated disks are mountingDidn't want you to rebuild an unmountable filesystem.
April 22Apr 22 Author @trurl @JorgeB Great. I appreciate the help.A few questions since I'm still waiting for the long S.M.A.R.T. to finish:1) Do I unassign/reassign both drives at the same time? Or is this a rebuild one and then the other?2) Since I keep two spares, would it make more sense to swap in the spares for the rebuilds? I'm assuming I could mount the existing drives via Unassigned Devices or in another server and still get to the data. They could become my spares after rebuild.3) I am in the unfortunate spot of having drain tile work in my basement starting Friday morning. I don't know that I can pull of a 14TB rebuild in that amount of time. Am I better shutting the server down until the work finishes? Or can I stop/resume a rebuild?EDIT: I should add that for question 3, the basement work is going to require me tarping over and wrapping my server rack, so they need to go offline. Thanks again, guys. Edited April 22Apr 22 by spall
April 22Apr 22 I didn't examine SMART for any other disks since you have so many. Do any show SMART warning (👎) on the DASHBOARD page?Rebuild will start over if you restart, so probably better if you just shutdown and wait.You can rebuild both at the same time. Rebuilding to spares does give other options in case of problems.My home office is undergoing some remodel currently, so I have moved my server to another room where I already have ethernet. Don't know if that is an option for you.
April 22Apr 22 Author @trurl Everything is thumps up on the dashboard including the two disabled disks. One of the disabled, disk5/sdi, has finished the extended test and is all good except for 1 UDMA CRC error which I believe was a previous error. Disk13/sdg is 90% done with the test. Unfortunately, moving is not an option for me. I have a spinal injury and cannot muscle the 4U upstairs even if I remove all the disks. I'm going to have to wrap and tarp the whole rack. Ok. Sounds like I'll probably rebuild to my spares, but I'll shut everything down when this test finishes and wait until the work is done in the basement. I'm not sure how long it will take to rebuild 2x14TB drives. Thank you.
April 22Apr 22 Author The second one just finished the extended test. All good on that drive, as well.I have to wrap the rack in about 30 hours. I have a feeling that would be cutting it real close if I started rebuild now, unfortunately. Edited April 22Apr 22 by spall
May 8May 8 Author @JorgeB @trurl My server rack was powered down longer than I anticipated. Anyway, dual rebuild to two new drives worked like a charm. Thanks for the help! Marking as solved.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.