August 3, 20178 yr After upgrading my parity drive to 4TB and one of my data drives from 1TB to 4TB, I'm now getting a red X on another drive. Screenshot I ran a SMART test on the drive (results attached). So, 2 questions... Is this ANOTHER drive I need to replace (4 drives swapped in one weekend is getting expensive)? How can I get the drive back into the array and have the data rebuilt? From my digging (and apparently I encountered a similar problem before), I need to create a new config under "Tools." Is this correct? How do I go about doing that while preserving/rebuilding my data? Thanks. tower-smart-20170802-2338.zip tower-diagnostics-20170803-0710.zip
August 3, 20178 yr While the 3T Seagates do not have a sterling record, I do not see anything wrong with that drive. Are you using hot-swap style drive cages? My guess is that every time you open the server, you and jiggling wires enough to create intermittent connections. Yours are the typical signs. Look at the CSE-M35T-1B cages.
August 3, 20178 yr Community Expert You log is full of these errors: Aug 2 22:21:13 Tower kernel: Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.0a 06/08/2012 Aug 2 22:21:13 Tower kernel: Workqueue: scsi_wq_1 sas_destruct_devices [libsas] Aug 2 22:21:13 Tower kernel: ffffc9000bcb3b28 ffffffff813a4a1b ffffc9000bcb3b78 ffffffff81944854 Aug 2 22:21:13 Tower kernel: ffffc9000bcb3b68 ffffffff8104d0d9 000000ed0bcb3be0 0000000000000000 Aug 2 22:21:13 Tower kernel: ffffffff81c7b220 ffff880211e05820 0000000000800070 ffff88001eb50000 Aug 2 22:21:13 Tower kernel: Call Trace: Aug 2 22:21:13 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e Aug 2 22:21:13 Tower kernel: [<ffffffff8104d0d9>] __warn+0xb8/0xd3 Aug 2 22:21:13 Tower kernel: [<ffffffff8104d13a>] warn_slowpath_fmt+0x46/0x4e Aug 2 22:21:13 Tower kernel: [<ffffffff81176a20>] sysfs_remove_group+0x4d/0x80 Aug 2 22:21:13 Tower kernel: [<ffffffff81491bb6>] dpm_sysfs_remove+0x4b/0x50 Aug 2 22:21:13 Tower kernel: [<ffffffff81488604>] device_del+0x44/0x1f0 Aug 2 22:21:13 Tower kernel: [<ffffffff814bde8a>] sd_remove+0x50/0xb7 Aug 2 22:21:13 Tower kernel: [<ffffffff8148b96f>] __device_release_driver+0x98/0x11c Aug 2 22:21:13 Tower kernel: [<ffffffff8148ba11>] device_release_driver+0x1e/0x2b Aug 2 22:21:13 Tower kernel: [<ffffffff8148acb9>] bus_remove_device+0xf6/0x109 Aug 2 22:21:13 Tower kernel: [<ffffffff81488721>] device_del+0x161/0x1f0 Aug 2 22:21:13 Tower kernel: [<ffffffff814b9f62>] __scsi_remove_device+0x5e/0xcb Aug 2 22:21:13 Tower kernel: [<ffffffff814b9ff0>] scsi_remove_device+0x21/0x2e Aug 2 22:21:13 Tower kernel: [<ffffffff814ba178>] scsi_remove_target+0x151/0x1ad Aug 2 22:21:13 Tower kernel: [<ffffffffa006f8a8>] sas_rphy_remove+0x26/0x6b [scsi_transport_sas] Aug 2 22:21:13 Tower kernel: [<ffffffffa006f8fa>] sas_rphy_delete+0xd/0x18 [scsi_transport_sas] Aug 2 22:21:13 Tower kernel: [<ffffffffa0081191>] sas_destruct_devices+0x63/0x85 [libsas] Aug 2 22:21:13 Tower kernel: [<ffffffff8105ed53>] process_one_work+0x192/0x295 Aug 2 22:21:13 Tower kernel: [<ffffffff8105f752>] worker_thread+0x27d/0x369 Aug 2 22:21:13 Tower kernel: [<ffffffff8105f4d5>] ? rescuer_thread+0x2b1/0x2b1 Aug 2 22:21:13 Tower kernel: [<ffffffff81063939>] kthread+0xdb/0xe3 Aug 2 22:21:13 Tower kernel: [<ffffffff8106385e>] ? kthread_park+0x52/0x52 Aug 2 22:21:13 Tower kernel: [<ffffffff8167f785>] ret_from_fork+0x25/0x30 Aug 2 22:21:13 Tower kernel: ---[ end trace c8cceb7943eb323b ]--- Aug 2 22:21:13 Tower kernel: ------------[ cut here ]------------ They appear related to the SASLP, failed disk is there, you can try using it in a different slot, or better get get rid of it as there are known issues with these controllers.
August 3, 20178 yr Author 52 minutes ago, bjp999 said: While the 3T Seagates do not have a sterling record, I do not see anything wrong with that drive. Are you using hot-swap style drive cages? My guess is that every time you open the server, you and jiggling wires enough to create intermittent connections. Yours are the typical signs. Look at the CSE-M35T-1B cages. I am using hot-swap bays. The cages you recommend are actually the cages I'm using. They've been very good so far, but I suppose the wires could be loose somehow. 30 minutes ago, johnnie.black said: You log is full of these errors: Aug 2 22:21:13 Tower kernel: Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.0a 06/08/2012 Aug 2 22:21:13 Tower kernel: Workqueue: scsi_wq_1 sas_destruct_devices [libsas] Aug 2 22:21:13 Tower kernel: ffffc9000bcb3b28 ffffffff813a4a1b ffffc9000bcb3b78 ffffffff81944854 Aug 2 22:21:13 Tower kernel: ffffc9000bcb3b68 ffffffff8104d0d9 000000ed0bcb3be0 0000000000000000 Aug 2 22:21:13 Tower kernel: ffffffff81c7b220 ffff880211e05820 0000000000800070 ffff88001eb50000 Aug 2 22:21:13 Tower kernel: Call Trace: Aug 2 22:21:13 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e Aug 2 22:21:13 Tower kernel: [<ffffffff8104d0d9>] __warn+0xb8/0xd3 Aug 2 22:21:13 Tower kernel: [<ffffffff8104d13a>] warn_slowpath_fmt+0x46/0x4e Aug 2 22:21:13 Tower kernel: [<ffffffff81176a20>] sysfs_remove_group+0x4d/0x80 Aug 2 22:21:13 Tower kernel: [<ffffffff81491bb6>] dpm_sysfs_remove+0x4b/0x50 Aug 2 22:21:13 Tower kernel: [<ffffffff81488604>] device_del+0x44/0x1f0 Aug 2 22:21:13 Tower kernel: [<ffffffff814bde8a>] sd_remove+0x50/0xb7 Aug 2 22:21:13 Tower kernel: [<ffffffff8148b96f>] __device_release_driver+0x98/0x11c Aug 2 22:21:13 Tower kernel: [<ffffffff8148ba11>] device_release_driver+0x1e/0x2b Aug 2 22:21:13 Tower kernel: [<ffffffff8148acb9>] bus_remove_device+0xf6/0x109 Aug 2 22:21:13 Tower kernel: [<ffffffff81488721>] device_del+0x161/0x1f0 Aug 2 22:21:13 Tower kernel: [<ffffffff814b9f62>] __scsi_remove_device+0x5e/0xcb Aug 2 22:21:13 Tower kernel: [<ffffffff814b9ff0>] scsi_remove_device+0x21/0x2e Aug 2 22:21:13 Tower kernel: [<ffffffff814ba178>] scsi_remove_target+0x151/0x1ad Aug 2 22:21:13 Tower kernel: [<ffffffffa006f8a8>] sas_rphy_remove+0x26/0x6b [scsi_transport_sas] Aug 2 22:21:13 Tower kernel: [<ffffffffa006f8fa>] sas_rphy_delete+0xd/0x18 [scsi_transport_sas] Aug 2 22:21:13 Tower kernel: [<ffffffffa0081191>] sas_destruct_devices+0x63/0x85 [libsas] Aug 2 22:21:13 Tower kernel: [<ffffffff8105ed53>] process_one_work+0x192/0x295 Aug 2 22:21:13 Tower kernel: [<ffffffff8105f752>] worker_thread+0x27d/0x369 Aug 2 22:21:13 Tower kernel: [<ffffffff8105f4d5>] ? rescuer_thread+0x2b1/0x2b1 Aug 2 22:21:13 Tower kernel: [<ffffffff81063939>] kthread+0xdb/0xe3 Aug 2 22:21:13 Tower kernel: [<ffffffff8106385e>] ? kthread_park+0x52/0x52 Aug 2 22:21:13 Tower kernel: [<ffffffff8167f785>] ret_from_fork+0x25/0x30 Aug 2 22:21:13 Tower kernel: ---[ end trace c8cceb7943eb323b ]--- Aug 2 22:21:13 Tower kernel: ------------[ cut here ]------------ They appear related to the SASLP, failed disk is there, you can try using it in a different slot, or better get get rid of it as there are known issues with these controllers. I'm not quite sure what you mean by the SASLP; is that the SATA controller? At this point, it is detecting the drive again, and as you said, there don't appear to be any errors with the drive. How can I get it back in the array?
August 3, 20178 yr Community Expert You can rebuild to the same disk, but with all those errors it's just a matter of time until that or another disk is disabled again.
August 3, 20178 yr Author 4 minutes ago, johnnie.black said: You can rebuild to the same disk, but with all those errors it's just a matter of time until that or another disk is disabled again. Ok. I'll add it to the list of hardware I need to replace ($500 on hard drives this week alone ). For now, though, how do I go about safely getting the disk to be read by the array again? Edited August 3, 20178 yr by Netbug
August 3, 20178 yr Community Expert 4 minutes ago, johnnie.black said: You can rebuild to the same disk
August 3, 20178 yr Author 1 minute ago, johnnie.black said: Sorry, but how? I don't get the option at the bottom that I would normally get to have the drive rebuilt to. It just shows as "Unmountable."
August 3, 20178 yr Community Expert https://wiki.lime-technology.com/Troubleshooting#Re-enable_the_drive Disk wasn't unmountable when you posted the screenshot and diags, but it looks to be empty, if so you can just format it.
August 3, 20178 yr Author 5 minutes ago, johnnie.black said: https://wiki.lime-technology.com/Troubleshooting#Re-enable_the_drive Disk wasn't unmountable when you posted the screenshot and diags, but it looks to be empty, if so you can just format it. That's got it. Sorry, I was 100% sure i saw it as unmountable when I left the house this morning. It's reconstructing now. I really appreciate all the help. So as of right now, I've got a new motherboard, RAM, and processor on the way, I will be getting two additional hot-swap bays and a controller. Now I need to add to that a second SATA controller. An expensive week. Thank you again.
August 3, 20178 yr Author Oh, one further question, is there a set of instructions for how to replace the SATA controller? I assume the disks will need to be re-assigned in their proper positions after the controller is replaced.
August 3, 20178 yr Community Expert 3 minutes ago, Netbug said: I assume the disks will need to be re-assigned in their proper positions after the controller is replaced. No, disks are tracked by serial number, as long as it's supported and not a raid controller, it's just plug and play, currently LSI HBAs are recommended, e.g., 9201-8i, 9211-8i and clones.
Archived
This topic is now archived and is closed to further replies.