Netbug Posted August 3, 2017 Share Posted August 3, 2017 After upgrading my parity drive to 4TB and one of my data drives from 1TB to 4TB, I'm now getting a red X on another drive. Screenshot I ran a SMART test on the drive (results attached). So, 2 questions... Is this ANOTHER drive I need to replace (4 drives swapped in one weekend is getting expensive)? How can I get the drive back into the array and have the data rebuilt? From my digging (and apparently I encountered a similar problem before), I need to create a new config under "Tools." Is this correct? How do I go about doing that while preserving/rebuilding my data? Thanks. tower-smart-20170802-2338.zip tower-diagnostics-20170803-0710.zip Link to comment
SSD Posted August 3, 2017 Share Posted August 3, 2017 While the 3T Seagates do not have a sterling record, I do not see anything wrong with that drive. Are you using hot-swap style drive cages? My guess is that every time you open the server, you and jiggling wires enough to create intermittent connections. Yours are the typical signs. Look at the CSE-M35T-1B cages. Link to comment
JorgeB Posted August 3, 2017 Share Posted August 3, 2017 You log is full of these errors: Aug 2 22:21:13 Tower kernel: Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.0a 06/08/2012 Aug 2 22:21:13 Tower kernel: Workqueue: scsi_wq_1 sas_destruct_devices [libsas] Aug 2 22:21:13 Tower kernel: ffffc9000bcb3b28 ffffffff813a4a1b ffffc9000bcb3b78 ffffffff81944854 Aug 2 22:21:13 Tower kernel: ffffc9000bcb3b68 ffffffff8104d0d9 000000ed0bcb3be0 0000000000000000 Aug 2 22:21:13 Tower kernel: ffffffff81c7b220 ffff880211e05820 0000000000800070 ffff88001eb50000 Aug 2 22:21:13 Tower kernel: Call Trace: Aug 2 22:21:13 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e Aug 2 22:21:13 Tower kernel: [<ffffffff8104d0d9>] __warn+0xb8/0xd3 Aug 2 22:21:13 Tower kernel: [<ffffffff8104d13a>] warn_slowpath_fmt+0x46/0x4e Aug 2 22:21:13 Tower kernel: [<ffffffff81176a20>] sysfs_remove_group+0x4d/0x80 Aug 2 22:21:13 Tower kernel: [<ffffffff81491bb6>] dpm_sysfs_remove+0x4b/0x50 Aug 2 22:21:13 Tower kernel: [<ffffffff81488604>] device_del+0x44/0x1f0 Aug 2 22:21:13 Tower kernel: [<ffffffff814bde8a>] sd_remove+0x50/0xb7 Aug 2 22:21:13 Tower kernel: [<ffffffff8148b96f>] __device_release_driver+0x98/0x11c Aug 2 22:21:13 Tower kernel: [<ffffffff8148ba11>] device_release_driver+0x1e/0x2b Aug 2 22:21:13 Tower kernel: [<ffffffff8148acb9>] bus_remove_device+0xf6/0x109 Aug 2 22:21:13 Tower kernel: [<ffffffff81488721>] device_del+0x161/0x1f0 Aug 2 22:21:13 Tower kernel: [<ffffffff814b9f62>] __scsi_remove_device+0x5e/0xcb Aug 2 22:21:13 Tower kernel: [<ffffffff814b9ff0>] scsi_remove_device+0x21/0x2e Aug 2 22:21:13 Tower kernel: [<ffffffff814ba178>] scsi_remove_target+0x151/0x1ad Aug 2 22:21:13 Tower kernel: [<ffffffffa006f8a8>] sas_rphy_remove+0x26/0x6b [scsi_transport_sas] Aug 2 22:21:13 Tower kernel: [<ffffffffa006f8fa>] sas_rphy_delete+0xd/0x18 [scsi_transport_sas] Aug 2 22:21:13 Tower kernel: [<ffffffffa0081191>] sas_destruct_devices+0x63/0x85 [libsas] Aug 2 22:21:13 Tower kernel: [<ffffffff8105ed53>] process_one_work+0x192/0x295 Aug 2 22:21:13 Tower kernel: [<ffffffff8105f752>] worker_thread+0x27d/0x369 Aug 2 22:21:13 Tower kernel: [<ffffffff8105f4d5>] ? rescuer_thread+0x2b1/0x2b1 Aug 2 22:21:13 Tower kernel: [<ffffffff81063939>] kthread+0xdb/0xe3 Aug 2 22:21:13 Tower kernel: [<ffffffff8106385e>] ? kthread_park+0x52/0x52 Aug 2 22:21:13 Tower kernel: [<ffffffff8167f785>] ret_from_fork+0x25/0x30 Aug 2 22:21:13 Tower kernel: ---[ end trace c8cceb7943eb323b ]--- Aug 2 22:21:13 Tower kernel: ------------[ cut here ]------------ They appear related to the SASLP, failed disk is there, you can try using it in a different slot, or better get get rid of it as there are known issues with these controllers. Link to comment
Netbug Posted August 3, 2017 Author Share Posted August 3, 2017 52 minutes ago, bjp999 said: While the 3T Seagates do not have a sterling record, I do not see anything wrong with that drive. Are you using hot-swap style drive cages? My guess is that every time you open the server, you and jiggling wires enough to create intermittent connections. Yours are the typical signs. Look at the CSE-M35T-1B cages. I am using hot-swap bays. The cages you recommend are actually the cages I'm using. They've been very good so far, but I suppose the wires could be loose somehow. 30 minutes ago, johnnie.black said: You log is full of these errors: Aug 2 22:21:13 Tower kernel: Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.0a 06/08/2012 Aug 2 22:21:13 Tower kernel: Workqueue: scsi_wq_1 sas_destruct_devices [libsas] Aug 2 22:21:13 Tower kernel: ffffc9000bcb3b28 ffffffff813a4a1b ffffc9000bcb3b78 ffffffff81944854 Aug 2 22:21:13 Tower kernel: ffffc9000bcb3b68 ffffffff8104d0d9 000000ed0bcb3be0 0000000000000000 Aug 2 22:21:13 Tower kernel: ffffffff81c7b220 ffff880211e05820 0000000000800070 ffff88001eb50000 Aug 2 22:21:13 Tower kernel: Call Trace: Aug 2 22:21:13 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e Aug 2 22:21:13 Tower kernel: [<ffffffff8104d0d9>] __warn+0xb8/0xd3 Aug 2 22:21:13 Tower kernel: [<ffffffff8104d13a>] warn_slowpath_fmt+0x46/0x4e Aug 2 22:21:13 Tower kernel: [<ffffffff81176a20>] sysfs_remove_group+0x4d/0x80 Aug 2 22:21:13 Tower kernel: [<ffffffff81491bb6>] dpm_sysfs_remove+0x4b/0x50 Aug 2 22:21:13 Tower kernel: [<ffffffff81488604>] device_del+0x44/0x1f0 Aug 2 22:21:13 Tower kernel: [<ffffffff814bde8a>] sd_remove+0x50/0xb7 Aug 2 22:21:13 Tower kernel: [<ffffffff8148b96f>] __device_release_driver+0x98/0x11c Aug 2 22:21:13 Tower kernel: [<ffffffff8148ba11>] device_release_driver+0x1e/0x2b Aug 2 22:21:13 Tower kernel: [<ffffffff8148acb9>] bus_remove_device+0xf6/0x109 Aug 2 22:21:13 Tower kernel: [<ffffffff81488721>] device_del+0x161/0x1f0 Aug 2 22:21:13 Tower kernel: [<ffffffff814b9f62>] __scsi_remove_device+0x5e/0xcb Aug 2 22:21:13 Tower kernel: [<ffffffff814b9ff0>] scsi_remove_device+0x21/0x2e Aug 2 22:21:13 Tower kernel: [<ffffffff814ba178>] scsi_remove_target+0x151/0x1ad Aug 2 22:21:13 Tower kernel: [<ffffffffa006f8a8>] sas_rphy_remove+0x26/0x6b [scsi_transport_sas] Aug 2 22:21:13 Tower kernel: [<ffffffffa006f8fa>] sas_rphy_delete+0xd/0x18 [scsi_transport_sas] Aug 2 22:21:13 Tower kernel: [<ffffffffa0081191>] sas_destruct_devices+0x63/0x85 [libsas] Aug 2 22:21:13 Tower kernel: [<ffffffff8105ed53>] process_one_work+0x192/0x295 Aug 2 22:21:13 Tower kernel: [<ffffffff8105f752>] worker_thread+0x27d/0x369 Aug 2 22:21:13 Tower kernel: [<ffffffff8105f4d5>] ? rescuer_thread+0x2b1/0x2b1 Aug 2 22:21:13 Tower kernel: [<ffffffff81063939>] kthread+0xdb/0xe3 Aug 2 22:21:13 Tower kernel: [<ffffffff8106385e>] ? kthread_park+0x52/0x52 Aug 2 22:21:13 Tower kernel: [<ffffffff8167f785>] ret_from_fork+0x25/0x30 Aug 2 22:21:13 Tower kernel: ---[ end trace c8cceb7943eb323b ]--- Aug 2 22:21:13 Tower kernel: ------------[ cut here ]------------ They appear related to the SASLP, failed disk is there, you can try using it in a different slot, or better get get rid of it as there are known issues with these controllers. I'm not quite sure what you mean by the SASLP; is that the SATA controller? At this point, it is detecting the drive again, and as you said, there don't appear to be any errors with the drive. How can I get it back in the array? Link to comment
JorgeB Posted August 3, 2017 Share Posted August 3, 2017 You can rebuild to the same disk, but with all those errors it's just a matter of time until that or another disk is disabled again. Link to comment
Netbug Posted August 3, 2017 Author Share Posted August 3, 2017 4 minutes ago, johnnie.black said: You can rebuild to the same disk, but with all those errors it's just a matter of time until that or another disk is disabled again. Ok. I'll add it to the list of hardware I need to replace ($500 on hard drives this week alone ). For now, though, how do I go about safely getting the disk to be read by the array again? Link to comment
JorgeB Posted August 3, 2017 Share Posted August 3, 2017 4 minutes ago, johnnie.black said: You can rebuild to the same disk Link to comment
Netbug Posted August 3, 2017 Author Share Posted August 3, 2017 1 minute ago, johnnie.black said: Sorry, but how? I don't get the option at the bottom that I would normally get to have the drive rebuilt to. It just shows as "Unmountable." Link to comment
JorgeB Posted August 3, 2017 Share Posted August 3, 2017 https://wiki.lime-technology.com/Troubleshooting#Re-enable_the_drive Disk wasn't unmountable when you posted the screenshot and diags, but it looks to be empty, if so you can just format it. Link to comment
Netbug Posted August 3, 2017 Author Share Posted August 3, 2017 5 minutes ago, johnnie.black said: https://wiki.lime-technology.com/Troubleshooting#Re-enable_the_drive Disk wasn't unmountable when you posted the screenshot and diags, but it looks to be empty, if so you can just format it. That's got it. Sorry, I was 100% sure i saw it as unmountable when I left the house this morning. It's reconstructing now. I really appreciate all the help. So as of right now, I've got a new motherboard, RAM, and processor on the way, I will be getting two additional hot-swap bays and a controller. Now I need to add to that a second SATA controller. An expensive week. Thank you again. Link to comment
Netbug Posted August 3, 2017 Author Share Posted August 3, 2017 Oh, one further question, is there a set of instructions for how to replace the SATA controller? I assume the disks will need to be re-assigned in their proper positions after the controller is replaced. Link to comment
JorgeB Posted August 3, 2017 Share Posted August 3, 2017 3 minutes ago, Netbug said: I assume the disks will need to be re-assigned in their proper positions after the controller is replaced. No, disks are tracked by serial number, as long as it's supported and not a raid controller, it's just plug and play, currently LSI HBAs are recommended, e.g., 9201-8i, 9211-8i and clones. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.