johngalt Posted April 28, 2023 Share Posted April 28, 2023 I was in the process of upgrading a hard drive when I woke up this morning and had 921 errors across the array. I had pulled the drive, replaced with the new one and was preclearing that when this occurred. It also seems to be stuck, as I went to stop the array and it just says "stopping". I can't shutdown or restart at this point. On the dashboard, my processor is also running at 90%, which I've never seen before. Diagnostics attached. tower-diagnostics-20230428-0517.zip Quote Link to comment
JorgeB Posted April 28, 2023 Share Posted April 28, 2023 Type reboot in the CLI, if it doesn't reboot after 5 minutes you will need to force it, then check/replace cables for disk10 and post new diags after array start. Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 I had to force shutdown. I don't have any extra cables but I unplugged them all and plugged them back in. Now Disk 9 is missing from the array. I shutdown again and checked that drive and restarted and it's still not found. New diagnostics attached tower-diagnostics-20230428-0712.zip Quote Link to comment
JorgeB Posted April 28, 2023 Share Posted April 28, 2023 Swap cables with a different disk, power and SATA, if the disk is still missing after that you will need to replace it. Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 (edited) ok, swapped a couple drives. Disk 9 is now found and Disk 13 is missing. Looks like whatever is attached to that port is showing up as missing. Would this be the root cause of why there were errors across the array this morning, or is this just collateral damage and there's more to it? Edit: Apparently I forgot to plug the SATA cable back in. Plugged it in and all discs are up and running now Edited April 28, 2023 by johngalt I'm an idiot Quote Link to comment
JorgeB Posted April 28, 2023 Share Posted April 28, 2023 1 minute ago, johngalt said: Would this be the root cause of why there were errors across the array this morning Could be Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 (edited) Updated prior post. Array is up and running. New diagnostics attached. What should my next steps be? My new disk was 90% through step 5 of the preclear when everything crashed. Should I replace that drive with the old one and run a parity check before continuing? I wouldn't have normally had the array up and running with a missing disk, but my wife is nursing and wants to watch Friends, so I made an exception. tower-diagnostics-20230428-0803.zip Edited April 28, 2023 by johngalt Quote Link to comment
JorgeB Posted April 28, 2023 Share Posted April 28, 2023 I would just replace it with the new disk. Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 ok, I just started the data rebuild and almost immediately got more errors. Logs attached tower-diagnostics-20230428-0927.zip Quote Link to comment
JorgeB Posted April 28, 2023 Share Posted April 28, 2023 New disk dropped offline so there's no SMART, but looking at previous diags SMART was incomplete, so that disk is possibly no good, power cycle the server to get a new SMART report. Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 Powered down (had to force it again as the server got hung up stopping the array again) and restarted. SMART report for new disk attached tower-smart-20230428-1011.zip Quote Link to comment
JorgeB Posted April 28, 2023 Share Posted April 28, 2023 Read SMART Data failed: scsi error badly formed scsi parameters This usually means a bad drive. Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 (edited) ok, I'll pull that and RMA it. In the meantime is there any harm in putting the old drive back in? It's giving me the message "The replacement disk must be as big or bigger than the original.". I assume I need to do a New Config and preserve all. Am I going to mess anything up in doing that? Or should I just wait for the new drive to come in? We're going to be out of town for the next week anyway, so it won't be a big deal to just wait it out. I was thinking I'd like to get the old drive in to confirm the problems do lie with the new drive. Edited April 28, 2023 by johngalt Quote Link to comment
JorgeB Posted April 28, 2023 Share Posted April 28, 2023 25 minutes ago, johngalt said: I assume I need to do a New Config and preserve all. Correct, then re-sync parity, it should fine as long as there aren't more disk issues. Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 ok, and just to make sure "Parity Disk content will be overwritten" is expected and won't lead to any data loss? Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 More errors, even with the new drive removed tower-diagnostics-20230428-1307.zip Quote Link to comment
Solution JorgeB Posted April 28, 2023 Solution Share Posted April 28, 2023 This time looks like a controller problem, those SASLP controllers have not been recommended for a long time, do you have a different one you could use? Ideally one from this list. You also need to change two of the onboard SATA ports form IDE to SATA/AHCI, usually ports 5 and 6. Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 No other controllers on hand, but I can order one. The controller is from the original build about ten years ago. Would this be the best replacement? https://www.newegg.com/lsi-9300-8i-sata-sas/p/N82E16816118217 What's the process for changing the onboard SATA ports? Quote Link to comment
JorgeB Posted April 28, 2023 Share Posted April 28, 2023 That would be a good option, you'd also need new cables, but the replacement itself would be plug and play. 14 minutes ago, johngalt said: What's the process for changing the onboard SATA ports? In the board BIOS. Quote Link to comment
johngalt Posted April 28, 2023 Author Share Posted April 28, 2023 (edited) Thanks for all the help. Went ahead and ordered the card and SFF8643 cables. At this point, are these just false errors due to the controller or is there a possibility my data was corrupted? Once I get the new card up and going, I just need to rebuild parity and then go through the normal drive replacement for the new 18tb drive? Edited April 28, 2023 by johngalt Quote Link to comment
JorgeB Posted April 29, 2023 Share Posted April 29, 2023 12 hours ago, johngalt said: Once I get the new card up and going, I just need to rebuild parity and then go through the normal drive replacement for the new 18tb drive? Hopefully yes. Quote Link to comment
johngalt Posted May 11, 2023 Author Share Posted May 11, 2023 Update: Got the new controller in and installed. Parity is currently rebuilding, but 15 minutes in and no errors, so that's a good sign. Once parity rebuild completes, how do I go about checking for corrupt data? Quote Link to comment
JorgeB Posted May 11, 2023 Share Posted May 11, 2023 Data should be all fine, but since you are not using btrfs/zfs and assuming no preexisting checksums exist there's no easy way to check. Quote Link to comment
johngalt Posted May 11, 2023 Author Share Posted May 11, 2023 Still no errors, but the rebuild has slowed to 2.5 MB/sec. Estimated completion is 80 days. tower-diagnostics-20230511-1339.zip Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.