hernandito Posted September 6, 2023 Share Posted September 6, 2023 Hi Guys, Pretty desperate. Long story short. I have had a lot of drives fail in the past week - Three so far. I suspect issue of bad cables. I have ordered new Sata and Sata power cables. Awaiting their arrival. My server has 14 array disks and two parity drives. One nvme cache drive and 1 ssd cache backup drive mounted via Unassigned Devices. At one point, one of the array drives showed as failed (a 16td drive bought 8 months ago) and then another old 4tb showed as failed. I orderd new ones . I replaced the 16th and the 4Tb (with an 8Tb). While pre-clearing, I was moving around some files. One of my parity drives shows SMART errors and eventually failed. The error was the attribute Disk Shift... it had a fairly high count of around 500,000. After pre-clearing, I assigned the new drives to the two array slots. After rebooting, I also had another drive showing data corruption. I was able to run fsdisk repair, and brought it back. Then, while still in Maintenance mode (in Safe Mode), there was a SYNC button for Data Rebuild (I cant recall the exact label). I hit that, and it started re-building. It went somewhat fast... like 6 hours to rebuilt 16TB... I suspected this was not right. I let it finish, and when starting array (in non-safe mode), indeed the two drives show as un-mountable. I took the failed Parity drive to another machine and ran a Short Test on it. It showed no errors, and still the high Disk Shift count. Nothing else seems abnormal. I took the drive and put it back in the server. Booted and it still shows disabled (missing). As of right now, I have one disabled Parity and the two newly assigned drives. I have shut down the server, while awaiting the new cables. If I start the array, the two new array drives have been assigned. If I un-assign them to no device, and re-add them. Same thing happens when I start the array. The whole thing started when I upgraded the cache and cache backup to larger capacity and do the ZFS pools conversion for both. I then tried to re-use the old nvme and ssd to create a future ZFS pool. Since my motherboard has only one nvme slot, I bought and installed an nvme to pcie adapter card which I plugged into an available slot in the mb. It all seem to work fine... I worry that it was this card that messed things up. I have since removed it. In essence, I don't know if this is the culprit for all the issues. Questions: Is there a way to bring back the one parity back so I can rebuild those two drives? How can I make the two new array drive start a data rebuild? I know that you need two Parities to re-build two array drives. What can I expect if the one bad parity cannot be brought back? Any thought what could be causing all those failures.? Any ideas to resolve the potential underlying issue? For SATA controllers, my Supermicro MB has three mini SAS connectors with SF-1087 to 4 sata data connections each. I also have one of those LSI / Adaptec (cant recall) cards in a PCIE slot w/ 2 SF-1087 to 4 sata cables each. This card has been working for years. Waiting for Amazon to deliver, but not hopeful that this will be the right solution. Any advice is greatly appreciated. I am not sure how / what I need to send one of those diagnostic test results... if needed, please point me in the right direction. I recall some ages ago having to install something in my Windows machine so that unRAID dumped the logs into it. Thank you!! H. Quote Link to comment
Michael_P Posted September 6, 2023 Share Posted September 6, 2023 26 minutes ago, hernandito said: I am not sure how / what I need to send one of those diagnostic test results.. Diagnostics from the Tools menu or from the command line Quote Link to comment
hernandito Posted September 9, 2023 Author Share Posted September 9, 2023 (edited) Thank you Michael. Small update. I have changed all the cables and the situation is still bad. I am attaching diagnostics... Once again I am having to fs-diskrepair the same drive I did before. It keeps getting corrupt and becomes unmountable. When the Array is mounted, it is offering me to format the two replaced array drives. I don't want to do this until I find out if there is a way to rebuild something. Please help... not sure how I can get some of my data back. Thank you, H. tower-diagnostics-20230908-2208.zip Edited September 9, 2023 by hernandito Quote Link to comment
JorgeB Posted September 9, 2023 Share Posted September 9, 2023 Diags are after rebooting, so we cannot see what happened before to disks 6 and 10, what you can do now is to check filesystem on both (run it without -n), then post new diags after array start. Quote Link to comment
hernandito Posted September 9, 2023 Author Share Posted September 9, 2023 Thank you Jorge... I repaired Disk 10 using xfs_repair on a separate computer (using PartedMagic). It ran without issues. I put back in my unRAID on a different sata port/card and seems to work OK. Ran xfs_repair in Maintenance mode a couple times and it looks happy. I am currently running xfs_repair on both the new array drives (6 and 15). Its been running for a couple hours and this is what I see Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .... Those dots ...... at the end, scroll horizontally for a long time. The read count on the drives keep going going up, so I will leave it for a while. These two were the drives that unRAID rebuilt with one disabled parity drive. It ran it too fast (like in 6 hours) and I don't think it did a proper job. I still have one of the Parity drives showing as disabled. I will wait for something to happen to the currently running repairs, reboot and run a new diagnostic. Thanks again. H. Quote Link to comment
JorgeB Posted September 10, 2023 Share Posted September 10, 2023 19 hours ago, hernandito said: Those dots ...... at the end, scroll horizontally for a long time. The read count on the drives keep going going up, so I will leave it for a while. Make sure you are running it without -n, but if it doesn't find a backup superblock after 30 minutes or so max, it likely won't find one. Quote Link to comment
hernandito Posted September 10, 2023 Author Share Posted September 10, 2023 I ran the two drives without the -n for about 24 hours and at the end it simply said “exiting now”. I bought another 16TB drive yesterday that I’m pre-clearing. It will take time… after which I will reboot the server and create the new diagnostics. thank you. Quote Link to comment
hernandito Posted September 10, 2023 Author Share Posted September 10, 2023 (edited) Started the array, and drives show un-mountable. Edited September 10, 2023 by hernandito Quote Link to comment
trurl Posted September 10, 2023 Share Posted September 10, 2023 38 minutes ago, hernandito said: create the new diagnostics. Quote Link to comment
hernandito Posted September 10, 2023 Author Share Posted September 10, 2023 Diagnostics attached.... Thank you. tower-diagnostics-20230910-1545.zip Quote Link to comment
trurl Posted September 11, 2023 Share Posted September 11, 2023 On 9/6/2023 at 8:22 AM, hernandito said: attribute Disk Shift... it had a fairly high count of around 500,000. That attribute isn't normally monitored, and it isn't clear how to interpret it since it may be a multi-part binary encoding. Other SMART attributes for that disk look OK, but it hasn't had an extended self-test run. Since neither unmountable disk is being emulated, I wonder if getting parity2 back into the array and then trying to emulate the unmountable disks would have different results. Or just emulate one of them with the parity1 to see what would happen. Wait a few hours and see if @JorgeB has any other ideas. Quote Link to comment
hernandito Posted September 11, 2023 Author Share Posted September 11, 2023 Thank you. I stopped the array to prevent anything being written. I could run the extended smart test on the bad parity. Quote Link to comment
JorgeB Posted September 11, 2023 Share Posted September 11, 2023 Possibly the previous rebuild didn't finished completely, you can see if emulating one of them provides better results. Quote Link to comment
hernandito Posted September 11, 2023 Author Share Posted September 11, 2023 (edited) 1 hour ago, JorgeB said: Possibly the previous rebuild didn't finished completely, you can see if emulating one of them provides better results. Thank you @JorgeB. I powered down the server; disconnected one of the new array drives, and upon reboot, array is stopped citing "configuration change". EDIT: I did the same with the other drive and got the same results. Edited September 11, 2023 by hernandito Quote Link to comment
JorgeB Posted September 11, 2023 Share Posted September 11, 2023 You could just unassign the disk, then start the array and post new diags. Quote Link to comment
hernandito Posted September 11, 2023 Author Share Posted September 11, 2023 39 minutes ago, JorgeB said: You could just unassign the disk, then start the array and post new diags. Hi... unassigned one of the disks, and started the array... Array does not start. tower-diagnostics-20230911-0713.zip Quote Link to comment
JorgeB Posted September 11, 2023 Share Posted September 11, 2023 20 minutes ago, hernandito said: Array does not start. You need to check the "I want to do this" box below the start button. Quote Link to comment
hernandito Posted September 11, 2023 Author Share Posted September 11, 2023 I did that and drive contents are not emulated. tower-diagnostics-20230911-1701.zip Quote Link to comment
JorgeB Posted September 12, 2023 Share Posted September 12, 2023 Check filesystem on the emulated disk15 (without -n) Quote Link to comment
hernandito Posted September 12, 2023 Author Share Posted September 12, 2023 I ran the xfs_repair… it went through a lot of errors. When it finished, I started array, and there was only a “lost and found folder”. In it, there are many numbered folders. Some were empty, and some had files in it. Since it’s media, I can decipher what they are. I am not sure how complete it is. i still have it showing as “no device” since we unassigned it. Should I assign the drive again and have it rebuild? Feeling some hope. thank you. Quote Link to comment
itimpi Posted September 12, 2023 Share Posted September 12, 2023 5 minutes ago, hernandito said: Since it’s media, I can decipher what they are. I am not sure how complete it is. It is frequently reasonably correct. Your can probably gauge how much space is showing as used compared to what it was before. The linux 'file' command can help you determine the exact media type, but it is a lot of work to recover from the lost+found folder so if you have backups this is normally easier. 5 minutes ago, hernandito said: Should I assign the drive again and have it rebuild? Feeling some hope. If you still have the 'failed' drive intact then there is a chance that running an xfs_repair against that drive may be more successful than running it against the emulated drive and it would not hurt to try this. Hopefully at this point the drive in question is showing up under Unassigned Devices? You could also try and mount it from there and see if it can be mounted by UD. If you go through the rebuild route you will end up with exactly what is showing on the emulated drive (i.e. the big lost+found folder). Quote Link to comment
hernandito Posted September 12, 2023 Author Share Posted September 12, 2023 For Disk 6, I managed to copy some data off it when it started failing to a spare drive. This I copied on another system (PartedMagic). I copied until I ran out of space in spare disk. I still have the original, and I can try to keep copying more. i also have listing of folders / sub folders in all my drives. I can alway rebuild my media back. Quote Link to comment
hernandito Posted September 12, 2023 Author Share Posted September 12, 2023 Thank you @itimpi. Please confirm these are the steps I should follow: Assign a new drive 15, and let unRAID rebuild this (with all the "lost & found") Keep Disk 6 still unassigned. I think this started failing with read errors. Replace the disabled Parity drive, and rebuild the parity. Assign a new Disk 6 and format it. Insert the spare drive as an UD, and copy the Disk 6 data I managed to save. Enable Docker Service. Start working on the "lost and Found" directory and compare to what my saved folder listings (see more below) to start replacing whatever media I cant recover. I could evaluate what media is worth replacing - it may be OK to cull the collection. I have my system run the below command daily so I can log exactly what is stored in each drive. I did not author the script, and I cant recall who did to give proper credit (but I love that person very much!!). I copied the file to my Win machine for safe-keeping. This only logged Movies and TV... I would not have any other record of other non-media files stored in that drive. Hopefully nothing much. Lost & Found possibly saved something. #!/bin/bash DATE=`date +%Y%m%d` REPORTS_PATH=/mnt/cache/appdata/reports echo "Searching for movies and TV shows..." mkdir -p $REPORTS_PATH find /mnt/disk*/* -maxdepth 2 -mindepth 2 -type d -not -iname '.*' \( -path "*/Movies/*" -or -path "*/TV2/*" \) -print | sort -k1 > $REPORTS_PATH/media.catalog_$DATE.csv echo echo "Finished." echo echo "File created at "$REPORTS_PATH/media.catalog_$DATE.csv echo While the "Lost & Found" procedure may be tedious, I am patient... and it will give me something to do. Please advise. Thank you. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.