Multiple Drive Failures, Including 1 Parity


Recommended Posts

Hi Guys,

 

Pretty desperate. Long story short.

 

I have had a lot of drives fail in the past week - Three so far. I suspect issue of bad cables. I have ordered new Sata and Sata power cables. Awaiting their arrival.

 

My server has 14 array disks and two parity drives. One nvme cache drive and 1 ssd cache backup drive mounted via Unassigned Devices.

 

At one point, one of the array drives showed as failed (a 16td drive bought 8 months ago) and then another old 4tb showed as failed. I orderd new ones . I replaced the 16th and the 4Tb (with an 8Tb). While pre-clearing, I was moving around some files. One of my parity drives shows SMART errors and eventually failed. The error was the attribute Disk Shift... it had a fairly high count of around 500,000. 

 

After pre-clearing, I assigned the new drives to the two array slots. After rebooting, I also had another drive showing data corruption. I was able to run fsdisk repair, and brought it back. Then, while still in Maintenance mode (in Safe Mode), there was a SYNC button for Data Rebuild (I cant recall the exact label). I hit that, and it started re-building. It went somewhat fast... like 6 hours to rebuilt 16TB... I suspected this was not right. I let it finish, and when starting array (in non-safe mode), indeed the two drives show as un-mountable.

 

I took the failed Parity drive to another machine and ran a Short Test on it. It showed no errors, and still the high Disk Shift count. Nothing else seems abnormal. I took the drive and put it back in the server. Booted and it still shows disabled (missing).

 

As of right now, I have one disabled Parity and the two newly assigned drives. I have shut down the server, while awaiting the new cables. If I start the array, the two new array drives have been assigned. If I un-assign them to no device, and re-add them. Same thing happens when I start the array.

 

The whole thing started when I upgraded the cache and cache backup to larger capacity and do the ZFS pools conversion for both. I then tried to re-use the old nvme and ssd to create a future ZFS pool. Since my motherboard has only one nvme slot, I bought and installed an nvme to pcie adapter card which I plugged into an available slot in the mb. It all seem to work fine... I worry that it was this card that messed things up. I have since removed it. In essence, I don't know if this is the culprit for all the issues.

 

Questions:

  • Is there a way to bring back the one parity back so I can rebuild those two drives?
  • How can I make the two new array drive start a data rebuild?
  • I know that you need two Parities to re-build two array drives. What can I expect if the one bad parity cannot be brought back?
  • Any thought what could be causing all those failures.? Any ideas to resolve the potential underlying issue?

 

For SATA controllers, my Supermicro MB has three mini SAS connectors with SF-1087 to 4 sata data connections each. I also have one of those LSI / Adaptec (cant recall) cards in a PCIE slot w/ 2 SF-1087 to 4 sata cables each. This card has been working for years.

 

Waiting for Amazon to deliver, but not hopeful that this will be the right solution.

 

Any advice is greatly appreciated. I am not sure how / what I need to send one of those diagnostic test results... if needed, please point me in the right direction. I recall some ages ago having to install something in my Windows machine so that unRAID dumped the logs into it.

 

Thank you!!

 

H.

 

Link to comment

Thank you Michael.

 

Small update. I have changed all the cables and the situation is still bad.

 

I am attaching diagnostics... Once again I am having to fs-diskrepair the same drive I did before. It keeps getting corrupt and becomes unmountable.

 

When the Array is mounted, it is offering me to format the two replaced array drives. I don't want to do this until I find out if there is a way to rebuild something.

 

Please help... not sure how I can get some of my data back.

 

Thank you,

 

H.

tower-diagnostics-20230908-2208.zip

Edited by hernandito
Link to comment

Thank you Jorge...

 

I repaired Disk 10 using xfs_repair on a separate computer (using PartedMagic). It ran without issues. I put back in my unRAID on a different sata port/card and seems to work OK. Ran xfs_repair in Maintenance mode a couple times and it looks happy.

 

I am currently running xfs_repair on both the new array drives (6 and 15). Its been running for a couple hours and this is what I see

 

Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
....

 Those dots ...... at the end, scroll horizontally for a long time. The read count on the drives keep going going up, so I will leave it for a while.

 

These two were the drives that unRAID rebuilt with one disabled parity drive. It ran it too fast (like in 6 hours) and I don't think it did a proper job. I still have one of the Parity drives showing as disabled.

 

I will wait for something to happen to the currently running repairs, reboot and run a new diagnostic.

 

Thanks again.

 

H.

Link to comment
On 9/6/2023 at 8:22 AM, hernandito said:

attribute Disk Shift... it had a fairly high count of around 500,000. 

That attribute isn't normally monitored, and it isn't clear how to interpret it since it may be a multi-part binary encoding. Other SMART attributes for that disk look OK, but it hasn't had an extended self-test run.

 

Since neither unmountable disk is being emulated, I wonder if getting parity2 back into the array and then trying to emulate the unmountable disks would have different results. Or just emulate one of them with the parity1 to see what would happen.

 

Wait a few hours and see if @JorgeB has any other ideas.

 

Link to comment
1 hour ago, JorgeB said:

Possibly the previous rebuild didn't finished completely, you can see if emulating one of them provides better results.

 

 

Thank you @JorgeB. I powered down the server; disconnected one of the new array drives, and upon reboot, array is stopped citing "configuration change".

 

EDIT:

I did the same with the other drive and got the same results.

 

 

Edited by hernandito
Link to comment

I ran the xfs_repair… it went through a lot of errors. When it finished, I started array, and there was only a “lost and found folder”. In it, there are many numbered folders. Some were empty, and some had files in it. Since it’s media, I can decipher what they are. I am not sure how complete it is.

 

i still have it showing as “no device” since we unassigned it. Should I assign the drive again and have it rebuild? Feeling some hope. 

 

thank you.

Link to comment
5 minutes ago, hernandito said:

Since it’s media, I can decipher what they are. I am not sure how complete it is.

It is frequently reasonably correct.    Your can probably gauge how much space is showing as used compared to what it was before.  The linux 'file' command can help you determine the exact media type, but it is a lot of work to recover from the lost+found folder so if you have backups this is normally easier.

 

5 minutes ago, hernandito said:

Should I assign the drive again and have it rebuild? Feeling some hope.

 

If you still have the 'failed' drive intact then there is a chance that running an xfs_repair against that drive may be more successful than running it against the emulated drive and it would not hurt to try this.  Hopefully at this point the drive in question is showing up under Unassigned Devices?   You could also try and mount it from there and see if it can be mounted by UD.  If you go through the rebuild route you will end up with exactly what is showing on the emulated drive (i.e. the big lost+found folder).

 

 

Link to comment

For Disk 6, I managed to copy some data off it when it started failing to a spare drive. This I copied on another system (PartedMagic). I copied until I ran out of space in spare disk. I still have the original, and I can try to keep copying more.

 

i also have listing of folders / sub folders in all my drives. I can alway rebuild my media back.

Link to comment

Thank you @itimpi.

 

Please confirm these are the steps I should follow:

  1. Assign a new drive 15, and let unRAID rebuild this (with all the "lost & found")
  2. Keep Disk 6 still unassigned. I think this started failing with read errors.
  3. Replace the disabled Parity drive, and rebuild the parity.
  4. Assign a new Disk 6 and format it. Insert the spare drive as an UD, and copy the Disk 6 data I managed to save.
  5. Enable Docker Service.
  6. Start working on the "lost and Found" directory and compare to what my saved folder listings (see more below) to start replacing whatever media I cant recover.
  7. I could evaluate what media is worth replacing - it may be OK to cull the collection.

 

I have my system run the below command daily so I can log exactly what is stored in each drive. I did not author the script, and I cant recall who did to give proper credit (but I love that person very much!!).  I copied the file to my Win machine for safe-keeping. This only logged Movies and TV... I would not have any other record of other non-media files stored in that drive. Hopefully nothing much. Lost & Found possibly saved something.

 

#!/bin/bash

DATE=`date +%Y%m%d`
REPORTS_PATH=/mnt/cache/appdata/reports

echo "Searching for movies and TV shows..."

mkdir -p $REPORTS_PATH
find /mnt/disk*/* -maxdepth 2 -mindepth 2 -type d -not -iname '.*' \( -path "*/Movies/*" -or -path "*/TV2/*" \) -print | sort -k1 > $REPORTS_PATH/media.catalog_$DATE.csv

echo
echo "Finished."
echo
echo "File created at "$REPORTS_PATH/media.catalog_$DATE.csv

echo

 

While the "Lost & Found" procedure may be tedious, I am patient... and it will give me something to do.

 

Please advise.

 

Thank you.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.