Multiple Issues After Power Outage

Dradder1 · November 16, 2021

Hello All,

I apologize in advance for the long message. I'm trying to be concise but at same time provide details needed.

The TLDR is I had a power outage at home and multiple data disks and my sole cache drive have failed or had issues during the various data rebuild processes. Dockers don't work and one of my media shares (TV is completely empty now).

Detailed explanation below.

I recently had a power outage at home. My unraid server lost power. I have a DAS configured and it's connected via an external sas.

Both were plugged into an APC unit but only the DAS was on the backup and the primary server surge only. My big mistake.

I powered up the server and it seemed to come up fine. I then noticed a couple data drives disabled. Not ideal but not too big a deal I thought as I have two parity drives and a few spares.

image.png.645b8433874e976c848a6ea661573c94.png

Shut down server replace disk 8 (14TB for 14TB) and parity rebuilds successfully in a little over a day.

image.png.79391b2c8474f1a5cfb25ab631eed39d.png

Once that's done shut down again and replace second drive, disk 12. I upgraded this disk from 8TB to 12TB. I put the new disk in and all of a sudden disk 11 is completely missing. I checked the connections and all looked good. I decide to let the disk 12 replacement finish rebuilding and then will visit disk 11.

image.png.79a1ae858c524495fb0a68cbdc7288bd.png

I then try to access my plex instance which is running as a docker on my cache drive. It does not work. I tried to restart the docker but it fails to load the page. I check my shares out of curiosity and I only see my media shares on the data disks but no app data, domain, system, etc. My single cache drive has a crc error but I tested many times before when it first reported that a while back. I again decide to revisit this after the second disk finished rebuilding and disk 11 is dealt with.

image.png.6aef4d7794b84de730c244c4b0b015e0.png

The second disk finished rebuilding and returns to normal operation.

image.png.1c6c3f8dbe3f5aa306a52e872b1762ec.png

I had one more spare and replaced disk 11 and let the data rebuild process start. This was a like for like 12TB for 12TB.

While this starts I get a error message from the Fix Common Problems plugin. It shows that my disk 12 which was recently replaced shows as being unable to write to, ie read only. I will investigate after disk 11 finishes data rebuild.

Before disk 11 finishes rebuild disk 13 goes to error state. I again decide to wait to investigate after disk 11 finishes data rebuild.

image.png.59d537d738b2dfd87ee6188c411aa871.png

Disk 11 finishes rebuild but with many errors.

image.png.efc500952920ebb879697ff3719214fd.png

I stopped the array and disks 8 (replaced) and 13 (not replaced) are both disabled. These are the drives where all the errors are coming from.

image.png.29c1eb43d99d91aaaf246e923768e3ae.png

I decide to cut my losses for the day and shut down the server and DAS.

I order a new pci sata port expander and sata cables. They came in today and i replaced both. These drives that failed for the most part seem to have come from one pci card. My cache drive was also connected here.

I power my das up first then server. Disks 8 and 12 are still disabled but now show unmountable not mounted.

I may have jumped the gun and tried a few things to fix the problems after reading the forums. I check the filesystem for disks 8 and 13 via the gui. First without the -n and errors are found in both. I try to restart array but they still are not working. I recheck filesystem with -n and errors are found but the disks are still not mountable. Finally I use the -vL when checking filesystem. It seems to partially work as both disks mount but are still disabled.

I then try to focus on my cache drive and ignore the data array for now. I try steps 1 and 2 from this link and they did not work for me. I go with the nuclear 3rd step, BTRFS check --repair. No change.

I have a second ssd and added to my cache in the hopes that it will let me access the shares such as appdata, system, domain. The second ssd adds fine but still no access to dockers.

Finally one of my shares setup for Plex shows as being completely empty. I have two setup, one for movies and one for tv shows. Most of the disks that had issues are in the tv shows share but one is used for movies. I am able to view the contents of the movies share for the most part. For the tv show share in the main view in the console I can see data on the disks allocated for the tv shows share.

At this point is there any thoughts as to what I can do to repair my unraid instance? I am totally lost. I don't know where to begin/continue. Ideally I'd like to not lose any media data but at this point accept I may lose most if not all. I'd like to save whatever data I can as I never made a full backup.

Thanks in advance.

I almost forgot to add the diagnostics file.

tower-diagnostics-20211115-2049.zip

Edited November 16, 2021 by Dradder1
I forgot to upload the diagnostics zip file.

trurl · November 16, 2021

Why didn't you ask for help before now? Repair won't enable a disk, in fact, repairing a disabled disk only repairs the emulated disk. But repair before rebuild is the usual recommendation, especially if rebuilding to the same disk.

Do you still have the original disks? Probably nothing wrong with them.

Attach diagnostics to your NEXT post in this thread.

Dradder1 · November 16, 2021

I have attached the server diagnostics.

I have the original three drives that were replaced set aside. If logs are needed for those should I connect them to my server and then provide?

Thanks,

tower-diagnostics-20211115-2049.zip

trurl · November 16, 2021

Were the problems on this controller?

0b:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9128 PCIe SATA 6 Gb/s RAID controller with HyperDuo [1b4b:9130] (rev 11)

Marvell NOT recommended.

You have a lost+found share on disk8 created by the repair. Have you looked at it?

Diagnostics seems to think TV Shows share exists on disks 9, 10, 11, but filesystem corruption on disk11 though it is mounted.

Has anything been written to your server since these problems began?

Do you have another copy of anything important and irreplaceable?

Dradder1 · November 16, 2021

I think my old asus mb has a couple onboard Marvell sata ports. The pcie card I'm not sure but the controller maybe Marvell as well. I do have an internal sas card on order and can get any drives off the Marvell controller(s).

I see the lost+found share but no data appears to be there.

Nothing has been written to the system other than the steps I listed in my initial post.

I have a partial copy of the data. If possible I'd like to save some media off the "empty" share. If it isn't I'll deal with it.

Nothing is critical and can be replaced.

Edited November 16, 2021 by Dradder1

trurl · November 16, 2021

2 minutes ago, Dradder1 said:

I do have an internal sas card on order and can get any drives off the Marvell controller(s).

Maybe you should wait until then to try to fix things.

Dradder1 · November 16, 2021

I'll wait for the sas card to arrive and moves drives there and off of Marvell.

I will provide an update then.

Thanks for your help.

Dradder1 · November 24, 2021

I finally received my internal sas card and placed it in my primary system. I removed any drives that were plugged into the marvell controllers.

At this point all drives are plugged in array is up and no change.

I've attached diagnostics logs after booting up.

Any help would be appreciated.

tower-diagnostics-20211124-1839.zip

trurl · November 25, 2021

Been a while.

22 minutes ago, Dradder1 said:

no change

None of that will have enabled the disks, they have to be rebuilt.

You have 2 disabled disks, dual parity, all disks mounted including the disabled/emulated disks. SMART for the disable disks looks OK though neither have had extended test run. Haven't checked SMART for each of your large number of disks. Do any of your disks have SMART warnings on the Dashboard page?

Safer to rebuild to spares if you have them but should be OK to rebuild to the same disks.

Dradder1 · November 25, 2021

I do have a couple disks with SMART errors on the dashboard. One is a disabled disk while another seems to be "fine".

1509544637_Dashboard-Array.png.5b6039e83bdec81baaaccc484efb3f7e.png

I have the original three disks I replaced back when these issues first started. I also have one new unused disk.

Is it best practice to try and replace one of the disabled drives and see how that goes?

trurl · November 25, 2021

Both of those are a single UDMA CRC ERROR. These are connection problems not disk problems. Click a warning to acknowledge it and it won't warn again until it increases.

1 hour ago, trurl said:

Safer to rebuild to spares if you have them but should be OK to rebuild to the same disks.

Rebuilding to spares allows you to keep the original disks as they are with their contents until you are satisfied with the rebuild.

Since you have dual parity you can rebuild both at once.

https://wiki.unraid.net/Manual/Storage_Management#What_is_a_.27failed.27_.28disabled.29_drive

Dradder1 · November 25, 2021

I can place in two new drives and let the data rebuild but have a couple questions/concerns before.

The last time I had a parity check was when I first had these issues and was replacing disks. Due to the multiple issues encountered then the parity check found many errors.

1957743331_ParityStatus.png.30b5b26a87feb54e07ceaeba36bdb073.png

I also noticed that Disk 12 is 12TB but shows as 8 in the data column. This was the second disk I replaced and it completed rebuild successfully 10 days ago or so.

Rebuild message on 11/13

213945203_Disk12RebuildCompletionMessage.png.53202e2cfe2244c8c75fb22ff16db6fb.png

Disk 12 status on 11/24

Would either of these cause any issues if I were to put in new drives to rebuild? If not I will proceed to let that process run.

Thanks

trurl · November 26, 2021

Nov 24 18:38:36 Tower emhttpd: shcmd (123): mkdir -p /mnt/disk12
Nov 24 18:38:36 Tower emhttpd: shcmd (124): mount -t xfs -o noatime /dev/md12 /mnt/disk12
Nov 24 18:38:36 Tower kernel: XFS (md12): Mounting V5 Filesystem
Nov 24 18:38:36 Tower kernel: XFS (md12): Ending clean mount
Nov 24 18:38:36 Tower kernel: xfs filesystem being mounted at /mnt/disk12 supports timestamps until 2038 (0x7fffffff)
Nov 24 18:38:36 Tower emhttpd: shcmd (125): xfs_growfs /mnt/disk12
Nov 24 18:38:36 Tower kernel: XFS (md12): Corruption warning: Metadata has LSN (101:6655742) ahead of current LSN (1:95417). Please unmount and run xfs_repair (>= v4.3) to resolve.
Nov 24 18:38:36 Tower kernel: XFS (md12): Metadata CRC error detected at xfs_allocbt_read_verify+0xd/0x3a [xfs], xfs_bnobt block 0x37fffffd0 
Nov 24 18:38:36 Tower kernel: XFS (md12): Unmount and run xfs_repair

check filesystem on disk12

Dradder1 · November 26, 2021

Followed instructions and ultimately ran a repair and disk 12 and it's reporting normal utilization level now.

306876816_Disk12ReturntoNormalUtilizationLevel.png.a150e66f395645619f0949190c7b0156.png

I checked the Main screen and disk 12 now shows the right size.

At this point do you feel it's safe to replace the two disabled drives then let it rebuild?

Thanks

trurl · November 27, 2021

According to those screenshots, disk12 was rebuilt on the 13th, but then a parity check on the 14th found a large number of sync errors. Was that a correcting parity check?

Dradder1 · November 27, 2021

I think this is because I made another mistake before reaching out to the forums for help.

When I first started having problems disks 8 and 12 were disabled.

I replaced disk 8 and it successfully rebuilt data on the 12th.
I replaced disk 12 on the 12th. When I powered on the system disk 11 was not detected. Instead of troubleshooting this I let disk 12 be rebuilt. It completed successfully on the 13th.
I then replaced disk 11 and this is when the parity errors were generated that show on the 14th.

trurl · November 27, 2021

58 minutes ago, Dradder1 said:

I then replaced disk 11 and this is when the parity errors were generated that show on the 14th

Do you mean you successfully rebuilt disk11 and then you ran a (correcting or non?) parity check?

Dradder1 · November 27, 2021

Disk 11 returned to normal operation but i did not get the message when it completed saying that data rebuild finished with no errors.

This is the message I received .

1531341722_Disk11MessageAfterRebuild.png.47e49cec827c942cdd9ec7b7b53a3a3b.png

This is unlike when I rebuilt disk 12 days earlier which did complete successfully.

1269584267_Disk12MessageAfterRebuild.png.a6f363e5174bce6178bc0901b1206bb4.png

trurl · November 27, 2021

Same number of errors on rebuild as on parity check.

14 hours ago, trurl said:

Was that a correcting parity check?

Dradder1 · November 28, 2021

I replaced the disks a couple days ago and today it finished data rebuild.

There are no errors in the rebuild process.

2072605610_1128-01.png.e7b3f0fe9998feeb01e12be0a5a4f779.png

List of disks

Some of my other shares now appear that were on the cache drive.

No dockers but I think that is okay as I can download them again.

My Movies share seems to be fine. I can navigate the folder contents and see most data there.

My TV shows share shows too many files when I click to view it's contents.

1138718037_1128-06.png.bd09cd66f0059a79b43c3f46ab318f01.png

If I view one of the disks that have TV shows I can see folders and then data files under there.

484480141_1128-07.png.5b3d4e546b69d4e8dca4b6c8b5266c6c.png

Is there any way that the TV Shows share can be fixed? I searched for the "too many files" string and found a case where some disks had to be repaired. Not sure if that's the path to take.

I'm attaching diagnostics from today.

Thanks

tower-syslog-20211128-1348.zip

Edited November 28, 2021 by Dradder1
Add description to one picture.

trurl · November 28, 2021

That is only syslog, not diagnostics.

All disks mounted so no reason to think repair is needed.

Have you check your lost+found share?

2 hours ago, Dradder1 said:

My TV shows share shows too many files when I click to view it's contents.

I've never seen that. Is that in the webUI or from some other computer over the network? Do you have a screenshot?

Dradder1 · November 28, 2021

Here are the diagnostic logs.

tower-diagnostics-20211128-1141.zip

In regards to the TV shows share this is what I see when I connect via a Windows system.

98006614_1128-08.png.2a18ebb3a612884bf4daf873ed1d47d9.png

This is the view when I click TV Shows under the Shares menu.

This is compared to the Movies share which does show the sub-folders I have on it with files underneath.

I have checked the lost+found share and see folders and files there.

634598671_1128-11.png.67b71274979f8ef6de9da3600cfd4752.png

But if I check the disks assigned to the TV Shows share I can see the regular folder structure plus lost+found where applicable.

1013203382_1128-13.png.5dd9909e85f80d8ba78f5f5166e6aa9c.png

Samples of files under the TV Shows folder on this disk.

Dradder1 · December 2, 2021

I found another forum entry that had the "No listing: Too many files" message. That case was just like mines, checking the share shows no files in Unraid or via Windows Explorer. However if you check the disks contents folders/files are there.

In the case above the impacted end user was asked to check the drives via webGui.

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

He had to run a rebuild-tree and after a couple of days it seemed to have restore the empty share.

2101778555_UnraidForumNoListingTooManyFilesPossibleSolution.png.a342519184c52ff4ef183780f93e2cdd.png

Applying this to my case I found the following entries in my diagnostics logs.

I looked up online how to determine what disk md11 refers to.

I ran the following command and it gave me the serial number which associates to disk 11.

grep diskId.11 /proc/mdstat

Should I use the instructions at the url: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

on my disk 11? It is part of my TV Shows share so I believe this may be the way to go but wanted to check before.

Thanks!

trurl · December 2, 2021

md11 is always disk11, you don't need the serial number or the sdX device. You must repair the md# device or you will invalidate parity.

Dradder1 · December 2, 2021

I ran the xfs_repair -nv command from the webGui and this is the results.

1063427628_Disk11xfsrepair-nvresults.png.4eec00e93d52063b31461b7533d740a3.png

Diagnostics are attached.

tower-diagnostics-20211201-2222.zip

Per this article running xfs_repair without the -n argument was suggested and worked.

Should this be my next step?

Multiple Issues After Power Outage

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation