Data rebuild failure following disabled drives after power loss

DaveW42 · September 12, 2023

Hi, I am running Unraid 6.12.4 with two parity disks, 16 disks in the array, an NVME cache drive, and a few unassigned devices drives. Most of the drives are connected to one of two LSI Logic SAS9211-8I 8PORT Int 6GB Sata+SAS Pcie 2.0 cards, with the others connected to SATA ports on the motherboard. I also have an IO CREST Internal 5 Port Non-RAID SATA III 6GB/s M.2 B+M Key Adapter Card connected to one of two NVME slots on my motherboard (Asus ROG STRIX X570-E GAMING) to provide the opportunity to add a few more SATA drives. Before the problems emerged, no drives were connected to that IO Crest device.

I had the case open and had installed an SSD drive (SK Hynix 1TB) to (I believe) a SATA port on the motherboard and (separately) an NVME drive (WD 2TB) to one of the open USB ports on the motherboard using an Inateck USB NVME enclosure, with the intention of using either of these for a new gaming VM. After sealing things up, putting the system back in place, and turning it on, the system lost power briefly. When it came back up Disk 1 and Parity 2 came up as disabled, with the contents of Disk 1 being emulated. I ran an extended Smart Test on the parity disk and there were no errors. I tried adding another HDD to the system (i.e., connecting to the IO Crest via the backplane on my computer case) with the intention of using it to replace Disk 1, but the new disk did not show in unassigned devices. I ran a regular smart test on Disk 1, and there were no errors. Given this I rebooted the array in safe mode, unassigned the disabled drives (Disk 1 and Parity 2), and then started up the array to make sure those drives were removed. I then rebooted in safe mode again, shutdown the array again, added Disk 1 and Parity 2 back in their original position, and started parity sync and the data rebuild process.

Things went badly very quickly at this point, with disk 9 and disk 16 almost immediately coming up as disabled. I briefly saw a message flash about CRC errors involving an unassigned device. At around 26.6% of the Disk 1 data rebuild, I was no longer able to interact with the unraid system. However, I could see that a separate Unraid Win 10 virtual machine was still running without issue. I didn’t touch anything and about 15 minutes later the system became responsive again and I could click on menus etc. The system currently shows the following:

· Parity 2: red x (parity device is disabled)

· Disk 1: Green but lists as “unmountable: unsupported or no file system.”

· Disk 9: red x, lists as “unmountable: unsupported or no file system.”

· Disk 16: Green but lists as “unmountable: unsupported or no file system.”

I don’t believe that Parity 2 has seen much/any real activity as a result of the rebuild. In its disabled state it shows 1 read, 4 writes, and 2 errors.

Attached is the diagnostic file. In terms of next steps, should I power down the system, check all cables and make sure that the LSI cards are properly seated in the motherboard? Help would be greatly, greatly appreciated.

Dave

nas24-diagnostics-20230912-0015.zip

JorgeB · September 12, 2023

There are read errors on at least four disks across both controllers, suggesting a power/connection problem, check/replace cables and/or use a different PSU if available and post new diags after array start.

DaveW42 · September 13, 2023

Thanks, JorgeB. Will do. You've got me thinking that my current PSU might not be enough given the number of devices I am running (I think current PSU is 800w). I will buy a new PSU, install it, and then post when I have the new diags (might be two or three days).

Dave

DaveW42 · September 16, 2023

Ok, I purchased a new 1500W PSU (Corsair HX1500i). This should be more than enough power for the system. My previous PSU was 850W. I also checked the cables and the seating of the LSI cards and these looked ok. Attached is the new diagnostic file.

Thanks!

Dave

nas24-diagnostics-20230915-2332.zip

JorgeB · September 16, 2023

No disk errors so far, but you need to check filesystem on disk1 and emulated disk9, run it without -n

DaveW42 · September 16, 2023

I ran the following in a terminal in maintenance mode:

xfs_repair -v /dev/md9

xfs_repair -v /dev/md1

In both cases I receive error messages saying "No such file or directory" and "fatal error -- couldn't initialize XFS library"

Dave

JorgeB · September 17, 2023

That's for v6.11 and older, the link explains how to use the GUI to run it, but if you prefer the CLI add p1, e.g.:

xfs_repair -v /dev/md9p1

DaveW42 · September 17, 2023

Thanks, JorgeB. Didn't realize I was supposed to use the GUI, and am happy to use it.

Here is the output for Disk 9 with the -v option specified.

Phase 1 - find and verify superblock... - block cache size set to 6101544 entries Phase 2 - using internal log - zero log... zero_log: head block 3794193 tail block 3794187 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

Here is the output for Disk 1 with the -v option specified.

Phase 1 - find and verify superblock... - block cache size set to 6071976 entries Phase 2 - using internal log - zero log... zero_log: head block 116970 tail block 116966 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

Thanks!

Dave

trurl · September 18, 2023

4 hours ago, DaveW42 said:

If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair

Unraid has already determined the filesystem is unmountable, so you have to use -L

DaveW42 · September 18, 2023

Thanks, trurl!

Just to confirm, so I should go back to the GUI and this time use the following options for both Disk 9 and Disk 1?

-vL

Thanks,

Dave

itimpi · September 18, 2023

2 hours ago, DaveW42 said:

Thanks, trurl!

Just to confirm, so I should go back to the GUI and this time use the following options for both Disk 9 and Disk 1?

-vL

Thanks,

Dave

You need the -L, the v is optional.

DaveW42 · September 18, 2023

Thanks!

Below are the results for Disk 9.

Dave

Phase 1 - find and verify superblock...
- block cache size set to 6101544 entries
Phase 2 - using internal log
- zero log...
zero_log: head block 3794193 tail block 3794187
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
- scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
sb_fdblocks 125178818, counted 127618391
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 8
- agno = 4
- agno = 5
- agno = 6
- agno = 1
- agno = 7
- agno = 9
- agno = 3
Phase 5 - rebuild AG headers and trees...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:3794198) is ahead of log (1:2).
Format log to cycle 4.

XFS_REPAIR Summary Mon Sep 18 09:49:15 2023

Phase       Start       End       Duration
Phase 1:   09/18 09:44:43   09/18 09:44:43
Phase 2:   09/18 09:44:43   09/18 09:45:14   31 seconds
Phase 3:   09/18 09:45:14   09/18 09:46:41   1 minute, 27 seconds
Phase 4:   09/18 09:46:41   09/18 09:46:41
Phase 5:   09/18 09:46:41   09/18 09:46:42   1 second
Phase 6:   09/18 09:46:42   09/18 09:48:03   1 minute, 21 seconds
Phase 7:   09/18 09:48:03   09/18 09:48:03

Total run time: 3 minutes, 20 seconds
done

JorgeB · September 18, 2023

It should mount now.

DaveW42 · September 18, 2023

Thanks!

Below are the results for Disk 1.

Dave

Phase 1 - find and verify superblock...
- block cache size set to 6071976 entries
Phase 2 - using internal log
- zero log...
zero_log: head block 116970 tail block 116966
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
- scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 7
- agno = 12
- agno = 4
- agno = 5
- agno = 6
- agno = 9
- agno = 1
- agno = 10
- agno = 8
- agno = 11
- agno = 3
Phase 5 - rebuild AG headers and trees...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (11:116995) is ahead of log (1:2).
Format log to cycle 14.

XFS_REPAIR Summary Mon Sep 18 10:05:02 2023

Phase       Start       End       Duration
Phase 1:   09/18 10:03:08   09/18 10:03:08
Phase 2:   09/18 10:03:08   09/18 10:03:37   29 seconds
Phase 3:   09/18 10:03:37   09/18 10:03:49   12 seconds
Phase 4:   09/18 10:03:49   09/18 10:03:49
Phase 5:   09/18 10:03:49   09/18 10:03:50   1 second
Phase 6:   09/18 10:03:50   09/18 10:03:59   9 seconds
Phase 7:   09/18 10:03:59   09/18 10:03:59

Total run time: 51 seconds
done

JorgeB · September 18, 2023

Should also mount.

DaveW42 · September 18, 2023

Thanks, JorgeB!

So I should restart the array (i.e., not in maintenance mode)?

Thanks,

Dave

itimpi · September 18, 2023

Just now, DaveW42 said:

Thanks, JorgeB!

So I should restart the array (i.e., not in maintenance mode)?

Thanks,

Dave

Yes. The disks should now mount fine.

DaveW42 · September 18, 2023

Thanks!

DaveW42 · September 18, 2023

Attached is the new diagnostics file.

Thanks!

Dave

nas24-diagnostics-20230918-1036.zip

JorgeB · September 18, 2023

If the emulated disk9 contents look correct you can rebuild on top, and re-sync parity at the same time:

https://docs.unraid.net/unraid-os/manual/storage-management#rebuilding-a-drive-onto-itself

DaveW42 · September 18, 2023

Contents of Disk 1 and Disk 9 look great, thank you!!!

When rebuilding the drive back onto itself, should I UNassign both Disk 9 and Parity 2, or would I just unassign Disk 9?

Thanks again!!!!

Dave

JorgeB · September 18, 2023

You can do both at the same time.

DaveW42 · September 18, 2023

Got it, thank you. I will unassign both Disk 9 and Parity 2.

Thank you!

Dave

DaveW42 · September 18, 2023

Thanks, JorgeB! Data rebuild is commencing as indicated.

As an additional data point in case anyone is curious, despite having so many drives the parity rebuild process is only requiring an additional 30 watts with respect to power consumption (80 plus platinum PSU).

Dave

DaveW42 · September 20, 2023

The rebuild process and parity check has completed, and everything looks great (no data loss!)

Thanks so much for all the help, JorgeB, itimpi, and trurl!!!! It is greatly appreciated.

Dave

Data rebuild failure following disabled drives after power loss

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation