January 8, 20251 yr I've got an array of 16 disks (14 data disks, 2 parity) -- unfortunately, in an attempt to do things the right way (long story short you can read my other posts from the past few days...), somehow ended up losing two of those disks (1 parity, 1 data) through some kind of hardware or controller issue (or seagate drives being seagate and deciding they want to call it a day and take a nap while running. Every time that happens I have to stop and start the array because it brings down the second parity drive (it literally disconnects). So my question is; I know the data from the two disks is gone (due to the attempt to readd them to the array), but can I know what was on the data disk that "failed"? What files are missing in the array? Or will they just be present in /mnt/user/ due to the remaining parity drive and I should concern myself about getting the data off the system either way? I will try to take the controller out of the equation (replace it) but I don't have a spare and I'm in a slight time crunch at this point so I'm hoping I can just get the data I need and then try to rebuild the array later EDIT: It's also worth noting I have 5 drives in the array that are COMPETELY empty -- I'm wondering if I can remove them from the array somehow still with the 1 failed parity and 1 failed data disk (because everything was working just fine prior to adding another JBOD and the additional disks to try to shift things around). This would allow me to go back to using a single JBOD (hoping that will help) Edited January 8, 20251 yr by VACInc
January 9, 20251 yr 8 hours ago, VACInc said: So my question is; I know the data from the two disks is gone (due to the attempt to readd them to the array), but can I know what was on the data disk that "failed"? What files are missing in the array? Or will they just be present in /mnt/user/ due to the remaining parity drive and I should concern myself about getting the data off the system either way? Not clear by your description, but if the missing disk is being emulated by parity the data will be there, please post the current diagnostics to see the array status, with the array started if that is possible.
January 9, 20251 yr Author Hey guys; thanks for the help! Attached is the diag normandy-diagnostics-20250109-0848.zip
January 9, 20251 yr There are read errors on disks 4, 6 and 7, looks more like power/connection issues, check all cables, especially anything shared by all those disks, then post new diags after array start.
January 9, 20251 yr Author Correct; but what is very weird is they only appear when doing a parity sync -- reading from those drives are perfectly fine otherwise (zero read errors when reading from when not doing a sync). This is what lead me to believe there's an issue somewhere in between the disks and the motherboard (either the controller or the cabling) -- I have a new one coming soon. In the meantime, based on that, I assume my data should all still be present correct (assuming no additional drive commits harakiri)
January 9, 20251 yr On 1/9/2025 at 3:38 PM, VACInc said: Correct; but what is very weird is they only appear when doing a parity sync -- reading from those drives are perfectly fine otherwise (zero read errors when reading from when not doing a sync). This could be a power related issue as a parity sync is a time the system likely has maximum power draw.
January 9, 20251 yr Author That would also be interesting...potentially a power issue between because of differentials. Most the the drives are powered via the SA120 (12 drives -- never had an issue), and then 4 are external and just running off a consumer NZXT C1200. They're also using a generic (and sketchy?) SFF-8088 to 4x SATA cable, which is what my brain keeps thinking is potentially problematic. I have a QNAS 4x mini JBOD showing up tomorrow, that'll allow me to switch both power and the cable out; hopefully that gets us to the destination. Edited January 9, 20251 yr by VACInc
January 9, 20251 yr Author Though I'm just realizing now; that if that is part of the issue it also means the LSI HBA handles the issues externally poorly for the other connected JBOD (because the parity drives are in the SA120, and the rest of the drives with read errors are not). Not unimaginable; but still surprising, I would think it would be more robust and less prone to stability issues.
January 11, 20251 yr Author Okay, I've stabilized things for the time being by using a separate JBOD and controller for the "extra" drives. That said, one drive had to be taken out of the equation so I'll have 2 parity drives fully healthy and still a missing "disk 9" (out of 14 data disks). The missing drive cannot be returned as it seems to be faulty (scanned on a separate system, needs to be returned). I don't need that extra space so can I create a new config without disk 9 and without any additional empty drives and move on with life? I assume no data is lost at this point correct? Thanks again for your help everyone Edited January 11, 20251 yr by VACInc
January 12, 20251 yr 15 hours ago, VACInc said: I don't need that extra space so can I create a new config without disk 9 and without any additional empty drives and move on with life? I assume no data is lost at this point correct? Assuming disk9 is also empty/lost, you won't lose any additional data.
January 12, 20251 yr Author @JorgeB Disk9 is lost. But there was data on it. I assumed that data still wouldn't be actually lost though because there was still a parity drive functioning, am I correct?
January 12, 20251 yr 2 hours ago, VACInc said: I assumed that data still wouldn't be actually lost though because there was still a parity drive functioning, am I correct? Not if you do a new config.
January 12, 20251 yr Author @JorgeB Even if I get the second parity drive healthy again? There is no way around needing a drive in "dish 9"?
January 12, 20251 yr 25 minutes ago, VACInc said: Even if I get the second parity drive healthy again? There is no way around needing a drive in "dish 9"? A new config will lose any emulated disks, if you currently can access the data from emulated disk9, copy that to another disk(s), and you can then do a new config.
January 12, 20251 yr Author @JorgeB It's not showing it's emulated though just faulty/not installed. Can emulation be forced? Edited January 12, 20251 yr by VACInc
January 12, 20251 yr If the other disk errors are fixed, you should be able to emulated it, post current diags after array start.
January 13, 20251 yr Author @JorgeB Attached! Again, thanks for your patience with this normandy-diagnostics-20250113-1406.zip
January 13, 20251 yr Emulated disk9 is not mountable. Check filesystem on missing/emulated disk9 from the webUI. Post the output.
January 13, 20251 yr Can only see the end with all the log spam, was this a rebuild or a parity/read check? Jan 13 13:32:38 Normandy kernel: md: sync done. time=175518sec
January 13, 20251 yr Author @JorgeB it was a parity sync restoring parity1 @trurl Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now.
January 13, 20251 yr Author @trurl Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.
January 14, 20251 yr Author Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... Metadata CRC error detected at 0x469f20, xfs_agi block 0x2/0x200 agi has bad CRC for ag 0 clearing needsrepair flag and regenerating metadata Metadata CRC error detected at 0x47191d, xfs_inobt block 0x18/0x1000 btree block 0/3 is suspect, error -74 Metadata CRC error detected at 0x47191d, xfs_finobt block 0x20/0x1000 btree block 0/4 is suspect, error -74 undiscovered finobt record, ino 128 (0/128) undiscovered finobt record, ino 169789760 (0/169789760) undiscovered finobt record, ino 212608960 (0/212608960) undiscovered finobt record, ino 222122432 (0/222122432) undiscovered finobt record, ino 236685632 (0/236685632) undiscovered finobt record, ino 237556416 (0/237556416) undiscovered finobt record, ino 277155008 (0/277155008) undiscovered finobt record, ino 291104832 (0/291104832) undiscovered finobt record, ino 296201280 (0/296201280) sb_fdblocks 4273345783, counted 4307396050 root inode chunk not found Phase 3 - for each AG... - scan and clear agi unlinked lists... found inodes not in the inode allocation tree found inodes not in the inode allocation tree - process known inodes and perform inode discovery... - agno = 0 15156cc4c680: Badness in key lookup (length) bp=(bno 0xa1ec940, len 4096 bytes) key=(bno 0xa1ec940, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xa1ec960, len 4096 bytes) key=(bno 0xa1ec960, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xcac27c0, len 4096 bytes) key=(bno 0xcac27c0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xcac27e0, len 4096 bytes) key=(bno 0xcac27e0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xcbae4c0, len 4096 bytes) key=(bno 0xcbae4c0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xcbae4e0, len 4096 bytes) key=(bno 0xcbae4e0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xd3d51c0, len 4096 bytes) key=(bno 0xd3d51c0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xd3d51e0, len 4096 bytes) key=(bno 0xd3d51e0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xe1b8940, len 4096 bytes) key=(bno 0xe1b8940, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xe1b8960, len 4096 bytes) key=(bno 0xe1b8960, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xe28d2c0, len 4096 bytes) key=(bno 0xe28d2c0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0xe28d2e0, len 4096 bytes) key=(bno 0xe28d2e0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0x10850cc0, len 4096 bytes) key=(bno 0x10850cc0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0x10850ce0, len 4096 bytes) key=(bno 0x10850ce0, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0x1159e840, len 4096 bytes) key=(bno 0x1159e840, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0x1159e860, len 4096 bytes) key=(bno 0x1159e860, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0x11a7ac40, len 4096 bytes) key=(bno 0x11a7ac40, len 16384 bytes) 15156cc4c680: Badness in key lookup (length) bp=(bno 0x11a7ac60, len 4096 bytes) key=(bno 0x11a7ac60, len 16384 bytes) bad CRC for inode 128 bad CRC for inode 131 bad CRC for inode 132 bad CRC for inode 133 bad CRC for inode 128, will rewrite Bad mtime nsec 3332600597 on inode 128, resetting to zero Bad ctime nsec 1212846252 on inode 128, resetting to zero cleared root inode 128 bad CRC for inode 131, will rewrite Bad atime nsec 3628754445 on inode 131, resetting to zero Bad mtime nsec 3823352658 on inode 131, resetting to zero Bad ctime nsec 4201589378 on inode 131, resetting to zero Bad crtime nsec 4033325188 on inode 131, resetting to zero correcting imap cleared inode 131 bad CRC for inode 132, will rewrite correcting imap cleared inode 132 bad CRC for inode 133, will rewrite Bad atime nsec 1946158217 on inode 133, resetting to zero Bad mtime nsec 1991492530 on inode 133, resetting to zero Bad ctime nsec 1991492530 on inode 133, resetting to zero Bad crtime nsec 1946158217 on inode 133, resetting to zero correcting imap cleared inode 133 correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap imap claims a free inode 237556416 is in use, correcting imap and clearing inode cleared inode 237556416 correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap correcting imap - agno = 1 bad CRC for inode 2147483777 bad CRC for inode 2147483777, will rewrite free inode 2147483777 contains errors, corrected - agno = 2 - agno = 3 bad CRC for inode 6442451073 bad CRC for inode 6442451073, will rewrite cleared inode 6442451073 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 4 - agno = 6 - agno = 12 - agno = 5 - agno = 17 - agno = 18 - agno = 8 - agno = 22 - agno = 11 - agno = 24 - agno = 25 - agno = 26 - agno = 13 - agno = 15 - agno = 14 - agno = 2 - agno = 16 - agno = 7 - agno = 19 - agno = 20 - agno = 21 - agno = 23 - agno = 10 - agno = 9 - agno = 3 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... reinitializing root directory - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... disconnected dir inode 131, moving to lost+found Phase 7 - verify and correct link counts... resetting inode 139 nlinks from 2 to 3 Maximum metadata LSN (1:1368614) is ahead of log (1:2). Format log to cycle 4. done
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.