Help! I really got myself into a state this time...

Ramshackleton · July 19, 2022

13 Data disks, 1 Parity (14TB), 1 cache. ~120TB total.

I've been having issues since moving my system to a new rack-mount case about a month ago, presumably some power-supply-related issues but things have been pretty stable until the other day:

Disk activity was running really slowly so I decided to reboot. When I did, the Parity drive wasn't recognized. I immediately jumped the gun, assumed it was dead, and ran to Best Buy to replace it with another 14TB WD drive (serial # ending in 94UG) to shuck.

Once installed, it began rebuilding parity as normal. This process for me at this current array size typically takes about 24 hours. But in this case it was going VERY slowly, targeting hundreds of days.

All the data seemed to be there but Plex was very flaky and the system overall was too slow to be very usable. At this point I also heard one of the drives periodically clicking (in a bad way), but I couldn't tell which one. It clearly wasn't the old parity drive because it was no longer connected.

I tried running SMART tests and using the Disk Speed Test plugin, but they gave results that weren't too obvious to me that anything was wrong. Eventually though I did figure out which drive was clicking, it was the one with serial # ending in LGTF if you look at the attached diagnostics. However, I don't see any tests that claim he's problematic. However I did have trouble accessing certain files while browsing that drive directly, so he must have something wrong with him. Just not sure if that's my only problem.

Anyway, at that point I decided to revert back to my original configuration. I cancelled the parity rebuild, put the old parity drive back in place, keep the clicky drive in as well, run whatever diagnostics I can, and if necessary move data off clicky.

The trouble is that the original parity drive (LWTC) is either unrecognized upon most reboots, or sometimes recognized, but if I choose him as parity to start the array, it immediately de-selects it and removes it from the drop-down of potential parity drives. I can swap in the new (94UG) drive for parity and start the array, but I think I'd be back to square one with being unprotected, taking eons to build parity, and a clicky drive slowing me down for hundreds of days.

Bottom line: it SEEMS like I have two drives that are in some kind of wonky state, so with only 1 parity, I'm kind of screwed?

I guess I'm just looking for tips on what else to try or think about. Thanks in advance.

tower-diagnostics-20220719-1136.zip tower-syslog-20220719-1649.zip

trurl · July 19, 2022

24 minutes ago, Ramshackleton said:

wap in the new (94UG) drive for parity and start the array, but I think I'd be back to square one with being unprotected

I doubt your original parity is valid now anyway. Would be slightly out-of-sync if you started the array without it, and much more out-of-sync if anything was written to the array without it.

Won't have time to look at diagnostics till later, maybe someone else will.

Ramshackleton · July 19, 2022

3 minutes ago, trurl said:

I doubt your original parity is valid now anyway. Would be slightly out-of-sync if you started the array without it, and much more out-of-sync if anything was written to the array without it.

Won't have time to look at diagnostics till later, maybe someone else will.

Agreed, I'm fine with losing whatever was written since. I just want a stable system back!

trurl · July 19, 2022

FYI - The syslog you attached is the same syslog already included in diagnostics, that is, the current syslog since reboot. So doesn't tell us anything about what happened before reboot. syslog server can allow you to save earlier syslogs

Can't tell whether your data is OK until you start the array. Unassign parity, start the array and post new diagnostics.

trurl · July 19, 2022

3 hours ago, Ramshackleton said:

new (94UG) drive for parity

9U4G, SMART for that disk looks fine, no self-tests have been run. Disable spindown on that disk and run an extended self-test

3 hours ago, Ramshackleton said:

original parity drive (LWTC)

SMART attributes look OK, but checking further down

ATA_READ_LOG_EXT (addr=0x03:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read SMART Extended Comprehensive Error Log failed

Read SMART Error Log failed: scsi error medium or hardware error (serious)

ATA_READ_LOG_EXT (addr=0x07:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read SMART Extended Self-test Log failed

Read SMART Self-test Log failed: scsi error medium or hardware error (serious)

Read SMART Selective Self-test Log failed: scsi error medium or hardware error (serious)

Write SCT Data Table failed: scsi error medium or hardware error (serious)
Read SCT Temperature History failed

Write SCT (Get) Error Recovery Control Command failed: scsi error medium or hardware error (serious)
SCT (Get) Error Recovery Control command failed

ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read Device Statistics page 0x00 failed

ATA_READ_LOG_EXT (addr=0x0c:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read Pending Defects log page 0x00 failed

Not actually seen that before but can't be good. I would forget about trying to use that disk for anything.

Ramshackleton · July 20, 2022

Thanks all - I just did what @trurl suggested and started with no parity. Also, I've attached the diagnostics from earlier (which were saved on the array, whew), which probably have more useful info.

Unfortunately, the LGTF drive didn't mount when I started the array, it's saying unmountable. I suppose I should try to run some repairs on it?

tower-diagnostics-20220719-0900.zip

Edited July 20, 2022 by Ramshackleton

trurl · July 20, 2022

21 hours ago, Ramshackleton said:

which drive was clicking, it was the one with serial # ending in LGTF

Missed that

10 minutes ago, Ramshackleton said:

attached the diagnostics from earlier

According to those, it was disk2, was mounted with nearly 7TB contents.

11 minutes ago, Ramshackleton said:

LGTF drive didn't mount when I started the array, it's saying unmountable

Post new diagnostics

Ramshackleton · July 20, 2022

11 minutes ago, trurl said:

Missed that

According to those, it was disk2, was mounted with nearly 7TB contents.

Post new diagnostics

Thanks for the quick reply, attached.

tower-diagnostics-20220720-1032.zip

trurl · July 20, 2022

Disk2 is indeed unmountable now. SMART attributes for disk2 look fine, but no self-tests have been run. Are you sure that was the clicking disk?

Disable spindown on disk2 and run an extended self-test.

I see a lot of this in syslog before the array has started (and so user shares don't exist yet)

Jul 19 10:56:18 Tower vsftpd[3269]: connect from 192.168.1.11 (192.168.1.11)
Jul 19 10:56:18 Tower vsftpd[3269]: [reolink] OK LOGIN: Client "192.168.1.11"
Jul 19 10:56:19 Tower vsftpd[3271]: [reolink] FAIL MKDIR: Client "192.168.1.11", "/mnt/user"

Any idea what that is about?

Ramshackleton · July 21, 2022

22 hours ago, trurl said:

Disk2 is indeed unmountable now. SMART attributes for disk2 look fine, but no self-tests have been run. Are you sure that was the clicking disk?

Disable spindown on disk2 and run an extended self-test.

Just completed, said it completed without error. Are there logs to look at? I don't see a link on the web page.

22 hours ago, trurl said:
I see a lot of this in syslog before the array has started (and so user shares don't exist yet)
Jul 19 10:56:18 Tower vsftpd[3269]: connect from 192.168.1.11 (192.168.1.11)
Jul 19 10:56:18 Tower vsftpd[3269]: [reolink] OK LOGIN: Client "192.168.1.11"
Jul 19 10:56:19 Tower vsftpd[3271]: [reolink] FAIL MKDIR: Client "192.168.1.11", "/mnt/user"
Any idea what that is about?

Yes, that's one of my reolink cameras trying to FTP video to the array. Disregard.

ChatNoir · July 21, 2022

18 minutes ago, Ramshackleton said:

Are there logs to look at? I don't see a link on the web page.

no logs but your diagnostics will have the last SMART report

Ramshackleton · July 21, 2022

2 minutes ago, ChatNoir said:

no logs but your diagnostics will have the last SMART report

Ah yes of course, silly me. Attached! Thanks!

tower-diagnostics-20220721-0912.zip

trurl · July 21, 2022

Check filesystem on disk2

Ramshackleton · July 21, 2022

32 minutes ago, trurl said:

Check filesystem on disk2

Phase 1 - find and verify superblock... - block cache size set to 1473832 entries Phase 2 - using internal log - zero log... zero_log: head block 1879314 tail block 1879310 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_fdblocks 121671163, counted 123624614 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 5 - agno = 7 - agno = 2 - agno = 4 - agno = 3 - agno = 6 - agno = 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Thu Jul 21 10:53:02 2022 Phase Start End Duration Phase 1: 07/21 10:51:57 07/21 10:51:57 Phase 2: 07/21 10:51:57 07/21 10:51:58 1 second Phase 3: 07/21 10:51:58 07/21 10:52:32 34 seconds Phase 4: 07/21 10:52:32 07/21 10:52:32 Phase 5: Skipped Phase 6: 07/21 10:52:32 07/21 10:53:02 30 seconds Phase 7: 07/21 10:53:02 07/21 10:53:02 Total run time: 1 minute, 5 seconds

ChatNoir · July 21, 2022

4 minutes ago, Ramshackleton said:

No modify flag set

Do it again without the -n flag.

Ramshackleton · July 21, 2022

5 minutes ago, ChatNoir said:

Do it again without the -n flag.

So just with -v? I did that and got this:

Phase 1 - find and verify superblock... - block cache size set to 1473832 entries Phase 2 - using internal log - zero log... zero_log: head block 1879314 tail block 1879310 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

trurl · July 21, 2022

50 minutes ago, Ramshackleton said:

unable to mount the filesystem, then use the -L option

XFS utility doesn't know Unraid has already failed to mount the filesystem. You have to use -L

Ramshackleton · July 21, 2022

Sorry for being so thick guys, thanks for all the help. Here's the output of the -L, and new logs.

Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... sb_fdblocks 121671163, counted 123624614 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 5 - agno = 4 - agno = 6 - agno = 7 - agno = 1 - agno = 3 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Maximum metadata LSN (1:1879329) is ahead of log (1:2). Format log to cycle 4. done

tower-diagnostics-20220721-1350.zip

trurl · July 21, 2022

Start the array in normal mode and post new diagnostics

Ramshackleton · July 21, 2022

It mounted this time! Diagnostics attached.

tower-diagnostics-20220721-1508.zip

trurl · July 21, 2022

On mobile now will look at Diagnostics later. Have you looked at the data? Do you have lost+found share?

Ramshackleton · July 21, 2022

I don't have a lost+found share, no. Data looks ok, but hard to know without a deep dive. I guess the question now is, am I safe to assign my new drive as parity, or will it try to take 255 days again?

trurl · July 21, 2022

Still a lot of things in syslog about sdm, the former parity drive we determined shouldn't be used. Remove that disk and post new diagnostics

Ramshackleton · July 22, 2022

4 hours ago, trurl said:

Still a lot of things in syslog about sdm, the former parity drive we determined shouldn't be used. Remove that disk and post new diagnostics

Ok, I removed the LWTC drive, the old parity disk, diagnostics attached.

tower-diagnostics-20220721-2236.zip

trurl · July 22, 2022

On mobile now will take a look later

Help! I really got myself into a state this time...

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation