Jump to content

Help! I really got myself into a state this time...


Recommended Posts

13 Data disks, 1 Parity (14TB), 1 cache.  ~120TB total.

 

I've been having issues since moving my system to a new rack-mount case about a month ago, presumably some power-supply-related issues but things have been pretty stable until the other day:

 

Disk activity was running really slowly so I decided to reboot.  When I did, the Parity drive wasn't recognized.  I immediately jumped the gun, assumed it was dead, and ran to Best Buy to replace it with another 14TB WD drive (serial # ending in 94UG) to shuck.  

 

Once installed, it began rebuilding parity as normal.  This process for me at this current array size typically takes about 24 hours.  But in this case it was going VERY slowly, targeting hundreds of days.  

 

All the data seemed to be there but Plex was very flaky and the system overall was too slow to be very usable.  At this point I also heard one of the drives periodically clicking (in a bad way), but I couldn't tell which one.  It clearly wasn't the old parity drive because it was no longer connected.  

 

I tried running SMART tests and using the Disk Speed Test plugin, but they gave results that weren't too obvious to me that anything was wrong.  Eventually though I did figure out which drive was clicking, it was the one with serial # ending in LGTF if you look at the attached diagnostics.  However, I don't see any tests that claim he's problematic.  However I did have trouble accessing certain files while browsing that drive directly, so he must have something wrong with him.  Just not sure if that's my only problem.  

 

Anyway, at that point I decided to revert back to my original configuration.  I cancelled the parity rebuild, put the old parity drive back in place, keep the clicky drive in as well, run whatever diagnostics I can, and if necessary move data off clicky.  

 

The trouble is that the original parity drive (LWTC) is either unrecognized upon most reboots, or sometimes recognized, but if I choose him as parity to start the array, it immediately de-selects it and removes it from the drop-down of potential parity drives.  I can swap in the new (94UG) drive for parity and start the array, but I think I'd be back to square one with being unprotected, taking eons to build parity, and a clicky drive slowing me down for hundreds of days.  

 

Bottom line: it SEEMS like I have two drives that are in some kind of wonky state, so with only 1 parity, I'm kind of screwed?  

 

I guess I'm just looking for tips on what else to try or think about.  Thanks in advance. 

tower-diagnostics-20220719-1136.zip tower-syslog-20220719-1649.zip

Link to comment
24 minutes ago, Ramshackleton said:

wap in the new (94UG) drive for parity and start the array, but I think I'd be back to square one with being unprotected

I doubt your original parity is valid now anyway. Would be slightly out-of-sync if you started the array without it, and much more out-of-sync if anything was written to the array without it.

 

Won't have time to look at diagnostics till later, maybe someone else will.

Link to comment
3 minutes ago, trurl said:

I doubt your original parity is valid now anyway. Would be slightly out-of-sync if you started the array without it, and much more out-of-sync if anything was written to the array without it.

 

Won't have time to look at diagnostics till later, maybe someone else will.

 

Agreed, I'm fine with losing whatever was written since.  I just want a stable system back!

Link to comment

 

3 hours ago, Ramshackleton said:

new (94UG) drive for parity

 

9U4G, SMART for that disk looks fine, no self-tests have been run. Disable spindown on that disk and run an extended self-test

 

3 hours ago, Ramshackleton said:

original parity drive (LWTC)

SMART attributes look OK, but checking further down

ATA_READ_LOG_EXT (addr=0x03:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read SMART Extended Comprehensive Error Log failed

Read SMART Error Log failed: scsi error medium or hardware error (serious)

ATA_READ_LOG_EXT (addr=0x07:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read SMART Extended Self-test Log failed

Read SMART Self-test Log failed: scsi error medium or hardware error (serious)

Read SMART Selective Self-test Log failed: scsi error medium or hardware error (serious)

Write SCT Data Table failed: scsi error medium or hardware error (serious)
Read SCT Temperature History failed

Write SCT (Get) Error Recovery Control Command failed: scsi error medium or hardware error (serious)
SCT (Get) Error Recovery Control command failed

ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read Device Statistics page 0x00 failed

ATA_READ_LOG_EXT (addr=0x0c:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read Pending Defects log page 0x00 failed

Not actually seen that before but can't be good. I would forget about trying to use that disk for anything.

Link to comment

Thanks all - I just did what @trurl suggested and started with no parity.  Also, I've attached the diagnostics from earlier (which were saved on the array, whew), which probably have more useful info.  

 

Unfortunately, the LGTF drive didn't mount when I started the array, it's saying unmountable.  I suppose I should try to run some repairs on it?

tower-diagnostics-20220719-0900.zip

Edited by Ramshackleton
Link to comment
21 hours ago, Ramshackleton said:

which drive was clicking, it was the one with serial # ending in LGTF

Missed that

10 minutes ago, Ramshackleton said:

attached the diagnostics from earlier

According to those, it was disk2, was mounted with nearly 7TB contents.

 

11 minutes ago, Ramshackleton said:

LGTF drive didn't mount when I started the array, it's saying unmountable

Post new diagnostics

Link to comment

Disk2 is indeed unmountable now. SMART attributes for disk2 look fine, but no self-tests have been run. Are you sure that was the clicking disk?

 

Disable spindown on disk2 and run an extended self-test.

 

I see a lot of this in syslog before the array has started (and so user shares don't exist yet)

Jul 19 10:56:18 Tower vsftpd[3269]: connect from 192.168.1.11 (192.168.1.11)
Jul 19 10:56:18 Tower vsftpd[3269]: [reolink] OK LOGIN: Client "192.168.1.11"
Jul 19 10:56:19 Tower vsftpd[3271]: [reolink] FAIL MKDIR: Client "192.168.1.11", "/mnt/user"

Any idea what that is about?

Link to comment
22 hours ago, trurl said:

Disk2 is indeed unmountable now. SMART attributes for disk2 look fine, but no self-tests have been run. Are you sure that was the clicking disk?

 

Disable spindown on disk2 and run an extended self-test.

 

Just completed, said it completed without error.  Are there logs to look at? I don't see a link on the web page. 

 

22 hours ago, trurl said:

I see a lot of this in syslog before the array has started (and so user shares don't exist yet)

Jul 19 10:56:18 Tower vsftpd[3269]: connect from 192.168.1.11 (192.168.1.11)
Jul 19 10:56:18 Tower vsftpd[3269]: [reolink] OK LOGIN: Client "192.168.1.11"
Jul 19 10:56:19 Tower vsftpd[3271]: [reolink] FAIL MKDIR: Client "192.168.1.11", "/mnt/user"

Any idea what that is about?

 

Yes, that's one of my reolink cameras trying to FTP video to the array.  Disregard.

Link to comment
32 minutes ago, trurl said:

Check filesystem on disk2

Phase 1 - find and verify superblock... - block cache size set to 1473832 entries Phase 2 - using internal log - zero log... zero_log: head block 1879314 tail block 1879310 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_fdblocks 121671163, counted 123624614 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 5 - agno = 7 - agno = 2 - agno = 4 - agno = 3 - agno = 6 - agno = 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Thu Jul 21 10:53:02 2022 Phase Start End Duration Phase 1: 07/21 10:51:57 07/21 10:51:57 Phase 2: 07/21 10:51:57 07/21 10:51:58 1 second Phase 3: 07/21 10:51:58 07/21 10:52:32 34 seconds Phase 4: 07/21 10:52:32 07/21 10:52:32 Phase 5: Skipped Phase 6: 07/21 10:52:32 07/21 10:53:02 30 seconds Phase 7: 07/21 10:53:02 07/21 10:53:02 Total run time: 1 minute, 5 seconds

Link to comment
5 minutes ago, ChatNoir said:

Do it again without the -n flag.

So just with -v? I did that and got this:

 

 

Phase 1 - find and verify superblock... - block cache size set to 1473832 entries Phase 2 - using internal log - zero log... zero_log: head block 1879314 tail block 1879310 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

Link to comment

Sorry for being so thick guys, thanks for all the help.  Here's the output of the -L, and new logs.

 

 

Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... sb_fdblocks 121671163, counted 123624614 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 5 - agno = 4 - agno = 6 - agno = 7 - agno = 1 - agno = 3 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Maximum metadata LSN (1:1879329) is ahead of log (1:2). Format log to cycle 4. done

tower-diagnostics-20220721-1350.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...