Ramshackleton Posted July 19, 2022 Share Posted July 19, 2022 13 Data disks, 1 Parity (14TB), 1 cache. ~120TB total. I've been having issues since moving my system to a new rack-mount case about a month ago, presumably some power-supply-related issues but things have been pretty stable until the other day: Disk activity was running really slowly so I decided to reboot. When I did, the Parity drive wasn't recognized. I immediately jumped the gun, assumed it was dead, and ran to Best Buy to replace it with another 14TB WD drive (serial # ending in 94UG) to shuck. Once installed, it began rebuilding parity as normal. This process for me at this current array size typically takes about 24 hours. But in this case it was going VERY slowly, targeting hundreds of days. All the data seemed to be there but Plex was very flaky and the system overall was too slow to be very usable. At this point I also heard one of the drives periodically clicking (in a bad way), but I couldn't tell which one. It clearly wasn't the old parity drive because it was no longer connected. I tried running SMART tests and using the Disk Speed Test plugin, but they gave results that weren't too obvious to me that anything was wrong. Eventually though I did figure out which drive was clicking, it was the one with serial # ending in LGTF if you look at the attached diagnostics. However, I don't see any tests that claim he's problematic. However I did have trouble accessing certain files while browsing that drive directly, so he must have something wrong with him. Just not sure if that's my only problem. Anyway, at that point I decided to revert back to my original configuration. I cancelled the parity rebuild, put the old parity drive back in place, keep the clicky drive in as well, run whatever diagnostics I can, and if necessary move data off clicky. The trouble is that the original parity drive (LWTC) is either unrecognized upon most reboots, or sometimes recognized, but if I choose him as parity to start the array, it immediately de-selects it and removes it from the drop-down of potential parity drives. I can swap in the new (94UG) drive for parity and start the array, but I think I'd be back to square one with being unprotected, taking eons to build parity, and a clicky drive slowing me down for hundreds of days. Bottom line: it SEEMS like I have two drives that are in some kind of wonky state, so with only 1 parity, I'm kind of screwed? I guess I'm just looking for tips on what else to try or think about. Thanks in advance. tower-diagnostics-20220719-1136.zip tower-syslog-20220719-1649.zip Quote Link to comment
trurl Posted July 19, 2022 Share Posted July 19, 2022 24 minutes ago, Ramshackleton said: wap in the new (94UG) drive for parity and start the array, but I think I'd be back to square one with being unprotected I doubt your original parity is valid now anyway. Would be slightly out-of-sync if you started the array without it, and much more out-of-sync if anything was written to the array without it. Won't have time to look at diagnostics till later, maybe someone else will. Quote Link to comment
Ramshackleton Posted July 19, 2022 Author Share Posted July 19, 2022 3 minutes ago, trurl said: I doubt your original parity is valid now anyway. Would be slightly out-of-sync if you started the array without it, and much more out-of-sync if anything was written to the array without it. Won't have time to look at diagnostics till later, maybe someone else will. Agreed, I'm fine with losing whatever was written since. I just want a stable system back! Quote Link to comment
trurl Posted July 19, 2022 Share Posted July 19, 2022 FYI - The syslog you attached is the same syslog already included in diagnostics, that is, the current syslog since reboot. So doesn't tell us anything about what happened before reboot. syslog server can allow you to save earlier syslogs Can't tell whether your data is OK until you start the array. Unassign parity, start the array and post new diagnostics. Quote Link to comment
trurl Posted July 19, 2022 Share Posted July 19, 2022 3 hours ago, Ramshackleton said: new (94UG) drive for parity 9U4G, SMART for that disk looks fine, no self-tests have been run. Disable spindown on that disk and run an extended self-test 3 hours ago, Ramshackleton said: original parity drive (LWTC) SMART attributes look OK, but checking further down ATA_READ_LOG_EXT (addr=0x03:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious) Read SMART Extended Comprehensive Error Log failed Read SMART Error Log failed: scsi error medium or hardware error (serious) ATA_READ_LOG_EXT (addr=0x07:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious) Read SMART Extended Self-test Log failed Read SMART Self-test Log failed: scsi error medium or hardware error (serious) Read SMART Selective Self-test Log failed: scsi error medium or hardware error (serious) Write SCT Data Table failed: scsi error medium or hardware error (serious) Read SCT Temperature History failed Write SCT (Get) Error Recovery Control Command failed: scsi error medium or hardware error (serious) SCT (Get) Error Recovery Control command failed ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious) Read Device Statistics page 0x00 failed ATA_READ_LOG_EXT (addr=0x0c:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious) Read Pending Defects log page 0x00 failed Not actually seen that before but can't be good. I would forget about trying to use that disk for anything. Quote Link to comment
Ramshackleton Posted July 20, 2022 Author Share Posted July 20, 2022 (edited) Thanks all - I just did what @trurl suggested and started with no parity. Also, I've attached the diagnostics from earlier (which were saved on the array, whew), which probably have more useful info. Unfortunately, the LGTF drive didn't mount when I started the array, it's saying unmountable. I suppose I should try to run some repairs on it? tower-diagnostics-20220719-0900.zip Edited July 20, 2022 by Ramshackleton Quote Link to comment
trurl Posted July 20, 2022 Share Posted July 20, 2022 21 hours ago, Ramshackleton said: which drive was clicking, it was the one with serial # ending in LGTF Missed that 10 minutes ago, Ramshackleton said: attached the diagnostics from earlier According to those, it was disk2, was mounted with nearly 7TB contents. 11 minutes ago, Ramshackleton said: LGTF drive didn't mount when I started the array, it's saying unmountable Post new diagnostics Quote Link to comment
Ramshackleton Posted July 20, 2022 Author Share Posted July 20, 2022 11 minutes ago, trurl said: Missed that According to those, it was disk2, was mounted with nearly 7TB contents. Post new diagnostics Thanks for the quick reply, attached. tower-diagnostics-20220720-1032.zip Quote Link to comment
trurl Posted July 20, 2022 Share Posted July 20, 2022 Disk2 is indeed unmountable now. SMART attributes for disk2 look fine, but no self-tests have been run. Are you sure that was the clicking disk? Disable spindown on disk2 and run an extended self-test. I see a lot of this in syslog before the array has started (and so user shares don't exist yet) Jul 19 10:56:18 Tower vsftpd[3269]: connect from 192.168.1.11 (192.168.1.11) Jul 19 10:56:18 Tower vsftpd[3269]: [reolink] OK LOGIN: Client "192.168.1.11" Jul 19 10:56:19 Tower vsftpd[3271]: [reolink] FAIL MKDIR: Client "192.168.1.11", "/mnt/user" Any idea what that is about? Quote Link to comment
Ramshackleton Posted July 21, 2022 Author Share Posted July 21, 2022 22 hours ago, trurl said: Disk2 is indeed unmountable now. SMART attributes for disk2 look fine, but no self-tests have been run. Are you sure that was the clicking disk? Disable spindown on disk2 and run an extended self-test. Just completed, said it completed without error. Are there logs to look at? I don't see a link on the web page. 22 hours ago, trurl said: I see a lot of this in syslog before the array has started (and so user shares don't exist yet) Jul 19 10:56:18 Tower vsftpd[3269]: connect from 192.168.1.11 (192.168.1.11) Jul 19 10:56:18 Tower vsftpd[3269]: [reolink] OK LOGIN: Client "192.168.1.11" Jul 19 10:56:19 Tower vsftpd[3271]: [reolink] FAIL MKDIR: Client "192.168.1.11", "/mnt/user" Any idea what that is about? Yes, that's one of my reolink cameras trying to FTP video to the array. Disregard. Quote Link to comment
ChatNoir Posted July 21, 2022 Share Posted July 21, 2022 18 minutes ago, Ramshackleton said: Are there logs to look at? I don't see a link on the web page. no logs but your diagnostics will have the last SMART report Quote Link to comment
Ramshackleton Posted July 21, 2022 Author Share Posted July 21, 2022 2 minutes ago, ChatNoir said: no logs but your diagnostics will have the last SMART report Ah yes of course, silly me. Attached! Thanks! tower-diagnostics-20220721-0912.zip Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 Check filesystem on disk2 Quote Link to comment
Ramshackleton Posted July 21, 2022 Author Share Posted July 21, 2022 32 minutes ago, trurl said: Check filesystem on disk2 Phase 1 - find and verify superblock... - block cache size set to 1473832 entries Phase 2 - using internal log - zero log... zero_log: head block 1879314 tail block 1879310 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_fdblocks 121671163, counted 123624614 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 5 - agno = 7 - agno = 2 - agno = 4 - agno = 3 - agno = 6 - agno = 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Thu Jul 21 10:53:02 2022 Phase Start End Duration Phase 1: 07/21 10:51:57 07/21 10:51:57 Phase 2: 07/21 10:51:57 07/21 10:51:58 1 second Phase 3: 07/21 10:51:58 07/21 10:52:32 34 seconds Phase 4: 07/21 10:52:32 07/21 10:52:32 Phase 5: Skipped Phase 6: 07/21 10:52:32 07/21 10:53:02 30 seconds Phase 7: 07/21 10:53:02 07/21 10:53:02 Total run time: 1 minute, 5 seconds Quote Link to comment
ChatNoir Posted July 21, 2022 Share Posted July 21, 2022 4 minutes ago, Ramshackleton said: No modify flag set Do it again without the -n flag. Quote Link to comment
Ramshackleton Posted July 21, 2022 Author Share Posted July 21, 2022 5 minutes ago, ChatNoir said: Do it again without the -n flag. So just with -v? I did that and got this: Phase 1 - find and verify superblock... - block cache size set to 1473832 entries Phase 2 - using internal log - zero log... zero_log: head block 1879314 tail block 1879310 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 50 minutes ago, Ramshackleton said: unable to mount the filesystem, then use the -L option XFS utility doesn't know Unraid has already failed to mount the filesystem. You have to use -L Quote Link to comment
Ramshackleton Posted July 21, 2022 Author Share Posted July 21, 2022 Sorry for being so thick guys, thanks for all the help. Here's the output of the -L, and new logs. Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... sb_fdblocks 121671163, counted 123624614 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 5 - agno = 4 - agno = 6 - agno = 7 - agno = 1 - agno = 3 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Maximum metadata LSN (1:1879329) is ahead of log (1:2). Format log to cycle 4. done tower-diagnostics-20220721-1350.zip Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 Start the array in normal mode and post new diagnostics Quote Link to comment
Ramshackleton Posted July 21, 2022 Author Share Posted July 21, 2022 It mounted this time! Diagnostics attached. tower-diagnostics-20220721-1508.zip Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 On mobile now will look at Diagnostics later. Have you looked at the data? Do you have lost+found share? Quote Link to comment
Ramshackleton Posted July 21, 2022 Author Share Posted July 21, 2022 I don't have a lost+found share, no. Data looks ok, but hard to know without a deep dive. I guess the question now is, am I safe to assign my new drive as parity, or will it try to take 255 days again? Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 Still a lot of things in syslog about sdm, the former parity drive we determined shouldn't be used. Remove that disk and post new diagnostics Quote Link to comment
Ramshackleton Posted July 22, 2022 Author Share Posted July 22, 2022 4 hours ago, trurl said: Still a lot of things in syslog about sdm, the former parity drive we determined shouldn't be used. Remove that disk and post new diagnostics Ok, I removed the LWTC drive, the old parity disk, diagnostics attached. tower-diagnostics-20220721-2236.zip Quote Link to comment
trurl Posted July 22, 2022 Share Posted July 22, 2022 On mobile now will take a look later Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.