heisenfig Posted January 31, 2020 Share Posted January 31, 2020 I'm having the same issue as this thread: https://forums.unraid.net/topic/69765-solved-unmountable-no-file-system/ We had a power outage last Friday, but everything looked fine last night after I upgraded to v6.8.2. Tonight it shows one of the drives is unmountable. I've posted the diagnostics below. I tried to follow along in that other thread, but don't understand how to start the array with the disk emulated. Pop the drive out before starting the array? but then how do you run that repair command? I'm a bit lost. The parity says valid. It's scheduled to run on the 1st day of each month. But it also says "Last check incomplete on Wed 29 Jan 2020 08:39:10 PM CST (yesterday), finding 1247 errors. Error code: aborted" I have 2 parity drives. Can I just unassign that drive, as if it died. Then format it, pre-clear it, then re-assign it and let it rebuild? This system has been running for over a year and I've never seen a bad parity check, so I really don't have any reason to think the parity is bad. I guess I could get a new drive tomorrow and do the rebuild to it instead. nabit-diagnostics-20200130-2053.zip Quote Link to comment
JorgeB Posted January 31, 2020 Share Posted January 31, 2020 Check filesystem on disk11: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui 4 hours ago, heisenfig said: Then format it, pre-clear it, then re-assign it and let it rebuild? Rebuilding a disk won't fix filesystem corruption. Quote Link to comment
heisenfig Posted January 31, 2020 Author Share Posted January 31, 2020 (edited) I'm pretty sure a format would create a new filesystem. Anyway, I did the test, per that link (output below). It noted a couple of issues. I tried to do the repair with just the -v option, but it said it had logs pending and I needed to mount the drive so the logs could be replayed. But w/o a filesystem, I can't mount it. Catch 22. It said I could use the -L option to wipe the logs, but not sure what that will do to the integrity of the drive. And if there is still a problem with it, I don't want it mounted and have the parity be updated incorrectly due to a currupted drive. So, I got a new drive today and it's in the process of pre-clearing. Once it's ready, I'll pop out the old drive and re-assign that slot to the new drive and let it rebuild. Once it's back to normal, i'll format/repair this drive, this drive and pre-clear it and check it for any errors. If all good with it, i'll use it to replace a smaller drive. Phase 1 - find and verify superblock... - block cache size set to 707216 entries Phase 2 - using internal log - zero log... zero_log: head block 3793692 tail block 3793686 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... ir_freecount/free mismatch, inode chunk 5/83977408, freecount 19 nfree 18 finobt ir_freecount/free mismatch, inode chunk 5/83977408, freecount 19 nfree 18 agi unlinked bucket 45 is 83977453 in ag 5 (inode=10821395693) sb_ifree 888, counted 875 sb_fdblocks 632626081, counted 635247463 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 imap claims in-use inode 2201742810 is free, correcting imap - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 data fork in regular inode 15032386860 claims used block 1919352826 correcting nextents for inode 15032386860 bad data fork in inode 15032386860 would have cleared inode 15032386860 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 3 - agno = 4 - agno = 7 - agno = 5 - agno = 6 - agno = 2 entry "Forged.in.Fire.S07E15.1080p.WEB.h264-TBS[rarbg].mkv" at block 0 offset 1392 in directory inode 15032386822 references free inode 15032386860 would clear inode number in entry at offset 1392... data fork in regular inode 15032386860 claims used block 1919352826 correcting nextents for inode 15032386860 would have cleared inode 15032386860 entry "RARBG.txt" in shortform directory 12907968861 references free inode 4373342706 would have junked entry "RARBG.txt" in directory inode 12907968861 entry "RARBG_DO_NOT_MIRROR.exe" in shortform directory 12907968861 references free inode 4373342707 would have junked entry "RARBG_DO_NOT_MIRROR.exe" in directory inode 12907968861 entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in shortform directory 12907968861 references free inode 4373342705 would have junked entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in directory inode 12907968861 would have corrected i8 count in directory 12907968861 from 6 to 3 entry "Source" in shortform directory 6619914978 references free inode 8589934758 would have junked entry "Source" in directory inode 6619914978 would have corrected i8 count in directory 6619914978 from 2 to 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 entry "Source" in shortform directory inode 6619914978 points to free inode 8589934758 would junk entry would fix i8count in inode 6619914978 - agno = 4 - agno = 5 - agno = 6 entry "RARBG.txt" in shortform directory inode 12907968861 points to free inode 4373342706 would junk entry entry "RARBG_DO_NOT_MIRROR.exe" in shortform directory inode 12907968861 points to free inode 4373342707 would junk entry entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in shortform directory inode 12907968861 points to free inode 4373342705 would junk entry would fix i8count in inode 12907968861 - agno = 7 entry "Forged.in.Fire.S07E15.1080p.WEB.h264-TBS[rarbg].mkv" in directory inode 15032386822 points to free inode 15032386860, would junk entry bad hash table for directory inode 15032386822 (no data entry): would rebuild - traversal finished ... - moving disconnected inodes to lost+found ... disconnected inode 2370139038, would move to lost+found disconnected inode 2370139483, would move to lost+found disconnected dir inode 6619914978, would move to lost+found disconnected inode 10821395693, would move to lost+found disconnected dir inode 12907968861, would move to lost+found disconnected dir inode 12907968867, would move to lost+found Phase 7 - verify link counts... would have reset inode 4373342699 nlinks from 1 to 3 would have reset inode 6619914978 nlinks from 3 to 2 would have reset inode 10821395693 nlinks from 0 to 1 No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Fri Jan 31 15:55:07 2020 Phase Start End Duration Phase 1: 01/31 15:55:05 01/31 15:55:06 1 second Phase 2: 01/31 15:55:06 01/31 15:55:06 Phase 3: 01/31 15:55:06 01/31 15:55:06 Phase 4: 01/31 15:55:06 01/31 15:55:06 Phase 5: Skipped Phase 6: 01/31 15:55:06 01/31 15:55:07 1 second Phase 7: 01/31 15:55:07 01/31 15:55:07 Total run time: 2 seconds Edited January 31, 2020 by heisenfig Quote Link to comment
JonathanM Posted January 31, 2020 Share Posted January 31, 2020 37 minutes ago, heisenfig said: I'm pretty sure a format would create a new filesystem. Yep. A blank filesystem. Without any of your data. Is that what you want? If you format that slot, that format will be written to parity, just like any other write. Parity rebuild encompasses the entire drive, filesystem included. It doesn't rebuild individual files, only the drive as a whole, corruption included. You will need to fix that filesystem to recover your files. Normally the -L option doesn't corrupt anything. Quote Link to comment
heisenfig Posted February 1, 2020 Author Share Posted February 1, 2020 Well, I wouldn't format it while it's part of the array, so it wouldn't be written to parity. I'm just thinking that if there is some corruption on it, i'm not sure I would want to trust it even after the rebuild. how do i know some of the files are not corrupted too? Seems like it's a safer choice to just go ahead and replace the drive and let the parity rebuild onto it. Are you saying once the new drive it rebuilt, it will have the same problem? Like the corrupted filesystem is already part of the parity? How is that possible if the drive doesn't mount to have any information read from it? Would disconnecting the drive and let it emulate that slot from parity still show it as unmountable? I'm not doubting you if you say that's true. I just though since parity is valid that I could just have it rebuild the data to a new drive from parity. Quote Link to comment
heisenfig Posted February 1, 2020 Author Share Posted February 1, 2020 I went ahead and ran the repair with the -L option. Afterwards I restarted array in normal mode and the drive mounted normally. Is it safe to assume this drive is perfectly healthy again? Phase 1 - find and verify superblock... - block cache size set to 707216 entries Phase 2 - using internal log - zero log... zero_log: head block 3793692 tail block 3793686 ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... ir_freecount/free mismatch, inode chunk 5/83977408, freecount 19 nfree 18 finobt ir_freecount/free mismatch, inode chunk 5/83977408, freecount 19 nfree 18 agi unlinked bucket 45 is 83977453 in ag 5 (inode=10821395693) sb_ifree 888, counted 875 sb_fdblocks 632626081, counted 635247463 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 imap claims in-use inode 2201742810 is free, correcting imap - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 data fork in regular inode 15032386860 claims used block 1919352826 correcting nextents for inode 15032386860 bad data fork in inode 15032386860 cleared inode 15032386860 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 1 - agno = 7 - agno = 3 - agno = 5 - agno = 6 - agno = 4 entry "Forged.in.Fire.S07E15.1080p.WEB.h264-TBS[rarbg].mkv" at block 0 offset 1392 in directory inode 15032386822 references free inode 15032386860 clearing inode number in entry at offset 1392... entry "RARBG.txt" in shortform directory 12907968861 references free inode 4373342706 junking entry "RARBG.txt" in directory inode 12907968861 entry "RARBG_DO_NOT_MIRROR.exe" in shortform directory 12907968861 references free inode 4373342707 junking entry "RARBG_DO_NOT_MIRROR.exe" in directory inode 12907968861 entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in shortform directory 12907968861 references free inode 4373342705 junking entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in directory inode 12907968861 corrected i8 count in directory 12907968861, was 6, now 3 entry "Source" in shortform directory 6619914978 references free inode 8589934758 junking entry "Source" in directory inode 6619914978 corrected i8 count in directory 6619914978, was 2, now 1 Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 bad hash table for directory inode 15032386822 (no data entry): rebuilding rebuilding directory inode 15032386822 - traversal finished ... - moving disconnected inodes to lost+found ... disconnected inode 2370139038, moving to lost+found disconnected inode 2370139483, moving to lost+found disconnected dir inode 6619914978, moving to lost+found disconnected inode 10821395693, moving to lost+found disconnected dir inode 12907968861, moving to lost+found disconnected dir inode 12907968867, moving to lost+found Phase 7 - verify and correct link counts... resetting inode 22525826 nlinks from 2 to 5 resetting inode 4373342699 nlinks from 1 to 3 resetting inode 6619914978 nlinks from 3 to 2 Maximum metadata LSN (1:3793688) is ahead of log (1:2). Format log to cycle 4. XFS_REPAIR Summary Fri Jan 31 22:32:07 2020 Phase Start End Duration Phase 1: 01/31 22:25:27 01/31 22:25:27 Phase 2: 01/31 22:25:27 01/31 22:27:18 1 minute, 51 seconds Phase 3: 01/31 22:27:18 01/31 22:27:19 1 second Phase 4: 01/31 22:27:19 01/31 22:27:19 Phase 5: 01/31 22:27:19 01/31 22:27:19 Phase 6: 01/31 22:27:19 01/31 22:27:19 Phase 7: 01/31 22:27:19 01/31 22:27:19 Total run time: 1 minute, 52 seconds done Quote Link to comment
JorgeB Posted February 1, 2020 Share Posted February 1, 2020 3 hours ago, heisenfig said: Well, I wouldn't format it while it's part of the array, so it wouldn't be written to parity. Formatting outside the array is pointless, since it would then be rebuilt the same as it was, like mentioned rebuilding disks can't fix filesystem corruption. 2 hours ago, heisenfig said: I went ahead and ran the repair with the -L option. Afterwards I restarted array in normal mode and the drive mounted normally. Is it safe to assume this drive is perfectly healthy again? The filesystem is healthy again, the drive always was, there may or not be some lost files, check for a lost+found folder. Quote Link to comment
heisenfig Posted February 2, 2020 Author Share Posted February 2, 2020 (edited) Apparenly, that didn't work after all. The drive shows mounted on the dashboard. But, nothing can be read/or written to that drive. I am able to read/write to all other drives. I found a folder in /mnt/user that I believe must have been on that drive, and nothing can be read/or written in that folder. Same error occurs. The settings.json file is missing (was probably on hat drive?) and I can't create any new files in this folder or read any of the files in this folder. Any ideas where to go from here? EDIT: Perhaps it WAS fixed, but became corrupted again. After restarting the array, the drive is again unmountable and the test shows it has errors again. Going to do the repair again. EDIT2: The repair worked again and some files that seemed missing are back again. and I can read/write the drive. So now that's two times that the file system has become corrupted on that drive. Edited February 2, 2020 by heisenfig Quote Link to comment
JorgeB Posted February 3, 2020 Share Posted February 3, 2020 9 hours ago, heisenfig said: Apparenly, that didn't work after all. The drive shows mounted on the dashboard. Disk being listed on UD means it's not assigned on the array, you need to either fix the filesystem on the emulated disk and rebuild, or do a new config with the original disk and re-sync parity. Quote Link to comment
heisenfig Posted February 3, 2020 Author Share Posted February 3, 2020 I'm assuming UD means Unassigned Devices. But the drive has never show under that. The drive is assigned to slot #13 and listed as Disk 11 in the array. It's also never shown up as an emulated disk. If I remove he drive and try to start the array, it says it will disable the drive and load show it as emulated, but I haven't done that yet because I wasn't sure if it's something I could come back from. Is this what I need to do? Disable it so it shows as emulated, then do the xfs_repair on the emulated drive? If I go the other way and create a new config and assign all the drives as they are now, does it do the parity re-sync automatically? Or is that something I have to do manually? Quote Link to comment
JorgeB Posted February 3, 2020 Share Posted February 3, 2020 31 minutes ago, heisenfig said: But the drive has never show under that. It's possible the disk dropped offline and reconnected with a different letter, or it was unassigned by you, please post current diags. Quote Link to comment
heisenfig Posted February 4, 2020 Author Share Posted February 4, 2020 (edited) This 8TB drive replaced a 4TB drive 3 months ago. It had been running find until the power outage about a week ago. The drive has never been "Unassigned" since it went in service 3 months ago. After the last repair of the filesystem, everything seemed normal again. I started a parity check with the "write corrections to parity" un-checked. During the first 8 hours, the files on that drive disappeared again. But it still shows it as mounted, as far as the Web GUI is concerned. Attached is the drive SMART report before I stopped the array. Also is the diagnostics report before the array is stopped. The parity check is still running. It shows 12 sync errors and estimated to finish in about 60 to 90 days (normally takes less than a day.) WDC_WD80EMAZ-00WJTA0_JEKVEADZ-20200204-0832.txt nabit-diagnostics-20200204-0835_beforeStoppingArray.zip After stopping the parity check and stopping the array. This is the screenshots showing the drive is still marked as assigned to slot 13. Attached output of "xfs_repair -nv". Stopped array. Running array in maintenance mode. xfs_repair-nv.txt Didn't show any errors on the drive this time. Retarted array in normal mode. This drive is still assigned, and all the file are back on disk 11. This is not the same outcome I had yesterday. When I restarted in normal mode yesterday, the drive was still assigned but had a note on it that it was not mountable due to no filesystem. The only difference is the checkbox for the "correct parity" was check when the parity check ran. Here's a new diagnostics after restarting the array. nabit-diagnostics-20200204-0900_afterRestartingArray.zip Hopefully, this is enough information to determine what is going on with this drive. Edited February 4, 2020 by heisenfig Quote Link to comment
JorgeB Posted February 4, 2020 Share Posted February 4, 2020 There's a connection problem with disk11, these are repeating constantly on the logs: Feb 3 15:09:06 nabit kernel: ata20.00: status: { DRDY } Feb 3 15:09:06 nabit kernel: ata20: hard resetting link Feb 3 15:09:16 nabit kernel: ata20: softreset failed (device not ready) Feb 3 15:09:16 nabit kernel: ata20: hard resetting link Feb 3 15:09:17 nabit kernel: ata20: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 3 15:09:18 nabit kernel: ata20.00: configured for UDMA/133 Feb 3 15:09:18 nabit kernel: ata20: EH complete Feb 3 15:09:31 nabit kernel: ata20.00: exception Emask 0x10 SAct 0xffffffff SErr 0x190002 action 0xe frozen Feb 3 15:09:31 nabit kernel: ata20.00: irq_stat 0x80400000, PHY RDY changed Feb 3 15:09:31 nabit kernel: ata20: SError: { RecovComm PHYRdyChg 10B8B Dispar } Feb 3 15:09:31 nabit kernel: ata20.00: failed command: READ FPDMA QUEUED Feb 3 15:09:31 nabit kernel: ata20.00: cmd 60/08:00:00:6a:e4/00:00:67:01:00/40 tag 0 ncq dma 4096 in Feb 3 15:09:31 nabit kernel: res 40/00:00:68:0a:d6/00:00:29:01:00/40 Emask 0x10 (ATA bus error) Feb 3 15:09:31 nabit kernel: ata20.00: status: { DRDY } Feb 3 15:09:31 nabit kernel: ata20.00: failed command: READ FPDMA QUEUED Feb 3 15:09:31 nabit kernel: ata20.00: cmd 60/40:08:88:8e:e9/00:00:2c:02:00/40 tag 1 ncq dma 32768 in Feb 3 15:09:31 nabit kernel: res 40/00:00:68:0a:d6/00:00:29:01:00/40 Emask 0x10 (ATA bus error) Feb 3 15:09:31 nabit kernel: ata20.00: status: { DRDY } 11 minutes ago, heisenfig said: The only difference is the checkbox for the "correct parity" was check when the parity check ran. That won't make any difference in this case, since it doesn't affect data on disk11, the connection problems likely explain the issues you been having, replace both cables. File system is mounting correctly now, so nothing to fix about that for now. P.S. Unrelated to this but there are also a few out of memory errors on the logs. Quote Link to comment
heisenfig Posted February 4, 2020 Author Share Posted February 4, 2020 (edited) Hmm.. That drive is in an ICYDOCK. So all 5 drives in the dock share 3 power cables. Since it's isolated to just one disk, i'm assuming the power cables are probably okay. When I get home, I'll replace the data cable for that one. If that doesn't work, i'll replace the 3 power cables too. Yeah, I think the memory thing was unrelated. I checked and all the CPU's were pegged. Rebooting to see if that clears up. Edit: it did. Thanks! Edited February 4, 2020 by heisenfig Quote Link to comment
heisenfig Posted February 5, 2020 Author Share Posted February 5, 2020 Haven't had a chance to change the cable yet, but the drive has stayed up for over 24 hours now. I have a theory though. This is a shucked WD drive that required tape being put over pin 3 of the power connector. I didn't have Kapton tape, so used a peice of tape from my label maker. I think it's possible that as the temperature of the drive increases, that it allows just enough current to pass to that pin to shut the drive off. Secondary to that, about the memory issues, there ended up being 6 instances of rclone running in the background trying to backup the array to google drive which kept the drives working overtime, causing them to be warmer than normal. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.