Chasing down disabled drive / CRC errors


Go to solution Solved by Arcaeus,

Recommended Posts

4 minutes ago, trurl said:

That will give us a chance to see what if anything needs correcting.

 

Do the contents of the disk look reasonably correct?

 

Yea, everything looks reasonably correct. At a glance nothing seems to be missing or broken.

 

file system check completed. I don't see a lost & found folder on the drive so I'm assuming there weren't any errors? Although I'm assuming if this is just a check it's not moving anything. Here is the output from the command:

root@MediaVault:~# xfs_repair -nv /dev/sdk1
Phase 1 - find and verify superblock...
        - block cache size set to 542368 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 20 tail block 20
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 0
        - agno = 2
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Wed May  4 15:02:47 2022

Phase           Start           End             Duration
Phase 1:        05/04 15:02:38  05/04 15:02:38
Phase 2:        05/04 15:02:38  05/04 15:02:38
Phase 3:        05/04 15:02:38  05/04 15:02:44  6 seconds
Phase 4:        05/04 15:02:44  05/04 15:02:44
Phase 5:        Skipped
Phase 6:        05/04 15:02:44  05/04 15:02:47  3 seconds
Phase 7:        05/04 15:02:47  05/04 15:02:47

Total run time: 9 seconds

 

6 minutes ago, trurl said:

Probably is the plan.

 

It is possible to get it to not rebuild parity, then fool it into thinking it still needs to rebuild disk5. But maybe that won't be necessary if contents of original disk looks good enough.

 

I am still a bit concerned about your hardware and its ability to reliably rebuild anything. Do we think the controller firmware update should have fixed this?

 

 

I still haven't seen any more errors, although the array has been stopped or in maintenance mode most of this whole time. Thus, I'm not sure if it's really had to do anything or has been communicating heavily with the drives. If it means anything, up until the disks started freaking out all of the files seemed reliable despite the large amount of CRC errors.

 

Link to comment
  • Replies 81
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

4 minutes ago, Arcaeus said:

if this is just a check it's not moving anything.

correct

 

That looks a lot better. Go ahead and run without -n, probably nothing will be lost+found.

 

Then we can go ahead and New Config that disk back into the array. And rebuilding parity will be a good test of the hardware.

 

 

Link to comment
6 minutes ago, trurl said:

correct

 

That looks a lot better. Go ahead and run without -n, probably nothing will be lost+found.

 

Then we can go ahead and New Config that disk back into the array. And rebuilding parity will be a good test of the hardware.

 

 

 

Command ran again without the n, and nothing is in lost + found:

root@MediaVault:~# xfs_repair -v /dev/sdk1
Phase 1 - find and verify superblock...
        - block cache size set to 542368 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 30 tail block 30
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...

        XFS_REPAIR Summary    Wed May  4 15:24:30 2022

Phase           Start           End             Duration
Phase 1:        05/04 15:24:20  05/04 15:24:20
Phase 2:        05/04 15:24:20  05/04 15:24:20
Phase 3:        05/04 15:24:20  05/04 15:24:27  7 seconds
Phase 4:        05/04 15:24:27  05/04 15:24:27
Phase 5:        05/04 15:24:27  05/04 15:24:27
Phase 6:        05/04 15:24:27  05/04 15:24:30  3 seconds
Phase 7:        05/04 15:24:30  05/04 15:24:30

Total run time: 10 seconds
done

 

What's next? Unmount and add back into array?

Link to comment
1 minute ago, trurl said:

Just to make sure there is no misunderstanding, post a screenshot of Main - Array Devices.

Sure, no problem. Would rather be safe than sorry. Here is what the main - array screen is showing right now. The old disk 5 is currently in UD and mounted.

main array devices final.png

Link to comment

Disk is unmounted and assigned to the disk 5 slot. Array is not started yet. 

 

Earlier you mentioned backing up the contents of disk 5. Is that something we need to do before we start the array?

 

11 minutes ago, trurl said:

Unmount, assign as disk5. When you go to start the array, there should be a checkbox for rebuilding parity. You must rebuild parity.

 

 

Is the checkbox for rebuilding parity the "Parity is already valid" one? Which should be unchecked as we want to rebuild parity. Just wanted to confirm.

 

What about disk 4? Will we handle that afterwords?

 

no rebuild parity option.png

Edited by Arcaeus
Link to comment

Backup would have been taken while it was still mounted in UD. You can reconsider backups after parity build completes.

 

During rebuild, you should see a lot of writes to the disk being rebuilt (parity), a lot of reads from all other disks in the array, and zeros in the Errors column on Main.

Link to comment
Just now, trurl said:

Backup would have been taken while it was still mounted in UD. You can reconsider backups after parity build completes.

 

During rebuild, you should see a lot of writes to the disk being rebuilt (parity), a lot of reads from all other disks in the array, and zeros in the Errors column on Main.

 

Understood. No CRC errors so far so seems like the firmware update did the trick. Will update this thread when the parity rebuild completes. 

Link to comment
1 minute ago, Arcaeus said:

No CRC errors

Usually when we refer to CRC errors we are talking about the SMART attribute where the disk firmware records those. The disk firmware detects inconsistency in the data it has received.

 

The Errors column in Main I was referring to include all I/O errors, many would not be recorded as CRC errors because the disk never received any data to check.

Link to comment
On 5/4/2022 at 4:38 PM, trurl said:

Usually when we refer to CRC errors we are talking about the SMART attribute where the disk firmware records those. The disk firmware detects inconsistency in the data it has received.

 

The Errors column in Main I was referring to include all I/O errors, many would not be recorded as CRC errors because the disk never received any data to check.

 

Alright, Parity sync has completed. Zero IO errors and zero CRC errors.

 

On the dashboard, the log is showing 92% full. What log is that referring to and how do I check it or clean it out?

log full.png

Edited by Arcaeus
Link to comment
5 hours ago, JorgeB said:

Rebooting will clear that, you might want to save the diags before doing in case they are needed, and if the log keeps growing you should post new diags.

Yep, looks like that got cleared out and so far staying at 1%. 

Alright, last question I have would be the process to format & mount the 2 blank 16TB drives currently sitting in UD. The idea here is that I would use these as a local backup of the data just in case something happens to the data on the array (like what almost happened here). While I wouldn't have parity in these, the plan would be to have a complete backup of everything that is kept in sync with Rclone or something similar, as well as a cloud backup offsite.

 

Currently I can see the disks, but the Mount button is greyed out (I'm assuming due to them being precleared but not formatted yet). Destructive mode is enabled and UD Plus plugin is installed. Where do I go to format those? 

 

I'm assuming that I should just format them in XFS as I don't plan to remove them from my server, but does in make any sense to format them in NTFS?

 

I saw this link that you had posted a few years back but it looks broken: https://forums.lime-technology.com/topic/44104-unassigned-devices-managing-disk-drives-outside-of-the-unraid-array/ . Is there an updated link or what are your thoughts on this idea and a process to complete it?

16TBs unmountable.png

Edited by Arcaeus
Link to comment
4 minutes ago, JorgeB said:

You need to enable destructive mode in the UD settings to format disks, I would just use XFS, they can always be read in any Linux computer, or by using an Unraid trial key.

Here is what I'm seeing as I must be missing something:

 

UD settings.png

16TBs unmountable 2.png

Link to comment
1 hour ago, JorgeB said:

Remove the existing partitions by clicking on the red x then format.

 

So when I do that, it gives me this message saying that I would remove the preclear signature and it would have to re-clear the disk. I guess that doesn't matter as I'm not adding the disk into the array?

preclear error.png

Link to comment
  • Solution

Marking this issue as solved.

 

Once the LSI 9207-8i was updated from FWVersion(20.00.00.00) to 20.00.07.00 (latest at this time), the CRC and IO errors stopped. The remaining posts were to resolve possible data corruption issues on my drives. 

 

Now SABnzbd is showing "OSError: [Errno 5] Input/output error: '/data/usenet_incomplete/...", and I opened a thread in the Binhex-SABnzbd channel here: https://forums.unraid.net/topic/44118-support-binhex-sabnzbd/?do=findComment&comment=1124052

Edited by Arcaeus
Link to comment
On 5/6/2022 at 12:46 PM, JorgeB said:

Correct.

 

Hey Jorge, I'm trying to figure out some share errors that aren't showing up now. When I ran 'ls -lah /mnt' it's showing question marks for disk 7 despite it showing ok in Main:

 

root@MediaVault:~# ls -lah /mnt
/bin/ls: cannot access '/mnt/disk7': Input/output error
total 16K
drwxr-xr-x 19 root   root  380 May  6 10:54 ./
drwxr-xr-x 21 root   root  480 May 10 09:58 ../
drwxrwxrwx  1 nobody users  80 May  6 11:04 cache/
drwxrwxrwx  9 nobody users 138 May  6 11:04 disk1/
drwxrwxrwx  8 nobody users 133 May  7 16:10 disk10/
drwxrwxrwx  5 nobody users  75 May  7 16:10 disk11/
drwxrwxrwx  6 nobody users  67 May  6 11:04 disk2/
drwxrwxrwx  9 nobody users 148 May  6 11:04 disk3/
drwxrwxrwx  4 nobody users  41 May  6 11:04 disk4/
drwxrwxrwx 10 nobody users 166 May  6 11:04 disk5/
drwxrwxrwx  7 nobody users 106 May  6 11:04 disk6/
d?????????  ? ?      ?       ?            ? disk7/
drwxrwxrwx  6 nobody users  73 May  7 16:10 disk8/
drwxrwxrwx  6 nobody users  67 May  6 11:04 disk9/
drwxrwxrwt  5 nobody users 100 May  6 13:41 disks/
drwxrwxrwt  2 nobody users  40 May  6 10:52 remotes/
drwxrwxrwt  2 nobody users  40 May  6 10:52 rootshare/
drwxrwxrwx  1 nobody users 138 May  6 11:04 user/
drwxrwxrwx  1 nobody users 138 May  6 11:04 user0/

 

I ran the file system check on disk 7 and got this output:

entry ".." at block 0 offset 80 in directory inode 282079658 references non-existent inode 6460593193
entry ".." at block 0 offset 80 in directory inode 282079665 references non-existent inode 4315820253
entry "Season 1" in shortform directory 322311173 references non-existent inode 2234721490
would have junked entry "Season 1" in directory inode 322311173
entry "Season 2" in shortform directory 322311173 references non-existent inode 4579464692
would have junked entry "Season 2" in directory inode 322311173
entry "Season 3" in shortform directory 322311173 references non-existent inode 6453739819
would have junked entry "Season 3" in directory inode 322311173
entry "Season 5" in shortform directory 322311173 references non-existent inode 2234721518
would have junked entry "Season 5" in directory inode 322311173
entry "Season 6" in shortform directory 322311173 references non-existent inode 4760222352
would have junked entry "Season 6" in directory inode 322311173
would have corrected i8 count in directory 322311173 from 3 to 0
entry ".." at block 0 offset 80 in directory inode 322311206 references non-existent inode 6453779144
entry ".." at block 0 offset 80 in directory inode 322311221 references non-existent inode 6460505325
No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Tue May 10 10:06:15 2022

Phase		Start		End		Duration
Phase 1:	05/10 10:06:11	05/10 10:06:11
Phase 2:	05/10 10:06:11	05/10 10:06:11
Phase 3:	05/10 10:06:11	05/10 10:06:15	4 seconds
Phase 4:	05/10 10:06:15	05/10 10:06:15
Phase 5:	Skipped
Phase 6:	Skipped
Phase 7:	Skipped

Total run time: 4 seconds

 

Is there any reason to not run the file system check without the -n flag now? After that completes would I rebuild the drive like we did before or how does that work?

 

New diags attached if needed.

disk 7 missing info.png

mediavault-diagnostics-20220510-1009.zip

Edited by Arcaeus
Link to comment
Just now, JorgeB said:

No need to rebuild since the disk is enable, but you do need to run xfs_repair without -n to fix the current corruptions.

 

Attempted to repair the file system as I was getting local errors on the monitor attached to the bare metal computer. When trying to run 'xfs_repair -v /dev/md7' received this error:

 

root@MediaVault:~# xfs_repair -v /dev/md7
Phase 1 - find and verify superblock...
        - block cache size set to 542376 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 159593 tail block 159588
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

Restarted the array regularly, and now disk 7 is showing 'Unmountable: not mounted' in Main - Array (attached).

 

disk 7 unmountable.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.