[SOLVED]Parity Drive read/wright errors


Recommended Posts

Hello, booted up my nas today and started the array, I had started and finished copying over some files and after I was done noticed the notifications that there were errors on my parity drive (see attached image). Have since stopped the array and downloaded the diagnostic file (tower-diagnostics-20190401-1040.zip). I feel the need to mention that when booting up the nas it did initially not go to the "RocketRAID bios setting utility" (attached image just from google to show what I was talking about) as it normally would and had to reboot a few times for it to come up. After getting it to come up it all started up normally.

I had been told by @johnnie.black in a previous post I made about drive errors that it could very likely be the raid card and should replace it with another controller he recommended. 

 

Just so I know (as I am not very experienced in using unraid) is what order of things I should proceed with things. Do I power down the nas and just wait for a new raid card, do I do a parity check now, or some diagnostic thing I should be doing to make sure there is nothing actually wrong with the drive. Just to mention I do not have an extra 10TB hdd on hand to be able to swap it out etc.

 

Would the files that I had copied over when these errors occured after this be needed to copied again? Please let me know and I will get it all under way so I can get my nas back up and running :)

Unraid.png

hqdefault.jpg

Edited by J0my
Link to comment
1 hour ago, johnnie.black said:

Parity disk dropped offline, likely controller related, you can reboot and post new diags so there's a SMART report for parity, but best best bet would likely be to replace that controller with one of the recommended LSI HBAs.

I rebooted and got the diagnostics (tower-diagnostics-20190401-2027.zip), I did not start the array though as it is currently disabled, I am trying to remember the way of getting it to re-enable, it was to unassign the drive, start the array, stop the array(at what point do you stop it though?), re-assign the drive and then restart the array isn't it?

Link to comment

Parity looks fine, so very likely a controller issue, could also be power/cable related.

 

3 minutes ago, J0my said:

it was to unassign the drive, start the array, stop the array(at what point do you stop it though?), re-assign the drive and then restart the array isn't it?

Correct, you can stop it immediately after start.

  • Upvote 1
Link to comment
11 hours ago, johnnie.black said:

Correct, you can stop it immediately after start.

Ok have done that it currently parity sync/rebuilding about 6hrs left. I have also just noticed that one of my drives Disk 1 now says it is "Unmountable: No file system" tower-diagnostics-20190402-0834.zip Should I potentially be stopping the rebuild if it is reading (which it is) from that drive for the rebuild of the parity drive, rebooting and seeing if it comes back and then starting again on the rebuild?

Unraid2.jpg

Link to comment
31 minutes ago, johnnie.black said:

Let the sync finish then run a filesystem check on disk1

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (1:1048685) is ahead of log (1:1048661).
Would format log to cycle 4.
No modify flag set, skipping filesystem flush and exiting.

This is what has come up since doing the filesystem check on disk 1 (tower-diagnostics-20190402-1712.zip) Not sure what the next step now is, array is still in maint mode.

Link to comment
2 hours ago, johnnie.black said:

You need to remove the -n flag (no modify)

removing the -n flag gets me this 
 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

Link to comment
4 minutes ago, johnnie.black said:

Use -L

Ok have done that and now have this 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:1048697) is ahead of log (1:2).
Format log to cycle 4.
done

tower-diagnostics-20190402-2016.zip So it is still in maint mode and for now still showing that the drive is "Unmountable: No file system" is there something that should be done next to test if its worked?

Link to comment
5 minutes ago, johnnie.black said:

Start in normal mode, disks aren't mounted in safe mode.

Done, it is all back and present *deep sigh* thank you very much mate, you really are a life saver as I am not really knowledgeable in this stuff so I really appreciate your patience with me.

 

One last question just for a general knowledge thing, when doing a Parity-Check, should the "Write corrections to parity" box be checked or not and what is the recommended frequency of the checks?

 

Will deff be looking to buy a new raid card or just use the 4 sata ports on my mobo and not use a cache drive for the time being.

Edited by J0my
Link to comment
2 hours ago, J0my said:

when doing a Parity-Check, should the "Write corrections to parity" box be checked or not

Depends, and to expand on what johnnie said, if it's an automatic check, no correct is always best. IF there are errors, you need to find out WHY there are errors, then correct the condition that caused the errors, and only then run a correcting check followed by another non correcting to see if the issue has been resolved.

 

One example where a correcting check may be needed is after an unclean shutdown.

Link to comment
21 hours ago, jonathanm said:

Depends, and to expand on what johnnie said, if it's an automatic check, no correct is always best. IF there are errors, you need to find out WHY there are errors, then correct the condition that caused the errors, and only then run a correcting check followed by another non correcting to see if the issue has been resolved.

 

One example where a correcting check may be needed is after an unclean shutdown.

Thanks for expanding on that point, it is really helpful :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.