Multiple Disks with millions of read errors.


Recommended Posts

Unraid - 6.7.2

 

So I woke up this morning to an array that has a disabled disk (single parity system) and seven disks all with millions of read errors. The array isn't mounted, and is unreachable from the network. In the shares pane only the disk shares are showing up, none of the other shares. I pulled a diagnostic report (attached below), and now wonder what the safe thing is to do. I did start a read test as indicated on the main page, but paused it almost immediately, not knowing if it would screw things up.

 

Last night I rebooted the server, as I was having some trouble with the TV system, as well. All seemed OK, and Plex was able to rescan the TV shows to find some new items that had been added. Watched a couple of episodes, and went to bed. 

 

My inclination is that I need to shut it down and check cabling, reseat controllers, etc... as that seems the most likely cause for millions of read errors all of a sudden, but I don't want to do anything that might compromise the system. I need to figure out if all the disks are on the same controller, but need to shut it down to get at the disks to see. In poking through the forums, this morning, I see that Marvell controllers can be an issue, so I assume it's due to one of those. The trouble appears to start around 5:59 in the log, and there are indications that the controller is the issue. I am woefully deficient at understanding these logs, however. This server has been running faithfully for many years, with the current HW, just FYI.

 

Also, I am assuming that the system disabled that one particular disk just because it can only do one with single parity, and it randomly chose it when the array crapped out.

 

Additionally, the system log tool has only one entry:

Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 134115360 bytes) in /usr/local/emhttp/plugins/dynamix/include/Syslog.php on line 20

Any help in how to proceed would be greatly appreciated, I am not very knowledgable about the deep inner working of the UnRAID system, and Linux, in general, but I am a quick learner. Hopefully some of you forum denizens will be able to help me out and point me in the right direction.

 

Thanks for listening.

tower-diagnostics-20200625-1429.zip

Link to comment

Disk17 is really failing, still with a good controller it would just get disable instead of bringing all the other disks down with it, you should replace them with and LSI controllers when possible.

 

35 minutes ago, ratmice said:

Also the disabled dis has a note that it in "unmountable: no file system".

Check filesystem on disk17 and if all OK then replace it.

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

Link to comment

Thanks for the prompt reply. As always, Johnny, you are a superb asset to the forums.

 

I am going to replace the controllers ASAP. I currently have a SuperMicro X8SIL-F motherboard. looks like I'm limited to x8 PCI-e cards (but it seems the SASLP are x4 cards). Any recommendations for direct replacements for the SASLP controllers? seems like:

LSI 9211-8i P20 IT Mode for ZFS FreeNAS unRAID Dell H310 6Gbps SAS HBA

might be a reasonable replacement, I'm just a bit fuzzy on the bus/lane deal. Are there more stable, proven replacements that have the SFF-8087 SAS connector so I can just plug and play?

Link to comment

So, I attempted to run xfs_repair and got this output, seems like the disk is really borked. However is there a way to attempt mounting a single disk while in maintenance mode, or is starting the array and having it choke on this disk enough to just jump to ignoring the log while repairing? I am too *nix illiterate to know this.

 

root@Tower:~# xfs_repair -v /dev/md17
Phase 1 - find and verify superblock...
        - block cache size set to 349728 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 449629 tail block 449625
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

Also, 2 new controllers on the way. If this attempt to repair disk17 is unsuccessful, would the best course of action be to shut down the array, install the new controllers, and then try to rebuild disk 17 to a new drive? or would it be OK to do that now?

Edited by ratmice
Link to comment

So, one last question, now that the repair has proceeded, and lots of items placed in Lost + Found, indicating, I think, that there was a lot of corruption, should i bother to try mounting it? or, will that screw up things when I go and try to rebuild the disk. My thinking here is that if it is actually mountable, then unraid will think that it's current state is what it's supposed to be and adjust parity accordingly, thus giving me a screwed up rebuild. As it is it appears that the emulated disk works OK.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.