Several Simultaneous Failures


Recommended Posts

16 hours ago, trurl said:

Little confused about your screenshots. They both seem to indicate rebuild of disk14 completed, but disk18 is now disabled.

 

The 1st screenshot shows 2 unmountable disks, the rebuilt disk14, and the disabled disk18. The 2nd screenshot shows only 1 unmountable disk, the rebuilt disk14, with the disabled disk18 mounted.

I replaced disk 14 with a brand new healthy disk and followed the process Johnnie Black noted above. The rebuild began and completed. During the rebuild Disk 18 started showing problems and wound up becoming unmountable. After the rebuild, disk 14 shows that it's not mountable as well. Since you said BAD GUESS, I'm not remembering that I should be check/fixing that disk. (I did not format thank goodness).

 

So, Disk 18 is a new set of failures I believe and Disk 14 has never been remounted since I replaced it and rebuilt it.

Link to comment
27 minutes ago, srfnmnk said:

FS type is auto -- thus I don't see xfs repair option...how do I run it on "auto" FS type?

I see you have luks:xfs as the default filesystem, so that is what "auto" means. If it doesn't give you the option to do the repair in the webUI you can probably do it from the command line, but I have no experience with encrypted drives so you should wait for someone else like @johnnie.black 😉

Link to comment

Done, this was the output.

 


root@pumbaa:/dev/mapper# xfs_repair -v /dev/mapper/md14
Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
        - block cache size set to 3067592 entries
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129
resetting superblock realtime bitmap ino pointer to 129
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130
resetting superblock realtime summary ino pointer to 130
Phase 2 - using internal log
        - zero log...
zero_log: head block 69490 tail block 69486
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
 

Link to comment
4 minutes ago, srfnmnk said:

  -L           Force log zeroing. Do this as a last resort.
 

This is what you're wanting me to do, right?

yes

 

The repair tool is generic for XFS filesystems on linux, not specific to Unraid. It is just telling you that you should attempt to mount the disk first. But Unraid has already told you the disk is unmountable. So the log can't be used because the disk can't be mounted.

Link to comment

Ok, that looks to have gotten it mounted and back into the array. I went to /mnt/disk14 and don't see a lost+found folder.

 

So now the only issue appears to be disk18, I believe this is a net new issue that occurred during the disk14 rebuild. I have attached the extended self-test results (failure). I'm assuming that since it's also unmountable that I could do the same xfs_repair even while the array is started, right (i.e. I don't need to stick it in maintenance mode)? But, I'm also assuming that since SMART failed I'm likely to need to just go ahead and replace it as well, but, now that the array is healthy this can be a normal disk replace, of >= size, right?

 

Last point is that I have two drives that are showing up as "Historical Devices". The TOSHIBA is the original DISK19 that became unmountable and the other appears to be my current disk 6. I'm guessing that when we did the force new config stuff this happened, so should i just "remove" them with the red X?

 

image.thumb.png.1db378d4cc22088acb3ef46c7df524f2.png 

image.thumb.png.e094b00799d723a585b43d408af05885.png

Link to comment
5 minutes ago, srfnmnk said:

I could do the same xfs_repair even while the array is started, right (i.e. I don't need to stick it in maintenance mode)?

Yes.

 

5 minutes ago, srfnmnk said:

But, I'm also assuming that since SMART failed I'm likely to need to just go ahead and replace it as well, but, now that the array is healthy this can be a normal disk replace, of >= size, right?

Correct, should still run xfs_repair first on the emulated disk.

 

6 minutes ago, srfnmnk said:

so should i just "remove" them with the red X?

That's from the UD plugin, but yes you can remove them.

Link to comment

Hello,

 

I thought this journey was coming to a close but the saga continues...

 

Since the last posts I have received yet another new disk and replaced disk 18.

  • I stopped the array, powered down
  • replaced the drive
  • started the array with the new drive selected as disk 18...the rebuild began and everything completed.
  • Disk was unmountable so I completed the xfs_repair -vL /dev/mapper/md18 as before, it completed and was then able to be mounted
  • Restarted the array in maintenance mode and ran xfs_repair -nv from the gui and everything looked good (completed as expected)

I thought that would be the end of it...BUT...that evening a wholly new disk became "disabled" (Disk 8 - SN: PL1331LAHGXLLH). So on I go:

  • I notice that the emulated files from disk 8 are unavailable. I ssh in and verify that I cannot make a copy of any files on disk 8 from ssh console...i tried both /mnt/disk8/... & /mnt/user/<share>
  • xfs_repair in maintenance mode had same superblock issue so -- xfs_repair -vL on Disk 8 (eumulated) just disabled.
  • Ran an extended SMART test and it passed...
  • now xfs_repair -nv looks normal but the drive is still disabled...
  • Restarting the array shows that the files are being emulated but the files are not there. Looked in (from ssh) /mnt/disk8 and /mnt/user/<share>... and they're in neither place...so I proceed to check lost+found and there are 2261 dirs with data in them but they all have IDs...(screenshot attached) The data is all throughout these folders...

 

What the devil is going on?! Ideas? Before I build a new config or do anything else I wanted to check in and get your thoughts. I'm worried that if I do another new config and note that disk 8 is good that we might be going in circles now...

 

It seems I did have 2 bad drives -- both have been replaced and rebuilt -- but why drives keep becoming disabled?

 

How to get disk 8 enabled again and get the array healthy again?

 

Attaching some screenshots and a new diagnostics package of current status. One other note, when the array is stopped, (maybe other times too) I see this scrolling error in the log "device /dev/sdae problem getting id" (screenshot attached) but I don't see any sdae devices anywhere in unraid gui but on ssh it does exist in /dev folder.

before_repairs.png

lost_found.png

after_repairs.png

sdae_error.png

pumbaa-diagnostics-20200903-1248.zip

Link to comment
19 minutes ago, srfnmnk said:

I thought that would be the end of it...BUT

That why I mentioned earlier:

On 8/31/2020 at 10:19 AM, JorgeB said:

You are having multiple disk errors, you need to fix that first or it will make any rebuild difficult, looks more like a power/connection problem, if you have it try another PSU and/or another controller, also check/replace all cables.

 

Link to comment

This is the same PSU that's been in there for 1.5 years. Are you suggesting that it might be intermittently going out or something? I have UPS connected and am not seeing any issues/fluctuations on input/output power...

 

The server also never turns off -- I just cannot fathom the randomness of the power failures that would be required to make this happen...The failures were across two different power rails too, I did confirm that when you mentioned this earlier.

 

The effort required to replace and rewire the PSU is significant, is there anything else that it could be? I would like to rule out absolutely everything else first. :/ There are some ongoing errors such as the /dev/sdae problem getting id and what not...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.