Several Simultaneous Failures

JorgeB · August 31, 2020

You are having multiple disk errors, you need to fix that first or it will make any rebuild difficult, looks more like a power/connection problem, if you have it try another PSU and/or another controller, also check/replace all cables.

geeksheikh · August 31, 2020

16 hours ago, trurl said:

Little confused about your screenshots. They both seem to indicate rebuild of disk14 completed, but disk18 is now disabled.

The 1st screenshot shows 2 unmountable disks, the rebuilt disk14, and the disabled disk18. The 2nd screenshot shows only 1 unmountable disk, the rebuilt disk14, with the disabled disk18 mounted.

I replaced disk 14 with a brand new healthy disk and followed the process Johnnie Black noted above. The rebuild began and completed. During the rebuild Disk 18 started showing problems and wound up becoming unmountable. After the rebuild, disk 14 shows that it's not mountable as well. Since you said BAD GUESS, I'm not remembering that I should be check/fixing that disk. (I did not format thank goodness).

So, Disk 18 is a new set of failures I believe and Disk 14 has never been remounted since I replaced it and rebuilt it.

JorgeB · August 31, 2020

Run xfs_repair on disk14, you might also try it on the emulated disk18.

geeksheikh · August 31, 2020

12 minutes ago, johnnie.black said:

Run xfs_repair on disk14

FS type is auto -- thus I don't see xfs repair option...how do I run it on "auto" FS type?

trurl · August 31, 2020

27 minutes ago, srfnmnk said:

FS type is auto -- thus I don't see xfs repair option...how do I run it on "auto" FS type?

I see you have luks:xfs as the default filesystem, so that is what "auto" means. If it doesn't give you the option to do the repair in the webUI you can probably do it from the command line, but I have no experience with encrypted drives so you should wait for someone else like @johnnie.black 😉

geeksheikh · August 31, 2020

Right, here's the screenshot for Disk 14 and Disk 18. Notice that Disk 14 doesn't have the Check FS block like Disk 18 does.

JorgeB · August 31, 2020

You can manually set the fs to xfs or run on the command line:

xfs_repair -v /dev/mapper/mdX

Replace X with correct disk #.

geeksheikh · August 31, 2020

Done, this was the output.

root@pumbaa:/dev/mapper# xfs_repair -v /dev/mapper/md14
Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
- block cache size set to 3067592 entries
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129
resetting superblock realtime bitmap ino pointer to 129
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130
resetting superblock realtime summary ino pointer to 130
Phase 2 - using internal log
- zero log...
zero_log: head block 69490 tail block 69486
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

trurl · August 31, 2020

6 minutes ago, srfnmnk said:

use
the -L option

geeksheikh · August 31, 2020

-L Force log zeroing. Do this as a last resort.

This is what you're wanting me to do, right?

trurl · August 31, 2020

4 minutes ago, srfnmnk said:

-L Force log zeroing. Do this as a last resort.

This is what you're wanting me to do, right?

yes

The repair tool is generic for XFS filesystems on linux, not specific to Unraid. It is just telling you that you should attempt to mount the disk first. But Unraid has already told you the disk is unmountable. So the log can't be used because the disk can't be mounted.

geeksheikh · August 31, 2020

Attached is the attempted repair output for Disk 14

disk_repair_output.txt

JorgeB · August 31, 2020

Disk should mount now, check contents look correct, also look for a lost+found folder and any data inside there.

geeksheikh · August 31, 2020

Ok, that looks to have gotten it mounted and back into the array. I went to /mnt/disk14 and don't see a lost+found folder.

So now the only issue appears to be disk18, I believe this is a net new issue that occurred during the disk14 rebuild. I have attached the extended self-test results (failure). I'm assuming that since it's also unmountable that I could do the same xfs_repair even while the array is started, right (i.e. I don't need to stick it in maintenance mode)? But, I'm also assuming that since SMART failed I'm likely to need to just go ahead and replace it as well, but, now that the array is healthy this can be a normal disk replace, of >= size, right?

Last point is that I have two drives that are showing up as "Historical Devices". The TOSHIBA is the original DISK19 that became unmountable and the other appears to be my current disk 6. I'm guessing that when we did the force new config stuff this happened, so should i just "remove" them with the red X?

JorgeB · August 31, 2020

5 minutes ago, srfnmnk said:

I could do the same xfs_repair even while the array is started, right (i.e. I don't need to stick it in maintenance mode)?

Yes.

5 minutes ago, srfnmnk said:

But, I'm also assuming that since SMART failed I'm likely to need to just go ahead and replace it as well, but, now that the array is healthy this can be a normal disk replace, of >= size, right?

Correct, should still run xfs_repair first on the emulated disk.

6 minutes ago, srfnmnk said:

so should i just "remove" them with the red X?

That's from the UD plugin, but yes you can remove them.

geeksheikh · August 31, 2020

xfs_repair complete on emulated Disk18. Output very similar to the Disk14. Should I just leave it emulated until replacement can come or try to mount it?

Thanks again for all the help. whatta trip this has been.

JorgeB · August 31, 2020

8 minutes ago, srfnmnk said:

or try to mount it?

No problem trying to mount the emulated disk, whatever you see there is what will be on the rebuilt disk later.

geeksheikh · August 31, 2020

Ok, and best/only way to mount it is to restart array, right?

JorgeB · August 31, 2020

Yes.

geeksheikh · September 1, 2020

Ok, so now 18 is mounted but it's still disabled and emulated. Hoping to receive additional disks tomorrow.

geeksheikh · September 3, 2020

Hello,

I thought this journey was coming to a close but the saga continues...

Since the last posts I have received yet another new disk and replaced disk 18.

I stopped the array, powered down
replaced the drive
started the array with the new drive selected as disk 18...the rebuild began and everything completed.
Disk was unmountable so I completed the xfs_repair -vL /dev/mapper/md18 as before, it completed and was then able to be mounted
Restarted the array in maintenance mode and ran xfs_repair -nv from the gui and everything looked good (completed as expected)

I thought that would be the end of it...BUT...that evening a wholly new disk became "disabled" (Disk 8 - SN: PL1331LAHGXLLH). So on I go:

I notice that the emulated files from disk 8 are unavailable. I ssh in and verify that I cannot make a copy of any files on disk 8 from ssh console...i tried both /mnt/disk8/... & /mnt/user/<share>
xfs_repair in maintenance mode had same superblock issue so -- xfs_repair -vL on Disk 8 (eumulated) just disabled.
Ran an extended SMART test and it passed...
now xfs_repair -nv looks normal but the drive is still disabled...
Restarting the array shows that the files are being emulated but the files are not there. Looked in (from ssh) /mnt/disk8 and /mnt/user/<share>... and they're in neither place...so I proceed to check lost+found and there are 2261 dirs with data in them but they all have IDs...(screenshot attached) The data is all throughout these folders...

What the devil is going on?! Ideas? Before I build a new config or do anything else I wanted to check in and get your thoughts. I'm worried that if I do another new config and note that disk 8 is good that we might be going in circles now...

It seems I did have 2 bad drives -- both have been replaced and rebuilt -- but why drives keep becoming disabled?

How to get disk 8 enabled again and get the array healthy again?

Attaching some screenshots and a new diagnostics package of current status. One other note, when the array is stopped, (maybe other times too) I see this scrolling error in the log "device /dev/sdae problem getting id" (screenshot attached) but I don't see any sdae devices anywhere in unraid gui but on ssh it does exist in /dev folder.

pumbaa-diagnostics-20200903-1248.zip

JorgeB · September 3, 2020

19 minutes ago, srfnmnk said:

I thought that would be the end of it...BUT

That why I mentioned earlier:

On 8/31/2020 at 10:19 AM, JorgeB said:

You are having multiple disk errors, you need to fix that first or it will make any rebuild difficult, looks more like a power/connection problem, if you have it try another PSU and/or another controller, also check/replace all cables.

geeksheikh · September 3, 2020

This is the same PSU that's been in there for 1.5 years. Are you suggesting that it might be intermittently going out or something? I have UPS connected and am not seeing any issues/fluctuations on input/output power...

The server also never turns off -- I just cannot fathom the randomness of the power failures that would be required to make this happen...The failures were across two different power rails too, I did confirm that when you mentioned this earlier.

The effort required to replace and rewire the PSU is significant, is there anything else that it could be? I would like to rule out absolutely everything else first. There are some ongoing errors such as the /dev/sdae problem getting id and what not...

JorgeB · September 3, 2020

I'm not saying it's the PSU, I'm saying there's a hardware problem and until you fix it you'll continue to have troubles, it could be a cable, it could be the controller, it could be power, etc, very difficult to diagnose remotely or without start swapping some hardware around.

JorgeB · September 3, 2020

One thing you can try now is to update LSI to latest firmware:

mpt2sas_cm1: LSISAS2116: FWVersion(17.00.01.00)

Current one is 20.00.07.00

Several Simultaneous Failures

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

geeksheikh

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation