2 disks have failed, and one of them stucks the array to "mouting" even when not here

xoC · August 15, 2023

Hello,

I have 2 disk which have failed. I'm kind of lost about what to do as all the usual links to the FAQ are broken.

I attached my diagnostics.

Even when I unselect both disk (empty) and starts the array, it is stuck and also show "mounting" on disk 2.

It does the same thing after trying to rebuild the array.

Thanks in advance.

nastorm-diagnostics-20230815-1658.zip

xoC · August 15, 2023

To add another picture, here is what happens when both disk are disconnected in hardware, I can't even start the array to access the contents (emulated) as it is stuck on "Mouting disks..."

JorgeB · August 15, 2023

Please post the diagnostics, if you cannot get them get at least the syslog:

cp /var/log/syslog /boot/syslog.txt

xoC · August 15, 2023

Thanks for your quick answer.

I thought the zip I posted in the first post was the diagnostics ?

nastorm-diagnostics-20230815-1658.zip

JorgeB · August 16, 2023

12 hours ago, xoC said:

I thought the zip I posted in the first post was the diagnostics ?

You did, sorry, missed it, check filesystem on disk2 (run it without -n).

xoC · August 16, 2023

So here it is for disk1, It had many errors (CRC) and yesterday I did a run with -n and then with no -n. Today it says :



    Phase 1 - find and verify superblock...
    Phase 2 - using internal log
            - zero log...
    ALERT: The filesystem has valuable metadata changes in a log which is being
    ignored because the -n option was used.  Expect spurious inconsistencies
    which may be resolved by first mounting the filesystem to replay the log.
            - scan filesystem freespace and inode maps...
    sb_fdblocks 121721460, counted 125133027
            - found root inode chunk
    Phase 3 - for each AG...
            - scan (but don't clear) agi unlinked lists...
            - process known inodes and perform inode discovery...
            - agno = 0
            - agno = 1
            - agno = 2
            - agno = 3
            - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
            - setting up duplicate extent list...
            - check for inodes claiming duplicate blocks...
            - agno = 0
            - agno = 3
            - agno = 1
            - agno = 2
    No modify flag set, skipping phase 5
    Phase 6 - check inode connectivity...
            - traversing filesystem ...
            - traversal finished ...
            - moving disconnected inodes to lost+found ...
    Phase 7 - verify link counts...
    No modify flag set, skipping filesystem flush and exiting.

Now for disk2 : did the same yesterday, loooot of errors with -n, tried with no -n and it didn't complete but I don't remember the error. Disk2 today's filesystem check is attached in txt format because way too long.

disk 2.txt

Edited August 16, 2023 by xoC

JorgeB · August 16, 2023

Try again disk2 without -n and post he output.

xoC · August 16, 2023

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

Should I try with -L ?

JorgeB · August 16, 2023

Yes.

xoC · August 16, 2023

Here it is.

disk2 -L.txt

Edited August 16, 2023 by xoC

JorgeB · August 16, 2023

It should mount now, if it does look for a lost+found folder.

xoC · August 16, 2023

Thanks.

I started the array It did mount indeed ! It directly started a re-build.

Do you think one or both disk are failing and should be replaced ?
Since I'm 2 down, the array is currently unprotected, I maybe should not try to rebuild if a disk is in an unproper shape.

edit : no lost+found folder on either disk.

Edited August 16, 2023 by xoC

JorgeB · August 16, 2023

Two disks getting disabled at the same time is usually not a disk problem, and SMART looks good for both, if emulated content for both looks correct you can rebuild on top, possibly good idea to check/replace cables before doing to rule that out, and if it happens again save the diags before rebooting.

xoC · August 16, 2023

Ok, noted !

Thanks a lot for your quick help

xoC · September 1, 2023

Hello back,

it's been a nightmare since then, everytime the parity sync runs, it finishes correctly, and then, one or two disk get disabled immediately.

I shut the server off one week ago because I had no time to investigate.

Yesterday, Parity 1 and Disk 1 were disabled. I changed sata cables for parity 1 parity 2 and disk 1 and ran a rebuild.

It completed this night and seen the logs, it disabled the disk 1 and disk 2, 20 min later.

Is it ok to continue on this topic or should I better open a non resolved topic ?

Attached are the diagnostic, server was not powered down since then.

nastorm-diagnostics-20230901-0912.zip

itimpi · September 1, 2023

I do not see the parity sync completing in the diagnostics. Instead I see you getting continual resets until some disks that look like disk1 and disk2 drop offline which means that there is no SMART information for those drives in the diagnostics that we can check.

You mention the SATA cabling - have you also checked the power cabling?

Are your sure your PSU can handle the load when all drives are active?

Have you tried running an extended SMART test on disk1 and disk2?

xoC · September 1, 2023

Thanks for your answer.

In the notification archives I found that where it said parity finished but with lots of errors.

For the power, that's a good question. The Server is running with the same HW (except a few disk swaps since then) for at least two years, but the power supply is not young.

IIRC it's a 600W one. My server is connected to a 650W UPS and it doesn't show that much load.

image.png.3754cfe5f9691a63ab0ca6da3c71ebda.png

Do you think 650W is too low though for 8 disks + 2 SSD, no GPU or pcie extension cards ?

I'm remote for the server today, and it said extended smart disks need no spin down delay. Everytime I apply "never" and come back, it's back on 15min so it seems I can't run the extended tests remote. Do I need to shut down and restart the array for that option to change ?

And the short tests do absolutely nothing.

JorgeB · September 1, 2023

Looks more like a power/connection issue with those two disks, do they share something in common, like a power splitter?

xoC · September 1, 2023

On the power supply connector, from memory, there are 4 or 6 disks in series.
I'm gonna check later when I go back to my server.
Maybe I can try to swap in their bays disk 1/2 with 3/4 and then rebuild and check if the reset errors goes to disk 3/4 ?

JorgeB · September 1, 2023

Worth a try.

xoC · September 1, 2023

Is it risky to rebuild with read error problems ? Can it write "wrong" data to my disk 1 & 2 by trying to rebuild but having lot of read errors ?

Edited September 1, 2023 by xoC

JorgeB · September 1, 2023

Any read errors beyond what parity can emulated will make Unraid skip those sectors during the rebuild, so for a disk being rebuilt on top it will keep the old data, which should be correct, for a new spare disk it will leave whatever was there.

xoC · September 1, 2023

Thanks for the info.

xoC · September 5, 2023

So I unplugged my two cache disks and put their power connector on disk1 & disk2.

Parity 1, 2 and disk 1, 2 were on the same power line, in series.

I tried to mount disk 1 : unmountable, no valid superblock > please use xfs_repair

disk 2 : mountable.

For some unknown reasons, I ran again xfs_repair from the GUI and both didn't found issues.
I ran xfs_repair -n /dev/sde1 and there were error, I tried without -n it said to use -L.

I used with -L and now the disk is mountable. It seems the GUI repair didn't fix anything even if it said so, and the command line repair actually worked.

I'm right now in maintenance mode and rebuilding the array, I will see what the outcome is in a few hours.

Edited September 5, 2023 by xoC

JorgeB · September 5, 2023

1 minute ago, xoC said:

I ran xfs_repair -n /dev/sde1

This will not update parity, you should run a correcting check.

2 disks have failed, and one of them stucks the array to "mouting" even when not here

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation