2 disks have failed, and one of them stucks the array to "mouting" even when not here


xoC
Go to solution Solved by JorgeB,

Recommended Posts

So here it is for disk1, It had many errors (CRC) and yesterday I did a run with -n and then with no -n. Today it says :



    Phase 1 - find and verify superblock...
    Phase 2 - using internal log
            - zero log...
    ALERT: The filesystem has valuable metadata changes in a log which is being
    ignored because the -n option was used.  Expect spurious inconsistencies
    which may be resolved by first mounting the filesystem to replay the log.
            - scan filesystem freespace and inode maps...
    sb_fdblocks 121721460, counted 125133027
            - found root inode chunk
    Phase 3 - for each AG...
            - scan (but don't clear) agi unlinked lists...
            - process known inodes and perform inode discovery...
            - agno = 0
            - agno = 1
            - agno = 2
            - agno = 3
            - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
            - setting up duplicate extent list...
            - check for inodes claiming duplicate blocks...
            - agno = 0
            - agno = 3
            - agno = 1
            - agno = 2
    No modify flag set, skipping phase 5
    Phase 6 - check inode connectivity...
            - traversing filesystem ...
            - traversal finished ...
            - moving disconnected inodes to lost+found ...
    Phase 7 - verify link counts...
    No modify flag set, skipping filesystem flush and exiting.

 

Now for disk2 : did the same yesterday, loooot of errors with -n, tried with no -n and it didn't complete but I don't remember the error. Disk2 today's filesystem check is attached in txt format because way too long.

disk 2.txt

Edited by xoC
Link to comment
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

Should I try with -L ?

Link to comment

Thanks.

I started the array It did mount indeed ! It directly started a re-build.

 

Do you think one or both disk are failing and should be replaced ?
Since I'm 2 down, the array is currently unprotected, I maybe should not try to rebuild if a disk is in an unproper shape.

 

edit : no lost+found folder on either disk.

Edited by xoC
Link to comment

Two disks getting disabled at the same time is usually not a disk problem, and SMART looks good for both, if emulated content for both looks correct you can rebuild on top, possibly good idea to check/replace cables before doing to rule that out, and if it happens again save the diags before rebooting.

Link to comment
  • 3 weeks later...

Hello back,

 

it's been a nightmare since then, everytime the parity sync runs, it finishes correctly, and then, one or two disk get disabled immediately.

I shut the server off one week ago because I had no time to investigate.

Yesterday, Parity 1 and Disk 1 were disabled. I changed sata cables for parity 1 parity 2 and disk 1 and ran a rebuild.

It completed this night and seen the logs, it disabled the disk 1 and disk 2, 20 min later.

 

Is it ok to continue on this topic or should I better open a non resolved topic ?

 

Attached are the diagnostic, server was not powered down since then.

nastorm-diagnostics-20230901-0912.zip

Link to comment

I do not see the parity sync completing in the diagnostics.   Instead I see you getting continual resets until some disks that look like disk1 and disk2 drop offline which means that there is no SMART information for those drives in the diagnostics that we can check.

 

You mention the SATA cabling - have you also checked the power cabling? 

Are your sure your PSU can handle the load when all drives are active?

Have you tried running an extended SMART test on disk1 and disk2?

Link to comment

Thanks for your answer.

In the notification archives I found that where it said parity finished but with lots of errors.

image.thumb.png.ee7505769f3d8fdd4a348db8f3a795c5.png

 

For the power, that's a good question. The Server is running with the same HW (except a few disk swaps since then) for at least two years, but the power supply is not young.

IIRC it's a 600W one. My server is connected to a 650W UPS and it doesn't show that much load.

image.png.3754cfe5f9691a63ab0ca6da3c71ebda.png

 

Do you think 650W is too low though for 8 disks + 2 SSD, no GPU or pcie extension cards ?

I'm remote for the server today, and it said extended smart disks need no spin down delay. Everytime I apply "never" and come back, it's back on 15min so it seems I can't run the extended tests remote. Do I need to shut down and restart the array for that option to change ?

And the short tests do absolutely nothing.

 

Link to comment

On the power supply connector, from memory, there are 4 or 6 disks in series.
I'm gonna check later when I go back to my server.
Maybe I can try to swap in their bays disk 1/2 with 3/4 and then rebuild and check if the reset errors goes to disk 3/4 ?

Link to comment

So I unplugged my two cache disks and put their power connector on disk1 & disk2.

Parity 1, 2 and disk 1, 2 were on the same power line, in series.

 

I tried to mount disk 1 : unmountable, no valid superblock > please use xfs_repair

disk 2 : mountable.

 

For some unknown reasons, I ran again xfs_repair from the GUI and both didn't found issues.
I ran xfs_repair -n /dev/sde1 and there were error, I tried without -n it said to use -L.

I used with -L and now the disk is mountable. It seems the GUI repair didn't fix anything even if it said so, and the command line repair actually worked.

 

I'm right now in maintenance mode and rebuilding the array, I will see what the outcome is in a few hours.

Edited by xoC
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.