How screwed am I? io errors everywhere

Mizerka · June 23, 2020

So had some heat issues today, some disks hit 60c before I realised.

Anyway, I sorted but found some strange behaviour, but primary io write errors from smb, so loaded up logs and found a lot of issues. took array down and up again, no go. rebooted in maintainence and found disk14 reported xfs_check issues, but then after leaving it for a while and checking logs it's filled with below.

So.... how bad is it? looking at docs I should run xfs_check -V /dev/sdX ,which I tried with disk14 which was only one that actually reported issue in webgui using xfs check.

But that's been running for past 15mins trying to find 2ndary superblocks in filesystem

So, help please

Mizerka · June 23, 2020

xfs_repair in webgui, after 2nd run of -nv said, "just start the array up bro, it'll be good", as doubtful as I was, I tried it and so far... it seems okay. but scrubbing btrfs cache as well just in case.

edit;

ye okay, spoke too soon.

thoughts on best action? I'll try to unmount again and repair but doubt it'll work

Edited June 23, 2020 by Mizerka

trurl · June 23, 2020

39 minutes ago, Mizerka said:

run xfs_check -V /dev/sdX

You can't repair the sdX device, and you shouldn't repair the sdX1 partition of a disk in the array, or you will invalidate parity. You must run it on the md# device.

10 minutes ago, Mizerka said:

xfs_repair in webgui

If you do it from the webUI then it will do it correctly.

Mizerka · June 23, 2020

5 minutes ago, trurl said:

You can't repair the sdX device, and you shouldn't repair the sdX1 partition of a disk in the array, or you will invalidate parity. You must run it on the md# device.

If you do it from the webUI then it will do it correctly.

I see, the doc specifies either can be used, I'll try now with md# instead, to confirm if I want to run it from webui I should just change -nv to -v?

trying to run from webui now only displays this;

image.png.24b3d15d1f29c22cb0473588c8bd7b27.png

not allowing any actions

using md# also returns device is busy

image.png.ac3458509b83a692e8b09f6e19b2a6cd.png

Edited June 23, 2020 by Mizerka

trurl · June 23, 2020

You have to stop the array, then start it in maintenance mode, to run repair.

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

If you already repaired the sdX1 partition using the command line, then you have already invalidated parity. A correcting parity check would be needed in this case.

trurl · June 23, 2020

Also, see this section on that same wiki page:

https://wiki.unraid.net/Check_Disk_Filesystems#Drive_names_and_symbols

Mizerka · June 23, 2020

8 minutes ago, trurl said:

Also, see this section on that same wiki page:

https://wiki.unraid.net/Check_Disk_Filesystems#Drive_names_and_symbols

okay, ye makes sense, so run it against md# instead, I've gone back to maintenance and I'm getting the errors in edit, md14 is saying drive busy and webui refuses to run beyond -n/-nv

I've tried to run repair, but it never got past saying magic number failed and trying to find secondary superblock

which outputs this if it helps

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
Metadata CRC error detected at 0x43c89d, xfs_bnobt block 0x3a381e28/0x1000
Metadata CRC error detected at 0x43c89d, xfs_bnobt block 0x74703c48/0x1000
btree block 1/1 is suspect, error -74
btree block 2/1 is suspect, error -74
bad magic # 0xdaa0086c in btbno block 1/1
bad magic # 0x2fdfba35 in btbno block 2/1
Metadata CRC error detected at 0x43c89d, xfs_cntbt block 0x3a381e30/0x1000
btree block 1/2 is suspect, error -74
bad magic # 0x419e48e9 in btcnt block 1/2
agf_freeblks 122094523, counted 0 in ag 1
agf_longest 122094523, counted 0 in ag 1
Metadata CRC error detected at 0x43c89d, xfs_cntbt block 0x74703c50/0x1000
btree block 2/2 is suspect, error -74
bad magic # 0xa8692ca5 in btcnt block 2/2
agf_freeblks 121856058, counted 0 in ag 2
agf_longest 121856058, counted 0 in ag 2
Metadata CRC error detected at 0x46ad5d, xfs_inobt block 0x3a381e38/0x1000
btree block 1/3 is suspect, error -74
Metadata CRC error detected at 0x46ad5d, xfs_inobt block 0x74703c58/0x1000
bad magic # 0x639e272e in inobt block 1/3
btree block 2/3 is suspect, error -74
bad magic # 0x796a2ce3 in inobt block 2/3
Metadata CRC error detected at 0x46ad5d, xfs_inobt block 0xaea85a78/0x1000
btree block 3/3 is suspect, error -74
bad magic # 0x15f1f03 in inobt block 3/3
sb_ifree 59, counted 44
sb_fdblocks 2926555418, counted 2681574888
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 4
        - agno = 3
        - agno = 14
        - agno = 22
        - agno = 8
        - agno = 9
        - agno = 5
        - agno = 6
        - agno = 10
        - agno = 12
        - agno = 15
        - agno = 16
        - agno = 13
        - agno = 17
        - agno = 18
        - agno = 2
        - agno = 21
        - agno = 7
        - agno = 19
        - agno = 20
        - agno = 23
        - agno = 11
No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
Maximum metadata LSN (904557511:-555599277) is ahead of log (1:6247).
Would format log to cycle 904557514.
No modify flag set, skipping filesystem flush and exiting.

Edited June 23, 2020 by Mizerka

trurl · June 23, 2020

What version of Unraid are you running?

Mizerka · June 23, 2020

1 minute ago, trurl said:

What version of Unraid are you running?

Version 6.8.3 2020-03-05 Stable afaik

trurl · June 23, 2020

34 minutes ago, Mizerka said:

xfs_repair in webgui, after 2nd run of -nv said, "just start the array up bro, it'll be good"

What did it actually say?

The -n (nomodify) flag means check but don't repair anything.

Mizerka · June 23, 2020

1 minute ago, trurl said:

What did it actually say?

The -n (nomodify) flag means check but don't repair anything.

running webgui with a -v flag gives this output;

Phase 1 - find and verify superblock...
        - block cache size set to 6097840 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 6247 tail block 6235
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

Mizerka · June 23, 2020

also attached diagnostics if you want to have a look but doubt there's anything interesting in config side of this

nekounraid-diagnostics-20200623-2216.zip

Mizerka · June 23, 2020

47 minutes ago, trurl said:

What did it actually say?

The -n (nomodify) flag means check but don't repair anything.

After looking around forums a bit more came across similar post,

mod advised to run against /dev/mapper/md# if drives are encrypted (all of mine are btw), then to -L it.

which spits out this output, same as webui

image.png.3626c57d8f7b855664a4f95f75d18167.png

Clearly it wants me to run with -L but that sounds destructive? It's a 12tb mostly filled, I'd really hate to lose it, at this point I'd almost be better to remove it and let parity emulate it probably and move data around before reformatting and adding back to array?

Edited June 23, 2020 by Mizerka

Mizerka · June 23, 2020

okay, so I think I'm good now, ended up booting back into full array with md14 mounted, moved all data off of it without issues, then went back into maintenance and could now run -v, once complete I've started array again and seems good fine for last 20mins or so, crisis averted for now. if it didn't -v I'd probably -L and just reformat it if it corrupts the filesystem.

trurl · June 23, 2020

Running it from the webUI on encrypted drives should still do the correct thing; i.e.,

1 hour ago, Mizerka said:

run against /dev/mapper/md#

-L is usually necessary, since the log can't be used if the disk is unmountable.

How screwed am I? io errors everywhere

Recommended Posts

Mizerka

Link to comment

Mizerka

Link to comment

trurl

Link to comment

Mizerka

Link to comment

trurl

Link to comment

trurl

Link to comment

Mizerka

Link to comment

trurl

Link to comment

Mizerka

Link to comment

trurl

Link to comment

Mizerka

Link to comment

Mizerka

Link to comment

Mizerka

Link to comment

Mizerka

Link to comment

trurl

Link to comment

Join the conversation