Server unresponsive

Mortalic · November 11, 2023

Last night my unraid server went unresponsive to webui, ssh and even the local after I plugged a monitor/kb in.

After hard powering it (I know), parity started but was crazy slow. I checked the smart status on all the drives and they all checked out.

This morning the server is unresponsive again.

What should I do?

EDIT: I also tried tapping the power button because that would power it down normally too, but that did not work.

EDIT2:
I was able to get an actual error after the most recent reboot:

The log mentions XFS (md1p1): Internal error and XFS (md1p1): Corruption detected. Unmount and run xfs_repair.

Is it safe to run xfs_repair?

the log is verbose, but keeps repeating this too:
[ 314.906880] I/O error, dev loop2, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 2 [ 314.906880] I/O error, dev loop2, sector 564320 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2 [ 314.906890] BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 [ 314.906895] BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 1, flush 1, corrupt 0, gen 0 [ 314.906900] BTRFS warning (device loop2): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount [ 314.906903] BTRFS: error (device loop2) in write_all_supers:4370: errno=-5 IO failure (errors while submitting device barriers.) [ 314.906933] I/O error, dev loop2, sector 1088608 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2 [ 314.907022] BTRFS info (device loop2: state E): forced readonly [ 314.907025] BTRFS: error (device loop2: state EA) in btrfs_sync_log:3198: errno=-5 IO failure [ 314.907025] BTRFS error (device loop2: state E): bdev /dev/loop2 errs: wr 0, rd 2, flush 1, corrupt 0, gen 0 [ 314.908111] I/O error, dev loop2, sector 564320 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2 [ 314.908117] BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 0, rd 3, flush 1, corrupt 0, gen 0 [ 314.908138] I/O error, dev loop2, sector 1088608 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2 [ 314.908141] BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 0, rd 4, flush 1, corrupt 0, gen 0 [ 314.908636] I/O error, dev loop2, sector 564320 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2 [ 314.908644] BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 0, rd 5, flush 1, corrupt 0, gen 0

EDIT3:

I replaced drive1 which seemed to be coming up in the logs a lot, though no errors, and smart checked out ok. This has allowed me to boot and performance appears normal, however after data rebuild, there are no shares, docker can't start, and no VM's.

But yes, now I can get into settings and pull diagnostics.

vault-diagnostics-20231112-0706.zip

Edited November 12, 2023 by Mortalic
More information

JorgeB · November 12, 2023

Please post the diagnostics.

Mortalic · November 12, 2023

Yes, I added a new edit to the post, and included the diagnostics.

Mortalic · November 13, 2023

Man.... I think I've really lost everything that was on disk1.... now I'm getting xfs errors on boot, even though the array can start.

It's telling me to:

root@vault:~# xfs_repair /dev/md1p1 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

Attempting to mount the filesystem does not clear this up, and in fact when I go to /mnt it throws three errors:

root@vault:~# cd /mnt
root@vault:/mnt# ls
/bin/ls: cannot access 'disk1': Input/output error
/bin/ls: cannot access 'user': Input/output error
/bin/ls: cannot access 'user0': Input/output error
cache/ disk1/ disk2/ disk3/ disk4/ usb/ user/ user0/
root@vault:/mnt#

Is there anything I can do? I don't want to lose everything. I've started copying what's left on disks 2, 3 and 4...

if I run xfs_repair -L it seems like I could lose everything....

Please help

itimpi · November 13, 2023

4 hours ago, Mortalic said:

if I run xfs_repair -L it seems like I could lose everything....

This is the normal action at this point as Unraid has already failed to mount the drive. Normally the -L option causes no data loss, and even when it does it is only the last file being written that tends to have a problem.

The section of the the online documentation accessible via the Manual link at the bottom of the Unraid GUI has this section covering repair and it mentions you should use the -L option in point 5.

Mortalic · November 13, 2023

14 hours ago, itimpi said:

This is the normal action at this point as Unraid has already failed to mount the drive. Normally the -L option causes no data loss, and even when it does it is only the last file being written that tends to have a problem.

The section of the the online documentation accessible via the Manual link at the bottom of the Unraid GUI has this section covering repair and it mentions you should use the -L option in point 5.

Thank you, I ran it, in maintenance mode. It only took a minute or two. Now my array came back up, mostly problem free. Shares are all back (haven't validated any data yet) as well as my VM's, however all my docker containers are missing.

the syslog has this section which was when I rant it.

Nov 12 18:47:33 vault kernel: XFS (md1p1): Internal error i != 0 at line 2798 of file fs/xfs/libxfs/xfs_bmap.c. Caller xfs_bmap_add_extent_hole_real+0x528/0x654 [xfs] Nov 12 18:47:33 vault kernel: CPU: 3 PID: 20234 Comm: fallocate Tainted: P O 6.1.49-Unraid #1 Nov 12 18:47:33 vault kernel: Call Trace: Nov 12 18:47:33 vault kernel: XFS (md1p1): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c. Caller xfs_alloc_file_space+0x206/0x246 [xfs] Nov 12 18:47:33 vault kernel: CPU: 3 PID: 20234 Comm: fallocate Tainted: P O 6.1.49-Unraid #1 Nov 12 18:47:33 vault kernel: Call Trace: Nov 12 18:47:33 vault root: truncate: cannot open '/mnt/disk1/system/docker/docker.img' for writing: Input/output error Nov 12 18:47:33 vault root: mount error

Looking at that file, it exists, but it's owned by nobody... should I just update chown and chmod to make it owned by root and executable?

-rw-r--r-- 1 nobody users 21474836480 Nov 13 14:11 docker.img
root@vault:/mnt/disk1/system/docker#

itimpi · November 13, 2023

You can always delete the docker.img file and let Unraid recreate it and use Apps->Previous Apps to get containers back with their previous settings.

however that message looks a little concerning as I would have expected the repair process to fix anything like that. It is possible there really is a problem with the disk. You might want to consider running an Extended SMART test on it to check it can complete that without error.

Mortalic · November 13, 2023

Ok, I'll do some reading there. I'm not familiar with that process, so thank you for suggesting it.

This is a different disk than the one that got me in all this trouble. It never actually failed, but disk1 kept throwing these crazy errors so I replaced it with another disk I had laying around.

I'll run extended SMART tests on all of them.

Also, I kicked off a Parity run, ok to let that run?

Mortalic · November 14, 2023

5 hours ago, itimpi said:

You can always delete the docker.img file and let Unraid recreate it and use Apps->Previous Apps to get containers back with their previous settings.

however that message looks a little concerning as I would have expected the repair process to fix anything like that. It is possible there really is a problem with the disk. You might want to consider running an Extended SMART test on it to check it can complete that without error.

I started the extended SMART test on the parity drive (16tb) about an hour ago and it's been at 10% the entire time. Is that normal?

EDIT:

Extended SMART test still at 10% after several hours.... starting to get nervous about it.

Edited November 14, 2023 by Mortalic
New information

itimpi · November 14, 2023

6 hours ago, Mortalic said:

I started the extended SMART test on the parity drive (16tb) about an hour ago and it's been at 10% the entire time. Is that normal?

The test only increments in 10% amounts so it is quite normal for it to stick for a while at each value. I normally estimate something up to 2 hours per 10% increment, but if it is taking longer than that it may not be a good sign. You could check to see if anything is showing in the syslog.

Mortalic · November 14, 2023

8 hours ago, itimpi said:

The test only increments in 10% amounts so it is quite normal for it to stick for a while at each value. I normally estimate something up to 2 hours per 10% increment, but if it is taking longer than that it may not be a good sign. You could check to see if anything is showing in the syslog.

Hmmm, next morning and Parity drive still reports 10% so that's not comforting. The actual Parity check is still going, so perhaps that's getting in the way?

Syslog does have some messages from yesterday I didn't notice, but nothing since then.

I can't remember what time I started the extended SMART check, but it would have been a bit after the parity check. which was around this time frame.

Nov 13 14:19:04 vault kernel: ata9.00: exception Emask 0x12 SAct 0x200000 SErr 0x280501 action 0x6 frozen
Nov 13 14:19:04 vault kernel: ata9.00: irq_stat 0x08000000, interface fatal error
Nov 13 14:19:05 vault kernel: ata9.00: exception Emask 0x10 SAct 0x20000000 SErr 0x280100 action 0x6 frozen
Nov 13 14:19:05 vault kernel: ata9.00: irq_stat 0x08000000, interface fatal error
Nov 13 14:19:05 vault kernel: ata9.00: exception Emask 0x10 SAct 0x20000 SErr 0x280100 action 0x6 frozen
Nov 13 14:19:05 vault kernel: ata9.00: irq_stat 0x08000000, interface fatal error
Nov 13 14:19:06 vault kernel: ata9.00: exception Emask 0x10 SAct 0x38000 SErr 0x280100 action 0x6 frozen
Nov 13 14:19:06 vault kernel: ata9.00: irq_stat 0x08000000, interface fatal error
Nov 13 14:20:02 vault root: Fix Common Problems: Error: Default docker appdata location is not a cache-only share ** Ignored

itimpi · November 14, 2023

I would very much doubt if your can successfully run a parity check and an extended SMART test at the same time. You really want the SMART test to have exclusive access to the drive while it is running.

Mortalic · November 15, 2023

On 11/14/2023 at 8:56 AM, itimpi said:

I would very much doubt if your can successfully run a parity check and an extended SMART test at the same time. You really want the SMART test to have exclusive access to the drive while it is running.

This appears to be correct. I let parity finish, (0 sync errors somehow) gave it a reboot and told all the drives to run extended SMART tests. Overnight the cache drive finished successfully, three of the others are 80-90% the larger ones are all around 50%.

Regarding the docker img recreate process...

Is the process basically:

backup the configurations
reinstall docker apps
copy configs over the top

EDIT:
After parity success and reboot, there is still one error in the syslog:

Nov 14 20:31:57 vault kernel: ata9.00: exception Emask 0x10 SAct 0x4 SErr 0x280100 action 0x6 frozen
Nov 14 20:31:57 vault kernel: ata9.00: irq_stat 0x08000000, interface fatal error

EDIT2, ran dmesg | grep ata9

looks like the replacement drive is the culprit, but I've got a new replacement drive showing up tomorrow, so I'll know one way or the other if that error gets solved.

EDIT3, All extended SMART tests passed, even the stand in drive showing a CRC smart error... weird.

New drive showed up to replace the stand in drive, no more syslog errors.

Parity run should be complete in a couple days.

Renamed docker.img to docker_old.img restarted docker service.

redownloaded all my docker containers and it appears everything picked up where it left off.

This was a long road but thank you itimpi for helping me out.

On a side note, I also used GPT4 to ask it some pretty specific questions at times and it was pretty helpful in laying out ways to troubleshoot certain steps. Even when I dropped giant logs into it's prompt it was good at parsing them to pick out and explain what was happening. I suggested that for anyone else running into issues.

Edited November 16, 2023 by Mortalic
more info

Server unresponsive

Recommended Posts

Mortalic

Link to comment

JorgeB

Link to comment

Mortalic

Link to comment

Mortalic

Link to comment

itimpi

Link to comment

Mortalic

Link to comment

itimpi

Link to comment

Mortalic

Link to comment

Mortalic

Link to comment

itimpi

Link to comment

Mortalic

Link to comment

itimpi

Link to comment

Mortalic

Link to comment

Join the conversation