Is my server about to crash?

DizRD · September 14, 2022

I went to update the plex app, and it seemed to be stuck there forever. Eventually I reloaded the unraid docker page and it gave me an error on that page, something to the effect of plex.ico was read-only.. Then everything went bonkers. I started seeing errors in the systemlog about my cache drive not being accessible. I exported a diagnostic log at that time<attached>:

I went ahead and restarted the server to see maybe the docker container update and put something in a stuck state.. When I restarted, 4 of my pool drives said they were unmountable. I searched and found a thread about booting into maintenance mode and running xfs_repair on the drives. It fixed some things and then I rebooted.

Everything seems fine now, but i'm worried.

I ran a smart test on the cache drive, and it said it had errors, but I've never had good luck with my smart reports in unraid<attached>:

Anyone want to chime in on health insights or other suggestions?

Some of my system shares are told to prefer cache.. What happens if cache really fails? Will the system no longer be operational until I replace the cache drive or disable cache?

deathstar-diagnostics-20220913-2024.zip deathstar-smart-20220913-2221.zip

DizRD · September 14, 2022

Here's the Diagnostic after the reboot and things seem operational. Btw I haven't moved any cables or anything, everything hardware wise has been solid until now.

deathstar-diagnostics-20220913-2334.zip

JorgeB · September 14, 2022

Cache disk looks OK, problem appear to start with this:

Sep 13 07:10:40 deathstar kernel: DMAR: ERROR: DMA PTE for vPFN 0x70 already set (to 2e9b16001 not 20c994d002)

This is the same error some users start seeing with the first v6.10 releases and that we then found it can cause data corruption, first time I see it with v6.9.x, but wasn't looking for it before, in any case you should update to v6.10.3 ASAP, that issue can no longer happen there.

DizRD · September 14, 2022

Interesting! I will try the update. I had updated to 10.3 before, but ran into surprise permission issues on shares and rolled back to my previous Unraid version and the permission issues went away. I guess I will have to see if they are fixed in the newest version

kizer · September 14, 2022

You can install Dynamix File Manager and quickly fix Permission issues depending on what they are.

DizRD · September 15, 2022

Thanks, I'd heard of Dynamix but was kind of worried about manually changing permissions to get it to work in 10.3 in case that messed with any default permission configurations needed for upgrade paths in the future.

Some of my system shares are told to prefer cache.. What happens if cache really fails? Will the system no longer be operational until I replace the cache drive or disable cache?

trurl · September 15, 2022

2 minutes ago, DizRD said:

system shares are told to prefer cache.. What happens if cache really fails?

You need to have backups. There are plugins to backup appdata and VMs

DizRD · September 16, 2022

Absolutely, Backups have been made, but I'm more curious about if the system is fault tolerant enough to operate if the cache drive dies or if the system halts, since I don't know how long it would take me to get a replacement cache drive.

JorgeB · September 16, 2022

Server works without cache, though you lose any services there, e.g., if you have dockers or VMs using it.

trurl · September 16, 2022

If Docker / VM Manager is enabled when the array is started, and they don't find their .img files because of missing cache, new ones will be created on the array, but they won't have any contents.

DizRD · September 18, 2022

I will try to update the version again and report back. I guess if it causes permission issues again with the update, I will try to figure out how to fix it with the Dynamix Permissions plugin..

For my piece of mind, What does this error mean in the log:

Sep 17 03:48:27 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 1838869 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1 Sep 17 03:48:27 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdv1 errs: wr 0, rd 0, flush 0, corrupt 2986, gen 0 Sep 17 03:48:27 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 1838869 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1 Sep 17 03:48:27 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdv1 errs: wr 0, rd 0, flush 0, corrupt 2987, gen 0 Sep 17 03:48:29 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 1838869 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1 Sep 17 03:48:29 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdv1 errs: wr 0, rd 0, flush 0, corrupt 2988, gen 0 Sep 17 03:48:29 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 1838869 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1 Sep 17 03:48:29 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdv1 errs: wr 0, rd 0, flush 0, corrupt 2989, gen 0

JorgeB · September 18, 2022

15 minutes ago, DizRD said:

What does this error mean in the log:

It means data corruption is being detected by btrfs, likely the result of the error above.

On 9/14/2022 at 9:21 AM, JorgeB said:

we then found it can cause data corruption

DizRD · September 18, 2022

Updated to 10.3, still seeing the btrfs errors. What should I check out now?

deathstar-diagnostics-20220918-0427.zip

JorgeB · September 18, 2022

Once there's corruption updating won't fix anything, but it should prevent more, run a scrub, if corrupt files are found they will be listed in the syslog, delete/replace them from backups.

DizRD · September 18, 2022

Thanks! Running Scrub now.

Now that I've updated to 10.3, should I open a separate thread for the permission issue it creates?

JorgeB · September 18, 2022

Yes, if you still have issues with that do, but see here and here for some info, usually v6.10 is not the problem, how the containers are configured is.

DizRD · September 19, 2022

Hmmm, I ran scrub, tracked down the 3 files it mentioned, and removed them. I've rerun scrub since then and found no errors, but I'm still seeing BTRFS errors on the cache drive. Do I need to restart?

deathstar-diagnostics-20220919-0619.zip

Edited September 19, 2022 by DizRD

JorgeB · September 19, 2022

Not seeing any errors after the scrub.

DizRD · September 20, 2022

Weird, cause my syslog is still showing recent BTRFS errors:

Quote

Sep 19 22:40:42 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 1838869 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1

Sep 19 22:40:42 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdw1 errs: wr 0, rd 0, flush 0, corrupt 5732, gen 0

Sep 19 22:40:42 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 1838869 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1

Sep 19 22:40:42 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdw1 errs: wr 0, rd 0, flush 0, corrupt 5733, gen 0

Sep 19 22:40:58 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 20864475 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1

Sep 19 22:40:58 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdw1 errs: wr 0, rd 0, flush 0, corrupt 5734, gen 0

Sep 19 22:40:58 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 20864475 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1

Sep 19 22:40:58 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdw1 errs: wr 0, rd 0, flush 0, corrupt 5735, gen 0

Sep 19 22:41:38 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 20864461 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1

Sep 19 22:41:38 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdw1 errs: wr 0, rd 0, flush 0, corrupt 5736, gen 0

Sep 19 22:41:38 deathstar kernel: BTRFS warning (device dm-15): csum failed root 5 ino 20864461 off 291581952 csum 0x069f7410 expected csum 0xf7d976f9 mirror 1

Sep 19 22:41:38 deathstar kernel: BTRFS error (device dm-15): bdev /dev/mapper/sdw1 errs: wr 0, rd 0, flush 0, corrupt 5737, gen 0

JorgeB · September 20, 2022

Run another scrub and post new diags when done.

DizRD · September 23, 2022

Marked as solved, I haven't seen the btrfs error after the scrub and a reboot. I will open a separate ticket for the permissions problems.

Is my server about to crash?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation