Random unresponsive issue with unraid V6.12

me217 · March 2

I'll start this by saying this is a very random fault, but a very annoying one. it has been hard to diagnose as when i usually get to the machine it is completely unresponsive. although tonight i have caught it mid spiral.

symptoms are as follows.

I use my unraid for Plex and a VM of home assistant, plus a few other minor things.

the system starts to become hard to contact but still contactable. although it is slow to respond. I could still watch plex remotely, although it did crash the stream on occasion. home assistant was contactable remotely but only able to control local devices, nothing running though any cloud services.

when i got home the server gui was able to be loaded, and i could stop all dockers and vms, before hand the cpu and ram usage was normal.

I connected a monitor and keyboard and logged in locally, issued a shutdown command and thats when the machine locked up.

now this is an issue that will happen randomly and usually when i'm asleep or not home and by the time i get to it, it has completely locked up and needs a hard reset to get it going again.

I managed to pull diagnostics before shutting it down so hopefully the issue can be resolved once and for all.

any help would be greatly appreciated, and if i need to supply any more information let me know.

vault-diagnostics-20240302-2316.zip

Edited March 3 by me217
Version update in title

itimpi · March 2

The syslog in the diagnostics is the RAM version that starts afresh every time the system is booted. It could be worth enabling the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash. The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.

me217 · March 2

That diagnostics was pulled from the machine before shutdown, so what ever has caused it should be in there.

JorgeB · March 3

Btrfs is detecting data corruption and to prevent further issues it forced the pool read-only:

Mar  2 19:42:39 Vault kernel: BTRFS error (device nvme0n1p1): block=11098538557440 write time tree block corruption detected

In this case there's a clearly visible bit flip:

Mar  2 19:42:39 Vault kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=2 block=11098538557440 slot=83, unexpected item end, have 2415931228 expect 12124

Quote

have 2415931228 expect 12124

Hex(12124)= 2F5C

hex(2415931228)=9000 2F5C

Start by running memtest

P.S: also change the docker network to ipvlan

me217 · March 3

4 minutes ago, JorgeB said:
Btrfs is detecting data corruption and to prevent further issues it forced the pool read-only:
Mar  2 19:42:39 Vault kernel: BTRFS error (device nvme0n1p1): block=11098538557440 write time tree block corruption detected
In this case there's a clearly visible bit flip:
Mar  2 19:42:39 Vault kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=2 block=11098538557440 slot=83, unexpected item end, have 2415931228 expect 12124
Hex(12124)= 2F5C

hex(2415931228)=9000 2F5C

Start by running memtest

P.S: also change the docker network to ipvlan

Thanks. I'll get a memtest started on it. I actually changed the docker network over last night after rebooting the server.

Will report back about memtest results.

me217 · March 3

Also is it possible its a bad ssd? as i have had one of them go unreadable before, all the data was lost, but it tested fine after a reformat, but i now run the two SSDs mirrored instead of two seperate caches.

JorgeB · March 3

41 minutes ago, me217 said:

Also is it possible its a bad ssd?

Possible yes, but IMHO unlikely that's the problem.

me217 · March 3

I found this topic which seems to be a very similar issue. If the system crashes again. I might need to look at reverting back to 6.11 to see if that solves it.

As my system has been operational since before 6.12 was released.

Since the last crash I have enabled iummo I think it's called in the bios. And fixed the macvlan issue. Will run the memtest as soon as I can shut the server off for an extended period of time as it runs the entire house.

JorgeB · March 3

26 minutes ago, me217 said:

As my system has been operational since before 6.12 was released.

It could be a coincidence, of the bif flips are rare, but the bit flip posted above cannot be a software issue.

me217 · March 3

Thanks. The issue isnt very common. Can run for a month or two, then out of the blue it will crash. Or it will crash within a week.

Plus forgot to mention. Server runs on a 1500va ups. So should rule out dirty power supply.

JorgeB · March 3

10 minutes ago, me217 said:

So should rule out dirty power supply.

Bit fillips are almost always bad RAM.

Random unresponsive issue with unraid V6.12

Recommended Posts

me217

Link to comment

itimpi

Link to comment

me217

Link to comment

JorgeB

Link to comment

me217

Link to comment

me217

Link to comment

JorgeB

Link to comment

me217

Link to comment

JorgeB

Link to comment

me217

Link to comment

JorgeB

Link to comment

Join the conversation