Random unresponsive issue with unraid V6.12


Recommended Posts

I'll start this by saying this is a very random fault, but a very annoying one. it has been hard to diagnose as when i usually get to the machine it is completely unresponsive. although tonight i have caught it mid spiral. 

 

symptoms are as follows. 

I use my unraid for Plex and a VM of home assistant, plus a few other minor things.

the system starts to become hard to contact but still contactable. although it is slow to respond. I could still watch plex remotely, although it did crash the stream on occasion. home assistant was contactable remotely but only able to control local devices, nothing running though any cloud services.

 

when i got home the server gui was able to be loaded, and i could stop all dockers and vms, before hand the cpu and ram usage was normal.

 

I connected a monitor and keyboard and logged in locally, issued a shutdown command and thats when the machine locked up. 

 

now this is an issue that will happen randomly and usually when i'm asleep or not home and by the time i get to it, it has completely locked up and needs a hard reset to get it going again. 

 

I managed to pull diagnostics before shutting it down so hopefully the issue can be resolved once and for all.

 

any help would be greatly appreciated, and if i need to supply any more information let me know.

vault-diagnostics-20240302-2316.zip

Edited by me217
Version update in title
Link to comment

The syslog in the diagnostics is the RAM version that starts afresh every time the system is booted.  It could be worth enabling the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash.  The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.  

Link to comment

Btrfs is detecting data corruption and to prevent further issues it forced the pool read-only:

 

Mar  2 19:42:39 Vault kernel: BTRFS error (device nvme0n1p1): block=11098538557440 write time tree block corruption detected

 

In this case there's a clearly visible bit flip:

 

Mar  2 19:42:39 Vault kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=2 block=11098538557440 slot=83, unexpected item end, have 2415931228 expect 12124

 

Quote

have 2415931228 expect 12124

 

Hex(12124)= 2F5C

hex(2415931228)=9000 2F5C

 

Start by running memtest

 

P.S: also change the docker network to ipvlan

 

Link to comment
4 minutes ago, JorgeB said:

Btrfs is detecting data corruption and to prevent further issues it forced the pool read-only:

 

Mar  2 19:42:39 Vault kernel: BTRFS error (device nvme0n1p1): block=11098538557440 write time tree block corruption detected

 

In this case there's a clearly visible bit flip:

 

Mar  2 19:42:39 Vault kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=2 block=11098538557440 slot=83, unexpected item end, have 2415931228 expect 12124

 

 

Hex(12124)= 2F5C

hex(2415931228)=9000 2F5C

 

Start by running memtest

 

P.S: also change the docker network to ipvlan

 

 

 

Thanks. I'll get a memtest started on it. I actually changed the docker network over last night after rebooting the server. 

 

Will report back about memtest results. 

Link to comment

Also is it possible its a bad ssd? as i have had one of them go unreadable before, all the data was lost, but it tested fine after a reformat, but i now run the two SSDs mirrored instead of two seperate caches.

Link to comment
  • me217 changed the title to Random unresponsive issue with unraid V6.12

I found this topic which seems to be a very similar issue. If the system crashes again. I might need to look at reverting back to 6.11 to see if that solves it. 

As my system has been operational since before 6.12 was released. 

 

Since the last crash I have enabled iummo I think it's called in the bios. And fixed the macvlan issue. Will run the memtest as soon as I can shut the server off for an extended period of time as it runs the entire house. 

 

 

Link to comment

Thanks. The issue isnt very common. Can run for a month or two, then out of the blue it will crash. Or it will crash within a week. 

 

Plus forgot to mention. Server runs on a 1500va ups. So should rule out dirty power supply. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.