me217 Posted March 2 Share Posted March 2 (edited) I'll start this by saying this is a very random fault, but a very annoying one. it has been hard to diagnose as when i usually get to the machine it is completely unresponsive. although tonight i have caught it mid spiral. symptoms are as follows. I use my unraid for Plex and a VM of home assistant, plus a few other minor things. the system starts to become hard to contact but still contactable. although it is slow to respond. I could still watch plex remotely, although it did crash the stream on occasion. home assistant was contactable remotely but only able to control local devices, nothing running though any cloud services. when i got home the server gui was able to be loaded, and i could stop all dockers and vms, before hand the cpu and ram usage was normal. I connected a monitor and keyboard and logged in locally, issued a shutdown command and thats when the machine locked up. now this is an issue that will happen randomly and usually when i'm asleep or not home and by the time i get to it, it has completely locked up and needs a hard reset to get it going again. I managed to pull diagnostics before shutting it down so hopefully the issue can be resolved once and for all. any help would be greatly appreciated, and if i need to supply any more information let me know. vault-diagnostics-20240302-2316.zip Edited March 3 by me217 Version update in title Quote Link to comment
itimpi Posted March 2 Share Posted March 2 The syslog in the diagnostics is the RAM version that starts afresh every time the system is booted. It could be worth enabling the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash. The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field. Quote Link to comment
me217 Posted March 2 Author Share Posted March 2 That diagnostics was pulled from the machine before shutdown, so what ever has caused it should be in there. Quote Link to comment
JorgeB Posted March 3 Share Posted March 3 Btrfs is detecting data corruption and to prevent further issues it forced the pool read-only: Mar 2 19:42:39 Vault kernel: BTRFS error (device nvme0n1p1): block=11098538557440 write time tree block corruption detected In this case there's a clearly visible bit flip: Mar 2 19:42:39 Vault kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=2 block=11098538557440 slot=83, unexpected item end, have 2415931228 expect 12124 Quote have 2415931228 expect 12124 Hex(12124)= 2F5C hex(2415931228)=9000 2F5C Start by running memtest P.S: also change the docker network to ipvlan Quote Link to comment
me217 Posted March 3 Author Share Posted March 3 4 minutes ago, JorgeB said: Btrfs is detecting data corruption and to prevent further issues it forced the pool read-only: Mar 2 19:42:39 Vault kernel: BTRFS error (device nvme0n1p1): block=11098538557440 write time tree block corruption detected In this case there's a clearly visible bit flip: Mar 2 19:42:39 Vault kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=2 block=11098538557440 slot=83, unexpected item end, have 2415931228 expect 12124 Hex(12124)= 2F5C hex(2415931228)=9000 2F5C Start by running memtest P.S: also change the docker network to ipvlan Thanks. I'll get a memtest started on it. I actually changed the docker network over last night after rebooting the server. Will report back about memtest results. Quote Link to comment
me217 Posted March 3 Author Share Posted March 3 Also is it possible its a bad ssd? as i have had one of them go unreadable before, all the data was lost, but it tested fine after a reformat, but i now run the two SSDs mirrored instead of two seperate caches. Quote Link to comment
JorgeB Posted March 3 Share Posted March 3 41 minutes ago, me217 said: Also is it possible its a bad ssd? Possible yes, but IMHO unlikely that's the problem. Quote Link to comment
me217 Posted March 3 Author Share Posted March 3 I found this topic which seems to be a very similar issue. If the system crashes again. I might need to look at reverting back to 6.11 to see if that solves it. As my system has been operational since before 6.12 was released. Since the last crash I have enabled iummo I think it's called in the bios. And fixed the macvlan issue. Will run the memtest as soon as I can shut the server off for an extended period of time as it runs the entire house. Quote Link to comment
JorgeB Posted March 3 Share Posted March 3 26 minutes ago, me217 said: As my system has been operational since before 6.12 was released. It could be a coincidence, of the bif flips are rare, but the bit flip posted above cannot be a software issue. Quote Link to comment
me217 Posted March 3 Author Share Posted March 3 Thanks. The issue isnt very common. Can run for a month or two, then out of the blue it will crash. Or it will crash within a week. Plus forgot to mention. Server runs on a 1500va ups. So should rule out dirty power supply. Quote Link to comment
JorgeB Posted March 3 Share Posted March 3 10 minutes ago, me217 said: So should rule out dirty power supply. Bit fillips are almost always bad RAM. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.