BTRFS Cache Drives Randomly goes "Read Only"


Recommended Posts

I had a problem with my Samsung SSD Cache drive going read only. I was a bit suspicious as the drive is only 6 months old, but since its contents weren't super important I just coped them to the Hard Drives and disabled it for the time being as we needed to get the NAS up and running so we could get work done. After a few hours, the rest of the other SSD, which we use for our VMs, went read only. The only solution is to restart the NAS, and the last time I restarted the NAS the filesystem on my cache SSD disappeared, and I fear that this could happen to my other VM SSD as well. Doing a full backup of the NAS (while possible) will be quite difficult as there is 18TB of content on the NAS and we don't have the best upload speeds, nor do we have 18TB of storage on hand at the moment (we are working on setting up a 3-2-1 backup solution but the NAS is fairly new). Another thing that is strange is that when Unraid decides to go "read only" it won't let me read the files either, it spits out the same error as before. Attached are the system logs that could be relevant.

 

I'm also having some issues with the VMs once their up and running, however I'm not sure if it could be related. Whenever I get a Linux VM up and running I can't seem to install certain Python packages via PIP without them getting corrupted halfway though installation. Specifically it happens with larger libraries, like Pytorch, however I'm not sure if its related. I assume it could be due to my issues being filesystem related.

hypercloud-syslog-20230108-1556.zip hypercloud-diagnostics-20230108-1102.zip

Link to comment

Okay there's something seriously wrong with my hardware.image.thumb.png.c7a0e2990b8afe06803d62e089ef7175.png

I let it get to about 200k errors before I pulled the plug. I'm going to first attempt to remove all the RAM and run memtest one stick at a time, hopefully this is as simple as some lose RAM. If that all fails then we'll try a re-seat of the CPU. If that doesn't work, do I have any other options other than some form of RMA?

Link to comment
Just now, chand1012 said:

let it get to about 200k errors

more than zero is too many

 

You don't even want to attempt to run any computer unless RAM is working perfectly. Everything goes through RAM, the OS and other executable code, your data, everything. The CPU can't do anything with anything until it is loaded into RAM.

Link to comment

  

Just now, trurl said:

more than zero is too many

I am aware of this, I just was hoping that I got lucky and that only one of my sticks was bad and the first one it tested happened to be the bad one, but obviously not now. I assume this is probably related to a bad seating of the CPU, however I'm going to try reseating the RAM first (and testing each stick individually) as that required less effort on my part initially.
 

 

2 minutes ago, trurl said:

Are you overclocking? Don't

The CPU is at stock frequency.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.