chand1012 Posted January 8, 2023 Share Posted January 8, 2023 I had a problem with my Samsung SSD Cache drive going read only. I was a bit suspicious as the drive is only 6 months old, but since its contents weren't super important I just coped them to the Hard Drives and disabled it for the time being as we needed to get the NAS up and running so we could get work done. After a few hours, the rest of the other SSD, which we use for our VMs, went read only. The only solution is to restart the NAS, and the last time I restarted the NAS the filesystem on my cache SSD disappeared, and I fear that this could happen to my other VM SSD as well. Doing a full backup of the NAS (while possible) will be quite difficult as there is 18TB of content on the NAS and we don't have the best upload speeds, nor do we have 18TB of storage on hand at the moment (we are working on setting up a 3-2-1 backup solution but the NAS is fairly new). Another thing that is strange is that when Unraid decides to go "read only" it won't let me read the files either, it spits out the same error as before. Attached are the system logs that could be relevant. I'm also having some issues with the VMs once their up and running, however I'm not sure if it could be related. Whenever I get a Linux VM up and running I can't seem to install certain Python packages via PIP without them getting corrupted halfway though installation. Specifically it happens with larger libraries, like Pytorch, however I'm not sure if its related. I assume it could be due to my issues being filesystem related. hypercloud-syslog-20230108-1556.zip hypercloud-diagnostics-20230108-1102.zip Quote Link to comment
JorgeB Posted January 9, 2023 Share Posted January 9, 2023 Jan 8 07:48:50 HyperCloud kernel: BTRFS error (device nvme1n1p1): block=8617803776 write time tree block corruption detected This means data corruption detected during a write, usually due to bad RAM, start by running memtest Quote Link to comment
chand1012 Posted January 9, 2023 Author Share Posted January 9, 2023 Okay there's something seriously wrong with my hardware. I let it get to about 200k errors before I pulled the plug. I'm going to first attempt to remove all the RAM and run memtest one stick at a time, hopefully this is as simple as some lose RAM. If that all fails then we'll try a re-seat of the CPU. If that doesn't work, do I have any other options other than some form of RMA? Quote Link to comment
trurl Posted January 9, 2023 Share Posted January 9, 2023 Just now, chand1012 said: let it get to about 200k errors more than zero is too many You don't even want to attempt to run any computer unless RAM is working perfectly. Everything goes through RAM, the OS and other executable code, your data, everything. The CPU can't do anything with anything until it is loaded into RAM. Quote Link to comment
trurl Posted January 9, 2023 Share Posted January 9, 2023 Are you overclocking? Don't Quote Link to comment
chand1012 Posted January 9, 2023 Author Share Posted January 9, 2023 Just now, trurl said: more than zero is too many I am aware of this, I just was hoping that I got lucky and that only one of my sticks was bad and the first one it tested happened to be the bad one, but obviously not now. I assume this is probably related to a bad seating of the CPU, however I'm going to try reseating the RAM first (and testing each stick individually) as that required less effort on my part initially. 2 minutes ago, trurl said: Are you overclocking? Don't The CPU is at stock frequency. Quote Link to comment
trurl Posted January 9, 2023 Share Posted January 9, 2023 40 minutes ago, chand1012 said: stock frequency Not always clear what that means. Did you check the link? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.