joelkolb Posted August 21, 2021 Share Posted August 21, 2021 (edited) I have a Windows 10 VM running on my Unraid server. I was running sevral tasks at the same time from this VM that were writing and reading to and from shares on the array. All of a sudden Windows became unresponsive. I couldn't get task manager to open or even get the OS to shutdown gracefully. Ultimately I force stopped the VM from Unraid. When I tried to start it again I got a message saying 'Execution error' and 'read only file system' referring to the path to the VM's vdisk file (I wish I had copied the exacting wording or grabbed a screenshot). I tried restarting the array and waited several minutes before then trying to restart the server altogether. The whole time the message at the bottom of the browser window has said "Array Stopping • Unmounting disks...". It's been like this for about 2 hours now. I'm inclined to force shutdown the box but I don't want to break anything either. Does anyone have an idea of what's going on and how to fix/recover from this or what I should do next? Diagnostic zip attached. Thanks! kolbnet-nas1-diagnostics-20210821-1730.zip Edited August 21, 2021 by joelkolb Quote Link to comment
trurl Posted August 21, 2021 Share Posted August 21, 2021 Corruption on cache Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS error (device nvme0n1p1): block=8501867921408 write time tree block corruption detected Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction) Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS info (device nvme0n1p1): forced readonly Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction. Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1942: errno=-5 IO failure Also Aug 21 15:35:51 KolbNET-NAS1 kernel: general protection fault, probably for non-canonical address 0xff682b580004d300: 0000 [#1] SMP NOPTI Have you done memtest? Quote Link to comment
joelkolb Posted August 21, 2021 Author Share Posted August 21, 2021 @trurl no, it never occurred to me to run memtest. But to do that I would have to shut the server down and it's been stuck trying to unmount the disks for the past 3 hours. Is there any way to get the server to shut down gracefully or should I force it? Quote Link to comment
trurl Posted August 21, 2021 Share Posted August 21, 2021 From the diagnostics it doesn't look like any disk is still mounted so not clear what is holding it up. Quote Link to comment
joelkolb Posted August 21, 2021 Author Share Posted August 21, 2021 Thanks. I'll force shutdown, run memtest and follow up with the results. Quote Link to comment
joelkolb Posted August 23, 2021 Author Share Posted August 23, 2021 @trurl I've completed 4 cycles of Memtest and I'm 50% through the 5th cycle. I've heard that at least 8 cycles are recommended to be sure of anything but I'm going out on a limb and saying I don't think there is a problem with my RAM? Should I continue running Memtest? If the RAM isn't the problem, what should I do next? Quote Link to comment
trurl Posted August 23, 2021 Share Posted August 23, 2021 Post new diagnostics with the array started. Quote Link to comment
joelkolb Posted August 23, 2021 Author Share Posted August 23, 2021 @trurl here is the ne diagnostic zip. kolbnet-nas1-diagnostics-20210823-1041.zip Quote Link to comment
trurl Posted August 23, 2021 Share Posted August 23, 2021 Those look good so far, but earlier On 8/21/2021 at 6:41 PM, trurl said: Corruption on cache Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS error (device nvme0n1p1): block=8501867921408 write time tree block corruption detected Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction) Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS info (device nvme0n1p1): forced readonly Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction. Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1942: errno=-5 IO failure Try this: Quote Link to comment
JorgeB Posted August 23, 2021 Share Posted August 23, 2021 16 minutes ago, trurl said: write time tree block corruption detected This is usually the result of bad RAM or some other kernel memory corruption issue. Quote Link to comment
joelkolb Posted August 23, 2021 Author Share Posted August 23, 2021 @trurl I ran "btrfs dev stats /mnt/cache" and it came back all zeros. I ran a scrub on the cache pool and it completed with 4 uncorrectable errors. Then I ran "btrfs dev stats /mnt/cache" again and it came back with this: [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 2 [/dev/nvme0n1p1].generation_errs 0 [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 2 [/dev/nvme1n1p1].generation_errs 0 @JorgeB trurl suggested it was a RAM issue but I ran multiple passes of Memtest and they came back clean. What should I try next? Quote Link to comment
JorgeB Posted August 23, 2021 Share Posted August 23, 2021 It's still likely a RAM or other hardware issue, try with just two DIMMs, test both pairs alone, since that's an easy test to do and see if it is any better. Quote Link to comment
joelkolb Posted August 23, 2021 Author Share Posted August 23, 2021 @JorgeB running Memtest on the first two DIMMs now. What about the uncorrectable errors on the cache pool? Quote Link to comment
JorgeB Posted August 23, 2021 Share Posted August 23, 2021 Just now, joelkolb said: What about the uncorrectable errors on the cache pool? Run a scrub and look at the syslog for the name of the corrupt files, then delete/restore from backup. Quote Link to comment
joelkolb Posted August 24, 2021 Author Share Posted August 24, 2021 @JorgeB I ran several passes of Memtest of 2 DIMMs and got no errors. I swapped those with the other 2 DIMMS, ran several passes of Memtest on those and also got no errors. If it's not the RAM what else should I check? Quote Link to comment
JorgeB Posted August 25, 2021 Share Posted August 25, 2021 Did you try running the server with only 2 DIMMs like suggested? Quote Link to comment
joelkolb Posted August 25, 2021 Author Share Posted August 25, 2021 @JorgeB you suggested: On 8/23/2021 at 11:41 AM, JorgeB said: It's still likely a RAM or other hardware issue, try with just two DIMMs, test both pairs alone, since that's an easy test to do and see if it is any better. I guess the question is what did you mean by "test"? 16 hours ago, joelkolb said: @JorgeB I ran several passes of Memtest of 2 DIMMs and got no errors. I swapped those with the other 2 DIMMS, ran several passes of Memtest on those and also got no errors. If it's not the RAM what else should I check? So I did test (with Memtest) just 2 DIMMs and then I tested the other 2 DIMMs and both pairs tested with no errors. 6 hours ago, JorgeB said: Did you try running the server with only 2 DIMMs like suggested? So when you say "test" and "running the server" are you talking about running Memtest with 2 DIMMs or running UNRAID with 2 DIMMs? Quote Link to comment
JorgeB Posted August 25, 2021 Share Posted August 25, 2021 8 minutes ago, joelkolb said: or running UNRAID with 2 DIMMs? This. Quote Link to comment
joelkolb Posted August 25, 2021 Author Share Posted August 25, 2021 @JorgeB OK. Once I'm in Unraid with 2 DIMMs is there something I should do to test or am I just waiting to see if I have any more problems? Quote Link to comment
JorgeB Posted August 25, 2021 Share Posted August 25, 2021 13 minutes ago, joelkolb said: is there something I should do to test or am I just waiting to see if I have any more problems? Clear the pool errors then work normally to see if the issues continue, if yes repeat with just the other two DIMMs. Quote Link to comment
joelkolb Posted August 25, 2021 Author Share Posted August 25, 2021 @JorgeB unfortunately, since I thought you were talking about Memtest, after everything passed I put all 4 DIMMs back in and cleared the pool errors by deleting the corrupt files as you suggested earlier. No errors are detected after scrub and "btrfs dev stats /mnt/cache" returns all zeros. Would running with 2 DIMMs still be a valid test at this point or has the opportunity been lost? Quote Link to comment
JorgeB Posted August 25, 2021 Share Posted August 25, 2021 14 minutes ago, joelkolb said: Would running with 2 DIMMs still be a valid test at this point or has the opportunity been lost? Yes, just 35 minutes ago, JorgeB said: work normally to see if the issues continue, if yes repeat with just the other two DIMMs. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.