"Array Stopping • Unmounting disks..." for 2 hours and counting after VM lockup

joelkolb · August 21, 2021

I have a Windows 10 VM running on my Unraid server. I was running sevral tasks at the same time from this VM that were writing and reading to and from shares on the array. All of a sudden Windows became unresponsive. I couldn't get task manager to open or even get the OS to shutdown gracefully. Ultimately I force stopped the VM from Unraid. When I tried to start it again I got a message saying 'Execution error' and 'read only file system' referring to the path to the VM's vdisk file (I wish I had copied the exacting wording or grabbed a screenshot). I tried restarting the array and waited several minutes before then trying to restart the server altogether. The whole time the message at the bottom of the browser window has said "Array Stopping • Unmounting disks...". It's been like this for about 2 hours now. I'm inclined to force shutdown the box but I don't want to break anything either. Does anyone have an idea of what's going on and how to fix/recover from this or what I should do next? Diagnostic zip attached. Thanks!

kolbnet-nas1-diagnostics-20210821-1730.zip

Edited August 21, 2021 by joelkolb

trurl · August 21, 2021

Corruption on cache

Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS error (device nvme0n1p1): block=8501867921408 write time tree block corruption detected
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction)
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS info (device nvme0n1p1): forced readonly
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction.
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1942: errno=-5 IO failure

Also

Aug 21 15:35:51 KolbNET-NAS1 kernel: general protection fault, probably for non-canonical address 0xff682b580004d300: 0000 [#1] SMP NOPTI

Have you done memtest?

joelkolb · August 21, 2021

@trurl no, it never occurred to me to run memtest. But to do that I would have to shut the server down and it's been stuck trying to unmount the disks for the past 3 hours. Is there any way to get the server to shut down gracefully or should I force it?

trurl · August 21, 2021

From the diagnostics it doesn't look like any disk is still mounted so not clear what is holding it up.

joelkolb · August 21, 2021

Thanks. I'll force shutdown, run memtest and follow up with the results.

joelkolb · August 23, 2021

@trurl I've completed 4 cycles of Memtest and I'm 50% through the 5th cycle. I've heard that at least 8 cycles are recommended to be sure of anything but I'm going out on a limb and saying I don't think there is a problem with my RAM? Should I continue running Memtest? If the RAM isn't the problem, what should I do next?

trurl · August 23, 2021

Post new diagnostics with the array started.

joelkolb · August 23, 2021

@trurl here is the ne diagnostic zip.

kolbnet-nas1-diagnostics-20210823-1041.zip

trurl · August 23, 2021

Those look good so far, but earlier

On 8/21/2021 at 6:41 PM, trurl said:

Corruption on cache

Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS error (device nvme0n1p1): block=8501867921408 write time tree block corruption detected
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction)
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS info (device nvme0n1p1): forced readonly
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction.
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1942: errno=-5 IO failure

Try this:

JorgeB · August 23, 2021

16 minutes ago, trurl said:
write time tree block corruption detected

This is usually the result of bad RAM or some other kernel memory corruption issue.

joelkolb · August 23, 2021

@trurl I ran "btrfs dev stats /mnt/cache" and it came back all zeros. I ran a scrub on the cache pool and it completed with 4 uncorrectable errors. Then I ran "btrfs dev stats /mnt/cache" again and it came back with this:

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  2
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  2
[/dev/nvme1n1p1].generation_errs  0

@JorgeB trurl suggested it was a RAM issue but I ran multiple passes of Memtest and they came back clean.

What should I try next?

JorgeB · August 23, 2021

It's still likely a RAM or other hardware issue, try with just two DIMMs, test both pairs alone, since that's an easy test to do and see if it is any better.

joelkolb · August 23, 2021

@JorgeB running Memtest on the first two DIMMs now. What about the uncorrectable errors on the cache pool?

JorgeB · August 23, 2021

Just now, joelkolb said:

What about the uncorrectable errors on the cache pool?

Run a scrub and look at the syslog for the name of the corrupt files, then delete/restore from backup.

joelkolb · August 24, 2021

@JorgeB I ran several passes of Memtest of 2 DIMMs and got no errors. I swapped those with the other 2 DIMMS, ran several passes of Memtest on those and also got no errors. If it's not the RAM what else should I check?

JorgeB · August 25, 2021

Did you try running the server with only 2 DIMMs like suggested?

joelkolb · August 25, 2021

@JorgeB you suggested:

On 8/23/2021 at 11:41 AM, JorgeB said:

It's still likely a RAM or other hardware issue, try with just two DIMMs, test both pairs alone, since that's an easy test to do and see if it is any better.

I guess the question is what did you mean by "test"?

16 hours ago, joelkolb said:

@JorgeB I ran several passes of Memtest of 2 DIMMs and got no errors. I swapped those with the other 2 DIMMS, ran several passes of Memtest on those and also got no errors. If it's not the RAM what else should I check?

So I did test (with Memtest) just 2 DIMMs and then I tested the other 2 DIMMs and both pairs tested with no errors.

6 hours ago, JorgeB said:

Did you try running the server with only 2 DIMMs like suggested?

So when you say "test" and "running the server" are you talking about running Memtest with 2 DIMMs or running UNRAID with 2 DIMMs?

JorgeB · August 25, 2021

8 minutes ago, joelkolb said:

or running UNRAID with 2 DIMMs?

This.

joelkolb · August 25, 2021

@JorgeB OK. Once I'm in Unraid with 2 DIMMs is there something I should do to test or am I just waiting to see if I have any more problems?

JorgeB · August 25, 2021

13 minutes ago, joelkolb said:

is there something I should do to test or am I just waiting to see if I have any more problems?

Clear the pool errors then work normally to see if the issues continue, if yes repeat with just the other two DIMMs.

joelkolb · August 25, 2021

@JorgeB unfortunately, since I thought you were talking about Memtest, after everything passed I put all 4 DIMMs back in and cleared the pool errors by deleting the corrupt files as you suggested earlier. No errors are detected after scrub and "btrfs dev stats /mnt/cache" returns all zeros. Would running with 2 DIMMs still be a valid test at this point or has the opportunity been lost?

JorgeB · August 25, 2021

14 minutes ago, joelkolb said:

Would running with 2 DIMMs still be a valid test at this point or has the opportunity been lost?

Yes, just

35 minutes ago, JorgeB said:

work normally to see if the issues continue, if yes repeat with just the other two DIMMs.

"Array Stopping • Unmounting disks..." for 2 hours and counting after VM lockup

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation