Jump to content

"Array Stopping • Unmounting disks..." for 2 hours and counting after VM lockup


Recommended Posts

I have a Windows 10 VM running on my Unraid server. I was running sevral tasks at the same time from this VM that were writing and reading to and from shares on the array. All of a sudden Windows became unresponsive. I couldn't get task manager to open or even get the OS to shutdown gracefully. Ultimately I force stopped the VM from Unraid. When I tried to start it again I got a message saying 'Execution error' and 'read only file system' referring to the path to the VM's vdisk file (I wish I had copied the exacting wording or grabbed a screenshot). I tried restarting the array and waited several minutes before then trying to restart the server altogether. The whole time the message at the bottom of the browser window has said "Array Stopping • Unmounting disks...". It's been like this for about 2 hours now. I'm inclined to force shutdown the box but I don't want to break anything either. Does anyone have an idea of what's going on and how to fix/recover from this or what I should do next? Diagnostic zip attached. Thanks!

kolbnet-nas1-diagnostics-20210821-1730.zip

Edited by joelkolb
Link to comment

Corruption on cache

Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS error (device nvme0n1p1): block=8501867921408 write time tree block corruption detected
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction)
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS info (device nvme0n1p1): forced readonly
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction.
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1942: errno=-5 IO failure

 

Also

Aug 21 15:35:51 KolbNET-NAS1 kernel: general protection fault, probably for non-canonical address 0xff682b580004d300: 0000 [#1] SMP NOPTI

Have you done memtest?

Link to comment

@trurl I've completed 4 cycles of Memtest and I'm 50% through the 5th cycle. I've heard that at least 8 cycles are recommended to be sure of anything but I'm going out on a limb and saying I don't think there is a problem with my RAM? Should I continue running Memtest? If the RAM isn't the problem, what should I do next?

Link to comment

Those look good so far, but earlier

On 8/21/2021 at 6:41 PM, trurl said:

Corruption on cache

Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS error (device nvme0n1p1): block=8501867921408 write time tree block corruption detected
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction)
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS info (device nvme0n1p1): forced readonly
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction.
Aug 21 15:15:48 KolbNET-NAS1 kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1942: errno=-5 IO failure

 

 

Try this:

 

Link to comment

@trurl I ran "btrfs dev stats /mnt/cache" and it came back all zeros. I ran a scrub on the cache pool and it completed with 4 uncorrectable errors. Then I ran "btrfs dev stats /mnt/cache" again and it came back with this:

 

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  2
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  2
[/dev/nvme1n1p1].generation_errs  0

 

@JorgeB trurl suggested it was a RAM issue but I ran multiple passes of Memtest and they came back clean.

 

What should I try next?

Link to comment

@JorgeB you suggested:

 

On 8/23/2021 at 11:41 AM, JorgeB said:

It's still likely a RAM or other hardware issue, try with just two DIMMs, test both pairs alone, since that's an easy test to do and see if it is any better.

 

I guess the question is what did you mean by "test"?

 

16 hours ago, joelkolb said:

@JorgeB I ran several passes of Memtest of 2 DIMMs and got no errors. I swapped those with the other 2 DIMMS, ran several passes of Memtest on those and also got no errors. If it's not the RAM what else should I check?

 

So I did test (with Memtest) just 2 DIMMs and then I tested the other 2 DIMMs and both pairs tested with no errors.

 

6 hours ago, JorgeB said:

Did you try running the server with only 2 DIMMs like suggested?

 

So when you say "test" and "running the server" are you talking about running Memtest with 2 DIMMs or running UNRAID with 2 DIMMs?

Link to comment

@JorgeB unfortunately, since I thought you were talking about Memtest, after everything passed I put all 4 DIMMs back in and cleared the pool errors by deleting the corrupt files as you suggested earlier. No errors are detected after scrub and "btrfs dev stats /mnt/cache" returns all zeros. Would running with 2 DIMMs still be a valid test at this point or has the opportunity been lost?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...