Disk and Docker errors

SoCold · December 4, 2020

Hi, like the title says I've been experiencing errors lately and I'm not really sure what's going on so hoping people here can help. Currently I'm running with a data disk disabled and most of my dockers not functioning.

I've been running my server for a few years with my only issues being things I could search for a resolve myself. A little while ago I had my first real disk issues. I woke up to a disabled disk, I had been meaning to order another drive already and I was busy so I ordered a replacement popped it in preclear, rebuild and we're back up with parity. Not too long after that more disk errors, posted on Reddit and was someone suggested the cable. That made a lot of sense so I replaced that I was good for a bit until now. I've been getting more disk errors and my dockers started failing.

I'm thinking something is corrupt or maybe I have multiple issues but I don't know what. I'm certainly not an expert but I'm descent at following directions and can provide more info or test anything as needed. Thanks preemptively for the help. Fix Common Problems and Diagnostics below

FIx Common Problems:

disk1 (ST8000VN004) is disabled

parity (ST8000VN0022) has read errors

disk1 (ST8000VN004)

/var/log is getting full (currently 100 % used)

Unable to write to cacheDrive mounted read-only or completely full.

Unable to write to disk1Drive mounted read-only or completely full.

Unable to write to Docker ImageDocker Image either full or corrupted.

elliot-diagnostics-20201204-0925.zip

trurl · December 4, 2020

Looks like all disks are disconnected. Check all connections, all disks, power and SATA, both ends, including splitters. Then post new diagnostics

SoCold · December 4, 2020

5 minutes ago, trurl said:

Looks like all disks are disconnected. Check all connections, all disks, power and SATA, both ends, including splitters. Then post new diagnostics

Thanks for the quick reply! My server is also my router/firewall so I'll be down for a bit while I do that.

FYI, my setup is 2 SSDs for cache and 2 Hard drives for data and parity all connected directly to the MOBO SATA ports.

SoCold · December 5, 2020

I didn't see anything obvious like a broken connector, loose cable, bent pin etc. Disconnected and reconnected everything and swapped to a spare SATA power cable for good measure. This diagnostics is right after I got connected again. My data drive is still disabled and I haven't done anything except grab the diagnostics.

elliot-diagnostics-20201204-1926.zip

JorgeB · December 5, 2020

Problem with the onboard SATA controller:

Dec  3 05:39:21 Elliot kernel: ahci 0000:01:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xb5050000 flags=0x0000]

Quite common with Ryzen boards, look for a BIOS update or disable IOMMU if not needed.

SoCold · December 5, 2020

Thanks Jorge, funny enough there was a Bios update yesterday for my board and the Bios I was running is about a year old.

I updated, rebooted and here's my new diagnostics.

I have not enabled disk 1 or done anything after the reboot yet. Is that the next step?

Thanks again for the help guys, really appreciate it

elliot-diagnostics-20201205-1342.zip

akshunj · December 6, 2020

Speaking only from personal experience, if all your devices are connected to the same controller, one bad drive can introduce kernel conditions where the unraid kicks it and every drive out of the array. Sometimes you need to do a little trial and error with hardware issues like this. Pull a suspect drive, replace cables, run the array for a bit, check diagnostics, etc. If you've been running this setup for years without issue, it's likely a bad drive or cable.

I'm not an unraid expert, but I believe the disabled drive will stay that way till you decide if/how you want to re-integrate it. I would try to diagnose the problem drive first.

Edited December 6, 2020 by akshunj
grammar

SoCold · December 6, 2020

I've replaced the drive and cable that first had errors and was disabled. These are also Ironwolf drives that have spent the majority of their life spun down in a case with a lot of airflow etc.

I think you're right that the drive will stay disabled I just didn't want to jump the gun. Obviously the #1 concern is not losing data and I'd like to do this in the proper order and hopefully determine the root issue.

Also not sure if this is a side effect to what's going on or just a beta issue but on the main tab of the UI I can no longer interact with dockers and my VMs are no longer listed.

I ran Fix Common Problems quick scan again just to check and the only error I have right now is

disk1 (ST8000VN004) is disabled

JorgeB · December 6, 2020

Once a disk gets disable it needs to be rebuilt, make sure the emulated disk is mounting and showing the correct data and rebuild on top.

SoCold · December 6, 2020

Thanks, I have the drive rebuilding now. I followed the procedure in the wiki.

I'll keep and eye on it and post an updated diagnostics later today when complete and both disks should be enabled.

SoCold · December 7, 2020

Rebuild just finished and both drives are now enabled. I was a little scared when I got the notification of 16437514 errors but I guess that's just because it was my data disk disabled and I was rebuilding from parity.

image.png.6fb028f50988c8990b2385dc3f1c2b82.png

I'll keep an eye on it and watch it but hopefully one of you guys can take a peek at my diagnostics and let me know if all looks good now. As always thank you all for the help!

elliot-diagnostics-20201206-2204.zip

Edited December 7, 2020 by SoCold

trurl · December 7, 2020

On mobile now so can't look at Diagnostics yet but Errors should be zero.

SoCold · December 7, 2020

I ran fix common problems again and saw some of the same errors so I did a reboot and ran diagnostics again to be thorough.

elliot-diagnostics-20201207-0024.zip

JorgeB · December 7, 2020

There were read errors on parity during the rebuild:

Dec  6 21:55:38 Elliot kernel: md: disk0 read error, sector=15496560240
Dec  6 21:55:38 Elliot kernel: md: recovery thread: multiple disk errors, sector=15496555488

This means there will be some corruption on the rebuilt disk.

Problem was the SATA controller:

Dec  6 21:54:07 Elliot kernel: ahci 0000:01:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xff4d0000 flags=0x0000]

Quite common with Ryzen boards, look for a BIOS update or disable IOMMU if not needed.

SoCold · December 7, 2020

5 hours ago, JorgeB said:

Quite common with Ryzen boards, look for a BIOS update or disable IOMMU if not needed.

Yesterday before doing the rebuild I updated the BIOS to the latest and greatest from Asus dated earlier this month. Would I need to downgrade the BIOS to something older and is it just guess and check at this point?

Currently I only run one VM, OPNSense with a NIC passed through so my understanding is I wouldn't be able to do that if I disable IOMMU.

JorgeB · December 7, 2020

27 minutes ago, SoCold said:

Would I need to downgrade the BIOS to something older and is it just guess and check at this point?

You can wait for a newer one, a newer kernel on next Unraid release might also help, not much else you can do.

SoCold · December 7, 2020

1 hour ago, JorgeB said:

You can wait for a newer one, a newer kernel on next Unraid release might also help, not much else you can do.

Thanks, I was thinking back to how and when issues started to creep up and I remember I was having various problems setting up my OPNsense VM and that's how I ended up on the beta. I'm crossing my fingers for that newer kernel.

7 hours ago, JorgeB said:
There were read errors on parity during the rebuild:
Dec  6 21:55:38 Elliot kernel: md: disk0 read error, sector=15496560240
Dec  6 21:55:38 Elliot kernel: md: recovery thread: multiple disk errors, sector=15496555488
This means there will be some corruption on the rebuilt disk.

I guess right now I don't trust the server and I'm just waiting on 6.9 to go stable. I was thinking about a CPU and maybe MOBO upgrade in 2021 to a newer generation of Ryzen but it would be my luck I'd have the same issue.

Is there anything in particular I can do or not do while I wait to mitigate issues with corrupt or lost data?

JorgeB · December 7, 2020

17 minutes ago, SoCold said:

Is there anything in particular I can do or not do while I wait to mitigate issues with corrupt or lost data?

Not much, except avoiding large I/O operations, like parity checks or disks rebuilds, since it appears to be when this issue shows up the most.

akshunj · December 9, 2020

On 12/4/2020 at 6:00 PM, SoCold said:

Hi, like the title says I've been experiencing errors lately and I'm not really sure what's going on so hoping people here can help. Currently I'm running with a data disk disabled and most of my dockers not functioning.

I've been running my server for a few years with my only issues being things I could search for a resolve myself. A little while ago I had my first real disk issues. I woke up to a disabled disk, I had been meaning to order another drive already and I was busy so I ordered a replacement popped it in preclear, rebuild and we're back up with parity. Not too long after that more disk errors, posted on Reddit and was someone suggested the cable. That made a lot of sense so I replaced that I was good for a bit until now. I've been getting more disk errors and my dockers started failing.

I'm thinking something is corrupt or maybe I have multiple issues but I don't know what. I'm certainly not an expert but I'm descent at following directions and can provide more info or test anything as needed. Thanks preemptively for the help. Fix Common Problems and Diagnostics below

FIx Common Problems:

disk1 (ST8000VN004) is disabled

parity (ST8000VN0022) has read errors

disk1 (ST8000VN004)

/var/log is getting full (currently 100 % used)

Unable to write to cacheDrive mounted read-only or completely full.

Unable to write to disk1Drive mounted read-only or completely full.

Unable to write to Docker ImageDocker Image either full or corrupted.

elliot-diagnostics-20201204-0925.zip 135.84 kB · 3 downloads

You've been running this exact same setup for years with no issues? But when you upgraded to the new beta unraid you started to experience issues? Is that correct? Just trying to piece together the timeline of when your issues began

SoCold · December 11, 2020

On 12/9/2020 at 7:01 AM, akshunj said:

You've been running this exact same setup for years with no issues? But when you upgraded to the new beta unraid you started to experience issues? Is that correct? Just trying to piece together the timeline of when your issues began

Hey, thanks for asking.

Honestly I can't say for certain. I know I ordered a replacement drive in July so it was then when I first started having drive errors. I don't remember when I switched over to the beta but I do remember I did because I was having issues getting my OPNsense VM up and running. When I first had a drive get disabled due to errors I swapped it out and just kept going, then I had errors a while later and replaced the cable but I should have come here and starting posting diagnostics back then.

I've also gone long stretches where I don't really use the server very much and it's mostly out of sight out of mind. And as an engineer and problem solver myself I know that's an incredibly unsatisfying answer for sure.

SoCold · December 11, 2020

New update, new kernel and I'm back with a new diagnostics.

Just did the update and in my unprofessional opinion I'm optimistic. My dashboard came back complete with dockers and VMs again and from just poking around for a few minutes things look normal.

I took a look at my disk stats right before the update, like I said I avoided doing anything to spin the disks up intentionally.

image.png.630428ee17b2c575df393e4077e8a17e.png

Attached is my diagnostics from right after the update. Fingers crossed and thanks again for the help

elliot-diagnostics-20201210-1838.zip

SoCold · December 12, 2020

Bump

Been taking it easy on the server but so far so good, haven't noticed anything since updating to RC1

I just ran an extended scan of fix common problems with no errors found and I've attached another diagnostics for what that's worth.

Not trying to spam things but hoping someone can give this a once over and let me know where I stand or what's next.

elliot-diagnostics-20201212-1745.zip

JorgeB · December 13, 2020

If so far so good there's not much to see.

SoCold · December 13, 2020

7 hours ago, JorgeB said:

If so far so good there's not much to see.

Thanks Jorge, I'll turn scheduled parity checks back on and start getting back to business as usual.

Should I mark this thread as Solved now?

Also, I know it looked like I had some corruption when I rebuilt my disk, would it make sense to run that again then?

Edited December 13, 2020 by SoCold

JorgeB · December 14, 2020

You can force a new rebuild to see if no more errors.

Disk and Docker errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation