Disk and Docker errors


SoCold

Recommended Posts

Hi, like the title says I've been experiencing errors lately and I'm not really sure what's going on so hoping people here can help. Currently I'm running with a data disk disabled and most of my dockers not functioning.

 

I've been running my server for a few years with my only issues being things I could search for a resolve myself. A little while ago I had my first real disk issues. I woke up to a disabled disk, I had been meaning to order another drive already and I was busy so I ordered a replacement popped it in preclear, rebuild and we're back up with parity. Not too long after that more disk errors, posted on Reddit and was someone suggested the cable. That made a lot of sense so I replaced that I was good for a bit until now. I've been getting more disk errors and my dockers started failing.

 

I'm thinking something is corrupt or maybe I have multiple issues but I don't know what. I'm certainly not an expert but I'm descent at following directions and can provide more info or test anything as needed. Thanks preemptively for the help. Fix Common Problems and Diagnostics below

 

FIx Common Problems:

disk1 (ST8000VN004) is disabled

parity (ST8000VN0022) has read errors

disk1 (ST8000VN004)

/var/log is getting full (currently 100 % used) 

Unable to write to cacheDrive mounted read-only or completely full. 

Unable to write to disk1Drive mounted read-only or completely full.

Unable to write to Docker ImageDocker Image either full or corrupted.

elliot-diagnostics-20201204-0925.zip

Link to comment
5 minutes ago, trurl said:

Looks like all disks are disconnected. Check all connections, all disks, power and SATA, both ends, including splitters. Then post new diagnostics

Thanks for the quick reply! My server is also my router/firewall so I'll be down for a bit while I do that.

 

FYI, my setup is 2 SSDs for cache and 2 Hard drives for data and parity all connected directly to the MOBO SATA ports.

 

 

Link to comment

Speaking only from personal experience, if all your devices are connected to the same controller, one bad drive can introduce kernel conditions where the unraid kicks it and every drive out of the array.  Sometimes you need to do a little trial and error with hardware issues like this.  Pull a suspect drive, replace cables, run the array for a bit, check diagnostics, etc.  If you've been running this setup for years without issue, it's likely a bad drive or cable.

 

I'm not an unraid expert, but I believe the disabled drive will stay that way till you decide if/how you want to re-integrate it.  I would try to diagnose the problem drive first.

Edited by akshunj
grammar
  • Like 1
Link to comment

I've replaced the drive and cable that first had errors and was disabled. These are also Ironwolf drives that have spent the majority of their life spun down in a case with a lot of airflow etc.

 

I think you're right that the drive will stay disabled I just didn't want to jump the gun. Obviously the #1 concern is not losing data and I'd like to do this in the proper order and hopefully determine the root issue.

 

Also not sure if this is a side effect to what's going on or just a beta issue but on the main tab of the UI I can no longer interact with dockers and my VMs are no longer listed.

 

I ran Fix Common Problems quick scan again just to check and the only error I have right now is

 

disk1 (ST8000VN004) is disabled

Link to comment

Rebuild just finished and both drives are now enabled. I was a little scared when I got the notification of 16437514 errors but I guess that's just because it was my data disk disabled and I was rebuilding from parity.

 

image.png.6fb028f50988c8990b2385dc3f1c2b82.png

 

I'll keep an eye on it and watch it but hopefully one of you guys can take a peek at my diagnostics and let me know if all looks good now. As always thank you all for the help!

elliot-diagnostics-20201206-2204.zip

Edited by SoCold
Link to comment

There were read errors on parity during the rebuild:

Dec  6 21:55:38 Elliot kernel: md: disk0 read error, sector=15496560240
Dec  6 21:55:38 Elliot kernel: md: recovery thread: multiple disk errors, sector=15496555488

 

This means there will be some corruption on the rebuilt disk.

 

Problem was the SATA controller:

 

Dec  6 21:54:07 Elliot kernel: ahci 0000:01:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xff4d0000 flags=0x0000]

 

Quite common with Ryzen boards, look for a BIOS update or disable IOMMU if not needed.

 

 

  • Like 1
Link to comment
5 hours ago, JorgeB said:

Quite common with Ryzen boards, look for a BIOS update or disable IOMMU if not needed.

Yesterday before doing the rebuild I updated the BIOS to the latest and greatest from Asus dated earlier this month. Would I need to downgrade the BIOS to something older and is it just guess and check at this point?

 

Currently I only run one VM, OPNSense with a NIC passed through so my understanding is I wouldn't be able to do that if I disable IOMMU.

Link to comment
1 hour ago, JorgeB said:

You can wait for a newer one, a newer kernel on next Unraid release might also help, not much else you can do.

Thanks, I was thinking back to how and when issues started to creep up and I remember I was having various problems setting up my OPNsense VM and that's how I ended up on the beta. I'm crossing my fingers for that newer kernel.

7 hours ago, JorgeB said:

There were read errors on parity during the rebuild:


Dec  6 21:55:38 Elliot kernel: md: disk0 read error, sector=15496560240
Dec  6 21:55:38 Elliot kernel: md: recovery thread: multiple disk errors, sector=15496555488

 

This means there will be some corruption on the rebuilt disk.

I guess right now I don't trust the server and I'm just waiting on 6.9 to go stable. I was thinking about a CPU and maybe MOBO upgrade in 2021 to a newer generation of Ryzen but it would be my luck I'd have the same issue.

 

Is there anything in particular I can do or not do while I wait to mitigate issues with corrupt or lost data?

Link to comment
On 12/4/2020 at 6:00 PM, SoCold said:

Hi, like the title says I've been experiencing errors lately and I'm not really sure what's going on so hoping people here can help. Currently I'm running with a data disk disabled and most of my dockers not functioning.

 

I've been running my server for a few years with my only issues being things I could search for a resolve myself. A little while ago I had my first real disk issues. I woke up to a disabled disk, I had been meaning to order another drive already and I was busy so I ordered a replacement popped it in preclear, rebuild and we're back up with parity. Not too long after that more disk errors, posted on Reddit and was someone suggested the cable. That made a lot of sense so I replaced that I was good for a bit until now. I've been getting more disk errors and my dockers started failing.

 

I'm thinking something is corrupt or maybe I have multiple issues but I don't know what. I'm certainly not an expert but I'm descent at following directions and can provide more info or test anything as needed. Thanks preemptively for the help. Fix Common Problems and Diagnostics below

 

FIx Common Problems:

disk1 (ST8000VN004) is disabled

parity (ST8000VN0022) has read errors

disk1 (ST8000VN004)

/var/log is getting full (currently 100 % used) 

Unable to write to cacheDrive mounted read-only or completely full. 

Unable to write to disk1Drive mounted read-only or completely full.

Unable to write to Docker ImageDocker Image either full or corrupted.

elliot-diagnostics-20201204-0925.zip 135.84 kB · 3 downloads

You've been running this exact same setup for years with no issues?  But when you upgraded to the new beta unraid you started to experience issues?  Is that correct?  Just trying to piece together the timeline of when your issues began

Link to comment
On 12/9/2020 at 7:01 AM, akshunj said:

You've been running this exact same setup for years with no issues?  But when you upgraded to the new beta unraid you started to experience issues?  Is that correct?  Just trying to piece together the timeline of when your issues began

Hey, thanks for asking.

 

Honestly I can't say for certain. I know I ordered a replacement drive in July so it was then when I first started having drive errors. I don't remember when I switched over to the beta but I do remember I did because I was having issues getting my OPNsense VM up and running. When I first had a drive get disabled due to errors I swapped it out and just kept going, then I had errors a while later and replaced the cable but I should have come here and starting posting diagnostics back then.

 

I've also gone long stretches where I don't really use the server very much and it's mostly out of sight out of mind. And as an engineer and problem solver myself I know that's an incredibly unsatisfying answer for sure.

Link to comment

New update, new kernel and I'm back with a new diagnostics.

 

Just did the update and in my unprofessional opinion I'm optimistic. My dashboard came back complete with dockers and VMs again and from just poking around for a few minutes things look normal.

 

I took a look at my disk stats right before the update, like I said I avoided doing anything to spin the disks up intentionally.

image.png.630428ee17b2c575df393e4077e8a17e.png

 

Attached is my diagnostics from right after the update. Fingers crossed and thanks again for the help

elliot-diagnostics-20201210-1838.zip

Link to comment

Bump

 

Been taking it easy on the server but so far so good, haven't noticed anything since updating to RC1

 

I just ran an extended scan of fix common problems with no errors found and I've attached another diagnostics for what that's worth.

 

Not trying to spam things but hoping someone can give this a once over and let me know where I stand or what's next.

elliot-diagnostics-20201212-1745.zip

Link to comment
7 hours ago, JorgeB said:

If so far so good there's not much to see.

Thanks Jorge, I'll turn scheduled parity checks back on and start getting back to business as usual.

 

Should I mark this thread as Solved now?

 

Also, I know it looked like I had some corruption when I rebuilt my disk, would it make sense to run that again then?

Edited by SoCold
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.