Parity Drive Keeps Going Into Error State

padioca · March 16, 2022

Hey there. Been running Unraid for quite a while with far more ups than downs. However, I've run into some bumps in the past week or so. First it started out with problems with a Docker container writing continually to the log, which then filled up my docker.img. I deleted and rebuilt the docker.img, so seems like that is taken care. However, when I restarted the server I began to run into issues with a stale configuration. I've had this problem in the past, but have typically been able to fix it by rebooting until the array starts. I used the same method here (not great, I realize) and finally got the array back online. However, once I got the array started I realized my cache drive was not online any longer. I then shut down the server, rechecked all the cables, restarted the server and the cache drive was back.

What I'm running to now is my parity drive went offline last night with an error reading "Parity disk in error state (disk dsbl)". I have shut the server down, rechecked all the cables again, and then brought it back up. Stopped the array, took the parity drive out of the array, started it again, shut it down again, and then put the parity drive back in the array. At this point the parity rebuild process starts but almost immediately fails with errors. I've run SMART on the drive and it comes back fine, but things are clearly not working.

I've attached the diagnostics, happy to provide any other information that could be helpful.

tower-diagnostics-20220316-1424.zip

trurl · March 16, 2022

Disable spindown on parity and run an extended SMART test.

Also, connection problems on cache.

padioca · March 16, 2022

Got it, running the extended SMART now. Can you let me know how you are seeing the connection issue on the cache drive? I looked through the diagnostic files the other day but wasn't sure what I should be looking for or where I should be looking.

trurl · March 17, 2022

Mar 16 11:18:50 Tower kernel: ata2.00: ATA-9: Samsung SSD 850 EVO 250GB, S21NNXAGA17780T, EMT01B6Q, max UDMA/133
Mar 16 11:19:49 Tower emhttpd: import 31 cache device: (sdj) Samsung_SSD_850_EVO_250GB_S21NNXAGA17780T
Mar 16 11:20:08 Tower kernel: ata2.00: exception Emask 0x10 SAct 0xe00 SErr 0x280100 action 0x6 frozen
Mar 16 11:20:08 Tower kernel: ata2.00: irq_stat 0x08000008, interface fatal error
Mar 16 11:20:08 Tower kernel: ata2: SError: { UnrecovData 10B8B BadCRC }
Mar 16 11:20:08 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Mar 16 11:20:08 Tower kernel: ata2.00: cmd 60/00:48:b8:34:93/04:00:07:00:00/40 tag 9 ncq dma 524288 in
Mar 16 11:20:08 Tower kernel:         res 40/00:58:f8:ea:5c/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Mar 16 11:20:08 Tower kernel: ata2.00: status: { DRDY }
Mar 16 11:20:08 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Mar 16 11:20:08 Tower kernel: ata2.00: cmd 60/58:50:b8:38:93/00:00:07:00:00/40 tag 10 ncq dma 45056 in
Mar 16 11:20:08 Tower kernel:         res 40/00:58:f8:ea:5c/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Mar 16 11:20:08 Tower kernel: ata2.00: status: { DRDY }
Mar 16 11:20:08 Tower kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 16 11:20:08 Tower kernel: ata2.00: cmd 61/08:58:f8:ea:5c/00:00:07:00:00/40 tag 11 ncq dma 4096 out
Mar 16 11:20:08 Tower kernel:         res 40/00:58:f8:ea:5c/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Mar 16 11:20:08 Tower kernel: ata2.00: status: { DRDY }
Mar 16 11:20:08 Tower kernel: ata2: hard resetting link

from syslog

padioca · March 17, 2022

Tried running the extended SMART and it got over to 80%. Checked back, thinking it would be done, and it had a message reading "Interrupted (host reset)" and no results from the test. I'm attaching the most recent diagnostics, but it appears as though the SMART reports have stayed the same. I've stopped the array completely and will rerun the extended SMART to see if I can have some results by the morning. Thanks for all your help!

tower-diagnostics-20220316-2152.zip

trurl · March 17, 2022

9 hours ago, padioca said:

"Interrupted (host reset)"

16 hours ago, trurl said:

Disable spindown on parity

Did you?

padioca · March 17, 2022

I thought I did, but I may have done it incorrectly. That being said, I just checked and the extended SMART from overnight finished without error.

trurl · March 17, 2022

Post new diagnostics

padioca · March 17, 2022

Here's the latest and greatest...

tower-diagnostics-20220317-0843.zip

trurl · March 17, 2022

Parity looks OK, but your cache isn't even reporting SMART and lots of errors in syslog for it. Check connections on cache and post new diagnostics.

padioca · March 17, 2022

I just tried to run a SMART short self-test and it immediately stopped, giving me a message reading, "A mandatory SMART command failed:exiting. To continue, add one or more '-T permissive' options.". I have purchased a new motherboard, RAM, CPU, and NVMe drive (intended to replace that cache drive), so if the parity drive looks fine I might just start the process of replacing those parts to see if it helps the issues. If that seems like a bad idea, please let me know.

trurl · March 17, 2022

You don't specifically mention

59 minutes ago, trurl said:

Check connections on cache

Did you? Power and SATA, both ends, including splitters.

padioca · March 17, 2022

Sorry, should have clarified. Yes, I did, both when I was having issues with the cache drive a few days ago as well as yesterday when issues with the parity drive started to happen.

trurl · March 17, 2022

You should disable Docker and VM Manager in Settings. You really should start over with those after you get new cache. You weren't doing them right anyway.

padioca · March 17, 2022

Ok, I can look into that. Can you clarify on what I should do differently or point me to some documentation? I don't use the VMs at all, but use Docker for the Arrs and Plex. Seemed to be working well enough until issues with the excessive logging a few weeks ago, and now these issues.

Also, just to be clear, it seems like you are indicating that you wouldn't suggest any further action before replacing the hardware. Is that correct?

Also, thanks again for your help. I'm reasonably good at figuring stuff out, but I'm thankful for people like you that are willing to jump in when I run into a dead end.

trurl · March 17, 2022

Sorry I was looking at another "tower-diagnostics". I really wish everyone would rename their server from the default "tower"

I can't actually tell where your docker/VM files are since the array isn't started but docker and domain .cfg looks OK and .cfg for appdata, domains, system looks OK too.

So maybe starting over isn't needed. But if you do start without cache those files will be recreated on the array, so best if you disable Docker and VMs as I said until you get cache fixed.

padioca · March 17, 2022

I have most of that stuff on a second cache pool drive, but I will double check to make sure it is moved over before replacing the hardware. Although I just started the array again and the cache is showing as unmountable, so it may be toast.

The server is super old, so I decided to work with stuff that has been produced in the last decade, hence the rebuild. Hopefully this will help clear up some of the issues I'm experiencing with the drives, and who knows, maybe the cache will decide to come back around.

I'll holler back here if I continue to experience issues, thanks again for your help!

padioca · March 21, 2022

I there, me again. I changed out the hardware and rebuilt parity on the drive that been giving me problems, and it appeared to work properly. Went to change out an older drive with a new one that I purchased and as it was being rebuilt from parity the errors came back on the parity, and the rebuild failed. I then took the new drive I was trying to install and put that in as a new parity drive. It rebuilt and seems to have worked, but I don't have another drive so I can't test a rebuild process at this point.

So, long story short, I'm back up and running with a new parity drive, but the old drive still returns an extended SMART test with no errors. At this point I'm thinking it might be an issue with a bad SATA cable, so I'm going to track that down and replace it tonight. But wanted to check back and see if there are any other thoughts on what could be going on. Also attaching a new diagnostics report. Thanks!

tower-diagnostics-20220321-0942.zip

trurl · March 21, 2022

2 hours ago, padioca said:

might be an issue with a bad SATA cable

or power cable or splitter or too many disks on one power cable.

Parity Drive Keeps Going Into Error State

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation