Jump to content

Parity Drive Keeps Going Into Error State


Recommended Posts

Hey there. Been running Unraid for quite a while with far more ups than downs. However, I've run into some bumps in the past week or so. First it started out with problems with a Docker container writing continually to the log, which then filled up my docker.img. I deleted and rebuilt the docker.img, so seems like that is taken care. However, when I restarted the server I began to run into issues with a stale configuration. I've had this problem in the past, but have typically been able to fix it by rebooting until the array starts. I used the same method here (not great, I realize) and finally got the array back online. However, once I got the array started I realized my cache drive was not online any longer. I then shut down the server, rechecked all the cables, restarted the server and the cache drive was back.

What I'm running to now is my parity drive went offline last night with an error reading "Parity disk in error state (disk dsbl)". I have shut the server down, rechecked all the cables again, and then brought it back up. Stopped the array, took the parity drive out of the array, started it again, shut it down again, and then put the parity drive back in the array. At this point the parity rebuild process starts but almost immediately fails with errors. I've run SMART on the drive and it comes back fine, but things are clearly not working.

I've attached the diagnostics, happy to provide any other information that could be helpful.

tower-diagnostics-20220316-1424.zip

Link to comment
Mar 16 11:18:50 Tower kernel: ata2.00: ATA-9: Samsung SSD 850 EVO 250GB, S21NNXAGA17780T, EMT01B6Q, max UDMA/133
Mar 16 11:19:49 Tower emhttpd: import 31 cache device: (sdj) Samsung_SSD_850_EVO_250GB_S21NNXAGA17780T
Mar 16 11:20:08 Tower kernel: ata2.00: exception Emask 0x10 SAct 0xe00 SErr 0x280100 action 0x6 frozen
Mar 16 11:20:08 Tower kernel: ata2.00: irq_stat 0x08000008, interface fatal error
Mar 16 11:20:08 Tower kernel: ata2: SError: { UnrecovData 10B8B BadCRC }
Mar 16 11:20:08 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Mar 16 11:20:08 Tower kernel: ata2.00: cmd 60/00:48:b8:34:93/04:00:07:00:00/40 tag 9 ncq dma 524288 in
Mar 16 11:20:08 Tower kernel:         res 40/00:58:f8:ea:5c/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Mar 16 11:20:08 Tower kernel: ata2.00: status: { DRDY }
Mar 16 11:20:08 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Mar 16 11:20:08 Tower kernel: ata2.00: cmd 60/58:50:b8:38:93/00:00:07:00:00/40 tag 10 ncq dma 45056 in
Mar 16 11:20:08 Tower kernel:         res 40/00:58:f8:ea:5c/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Mar 16 11:20:08 Tower kernel: ata2.00: status: { DRDY }
Mar 16 11:20:08 Tower kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 16 11:20:08 Tower kernel: ata2.00: cmd 61/08:58:f8:ea:5c/00:00:07:00:00/40 tag 11 ncq dma 4096 out
Mar 16 11:20:08 Tower kernel:         res 40/00:58:f8:ea:5c/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Mar 16 11:20:08 Tower kernel: ata2.00: status: { DRDY }
Mar 16 11:20:08 Tower kernel: ata2: hard resetting link

from syslog

Link to comment

Tried running the extended SMART and it got over to 80%. Checked back, thinking it would be done, and it had a message reading "Interrupted (host reset)" and no results from the test. I'm attaching the most recent diagnostics, but it appears as though the SMART reports have stayed the same. I've stopped the array completely and will rerun the extended SMART to see if I can have some results by the morning. Thanks for all your help!

tower-diagnostics-20220316-2152.zip

Link to comment

I just tried to run a SMART short self-test and it immediately stopped, giving me a message reading, "A mandatory SMART command failed:exiting. To continue, add one or more '-T permissive' options.". I have purchased a new motherboard, RAM, CPU, and NVMe drive (intended to replace that cache drive), so if the parity drive looks fine I might just start the process of replacing those parts to see if it helps the issues. If that seems like a bad idea, please let me know.

Link to comment

Ok, I can look into that. Can you clarify on what I should do differently or point me to some documentation? I don't use the VMs at all, but use Docker for the Arrs and Plex. Seemed to be working well enough until issues with the excessive logging a few weeks ago, and now these issues.

Also, just to be clear, it seems like you are indicating that you wouldn't suggest any further action before replacing the hardware. Is that correct?

Also, thanks again for your help. I'm reasonably good at figuring stuff out, but I'm thankful for people like you that are willing to jump in when I run into a dead end.

Link to comment

Sorry I was looking at another "tower-diagnostics". I really wish everyone would rename their server from the default "tower"

 

I can't actually tell where your docker/VM files are since the array isn't started but docker and domain .cfg looks OK and .cfg for appdata, domains, system looks OK too.

 

So maybe starting over isn't needed. But if you do start without cache those files will be recreated on the array, so best if you disable Docker and VMs as I said until you get cache fixed.

Link to comment

I have most of that stuff on a second cache pool drive, but I will double check to make sure it is moved over before replacing the hardware. Although I just started the array again and the cache is showing as unmountable, so it may be toast.

The server is super old, so I decided to work with stuff that has been produced in the last decade, hence the rebuild. Hopefully this will help clear up some of the issues I'm experiencing with the drives, and who knows, maybe the cache will decide to come back around.

I'll holler back here if I continue to experience issues, thanks again for your help!

Link to comment

I there, me again. I changed out the hardware and rebuilt parity on the drive that been giving me problems, and it appeared to work properly. Went to change out an older drive with a new one that I purchased and as it was being rebuilt from parity the errors came back on the parity, and the rebuild failed. I then took the new drive I was trying to install and put that in as a new parity drive. It rebuilt and seems to have worked, but I don't have another drive so I can't test a rebuild process at this point.

 

So, long story short, I'm back up and running with a new parity drive, but the old drive still returns an extended SMART test with no errors. At this point I'm thinking it might be an issue with a bad SATA cable, so I'm going to track that down and replace it tonight. But wanted to check back and see if there are any other thoughts on what could be going on. Also attaching a new diagnostics report. Thanks!

tower-diagnostics-20220321-0942.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...