Read errors on data disk (probably fucked up my parity)


Recommended Posts

Hi everybody,

I think I fucked up...

Short version:

  • Last successful parity check was on August 3rd
  • Changed MP, CPU, RAM on August 5th, no parity check afterwards
  • Unraid crashed during reboot (August 19th)
  • automated parity check after crash resulted in lots of errors, due to read errors on disk 2 (which had a pending sector for a while, but never acted up before)
  • replaced cabeling, restarted parity check, then disk 3 (marvel controller card) was marked as failed
  • Removed marvel card, rebuild disk 3 from parity, but had lots of read errors from disk 2 again (August 20), runraid showed health as "green"
  • After restart filesystem of disk 1 and 3 was corrupted. Repared both.
  • Disk 2 doesn't complete a extended smart check due to read errors
  • parity check shows now 27729 errors (finished last night, August 22nd)
  • Don't know what to do now, as I have to replace disk 2, but probably parity information is fucked up as far as I understand...
  • Please help ;)
  • Diagnostic files are attached.
  • Pictures, documents etc. are all backup up offsite, so no problem. Media files are expandable and can be re-ripped from BDs.


Details:
My last successful parity check was on August 3rd. On August 5th I replaced the MB, CPU and RAM of the server to a supermicro x10 with Xeon E3-1275 v3 and 32GB ECC RAM. Before, I was running an old Asrock with I5 and 8GB non-ECC RAM. Initially I also wanted to replace my old Marvel sata card with two ports with a Dell H310 card. I didn't have enough time, so I thought lets do this in the next days, because the Marvel worked perfectly so far (still had to flesh the IT Bios of the H310 card).

Unfortunately after the replacement it never occurred to me to do a non-correcting parity of the new build.

The server was running fine until Thursday, August 19th, when during a manual reboot the array would not stop (took forever) and then unraid froze completely. I waited a few hours, but no response, so I did a hard reset. When the server was back up, all Docker where gone. Apparently the docker image was corrupted and reset. No biggie, installed everything and all worked fine.

After the system was back up, an automatic parity check was started (with correcting errors) and showed quite many errors (633) after ~ 2h30 at which point I also saw that disk 2 had read errors. So I stopped the parity check and read up on the forum. Then I understood, that automatic parity checks with correcting errors is not such a good idea, because of the risk if you have a broken disk.

In a few threads it was suggested that read errors could be cable problems, so I replaced the cable of this drive and started another parity check (non-correcting). Then disk 3 which was connected to the Marvel card started to act up and was marked as failed by unraid.


I removed the marvel controller and attached disk 3 to the MB controller.

At this moment I still had hope, that problems on disk 2 were due to a bad cable and rebuild disk 3 from parity (big mistake I guess). Rebuild completed with lots of errors (1405) on Friday evening, August 20. Disk 2 still had read errors

Disk 2 smart extended test does not complete due to read errors

Parity check shows now 27729 errors (finished last night, August 22nd)

Don't know what to do now, as I have to replace disk 2, but probably parity information is fucked up as far as I understand... Parity drive is 12TB, I plan to buy one or two new 16TB. Can I relplace disk 2 with the 16TB and have only 12TB usable?

Please help ;) Diagnostic files are attached. Pictures, documents etc. are all backup up offsite, so no problem. Media files are expandable and can be re-ripped from BDs.

PS: I hope all is clear and all info is included in the diagnostic files. If not, please tell me.

bigfoot-diagnostics-20210819-2216.zip bigfoot-diagnostics-20210822-1257.zip bigfoot-diagnostics-20210820-1821.zip

Edited by Junsas
Link to comment
6 hours ago, Junsas said:

an automatic parity check was started (with correcting errors)

automatic parity check for unclean shutdown is non-correcting.

 

Haven't looked at diagnostics yet, and maybe I missed it somewhere in all that, but from your description it doesn't sound like any of your parity checks was correcting, so they wouldn't have changed parity. Of course, normal writing to the array updates parity.

Link to comment

OK, I think by "automatic" you must be referring to a scheduled parity check instead of an unclean shutdown parity check.

 

How often do you have parity check scheduled?

 

Why are you still running 6.8.3?

 

Would have been a lot better if you had posted as soon as you began to have problems, instead of saving it all up and writing a book about it. Will take some work to piece together all this information. Probably why nobody replied earlier.

 

Do any disks other than disk2 have SMART warnings on the Dashboard page?

Link to comment

Only looking at latest diagnostics for now, not sure what relevance to your current situation the older diagnostics would have.

 

You should completely avoid using the Marvell controller. You mention it a few times and seems like it might have caused the problems, but not entirely clear if it was completely eliminated when you were trying to fix things.

 

I see the repair of disk3 then disk1 in syslog. All disks are mounted when the diagnostics were taken, and no disk is disabled, though disk2 does fail extended SMART as you mentioned.

 

Looks like a correcting parity check started here:

Aug 20 22:19:04 Bigfoot kernel: md: recovery thread: check P ...

but I don't know why you would characterize that as "automatic". Definitely wasn't related to unclean shutdown, and seems an odd time for a scheduled check.

 

Disk2 read problems began a couple of hours later with more disk2 read errors and parity corrections after that.

 

 


 

Link to comment

I guess the question is whether an emulated disk2 would be mountable with parity in its current state.

 

7 hours ago, Junsas said:

Parity drive is 12TB, I plan to buy one or two new 16TB. Can I relplace disk 2 with the 16TB and have only 12TB usable?

The way to deal with that is known as "Parity Swap", where parity is copied to a new larger disk then former parity disk is used as a replacement for the rebuild of the data disk.

 

That might be an option, since I think the only way to answer the question about emulated disk2 being mountable would be to disable it and see.

 

I am going to stop for now to wait for replies from you @Junsas to see if I have characterized the situation correctly, and also to get other opinions.

 

@JorgeB @itimpi @JonathanM

Link to comment

thanks so much for taking the time @trurl!

Yes, I know I should have just shutdown the array and posted here, when the first problems occurred. Sometime you (I) think the situation is manageable and then you're suddenly in over your head! I know it is a mess and I made the situation worse. Thanks for taking the time to help me! Already had some serious lessons learned from this experience.

 

A scheduled parity check is once a month. I am pretty sure that the unclean shutdown parity check started on August 19th in the late evening and I stopped it at around 1/2am on August 20th after I saw the many errors during the check and the read errors on disk 2.

 

Why are you still running 6.8.3? I updated a few month back to current version, but had a problem with VM startup, which I couldn't solve. I was planning to install the new hardware anyway and thought I will wait until that is done.

No SMART errors except disk2 (pending sectors and multi zone errors). Disk 3 has UDMA CRC error count = 79, but that was probably due to the marvel controller acting up.

 

The Marvel controller was removed from the server after disk 3 failed. During rebuild of disk 3 and later the disk was connected to the MB sata. Marvel card is gone and won't be used anymore. But it never gave me problems before I switched CPU/MB/RAM. I read, that there seems to be a connection between virtualization features being active and marvel controller problems. My old I5 probably didnt support those and that is why I had no problems with the old build?

 

Disk 1 and 3 had the corrupted filesystem after rebuild of disk 3.

 

I read the parity swap procedure. Main question now is, if the rebuild of disk 3 resulted in usable data, even though I had the read errors on disk 2 during the rebuild? And as you write, if disk 2 can be emulated and mounted.

Link to comment
5 minutes ago, Junsas said:

unclean shutdown parity check started on August 19th in the late evening

Unclean shutdown parity check begins immediately after starting the array after reboot, it won't start at some random time after your server has been running. Also, as mentioned, unclean shutdown parity check is non-correcting. There was no evidence of unclean shutdown in any of your diagnostics.

 

See below for a better understanding on what is meant by unclean shutdown and why it happens.

 

 

 

 

Link to comment
24 minutes ago, Junsas said:

I read the parity swap procedure. Main question now is, if the rebuild of disk 3 resulted in usable data, even though I had the read errors on disk 2 during the rebuild? And as you write, if disk 2 can be emulated and mounted.

As mentioned, all disks currently mountable according to diagnostics and looks like they have a reasonable amount of data. I don't even see a lost+found share in your diagnostics. You should be able to examine folders/files on all disks.

Filesystem      Size  Used Avail Use% Mounted on
/dev/md1         11T   11T  703G  94% /mnt/disk1
/dev/md2        5.5T  5.2T  336G  94% /mnt/disk2
/dev/md3        2.8T  1.5T  1.3T  54% /mnt/disk3
/dev/md4        3.7T  2.3T  1.4T  63% /mnt/disk4

 

Probably nothing to be done until you get replacement(s). If you can't shutdown and wait, you should at least make sure nothing else writes to disk2.

Link to comment
13 minutes ago, trurl said:

As mentioned, all disks currently mountable according to diagnostics and looks like they have a reasonable amount of data. I don't even see a lost+found share in your diagnostics. You should be able to examine folders/files on all disks.

Filesystem      Size  Used Avail Use% Mounted on
/dev/md1         11T   11T  703G  94% /mnt/disk1
/dev/md2        5.5T  5.2T  336G  94% /mnt/disk2
/dev/md3        2.8T  1.5T  1.3T  54% /mnt/disk3
/dev/md4        3.7T  2.3T  1.4T  63% /mnt/disk4

 

Probably nothing to be done until you get replacement(s). If you can't shutdown and wait, you should at least make sure nothing else writes to disk2.

 

 

Yes, all disks are mounted. I was/am just not sure if all data on disk 3 is uncorrupted after the rebuild? When the rebuild finished with errors, I just assumed there must be some corrupted files somewhere? Or was the result "just" the corrupted filesystem and I got lucky that this could be repaired without any files landing in lost+found?

 

Array is shut down and will stay shut down until a new drive arrives. What would be the safest way? Parity is currently 12tb, but the sweet spot for a new drive would be 16b. Can I manage it risk free with a 16tb and parity swap? Or would it be much safer to buy another 12tb drive and just replace disk 2?

Link to comment

I think parity swap should be OK and shouldn't make things any worse than they are provided you do it correctly. Parity swap won't have any effect on any other disks.

 

Just to make sure we are on the same page:

 

https://wiki.unraid.net/Manual/Storage_Management#Parity_Swap

 

Since you don't have a disabled disk, you will have to disable it yourself, and it tells you how to disable the disk to be replaced.

 

When you start the array after unassigning it, it becomes disabled and emulated, and you should be able to see if the emulated disk mounts. The emulated contents are what will be rebuilt.

 

Parity is just copied to the new disk, original parity is overwritten with rebuild of data disk being replaced, and all other disks are only read for that rebuild.

 

If the emulated disk isn't mountable we can try to repair the filesystem after rebuild.

 

And the original data disk is at least somewhat accessible and can probably be mounted Unassigned if you need to get anything from it.

 

Link to comment

Always be sure to double check all disk connections, power and SATA, both ends, including splitters, anytime you are mucking about in the case. The main reason someone has a replace/rebuild problem is because they have disturbed connections on other disks.

 

And leave the Marvell out.

Link to comment
  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.