parity errors, possibly the same ones recurring from test to test

JustinChase · July 29, 2015

friendly reminder; this is still a problem...

I had the GUI hang yesterday, and had to hard boot the server. Resultant parity check had 5 errors...

diagnostics attached.

media-diagnostics-20150728-2056.zip

RobJ · July 29, 2015

Probably not related, but this unclean shutdown resulted in a corrupted file system on the flash drive, making it read-only. Once you shut down, you'll need to fix that, and it's going to indicate another unclean shutdown on next boot.

JustinChase · July 29, 2015

Probably not related, but this unclean shutdown resulted in a corrupted file system on the flash drive, making it read-only. Once you shut down, you'll need to fix that, and it's going to indicate another unclean shutdown on next boot.

Yeah, I noticed that after the parity check finished. I rebooted and it started another parity check. I stopped it, then rebooted again.

I tried to make a change to one of my windows virtual machines, but had to force stop it, so I had to reboot again, and it just came back up, and now it's no longer showing my cache drive. I stopped the array, then selected the disk that has been my cache drive, but it won't 'stick'.

I'm rebooting yet again.

arrrggghhh!!!!

rxnelson · December 2, 2016

Any update on this? I had a power outage and now have 4 errors that will not correct. I'm running the latest stable version.

Sent from my iPhone using Tapatalk

garycase · December 2, 2016

When you run a correcting parity check, do you get the same 4 errors "corrected" every time ??

garycase · December 2, 2016

... also, for Justin: It's been over two years since this thread was active. Did you finally figure out a "fix" that worked?

rxnelson · December 2, 2016

When you run a correcting parity check, do you get the same 4 errors "corrected" every time ??

I'm guessing I'll need to pull the log for this? I won't know about every time yet anyway. I don't have logs for the first two.

Edit. Is this a starting point?

Dec 1 00:30:24 TowerTest kernel: md: recovery thread: Q corrected, sector=288

Dec 1 00:30:39 TowerTest kernel: md: recovery thread: Q corrected, sector=4269056

Dec 1 00:30:39 TowerTest kernel: md: recovery thread: Q corrected, sector=4269600

Dec 1 03:40:01 TowerTest root: mover started

Dec 1 03:40:01 TowerTest root: mover finished

Dec 1 04:14:54 TowerTest kernel: md: recovery thread: PQ corrected, sector=3907430608

Sent from my iPhone using Tapatalk

garycase · December 2, 2016

Yes, the question is do you get the same 4 corrections the next time you run a correcting parity check.

JorgeB · December 2, 2016

Q is parity2, if all errors are always Q errors there can be issue with that disk.

rxnelson · December 2, 2016

Q is parity2, if all errors are always Q errors there can be issue with that disk.

Sorry for the ignorance but the last error is "PQ". Same thing?

Would it be a good idea to pull out parity 2 and clear it again or something?

Also, should I create a new post since the other is so old? The issue seemed similar to me but in reading the "how to get help" post it says to create new vs. hijacking.

JorgeB · December 2, 2016

Sorry for the ignorance but the last error is "PQ". Same thing?

Didn't notice that one, that was out of sync on both paritys, do another check and see if the same errors appear.

rxnelson · December 2, 2016

Ok. I started it last night so I'll try to reply back this evening. I think the last one took 18 hours.

Sent from my iPhone using Tapatalk

rxnelson · December 2, 2016

Zero errors this time? It had 4 errors at least 3 times in a row. I guess I shouldn't be complaining. The only thing I did was scrub the cache drive. I guess I will check parity again and see what happens?

JorgeB · December 2, 2016

Are you sure all were correcting checks?

rxnelson · December 2, 2016

Yes. I've never done it otherwise. I dunno???

Sent from my iPhone using Tapatalk

John_M · December 3, 2016

I'd be interested in reading a follow-up from JustinChase too. I'm wondering if the problem was a manifestation of the Marvell bug. I notice that opentoe abandoned his Marvell controllers.

The reason I'm interested is because this appeared in the syslog of one of my servers (slightly edited for clarity):

Dec 1 05:00:01 Mandaue kernel: mdcmd (50): check NOCORRECT

Dec 1 05:00:01 Mandaue kernel: md: recovery thread: check P Q ...

Dec 1 05:00:01 Mandaue kernel: md: using 1536k window, over a total of 5860522532 blocks.

Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110472

Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110480

Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110488

Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110496

Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110504

Dec 1 17:45:31 Mandaue kernel: md: sync done. time=45929sec

Dec 1 17:45:31 Mandaue kernel: md: recovery thread: completion status: 0

A subsequent correcting parity check fixed them and this was confirmed by a third, non-correcting check, which found no errors. So the issue is not exactly the same as the OP's. However, there were five errors and they form a contiguous block, as reported.

One thing I can say is that I'm using an AOC-SAS2LP-MV8 controller (Marvell), though it has only four disks currently attached (two in the array and two unassigned that happened to be pre-clearing when the scheduled non-correcting parity check ran). The remaining disks, including both parities and both cache disks are controlled by the motherboard. I have neither Dockers nor VMs enabled but I do have AMD-Vi IOMMU enabled. I don't remember specifically enabling it in the BIOS but I don't need it and will turn it off once my pre-clearing has finished. I strongly suspect that will fix the problem.

Diagnostics attached.

mandaue-diagnostics-20161203-0210.zip

JorgeB · December 3, 2016

Yes. I've never done it otherwise. I dunno???

If it happens again post the diagnostics before rebooting.

JustinChase · December 3, 2016

I continued to have the problem for at least a year. I've since replaced a drive or 2, swapped a couple from the sata card to the motherboard due to parity checks killing the server, and finally removing a drive from the array altogether, reducing the total number of drives in the server. Parity checks still kill dockers and/or VMs, but I no longer see the 5 errors when the check finally finishes. I plan to replace drives with larger ones and continue to remove drives from the array and stop using the data card once I no longer need that many drives. I cannot say which changes resulted in the errors going away, sorry.

Sent from my HTC6545LVW using Tapatalk

saarg · December 3, 2016

I continued to have the problem for at least a year. I've since replaced a drive or 2, swapped a couple from the sata card to the motherboard due to parity checks killing the server, and finally removing a drive from the array altogether, reducing the total number of drives in the server. Parity checks still kill dockers and/or VMs, but I no longer see the 5 errors when the check finally finishes. I plan to replace drives with larger ones and continue to remove drives from the array and stop using the data card once I no longer need that many drives. I cannot say which changes resulted in the errors going away, sorry.

Sent from my HTC6545LVW using Tapatalk

Just looked at the specs in your signature. You have 8GB ram total and two VM's running, each with 4GB ram? That might be one of the reasons dockers and VM's are killed.

garycase · December 3, 2016

Thanks for the update Justin => too bad it wasn't a nice, neat "Hey, I finally figured out that THIS fixes it"

But good to know you're no longer seeing the same mysterious "errors" every time.

JustinChase · December 3, 2016

Whoops, I need to update my signature, I added another 16GB ram to the server, so I've got 24 GB now.

I agree a "this is what fixed it" would have been great, but no such luck. It seems destined to remain a mystery.

Sent from my HTC6545LVW using Tapatalk

John_M · December 3, 2016

Yes, thanks for the update JustinChase. Assuming your sig is still up to date, I see you still using the Marvell 88SE9215-based SYBA SI-PEX40064. I see a GeForce GT 720 graphics card mentioned there too. Are you passing it through to a VM? If so there's a chance it could indeed be the Marvell bug. I'm still waiting for my pre-clears to end before I can switch off IOMMU.

JustinChase · December 3, 2016

I believe that SATA card is still correct, and I'm definitely passing the 720 GT to my main HTPC VM. Funny thing is I had those errors before the 720. Ive had the 550Ti for years, but only added the 720 about 6 months ago.

Sent from my HTC6545LVW using Tapatalk

John_M · December 4, 2016

It isn't that you're passing through a particular graphics card, but that you have IOMMU enabled in order to pass through any device. Are you using the "iommu=pt" workaround, by any chance. Now my pre-clears are complete I simply re-booted and turned off AMD-Vi in the BIOS. I have a non-correcting parity check underway.

JustinChase · December 4, 2016

I'm not sure what the iommu-pt workaround is. I had to enable some override to allow devices to be split up to share to different VMs, but I can't remember the actual setting name. It may be the same thing you're asking about, but its called something different in the GUI

Sent from my HTC6545LVW using Tapatalk

parity errors, possibly the same ones recurring from test to test

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation