parity errors, possibly the same ones recurring from test to test


Recommended Posts

  • Replies 126
  • Created
  • Last Reply

Top Posters In This Topic

Probably not related, but this unclean shutdown resulted in a corrupted file system on the flash drive, making it read-only.  Once you shut down, you'll need to fix that, and it's going to indicate another unclean shutdown on next boot.

 

Yeah, I noticed that after the parity check finished.  I rebooted and it started another parity check.  I stopped it, then rebooted again.

 

I tried to make a change to one of my windows virtual machines, but had to force stop it, so I had to reboot again, and it just came back up, and now it's no longer showing my cache drive.  I stopped the array, then selected the disk that has been my cache drive, but it won't 'stick'.

 

I'm rebooting yet again.

 

arrrggghhh!!!!

Link to comment
  • 1 year later...

When you run a correcting parity check, do you get the same 4 errors "corrected" every time ??

 

I'm guessing I'll need to pull the log for this?  I won't know about every time yet anyway. I don't have logs for the first two.

 

Edit.  Is this a starting point?

 

Dec  1 00:30:24 TowerTest kernel: md: recovery thread: Q corrected, sector=288

Dec  1 00:30:39 TowerTest kernel: md: recovery thread: Q corrected, sector=4269056

Dec  1 00:30:39 TowerTest kernel: md: recovery thread: Q corrected, sector=4269600

Dec  1 03:40:01 TowerTest root: mover started

Dec  1 03:40:01 TowerTest root: mover finished

Dec  1 04:14:54 TowerTest kernel: md: recovery thread: PQ corrected, sector=3907430608

 

 

Sent from my iPhone using Tapatalk

Link to comment

Q is parity2, if all errors are always Q errors there can be issue with that disk.

Sorry for the ignorance but the last error is "PQ". Same thing?

Would it be a good idea to pull out parity 2 and clear it again or something?

 

Also, should I create a new post since the other is so old?  The issue seemed similar to me but in reading the "how to get help" post it says to create new vs. hijacking.

Link to comment

I'd be interested in reading a follow-up from JustinChase too. I'm wondering if the problem was a manifestation of the Marvell bug. I notice that opentoe abandoned his Marvell controllers.

 

The reason I'm interested is because this appeared in the syslog of one of my servers (slightly edited for clarity):

 

Dec  1 05:00:01 Mandaue kernel: mdcmd (50): check NOCORRECT

Dec  1 05:00:01 Mandaue kernel: md: recovery thread: check P Q ...

Dec  1 05:00:01 Mandaue kernel: md: using 1536k window, over a total of 5860522532 blocks.

Dec  1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110472

Dec  1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110480

Dec  1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110488

Dec  1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110496

Dec  1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110504

Dec  1 17:45:31 Mandaue kernel: md: sync done. time=45929sec

Dec  1 17:45:31 Mandaue kernel: md: recovery thread: completion status: 0

 

A subsequent correcting parity check fixed them and this was confirmed by a third, non-correcting check, which found no errors. So the issue is not exactly the same as the OP's. However, there were five errors and they form a contiguous block, as reported.

 

One thing I can say is that I'm using an AOC-SAS2LP-MV8 controller (Marvell), though it has only four disks currently attached (two in the array and two unassigned that happened to be pre-clearing when the scheduled non-correcting parity check ran). The remaining disks, including both parities and both cache disks are controlled by the motherboard. I have neither Dockers nor VMs enabled but I do have AMD-Vi IOMMU enabled. I don't remember specifically enabling it in the BIOS but I don't need it and will turn it off once my pre-clearing has finished. I strongly suspect that will fix the problem.

 

Diagnostics attached.

mandaue-diagnostics-20161203-0210.zip

Link to comment

I continued to have the problem for at least a year. I've since replaced a drive or 2, swapped a couple from the sata card to the motherboard due to parity checks killing the server, and finally removing a drive from the array altogether, reducing the total number of drives in the server. Parity checks still kill dockers and/or VMs, but I no longer see the 5 errors when the check finally finishes. I plan to replace drives with larger ones and continue to remove drives from the array and stop using the data card once I no longer need that many drives.  I cannot say which changes resulted in the errors going away, sorry.

 

Sent from my HTC6545LVW using Tapatalk

 

Link to comment

I continued to have the problem for at least a year. I've since replaced a drive or 2, swapped a couple from the sata card to the motherboard due to parity checks killing the server, and finally removing a drive from the array altogether, reducing the total number of drives in the server. Parity checks still kill dockers and/or VMs, but I no longer see the 5 errors when the check finally finishes. I plan to replace drives with larger ones and continue to remove drives from the array and stop using the data card once I no longer need that many drives.  I cannot say which changes resulted in the errors going away, sorry.

 

Sent from my HTC6545LVW using Tapatalk

 

Just looked at the specs in your signature. You have 8GB ram total and two  VM's running, each with 4GB ram? That might be one of the reasons dockers and  VM's are killed.

Link to comment

Yes, thanks for the update JustinChase. Assuming your sig is still up to date, I see you still using the Marvell 88SE9215-based SYBA SI-PEX40064. I see a GeForce GT 720 graphics card mentioned there too. Are you passing it through to a VM? If so there's a chance it could indeed be the Marvell bug. I'm still waiting for my pre-clears to end before I can switch off IOMMU.

Link to comment

It isn't that you're passing through a particular graphics card, but that you have IOMMU enabled in order to pass through any device. Are you using the "iommu=pt" workaround, by any chance. Now my pre-clears are complete I simply re-booted and turned off AMD-Vi in the BIOS. I have a non-correcting parity check underway.

 

Link to comment

I'm not sure what the iommu-pt workaround is. I had to enable some override to allow devices to be split up to share to different VMs, but I can't remember the actual setting name. It may be the same thing you're asking about, but its called something different in the GUI

 

Sent from my HTC6545LVW using Tapatalk

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.