JustinChase Posted July 29, 2015 Author Share Posted July 29, 2015 friendly reminder; this is still a problem... I had the GUI hang yesterday, and had to hard boot the server. Resultant parity check had 5 errors... diagnostics attached. media-diagnostics-20150728-2056.zip Quote Link to comment
RobJ Posted July 29, 2015 Share Posted July 29, 2015 Probably not related, but this unclean shutdown resulted in a corrupted file system on the flash drive, making it read-only. Once you shut down, you'll need to fix that, and it's going to indicate another unclean shutdown on next boot. Quote Link to comment
JustinChase Posted July 29, 2015 Author Share Posted July 29, 2015 Probably not related, but this unclean shutdown resulted in a corrupted file system on the flash drive, making it read-only. Once you shut down, you'll need to fix that, and it's going to indicate another unclean shutdown on next boot. Yeah, I noticed that after the parity check finished. I rebooted and it started another parity check. I stopped it, then rebooted again. I tried to make a change to one of my windows virtual machines, but had to force stop it, so I had to reboot again, and it just came back up, and now it's no longer showing my cache drive. I stopped the array, then selected the disk that has been my cache drive, but it won't 'stick'. I'm rebooting yet again. arrrggghhh!!!! Quote Link to comment
rxnelson Posted December 2, 2016 Share Posted December 2, 2016 Any update on this? I had a power outage and now have 4 errors that will not correct. I'm running the latest stable version. Sent from my iPhone using Tapatalk Quote Link to comment
garycase Posted December 2, 2016 Share Posted December 2, 2016 When you run a correcting parity check, do you get the same 4 errors "corrected" every time ?? Quote Link to comment
garycase Posted December 2, 2016 Share Posted December 2, 2016 ... also, for Justin: It's been over two years since this thread was active. Did you finally figure out a "fix" that worked? Quote Link to comment
rxnelson Posted December 2, 2016 Share Posted December 2, 2016 When you run a correcting parity check, do you get the same 4 errors "corrected" every time ?? I'm guessing I'll need to pull the log for this? I won't know about every time yet anyway. I don't have logs for the first two. Edit. Is this a starting point? Dec 1 00:30:24 TowerTest kernel: md: recovery thread: Q corrected, sector=288 Dec 1 00:30:39 TowerTest kernel: md: recovery thread: Q corrected, sector=4269056 Dec 1 00:30:39 TowerTest kernel: md: recovery thread: Q corrected, sector=4269600 Dec 1 03:40:01 TowerTest root: mover started Dec 1 03:40:01 TowerTest root: mover finished Dec 1 04:14:54 TowerTest kernel: md: recovery thread: PQ corrected, sector=3907430608 Sent from my iPhone using Tapatalk Quote Link to comment
garycase Posted December 2, 2016 Share Posted December 2, 2016 Yes, the question is do you get the same 4 corrections the next time you run a correcting parity check. Quote Link to comment
JorgeB Posted December 2, 2016 Share Posted December 2, 2016 Q is parity2, if all errors are always Q errors there can be issue with that disk. Quote Link to comment
rxnelson Posted December 2, 2016 Share Posted December 2, 2016 Q is parity2, if all errors are always Q errors there can be issue with that disk. Sorry for the ignorance but the last error is "PQ". Same thing? Would it be a good idea to pull out parity 2 and clear it again or something? Also, should I create a new post since the other is so old? The issue seemed similar to me but in reading the "how to get help" post it says to create new vs. hijacking. Quote Link to comment
JorgeB Posted December 2, 2016 Share Posted December 2, 2016 Sorry for the ignorance but the last error is "PQ". Same thing? Didn't notice that one, that was out of sync on both paritys, do another check and see if the same errors appear. Quote Link to comment
rxnelson Posted December 2, 2016 Share Posted December 2, 2016 Ok. I started it last night so I'll try to reply back this evening. I think the last one took 18 hours. Sent from my iPhone using Tapatalk Quote Link to comment
rxnelson Posted December 2, 2016 Share Posted December 2, 2016 Zero errors this time? It had 4 errors at least 3 times in a row. I guess I shouldn't be complaining. The only thing I did was scrub the cache drive. I guess I will check parity again and see what happens? Quote Link to comment
JorgeB Posted December 2, 2016 Share Posted December 2, 2016 Are you sure all were correcting checks? Quote Link to comment
rxnelson Posted December 2, 2016 Share Posted December 2, 2016 Yes. I've never done it otherwise. I dunno??? Sent from my iPhone using Tapatalk Quote Link to comment
John_M Posted December 3, 2016 Share Posted December 3, 2016 I'd be interested in reading a follow-up from JustinChase too. I'm wondering if the problem was a manifestation of the Marvell bug. I notice that opentoe abandoned his Marvell controllers. The reason I'm interested is because this appeared in the syslog of one of my servers (slightly edited for clarity): Dec 1 05:00:01 Mandaue kernel: mdcmd (50): check NOCORRECT Dec 1 05:00:01 Mandaue kernel: md: recovery thread: check P Q ... Dec 1 05:00:01 Mandaue kernel: md: using 1536k window, over a total of 5860522532 blocks. Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110472 Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110480 Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110488 Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110496 Dec 1 07:43:16 Mandaue kernel: md: recovery thread: PQ incorrect, sector=3131110504 Dec 1 17:45:31 Mandaue kernel: md: sync done. time=45929sec Dec 1 17:45:31 Mandaue kernel: md: recovery thread: completion status: 0 A subsequent correcting parity check fixed them and this was confirmed by a third, non-correcting check, which found no errors. So the issue is not exactly the same as the OP's. However, there were five errors and they form a contiguous block, as reported. One thing I can say is that I'm using an AOC-SAS2LP-MV8 controller (Marvell), though it has only four disks currently attached (two in the array and two unassigned that happened to be pre-clearing when the scheduled non-correcting parity check ran). The remaining disks, including both parities and both cache disks are controlled by the motherboard. I have neither Dockers nor VMs enabled but I do have AMD-Vi IOMMU enabled. I don't remember specifically enabling it in the BIOS but I don't need it and will turn it off once my pre-clearing has finished. I strongly suspect that will fix the problem. Diagnostics attached. mandaue-diagnostics-20161203-0210.zip Quote Link to comment
JorgeB Posted December 3, 2016 Share Posted December 3, 2016 Yes. I've never done it otherwise. I dunno??? If it happens again post the diagnostics before rebooting. Quote Link to comment
JustinChase Posted December 3, 2016 Author Share Posted December 3, 2016 I continued to have the problem for at least a year. I've since replaced a drive or 2, swapped a couple from the sata card to the motherboard due to parity checks killing the server, and finally removing a drive from the array altogether, reducing the total number of drives in the server. Parity checks still kill dockers and/or VMs, but I no longer see the 5 errors when the check finally finishes. I plan to replace drives with larger ones and continue to remove drives from the array and stop using the data card once I no longer need that many drives. I cannot say which changes resulted in the errors going away, sorry. Sent from my HTC6545LVW using Tapatalk Quote Link to comment
saarg Posted December 3, 2016 Share Posted December 3, 2016 I continued to have the problem for at least a year. I've since replaced a drive or 2, swapped a couple from the sata card to the motherboard due to parity checks killing the server, and finally removing a drive from the array altogether, reducing the total number of drives in the server. Parity checks still kill dockers and/or VMs, but I no longer see the 5 errors when the check finally finishes. I plan to replace drives with larger ones and continue to remove drives from the array and stop using the data card once I no longer need that many drives. I cannot say which changes resulted in the errors going away, sorry. Sent from my HTC6545LVW using Tapatalk Just looked at the specs in your signature. You have 8GB ram total and two VM's running, each with 4GB ram? That might be one of the reasons dockers and VM's are killed. Quote Link to comment
garycase Posted December 3, 2016 Share Posted December 3, 2016 Thanks for the update Justin => too bad it wasn't a nice, neat "Hey, I finally figured out that THIS fixes it" But good to know you're no longer seeing the same mysterious "errors" every time. Quote Link to comment
JustinChase Posted December 3, 2016 Author Share Posted December 3, 2016 Whoops, I need to update my signature, I added another 16GB ram to the server, so I've got 24 GB now. I agree a "this is what fixed it" would have been great, but no such luck. It seems destined to remain a mystery. Sent from my HTC6545LVW using Tapatalk Quote Link to comment
John_M Posted December 3, 2016 Share Posted December 3, 2016 Yes, thanks for the update JustinChase. Assuming your sig is still up to date, I see you still using the Marvell 88SE9215-based SYBA SI-PEX40064. I see a GeForce GT 720 graphics card mentioned there too. Are you passing it through to a VM? If so there's a chance it could indeed be the Marvell bug. I'm still waiting for my pre-clears to end before I can switch off IOMMU. Quote Link to comment
JustinChase Posted December 3, 2016 Author Share Posted December 3, 2016 I believe that SATA card is still correct, and I'm definitely passing the 720 GT to my main HTPC VM. Funny thing is I had those errors before the 720. Ive had the 550Ti for years, but only added the 720 about 6 months ago. Sent from my HTC6545LVW using Tapatalk Quote Link to comment
John_M Posted December 4, 2016 Share Posted December 4, 2016 It isn't that you're passing through a particular graphics card, but that you have IOMMU enabled in order to pass through any device. Are you using the "iommu=pt" workaround, by any chance. Now my pre-clears are complete I simply re-booted and turned off AMD-Vi in the BIOS. I have a non-correcting parity check underway. Quote Link to comment
JustinChase Posted December 4, 2016 Author Share Posted December 4, 2016 I'm not sure what the iommu-pt workaround is. I had to enable some override to allow devices to be split up to share to different VMs, but I can't remember the actual setting name. It may be the same thing you're asking about, but its called something different in the GUI Sent from my HTC6545LVW using Tapatalk Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.