David Bott Posted October 1, 2016 Share Posted October 1, 2016 Hi All... Ver: 6.1.9 Each month I have the UnRAID server run a check and fix on Parity. It seems that each month I have 5 errors and it fixes them. If I run another check I have no errors. So it seems to fix them. But then the next month, 5 again. Anyway, here is the part of my syslog that does show 5 errors during the check. All seem to be on ATA7. (?) Can someone please advise on what I should look at based on what the log may be saying. 5 Data Drives and a Cache drive (and parity of course). One add on PCIe. Server Diag ZIP attached if needed. Thank you. Oct 1 00:00:01 Server logger: mover started Oct 1 00:00:01 Server kernel: mdcmd (41): check Oct 1 00:00:01 Server kernel: md: recovery thread woken up ... Oct 1 00:00:01 Server kernel: md: recovery thread checking parity... Oct 1 00:01:21 Server sSMTP[28530]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384 Oct 1 00:01:23 Server sSMTP[28530]: Authorization failed (535 Incorrect authentication data) Oct 1 00:23:22 Server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 1 00:23:22 Server kernel: ata7.00: failed command: SMART Oct 1 00:23:22 Server kernel: ata7.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 1 pio 512 in Oct 1 00:23:22 Server kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:23:22 Server kernel: ata7.00: status: { DRDY } Oct 1 00:23:22 Server kernel: ata7: hard resetting link Oct 1 00:23:22 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 1 00:23:27 Server kernel: ata7.00: qc timeout (cmd 0xec) Oct 1 00:23:27 Server kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 1 00:23:27 Server kernel: ata7.00: revalidation failed (errno=-5) Oct 1 00:23:27 Server kernel: ata7: hard resetting link Oct 1 00:23:27 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 1 00:23:27 Server kernel: ata7.00: configured for UDMA/133 Oct 1 00:23:27 Server kernel: ata7: EH complete Oct 1 00:38:45 Server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 1 00:38:45 Server kernel: ata7.00: failed command: IDENTIFY DEVICE Oct 1 00:38:45 Server kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 6 pio 512 in Oct 1 00:38:45 Server kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:38:45 Server kernel: ata7.00: status: { DRDY } Oct 1 00:38:45 Server kernel: ata7: hard resetting link Oct 1 00:38:45 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 1 00:38:50 Server kernel: ata7.00: qc timeout (cmd 0xec) Oct 1 00:38:50 Server kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 1 00:38:50 Server kernel: ata7.00: revalidation failed (errno=-5) Oct 1 00:38:50 Server kernel: ata7: hard resetting link Oct 1 00:38:50 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 1 00:38:50 Server kernel: ata7.00: configured for UDMA/133 Oct 1 00:38:50 Server kernel: ata7: EH complete Oct 1 00:54:22 Server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 1 00:54:22 Server kernel: ata7.00: failed command: IDENTIFY DEVICE Oct 1 00:54:22 Server kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 21 pio 512 in Oct 1 00:54:22 Server kernel: res 40/00:01:00:00:00/00:00:00:00:00/e0 Emask 0x4 (timeout) Oct 1 00:54:22 Server kernel: ata7.00: status: { DRDY } Oct 1 00:54:22 Server kernel: ata7: hard resetting link Oct 1 00:54:22 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 1 00:54:27 Server kernel: ata7.00: qc timeout (cmd 0xec) Oct 1 00:54:27 Server kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 1 00:54:27 Server kernel: ata7.00: revalidation failed (errno=-5) Oct 1 00:54:27 Server kernel: ata7: hard resetting link Oct 1 00:54:27 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 1 00:54:37 Server kernel: ata7.00: qc timeout (cmd 0xec) Oct 1 00:54:37 Server kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 1 00:54:37 Server kernel: ata7.00: revalidation failed (errno=-5) Oct 1 00:54:37 Server kernel: ata7: limiting SATA link speed to 3.0 Gbps Oct 1 00:54:37 Server kernel: ata7.00: limiting speed to UDMA/133:PIO3 Oct 1 00:54:37 Server kernel: ata7: hard resetting link Oct 1 00:54:38 Server kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Oct 1 00:54:38 Server kernel: ata7.00: configured for UDMA/133 Oct 1 00:54:38 Server kernel: ata7: EH complete Oct 1 03:26:23 Server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 1 03:26:23 Server kernel: ata7.00: failed command: IDENTIFY DEVICE Oct 1 03:26:23 Server kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 30 pio 512 in Oct 1 03:26:23 Server kernel: res 40/00:01:00:00:00/00:00:00:00:00/e0 Emask 0x4 (timeout) Oct 1 03:26:23 Server kernel: ata7.00: status: { DRDY } Oct 1 03:26:23 Server kernel: ata7: hard resetting link Oct 1 03:26:23 Server kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Oct 1 03:26:23 Server kernel: ata7.00: configured for UDMA/133 Oct 1 03:26:23 Server kernel: ata7: EH complete Oct 1 03:37:36 Server kernel: md: correcting parity, sector=3519069768 Oct 1 03:37:36 Server kernel: md: correcting parity, sector=3519069776 Oct 1 03:37:36 Server kernel: md: correcting parity, sector=3519069784 Oct 1 03:37:36 Server kernel: md: correcting parity, sector=3519069792 Oct 1 03:37:36 Server kernel: md: correcting parity, sector=3519069800 PCI Devices... 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09) 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09) 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09) 00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04) 00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04) 00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04) 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4) 00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4) 00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04) 00:1f.0 ISA bridge: Intel Corporation H77 Express Chipset LPC Controller (rev 04) 00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04) 00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04) 01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11) 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 09) SCSI Devices... [0:0:0:0] disk Sony Storage Media 0100 /dev/sda [2:0:0:0] disk ATA ST4000VN000-1H41 SC43 /dev/sdb [3:0:0:0] disk ATA ST4000VN000-1H41 SC43 /dev/sdc [4:0:0:0] disk ATA ST4000VN000-1H41 SC43 /dev/sdd [5:0:0:0] disk ATA ST4000VN000-1H41 SC43 /dev/sde [6:0:0:0] disk ATA WDC WD30EZRX-00D 0A80 /dev/sdf [7:0:0:0] disk ATA KINGSTON SV300S3 BBF0 /dev/sdg [8:0:0:0] disk ATA ST4000VN000-1H41 SC46 /dev/sdh server-diagnostics-20161001-1035.zip Quote Link to comment
Squid Posted October 1, 2016 Share Posted October 1, 2016 This problem seems to be resurfacing as of late. In my case on my 6.1.9 machine, I also have continual 5 corrected errors *always* on the exact same sectors, but only under a certain circumstance. The circumstance is that its the first parity check after a reboot. Subsequent parity checks without a reboot will not pick up any errors. This implies to me that its something in the BIOS/HBA's that are changing the values (HPA? - but no additional drives have an HPA partition on them that didn't already previously exist). Just haven't had time to properly diagnose where the issue actually is (but I don't believe that in my case it's unRaid's fault) And, since AFAIK no files are corrupted due to the 5 sectors changing at boot time, it hasn't been a priority for me (all my MD5s check out fine - which implies (not a given) that the problem is on the parity disk itself. Quote Link to comment
JorgeB Posted October 1, 2016 Share Posted October 1, 2016 The ATA7 errors are from your cache SSD, so unrelated to the parity sync errors. I would start by switching your parity disk from the Marvell controller to the unused onboard controller, leave only the cache disk there, and see if issue goes way. Quote Link to comment
David Bott Posted October 1, 2016 Author Share Posted October 1, 2016 The ATA7 errors are from your cache SSD, so unrelated to the parity sync errors. I would start by switching your parity disk from the Marvell controller to the unused onboard controller, leave only the cache disk there, and see if issue goes way. I will try that. Question in regards to that...Will I need to do anything with UnRAID for this switch of controllers for a drive? Does it care or doe sit just look at the drive itself for configuration? Thank you. Quote Link to comment
JorgeB Posted October 1, 2016 Share Posted October 1, 2016 You don't have to do nothing, disks are tracked by serial number. Quote Link to comment
David Bott Posted October 2, 2016 Author Share Posted October 2, 2016 The ATA7 errors are from your cache SSD, so unrelated to the parity sync errors. I would start by switching your parity disk from the Marvell controller to the unused onboard controller, leave only the cache disk there, and see if issue goes way. Hi...Just wanted to mentioned that I made this change and we will see what happens. BTW...You mentioned the ATA7 errors were on my cache drive...I do not think the drive is failing...but not really sure what the error is saying. Cable issue? Drive? Something else? (Or are these errors of no real concern. Docker is using the Cache drive.) Thank you. Quote Link to comment
ashman70 Posted October 2, 2016 Share Posted October 2, 2016 I also have this problem on my 105TB unRAID server. Quote Link to comment
JorgeB Posted October 2, 2016 Share Posted October 2, 2016 BTW...You mentioned the ATA7 errors were on my cache drive...I do not think the drive is failing...but not really sure what the error is saying. Cable issue? Drive? Something else? (Or are these errors of no real concern. Docker is using the Cache drive.) Those were timeout errors, it then recovered, but naturally it's not good, some Marvell controllers have been having issues with unRAID v6, yours is one of those, if your parity issues are resolved you should replace it, that should also take care of the SSD errors, in the meantime you could replace the SATA cable to rule it out. Quote Link to comment
David Bott Posted October 2, 2016 Author Share Posted October 2, 2016 Thanks Johnnie for your help. Quote Link to comment
coolspot Posted May 6, 2017 Share Posted May 6, 2017 I'm also having the same problem - in my logs, I see the exact same 5 sectors: May 6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069768 May 6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069776 May 6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069784 May 6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069792 May 6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069800 However, I do not have any ATA link errors. Is there a way to tell which drive those sectors belong to?Attached are my diagnostic logs. nasserv-diagnostics-20170506-1423-2.zip Quote Link to comment
JorgeB Posted May 6, 2017 Share Posted May 6, 2017 If you run another check without rebooting do you still get the same errors? Quote Link to comment
coolspot Posted May 6, 2017 Share Posted May 6, 2017 Just now, johnnie.black said: If you run another check without rebooting do you still get the same errors? No it's fine afterwards - this seems to be a problem with the Marvell chipset ... I see that others with a SuperMicro SAS2LP have the problem. I have a Highpoint RocketRaid 2720SGL which has the same chipset. I'm going to try to turn off IOMMU / VT-D and see if this fixes things. Quote Link to comment
JorgeB Posted May 6, 2017 Share Posted May 6, 2017 1 minute ago, coolspot said: No it's fine afterwards - this seems to be a problem with the Marvell chipset ... I see that others with a SuperMicro SAS2LP have the problem. I have a Highpoint RocketRaid 2720SGL which has the same chipset. I'm going to try to turn off IOMMU / VT-D and see if this fixes things. @Squid had the idea that these 5 repeating sectors after a reboot could be related to int13h being enable, on the SAS2LP bios disable it and see it it helps. Quote Link to comment
JorgeB Posted May 6, 2017 Share Posted May 6, 2017 You'll need to disable it, run a check, if there are errors correct them and reboot one more time and run another one. Quote Link to comment
Squid Posted May 6, 2017 Share Posted May 6, 2017 5 minutes ago, coolspot said: No it's fine afterwards - this seems to be a problem with the Marvell chipset ... I see that others with a SuperMicro SAS2LP have the problem. I have a Highpoint RocketRaid 2720SGL which has the same chipset. I'm going to try to turn off IOMMU / VT-D and see if this fixes things. I just have no idea if the RocketRaid 2720SGL supports disabling int13h in it's bios. But if it does, you would need to keep VT-D / IOMMU enabled in the computer's BIOS and just disable int13h in the HBA's bios. It's only a theory. Not tested at all, but perfectly safe to do. (int13h enabled only allows drives attached to the HBA to be bootable) Quote Link to comment
JorgeB Posted May 6, 2017 Share Posted May 6, 2017 22 minutes ago, coolspot said: I have a Highpoint RocketRaid 2720SGL which has the same chipset. Sorry it looks exactly the same as SAS2LP in the diags and I didn't read carefully enough, it will probably have the same setting in the bios though. Quote Link to comment
S80_UK Posted May 19, 2017 Share Posted May 19, 2017 OK - So it's not just me then.... (I am relieved, kind of.) I also have the SAS2LP in my main server, and every so often (like after I just upgraded to unRAID version 6.3.4) I see the same 5 errors when I do a parity check. The problem that I have is that the errors are real and are not on the parity drive. I have a number of files which consistently get trampled on when this happens. How do I know this? I use an rsync based script to do a binary compare of every file on my server against the backup server. This doesn't take as long as one might think because I am able to run multiple compares in parallel (one instance be drive), so it takes about 12 hours or so. The backup server does not use the SAS2LP controller and never fails its parity checks. After I find the 5 error failure on the main server, I can restore the corrupted files from the backup server. I then run a correcting parity check on the main server so that parity is then correct for the re-written data. I then run another non-correcting parity check without error and then another binary compare between the servers with no errors being found. Everything is then OK and remains OK until the next reboot. So - is there anything that I can add to the discussion or do to help with diagnosis? Or is my only safe option to junk the SAS2LP and go back to the slower SASLP which has never showed the problem. Quote Link to comment
JorgeB Posted May 19, 2017 Share Posted May 19, 2017 4 minutes ago, S80_UK said: So - is there anything that I can add to the discussion or do to help with diagnosis? Try this: On 06/05/2017 at 7:37 PM, johnnie.black said: @Squid had the idea that these 5 repeating sectors after a reboot could be related to int13h being enable, on the SAS2LP bios disable it and see it it helps. Quote Link to comment
JorgeB Posted May 19, 2017 Share Posted May 19, 2017 11 minutes ago, S80_UK said: The problem that I have is that the errors are real and are not on the parity drive. I have a number of files which consistently get trampled on when this happens. Forgot to add that if this also happens to the other users with the same issue it's much more serious than I though, is it always the same files or different ones? 1 Quote Link to comment
S80_UK Posted May 19, 2017 Share Posted May 19, 2017 Many thanks for chiming in. So far, it's always the same files. One obvious one is a Blu-ray rip - Cast Away, it's around 38GB, so its an easy target. When I compared the backup against the mains server copy I could see that quite a few sectors are getting overwritten - it may be possible for me to see if it's the same corruption each time, but it will take a while. There's also a couple of Windows 10 automatic backup database files that get screwed up. On the plus side, I can at least see where the errors are because I do have a 100% backup. I shall try the disabling the interrupt in the BIOS when the machine has finished the current round of checking (not sure it it's currently enabled or not). I will also need to restore the good files and then recheck everything before I can do a fresh power cycle, so it may be a while before I come back with any more details. 1 Quote Link to comment
S80_UK Posted May 22, 2017 Share Posted May 22, 2017 (edited) Update: So I restored the corrupted files. I used the reconstruct method for parity to force it to be correct regardless of the previous state (just saved a bit of time). I then ran a parity check without error. i then did a full compare against the backup server - no errors. Then I did another parity check - still no errors. So far, so good. Next I rebooted the server to look at the interrupt option on the SAS2LP controller - it was enabled. So I disabled it. Then I powered off, rebooted, checked that the setting had been retained. It had. Power up - all looked OK. Read and wrote a few files from my main PC. OK. Then last night I ran another parity check. Guess what..? 5 errors. Back again. Interrupt off. Damn! May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565768May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565776May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565784May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565792May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565800 So now I am going to run another file compare against the backup to see what changed. My guess it that it will be the same files as before. Watch this space. Edited May 22, 2017 by S80_UK Quote Link to comment
JorgeB Posted May 22, 2017 Share Posted May 22, 2017 So now I am going to run another file compare against the backup to see what changed. My guess it that it will be the same files as before. Watch this space.Please repeat the complete test one more time now that int13 was disable from the start. Quote Link to comment
S80_UK Posted May 22, 2017 Share Posted May 22, 2017 (edited) Yes - I may have had some other activity going on. So now I want to try to narrow it down further. So I just restarted from power off (Int still disabled) and then I started the file compare against the backup. My plan was to complete that, then fix any broken files, then check parity (should be OK), then compare files (should still be OK) then shut down completely. Follow that with a power up, check parity (might be broken). I will try to avoid writing to any shares for the duration - shares that get written to by PC Windows backups have been set to read-only for now. Full file compare takes about 14 hours. Parity check takes about 7.5 hours. So a couple more days. I could shorten this by only checking the files that I suspect may get damaged but I'd like to know that nothing else is affected Or I can only do the full binary compare after I have observed the parity errors. Edited May 22, 2017 by S80_UK Quote Link to comment
S80_UK Posted May 23, 2017 Share Posted May 23, 2017 (edited) A further update... With the SAS2LP interrupt off Restarted. Fixed the broken files from the backup with reconstruct On to force parity reconstruction for those sectors regardless of the previous parity data. Switched reconstruct back to Auto (which is my normal setting). Ran a complete parity check (was all OK) Checked files that were previously getting corrupted were still OK by reading off the server and then doing a binary compare. Shut down the server completely. Power up. Captured diagnostics just in case (attached if interested). Ran another complete parity check. (Good news - It was still all OK.) So for the time being I shall do a parity check every few days. I normally leave the server running 24/7, but I shall try an occasional restart just to see if I can force the error to come back. But if this has solved it, then we are left with the conclusion that with the interrupt enabled, there can be a small number of destructive writes which need not be confined to the parity drive but which only show up during a parity check. The prevailing view is normally that if there is the odd parity error then a parity correction is required, but that was not the case here At least I have my backups, and no SAS2LP in that machine, just the slower SASLP version. The attached file diffs.zip contains the report from a command line file compare of the backup version of the affected MKV file and the corrupted version. I doubt that it's that useful, but it shows that most of the overwritten data was replaced with zeroes. One passing thought - would a second parity drive have helped to catch this sooner, or have given more clarity as to where the error lay? Or might it just have made life more complicated? tower-v6-diagnostics-20170523-1012.zip diffs.zip Edited May 23, 2017 by S80_UK Typos 1 Quote Link to comment
JorgeB Posted May 23, 2017 Share Posted May 23, 2017 2 minutes ago, S80_UK said: But if this has solved it, then we are left with the conclusion that with the interrupt enabled, there can be a small number of destructive writes which need not be confined to the parity drive but which only show up during a parity check. The prevailing view is normally that if there is the odd parity error then a parity correction is required, but that was not the case here At least I have my backups, and no SAS2LP in that machine, just the slower SASLP version. The attached file diffs.zip contains the report from a command line file compare. I doubt that its that useful, but it shows that most of the overwritten data was replaced with zeroes. Thanks for testing and keep us updated, but the fact that these writes corrupt files is troubling, also, if parity is on the SAS2LP it's probably getting corrupted as well, I never had this problem with mine, although I always had vd-t disable on the servers they were used and just recently checked and Int13h is also disable on both, I must have disable it for another reason when I got them. 5 minutes ago, S80_UK said: One passing thought - would a second parity drive have helped to catch this sooner, or have given more clarity as to where the error lay? Or might it just have made life more complicated? Not sure, but probably not, depends where each disk is connected. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.