[RESOLVED - mostly] 3 Parity Errors after running Parity Check - General Support (V5 and Older)

April 7, 201115 yr

I ran a monthly Parity Check (manually invoked) this morning before leaving for work, and came home to this digest (from unMenu):

Parity is Valid:. Last parity check < 1 day ago . Parity updated 3 times to address sync errors.

System Log (last 15 lines)

Apr 6 07:27:30 servername kernel: md: recovery thread checking parity... (unRAID engine)

Apr 6 07:27:30 servername kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)

Apr 6 07:27:30 servername kernel: md: parity incorrect: 13232 (Errors)

Apr 6 07:27:30 servername kernel: md: parity incorrect: 13240 (Errors)

Apr 6 07:27:30 servername kernel: md: parity incorrect: 35808 (Errors)

Apr 6 16:52:37 servername kernel: md: sync done. time=33907sec rate=57613K/sec (unRAID engine)

Apr 6 16:52:37 servername kernel: md: recovery thread sync completion status: 0 (unRAID engine)

Apr 6 17:47:18 servername emhttp: shcmd (175): cp /var/spool/cron/crontabs/root- /var/spool/cron/crontabs/root (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (176): echo '# Generated mover schedule:' >>/var/spool/cron/crontabs/root (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (177): echo '40 3 * * * /usr/local/sbin/mover 2>&1 | logger' >>/var/spool/cron/crontabs/root (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (178): crontab /var/spool/cron/crontabs/root (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (179): rm /etc/samba/smb-shares.conf >/dev/null 2>&1 (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (180): cp /etc/exports- /etc/exports (Other emhttp)

Apr 6 17:47:20 servername emhttp: shcmd (181): killall -HUP smbd (Minor Issues)

Apr 6 17:47:20 servername emhttp: shcmd (182): /etc/rc.d/rc.nfsd restart | logger (Other emhttp)

I've tried looking over the syslog - not that I know exactly what I'm supposed to be looking for - and can't seem to find any detail as to what the errors were or whether I should be worried. I've attached the pertinent section of the syslog (covering this morning through about now). Could some generous soul please look it over and let me know:

A) Is there anything I should be worried about

B) What exactly I should be worried about

C) What do I do to fix the issue (if any)

As always, any help is greatly appreciated!

BTW, I'm running 4.7.

syslog.txt

Quote

April 7, 201115 yr

Please post the entire syslog. zip it if needed.

Quote

April 7, 201115 yr

I ran a monthly Parity Check (manually invoked) this morning before leaving for work, and came home to this digest (from unMenu):

Parity is Valid:. Last parity check < 1 day ago . Parity updated 3 times to address sync errors.

System Log (last 15 lines)

Apr 6 07:27:30 servername kernel: md: recovery thread checking parity... (unRAID engine)

Apr 6 07:27:30 servername kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)

Apr 6 07:27:30 servername kernel: md: parity incorrect: 13232 (Errors)

Apr 6 07:27:30 servername kernel: md: parity incorrect: 13240 (Errors)

Apr 6 07:27:30 servername kernel: md: parity incorrect: 35808 (Errors)

Apr 6 16:52:37 servername kernel: md: sync done. time=33907sec rate=57613K/sec (unRAID engine)

Apr 6 16:52:37 servername kernel: md: recovery thread sync completion status: 0 (unRAID engine)

Apr 6 17:47:18 servername emhttp: shcmd (175): cp /var/spool/cron/crontabs/root- /var/spool/cron/crontabs/root (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (176): echo '# Generated mover schedule:' >>/var/spool/cron/crontabs/root (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (177): echo '40 3 * * * /usr/local/sbin/mover 2>&1 | logger' >>/var/spool/cron/crontabs/root (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (178): crontab /var/spool/cron/crontabs/root (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (179): rm /etc/samba/smb-shares.conf >/dev/null 2>&1 (Other emhttp)

Apr 6 17:47:18 servername emhttp: shcmd (180): cp /etc/exports- /etc/exports (Other emhttp)

Apr 6 17:47:20 servername emhttp: shcmd (181): killall -HUP smbd (Minor Issues)

Apr 6 17:47:20 servername emhttp: shcmd (182): /etc/rc.d/rc.nfsd restart | logger (Other emhttp)

I've tried looking over the syslog - not that I know exactly what I'm supposed to be looking for - and can't seem to find any detail as to what the errors were or whether I should be worried. I've attached the pertinent section of the syslog (covering this morning through about now). Could some generous soul please look it over and let me know:

A) Is there anything I should be worried about

B) What exactly I should be worried about

C) What do I do to fix the issue (if any)

As always, any help is greatly appreciated!

BTW, I'm running 4.7.

Did you have to hard boot or otherwise have a dirty shutdown since your last parity check?

The parity blocks are very near the beginning of the disk, typical of parity errors you would get after a dirty shutdown.

Quote

April 7, 201115 yr

Author

Please post the entire syslog. zip it if needed.

Attached. I can't seem to get it "clean" with line breaks, even with wordwrap turned out. Let me know if there's any section in particular that I might be able to snip and re-align (can't do it for the whole doc as it's too lengthy).

Thanks!

syslog.zip

Quote

April 7, 201115 yr

Author

Did you have to hard boot or otherwise have a dirty shutdown since your last parity check?

The parity blocks are very near the beginning of the disk, typical of parity errors you would get after a dirty shutdown.

Not that I recall. I've been very diligent with doing a Stop + Shutdown via the UI. No power outages either.

Quote

April 7, 201115 yr

I suggest you run an overnight memory test. Reboot the server and select the memory test option.

Quote

April 7, 201115 yr

Author

I suggest you run an overnight memory test. Reboot the server and select the memory test option.

IIRC, this option is available on the boot screen before the default unRAID selection boots, is that correct? I'm going to do this tonight, but as part of my continuing education, what am I seeking to discover with the memtest?

Thanks!

Quote

April 7, 201115 yr

Author

Oh, and should I worry at this point of a possible drive failure while this parity error is being diagnosed? (no jinxies, no jinxies, no jinxies...)

Quote

April 7, 201115 yr

I suggest you run an overnight memory test. Reboot the server and select the memory test option.

IIRC, this option is available on the boot screen before the default unRAID selection boots, is that correct? I'm going to do this tonight, but as part of my continuing education, what am I seeking to discover with the memtest?

Thanks!

Correct.

Bad memory can cause parity errors, becuase the computer starts trying to add something in a memory location and then the bad memory causes a bit to flip and the answer is wrong. That combined with a dirty shutdown are two frequent reasons for parity errors.

Can you look in your /var/log directory on your server and see if there is another syslog in there (maybe "syslog.1" or something). The syslog you posted appears to be a restarted syslog. The other one must have filled up.

I see signs of crashing processes. That may be another sign of bad memory, or could mean something else. The earlier syslog would be helpful.

Quote

April 7, 201115 yr

Oh, and should I worry at this point of a possible drive failure while this parity error is being diagnosed? (no jinxies, no jinxies, no jinxies...)

You could run smartctl reports on your drives to see if anything looks suspicious. So far no sign of drive problems, that that is always a possibility.

The syslog looks like it is haming memory allocation problems. Either memory filling up (need to see prior syslog) or bad memory or Linux has become unstable. But all of that was a couple weeks ago - no log entries of note in a while.

If memory is ok, we should look at the smart reports. If they look okay, we should reboot and look at the fresh syslog. If that is okay, we should run another parity check.

Quote

April 7, 201115 yr

Author

Correct.

Bad memory can cause parity errors, becuase the computer starts trying to add something in a memory location and then the bad memory causes a bit to flip and the answer is wrong. That combined with a dirty shutdown are two frequent reasons for parity errors.

Can you look in your /var/log directory on your server and see if there is another syslog in there (maybe "syslog.1" or something). The syslog you posted appears to be a restarted syslog. The other one must have filled up.

I see signs of crashing processes. That may be another sign of bad memory, or could mean something else. The earlier syslog would be helpful.

Attached to this reply as requested, didn't seem to be any other "syslog" files.

You could run smartctl reports on your drives to see if anything looks suspicious. So far no sign of drive problems, that that is always a possibility.

The syslog looks like it is haming memory allocation problems. Either memory filling up (need to see prior syslog) or bad memory or Linux has become unstable. But all of that was a couple weeks ago - no log entries of note in a while.

If memory is ok, we should look at the smart reports. If they look okay, we should reboot and look at the fresh syslog. If that is okay, we should run another parity check.

Total n00b on this one, how exactly do I run a smartctl report? And should I do this before or after doing the overnight memtest?

Thanks for your patience!

syslog.1.zip

Quote

April 7, 201115 yr

Run overnight memory test.

Quote

April 7, 201115 yr

Author

Run overnight memory test.

Will do before hitting the hay tonight and will post results when I get back in from work tomorrow. Do I just interrupt the memtest in-progress when I return to it tomorrow then copy/paste whatever is on-screen? Or is there a log file generated somewhere?

In case this helps with the memory diagnosis (items that might cause memory outage?), here's a list of add-on packages in my unRAID setup (server has 4GB of RAM installed):

bwm-ng - Bandwidth Monitor NG (Next Generation), a live bandwidth monitor

Currently Installed. Will be automatically Re-Installed upon Re-Boot.

cxxlibs-6.0.9-i486.tgz library accidentally left out of unRAID 4.4-beta2 through 4.5beta5

Installed, Not Downloaded

lighttpd (pronounced "lighty")

Installed, Not Downloaded

unRAID-Web

Currently Installed. Will be automatically Re-Installed upon Re-Boot.

Thanks again bjp999!

Quote

April 7, 201115 yr

Just look at the screen. It will tell you if there were memory errors. You could snap a picture of the screen and post it if you're not sure.

Quote

April 8, 201115 yr

Author

Just look at the screen. It will tell you if there were memory errors. You could snap a picture of the screen and post it if you're not sure.

Sorry bjp, just got back from a looong workday. Anyway, the memtest still seems to be running, but no errors are displayed under any column. Picture attached.

Should I keep it going?

Quote

April 8, 201115 yr

You can stop the memory test. Your memory is fine.

Reboot into unRaid and capture / post a syslog after booting. Also, run smartctl reports on each of your drives and post them. Follow the instructions in the "troubleshooting" link in my sig below to capture smart reports.

Quote

April 9, 201115 yr

Author

You can stop the memory test. Your memory is fine.

Reboot into unRaid and capture / post a syslog after booting. Also, run smartctl reports on each of your drives and post them. Follow the instructions in the "troubleshooting" link in my sig below to capture smart reports.

Rebooted from memtest, captured syslog, then ran smartctl on each drive. Logs/results are attached. Thanks!

sys+smart.zip

Quote

April 9, 201115 yr

Analysis ...

I am not an expert syslog reader, but I don't like the looks of this section:

Apr 9 00:29:47 servername kernel: ------------[ cut here ]------------

Apr 9 00:29:47 servername kernel: WARNING: at fs/proc/generic.c:590 proc_register+0x11c/0x14b()

Apr 9 00:29:47 servername kernel: Hardware name: X8SIL

Apr 9 00:29:47 servername kernel: proc_dir_entry 'scsi_tgt/mvst_scst' already registered

Apr 9 00:29:47 servername kernel: Modules linked in: mvsas(+) libsas scst scsi_transport_sas

Apr 9 00:29:47 servername kernel: Pid: 930, comm: modprobe Not tainted 2.6.32.9-unRAID #8

Apr 9 00:29:47 servername kernel: Call Trace:

Apr 9 00:29:47 servername kernel: [<c102449e>] warn_slowpath_common+0x60/0x77

Apr 9 00:29:47 servername kernel: [<c10244e9>] warn_slowpath_fmt+0x24/0x27

Apr 9 00:29:47 servername kernel: [<c109cf0e>] proc_register+0x11c/0x14b

Apr 9 00:29:47 servername kernel: [<c109d0cc>] proc_mkdir_mode+0x2f/0x43

Apr 9 00:29:47 servername kernel: [<c109d0ef>] proc_mkdir+0xf/0x11

Apr 9 00:29:47 servername kernel: [<f83f6a36>] scst_build_proc_target_dir_entries+0x55/0xdc [scst]

Apr 9 00:29:47 servername kernel: [<f83ddca9>] __scst_register_target_template+0x16c/0x3af [scst]

Apr 9 00:29:47 servername kernel: [<f845bd5d>] mvst_init+0x3b/0x5b [mvsas]

Apr 9 00:29:47 servername kernel: [<f8460895>] mvs_pci_init+0xaa5/0xaf7 [mvsas]

Apr 9 00:29:47 servername kernel: [<c10062d9>] ? dma_generic_alloc_coherent+0x0/0xdb

Apr 9 00:29:47 servername kernel: [<c1142050>] local_pci_probe+0xe/0x10

Apr 9 00:29:47 servername kernel: [<c11426ad>] pci_device_probe+0x48/0x66

Apr 9 00:29:47 servername kernel: [<c1194956>] driver_probe_device+0x79/0xed

Apr 9 00:29:47 servername kernel: [<c1194a0d>] __driver_attach+0x43/0x5f

Apr 9 00:29:47 servername kernel: [<c11940a7>] bus_for_each_dev+0x39/0x5a

Apr 9 00:29:47 servername kernel: [<f8468000>] ? mvs_init+0x0/0x45 [mvsas]

Apr 9 00:29:47 servername kernel: [<c119482f>] driver_attach+0x14/0x16

Apr 9 00:29:47 servername kernel: [<c11949ca>] ? __driver_attach+0x0/0x5f

Apr 9 00:29:47 servername kernel: [<c119451c>] bus_add_driver+0x9f/0x1c5

Apr 9 00:29:47 servername kernel: [<f8468000>] ? mvs_init+0x0/0x45 [mvsas]

Apr 9 00:29:47 servername kernel: [<c1194ccf>] driver_register+0x7b/0xd7

Apr 9 00:29:47 servername kernel: [<f8468000>] ? mvs_init+0x0/0x45 [mvsas]

Apr 9 00:29:47 servername kernel: [<c1142882>] __pci_register_driver+0x39/0x8c

Apr 9 00:29:47 servername kernel: [<f8468000>] ? mvs_init+0x0/0x45 [mvsas]

Apr 9 00:29:47 servername kernel: [<f8468030>] mvs_init+0x30/0x45 [mvsas]

Apr 9 00:29:47 servername kernel: [<c1001139>] do_one_initcall+0x4c/0x131

Apr 9 00:29:47 servername kernel: [<c1042e6e>] sys_init_module+0xa7/0x1dd

Apr 9 00:29:47 servername kernel: [<c1002935>] syscall_call+0x7/0xb

Apr 9 00:29:47 servername kernel: ---[ end trace 374000b44d7de86f ]---

Apr 9 00:29:47 servername kernel: [930]: scst: __scst_register_target_template:253:***ERROR***: Target driver mvst_scst already registered

Apr 9 00:29:47 servername kernel: [930]: scst: __scst_register_target_template:293:***ERROR***: Failed to register target template mvst_scst

On the smart reports,

SDB (PVS1QTZL) has 2 reallocated sectors. Not a major problem, but if you keep getting more reallocations it would be an indication of a failing disk.

SDF (9VS1VC6J) has 1 reallocated sector. Same deal.

SDG (9VS1T849) has 2 reallocated sectors. Same.

SDH (9VS1GM4B) has 2 reallocated sectors. Same.

SDL (WD-WCAYY0288947) has 4 reallocated sectors. Same.

SDQ (9VS1KL5W) has 62 reallocated sectors. It also has 2 spin retry counts (unusual). This is more serious. My guess is that drive is failing. May or may not be causing the problem above. You could run a smartctl long test on the drive and see what it says. You should look at the reallocated sectors after and see if they have gone up. My guess is that every time you do a full drive scan the numbers are going to go up and up.

Hopefully Joe L. or one of the other Linux experts will look at your syslog and be able to interpret the syslog problem above. If the array has been working fine and suddenly this has started, my best guess is that SDQ is failing and causing the problem. But the error above could be unrelated to SDQ.

Quote

April 9, 201115 yr

Author

Thanks for looking those over!

I'm going to run an extended smartctl on SDQ and post back the results.

Quote

April 10, 201115 yr

Author

Extended smartctl comleed, log attached. Doesn't seem like the reallocated sectors or spin-up retires have changed. What should be tried next?

smart_xt_SDQ.txt

Quote

April 10, 201115 yr

Apr  9 00:29:47 servername kernel: proc_dir_entry 'scsi_tgt/mvst_scst' already registered

This is nothing. You have more than one of this type of card.

I would not worry about three errors. You can try to track them down but will probably not be able to determine which disk is wrong. I would just run another parity check and see what happens.

If you want to try to track the errors down you can run a NOCORRECT parity check and the location of the errors will appear in the log. Then you'll need to repeatedly read those locations on all disks to see if a disk is returning inconsistent results.

Quote

April 11, 201115 yr

Apr  9 00:29:47 servername kernel: proc_dir_entry 'scsi_tgt/mvst_scst' already registered
This is nothing. You have more than one of this type of card.

I would not worry about three errors. You can try to track them down but will probably not be able to determine which disk is wrong. I would just run another parity check and see what happens.

If you want to try to track the errors down you can run a NOCORRECT parity check and the location of the errors will appear in the log. Then you'll need to repeatedly read those locations on all disks to see if a disk is returning inconsistent results.

Not sure what "I would not worry about three errors" means. If the array was not shut down hard, there is no reason you should have parity erros.

I have seen parity errors after hard shutdowns, and have even seen parity errors AFTER the parity check that get geneated after a hard shutdown. So if your computer has been hard shutdown since the last parity check, it could still be the cause.

I am still concerned about the crashes I have seen in the old and new syslogs. I still think there is something serious wrong. I'd recommend sending a note to Joe L. asking him to take a look.

You could try to re-run your parity checks. If it works with no sync errors, I would run another and another. If you can run 3 consecuritve parity checks with no errors, I think your data is safe. Make sure to keep copies of syslogs, and track locations of any parity sync errors.

Quote

April 11, 201115 yr

I would not worry about 3 parity errors because the effort it takes to determine on which disk they actually lay is very high and the chance of finding them is very low. The errors could be on any disk, even three different disks. UnRaid reports these errors but gives no indication of where they lay. This is unRAIDS major weakness. (Don't get me wrong, unRAID is great if you have a disk failure.) These errors most likely lie in a music or video file and they will be inconsequential to playback.

If this is a symptom of a more serious problem, such as crashing. Then of course the serious problem needs to be diagnosed and remedied. But 3 parity errors in and of themselves is not something to lose sleep over.

Quote

April 11, 201115 yr

Author

Apr  9 00:29:47 servername kernel: proc_dir_entry 'scsi_tgt/mvst_scst' already registered
This is nothing. You have more than one of this type of card.

I would not worry about three errors. You can try to track them down but will probably not be able to determine which disk is wrong. I would just run another parity check and see what happens.

If you want to try to track the errors down you can run a NOCORRECT parity check and the location of the errors will appear in the log. Then you'll need to repeatedly read those locations on all disks to see if a disk is returning inconsistent results.
I would not worry about 3 parity errors because the effort it takes to determine on which disk they actually lay is very high and the chance of finding them is very low. The errors could be on any disk, even three different disks. UnRaid reports these errors but gives no indication of where they lay. This is unRAIDS major weakness. (Don't get me wrong, unRAID is great if you have a disk failure.) These errors most likely lie in a music or video file and they will be inconsequential to playback.

If this is a symptom of a more serious problem, such as crashing. Then of course the serious problem needs to be diagnosed and remedied. But 3 parity errors in and of themselves is not something to lose sleep over.

As previously noted, I have not had any hard-shutdowns since I've put the server online. Also, I'm a bit confused by the use of the "crashing" terminology. What, exactly, has crashed and where?

Note that the server has not been unresponsive or otherwise inaccessible. Granted, I'm a not much of a techie in these matters, but I'm understanding what dgaschk is espousing in that I don't want to make a mountain out of a mole-hill, especially if there really isn't a mole.

I've sent a PM to Joe L. to please look over the syslog, but haven't gotten a response as yet. In any case, if it's helpful to go over it again, this is the sequence of events that got me here:

1) Server has been running fine, no issues, nothing other than greens

2) Ran manual Parity Check since I do it roughly once a month

3) Came home to completion, saw digest "Parity is Valid:. Last parity check < 1 day ago . Parity updated 3 times to address sync errors" (I IMMEDIATELY assume that Parity is currently valid as it indicates - I gotta take it at face value, right? I also take this to read that, while it is currently valid, it's also telling me that it found - and corrected - 3 errors)

4) Asked for advice on this forum re: the 3 errors

5) Ran memtests and smartctl per bjp999's suggestion

6) As of this writing (after memtest + smartctl results reported), sounds like going any further to try to determine the cause for those 3 errors might be an exercise in diminishing returns

Again, my biggest concern at this point is the mention of crash/es/ing as - presumably - captured by my syslog.

I'm operating my unRAID server on the premise that it's "supposed" to give me the one-drive safety net if and when it actually happens. I think that if I (we) had to chase down a potential failure-in-the-making, we'd be paranoid to even let things be. Am I naive to place that much "trust" in my unRAID setup? In that I should second-guess it if it shows all-green? More to the point, given what's been posted in this thread so far, is there anything I currently *really* have to worry about?

Again, I have to operate on the premise that if the main board shows me green, then I'm green, right?

I'll run another Parity Check again when I get home tonight and see what it says in the morning. Thanks for the continuing input and education!

Quote

April 11, 201115 yr

Personally, I would run another parity check and then look at the smart reports again and see if any of the reallocated sector counts increase.

If they increase... then in my opinion the drive can't be trusted.

Quote

[RESOLVED - mostly] 3 Parity Errors after running Parity Check

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)