[4.6] Constant but consistent parity errors

January 10, 201115 yr

This is baffling me now. One drive out of an 11+1 array showed 253 errors on a recent parity check (all drives are jumpered -20EARS). Since I've experienced several "harmless" parity errors before, after a review of the syslog didn't find anything obviously fatal, I simply unassigned/reassigned the drive in question to let unRAID rebuild the data drive and expected no errors.

This time was different as I got the exact same 253 errors, though the drive indicator was GREEN. I left the unRAID online then several days later the same drive showed RED again.

Fearing the worst, I pulled the jumpered -20EARS drive and replaced it with another one and left unRAID to rebuild the data.

Got the SAME 253 errors.

Again, reviewing all logs, any apparent errors have escaped me (or my old eyes), so attached all the logs involved in each parity error incident.

Now what do I do? Do I need to disregard my parity drive and rebuild THAT instead of trusting the parity drive?

NOTE: I am allowed to only attach my last SysLog involving the new drive as the other syslogs (I kept all of them for every parity error incident) are too large individually to attach as I'm limited to 192KB. I can e-mail the other syslogs upon request.

syslog_parity_errors_with_NEW_drive.txt

Quote

January 11, 201115 yr

This is baffling me now. One drive out of an 11+1 array showed 253 errors on a recent parity check (all drives are jumpered -20EARS). Since I've experienced several "harmless" parity errors before, after a review of the syslog didn't find anything obviously fatal, I simply unassigned/reassigned the drive in question to let unRAID rebuild the data drive and expected no errors.

Well these exactly 253 errors are strange and I really do not know what to suggest overall but since you mention this above that I find somehow troublesome and you have your complete hardware configuration in your signature I will try to give you my thoughts.

I am curious when did you start experiencing these "harmless" parity errors - right from the start or after you expanded your system past the 8,9,10,11 or 12 drives...

The thing is you are using HTPC case and you must have done some modifications to it to be able to fit 12 HDs inside. Do you have additional cooling and what is the temperature of the HDs during these parity checks / rebuilds?

Since you are using only WD -20EARS drives keep in mind that a recent study by a Russian data-recovery lab (available on Tom's hardware) pointed to their findings that the recent high capacity WD hard drives have a very heat-sensitive heads and they do not recommend to run these above 40 degrees. By cramping 12 HD in that case you may have put some of the existing HDs in a real hot spots and this results in these "harmless" parity errors.

The second one is your choice of a power supply - it is not a single rail one and with 12 HDs especially running at elevated temperature it may be on it's limits which may lead to higher ripple. The reviews on Newegg are also not very positive with a lot of early failures. It is somehow important how you distributed the power load to all these HDs, especially the HDs attached to the Molex connectors (do you use all four of them or just two)

It is speculation on mine but if I am going to test I will open the case, place a big fan blowing air over and use a different and better single-rail power supply to see it one can get rid of this parity errors once for all.

Good luck.

Quote

January 11, 201115 yr

I have to ask,

Have you attempted to replace the SATA cable? Quickest thing to do and run a test.

Quote

January 11, 201115 yr

Author

I've experienced exactly TWO "harmless" errors so far; and one was due to losing power completely to the household. So I doubt it has anything to do with the system power supply.

Having said that, once I passed 10 hard drives, I bought a higher capacity Cooler Master M850 just to be safe; just haven't installed it yet.

Could it be that the "253" errors involves the parity drive and that was compromised by insufficient power? Perhaps.

The fact is that the unRAID log has not provided me any obvious errors on any specific drive; just that the GUI reports 253 errors all the time and I had a red dot on the same data drive but have since replaced it and still get 253 errors. To me, this can only suggests an issue with the parity drive as I've already replaced the other possibility.

Regarding the HTPC case, there are 2 racks that hold 4 drives each, then a double height 5.25" rack above a fixed 3.5" mount. I converted the double 5.25 into 3 3.5" drive mounts, so in essence, that is the only modification I have made. As far as heat is concerned, there is typically only one or two drives spinning as this is a dedicated media server to just one entertainment system, so there are no other demands on the server from different media players. In fact, the media server is sitting right next to me as I'm waiting for a remote power switch from a manufacturer that would allow me to power the system from infrared or RF remotes, which is how I know the number of drives that are typically spinning.

Quote

January 11, 201115 yr

Author

I have to ask,

Have you attempted to replace the SATA cable? Quickest thing to do and run a test.

I have thought of that. Unfortunately, the cable is part of a mini-SAS to 4-SATA breakout cable, so I would have to replace the entire assembly. Though would I always get the same 253 errors if it was a bad cable? I would think I would get a random set of errors...

Quote

January 15, 201115 yr

Author

UPDATE: After several days I have experienced no further issues so far.

After doing some research to determine if the power supply is sufficient, I did some research on the power consumption of the components:

1) WD20EARS = 7.14W constant activity, 14.01W startup (http://www.storagereview.com/western_digital_caviar_green_2tb_review_wd20ears)

2) X7SPA-HF+AOC-SASLP-MV8 = 34W idle, 36W startup (http://www.silentpcreview.com/forums/viewtopic.php?t=57817)

Thus, the max power consumption can range from 120W to 205W; well within the 480W capacity of the Zalman PSU.

Quote

January 15, 201115 yr

UPDATE: After several days I have experienced no further issues so far.

After doing some research to determine if the power supply is sufficient, I did some research on the power consumption of the components:

1) WD20EARS = 7.14W constant activity, 14.01W startup (http://www.storagereview.com/western_digital_caviar_green_2tb_review_wd20ears)

2) X7SPA-HF+AOC-SASLP-MV8 = 34W idle, 36W startup (http://www.silentpcreview.com/forums/viewtopic.php?t=57817)

Thus, the max power consumption can range from 120W to 205W; well within the 480W capacity of the Zalman PSU.

We like to estimate 2 Amps per "green" drive on spin-up. The SASLP would be about 3 Amps 36/12 = 3 amps.

You have 12 disks. I would estimate 24 Amps for them, + 3 Amps for the disk controller. = 27 Amps * 12 volts = 324 Watts.

Based on your 205 watt estimate I took a closer look at your power supply. It is a dual-rail supply. One of those rails is dedicated to the CPU power plugs, the other for the SATA and MOLEX connectors, and it is also usually shared by the 24 Pin motherboard connector.

12v1 = 16 Amps, 12v2 = 18 Amps.

According to the review, the rails are split like I said:

This power supply features two +12 V virtual rails distributed like this:

* +12V1 (solid yellow wire): All cables but the ATX12V.

* +12V2 (yellow with black stripe wire): ATX12V connector.

So, you actually have a 16 amp supply, and it is shared with the 24 Pin connector for the motherboard.

If you use your estimate of 205 Watts you are drawing 17 Amps (205 watts/12volts = 17.08 amps) , not counting what the motherboard and memory is using, or the fans. That is over your 12v1 rating.

If you use my estimate based on 2 Amps per drive when initially spinning up, you are attempting to draw 27 Amps from a supply rail rated for 16, and that is again without counting what the motherboard is using for memory, etc.

From my perspective, you are lucky the server is working as well as it is. Get a true high current single 12 Volt Rail supply and stop fooling yourself about the "advertised" capacity of your existing supply. As far as the only 12V rail you can use is concerned, it is a 192 watt supply. (16x12=192) even if the marketing of the power supply says otherwise.

Joe L.

Quote

January 15, 201115 yr

Do I point out the obvious?

The first drive red-balled which means unRAID had an issue writing to it.

The drive data would have been synchronized to the parity data when you rebuilt the data onto the first drive and then the new drive. So, after the first disk rebuild you shouldn't have parity errors.

When do you see these errors? During the disk rebuild or during a parity check after the rebuild?

Joe - Could a HPA cause something like this?

Peter

Quote

January 15, 201115 yr

Do I point out the obvious?

The first drive red-balled which means unRAID had an issue reading from it.

A red indicator would be if unRAID could not WRITE to the drive. Read erorrs are just shown on the management console as "errors" affiliated with a drive.

Quote

January 15, 201115 yr

Joe, you are correct. My point is that these errors aren't just a simple parity mismatch.

Peter

Quote

January 17, 201115 yr

Author

Do I point out the obvious?

The first drive red-balled which means unRAID had an issue writing to it.

The drive data would have been synchronized to the parity data when you rebuilt the data onto the first drive and then the new drive. So, after the first disk rebuild you shouldn't have parity errors.

When do you see these errors? During the disk rebuild or during a parity check after the rebuild?

Joe - Could a HPA cause something like this?

Peter

The first "red ball" appeared after a "routine" parity check came up with 253 errors. I had the suspect data drive rebuilt and it showed the same number of errors upon completion (253) but no red ball. A day or so afterwards the same data drive "red-balled" on its own unRAID logic (meaning it detected an issue during an unknown write process as I don't recall performing any file transfers at the time). I replaced it with a new, similar spec'd drive and had data rebuilt on it. 253 errors after completion, but no red-ball since.

I appreciate the "schooling" on the PSU, however, again, I would be inclined to believe if this is all power related, I would be getting RANDOM errors across different drives; not the same 253 errors on the same drive channel. By similar logic, this would also rule out a bad SATA cable and bad card.

If its always 253 errors, I would believe the issue is specific to the parity drive...

Quote

January 17, 201115 yr

Always the same errors is why I asked if a HPA could be the cause. Something is causing you to have read errors.

If you rebuilt the data drive, then the data drives and the parity drive should both be in sync with each other. That is how it works, the data drive is rebuilt using the parity drive info so the data drive will be sync'd to the parity. If there is an error on the parity drive then there is a matching error on the rebuilt data drive and those errors effectively cancel each other out. I believe this is what you are questioning...

Peter

Quote

January 19, 201115 yr

Author

UPDATE:

I completed the first new parity check since swapping a new drive and came up with ZERO errors.

Hmm... I guess that's a good thing.

Doesn't explain all the previous errors...

Quote

January 26, 201115 yr

Author

UPDATE:

I attempted to run another parity check just to see how things are going and then it started reporting parity errors. Then suddenly, the drive on the problematic channel red-balled.

I'm going to order a new break-out cable and while I'm at it, put in the new PSU, and see where that gets me...

Quote

January 27, 201115 yr

Could a HPA cause something like this?

My first thought too, but it is not a Gigabyte board, and there are no HPA's.

NOTE: I am allowed to only attach my last SysLog involving the new drive as the other syslogs (I kept all of them for every parity error incident) are too large individually to attach

The syslog you attached does not show any problems, just that you rebuilt Disk 10. Could you zip up one or more of the syslogs with parity errors? When a syslog is large, we zip them and attach the zip file. Syslogs are very repetitive text and compress very small, especially if there are lots of repetitive error messages!

Quote

January 27, 201115 yr

Author

I've attached the first part of the syslog where the initial errors occurred, but the second part is 10.5 MB uncompressed, 164 KB compressed, so I will break the second part into two files...

syslog_initial_parity_error-part_1.txt.zip

Quote

January 27, 201115 yr

Author

Part 2 of the initial syslog...

syslog_initial_parity_error-part_2.txt.zip

Quote

January 27, 201115 yr

Author

Final part 3 of the initial syslog.

syslog_initial_parity_error-part_3.txt.zip

Quote

January 27, 201115 yr

Author

The syslog you attached does not show any problems, just that you rebuilt Disk 10. Could you zip up one or more of the syslogs with parity errors? When a syslog is large, we zip them and attach the zip file. Syslogs are very repetitive text and compress very small, especially if there are lots of repetitive error messages!

That's one of the things that is baffling me. I rebuilt the drive and unRAID showed 253 errors, yet that syslog showed no errors that I could discern.

That does not sound right nor give me any piece of mind that unRAID is properly recording errors to the log when the GUI shows errors...

Quote

January 27, 201115 yr

Jan 1 20:06:38 UnRAID kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jan 1 20:06:38 UnRAID kernel: ata7.00: configured for UDMA/133

Jan 1 20:06:38 UnRAID kernel: ata7: EH complete

Jan 1 20:06:39 UnRAID kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen

Jan 1 20:06:39 UnRAID kernel: ata7.00: irq_stat 0x08000000, interface fatal error

Jan 1 20:06:39 UnRAID kernel: ata7: SError: { UnrecovData Handshk }

Jan 1 20:06:39 UnRAID kernel: ata7.00: failed command: WRITE DMA EXT

Jan 1 20:06:39 UnRAID kernel: ata7.00: cmd 35/00:08:6f:41:54/00:00:dc:00:00/e0 tag 0 dma 4096 out

Jan 1 20:06:39 UnRAID kernel: res 50/00:00:6e:12:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)

Jan 1 20:06:39 UnRAID kernel: ata7.00: status: { DRDY }

Jan 1 20:06:39 UnRAID kernel: ata7: hard resetting link

Jan 1 20:06:39 UnRAID emhttp: shcmd (28): cp /var/spool/cron/crontabs/root- /var/spool/cron/crontabs/root

Jan 1 20:06:39 UnRAID emhttp: shcmd (29): echo '# Generated mover schedule:' >>/var/spool/cron/crontabs/root

Jan 1 20:06:39 UnRAID emhttp: shcmd (30): echo '40 3 * * * /usr/local/sbin/mover 2>&1 | logger' >>/var/spool/cron/crontabs/root

Jan 1 20:06:39 UnRAID emhttp: shcmd (31): crontab /var/spool/cron/crontabs/root

Jan 1 20:06:39 UnRAID kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jan 1 20:06:39 UnRAID kernel: ata7.00: configured for UDMA/133

Jan 1 20:06:39 UnRAID kernel: ata7: EH complete

Jan 1 20:06:39 UnRAID kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen

Jan 1 20:06:39 UnRAID kernel: ata7.00: irq_stat 0x08000000, interface fatal error

Jan 1 20:06:39 UnRAID kernel: ata7: SError: { UnrecovData Handshk }

Jan 1 20:06:39 UnRAID kernel: ata7.00: failed command: WRITE DMA EXT

Jan 1 20:06:39 UnRAID kernel: ata7.00: cmd 35/00:08:a7:04:f2/00:00:9c:00:00/e0 tag 0 dma 4096 out

Jan 1 20:06:39 UnRAID kernel: res 50/00:00:76:41:54/00:00:dc:00:00/e0 Emask 0x10 (ATA bus error)

Jan 1 20:06:39 UnRAID kernel: ata7.00: status: { DRDY }

Jan 1 20:06:39 UnRAID kernel: ata7: hard resetting link

Jan 1 20:06:39 UnRAID kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jan 1 20:06:39 UnRAID kernel: ata7.00: configured for UDMA/133

Jan 1 20:06:39 UnRAID kernel: ata7: EH complete

Jan 1 20:06:39 UnRAID kernel: ata7: limiting SATA link speed to 1.5 Gbps

Jan 1 20:06:39 UnRAID kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen

Jan 1 20:06:39 UnRAID kernel: ata7.00: irq_stat 0x08000000, interface fatal error

Jan 1 20:06:39 UnRAID kernel: ata7: SError: { UnrecovData Handshk }

Jan 1 20:06:39 UnRAID kernel: ata7.00: failed command: WRITE DMA

Jan 1 20:06:39 UnRAID kernel: ata7.00: cmd ca/00:10:cf:00:54/00:00:00:00:00/eb tag 0 dma 8192 out

Jan 1 20:06:39 UnRAID kernel: res 50/00:00:46:00:54/00:00:0b:00:00/eb Emask 0x10 (ATA bus error)

Jan 1 20:06:39 UnRAID kernel: ata7.00: status: { DRDY }

Jan 1 20:06:39 UnRAID kernel: ata7: hard resetting link

Jan 1 20:06:40 UnRAID kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Jan 1 20:06:40 UnRAID kernel: ata7.00: configured for UDMA/133

Jan 1 20:06:40 UnRAID kernel: ata7: EH complete

Quote

January 27, 201115 yr

Author

Yes, that was the initial error.

But after rebuilding I got exactly 253 errors.

So I replaced it with a brand new drive and still got 253 errors via the GUI, but the syslog did not show it anywhere I could find.

Quote

January 29, 201115 yr

Author

UPDATE:

I changed PSU, I changed break-out cable, then did a parity rebuild on the problematic drive channel and got 400 errors. All on ATA7 (the same channel).

At this point, I can only surmise that it is a failing/bad AOC-SASLP-MV8 card as I've changed every possible component on the channel but continue to get errors.

Quote

February 3, 201115 yr

Author

Is it possible to continue to use unRAID when drives are missing due to vacated card shipped for repair?

Quote

February 3, 201115 yr

Is it possible to continue to use unRAID when drives are missing due to vacated card shipped for repair?

If you write to your disks you will break parity.

And even mounting a disk does some writing.

What I'd likely do - although not strongly recommeneded - would be the following.

Put your current USB key aside.

Prepare a new USB key (use a second key if you have it, or free license).

Boot using fresh USB stick. Assign your disks to the disk slots. Do not assign parity. If you have the free license you'll only be able to mount 2 at a time.

Start the array. Use your array in strickly read-only mode.

If you need to mount other disks, just stop the array, shuffle the disk assignments, and restart. Did I mention not to assign parity? That's important - make that mistake and you trash a disk in .1 second.

This will allow you to watch movies and access your files.

When you get your controller back, you can move back to your original USB key. Once you get the array to start (hopefully clean), you should run a parity check. It will likely find a few issues in the early sectors - but let it run to completion. Its a good test plus you may have inadvertently written something to the disk and you'll want parity 100% perfect.

Quote

February 18, 201115 yr

Author

UPDATE:

I've waited two weeks to get my board "repaired" (it wasn't replaced), and when it was returned, I commenced with a parity check...

BAM: 77 errors.

Reading the log, I'm now getting errors on disk0 (parity drive), in addition to the same "media error" on ata7 (the original problematic channel).

But after contemplating the error on ata7, is it really a fatal error? The following is the base error that repeats itself for 24 times:

Feb 17 22:14:26 UnRAID kernel: ata7: EH complete
Feb 17 22:14:28 UnRAID kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 17 22:14:28 UnRAID kernel: ata7.00: irq_stat 0x40000001
Feb 17 22:14:28 UnRAID kernel: ata7.00: failed command: READ DMA EXT
Feb 17 22:14:28 UnRAID kernel: ata7.00: cmd 25/00:80:ef:92:c0/00:02:6a:00:00/e0 tag 0 dma 327680 in
Feb 17 22:14:28 UnRAID kernel:          res 51/40:bf:9f:94:c0/00:00:6a:00:00/e0 Emask 0x9 (media error)
Feb 17 22:14:28 UnRAID kernel: ata7.00: status: { DRDY ERR }
Feb 17 22:14:28 UnRAID kernel: ata7.00: error: { UNC }
Feb 17 22:14:28 UnRAID kernel: ata7.00: configured for UDMA/133

The reason I question this now is that the unRAID MAIN page shows the errors all on the parity drive in the ERRORS column. And this time the log now clearly shows errors on the parity disk:

Feb 17 22:14:28 UnRAID kernel: md: disk0 read error
Feb 17 22:14:28 UnRAID kernel: handle_stripe read error: 1791005792/0, count: 1

What should my next step be? Disregard the parity drive and unassign it from the array, install a new drive, then build parity?

syslog-2011-02-18.txt.zip

Quote

[4.6] Constant but consistent parity errors

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)