[Resolved] 5 Errors After Every Parity Check


Recommended Posts

@johnny.black, could the parity issue be caused by having some drives on the H310 and others on the Mobo controllers? The Cache & Parity Drives are on the Mobo 6Gb/s; 2 of the older, slower HDDs are on the Mobo 3Gb/s controller.

 

update: the onboard 6Gb/s, 2 are intel & 2 are marvell

Edited by Joseph
update
Link to comment
  • Replies 165
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

26 minutes ago, crowdx42 said:

Sounds like your issues may finally be resolved. I am not sure if you posted this, what motherboard, cpu, memory etc are you running? My main setup is all intel and so I am wondering if there is any connection.

Also, for the SAS card update, did you have to check via command line?

If memory serves, its done via command line from a bootable USB stick. There should be a utility included inside the zip file to set this up.

 

here are my specs (which can be found in my signature; thought I had it set right so people could see)

 

Tower: unRAID 6.3.3 Pro | Mobo: ASUS P8Z68 DELUXE/GEN3 | CPU: Intel® Core™ i7-2600 CPU @ 3.40GHz | HBA: SUPERMICRO AOC-SAS2LP-MV8 DELL H310 PERC 0HV52W FLASHED: LSI SAS2008 (P10)

PSU: CoolerMaster G750M | RAM: 4GBx4 Non-ECC | Case: Sharkoon T9 Value | Parity: 4TBx2 Segate NAS | Cache: 500MBx2 Samsung 850 EVO

Network: Dual Gigabit; Bonded | Storage: 25TB | 4TBx3 Seagate NAS, 2TBx1 Seagate, 4TBx1, HGST, 3TBx1 Seagate, 2TBx2 Hitatchi | Dockers: BOINC for SETI

Edited by Joseph
update
Link to comment

Seems like the cpu and I would assume chipset on your system are little older than what I am running. I am on Haswell (mainly hoping for better power consumption :) ) . I am going to drop the 8tb second parity drive into my server and set it to rebuild. As stated earlier, I now have both parity drives on separate controllers (originally they were both on the same one) .

I too have some of my drives on the motherboard sata controllers but that has never caused an issue in the past.

Something I am thinking about though, when a parity check is run, it has to be stressful on the whole system and so if there is any point of failure (particularly a failing drive) a parity check may push it over the edge. I am wondering if my initial issue was a failing drive which caused the first parity errors and then it was that same drive I moved to become a parity drive (dumb luck) , thinking the parity drive was failing.

A parity check validates against the drives in the system? If so, if a drive in the array is failing, it could cause parity errors?

Link to comment
10 minutes ago, crowdx42 said:

 

A parity check validates against the drives in the system? If so, if a drive in the array is failing, it could cause parity errors?

 

I believe that's correct...It looks at the parity bit, and does a parity check, if there is a match, all ok. If there is a mismatch, it writes the corrected parity bit to the parity drive(s) and updates the error counter and logs. and now I'm wondering if my older drives, which did pass SMART, might be an issue.

 

Last SMART test result: Completed without error
#     Attribute Name    Flag       Value    Worst    Threshold     Type     Updated     Failed    Raw Value
199    UDMA CRC error count    0x000a    200    200    000    Old age    Always    Never      1    <~~~~Shouldn't this be 0?
Edited by Joseph
Link to comment

I know from my system, I got no smart errors on the failing drive, only some bad sectors. So I popped it into an enclosure on my Windows machine and the minute it was detected, Stablebit Scanner which is installed on my windows machine gave a smart error indicating the the drive was not passing some of the checks it does. After a full scan, it also detected bad sectors.

Stablebit scanner flags some SMART flags for "pre-fail" , those were the ones it flagged my dying drive with.

 

In general, older drives make me nervous, also I had a slew of issues with Seagate drives and moved to Western Digital and HGST drives (now owned by WD) . The HGST drives run cooler than the WDs and overall have been a great drive. 
I have only a single WD Red in my system which actually came from an enclosure which I pulled the drive from. In general I buy external drives, run preclear on them and then pull the drive to use in the array, this works out cheaper than buying internal drives. The new WD enclosures are also a lot easier than the older ones and so the drives can now be put back into their original enclosure in the case of failure. The two 8tb drives I just got were both $210 each, whereas the same internal drive is over $300 each. Quite the savings.

Link to comment

I might need to go that route then when checking SMART. The 2 2tb drives are the last of the old batch. I've spent a small fortune on hard drives last year when I built this box; I'm considering upgrading the mobo, cpu and memory as soon as the Mrs. gives me back my credit card. lol

 

Building a 32-Thread Xeon Monster PC for Less Than the Price of a Flagship Core i7

http://www.techspot.com/review/1155-affordable-dual-xeon-pc/

Link to comment

I think my current bottleneck is the 1 gigabyte connection. I have been looking at upgrading to 10 but it all seems to be very expensive. I currently only have a single NIC in my server which I am wondering is enough when backing it up and also trying to stream a movie. I believe maybe even a 4 port NIC might help.

Edited by crowdx42
Link to comment
49 minutes ago, crowdx42 said:

I think my current bottleneck is the 1 gigabyte connection. I have been looking at upgrading to 10 but it all seems to be very expensive. I currently only have a single NIC in my server which I am wondering is enough when backing it up and also trying to stream a movie. I believe maybe even a 4 port NIC might help.

Our house wasn't wired for internet. I discovered the phone lines used CAT5e and I was able to get a few rooms connected that way. I have dual-giga for a couple of workstations, but that's about all I can do without tearing into walls for the other locations. I have a couple of APs for wireless devices. If I could get 2 or 4gb/s via bonding or 10gb/s ($$$) everywhere that would awesome. fwiw, Emby allows you to throttle back the streaming bandwidth if need be.

Link to comment

Well I ran some cat6 outside and some in the attic so that I got wired through out our house. The office where my main computers are, are next door to where the unRAID server resides. But it does seem that when the main unRAID server is put under load the GUI becomes unresponsive and very laggy. :( , hence I wonder if I added a 4 port gigabyte NIC would it help. They are not very expensive and there is a switch right beside the server, so it would not require large runs of additional cables.

Link to comment
On 4/9/2017 at 11:45 AM, Joseph said:

with the new card installed, the first parity check corrected 5 errors (this was to be expected) However the second parity check found the same 5 errors. Parity check 3 came back ok. I'm running check 4 and will post what I find. If it goes well, the plan is to reboot afterward and try it again.

Parity check 4 came back clean. Log Posted. Next steps: reboot and try again.

 

Last check completed on Sun 09 Apr 2017 10:30:47 PM CDT (today), finding 0 errors.
Duration: 11 hours, 30 minutes, 19 seconds. Average speed: 96.6 MB/sec

 

Edited by Joseph
Link to comment
On 4/10/2017 at 7:40 AM, crowdx42 said:

It sounds like your issue may be resolved, hopefully the stress of all the parity checks does not break something else :P

NOPE...FAIL!

 

Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069768
Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069776
Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069784
Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069792
Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069800

 

Same sectors each time. The only difference I see is that the system was rebooted. The check is still going (90% complete), but I'm posting the syslog

 

Edited by Joseph
Link to comment

So maybe it is time to take a step back. Do you have any spare data drives? I am thinking if you save your current configuration and then pull all the drives, then add back a single parity and single data drive and see if you can get a stable system with no errors that way. If that works you could start slowly adding back the data drives and from there you could pin down whether it is a dying drive or some other piece of hardware?

One other thing I would consider is pulling any add in card and just use onboard SATA , just to test. You could be just unlucky and got a bad SAS card :(

 

I did complete my parity rebuild on my system and I now have both parity drives back up and running.

Link to comment

I hear you...In all likelihood I may have to go that route, but its gonna be awhile before I can carve out some time to try.

 

Its strange that it only seems to happen after a reboot. fwiw, I'm not getting unclean shutdown messages. Once complete I will run another parity check without rebooting and then reboot and run parity to verify the pattern.

 

if anyone has any thoughts or ideas, feel free to chime in.

Link to comment

So, your having a lot of ATA errors on one of your cache SSDs:

 

Apr 10 06:04:01 Tower kernel: ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 10 06:04:01 Tower kernel: ata12.00: supports DRM functions and may not be fully accessible
Apr 10 06:04:01 Tower kernel: ata12.00: disabling queued TRIM support
Apr 10 06:04:01 Tower kernel: ata12.00: supports DRM functions and may not be fully accessible
Apr 10 06:04:01 Tower kernel: ata12.00: disabling queued TRIM support
Apr 10 06:04:01 Tower kernel: ata12.00: configured for UDMA/133
Apr 10 06:04:01 Tower kernel: ata12: EH complete
Apr 10 06:04:51 Tower kernel: ata12.00: NCQ disabled due to excessive errors
Apr 10 06:04:51 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x100 SErr 0x0 action 0x6 frozen
Apr 10 06:04:51 Tower kernel: ata12.00: failed command: READ FPDMA QUEUED
Apr 10 06:04:51 Tower kernel: ata12.00: cmd 60/20:40:80:dc:35/00:00:36:00:00/40 tag 8 ncq dma 16384 in
Apr 10 06:04:51 Tower kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 10 06:04:51 Tower kernel: ata12.00: status: { DRDY }
Apr 10 06:04:51 Tower kernel: ata12: hard resetting link
Apr 10 06:04:51 Tower kernel: ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 10 06:04:51 Tower kernel: ata12.00: supports DRM functions and may not be fully accessible
Apr 10 06:04:51 Tower kernel: ata12.00: disabling queued TRIM support
Apr 10 06:04:51 Tower kernel: ata12.00: supports DRM functions and may not be fully accessible
Apr 10 06:04:51 Tower kernel: ata12.00: disabling queued TRIM support
Apr 10 06:04:51 Tower kernel: ata12.00: configured for UDMA/133
Apr 10 06:04:51 Tower kernel: ata12: EH complete

This SSD is connected on a Marvell controller, these are also known to be problematic (SAS2LP also uses a Marvell chipset), parity2 disk is connected on the same controller, since you replaced the SAS2LP all errors are on parity2, can you connect the parity2 on the LSI? you are only using 7 ports there.

Link to comment
10 minutes ago, johnnie.black said:

So, your having a lot of ATA errors on one of your cache SSDs:

 

This SSD is connected on a Marvell controller, these are also known to be problematic (SAS2LP also uses a Marvell chipset), parity2 disk is connected on the same controller, since you replaced the SAS2LP all errors are on parity2, can you connect the parity2 on the LSI? you are only using 7 ports there.

ok, I'll have to do it later when I have more time...thanks for looking into it. Fwiw, the Mobo has 2 6Gp/s & 4 3Gp/s Intel Z68 ports and 2 Marvell 6Gp/s ports.

Edited by Joseph
Link to comment
1 minute ago, crowdx42 said:

So a question on these controllers, is it that they have issues with certain motherboard chipsets or is there simply an unRAID compatibility? 

Some people have issues, some don't. I suspect it's a question of the specific implementation, whether or not the design specs were pushed to the ragged edge or not. I think I heard somewhere that heat may also be an issue, the controller chip itself may not tolerate higher temperatures (loads) like it should. Certain configurations may push the issue into failure, where others allow it to operate ok.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.