UDMA CRC Error Count on new cache pool


Recommended Posts

After using this link http://lime-technology.com/wiki/Replace_A_Cache_Drive to replace my cache drives I started getting quite a few CRC error count warnings.  I had never seen these before on any drives.  And it is still only effecting the cache drives.  I have followed just about every post i can find on this error about replacing SATA cables and rerouting them.  I am on my third set of cables and they are now nowhere near any power cables.  

 

The new SSD cache pool has been running for 14+ hours now and both drives are sitting around 97 to 100 on the CRC error count.  Originally they were connecting at SATA 1 speeds even though I had verified SATA 3 cables.  I turned off the on board eSATA port in the BIOS, booted unRAID, and now have 6 gig speeds.  This slowed down the number of errors but hasn't fully resolved my issue.

 

System:

unRAID Version: 6.5.3

Gigabyte GA-990FX-UD3 R5 motherboard

AMD FX8320E CPU

32GB Crucial Ballistix DDR3 RAM

3 x Seagate 3TB 7200rpm drive 

1 x Seagate 2TB 7200rpm drive

2 x Samsung 500GB 860 EVO SSDs

1 x GeForce 210 gpu

2 x nvidia 1060 gpu

Seasonic X-850 850 watt PSU

 

My old cache drives were Toshiba OCZ TR150 120GB SSDs.  They don't seem to report the CRC error count.  I put one in an external enclosure and ran CrystalDiskInfo on my desktop and this code is present in the report.  So its quite possible I've had this error for a long time and didn't know about it.

 

Things I have tried:

- Changed sata cables 3 times now

- Rerouted sata and power cables to keep them at least 1/2 inch apart

- Turned off eSATA port in mobo and for whatever reason this gave me SATA3 speeds

 

Things I haven't done yet:

- Update BIOS on the motherboard

- Move MongoDB off of the cache drives since it is the database for a Rocket Chat instance and is very chatty with disk i/o

 

Here is my full diagnostic file if this helps.  If i missed any helpful info in my post let me know and I will try to provide it asap.  Thanks.

tower-diagnostics-20180717-1432.zip

Edited by sansoo22
Link to comment

I have delayed replying to you as I am not much of Guru in this area.  But your two SDD's seem to be connected to SATA ports 5 and 6.  There are a lot of errors associated with these two ports near the end of your syslog. I also suspect that those two ports are also using the Marvell Controller chip (Marvell chips have been implicated  in a lot of issues over  the past few years.)  Plus, you indicate that things got better when you turned off the eSATA port.   Plus the number of CRC errors are exactly 99 for both disks.  This would normally not be expected if it were a problem with cables.  I would like to suggest that you move those two SDD drives to the two of the other four SATA ports and lets see what happens.

 

Note that CRC errors are not fatal.  They simply require that the data be resent and, thus, have a small effect on performance. 

Link to comment

Thanks @Frank1940.  I was wondering if the last 2 ports were the cause of the problem after seeing in the BIOS they can be configured differently than the others.  Since my initial post Cache 1 = 106 errors and Cache 2 = 102 errors.  So something with turning off the eSATA port made a pretty big difference.  You may have just found the problem for me.  

 

I've done some testing with large single file reads/writes and it doesn't seem to be causing any errors.  While I still don't like the idea of having a couple suspect ports if they are assigned to my 2 large media disks it wont hurt anything.  Like you said its not fatal and only hampers performance.  Emby isn't that fast to begin with so doubt I would ever notice.

 

Is there anything I need to know about moving disks / ports around?  Will unRAID even notice if I move them around?  Or should i just follow the process I did on my last mobo upgrade and screen shot the disk assignments in the array and ensure all the disks are assigned to the same slots?

 

Thanks again for the help.

Link to comment

unRAID tracks disks by serial numbers so everything should end up correctly assigned to the right slots.  BUT, I would still grab a quick screen shot of the assignments.  It is just good insurance.  

 

BTW, today LSI based SATA cards are the recommended choice.  You can pickup an eight port one on E-Bay between $50US and $100.  Try to avoid the counterfeit Chinese ones.  Just be sure you go with a reputable supplier and don't base your purchase choice strictly on price.  Here is a link to what you should be looking from from @johnnie.black:

 

          https://lime-technology.com/forums/topic/69018-sata-controller-replacement-question-and-advice/?tab=comments#comment-630097

 

Link to comment

I had looked at getting one but was never able to find which models were best with unRAID.  So instead I turned my unRAID machine into a mining rig.  I figured if it was on 24/7 I may was well slap in a couple 1060s and have it make some money for me.  Which if you are familiar with that market at all you know that mining small scale isn't really profitable anymore.  Looks like it might be time to hit up the eBay to sell off the 1060s and invest those proceeds in an LSI card.  Although I do loathe the scammer gauntlet one must traverse in order to sell anything on eBay anymore.

 

Tonight I will be taking unRAID offline to switch the cache drives to different ports and will report back how that works out.  

 

Any advice on keeping the LSI card cool in a normal tower case?  I've read that most of these were designed for forced air rack mounted servers.

Link to comment
1 hour ago, sansoo22 said:

 

Any advice on keeping the LSI card cool in a normal tower case?  I've read that most of these were designed for forced air rack mounted servers.

 

I have never heard of anyone having a heat problem with them in an unRAID server.  Yes, normally they are used in a forced air cooled rack but the packing (and power) density is far beyond what the typical unRAID server will experience.  If you keep the HD's temperatures from exceeding the mid 40'sC, you should be OK.  Remember that the air flow should come across the HD's first and be forced out of the case behind them.  (You don't want to have any fans blowing in as you might with those video cards installed!) 

 

Plus, the LSI card will only draw about 15W in the worst case condition with the nominal power around 8W. 

Link to comment

Finally had time to shutdown the server and swap SATA cables around.  Moved the cache disks to the 3 and 4 ports on the motherboard.  For a test i copied one of my dev projects over to a share that uses cache.  The project folder with all of its node packages contained 18,792 files.  With that test I saw a +2 error count on Cache 1 and a +1 error count on Cache 2.  As a follow up i then deleted that directory and saw no increase in error counts.  That should roughly simulate what the Mongo database will hit the disk with in a day when Rocket Chat is active during normal business hours.

 

While I'd still like to see a 0 rise in this error count I know its not a data threatening error.  I guess call it solved for now.  Next step is to find an LSI card to see if it rules out any drive specific problems.

Link to comment

I believe that only of two of the ports use the Marvell controller.  (At least, this is typical and generally these are the two highest ones.)  

 

One more thing to check if your SATA cables have metal latches.  First, read this article:   

 

         https://support.wdc.com/knowledgebase/answer.aspx?ID=10477

 

While it only pertains to WD hard disks, whatever the reasons were for the change,  they might have caused other disk manufacturers to do the same thing.  The way to check is to pull gently on the SATA data connector on the HD.  You should feel a (slight) resistance as you pull on the connector.  If you can feel none, the connector does not have the internal 'bumps' and  is just 'floating inside of the housing.  Of course, if the metal latch is locking onto the shroud, It won't even move!  

Link to comment

I think I saw that doc from another post.  I've checked and all my drives have a firm click when inserting the locking cables.  I took one of the Samsung drives out when I was switching ports and checked how firm the cable is in.  It won't come out with the locking cables if you pull on it.  Im sure if you pulled hard enough it would.

 

Anyway still getting errors so port switching didnt work.  Hit up ebay and ordered one of these https://www.ebay.com/itm/162862201664.  Its an HP flashed to latest LSI firmware in IT Mode.  Also picked up plenty of breakout cables.  Both locking and non locking.  

 

I couldn't confirm if the card is PCIE 2.0 compatible but I'm rolling the dice on it.  Next build will be PCIE 3.0 so trying to be a little future proofed.  If it isn't the LSI 9211-8i in IT Mode is readily available on eBay from reputable buyers.

Link to comment
1 hour ago, sansoo22 said:

I think I saw that doc from another post.  I've checked and all my drives have a firm click when inserting the locking cables.  I took one of the Samsung drives out when I was switching ports and checked how firm the cable is in.  It won't come out with the locking cables if you pull on it.  Im sure if you pulled hard enough it would.

 

 

If you feel the click, the connector is working the way it was designed. 

 

If you overclocked the system to improve  the efficiency of the mining operation, the consensus has been that you should never be overclocking a server because of reliability concerns. 

Link to comment

Sorry to interject, i run an LSI 2008 card and yes, it does get bloody hot! BUT! i modded a small Delta Server fan right between the case and the card sucking air out, and then on on the other side, where the IO is, a fan sucking air in.

It has reduced ALL Temps down by 10-25C in most cases. those fans are running off 7V Power Adapter as i will be adding a Fan Speed Changer in when i can to control based off My Settings.

 

 

Little noisy on 7V but nothing that really concerns me for now.

Edited by Stan464
Link to comment

@Frank1940 I haven't overclocked anything in the server except for the video cards in the VM that was running them.  I'm not much of an overclock guy anyway.  I prefer peace of mind in all my systems.  I'm not too sad to see the mining operation go.  It was merely an experiment to see if a server could pay its own utilities.  It was a success until the brutally hot summer we are having in the mid west on top of the market for crypto falling apart.  

 

@Stan464 thanks for the heads up.  The card and cables just shipped out today.  My case stays really cool when the dual 1060s arent running.  Even then the AIO liquid cooler does a great job with the CPU.  The Fractal Design XL R2 is a huge case with plenty of space around components.  If the LSI card gets hot I thought about putting in one of these:

 

https://www.amazon.com/GeLid-SL-PCI-02-Slot-120mm-Cooler/dp/B00OXHOQVU/ref=cm_cr_arp_d_product_top?ie=UTF8

 

 

Link to comment
On 7/24/2018 at 3:20 AM, sansoo22 said:

@Frank1940 I haven't overclocked anything in the server except for the video cards in the VM that was running them.  I'm not much of an overclock guy anyway.  I prefer peace of mind in all my systems.  I'm not too sad to see the mining operation go.  It was merely an experiment to see if a server could pay its own utilities.  It was a success until the brutally hot summer we are having in the mid west on top of the market for crypto falling apart.  

 

@Stan464 thanks for the heads up.  The card and cables just shipped out today.  My case stays really cool when the dual 1060s arent running.  Even then the AIO liquid cooler does a great job with the CPU.  The Fractal Design XL R2 is a huge case with plenty of space around components.  If the LSI card gets hot I thought about putting in one of these:

 

https://www.amazon.com/GeLid-SL-PCI-02-Slot-120mm-Cooler/dp/B00OXHOQVU/ref=cm_cr_arp_d_product_top?ie=UTF8

 

 

 

 

Lovely case! and yes, should be more room than mine, but yes, they do and can run hot perfectly fine, but it always best to try and keep Electronics at a Stable temperature the best you can..

your idea is a little better than mine, but i just dumped whatever i had in the case at that time. 


Drives hit 35-40 on a warmer day, the noise is a little annoying, but will fix at a later date.

Link to comment

Card is here but still waiting on cables.  They said they would be here today but USPS is a crap shoot in my city.  Looks like they sent them to the wrong hub and are on there way to the correct hub and may show up tomorrow.  So hopefully this weekend i can get this thing installed and see how it does.  Just wanted to say i havent abandoned the post.  Just (im)patiently waiting for USPS to figure out logistics.

Link to comment

Can't seem to catch a break right now.  Was going to install the new LSI card but then woke up to a new error of "187 Reported Uncorrect" on disk 2.  That popped during a parity check and I don't have a spare drive at the moment so waiting on a new drive to show up before I tear the server down.  The count on that error is 2 and there are no bad sectors reported on the drive but i dont want to run the risk of a drive going bad while im swapping around components.

Link to comment
  • 2 weeks later...

Finally got around to getting the HP 220 card installed.  My girlfriend is a teacher and was using my Emby server to binge watch Criminal Minds before starting school again.  I've learned not to upset her in that last week or two before school.  Looks like the the "UDMA CRC Error count" issue can finally be put behind me.  I replaced the Seagate drive that had a couple "187 Reported Uncorrect" errors on it as well.  I thought it had died because there was about 200 write errors to it when trying to spin up a VM and unRAID disabled.  So i put the new drive in let it rebuild and the same thing happened again.  Disk 2 was disabled by unraid and i had to remove it, start the array, stop the array, and its currently rebuilding on to itself.  Its a brand new disk that i ran a full extended SMART test on before installing with 0 errors.  I even ran another short test from within unRAID when it was disabled and its all green.

 

So something with that old VM that used to have the 2 1060s assigned to it is so borked its causing a disk to have a ton of bad writes.  I was quite shocked to see this behavior.

 

Anyway the original point of this topic is solved and the answer is don't use on board sata.  Hit up ebay and get a SAS card.

Link to comment
  • 1 year later...
On 8/16/2018 at 4:02 AM, sansoo22 said:

Finally got around to getting the HP 220 card installed.  My girlfriend is a teacher and was using my Emby server to binge watch Criminal Minds before starting school again.  I've learned not to upset her in that last week or two before school.  Looks like the the "UDMA CRC Error count" issue can finally be put behind me.  I replaced the Seagate drive that had a couple "187 Reported Uncorrect" errors on it as well.  I thought it had died because there was about 200 write errors to it when trying to spin up a VM and unRAID disabled.  So i put the new drive in let it rebuild and the same thing happened again.  Disk 2 was disabled by unraid and i had to remove it, start the array, stop the array, and its currently rebuilding on to itself.  Its a brand new disk that i ran a full extended SMART test on before installing with 0 errors.  I even ran another short test from within unRAID when it was disabled and its all green.

 

So something with that old VM that used to have the 2 1060s assigned to it is so borked its causing a disk to have a ton of bad writes.  I was quite shocked to see this behavior.

 

Anyway the original point of this topic is solved and the answer is don't use on board sata.  Hit up ebay and get a SAS card.

A bit late to reply to this but I wonder whether the errors may have been a byproduct of the power supply being too small?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.