Read error during parity sync after adding new parity drive


Recommended Posts

Hi all,

 

Im afraid I need the brains trust on this one.  I woke this morning to one of my very stable (albeit old) data drives having exhibited read errors over night while the server was undergoing a parity sync.

 

For background, I made some hardware changes to my main server yesterday:

 

- Added a new 4 port sata expansion card (Skymaster PCIe 4-Ports SATA 6G Card EST11B)

- Added a new (SMART tested and precleared) 8TB Seagate Barracuda Compute

- Added a Silverstone Riser Cable so I could move the graphics card I have in there (which was taking two pcie slots) and allow me to drop the new sata expansion card in there (SilverStone RC04B PCI-e Riser Cable 400mm SST-RC04B-400)

- Added a new 250GB SSD to run my living room tv LibreELEC VM via UAD

 

My overall goal is to return to dual parity. I was running 1 parity drive (after releasing my second one some months ago for a data drive).  I am intending to add 2 x new Barracuda's to be parity and dropping my single archive 8TB parity disk to be data disk.  Just doing it slow and sure over time, step by step hence why I am adding a new Parity drive before replacing the current one.

 

As the sata expansion card has a marvel chipset, I had to apply the 'iommu=pt' fix to my syslinux config to let unRAID see the drive.  It is worth noting that the new 8TB and SSD drives are on the new sata card but the drive which is exhibiting problems is not.

 

So, as you would have expected, after installing the new 8TB drive - I added it as a parity disk - and began the parity sync.  Then I left it alone.

 

I cannot see anything obvious in the diagnostics.  The disks' SMART data looks fine.  There were only 88 read errors.  These errors coincided with the commencement of my daily CA docker backup sequence and only last a short time.  I mention these only because this happened at around the same time and it was the only other thing the server was doing.

 

I guess I could have knocked a cable while I was in there but I am pretty diligent and I checked the cables before I packed up.  Plus, I could usually expect to see the UDMA CRC error count to be high if there were cabling issues.  Which there weren't.

 

I think I know the protocol.  Let the parity sync finish, check the cables again and then do a correcting parity check.  Id be grateful if, in the meantime, anyone has any other insights as to what might be the issue?

 

Diagnostics attached.  Thank you in advance.

 

Daniel

unraid-diagnostics-20200530-0711.zip

Edited by danioj
Link to comment

The error is reported on the syslog as an actual disk problem, and same in SMART, though these kind of errors can sometimes be intermittent, you should run an extended SMART test, if it fails or if you get more similar errors on the near future you should consider replacing that disk.

Link to comment

Thanks for the review.

Read errors again, this time on the parity check. I’m running a long SMART test now. I know disks fail but the absence of obvious SMART data makes me wonder if I did something when I was in there. Is there anything physical I could have done to cause this!?


Sent from my iPhone using Tapatalk

Link to comment
7 minutes ago, danioj said:

absence of obvious SMART data makes

There are some SMART issues:

 

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    11

 

This should be zero on a healthy WD drive, though just because it isn't it's not definite proof the disk is failing, but it's never a good sign, especially if it keeps climbing.


 

Error 1 [0] occurred at disk power-on lifetime: 58318 hours (2429 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 12 78 3b e0 40 00  Error: UNC at LBA = 0x12783be0 = 309869536

This error (UNC @ LBA) usually also means a disk problem, a bad/failing sector, and looking at the power-on-hours you can see the error is recent, again it's not 100% conclusive since I've seen similar errors logged like that and it wasn't a disk problem, but it usually is, and if the SMART test fails it will confirm.

 

 

 

Link to comment
Because Unraid used parity plus the other disks to reconstruct those sectors, but those errors would be a problem if it was a disk rebuild instead.

However, I can reconstruct this disk off of that recent parity sync successfully though right? I haven’t been able to do a parity check on it since I’ve had the read errors!?

 

 

Sent from my iPhone using Tapatalk

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.