Jump to content

Sync errors in parity check after power outage (edit) - 2nd power outage during parity check more problems


FrozenGamer

Recommended Posts

I had 248 and 298 sync errors on each of my servers.  The power went out, i started a parity check and quit it after about 1 or 2 minutes.  I did this because there was an update from 6.4 to 6.5 for unraid and i did the update, rebooted and restarted the parity check.  So its all done, too late, but i was wondering about these sync errors as i haven't ever seen one.  Also there were 4 udma crc errors after the power outage, 4 drives, one udma crc error each.  I am wondering about those as well.  I haven't seen them before.

 

update. so the power just went browned out, and came back during parity checks - one was without writing and came back with zero errors before the outage happened.  As far as i know this is ok.  

2nd one was still in a parity check with write corrections enabled and the power went out about 2/3 through - it had 3 sync errors at the time of the brown out.  Now i am showing 2 of 21 drives as faulty upon reboot. 1 of the drives is 1 of 2 parity drives (8tb) and the 2nd faulty drive is a 6 tb seagate.  I had taken a diagnostics just prior to the brownout and will attach as well as one right after when i got home.

 

Yes i am now looking for suggestions on a battery backup to get me through brownouts and shutdown on power outages - any suggestions?

I am guessing that nothing is wrong with the 2 drives marked faulty (this has usually been the case when it happens after i put in a new drive, rebuild, and then preclear the old drive).  However something was wrong to cause a small number of sync errors on the 2nd try on this failed server?  

Given what has just happened with 2 power outages in 2 days (first one was random) 2nd is due to a storm which is ongoing at the moment.  I am very very hesitant to do any rebuilding with 2 of 2 drives marked bad that can fail without data loss.  Thinking of shutting it all off until tomorrow - the wind is supposed to die down in a few hours.  A full rebuild of both 8tb parity drives took 5 days when i added them the first time.  I assume that rebuilding a parity and a 6tb would take almost as long.  A parity check with error corrections is about a 2 day procedure.

Is there any way to find out if the 2 drives that are marked faulty are in fact faulty or just showing that after the 1 second brownout? Would it be best to keep those 2 drives marked faulty out of machine during a rebuild?  Could the data be used in case of a failure of a drive during rebuild? when looking at diagnostics the 2 drives that are marked faulty are 8tb ending in 0591E and 6tb ending in 06ET1.tower-diagnostics-20180410-1652 before 2nd outage.zip

What should i do now?  

tower-diagnostics-20180410-1523. after 2nd outage.zip

Link to comment

Did you do a correcting parity check? You must correct parity errors. After a correcting parity check, if I have any sync errors I always do another non-correcting parity check to make sure there are no more parity errors. That way I know there isn't some other problem. Your last parity check should always have zero sync errors, or you need to figure out why.

 

The crc errors might just be a connection issue. Since there weren't many you can just acknowledge them on the dashboard and you won't get notified of them again unless they increase.

 

 

Link to comment

Thanks - trurI. I acknowledged udma's - connections would not have changed from before, so i assume it had to do with outage?  Yes i did correcting parity check , i will follow your advice now and run a non correcting parity.  Cheers.  What exactly is a sync error?

Link to comment
1 minute ago, FrozenGamer said:

What exactly is a sync error?

It means the parity data does not agree with the parity calculation. Typically the only thing you can do is correct parity. Parity data is likely to blame after an unclean shutdown since a parity update might not have completed.

 

The only time you might not correct parity is if you do a noncorrecting parity check after rebuilding a data disk. In that case, it might be the rebuild that went wrong and the parity data is actually correct.

Link to comment

One of the two towers is almost done with parity check without writing corrections and has zero errors.  The other one crashed half way through and had 3 errors.  I was unable to get diagnostics at the machine on keyboard, but took some screen shots which i am attaching.

IMG_4787.jpg

IMG_4786.jpg

Link to comment
23 hours ago, FrozenGamer said:

I acknowledged udma's - connections would not have changed from before, so i assume it had to do with outage

 

Forgot to address this. Tracking these SMART attributes was added by the update. Since they don't reset they could have been from long ago but you only found out about them when you updated unRAID. So only a concern if they increase.

 

Don't know what to suggest for your other problem. The simplest advice is to run memtest. Can't hurt and if it passes we will at least know that isn't the problem.

 

Also, when you updated, did you follow the advice in the update thread?

 

https://lime-technology.com/forums/topic/66327-unraid-os-version-650-stable-release-update-notes/

 

 

 

 

Link to comment

I didn't follow the advice in update thread.   I will run memtest - i have already started a parity check with corrections after shutting down (unclean) and powering down all devices for a few minutes.  Is it ok to cancel the parity check and run the memtest?  Array is rather large so parity check will take around 2 days.

the only plug-ins not up to date were community apps and user scripts.  Don't have advanced buttons plugin unless its part of another plugin that i am not aware of.

Thanks much again for each reply!

Link to comment

Have you considered just leaving every thing shutdown until you get an UPS?

 

6 minutes ago, FrozenGamer said:

I posted an update with a brownout that just happened(in addition to the first outage).  I don't intend to do anything as far as starting the array with red balled drives until i get some advice.

 

Unclear from this post what your current situation is and you didn't even post a new diagnostic.

Link to comment
9 minutes ago, FrozenGamer said:

I don't intend to do anything as far as starting the array with red balled drives until i get some advice.

 

So does this mean your current situation is red balled drives? Diagnostics are helpful, but there is a lot of stuff in them and it can be a lot of work to figure out what you have without any idea of what to look for. Post a screenshot if you don't want to type.

Link to comment

Syslog is being spammed with PHP warnings from the preclear plugin. You can turn these off in the Tips and Tweaks plugin (which you don't seem to have installed).

 

Syslog is also being spammed with wrong csrf token from preclear. Wrong csrf token is typically caused by having multiple browsers open to the webUI.

 

SMART looks OK for both disabled disks, parity2 and disk5. The usual advice would be to check connections, but considering how many disks you are trying to run with unreliable power I am tempted to blame it on your electric service.

 

Get an UPS, make sure it has AVR.

Link to comment

It has been running for years since december 2014, maybe once or twice a year it has had a problem. I have not lost data except for an error that was my fault.  I have it attached to 2 silicon graphics se3016 caddies. See link - https://www.ebay.com/itm/SGI-Rackables-16-Bay-SAS-SATA-Expander-SE3016-Disk-Shelf-1-5M-Cable/263556965429?hash=item3d5d382435:g:mlAAAOSwm51arql2

 

  The power problems are not normal - from 2 to 4 times a year on average.  I live in Alaska and we are on our own grid/lots of storms etc.  It appears that  more problems are occuring after i upgraded to 8tb parity drives recently.  Really bad timing on the 2 outages recently.

 

I was going to start with a memtest now.  Once i eliminate that, i am not sure how to proceed to get the array back up into full parity protection.  I am searching online for a ups as well.

Link to comment
9 hours ago, FrozenGamer said:

i am not sure how to proceed to get the array back up into full parity protection.

 

Since you have dual parity and only 2 disabled it should just be a simple rebuild.

 

Stop, unassign disabled disks, start.

Stop, reassign disabled disks, start.

 

The only reason I didn't suggest it before is I am not confident you can trust your power to complete the rebuild.

Link to comment

You'll also need to check filesystem on disk2:

 

Apr 10 06:59:55 Tower kernel: XFS (md2): xfs_dabuf_map: bno 0 dir: inode 8837623527
Apr 10 06:59:55 Tower kernel: XFS (md2): [00] br_startoff 0 br_startblock -2 br_blockcount 1 br_state 0
Apr 10 06:59:55 Tower kernel: XFS (md2): Internal error xfs_da_do_buf(1) at line 2525 of file fs/xfs/libxfs/xfs_da_btree.c.  Caller xfs_da_reada_buf+0x31/0x78 [xfs]

 

Link to comment
1 hour ago, trurl said:

 

Since you have dual parity and only 2 disabled it should just be a simple rebuild.

 

Stop, unassign disabled disks, start.

Stop, reassign disabled disks, start.

 

The only reason I didn't suggest it before is I am not confident you can trust your power to complete the rebuild.

Thanks Trurls - I have installed a ups - TrippLite G1000UB.  I ran 1 pass of memtest and it passed, unassigned both disks and reassigned them with spares.  It is a long process of 5 days to do the both drives.  

Link to comment
1 hour ago, johnnie.black said:

You'll also need to check filesystem on disk2:

 


Apr 10 06:59:55 Tower kernel: XFS (md2): xfs_dabuf_map: bno 0 dir: inode 8837623527
Apr 10 06:59:55 Tower kernel: XFS (md2): [00] br_startoff 0 br_startblock -2 br_blockcount 1 br_state 0
Apr 10 06:59:55 Tower kernel: XFS (md2): Internal error xfs_da_do_buf(1) at line 2525 of file fs/xfs/libxfs/xfs_da_btree.c.  Caller xfs_da_reada_buf+0x31/0x78 [xfs]

 

Thanks Johnnie - i just clicked the little down arrow to the left of disk 2 to see if it had any more info on the disk - apparently that is the spin down button.  I hope that unraid is foolproof enough to not have actually spinned down during the rebuild..  It shows as reading still though.

 

I have had this error for ever that you are quoting and it repeats over and over - thanks for pointing me in the right direction!  - i won't be able to do the diskcheck for about 5 days..  When i do it should i skip the test and go right to the repair or test first, then repair?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...