Jump to content

[SOLVED] new user. errors during first parity sync.


X1pheR

Recommended Posts

Hi,

 

New user here. Just as a warning that the obvious may not seem that obvious to me personally :)

I just discovered unraid little over a week ago now and after some reading on the forums, wiki and such I knew unraid was it for me and I decided to take the plunge and build my first unraid server with hardware from an old gaming pc I still had. I keep an actual overview updated with specs, config and such on google docs for myself but also for support reasons. This is the link.

 

First off I started with a simultaneous preclear_disk of 4 new disks (logs attached). It lasted 32 hours without any errors. In the smartctl report I did however notice an increasing high count of "Hardware_ECC_Recovered" on all drives but after research this seems normal for these Samsung HD154UI drives (correct me if I'm wrong). During preclearing the disk temps didn't exceed 30 degrees and ran at 60 MB/s on average which did seem quite low (again, correct me if I'm wrong) but I would investigate later. The result for all 4 drives was about the same like:

 

[color=blue]Preclear Successful
... Total time 33:08:00
... Pre-Read time 5:12:22 (80 MB/s)
... Zeroing time 8:11:43 (50 MB/s)
... Post-Read time 19:42:51 (21 MB/s)[/color]

 

Aren't these MB/s values rather low? Anyway. I created the array with adding 2 as data drives without parity drive and copied everything over from my old server. After completion I added the parity drive and started parity-sync. No errors still at this point.

 

Parity sync went fine for a while starting at 110 MB/s and dropping a little to around 90 MB/s when suddenly the horror started. Suddenly it was only around a few hundred KB/s and Sync Errors started occuring and counting up rapidly. The syslog (attached) started showing loads of errors. In unmenu MyMain I checked syslog entries per disk. The parity disk seems normal. Disk 1 and 2 however both report many errors. Disk 4 is not in the array since I use the basic version and is spun down. When I look at smartctl reports per disk I however only see trouble with disk 1 (all disk smartctl reports attached).

 

This rises several questions which I hope anybody can help me with;

- What should I obviously do?

- Is it so simple that disk1 is bad? Because syslog entries for disk2 also show many errors. Or is this because of the array.

- All are new drives. everything went fine, even preclearing for 32 hours. now suddenly errors. is it disk, psu, mem, etc? where should I look.

- Is data on disk1 bad and should I replace it with the warm spare (disk4) and rebuild or just copy all data over again from the old server.

 

Please advice and thank you in advance..

 

UPDATE 1:

The parity error count is 8222 now not building up anymore though syslog is still spitting out errors. I also attached a txt with the parity building progress at 3 times I checked.

 

UPDATE 2:

It eventually finished and now displays as parity being valid (is this true?). Unmenu says "Parity updated  8222  times to address sync errors". It didn't increase anymore. Anybody who would give me advice as to what to still do, what to check?

 

UPDATE 3:

I started a smartctl --test=long /dev/sd* on all disks. disk1 aborts after 20 seconds or so (tried several times). see report below. the other disks are still running the long test.

SMART Self-test log structure revision number 1

 

[color=blue]Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%        86         1526120678
# 2  Extended offline    Completed: read failure       90%        86         1526120671
# 3  Short offline       Completed: read failure       20%        85         1526120671
# 4  Extended offline    Aborted by host               90%        85         -[/color]

preclear_reports.zip

syslog-2011-09-05.zip

smartctl_2011-09-05.zip

parity_rebuilding_progress.txt

Link to comment

I would be concerned about your disk1 (S1Y6J90SB25328) between the preclear_finish report and the smartctrl report it has developed a lot of bad blocks (looks like 209 sectors are waiting for reallocation).  It's also very odd that the long SMART test is aborting so fast, these typically take a number of hours to run.  Check to make sure you have not configured your unRAID to spin down idle drives as doing this usually aborts any running SMART tests.

 

The following fields suddenly became non zero:

 

  1 Raw_Read_Error_Rate     0x000f   099   099   051    Pre-fail  Always       -       1311
13 Read_Soft_Error_Rate    0x000e   099   099   000    Old_age   Always       -       1280
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       1280
197 Current_Pending_Sector  0x0012   095   095   000    Old_age   Always       -       209
201 Soft_Read_Error_Rate    0x000a   099   091   000    Old_age   Always       -       40

 

 

As for if you can trust your parity, run another parity test or two (in the nocorrect mode) and see if it says there are any errors.  If there are no errors then you can probably trust the parity, at which point you could remove disk1 and put your extra drive into its place in the array - then unRAID will rebuild onto the new disk.

 

Until you get all this working don't delete the original files that you copied from.

 

Regards,

 

Stephen

Link to comment

It appears you need to swap disk1 out for the spare disk and basically start over with disk1. Forget about saving any of the data on disk1 since every bad sector has corrupted whatever was stored there and the parity has been created based on this bad data. Just replace the disk and copy the data over again.

 

This is the disadvantage of copying the data and then building parity. If you had setup the array with the parity drive from the start then you could have just rebuild disk1.

 

Peter

 

Link to comment

Thank you both for replying. I was worried nobody would post anything :)

I'll get started with it right away. Besides disk1, are the speeds during preclear_disk for Pre-Read (80 MB/s), Zeroing (50 MB/s) and Post-Read (21 MB/s) normal? Asking just to be sure since they seem low and also very different. Especially pre-read and post-read are like the same right? Only they differ 60 MB/s.

Link to comment

The post-read speed does seem low, but I think in that phase it is lower because it does the read in a non-sequential order to better exercise the drive.  On a 2TB WD green drive one full pass takes about 28 hours, so your 32 hours to do four simultaneous preclears does not seem to bad.

 

Regards,

 

Stephen

Link to comment

Ok thank you. I also remember that when I copied data from my old server (from RAID5 set so plenty of read speed) to unraid disk1 and disk2 over my gbit network it only went at around 40 MB/s. So that was without a parity drive added to the array. Also copying to the disk shares directly and not using user shares. Are these speeds to be expected? Without a parity drive added to the array you should get about the same speeds as with using a cache drive right?

 

I removed the bad disk1 and replaced it with the warm spare. Then I used initconfig but just to be sure I'm running samsung's ES_Tool utility on all drives now. When they are all done and I've copied the contents for disk1 from my old server again I'll update. Thanks again for all answers.

Link to comment

all of the other disks didn't report any errors. also did an 31 hour memtest without errors. unraid is now parity syncing the new array. no errors so far. but I did notice some other errors in the syslog. Do I need to worry about any of them?

 

Sep  8 06:53:24 Tower kernel: spurious 8259A interrupt: IRQ7. (Minor Issues)
Sep  8 06:53:24 Tower kernel:  pci0000:00: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d (Minor Issues)
Sep  8 06:53:24 Tower kernel: AMD_IDE: probe of 0000:00:06.0 failed with error -12 (Errors)
Sep  8 08:56:05 Tower emhttp: shcmd (31): killall -HUP smbd (Minor Issues)
Sep  8 08:56:04 Tower kernel: REISERFS warning (device md1): sh-2021 reiserfs_fill_super: can not find reiserfs on md1 (Minor Issues)
Sep  8 08:56:04 Tower logger:        missing codepage or helper program, or other error (Errors)
Sep  8 08:56:04 Tower emhttp: disk1 mount error: 32 (Errors)

Link to comment

The disk1 mount error is a problem unless you had just started the array with a newly installed and unformatted disk1. It's impossible to know since you didn't include the whole syslog, which would show the sequence of events leading up to the error.

 

I doubt the others are anything to worry about.

 

Peter

Link to comment

The disk1 mount error is a problem unless you had just started the array with a newly installed and unformatted disk1. It's impossible to know since you didn't include the whole syslog, which would show the sequence of events leading up to the error.

 

I doubt the others are anything to worry about.

 

Peter

Thanks. Well you hit it spot on. It wasn't a new drive but disk4 which was already precleared. I did a poweroff, replaced disk1 with disk4 (which was not part of the array). Some of the errors in my previous post do seem to reoccur occasionally. You sure I should ignore them?

 

I also came across another error. I searched but didn't came across any matching posts. Only one regarding segfault but with another service in front. See below. Syslog attached. I wasn't doing anything with unraid at the time I believe.

Sep 8 21:01:09 Tower kernel: smartctl[18269]: segfault at 1810a6b7 ip 080aa050 sp bfe7c200 error 6 (Errors)

 

syslog-2011-09-08.txt

Link to comment

I believe that means smartctl crashed.

 

Peter

it appeared that it is an segmentation fault. I had the parameters for the command wrong.  :-X

 

I'm also adding [solved] to the title. When I changed disks I did an initconfig and parity synced the array right away. no errors. I also recopied the contents for disk1 over from the network and did a double parity check afterwards. again, no errors. so luckely everything is solved. was still doubting a little being the other errors which occured. but for a while now none have occured and everything seems to be working fine. thank you very much.

Link to comment
  • 2 weeks later...

I believe that means smartctl crashed.

 

Peter

it appeared that it is an segmentation fault. I had the parameters for the command wrong.  :-X ...

[ I came across this post while searching for something different, but, since I'm here ...]

 

I'll bet the Author and/or Maintainer of smartctl would appreciate receiving a "bug report" (this is more like a "buglet" [cf: piglet]), with the exact command-line that will reproduce the crash/segfault.

 

"One small step for man ..."

 

-- UhClem

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...