X1pheR Posted September 5, 2011 Share Posted September 5, 2011 Hi, New user here. Just as a warning that the obvious may not seem that obvious to me personally I just discovered unraid little over a week ago now and after some reading on the forums, wiki and such I knew unraid was it for me and I decided to take the plunge and build my first unraid server with hardware from an old gaming pc I still had. I keep an actual overview updated with specs, config and such on google docs for myself but also for support reasons. This is the link. First off I started with a simultaneous preclear_disk of 4 new disks (logs attached). It lasted 32 hours without any errors. In the smartctl report I did however notice an increasing high count of "Hardware_ECC_Recovered" on all drives but after research this seems normal for these Samsung HD154UI drives (correct me if I'm wrong). During preclearing the disk temps didn't exceed 30 degrees and ran at 60 MB/s on average which did seem quite low (again, correct me if I'm wrong) but I would investigate later. The result for all 4 drives was about the same like: [color=blue]Preclear Successful ... Total time 33:08:00 ... Pre-Read time 5:12:22 (80 MB/s) ... Zeroing time 8:11:43 (50 MB/s) ... Post-Read time 19:42:51 (21 MB/s)[/color] Aren't these MB/s values rather low? Anyway. I created the array with adding 2 as data drives without parity drive and copied everything over from my old server. After completion I added the parity drive and started parity-sync. No errors still at this point. Parity sync went fine for a while starting at 110 MB/s and dropping a little to around 90 MB/s when suddenly the horror started. Suddenly it was only around a few hundred KB/s and Sync Errors started occuring and counting up rapidly. The syslog (attached) started showing loads of errors. In unmenu MyMain I checked syslog entries per disk. The parity disk seems normal. Disk 1 and 2 however both report many errors. Disk 4 is not in the array since I use the basic version and is spun down. When I look at smartctl reports per disk I however only see trouble with disk 1 (all disk smartctl reports attached). This rises several questions which I hope anybody can help me with; - What should I obviously do? - Is it so simple that disk1 is bad? Because syslog entries for disk2 also show many errors. Or is this because of the array. - All are new drives. everything went fine, even preclearing for 32 hours. now suddenly errors. is it disk, psu, mem, etc? where should I look. - Is data on disk1 bad and should I replace it with the warm spare (disk4) and rebuild or just copy all data over again from the old server. Please advice and thank you in advance.. UPDATE 1: The parity error count is 8222 now not building up anymore though syslog is still spitting out errors. I also attached a txt with the parity building progress at 3 times I checked. UPDATE 2: It eventually finished and now displays as parity being valid (is this true?). Unmenu says "Parity updated 8222 times to address sync errors". It didn't increase anymore. Anybody who would give me advice as to what to still do, what to check? UPDATE 3: I started a smartctl --test=long /dev/sd* on all disks. disk1 aborts after 20 seconds or so (tried several times). see report below. the other disks are still running the long test. SMART Self-test log structure revision number 1 [color=blue]Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 86 1526120678 # 2 Extended offline Completed: read failure 90% 86 1526120671 # 3 Short offline Completed: read failure 20% 85 1526120671 # 4 Extended offline Aborted by host 90% 85 -[/color] preclear_reports.zip syslog-2011-09-05.zip smartctl_2011-09-05.zip parity_rebuilding_progress.txt Quote Link to comment
vca Posted September 6, 2011 Share Posted September 6, 2011 I would be concerned about your disk1 (S1Y6J90SB25328) between the preclear_finish report and the smartctrl report it has developed a lot of bad blocks (looks like 209 sectors are waiting for reallocation). It's also very odd that the long SMART test is aborting so fast, these typically take a number of hours to run. Check to make sure you have not configured your unRAID to spin down idle drives as doing this usually aborts any running SMART tests. The following fields suddenly became non zero: 1 Raw_Read_Error_Rate 0x000f 099 099 051 Pre-fail Always - 1311 13 Read_Soft_Error_Rate 0x000e 099 099 000 Old_age Always - 1280 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 1280 197 Current_Pending_Sector 0x0012 095 095 000 Old_age Always - 209 201 Soft_Read_Error_Rate 0x000a 099 091 000 Old_age Always - 40 As for if you can trust your parity, run another parity test or two (in the nocorrect mode) and see if it says there are any errors. If there are no errors then you can probably trust the parity, at which point you could remove disk1 and put your extra drive into its place in the array - then unRAID will rebuild onto the new disk. Until you get all this working don't delete the original files that you copied from. Regards, Stephen Quote Link to comment
lionelhutz Posted September 6, 2011 Share Posted September 6, 2011 It appears you need to swap disk1 out for the spare disk and basically start over with disk1. Forget about saving any of the data on disk1 since every bad sector has corrupted whatever was stored there and the parity has been created based on this bad data. Just replace the disk and copy the data over again. This is the disadvantage of copying the data and then building parity. If you had setup the array with the parity drive from the start then you could have just rebuild disk1. Peter Quote Link to comment
X1pheR Posted September 6, 2011 Author Share Posted September 6, 2011 Thank you both for replying. I was worried nobody would post anything I'll get started with it right away. Besides disk1, are the speeds during preclear_disk for Pre-Read (80 MB/s), Zeroing (50 MB/s) and Post-Read (21 MB/s) normal? Asking just to be sure since they seem low and also very different. Especially pre-read and post-read are like the same right? Only they differ 60 MB/s. Quote Link to comment
vca Posted September 6, 2011 Share Posted September 6, 2011 The post-read speed does seem low, but I think in that phase it is lower because it does the read in a non-sequential order to better exercise the drive. On a 2TB WD green drive one full pass takes about 28 hours, so your 32 hours to do four simultaneous preclears does not seem to bad. Regards, Stephen Quote Link to comment
X1pheR Posted September 7, 2011 Author Share Posted September 7, 2011 Ok thank you. I also remember that when I copied data from my old server (from RAID5 set so plenty of read speed) to unraid disk1 and disk2 over my gbit network it only went at around 40 MB/s. So that was without a parity drive added to the array. Also copying to the disk shares directly and not using user shares. Are these speeds to be expected? Without a parity drive added to the array you should get about the same speeds as with using a cache drive right? I removed the bad disk1 and replaced it with the warm spare. Then I used initconfig but just to be sure I'm running samsung's ES_Tool utility on all drives now. When they are all done and I've copied the contents for disk1 from my old server again I'll update. Thanks again for all answers. Quote Link to comment
X1pheR Posted September 8, 2011 Author Share Posted September 8, 2011 all of the other disks didn't report any errors. also did an 31 hour memtest without errors. unraid is now parity syncing the new array. no errors so far. but I did notice some other errors in the syslog. Do I need to worry about any of them? Sep 8 06:53:24 Tower kernel: spurious 8259A interrupt: IRQ7. (Minor Issues) Sep 8 06:53:24 Tower kernel: pci0000:00: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d (Minor Issues) Sep 8 06:53:24 Tower kernel: AMD_IDE: probe of 0000:00:06.0 failed with error -12 (Errors) Sep 8 08:56:05 Tower emhttp: shcmd (31): killall -HUP smbd (Minor Issues) Sep 8 08:56:04 Tower kernel: REISERFS warning (device md1): sh-2021 reiserfs_fill_super: can not find reiserfs on md1 (Minor Issues) Sep 8 08:56:04 Tower logger: missing codepage or helper program, or other error (Errors) Sep 8 08:56:04 Tower emhttp: disk1 mount error: 32 (Errors) Quote Link to comment
lionelhutz Posted September 8, 2011 Share Posted September 8, 2011 The disk1 mount error is a problem unless you had just started the array with a newly installed and unformatted disk1. It's impossible to know since you didn't include the whole syslog, which would show the sequence of events leading up to the error. I doubt the others are anything to worry about. Peter Quote Link to comment
X1pheR Posted September 8, 2011 Author Share Posted September 8, 2011 The disk1 mount error is a problem unless you had just started the array with a newly installed and unformatted disk1. It's impossible to know since you didn't include the whole syslog, which would show the sequence of events leading up to the error. I doubt the others are anything to worry about. Peter Thanks. Well you hit it spot on. It wasn't a new drive but disk4 which was already precleared. I did a poweroff, replaced disk1 with disk4 (which was not part of the array). Some of the errors in my previous post do seem to reoccur occasionally. You sure I should ignore them? I also came across another error. I searched but didn't came across any matching posts. Only one regarding segfault but with another service in front. See below. Syslog attached. I wasn't doing anything with unraid at the time I believe. Sep 8 21:01:09 Tower kernel: smartctl[18269]: segfault at 1810a6b7 ip 080aa050 sp bfe7c200 error 6 (Errors) syslog-2011-09-08.txt Quote Link to comment
lionelhutz Posted September 9, 2011 Share Posted September 9, 2011 I believe that means smartctl crashed. Peter Quote Link to comment
X1pheR Posted September 9, 2011 Author Share Posted September 9, 2011 I believe that means smartctl crashed. Peter it appeared that it is an segmentation fault. I had the parameters for the command wrong. I'm also adding [solved] to the title. When I changed disks I did an initconfig and parity synced the array right away. no errors. I also recopied the contents for disk1 over from the network and did a double parity check afterwards. again, no errors. so luckely everything is solved. was still doubting a little being the other errors which occured. but for a while now none have occured and everything seems to be working fine. thank you very much. Quote Link to comment
UhClem Posted September 20, 2011 Share Posted September 20, 2011 I believe that means smartctl crashed. Peter it appeared that it is an segmentation fault. I had the parameters for the command wrong. ... [ I came across this post while searching for something different, but, since I'm here ...] I'll bet the Author and/or Maintainer of smartctl would appreciate receiving a "bug report" (this is more like a "buglet" [cf: piglet]), with the exact command-line that will reproduce the crash/segfault. "One small step for man ..." -- UhClem Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.