help with repeat parity errors needed

quincyg · March 29, 2010

I am getting repeated parity errors following a power loss during writing. I have tried replacing sata cables, removed what seemed to be a faulty cache drive(space was not freeing after move). Ran a memtest over night and ran smart tests on all disks. The smart test say drive passed for all drives, but I don't really know what to look for. I will attach the syslog and smart reports for each drive. Here are the syslog lines that jump out to my untrained eye:

Mar 28 09:27:37 Tower kernel: AMI BIOS detected: BIOS may corrupt low RAM, working around it.

Mar 28 09:27:37 Tower kernel: pci 0000:00:0b.1: EHCI: BIOS handoff failed (BIOS bug?) 01010001

Mar 28 09:27:37 Tower kernel: ACPI Warning: Incorrect checksum in table [OEMB] - 7D, should be 70 (20090903/tbutils-314)

syslog.txt.zip

smart.zip

Joe L. · March 29, 2010

Where are you seeing the parity errors? I don't see any evidence of them in the syslog?

quincyg · March 29, 2010

when run a parity check I shows sync error:255

I have never had any errors reported in the errors column of the disk status window.

Everything was fine, before the power outage. I just disable acpi on the server...grasping at straws.

quincyg · March 30, 2010

I ran a restore last night and am checking parity again now with errors. Here is a new log, post restore. I notice these lines. Am I dealing with a bad usb thumb drive?

Mar 29 13:24:11 Tower kernel: md: unRAID driver removed

Mar 29 13:24:11 Tower kernel: md: unRAID driver 0.95.4 installed

Mar 29 13:24:11 Tower kernel: read_file: error 2 opening /boot/config/super.dat

Mar 29 13:24:11 Tower kernel: md: could not read superblock from /boot/config/super.dat

I should also mention that after trying to change the timezone last night, the whole system went crazy. the drives alway showed red dots and the syslog was just a bunch of error codes. that syslog is attached also...named syslog crazy.

syslog_post_rest.txt.zip

syslog_crazy.txt.zip

quincyg · March 31, 2010

well, i'm impatient to get this sorted out and havent heard back from tech support yet, so i moved the entire array to another server. new power supply,ram,motherboard, etc. I formatted and scandsk-ed the thumb drive and reinstalled 4.51. the system ran parity all night and i am still getting errors on the usb drive.

Mar 30 15:41:48 Tower kernel: md: unRAID driver 0.95.4 installed

Mar 30 15:41:48 Tower kernel: read_file: error 2 opening /boot/config/super.dat

Mar 30 15:41:48 Tower kernel: md: could not read superblock from /boot/config/super.dat

Mar 30 15:41:48 Tower kernel: md: initializing superblock

I guess I will have to rma the thumb and get a new key an try with a different thumb. Any help would be appreciated.

As a side note...It is impossible to create the bootable flash in windows 7. I had to create it on an xp machine.

Joe L. · March 31, 2010

well, i'm impatient to get this sorted out and havent heard back from tech support yet, so i moved the entire array to another server. new power supply,ram,motherboard, etc. I formatted and scandsk-ed the thumb drive and reinstalled 4.51. the system ran parity all night and i am still getting errors on the usb drive.

Mar 30 15:41:48 Tower kernel: md: unRAID driver 0.95.4 installed

Mar 30 15:41:48 Tower kernel: read_file: error 2 opening /boot/config/super.dat

Mar 30 15:41:48 Tower kernel: md: could not read superblock from /boot/config/super.dat

Mar 30 15:41:48 Tower kernel: md: initializing superblock

I guess I will have to rma the thumb and get a new key an try with a different thumb. Any help would be appreciated.

As a side note...It is impossible to create the bootable flash in windows 7. I had to create it on an xp machine.

Whenever you press "Restore" the super.dat file is moved to super.old. It is then perfectly normal for the super.dat file to not be found, and it is normal for unRAID to create a new one. This is NOT an "unexpected" error. In the same way, when initially creating an array, on a new flash drive, the super.dat file does not yet exist.

This is only a problem is on a subsequent boot it still says it cannot find the super.dat. (Normally you NEVER press the button labeled as "restore" unless you are removing a drive and not immediately replacing it with a new drive) You always "Start" the array by pressing the button labeled as "Start"

we need to see the syslog after you start the array, stop the array cleanly, then reboot, to see if the flash drive still shows as having a missing super.dat file. Also, although unlikely, you might have unplugged the flash drive from the pc without ejecting it safely. If you did, it might have corruption on its file-system. Running chkdisk on it on windows should tell you if it is OK.

lastly, on Vista, and probably on Win7, you need to open a command window as administrator, then run syslinux as administrator, then it can make the flash drive bootable.

quincyg · March 31, 2010

Thanks for your help Joe. Yes, I still get the errors after a clean restart. I ran scandisk and ejected the drive by right clicking>eject and the from the task bar ejected the drive safely. I believe that is a redundant ejection...I normally just eject from the taskbar.

I am planning to start over fresh, so I am not worried about the restore. I unassigned all drives in the array, but the syslog should be valid correct?

Also it seems to be taking much longer to load the bzroot file on startup. Maybe 3-4 minutes.

The syslogs for the last 2 clean power ups are attached.

syslog3_31.zip

Joe L. · March 31, 2010

Thanks for your help Joe. Yes, I still get the errors after a clean restart. I ran scandisk and ejected the drive by right clicking>eject and the from the task bar ejected the drive safely. I believe that is a redundant ejection...I normally just eject from the taskbar.

Ok... you would be surprised to learn how many people never "safely" eject anything... I just want to eliminate that as a possible reason the super.tat file was not found.

I am planning to start over fresh, so I am not worried about the restore. I unassigned all drives in the array, but the syslog should be valid correct?

As long as you did not reboot, yes.

Also it seems to be taking much longer to load the bzroot file on startup. Maybe 3-4 minutes.

That is a function of your BIOS and the flash drive and if it is initially being treated as USB-1 vs. USB-2 by your BIOS.

The syslogs for the last 2 clean power ups are attached.

I'll look at them.

You are not pressing "restore" each time, are you?

Joe L.

quincyg · March 31, 2010

thanks. I am trying to run preclear to wipe all of my disks clean, but can't get past and error: " sorry /dev/hda does not exist as a block device"

Same dialog for all of the drives. Any ideas on what is going on there?

Joe L. · March 31, 2010

thanks. I am trying to run preclear to wipe all of my disks clean, but can't get past and error: " sorry /dev/hda does not exist as a block device"

Same dialog for all of the drives. Any ideas on what is going on there?

Well.

If you type

ls -l /dev/hda

what do you see?

If it exists on your hardware, the first character on the line will be a "b" indicating it is a "block" device, like this:

ls -l /dev/hda

brw-rw---- 1 root disk 3, 0 Mar 11 22:20 /dev/hda

If /dev/hda is not a device on YOUR HARDWARE, then you need to provide the appropriate device.

If you type

ls -l /dev/disk/by-id

you will see the disks, their model/serial numbers, and at the end of the lines, the three character device names.

quincyg · March 31, 2010

Thanks will try it. How important is it to preclear vs. letting unraid simply format the drives?

terrastrife · March 31, 2010

the preclear script will stress a hdd much more throughly, good idea if the drives are brand new to weed out DOA drives. not so much for an existing drive with undreds of hours already uner its belt.

quincyg · April 1, 2010

OK, I really have no idea what is going on. I am using a completely new system, with a new usb drive. I formatted the disks again. I don't see any obvious trouble in the syslog, but I still am seeing parity errors reported on every parity check. 211 errors on the last check. The most recent syslog is attached, which is after a clean boot and includes parity check.

I am running preclear now on the parity drive. I got a notice that a guid partitioin was detected and it should be gnu. Will the preclear wipe out this guid table or do I have to do something to reformat the drive again outside of unraid?

My current plan is to reformat the usb key and reinstall unraid. Run preclear on all disks and then start things up. I am thinking that some trace data is somehow being left on the parity disk which is continuing the corruption.

syslog4_1postchk.txt.zip

Joe L. · April 1, 2010

OK, I really have no idea what is going on. I am using a completely new system, with a new usb drive. I formatted the disks again. I don't see any obvious trouble in the syslog, but I still am seeing parity errors reported on every parity check. 211 errors on the last check. The most recent syslog is attached, which is after a clean boot and includes parity check.

I am running preclear now on the parity drive. I got a notice that a guid partitioin was detected and it should be gnu. Will the preclear wipe out this guid table or do I have to do something to reformat the drive again outside of unraid?

My current plan is to reformat the usb key and reinstall unraid. Run preclear on all disks and then start things up. I am thinking that some trace data is somehow being left on the parity disk which is continuing the corruption.

I suspect your analysis is incorrect.

If you see random parity errors you have HARDWARE issues. It is not residual data. Re-loading unRAID will not solve the problem, but do what you need to convince yourself.

In the past, this has been bad memory or incorrectly configured memory, bad disk controllers, bad motherboard chipsets, bad power supplies, and even (I think only once) a bad disk.

unRAID is not MS-Windows. Formatting a disk has absolutely nothing to do with parity, as parity works on the raw disk bits, not at the file-system level. Parity could care less if the disk is partitioned, formatted, all ones, all zeros, etc as long as the data on it stays consistent with what was there when parity was initially calculated.

No "trace" data is involved. Can't be. Not unless you are using the trust-parity process and you've changed a data disk since calculating parity, or you are writing directly to the raw devices. If a parity check followed by a second parity check finds errors, you have HARDWARE problems, not software problems. You need to start eliminating through a process of elimination the possible hardware involved. Since it can involve almost anything (except possibly the case fans) it will not be easy.

The preclear script will wipe out any existing partition table. It completely overwrites the MBR (first 512 bytes on the disk where the partition table resides) and creates a single partition from sector 63 to the end of the disk after writing zeros to the entire disk. The only thing it does not do is clear an HPA (HOST PROTECTED AREA) your BIOS or previous OS might have created, but that would not cause occasional parity errors, but instead usually just a tiny loss of storage capacity.

quincyg · April 1, 2010

If it is hardware error, then there is only one component that is the same as the original configuration....the hard drives. All other components are completely different from when the errors began. Is it typical that there is no indication in the syslog of what the hardware failure could be? did you take a look at the syslog?

Thanks again for your help.

Joe L. · April 1, 2010

If it is hardware error, then there is only one component that is the same as the original configuration....the hard drives. All other components are completely different from when the errors began. Is it typical that there is no indication in the syslog of what the hardware failure could be? did you take a look at the syslog?

Thanks again for your help.

If it is a disk drive occasionally returning a bad bit, but thinking it was what it had written to the platter, yes. As far as the disk is concerned, it is working fine, so it reports no error.

See this thread: http://lime-technology.com/forum/index.php?topic=3642

Starting at this post: http://lime-technology.com/forum/index.php?topic=3642.msg38393#msg38393

Resolved at this post: http://lime-technology.com/forum/index.php?topic=3642.msg38655#msg38655

You've clearly been isolating hardware... It sounds like you are now down to the drives themselves. It would only take one bad drive to cause your system to act the way it is acting.

quincyg · April 1, 2010

Thanks, I read through that thread an it makes me think I should try running the array from my rosewill pci card as opposed from the motherboard and see what happens. If I can figure out how to run an md5 check I will do that as well.

purko · April 1, 2010

there is only one component that is the same as the original configuration....the hard drives.

All other components are completely different from when the errors began.

On a different case, the user replaced ALL components but the hard drives, and was still getting strange parity check errors:

http://lime-technology.com/forum/index.php?topic=5740.msg55411#msg55411

It turned out, his replacement motherboard was the same make and model as the old one, and that's what was giving him the troubles.

Of course, your situation may be completly different. I just thought it's worth pointing that out.

quincyg · April 2, 2010

I did think of that. They are both asus motherboards, but completely different models. I guess I can't rule out that possibility

purko · April 2, 2010

Any other component that's the same make and model as before?

quincyg · April 2, 2010

nope. and the motherboard is listed as supported.

quincyg · April 2, 2010

Can I use the server while preclear is running on a disk that is no assigned?

Joe L. · April 2, 2010

Can I use the server while preclear is running on a disk that is no assigned?

Absolutely, that is how it is expected to be used.

quincyg · April 2, 2010

So preclear just finished on sda (my parity disk) and I got the following message:

"smart error count differences detected after pre clear....63c63...193 load-cycle count 0x0032 200 200 000 old age 480"

then the 2nd count of "481"

This is a 1 week old drive. Is this an indication of anything serious?

Joe L. · April 2, 2010

So preclear just finished on sda (my parity disk) and I got the following message:

"smart error count differences detected after pre clear....63c63...193 load-cycle count 0x0032 200 200 000 old age 480"

then the 2nd count of "481"

This is a 1 week old drive. Is this an indication of anything serious?

It indicates your disk loaded the disk heads onto the spinning platters 480 times prior to the pre-clear process (some time in the past week) and 1 more time for the pre-clearing process.

I'll let you judge, is it serious if the disk heads are loaded onto the platter to read and/or write them? Personally, I think if the disk heads did NOT load it would be a much bigger issue.

Joe L.

help with repeat parity errors needed

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived