Did a parity check but it stopped with errors on disk1 - need help - unRAID Server 4.3 [No new topics]

November 5, 200817 yr

Hey guys,

So I do a parity check every 1-2 months just to make sure everything is valid and things are running smoothly.

I tried doing a check today and when I came back to see how far along I was I saw disk1 was taken offline and there are 576 errors.

I ran reiserfsck and it said no corruptions were found.

I ran smartctl and this is what it output:

smartctl version 5.36 [i486-slackware-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

My log is attached. I had to ZIP it because it was too large to attach to this post.

I'm not sure if the drive is bad, parity is bad, or if there's some other sort of error/bug going on here. smartctl can't recognize the drive but resierfsck says the disk has no errors. I can't run a parity check because the disk errors out and now disk1 is offline.

Any help would be appreciated. I'm not sure if getting a new disk to replace disk1 with is the best idea (in case something else went wrong like a bad parity drive.

November 5, 200817 yr

Hey guys,

So I do a parity check every 1-2 months just to make sure everything is valid and things are running smoothly.

I tried doing a check today and when I came back to see how far along I was I saw disk1 was taken offline and there are 576 errors.

I ran reiserfsck and it said no corruptions were found.

I ran smartctl and this is what it output:
smartctl version 5.36 [i486-slackware-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
My log is attached. I had to ZIP it because it was too large to attach to this post.

I'm not sure if the drive is bad, parity is bad, or if there's some other sort of error/bug going on here. smartctl can't recognize the drive but resierfsck says the disk has no errors. I can't run a parity check because the disk errors out and now disk1 is offline.

Any help would be appreciated. I'm not sure if getting a new disk to replace disk1 with is the best idea (in case something else went wrong like a bad parity drive.

If the failed disk is off-line, then reiserfsck is checking the "virtual" disk contents as re-constructed by the unRAID software. It simply proves that the "virtual" disk, as reconstructed by parity and your other data disks is not corrupted.

Joe L.

November 5, 200817 yr

You can try to see if a cable has come loose from the drive. (power down while checking it, to prevent further damage as you wiggle things) If it was loose, after powering up, you can get it back on-line by un-assigning it in the devices page, rebooting, then re-assigning it on the devices page. If it is then writable it will be re-constructed from parity and the other data drives.

Unless you find a bad-cable or loose cable connection to the failed disk, your best bet is to replace the failed disk as soon as possible.

When you do replace it, use the "Start" button to start your array. It will start and begin the process of rebuilding the failed disk.

Do NOT use the button labeled "restore" as it does not restore data, but instead removes your super.dat file, immediately invalidates the existing parity data and then begins the process of calculating parity basd on the current assigned and WORKING drives. Any data on the failed drive will be lost if you press the button labeled "restore"

November 5, 200817 yr

Author

Thanks Joe... I'll go pick up a new disk today to replace disk1 with. I can upgrade to a larger disk size correct? Or do I need to replace the 750g drive with another 750g drive and once the drive is rebuilt I can swap it out for a new 1TB drive?

November 5, 200817 yr

Thanks Joe... I'll go pick up a new disk today to replace disk1 with. I can upgrade to a larger disk size correct? Or do I need to replace the 750g drive with another 750g drive and once the drive is rebuilt I can swap it out for a new 1TB drive?

You can do either... If the new drive is bigger than your existing parity drive, then it will not let you start the array. In that case, you simply assign the new disk to parity, the old parity disk to the failed slot in the array and press start. It will first copy parity to the new (larger) parity drive, then begin the process of rebuilding the data on the newly re-assigned data drive that used to be your parity drive.

I'd check for the loose cable first... but other than that, it is time to go out and get a new drive.

November 5, 200817 yr

Author

I checked and I have no loose cables. I don't really want to try swapping out a cable because I have everything cable tied and many of the SATA cables are tucked away so I can't see which cable goes from which drive to which connection on the MOBO without un-doing a lot of the organizing work I did when I built the server.

I'll go pick up a new drive today and get the server running again. I guess if it is a bad cable I'll know when I replace the drive with a new one (and worst comes to worst I'll just end up with another 1TB of storage (which I need anyways).

Thanks again for the help.

November 5, 200817 yr

Loose SATA cable connector was going to be my first suggestion too, but Joe beat me!

The problem started with a spin up (to start the parity check) that failed on Disk 1 (sdb), with communications type errors (exception Emask, BadCRC, ATA bus error). I can't tell if the communications problems are the *cause* of the issue (loose or bad cable?), or there was an issue with the simultaneous spinup used in unRAID v4.3.3, that has caused a few problems like this before, with others. Once it failed to communicate after multiple retries and resets, it failed the drive, which resulted in all of the errors you saw. In any case, the drive is almost certainly PERFECTLY FINE.

After powering off and rebooting, check to see if the drive assignment is right, you might need to un-assign and re-assign it, and reboot one more time. If the drives all look right, assigned correctly but not red (not all green either), then it is time to use the Make unRAID Trust the Parity Drive procedure. This is one time that the Restore button can be used, but follow the instructions very carefully. And let the parity check finish, since that was what you were attempting anyway. I would expect a few parity errors near the very beginning.

Edit: to emphasize, only use that procedure if ALL of the drives are correct, not disabled or missing. Feel free to post a screen shot first, before proceeding.

November 5, 200817 yr

I checked and I have no loose cables. I don't really want to try swapping out a cable because I have everything cable tied and many of the SATA cables are tucked away so I can't see which cable goes from which drive to which connection on the MOBO without un-doing a lot of the organizing work I did when I built the server.

By the way, you don't have to mess up your fine tying job, to test cables. Just disconnect both ends of a suspect cable, leave it in the ties, and connect a replacement cable directly between the 2 endpoints. It's only while you are testing.

November 5, 200817 yr

Author

Thanks for the reply Rob... that makes sense about the disk not starting up. Maybe in the future I should spin up all the disks before hitting the parity check button? That way, if a disk fails to start up it won't error out on me, and I can click the spin up button again to hopefully get it going.

I followed the steps you posted and the parity check is currently in progress. Disk1 had to be mounted and there were 5,912 writes to the drive before the parity check started. There were also a few more writes done to the parity drive. I'm assuming this is all normal?

And in regards to my cableing, I know I could connect from the drive to the mobo without undoing all my cable ties... but it's impossible for me to see which hard drive cable connects to which port on the MOBO without undoing a lot of my cable organization.

Thanks again... hopefully parity checks out fine and I'm good to go.

November 5, 200817 yr

Author

Oh... just noticed this...

There are no hard drive errors so far... but there are 48 sync errors. I'm assuming those are the errors you said that may pop up in the beginning?

November 5, 200817 yr

Thanks for the reply Rob... that makes sense about the disk not starting up. Maybe in the future I should spin up all the disks before hitting the parity check button? That way, if a disk fails to start up it won't error out on me, and I can click the spin up button again to hopefully get it going.

If your power supply is being stressed by all the disks spinning up at the same time, it might be worth the effort to make sure the 12 volt rails have equal loads and is sufficient capacity for all your drives.

November 5, 200817 yr

Author

My power supply is a PC Power and Cooling 750w with a single 12v rail rated at 60A. I purposely got the higher wattage power supply for the larger 12v rail to deal with startup loads. My server only has 11 disks total right now (including parity), so the power supply shouldn't be the issue.

I'm thinking it's just a random error. I've had that disk since I built my unRAID server and it's never given me an issue before.

November 6, 200817 yr

It is good that you did the parity check and discovered this problem. The worst thing that can happen is to have a problem that you don't know about and then have a second problem. That is when you chances for data loss go way up!

Good luck with your recovery!

(You are in the best of hands with Joe L. and RobJ!)

November 6, 200817 yr

You can spin up your drives sequentially, one each second, using something like this script.

# loop through the drives spinning them up

# by reading a single "random" block from each disk

# in turn. We use "dd" commands in the background so

# all the disks will be spun up in parallel

for i in `ls /dev/md*`

do

# determine the number of blocks on the drive

blocks=`df -k $i | grep -v Filesystem|

sed "s/$[^ ]*$ *$[^ ]*$ .*/\2/"`

# calculate a (random) block number to be read somewhere

# between block 1 and the max blocks on the drive

skip_b=$(( 1+(`dd if=/dev/urandom count=1 2>/dev/null |

cksum | cut -f1 -d" "`)%($blocks) ))

# read the random block from the disk. We use a random

# block to try to ensure it is not in the cache memory

# if the block was in the cache memory, the disk would

# not spin up.

dd if=$i of=/dev/null \

count=1 bs=1k skip=$skip_b >/dev/null 2>&1 &

sleep 1

echo "`date` ** reading block $skip_b from $i " >>/var/log/spin_up.log

done

wait

November 6, 200817 yr

Author

Is there a way to make that script run automatically when I click on the button from the web GUI page?

Doesn't the current startup disk start one disk every 5 seconds or so?

November 6, 200817 yr

Is there a way to make that script run automatically when I click on the button from the web GUI page?

it is built into unmenu... but no, it cannot be invoked from the stock unRAID web-interface.

Doesn't the current startup disk start one disk every 5 seconds or so?

I don't honestly know the timing.

I think they are all started at once...(or within milliseconds of each other) otherwise, if 5 seconds between, it would take a minute to spin up my array, and the spin-up button is way faster than that.

In the script above, the "sleep 1" is the major delay between spin up of each disk. It could be removed for sequential spin up with minimal delay.

Note that the read of each disk is done in parallel, with the "wait" at the end of the script waiting till all are completed before the script exits. (If a drive needs to spin-up, it might take 10 seconds or more to complete the read of the random data block)

Did a parity check but it stopped with errors on disk1 - need help

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)