HELP - Parity Check Failed and drive set as new! - General Support (V5 and Older)

July 8, 201312 yr

UNRAID Ver: 5 rc11 PRO

I'm sure I made some mistakes attempting to fix the problem myself and I'll try to provide as much information as possible hoping someone can help me out.

I run various plugins: SickBeard, CouchPotato, Sabnzbd, Transmission, mysql (for XBMC Shared Library) are the major ones.

All seems to have been stable on my system for quite sometime now, at least a few months. I've rebooted multiple times, parity check is completed weekly with zero errors. The NAS was rebooted about 2 1/2 weeks ago when I moved. I also use cron to monitor Directories and add torrents to Transmission. Today I started receiving Input/Output error notices via email every minute for hours - of course I wasn't home so I couldn't check on the NAS to find out what the issue was. I unfortunately do not have a syslog for any of these events only the most recent reboot.

'sh: /mnt/cache/.scripts/transmission/watchdir.sh: Input/output error' is the msg I was receiving.

Finally when I was able to get home I commented out the line in the go file that added the cronjob until I could figure out what exactly was causing the issue. I stopped the array, rebooted and left to go do something else. I didn't check to make sure the NAS went down and came back up successfully. I'm not sure if it did and I can't confirm it. The emails did stop so I assume that it did successfully reboot. When I did come back to check a few hours later I was unable to load the Webgui (waited a few minutes for the page to load). So I tried 'killall emhttp' (I killed unmenu), which did kill the process but when I tried to restart it, 'nohup /usr/local/sbin/emhttp &' which resulted in a Segmentation Fault. SO I tried 'powerdown'.... This is where everything went downhill...

After I waited a little while, about 10-15 mins, I noticed that the cache drive was still mounted. I realized that Sickbeard, Couchpotato, mysql and Transmission were still running - I was unable to access their web pages earlier to shutdown them down. SO I killed the running processes - which I had done in the past and resulted in the successful shutdown and restart of the NAS. Not so lucky this time. I waited for another 20 mins - went and did other things and came back to find the NAS had not yet powered down. Whenever I reboot and I want to make sure I always run 'ping x.x.x.x -t'. I tried to telnet back into the NAS and was unable to reconnect but it was still responding to pings. So, I left it again for another 10 minutes and patiently waited for the power off to occur. I came back and it was still responding but I couldn't telnet in or access any shares. SO I did what I thought was my last option and last resort - I went to the box and gave it the ol' 1 finger salute and it shutdown. I took out the flash drive and commented out everything not stock and renamed the plugins directory so I would have a stock UNRAID when it booted back up.

I have a SYSLOG from here on. (attached)

When the system rebooted, it of course started a parity check - Which I let run but when I came back to it probably about 1/2 - 1 hour later (I'm really not sure) the parity failed and it said there 758 errors and the parity disk was red balled. I stopped the array and now the parity disk is reported as 'New parity disk installed.'

I've run a short smart test. (attached)

A long test is running now. I will post once the test is complete.

I have not done anything else and will patiently wait for some advice on what to do from here...

syslog.zip

smart.txt

Quote

July 8, 201312 yr

The log indicates all drives are having trouble communicating and the SMART report suggests power issues. When was the last internal physical change to the server? When was the last drive added? What are the hardware specs including PSU model?

Quote

July 8, 201312 yr

Author

Thanks for your help!

The last physical change was in early June, about 2 weeks before I moved. I had pre-cleared drives as spares set aside (3 hdds) and I added them to the server but not the array to keep them safe during the move. I used hdparm -y during the go script to put the drives in standby using 'preclear_disk.sh -l to grep the /dev assignment of the drive using the serial #. That way they wouldn't draw any power and would be available if/when needed.

lines from go:

for sparesn in W1E26DXP JP2940HD046Y1C S13PJ90S328879
do
/boot/./stdbyspare.sh $sparesn
done

stdbyspare.sh:

##Get /Dev/sdX - Works Since Drive is OUTSIDE of the Array, using S/N to parse /dev Location
#Serial Number of Spare Drive
sparelog=/boot/spare.txt
sparesn=$1

# Uses preclear_disk.sh script to list Drives Ouside of Array | Get Line containing Supplied  drive S/N | Parses line with '=' as delimiter and returns first feild | Removes ALL spaces
spare="$(/boot/./preclear_disk.sh -l | grep $sparesn | cut -d = -f1 | tr -d ' ')" 
echo  "Drive S/N: $1" >> $sparelog 
echo "Spare Drive: $spare" >> $sparelog
/usr/sbin/hdparm -C $spare >> $sparelog
/usr/sbin/hdparm -y $spare >> $sparelog
/usr/sbin/hdparm -C $spare >> $sparelog

H/W:

MB - ECS A780GM-M3

CPU - AMD Athlon X2 250

PSU - OCZ ModXStream Pro 700W

RAM - 4 GB

PCI-E SAS - AOC-SASLP-MV8

PCI SATA - SY-PEX40008

Flash - 4GB

Seagate (2.0A Startup):

3TB - ST3000DM001

2TB - ST2000DM001 x 5 (4 in use - 1 put in standby using hdparm -y during 'go' script)

2TB - ST3200542AS

1TB - ST31000528AS

250GB (cache) - ST3250310AS (2.8A Startup)

Samsung (2.4A Startup) :

1TB - HD103UJ x 2 (1 in use the other put in standby using hdparm -y during 'go' script)

Hitachi (2.0A Startup):

HDS721010CLA332 (put in standby using hdparm -y during 'go' script)

I think that you are right in thinking that it may be a power issue... I checked to see if the 3 disks that should be in standby and none of them are... So I'm way well over the 25A rail for the PSU (@ 31 A currently ).... Without those 3 drives I'm at 19.2A + 5A for the mobo I thought I'd be good, as long as the drives stayed in Standby.

Yes, I am looking to purchase a better PSU - SeaSonic G Series SSR-550RM 550W (http://www.newegg.ca/Product/Product.aspx?Item=N82E16817151119). I've just been waiting for a sale/rebate or price drop.

Would it be safe to shutdown the Server and remove those 3 drives. Is there any way to get the parity drive back without a complete rebuild? Or is the parity completely invalid now?

What would you recommend as the next steps?

Quote

July 9, 201312 yr

The 3 extra drives need to be removed. Shutdown or hot-plug.

Quote

July 9, 201312 yr

Author

I removed the 3 drives and rebooted the server. Here's a fresh syslog. I have not yet started the array.

Can the parity be recovered or will I have to rebuild it?

syslog-2013-07-09.txt

Quote

July 10, 201312 yr

Rebuild and then check parity. Limit the SY-PEX40008 to only 2 drives and performance will be greatly improved.

Quote

July 10, 201312 yr

Author

Thanks again for the help!

The parity rebuild failed.

A new syslog is attached. Also a new smart test for the parity drive.

My cache drive seems to have disappeared as well now... the cache drive slot is now set to 'unassigned' and there are no options in the menu. It was an old 250 GB drive... no data of value if the drive is lost. But the fact it's disappeared may help...

smart.txt

syslog-2013-07-09.2.zip

Quote

July 10, 201312 yr

The drive is having power issues. Check the connections. The PSU may only support 6-7 drives.

Quote

July 10, 201312 yr

Author

Ok, thanks! Maybe time to purchase that new PSU.

In the meantime, I'll shutdown pull the cache drive, check the connections again and try another parity rebuild.

FYI - I also confirmed that I only have 1 drive connected to the SY-PEX40008.

I appreciate the help!

Quote

July 10, 201312 yr

Author

The parity did rebuild successfully but during the rebuild disk 5 showed 31 errors. I assume that these would be read errors since nothing was being written to the array during the rebuild. After the parity rebuild was complete, a parity check NOCORRECT completed successfully.

Parity Valid

Last checked on Wed Jul 10 18:12:50 2013 ADT, finding 0 errors.

However, now there is 138 errors showing for disk 5. I know that ANY errors during a rebuild is bad but I'm just not sure exactly what this means.

Is the disk bad/have corrupt sectors?

Is this caused by the power issues experienced earlier?

I've attached a smart test for the disk in question and a new syslog.

In the log there are quite a few of these lines, which would've occured during the rebuild:

Jul 10 02:15:42 MR_NAS kernel: md: disk5 read error

Jul 10 02:15:42 MR_NAS kernel: handle_stripe read error: 1974783248/5, count: 1

Jul 10 02:15:42 MR_NAS kernel: md: parity incorrect: 1974783248

Later there are these, which would've occured during the Parity Check:

Jul 10 12:01:07 MR_NAS kernel: md: disk5 read error

Jul 10 12:01:07 MR_NAS kernel: handle_stripe read error: 1974783248/5, count: 1

FYI - I've also ordered a new psu: Corsair HX650 - http://www.newegg.ca/Product/Product.aspx?Item=N82E16817139012

smart.txt

syslog.txt

Quote

July 11, 201312 yr

The SMART report looks fine. Try again with a new PSU.

Quote

July 11, 201312 yr

The PSU looks to be a dual-rail unit. Try to find a single-rail PSU.

Quote

July 11, 201312 yr

Author

Will do. New PSU is already ordered and on route, should receive it tomorrow, hopefully will get it installed tomorrow night.

Got a great deal on a single rail 650W with 54A on 12V rail, CORSAIR HX650 $119.99 CAD + 25USD rebate.

http://www.newegg.ca/Product/Product.aspx?Item=N82E16817139012

Quote

HELP - Parity Check Failed and drive set as new!

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)