evertime i run parity a new disk gets dropped

April 28, 201610 yr

ok,

ever since i upgraded to v6 ive had non stop problems. every time i run a parity check one of my drives gets dropped from the array and is flagged as unmountable. All the drives so far are on different controllers and i have checked the cables.

DB link to log file. https://dl.dropboxusercontent.com/u/20775210/tower-diagnostics-20160428-0850.zip

Quote

April 28, 201610 yr

I know you are running an SAS2LP, but what are the full specs of your system? I had exactly the same problem when I upgraded to unRAID 6.

Quote

April 28, 201610 yr

Author

updated my system specs in my sig.

Quote

April 28, 201610 yr

Try disabling virtualization (VT-d) in the BIOS, someone said SAS2LPs have issues with it but I've never had any issues with 6 SAS2LP cards in unRAID 6.

Do the drives remain unmountable? It could be that parity checks are corrupting filesystems due to unstable hardware (most likely RAM).

Quote

April 28, 201610 yr

Author

i am running sas > sata cables. Ive tried swapping the drives off of the sas card onto the mobo and off the mobo onto an old sata card. The issues are happening everywhere. I was running the v6 beta for months with no issues. Only thing that changed was updating to release v6 and running a parity check.

I will try to disable virtualization and see what happens.

Quote

April 28, 201610 yr

i am running sas > sata cables. Ive tried swapping the drives off of the sas card onto the mobo and off the mobo onto an old sata card. The issues are happening everywhere. I was running the v6 beta for months with no issues. Only thing that changed was updating to release v6 and running a parity check.

I will try to disable virtualization and see what happens.

In my both my servers with 3x SAS2LP cards, I did have randomly dropping drives during checks until I moved the 3rd SAS2LP card to a slower x4 slot. Not ideal, but it worked, and it worked in both my servers. You could try moving the SAS2LP to a different PCI-E slot if your mobo has one. I didn't know about the VT-d issue at the time, so I'm tempted to try it in the full speed port with that disabled. Lemme know if VT-d disabled does anything for you.

If you have spare ram, it wouldn't hurt to throw a 4GB kit in there for testing purpose, both to rule out faulty RAM and (IMO) 4GB is ideal for v6.

Quote

April 28, 201610 yr

Author

so vt-d made no difference. i was finally able to get the drive to run a filesystem check and got the following:

Will read-only check consistency of the filesystem on /dev/md7

Will put log info to 'stdout'

###########

reiserfsck --check started at Thu Apr 28 13:32:20 2016

###########

Replaying journal:

Replaying journal: Done.

Reiserfs journal '/dev/md7' in blocks [18..8211]: 0 transactions replayed

The problem has occurred looks like a hardware problem. If you have

bad blocks, we advise you to get a new hard drive, because once you

get one bad block that the disk drive internals cannot hide from

your sight,the chances of getting more are generally said to become

much higher (precise statistics are unknown to us), and this disk

drive is probably not expensive enough for you to you to risk your

time and data on it. If you don't want to follow that follow that

advice then if you have just a few bad blocks, try writing to the

bad blocks and see if the drive remaps the bad blocks (that means

it takes a block it has in reserve and allocates it for use for

of that block number). If it cannot remap the block, use badblock

option (-B) with reiserfs utils to handle this block correctly.

bread: Cannot read the block (12025856): (Input/output error).

any suggestions other than replacing the drive.

Quote

April 29, 201610 yr

Community Expert

That usually indicates bad sectors, run an extended SMART test for disk7.

You are also still getting errors on you're parity disk, if it's already on a different controller with new cables it's probably a bad disk.

Apr 28 08:32:26 tower kernel: ata19.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Apr 28 08:32:26 tower kernel: ata19.00: failed command: WRITE DMA EXT
Apr 28 08:32:26 tower kernel: ata19.00: cmd 35/00:08:d8:de:53/00:00:22:00:00/e0 tag 29 dma 4096 out
Apr 28 08:32:26 tower kernel:         res 01/04:00:b7:b4:cf/00:00:26:00:00/e0 Emask 0x2 (HSM violation)
Apr 28 08:32:26 tower kernel: ata19.00: status: { ERR }
Apr 28 08:32:26 tower kernel: ata19.00: error: { ABRT }
Apr 28 08:32:26 tower kernel: ata19: hard resetting link
Apr 28 08:32:26 tower kernel: sas: ata20: end_device-1:7: dev error handler
Apr 28 08:32:27 tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 28 08:32:27 tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 28 08:32:27 tower kernel: ata19.00: both IDENTIFYs aborted, assuming NODEV
Apr 28 08:32:27 tower kernel: ata19.00: revalidation failed (errno=-2)
Apr 28 08:32:28 tower kernel: mvsas 0000:05:00.0: Phy6 : No sig fis
Apr 28 08:32:31 tower kernel: ata19: hard resetting link
Apr 28 08:32:31 tower kernel: sas: sas_form_port: phy6 belongs to port6 already(1)!
Apr 28 08:32:31 tower kernel: ata19.00: configured for UDMA/133
Apr 28 08:32:31 tower kernel: ata19: EH complete

Quote

April 29, 201610 yr

Author

hmm,

i find it suspect that as soon as i switch to v6 release i have three drives go bad at once.

Could the problem with the parity be that it has never completed a parity check since the upgrade?

James

Quote

April 29, 201610 yr

hmm,

i find it suspect that as soon as i switch to v6 release i have three drives go bad at once.

Could the problem with the parity be that it has never completed a parity check since the upgrade?

James

Is it possible you've been ignoring your server, not doing monthly parity checks, etc - and then when you upgraded to v6 you tried running parity for the first time in awhile? This happened to my friend, who I warned time and time again not to ignore his server. He ended up with 3 failing drives when he ran his first parity check in 6 months. Have you checked SMART and ran extended SMART tests on the drives that keep dropping?

If the drives are semi-new and from the same batch, I wouldn't be surprised to see 3 drives fail at the same time if they weren't tested thoroughly with preclearing.

Quote

April 29, 201610 yr

Author

yes i have run smart tests on the failed drives. other than the one posted above all the smart tests were fine.

it is "possible" that there were problems before the upgrade but parity checks are set to run monthly. and i do not recall seeing parity issues prior to the upgrade.

Quote

April 29, 201610 yr

yes i have run smart tests on the failed drives. other than the one posted above all the smart tests were fine.

it is "possible" that there were problems before the upgrade but parity checks are set to run monthly. and i do not recall seeing parity issues prior to the upgrade.

Failed SMART test = the drive needs replaced ASAP. It's an internal test that doesn't factor in any other hardware/software.

You may have multiple issues. A mixture of failing drives, multiple bad cables, multiple bad controllers, or bad RAM. Run memtest for 2-3 passes and hope that it's faulty RAM because that'd be the easiest fix, it's included in unRAID and can be selected when starting the server up if you connect a display to the server. You could also try swapping cables on any drive that drops out.

I kinda doubt it's v6 causing the other issues, but there were some very rare cases where people couldn't get really old hardware to work with unRAID 6.

Quote

April 29, 201610 yr

Here's my story.

Short version - I started with a Core2Duo E6400 LGA775 setup, running an AMI BIOS and 2GB of RAM. There are differences with your setup - you've got an NVidia chipset. But we both have old, similar hardware. I went through a series of steps to improve my server during the unRAID 6 beta. They included adding RAM, upgrading the processor, adding a SAS2LP, adding disks, upgrading the power supply, and updating to unRAID 6. During the process my highly stable unRAID 5 setup became unstable, with the most problematic symptom being disks getting dropped from the array during parity checks.

Here's my main post http://lime-technology.com/forum/index.php?topic=36295.0. The main symptom was a red ball disk during a parity check (not sync), and a bad smart report following the red ball:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.16.3-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               /0:0:2:0
Product:              
Physical block size:  0 bytes
Lowest aligned LBA:   14896
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Following a reboot, though, the disk would come back just fine with a good smart report and I could rebuild it (or do a New Config and rebuild Parity).

My theory is that there are driver level incompatibilities between the drivers for the motherboard chipset and the Marvel driver that cause timeouts. It only happens on very old LGA775 hardware and this isn't unRAID code - it's part of the underlying Linux distro. Symptoms like this happened to a few of us on the board, but it was a relatively small number. For some reason my symptoms were worse and quite reproducible because I had some of my drives on the motherboard SATA ports and some of them on the SAS2LP. The thing I did that made the most difference to stabilize my system was to move all the drives onto the SAS2LP. Upgrading the power supply helped, too.

I realize you don't have that luxury to move all your drives onto a single SAS2LP since you have more drives that I do. You could mess with which drives are on which controllers, or replacing the controllers with LSI based cards like the IBM M1015 or Dell Perc H310s. But, and you're not going to like this, I'm going to recommend more radical surgery - a new motherboard, CPU, and RAM. You're running on 9 year old hardware and the new Linux device drivers just don't seem to support it as well. I honestly think this will save you a lot of effort and aggravation, and it can be done for less than $200.

So, I'm sorry to recommend brain surgery for a disk problem but if your problem is like mine it may be the best option. To validate, take the following steps:

Get everything clean
Run a parity check and get a red ball
Pull a smart report for the drives
Reboot and pull new smart reports

If you are miraculously seeing clean smart reports after reboot then it's time to look at motherboards and disk controllers.

Quote

April 29, 201610 yr

Author

So, I'm sorry to recommend brain surgery for a disk problem

ive been thinking about updating the server as i want to start taking advantage of the new VM machines. Might be time to go ahead and bite the bullet.

Ill let the smart tests finish running and see where they land.

Quote

May 2, 201610 yr

Author

so i believe i have isolated the problem to drives that are connected to the sas2lp card. drives that are connected to the mobo or my other raid card do not seem to be getting dropped during parity checks. also the problematic drives are passing extended smart tests with the exception of the parity disk. It gives an "interuppted (host reset) error every time i run a smart test. HOWEVER moving the parity to a differnt controller allows the smart test to complete with error. I believe that my sas2lp problems are indeed being caused by the new v6. which is odd as i was using a beta version of 6 for several months with no issues.

Quote

May 2, 201610 yr

I just noticed that someone pointed out that the op has an old nvidia chipset. If it happens to be an nForce4 there are, serious issues with hard drives and the pci/pcie buses running off of it (all acknowledged by nvidia)

Sent from my LG-D852 using Tapatalk

Quote

May 2, 201610 yr

Author

Interesting. I do have an nforce4. Perhaps its time to just bite the bullet and upgrade the mobo.

Quote

May 2, 201610 yr

Search these threads for nforce4 (or look in the wiki under compatible hardware) it do a Google search and you'll see lots of results

Sent from my LG-D852 using Tapatalk

Quote

May 2, 201610 yr

so i believe i have isolated the problem to drives that are connected to the sas2lp card

In my experience the SAS2LP card is involved, but not the entire problem. It has more to do with compatibility between your motherboard and the SAS2LP. Many people are using the SAS2LP without any problems. Unfortunately there are a few of us on older hardware who are having problems. That combined with your NForce4 potential issues isn't a good place to be. My opinion is that going to a new motherboard will cost you some $ but save you a lot of effort over continuing to try and diagnose this issue .

Quote

May 2, 201610 yr

Author

agreed. posted a new topic in the mobo forum for suggestions on what board to use with my spare i3 lga1155 system. have a complete system but the mobo is mini itx. so just need to find a new full size mobo with lots of sata ports.

Quote

May 2, 201610 yr

How many total drives do you need to support?

Quote

May 5, 201610 yr

Author

currently have 14 drives with a capacity of up to 20 in the case. 8 running off the mvp8 card, 6 off the mobo,

Quote

May 5, 201610 yr

Author

I'm going to recommend more radical surgery - a new motherboard, CPU, and RAM. You're running on 9 year old hardware and the new Linux device drivers just don't seem to support it as well. I honestly think this will save you a lot of effort and aggravation, and it can be done for less than $200.

What kind of mobo/cpu/ram combo are you suggesting for less than $200. i cant find anything even remotely close to that kind of pricing.

Quote

May 6, 201610 yr

If you want to use your existing LGA1155 Core i3 it looks like there are still H61, Q67, and Z77 motherboards available. The ASRock Z77 Extreme4 looks like it would work for you. The H61 and Q77 motherboards are trickier for several reasons. They have fewer SATA ports, many have Realtek LAN chips (unRAID gets along with some but not all of them) and only 1 PCIex x16 slot for a SATA controller (the Intel BOXDQ67SWB3 has an extra x4 slot so it might work).

If you wan to to go with new hardware with a $200 budget, I'd go with something like:

Intel Pentium G3258 $69

G.SKILL Ripjaws Series 8GB (2 x 4GB) $38

ASRock H97M Pro4 $79

To make sure that the second PCIex slot is at least x8 or greater it looks like it might go a little over your budget but there are a lot of options. What were you looking at?

Quote

May 6, 201610 yr

Author

thats not a bad price for a complete upgrade. wanted to keep it under $250. Only thing is the dual core. if i am going to go with a complete new system i think id want to go with at least a quad core so i can start doing some virtualization. quad cores ive found start around $150.

Quote

evertime i run parity a new disk gets dropped

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)