[SOLVED] disk disabled during parity check


Recommended Posts

Basic server info:

-Unraid Pro 6.6.1

-The two HBAs I have installed are Dell H200s (SAS2008) flashed to the latest firmware P20.00.07.00 following the instructions from Fireball3.

-6 SATA III ports on the mobo, all in use IIRC.

-M/B: ASUSTeK COMPUTER INC. - M5A97 R2.0 | CPU: AMD FX™-6300 Six-Core @ 3500 | Memory: 16 GB (max. installable capacity 32 GB)

 

 

This morning my backup server kicked off its monthly non-correcting parity check. About an hour and a half into the check I got several PushBullet notifications about disk errors from disk 16 followed by a notification that the parity check had been canceled. I imagine when the OS marked the disk as failed it must have some logic to stop the parity check to prevent killing other disks...Either way I was able to capture my diagnostics zip file because it canceled the check which is attached.

 

I haven't made any changes (hardware or software) to my backup server in almost 6 months and it hasn't gotten a single error on any of its recent parity checks. No case openings, drive swaps, SATA cable moves nothing. The drive is only a few years old and was pre-cleared at least twice when I first purchased it before being installed.

 

I shut the server down and pulled the offending drive and attached it to my technician bench. I am running drive tests in SeaTools for Windows (it is a pre-IronWolf Seagate 4TB NAS drive) and so far the drive has passed the SMART check, short DST, and short generic test. It is currently 1/4 of the way through the long generic test and so far hasn't shown any signs of a problem so I am tempted to think this was some sort of fluke with the cabling or controller.

 

I wanted to ask the community though before I go putting this drive back into service as if nothing happened. If I do end up putting it back into service I will obviously want to run several more non-correcting parity checks to ensure there are consistently zero errors being returned. 

 

EDIT: To clarify I have not actually verified if the offending disk was attached to one of the Dell H200s or if it was running off the motherboard SATA controller. I thought I could tell from looking through the diagnostic report but I may just have to physically verify it when I get home from work. The SMART report for the disk doesn't highlight anything bad (the seek error and raw read error rate are garbage stats from what I have read online).

 

void-diagnostics-20181001-0842.zip

Edited by weirdcrap
clarified I don't know which controller the drive is on
Link to comment

Disk is on one of the LSI controllers, SMART looks fine, but there's this after all the errors:

Oct  1 08:18:05 VOID kernel: sd 12:0:4:0: Power-on or device reset occurred

sd 12:0:4:0 is the disabled disk, it's like it lost power and then came back again, still could be a disk problem, I would replace or swap both cables/backplane with another disk and rebuild to the same, if issue reoccurs it's likely the disk.

Link to comment

All my drives are mounted in 5 bay NORCO SS-500s so swapping drive position or SATA cables would be easiest for testing.

 

What do you mean by rebuild to the same? I'm assuming the data on the disk is ok unless the failure during the parity check does some sort of write to the disk?

 

EDIT: I guess my main concern is first making sure the parity data is 100% correct before I go rebuilding over the disabled device if that is what is recommended.

Edited by weirdcrap
Link to comment
16 hours ago, johnnie.black said:

You basically have 2 or 3 options:

 

-re-enable the disable disk by rebuilding on top

-rebuild to a new disk

-only if you're sure no data was written to disk16 after it got disabled, do a new config and check parity

 

options 2 and 3 (if possible) are the safest, option 1 is the easiest.

Thanks for the breakdown.

 

I did some thinking last night and there isn't anything on this server that isn't either a mirror of my main server or backed up somewhere else in the cloud.

 

I'll go with option 1 for now since I don't have a spare 4TB disk right now and I'm fairly certain there were no writes but I can't guarantee it...

 

The rebuild is in progress and hopefully everything will go smoothly and that will be the end of it, thanks for your help Johnnie.

 

EDIT: Rebuild completed with zero issues, seems like it was just a fluke.

Edited by weirdcrap
Link to comment
  • 1 month later...

So the issue is back. I swapped the disk locations and the rebuild went perfectly fine no issues. Well now it is monthly parity check time again and the same disk physical disk dropped out so I guess it must be the disk.

 

It just seems really odd that it can handle the entire rebuild process and all of those writes to the disk but within 10 minutes of me starting a non-correcting check it resets and causes the array to drop it.

Link to comment
  • 5 weeks later...

New month, new parity check, another disk is now exhibiting the exact same symptoms. On the same LSI controller as the last one and even shares the same breakout cable as the first disk. This is not the disk I mentioned above that  I pulled in from the primary server, it has had zero issues so far on the same card and breakout cable. So I'm guessing another disk has died. 

 

I'm going to try and swap to a different bay after I rebuild again for the 4th or 5th time in as many months. If these F-ing Seagate disks are just going to keep dying one after the other I will never buy another Seagate drive ever again. I have 10+ year old Samsung SpinPoints still going strong and these NAS level drives crap out in 3 years. I never understood the hate people have for Seagate, but I am starting to understand why many people tell me they don't buy Seagate on principle.

 

void-diagnostics-20181201-0837.zip

Edited by weirdcrap
Link to comment

Disk looks fine, good idea to swap it to a different bay.

 

There was also a timeout with the parity disk during the parity check, and that one is on a different controller, though it's a Marvell one, so timeout errors might be normal, but multiple issues during the check could also point to another reason for the problems, like power supply issue.

Link to comment
15 minutes ago, johnnie.black said:

Disk looks fine, good idea to swap it to a different bay.

 

There was also a timeout with the parity disk during the parity check, and that one is on a different controller, though it's a Marvell one, so timeout errors

might be normal, but multiple issues during the check could also point to another reason for the problems, like power supply issue.

1

 

All of those disks (Parity, the one that failed last time, and this most recent one) are in the top Norco 5 bay enclosure so they share a common power connection (2x molex connectors) so wouldn't a power issue from the PSU or molex connectors affect multiple disks, not just 1 or 2 at random? Especially since a parity spinup would draw the most power for spinning up 20 disks.

 

Yes it is a marvel controller, never had issues with it before even when all my other Marvel controllers started acting up and I replaced them with LSIs. I'd be open to swapping it out for something else that can support 2-4 disks and will fit in a PCIE x1 slot (the only open slots I have left). Suggestions?

 

The issue is never in the same bay but so far they have all been in that same top 5 bay enclosure.

Edited by weirdcrap
Link to comment
4 minutes ago, johnnie.black said:

These issues can be difficult to diagnose, try swapping things around to rule out as many as possible, cables, bays, etc

I will after the rebuild is done, I swapped the "bad" disk with one connected to the mobo SATA controller directly.

 

FYI the parity disk didn't reset when the rebuild started and all the disks spun up and it is still on the Marvell controller. I don't recall seeing that in the other logs either but I will keep my eye out for it as well.

Edited by weirdcrap
Link to comment
On 10/1/2018 at 6:06 PM, weirdcrap said:

All my drives are mounted in NORCO SS-500s so swapping drive position or SATA cables would be easiest for testing.

 

What do you mean by rebuild to the same? I'm assuming the data on the disk is ok unless the failure during the parity check does some sort of write to the disk?

 

So far the parity check hasn't canceled or thrown any errors for disk 18 since moving it, about 40% into the check.

 

The parity disk is throwing more of these now (only seems to happen during parity check):

 

Dec  2 06:37:22 VOID kernel: ata7: hard resetting link
Dec  2 06:37:32 VOID kernel: ata7: softreset failed (1st FIS failed)
Dec  2 06:37:32 VOID kernel: ata7: hard resetting link
Dec  2 06:37:38 VOID kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec  2 06:37:38 VOID kernel: ata7.00: configured for UDMA/133
Dec  2 06:37:38 VOID kernel: ata7: EH complete
Dec  2 06:40:44 VOID kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x190002 action 0xe frozen
Dec  2 06:40:44 VOID kernel: ata7.00: irq_stat 0x80400000, PHY RDY changed
Dec  2 06:40:44 VOID kernel: ata7: SError: { RecovComm PHYRdyChg 10B8B Dispar }
Dec  2 06:40:44 VOID kernel: ata7.00: failed command: READ DMA EXT
Dec  2 06:40:44 VOID kernel: ata7.00: cmd 25/00:60:30:62:35/00:02:4d:00:00/e0 tag 4 dma 311296 in
Dec  2 06:40:44 VOID kernel:         res 50/00:00:2f:62:35/00:00:4d:00:00/e0 Emask 0x10 (ATA bus error)
Dec  2 06:40:44 VOID kernel: ata7.00: status: { DRDY }
Dec  2 06:40:44 VOID kernel: ata7: hard resetting link
Dec  2 06:40:54 VOID kernel: ata7: softreset failed (1st FIS failed)
Dec  2 06:40:54 VOID kernel: ata7: hard resetting link
Dec  2 06:40:59 VOID kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec  2 06:40:59 VOID kernel: ata7.00: configured for UDMA/133
Dec  2 06:40:59 VOID kernel: ata7: EH complete

So I went ahead and bought a replacement SATA expander (ASM-1061) that does not have a marvel controller. Hopefully that will resolve these new errors on the parity disk. Only the parity and 1 data disk are attached to that controller.

void-diagnostics-20181202-0702.zip

Edited by weirdcrap
Link to comment

Update on this.

 

--Moved disk, parity check completed fine but with a couple initial link resets on the parity disk that I did think (not anymore) may have been caused by the marvel controller it was hooked up to.

--During the check though I started to get a bunch of call traces from the CPU but the server always seemed to recover and never became unresponsive. Attached are the two zipped logs, the first is when I started the parity check all the way through to when I restarted the server after it was complete. The second is after the server restart. syslogs.zip

--Let it ride while I waited for my new non-marvel controller.

 

Fast forward to today, got and installed my new 2 port SATA card. Immediately started a parity check and I am STILL seeing those initial hard link resets on the parity disk. Always only two and no more right at the start of the check.

In addition, I get these call traces now whenever I run a parity check which was not the case last month. They don't seem to occur any other time but during my parity checks and at seemingly random intervals throughout the check.

This is my log currently as the parity check is in progress: void-syslog-20181204-1829.zip

 

The call traces seem to me like they indicate the CPU is upset because of how busy it is? I have only 2 dockers on this server: DuckDNS and Plex. Plex should be sitting idle as I'm the only user and it only gets used when my internet is down for local streaming.

 

Ignore the 5 parity errors, I caused those earlier and they can be ignored, I will correct them once I get everything else figured out. I accidentally applied the OS update while my parity check was running in the background and UnRAID got kind of wonky on me (which I believe is where these errors came from). These errors appeared after that restart.

 

EDIT: I am on the latest v6.6.6 now btw for OS version.

Edited by weirdcrap
Link to comment
11 minutes ago, weirdcrap said:

I accidentally applied the OS update while my parity check was running in the background and UnRAID got kind of wonky on me (which I believe is where these errors came from). These errors appeared after that restart.

The OS update doesn't actually load until the following reboot.

Link to comment
17 minutes ago, trurl said:

The OS update doesn't actually load until the following reboot.

I understand that part, I just wasn't sure if it did some writes to the array or anything as part of the preparation for the update on the reboot.

 

I had started a parity check before the OS update like I said and it didnt pick any of those errors up. It wasn't until after I had applied the os update during my check, rebooted the server, and restarted my check that they appeared. 

 

Seems odd to me that the parity errors would be caused by anything else (at least as far as writes to the array) . Data only gets copied to this server once a night around 2AM CST via a scheduled rsync script. It's all moved to the array first thing the following morning by the mover. 

Edited by weirdcrap
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.