Large Disk Failure Help


tential

Recommended Posts

24 minutes ago, trurl said:

Go to Settings - Notifications. Turn on Help. Configure email. The default settings I think should be good enough to send you an email when you have disk issues.

Got that setup and got the card/cables ordered!

 

Now I wait til this drive rebuilds I guess.

Link to comment
  • Replies 208
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Done rebuilding, 0 errors.  Now my write speeds are extremely slow.  I'm guessing this is in part due to having a parity for my first time and also in part due to having a sub optimal HDD setup through my mobo?

I'm getting a write/transfer of 11 MB/s no matter the PC. Is that too slow or to be expected?

Edit: Replugged the cable and got full gigabit speeds after running ethtool.  Not sure why that happened weird.

Edit2: Speeds are now at ~42

Edited by tential
Current transfer speeds
Link to comment

Ok, well, I'm all setup!

Now , I'm just struggling with docker mappings.  it's all setup right... but it's not !  I'll create a new thread for that issue i guess.

Still waiting on the card delivery of course.  I'm sure that will come with its own host of problems.  This has not been easy at any point!

Edited by tential
Link to comment
  • 3 weeks later...

Ok, so I had everything working.

I ended up just reusing the same 5TB drive as disk 10.  I was too tired to get a new one out.

I've been running for the last 15 days or so, but got the new controller card in, and was close to out of space anyway, so set that up.

Everything was recognized and running fine.tower-diagnostics-20180206-1210.zip

I then went to sleep, woke up, and again Disk 10, not recognized/unavailable.

 

I'm guessing that Disk 10 isn't actually dead.  At this point, it's 100 Gigs away from being full, so I'd like to keep it if it's working and not throw away ~$100 (even though I have lots of extra drives, I am building a second server of course now that I'm out of drive bays.)

 

That's my diagnostic of when I first noticed it was wrong.  I thought I had downloaded a second diagnostic but I guess not.

 

I also now have an error on Drive 9?  Ugh racking my head here, was hoping I could add this card smoothly and I almost did!!!!

 

My guess is I just need to AGAIN, unplug and replug everything.  My wiring scheme is all hell in that case.  I'm guessing anyway that I doubt I got a drive failure from the drive failing but rather from the user (Me) doing something when I installed the new card.

 

tower-diagnostics-20180206-1303.zip (Current diagnostic)

 

Edit: Maybe I should just pick up my Xeon server upgrade now and get this over with before I knock something else loose/screw something else up.

Edited by tential
Link to comment
43 minutes ago, johnnie.black said:

Your having problems with various disks at the same time, disk9 and 10, these are both on the same Marvell controller, so maybe try reseating it, parity is on the LSI controller and SMART looks fine, maybe power issues?

Thanks a lot.  Disk 9 JUST started, so I think your diagnosis of them being on the marvell controller helps since that is right next to the LSI controller, and when I was setting it up those cables are hard to touch while working with the LSI Controller.  I recabled again and it looks a lot better in there I'm starting it up again now.

 

It could be power issues?  I only have a 500 Watt CX(Corsair)?  But at the same time, I've been running this for the last 15 days, I just didn't have sata cables on the 3 new HDDs.  Would that matter at all? A modular PSU upgrade sounds nice.

 

The issues only happened when I installed the LSI Controller, and started to clear the 3 new drives.  I MUST have knocked something loose on drives 9/10 while installing the LSI Controller that sounds reasonable right?  

 

Should I run a Diagnostic before starting the array?  

What steps should I be taking here I feel like I did a bunch more harm than good right now be restarting constantly as my head hasn't been clear working on this.

 

I started up but my rebuild speed is incredibly slow at 2 MB/sec.

 

The tower-diagnostics-20180206-1548.zip

is the new one now.

 

Edit: Speed is up to 120 MB/s yay?  So I guess I've got another 20+ hours ahead of me.  12 hours for this rebuild, another 8 for adding the new drives.  Hopefully everything checks out!

 

What should I do regarding reenabling my first Parity drive again?

 

I've got a pretty good idea exactly which cables were loose now that you've mentioned what was connected to what.

If I'm able to get everything working, this thread seriously should just be called "What happens when you're terrible at cabling power/sata cables..."

Edited by tential
Link to comment
4 hours ago, johnnie.black said:

Still ATA errors on disks 9 and 10, if everything was checked maybe a problem with the controller, a CX500 with 15 disks is pushing it a little, I would use 550/600W, but these ATA errors are slmost certainly unrelated to the PSU

 

Uh oh, that's worrisome. 

I just don't see why Disk 9 would have issues, I haven't even been using it.  Literally only got errors after I touched something.  

Disk 9 still is showing as available in my array/working.

Disk 10 is even showing more errors.  Could I have damaged these 2 drives at this point?tower-diagnostics-20180207-0445.zip

 

Latest Diagnostic.

Link to comment

These ATA errors are likely controller related, and the reason you were seeing the slow rebuild at the start, it recovered but these errors are not normal and can result in the disk being dropped.

 

In the last diags there are also similar errors on ATA7, another disk on the same Marvell controller:

 

Feb  6 20:48:07 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb  6 20:48:07 Tower kernel: ata7.00: failed command: IDENTIFY DEVICE
Feb  6 20:48:07 Tower kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 18 pio 512 in
Feb  6 20:48:07 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb  6 20:48:07 Tower kernel: ata7.00: status: { DRDY }
Feb  6 20:48:07 Tower kernel: ata7: hard resetting link
Feb  6 20:48:07 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  6 20:48:07 Tower kernel: ata7.00: NCQ Send/Recv Log not supported
Feb  6 20:48:07 Tower kernel: ata7.00: NCQ Send/Recv Log not supported
Feb  6 20:48:07 Tower kernel: ata7.00: configured for UDMA/133
Feb  6 20:48:07 Tower kernel: ata7: EH complete

Then more errors on ATA9:

 

Feb  6 20:54:03 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb  6 20:54:03 Tower kernel: ata9.00: irq_stat 0x40000001
Feb  6 20:54:03 Tower kernel: ata9.00: failed command: READ DMA EXT
Feb  6 20:54:03 Tower kernel: ata9.00: cmd 25/00:40:98:39:cf/00:05:ed:00:00/e0 tag 10 dma 688128 in
Feb  6 20:54:03 Tower kernel:         res 53/40:00:a8:3e:cf/00:00:ed:00:00/00 Emask 0x8 (media error)
Feb  6 20:54:03 Tower kernel: ata9.00: status: { DRDY SENSE ERR }
Feb  6 20:54:03 Tower kernel: ata9.00: error: { UNC }
Feb  6 20:54:03 Tower kernel: ata9.00: NCQ Send/Recv Log not supported
Feb  6 20:54:03 Tower kernel: ata9.00: NCQ Send/Recv Log not supported
Feb  6 20:54:03 Tower kernel: ata9.00: configured for UDMA/33
Feb  6 20:54:03 Tower kernel: sd 10:0:0:0: [sdi] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb  6 20:54:03 Tower kernel: sd 10:0:0:0: [sdi] tag#10 Sense Key : 0x3 [current]
Feb  6 20:54:03 Tower kernel: sd 10:0:0:0: [sdi] tag#10 ASC=0x11 ASCQ=0x0
Feb  6 20:54:03 Tower kernel: sd 10:0:0:0: [sdi] tag#10 CDB: opcode=0x88 88 00 00 00 00 00 ed cf 39 98 00 00 05 40 00 00
Feb  6 20:54:03 Tower kernel: print_req_error: I/O error, dev sdi, sector 3989780888

You need to get rid of that controller, all these errors resulted in this:

 

Feb  6 20:54:03 Tower kernel: md: disk9 read error, sector=3989780824
Feb  6 20:54:03 Tower kernel: md: recovery thread: multiple disk errors, sector=3989780824
Feb  6 20:54:03 Tower kernel: md: disk9 read error, sector=3989780832
Feb  6 20:54:03 Tower kernel: md: recovery thread: multiple disk errors, sector=3989780832
Feb  6 20:54:03 Tower kernel: md: disk9 read error, sector=3989780840

 

md: recovery thread: multiple disk errors is unRAID speak for "there are errors in more disks than current redundancy can correct, the rebuild/sync will continue but there will be some (or a lot) of corruption."

Link to comment

I have 2 of that controller.  

Can I swap that controller, and then move PCI Expres slots as well (in case that's the issue)?

I think it's in a PCI Express x1 lane now, but I have x16 GPU slots available still.

 

Also what should I do about the first parity drive?  I'm guessing that was simply a cabling issue for sure, how do I re-enable that? 

 

Edited by tential
Link to comment
36 minutes ago, tential said:

Can I swap that controller, and then move PCI Expres slots as well (in case that's the issue)?

You can try, cancel current rebuild so it will start over.

 

37 minutes ago, tential said:

Also what should I do about the first parity drive?  I'm guessing that was simply a cabling issue for sure, how do I re-enable that? 

Same as re-enabling a data drive, difference it it will be resynced instead of rebuilt.

 

http://lime-technology.com/wiki/Troubleshooting#Re-enable_the_drive

 

 

Link to comment
3 minutes ago, johnnie.black said:

You can try, cancel current rebuild so it will start over.

 

Same as re-enabling a data drive, difference it it will be resynced instead of rebuilt.

 

http://lime-technology.com/wiki/Troubleshooting#Re-enable_the_drive

 

 

 

Rebuild is already complete.

 

So my first step then, swap marvel controller, move to new PCI Express slot.

 

Start Rebuild(/resync?) again, but on Parity drive first so I have both Parity drives working.  Then on Drive 10?

Edited by tential
Link to comment

The disk on ATA9 is not being detected, possibly a cable issue, or maybe power:

 

Feb  7 06:24:00 Tower kernel: ata9: softreset failed (1st FIS failed)
Feb  7 06:24:00 Tower kernel: ata9: softreset failed (1st FIS failed)
Feb  7 06:24:00 Tower kernel: ata9: limiting SATA link speed to 3.0 Gbps
Feb  7 06:24:00 Tower kernel: ata9: softreset failed (device not ready)
Feb  7 06:24:00 Tower kernel: ata9: reset failed, giving up

 

Link to comment

Ok recabled, everything is showing up.  I tried to sync the 8TB drive for the parity, it's going super slow at 500 kb/sec currently.  Maybe it will speed up?  I left Disk 10 as emulated.  Diagnostic is super slow and hasn't finished yet still.  

 

Says 212 days to finish the parity sync at this rate and the unassigned drives section takes a second to populate when I go to the main page.  Should I cancel and do another diagnostic?

Link to comment
3 minutes ago, johnnie.black said:

The errors on ATA9/10 persist, constantly, you'll need to get a different controller.

I've been using this controller the whole thread, and now 2 different ones are showing errors? 

What about the third drive connected?  This is a 4 port controller with the ATA9/10 ones showing errors but there is a third drive connected right?  

I still have the 2 port controller card, I can move those over to that card specifically.

 

Previously, I believe I had the SSDs wired on the controller card(I believe).  I moved those to the mobo, and some from the mobo to the controller card.

 

Just confused as to why I'm having issues with the controller card now after so long of up time.  I'm just confused in general though at this point.

 

Link to comment

It's strange, I was going to say try different disk models, but the other disk is same model and no issues so far, so checkinh SMART reports again, disk9 is failing:

 

197 Current_Pending_Sector  -O--C-   100   100   000    -    8
198 Offline_Uncorrectable   ----C-   100   100   000    -    8

Disk10 dropped offline, maybe it's failing also, power cycle the server and get new diags.

 

Link to comment

Yep, it's failing also:

 

197 Current_Pending_Sector  -O--C-   079   079   000    -    7112
198 Offline_Uncorrectable   ----C-   079   079   000    -    7112

Kind of my bad for not checking that earlier, but two disks with issues at the same time on the same controller made me suspect of the controller, especially because some Marvell controllers tend to act up in unRAID.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.