Large Disk Failure Help

January 19, 20188 yr

Author

24 minutes ago, trurl said:

Go to Settings - Notifications. Turn on Help. Configure email. The default settings I think should be good enough to send you an email when you have disk issues.

Got that setup and got the card/cables ordered!

Now I wait til this drive rebuilds I guess.

Quote

January 20, 20188 yr

Author

Done rebuilding, 0 errors. Now my write speeds are extremely slow. I'm guessing this is in part due to having a parity for my first time and also in part due to having a sub optimal HDD setup through my mobo?

I'm getting a write/transfer of 11 MB/s no matter the PC. Is that too slow or to be expected?

Edit: Replugged the cable and got full gigabit speeds after running ethtool. Not sure why that happened weird.

Edit2: Speeds are now at ~42

Edited January 20, 20188 yr by tential
Current transfer speeds

Quote

January 20, 20188 yr

Community Expert

5 hours ago, tential said:

Edit2: Speeds are now at ~42

That's about right for normal write mode, you might get some more with turbo write enable, but turbo write is also slowdown by controller bottlenecks, same as parity check or disk rebuilds.

Quote

January 21, 20188 yr

Author

Ok, well, I'm all setup!

Now , I'm just struggling with docker mappings. it's all setup right... but it's not ! I'll create a new thread for that issue i guess.

Still waiting on the card delivery of course. I'm sure that will come with its own host of problems. This has not been easy at any point!

Edited January 21, 20188 yr by tential

Quote

February 6, 20188 yr

Author

Ok, so I had everything working.

I ended up just reusing the same 5TB drive as disk 10. I was too tired to get a new one out.

I've been running for the last 15 days or so, but got the new controller card in, and was close to out of space anyway, so set that up.

Everything was recognized and running fine.tower-diagnostics-20180206-1210.zip

I then went to sleep, woke up, and again Disk 10, not recognized/unavailable.

I'm guessing that Disk 10 isn't actually dead. At this point, it's 100 Gigs away from being full, so I'd like to keep it if it's working and not throw away ~$100 (even though I have lots of extra drives, I am building a second server of course now that I'm out of drive bays.)

That's my diagnostic of when I first noticed it was wrong. I thought I had downloaded a second diagnostic but I guess not.

I also now have an error on Drive 9? Ugh racking my head here, was hoping I could add this card smoothly and I almost did!!!!

My guess is I just need to AGAIN, unplug and replug everything. My wiring scheme is all hell in that case. I'm guessing anyway that I doubt I got a drive failure from the drive failing but rather from the user (Me) doing something when I installed the new card.

tower-diagnostics-20180206-1303.zip (Current diagnostic)

Edit: Maybe I should just pick up my Xeon server upgrade now and get this over with before I knock something else loose/screw something else up.

Edited February 6, 20188 yr by tential

Quote

February 6, 20188 yr

Author

Reseated everything

Booted up:

tower-diagnostics-20180206-1409.zip

This has to be an issue with me? A different drive now having issues too? Ugh.

I just need to take this all apart and redo the whole thing before I make this worse don't I?

Quote

February 6, 20188 yr

Community Expert

Your having problems with various disks at the same time, disk9 and 10, these are both on the same Marvell controller, so maybe try reseating it, parity is on the LSI controller and SMART looks fine, maybe power issues?

Quote

February 6, 20188 yr

Author

43 minutes ago, johnnie.black said:

Your having problems with various disks at the same time, disk9 and 10, these are both on the same Marvell controller, so maybe try reseating it, parity is on the LSI controller and SMART looks fine, maybe power issues?

Thanks a lot. Disk 9 JUST started, so I think your diagnosis of them being on the marvell controller helps since that is right next to the LSI controller, and when I was setting it up those cables are hard to touch while working with the LSI Controller. I recabled again and it looks a lot better in there I'm starting it up again now.

It could be power issues? I only have a 500 Watt CX(Corsair)? But at the same time, I've been running this for the last 15 days, I just didn't have sata cables on the 3 new HDDs. Would that matter at all? A modular PSU upgrade sounds nice.

The issues only happened when I installed the LSI Controller, and started to clear the 3 new drives. I MUST have knocked something loose on drives 9/10 while installing the LSI Controller that sounds reasonable right?

Should I run a Diagnostic before starting the array?

What steps should I be taking here I feel like I did a bunch more harm than good right now be restarting constantly as my head hasn't been clear working on this.

I started up but my rebuild speed is incredibly slow at 2 MB/sec.

The tower-diagnostics-20180206-1548.zip

is the new one now.

Edit: Speed is up to 120 MB/s yay? So I guess I've got another 20+ hours ahead of me. 12 hours for this rebuild, another 8 for adding the new drives. Hopefully everything checks out!

What should I do regarding reenabling my first Parity drive again?

I've got a pretty good idea exactly which cables were loose now that you've mentioned what was connected to what.

If I'm able to get everything working, this thread seriously should just be called "What happens when you're terrible at cabling power/sata cables..."

Edited February 7, 20188 yr by tential

Quote

February 7, 20188 yr

Community Expert

Still ATA errors on disks 9 and 10, if everything was checked maybe a problem with the controller, a CX500 with 15 disks is pushing it a little, I would use 550/600W, but these ATA errors are slmost certainly unrelated to the PSU

Quote

February 7, 20188 yr

Author

4 hours ago, johnnie.black said:

Still ATA errors on disks 9 and 10, if everything was checked maybe a problem with the controller, a CX500 with 15 disks is pushing it a little, I would use 550/600W, but these ATA errors are slmost certainly unrelated to the PSU

Uh oh, that's worrisome.

I just don't see why Disk 9 would have issues, I haven't even been using it. Literally only got errors after I touched something.

Disk 9 still is showing as available in my array/working.

Disk 10 is even showing more errors. Could I have damaged these 2 drives at this point?tower-diagnostics-20180207-0445.zip

Latest Diagnostic.

Quote

February 7, 20188 yr

Community Expert

These ATA errors are likely controller related, and the reason you were seeing the slow rebuild at the start, it recovered but these errors are not normal and can result in the disk being dropped.

In the last diags there are also similar errors on ATA7, another disk on the same Marvell controller:

Feb  6 20:48:07 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb  6 20:48:07 Tower kernel: ata7.00: failed command: IDENTIFY DEVICE
Feb  6 20:48:07 Tower kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 18 pio 512 in
Feb  6 20:48:07 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb  6 20:48:07 Tower kernel: ata7.00: status: { DRDY }
Feb  6 20:48:07 Tower kernel: ata7: hard resetting link
Feb  6 20:48:07 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  6 20:48:07 Tower kernel: ata7.00: NCQ Send/Recv Log not supported
Feb  6 20:48:07 Tower kernel: ata7.00: NCQ Send/Recv Log not supported
Feb  6 20:48:07 Tower kernel: ata7.00: configured for UDMA/133
Feb  6 20:48:07 Tower kernel: ata7: EH complete

Then more errors on ATA9:

Feb  6 20:54:03 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb  6 20:54:03 Tower kernel: ata9.00: irq_stat 0x40000001
Feb  6 20:54:03 Tower kernel: ata9.00: failed command: READ DMA EXT
Feb  6 20:54:03 Tower kernel: ata9.00: cmd 25/00:40:98:39:cf/00:05:ed:00:00/e0 tag 10 dma 688128 in
Feb  6 20:54:03 Tower kernel:         res 53/40:00:a8:3e:cf/00:00:ed:00:00/00 Emask 0x8 (media error)
Feb  6 20:54:03 Tower kernel: ata9.00: status: { DRDY SENSE ERR }
Feb  6 20:54:03 Tower kernel: ata9.00: error: { UNC }
Feb  6 20:54:03 Tower kernel: ata9.00: NCQ Send/Recv Log not supported
Feb  6 20:54:03 Tower kernel: ata9.00: NCQ Send/Recv Log not supported
Feb  6 20:54:03 Tower kernel: ata9.00: configured for UDMA/33
Feb  6 20:54:03 Tower kernel: sd 10:0:0:0: [sdi] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb  6 20:54:03 Tower kernel: sd 10:0:0:0: [sdi] tag#10 Sense Key : 0x3 [current]
Feb  6 20:54:03 Tower kernel: sd 10:0:0:0: [sdi] tag#10 ASC=0x11 ASCQ=0x0
Feb  6 20:54:03 Tower kernel: sd 10:0:0:0: [sdi] tag#10 CDB: opcode=0x88 88 00 00 00 00 00 ed cf 39 98 00 00 05 40 00 00
Feb  6 20:54:03 Tower kernel: print_req_error: I/O error, dev sdi, sector 3989780888

You need to get rid of that controller, all these errors resulted in this:

Feb  6 20:54:03 Tower kernel: md: disk9 read error, sector=3989780824
Feb  6 20:54:03 Tower kernel: md: recovery thread: multiple disk errors, sector=3989780824
Feb  6 20:54:03 Tower kernel: md: disk9 read error, sector=3989780832
Feb  6 20:54:03 Tower kernel: md: recovery thread: multiple disk errors, sector=3989780832
Feb  6 20:54:03 Tower kernel: md: disk9 read error, sector=3989780840

md: recovery thread: multiple disk errors is unRAID speak for "there are errors in more disks than current redundancy can correct, the rebuild/sync will continue but there will be some (or a lot) of corruption."

Quote

February 7, 20188 yr

Author

I have 2 of that controller.

Can I swap that controller, and then move PCI Expres slots as well (in case that's the issue)?

I think it's in a PCI Express x1 lane now, but I have x16 GPU slots available still.

Also what should I do about the first parity drive? I'm guessing that was simply a cabling issue for sure, how do I re-enable that?

Edited February 7, 20188 yr by tential

Quote

February 7, 20188 yr

Community Expert

36 minutes ago, tential said:

Can I swap that controller, and then move PCI Expres slots as well (in case that's the issue)?

You can try, cancel current rebuild so it will start over.

37 minutes ago, tential said:

Also what should I do about the first parity drive? I'm guessing that was simply a cabling issue for sure, how do I re-enable that?

Same as re-enabling a data drive, difference it it will be resynced instead of rebuilt.

http://lime-technology.com/wiki/Troubleshooting#Re-enable_the_drive

Quote

February 7, 20188 yr

Author

3 minutes ago, johnnie.black said:

You can try, cancel current rebuild so it will start over.

Same as re-enabling a data drive, difference it it will be resynced instead of rebuilt.

http://lime-technology.com/wiki/Troubleshooting#Re-enable_the_drive

Rebuild is already complete.

So my first step then, swap marvel controller, move to new PCI Express slot.

Start Rebuild(/resync?) again, but on Parity drive first so I have both Parity drives working. Then on Drive 10?

Edited February 7, 20188 yr by tential

Quote

February 7, 20188 yr

Community Expert

You can sync parity1 and rebuild disk10 at the same time, though you might just sync parity first after swapping the controller to see if the issues continue.

Quote

February 7, 20188 yr

Author

tower-diagnostics-20180207-0625.zip

That's what my current diagnostic says now. Still having issues with that same drive getting it to show up.

Quote

February 7, 20188 yr

Community Expert

The disk on ATA9 is not being detected, possibly a cable issue, or maybe power:

Feb  7 06:24:00 Tower kernel: ata9: softreset failed (1st FIS failed)
Feb  7 06:24:00 Tower kernel: ata9: softreset failed (1st FIS failed)
Feb  7 06:24:00 Tower kernel: ata9: limiting SATA link speed to 3.0 Gbps
Feb  7 06:24:00 Tower kernel: ata9: softreset failed (device not ready)
Feb  7 06:24:00 Tower kernel: ata9: reset failed, giving up

Quote

February 7, 20188 yr

Author

Ok recabled, everything is showing up. I tried to sync the 8TB drive for the parity, it's going super slow at 500 kb/sec currently. Maybe it will speed up? I left Disk 10 as emulated. Diagnostic is super slow and hasn't finished yet still.

Says 212 days to finish the parity sync at this rate and the unassigned drives section takes a second to populate when I go to the main page. Should I cancel and do another diagnostic?

Quote

February 7, 20188 yr

Community Expert

3 minutes ago, tential said:

Should I cancel and do another diagnostic?

Probably best

Quote

February 7, 20188 yr

Author

5 minutes ago, johnnie.black said:

Probably best

tower-diagnostics-20180207-0752.zip

Ok, here is where I'm at now.

Quote

February 7, 20188 yr

Community Expert

The errors on ATA9/10 persist, constantly, you'll need to get a different controller.

Quote

February 7, 20188 yr

Author

3 minutes ago, johnnie.black said:

The errors on ATA9/10 persist, constantly, you'll need to get a different controller.

I've been using this controller the whole thread, and now 2 different ones are showing errors?

What about the third drive connected? This is a 4 port controller with the ATA9/10 ones showing errors but there is a third drive connected right?

I still have the 2 port controller card, I can move those over to that card specifically.

Previously, I believe I had the SSDs wired on the controller card(I believe). I moved those to the mobo, and some from the mobo to the controller card.

Just confused as to why I'm having issues with the controller card now after so long of up time. I'm just confused in general though at this point.

Quote

February 7, 20188 yr

Community Expert

It's strange, I was going to say try different disk models, but the other disk is same model and no issues so far, so checkinh SMART reports again, disk9 is failing:

197 Current_Pending_Sector  -O--C-   100   100   000    -    8
198 Offline_Uncorrectable   ----C-   100   100   000    -    8

Disk10 dropped offline, maybe it's failing also, power cycle the server and get new diags.

Quote

February 7, 20188 yr

Author

tower-diagnostics-20180207-0840.zip

That's the latest.

tower-diagnostics-20180207-0840.zip

Quote

February 7, 20188 yr

Community Expert

Yep, it's failing also:

197 Current_Pending_Sector  -O--C-   079   079   000    -    7112
198 Offline_Uncorrectable   ----C-   079   079   000    -    7112

Kind of my bad for not checking that earlier, but two disks with issues at the same time on the same controller made me suspect of the controller, especially because some Marvell controllers tend to act up in unRAID.

Quote

Large Disk Failure Help

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)