Array Issues on current stable

Viper359 · December 18, 2019

Is anyone else experiencing issues with write errors and then a disk being dropped?

6.7 was fine, not a single issue, then I updated to the release client, and it started from there.

21 drives. It would report one drive with read errors and kick it. I would run smart and preclear, not a single issue. Rebuild, not a single issue. Span a few hours later, many drives start having errors.

Reboot and everything good, but it kicked the same disk. Pull it and definitely verify, everything is fine. Preclear again and rebuild, all is fine.

Then a few hours after its done, parity disk #2 starts having these read erros. Now rebuilding, as i know the disk is fine.

I am using a DS4246 disk shelf connected to my server. This issue wasn't present in 6.7

Any thoughts or ideas whats going on?

JorgeB · December 18, 2019

3 minutes ago, Viper359 said:

Any thoughts or ideas whats going on?

Difficult to say without any diagnostics, post some if it happens again, or if you haven't reboot since last time.

trurl · December 18, 2019

4 minutes ago, Viper359 said:

Preclear again and rebuild

Why would you preclear a disk to rebuild it?

Viper359 · December 18, 2019

Because the first try it rebuilt fine, but I wanted to be 100 percent sure the second time.

Elimination of possible paths if you would

trurl · December 18, 2019

Preclear I guess would be another way to test the disk, but other than that has nothing at all to do with rebuild. A clear disk is only needed when adding it to a new slot in an array that already has valid parity. A clear (all zero) disk has no effect on parity, so parity remains valid when adding a clear disk to a new slot.

There is no other scenario where a clear disk is required. When rebuilding, the disk will be completely overwritten anyway so it doesn't matter at all what was already on the disk.

Viper359 · December 18, 2019

I know, what I was trying to say is I do it to ensure that the disk was fine.

Now that several are reporting errors that magically stop happening after a disk is kicked out of the array, or rebuilding a disk, I am now certain its not my disks.

Reboot et doesn't seem to stop this issue, and no red flags popup anywhere either.

trurl · December 18, 2019

35 minutes ago, Viper359 said:

Now that several are reporting errors

Do you have a link?

Viper359 · December 18, 2019

Nope. Rebooted and rebuilding that parity drive that got kicked.

When its done, i think I will try to downgrade to 6.7, I didn't have issues then.

Viper359 · December 20, 2019

It has done it again.

I am attaching my diagnostics. You will notice several disks report low amount of errors, and 1 drive has been kicked.

tower-diagnostics-20191220-1027.zip

JorgeB · December 20, 2019

Looks to me like a controller problem:

Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e
Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1880:sas IO status 0x24
Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e
Dec 18 05:24:31 Tower kernel: sd 2:0:16:0: Power-on or device reset occurred
Dec 18 05:24:33 Tower kernel: pm80xx mpi_ssp_completion 1880:sas IO status 0x24
Dec 18 05:24:33 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e
Dec 18 05:24:33 Tower kernel: sd 2:0:16:0: [sdt] tag#90 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Dec 18 05:24:33 Tower kernel: sd 2:0:16:0: [sdt] tag#90 CDB: opcode=0x88 88 00 00 00 00 02 9a 4c 13 d0 00 00 00 20 00 00
Dec 18 05:24:33 Tower kernel: print_req_error: I/O error, dev sdt, sector 11178611664
Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611600
Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611608
Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611616
Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611624

It starts with disk11 and then also on the other disks.

JorgeB · December 20, 2019

Could also be a power related issue.

Viper359 · December 20, 2019

This is just the thing, this didn't happen on 6.7.

Does this make any sense? Its a disk shelf, and everything seems to be working. The only thing I can thing of is the interposers, but, you would think I would have had this issue on 6.7 for the couple weeks I was running it?

The other thing I cannot get, why do these errors only happen after a rebuild? Like I rebuilt the parity drive, nothing happen. Within an hour of it finishing, bam. Same when disk 11 was causing issues. Rebuilt the disk, nothing, hour or so later, boom.

Edited December 20, 2019 by Viper359

JorgeB · December 20, 2019

This is just the thing, this didn't happen on 6.7.

Does this make any sense?

Possibly something that changed on the HBA driver, I don't have any experience with PMC HBAs, they are also quite uncommon on the Unraid user base, so difficult to say if it's an isolated issue or not, can you post some v6.7 diags so I can see if it's using a diferent driver?

trurl · December 20, 2019

21 minutes ago, Viper359 said:

The other thing I cannot get, why do these errors only happen after a rebuild?

Heat buildup. Excessive vibrations affecting connections.

Viper359 · December 20, 2019

4 minutes ago, johnnie.black said:

Possibly something that changed on the HBA driver, I don't have any experience with PMC HBAs, they are also quite uncommon on the Unraid user base, so difficult to say if it's an isolated issue or not, can you post some v6.7 diags so I can see if it's using a diferent driver?

I will have to downgrade and give it a go and see. Is it safe to downgrade now that its kicked a disk out of my array, and then rebuild it, and go from there?

JorgeB · December 20, 2019

Is it safe to downgrade now that its kicked a disk out of my array, and then rebuild it, and go from there?

Yes, just make sure the emulated disk is mounting and contents look correct before rebuilding on top.

Don't know if this is the model you're using but it appears to describe a pretty similar issue:

I've not had good luck with the NetApp PMC8003 HBAs. They work, until you have a lot of stress on the SAS link, then once a drive drops out, it seems to cascade lock up the whole bus under both FreeNAS and CentOS.

https://forums.servethehome.com/index.php?threads/pmc-pm8001-based-hba.24186/#post-225079

Wen possible we always recommend LSI HBAs for Unraid.

Viper359 · December 20, 2019

3 minutes ago, johnnie.black said:

Yes, jus make sure the emulated disk is mounting and contents look correct before rebuilding on top.

Don't know if this is the model you're using but it appears to describe a pretty similar issue:

https://forums.servethehome.com/index.php?threads/pmc-pm8001-based-hba.24186/#post-225079

Wen possible we always recommend LSI HBAs for Unraid.

Yes, I believe that is the model I am using. I do have an LSI HBA card and a spare QSFP to whatever adapter cable. Maybe I should try this first. Do a rebuild, and see if it stays stable.

JorgeB · December 20, 2019

That's what I would do.

Viper359 · December 20, 2019

Alright, later tonight I will power down everything and make the switch and see what happens. I will also triple check the cables are seated properly. Those QSFP cables are thick and don't like to move.

Viper359 · December 22, 2019

Had a different parity drive fail that night.

Switched out the HBA to an LSI one.

Rebuilt the parity drive. Ran 10hrs, no errors

Started Rebuilding the other dropped drive.

21 drives running now, we shall see if this was the issue. Just waiting for rebuild to finish on the last drive.

pXius · May 14, 2020

On 12/23/2019 at 12:45 AM, Viper359 said:

Had a different parity drive fail that night.

Switched out the HBA to an LSI one.

Rebuilt the parity drive. Ran 10hrs, no errors

Started Rebuilding the other dropped drive.

21 drives running now, we shall see if this was the issue. Just waiting for rebuild to finish on the last drive.

Did you ever resolve the issue?

I'm having the same problems using a Netapp X2065A-R6 HBA and a DS4246.

At first I thought it was being caused by a bad interposer, but after reading your thread and having it happen to a second drive, I'm not so sure.

I'm considering removing all the interposers and running straight SATA. I've done this in the past but I get issues with drive detection in unRaid on 8TB+ drives when I reboot or power cycle without interposers.

edit: UPDATE

Ok, I've played around a bit, using "new config" and "parity is valid" to move drives around and add/remove interposers.

Interposers changes the drive name, hence the "New Config".

Anyway,

I'm starting to think I had bad interposers on the drives that got kicked. Having a bunch of extra's, I replaced the interposers on those drives and so far everything seems fine.

In a perfect world I'd just not use interposers at all, because I lose most of the smart data, including drive temps and the ability to properly spin up and down (temps did work for awhile but stopped a few updates back), but without interposers I have a hard time detecting drives that are 8TB's and larger. Sometimes I need to replug those drives once or twice before they're detected after a reboot. Interposers solves that issue.

To save myself all the headache I've ordered an LSI card and the appropriate cable which will hopefully allow me to run the SATA drives without interposers once again. Looking forward to having SMART data again.

Edited May 15, 2020 by pXius

Viper359 · May 19, 2020

Yes, I switched and all my problems went away. Left my interposers in. Since all issues went away, I stopped trying to debug what the actual issue was.

Array Issues on current stable

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation