Array Issues on current stable


Viper359

Recommended Posts

Is anyone else experiencing issues with write errors and then a disk being dropped?

 

6.7 was fine, not a single issue, then I updated to the release client, and it started from there.

 

21 drives. It would report one drive with read errors and kick it. I would run smart and preclear, not a single issue. Rebuild, not a single issue. Span a few hours later, many drives start having errors.

 

Reboot and everything good, but it kicked the same disk. Pull it and definitely verify, everything is fine. Preclear again and rebuild, all is fine.

 

Then a few hours after its done, parity disk #2 starts having these read erros. Now rebuilding, as i know the disk is fine.

 

I am using a DS4246 disk shelf connected to my server. This issue wasn't present in 6.7

 

Any thoughts or ideas whats going on?

Link to comment

Preclear I guess would be another way to test the disk, but other than that has nothing at all to do with rebuild. A clear disk is only needed when adding it to a new slot in an array that already has valid parity. A clear (all zero) disk has no effect on parity, so parity remains valid when adding a clear disk to a new slot.

 

There is no other scenario where a clear disk is required. When rebuilding, the disk will be completely overwritten anyway so it doesn't matter at all what was already on the disk.

Link to comment

I know, what I was trying to say is I do it to ensure that the disk was fine.

 

Now that several are reporting errors that magically stop happening after a disk is kicked out of the array, or rebuilding a disk, I am now certain its not my disks.

 

Reboot et doesn't seem to stop this issue, and no red flags popup anywhere either.

Link to comment

Looks to me like a controller problem:

 

Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e
Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1880:sas IO status 0x24
Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e
Dec 18 05:24:31 Tower kernel: sd 2:0:16:0: Power-on or device reset occurred
Dec 18 05:24:33 Tower kernel: pm80xx mpi_ssp_completion 1880:sas IO status 0x24
Dec 18 05:24:33 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e
Dec 18 05:24:33 Tower kernel: sd 2:0:16:0: [sdt] tag#90 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Dec 18 05:24:33 Tower kernel: sd 2:0:16:0: [sdt] tag#90 CDB: opcode=0x88 88 00 00 00 00 02 9a 4c 13 d0 00 00 00 20 00 00
Dec 18 05:24:33 Tower kernel: print_req_error: I/O error, dev sdt, sector 11178611664
Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611600
Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611608
Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611616
Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611624

It starts with disk11 and then also on the other disks.

 

Link to comment

This is just the thing, this didn't happen on 6.7.

 

Does this make any sense? Its a disk shelf, and everything seems to be working. The only thing I can thing of is the interposers, but, you would think I would have had this issue on 6.7 for the couple weeks I was running it?

 

The other thing I cannot get, why do these errors only happen after a rebuild? Like I rebuilt the parity drive, nothing happen. Within an hour of it finishing, bam. Same when disk 11 was causing issues. Rebuilt the disk, nothing, hour or so later, boom.

Edited by Viper359
Link to comment
This is just the thing, this didn't happen on 6.7.

 

Does this make any sense?

Possibly something that changed on the HBA driver, I don't have any experience with PMC HBAs, they are also quite uncommon on the Unraid user base, so difficult to say if it's an isolated issue or not, can you post some v6.7 diags so I can see if it's using a diferent driver?

 

 

Link to comment
4 minutes ago, johnnie.black said:

Possibly something that changed on the HBA driver, I don't have any experience with PMC HBAs, they are also quite uncommon on the Unraid user base, so difficult to say if it's an isolated issue or not, can you post some v6.7 diags so I can see if it's using a diferent driver?

 

 

I will have to downgrade and give it a go and see. Is it safe to downgrade now that its kicked a disk out of my array, and then rebuild it, and go from there?

Link to comment
Is it safe to downgrade now that its kicked a disk out of my array, and then rebuild it, and go from there?

Yes, just make sure the emulated disk is mounting and contents look correct before rebuilding on top.

 

 

Don't know if this is the model you're using but it appears to describe a pretty similar issue:

I've not had good luck with the NetApp PMC8003 HBAs. They work, until you have a lot of stress on the SAS link, then once a drive drops out, it seems to cascade lock up the whole bus under both FreeNAS and CentOS.

https://forums.servethehome.com/index.php?threads/pmc-pm8001-based-hba.24186/#post-225079

 

Wen possible we always recommend LSI HBAs for Unraid.

 

 

Link to comment
3 minutes ago, johnnie.black said:

Yes, jus make sure the emulated disk is mounting and contents look correct before rebuilding on top.

 

 

Don't know if this is the model you're using but it appears to describe a pretty similar issue:

https://forums.servethehome.com/index.php?threads/pmc-pm8001-based-hba.24186/#post-225079

 

Wen possible we always recommend LSI HBAs for Unraid.

Yes, I believe that is the model I am using. I do have an LSI HBA card and a spare QSFP to whatever adapter cable. Maybe I should try this first. Do a rebuild, and see if it stays stable.

Link to comment

Had a different parity drive fail that night.

 

Switched out the HBA to an LSI one.

 

Rebuilt the parity drive. Ran 10hrs, no errors

 

 Started Rebuilding the other dropped drive. 

 

21 drives running now, we shall see if this was the issue. Just waiting for rebuild to finish on the last drive.

Link to comment
  • 4 months later...
On 12/23/2019 at 12:45 AM, Viper359 said:

Had a different parity drive fail that night.

 

Switched out the HBA to an LSI one.

 

Rebuilt the parity drive. Ran 10hrs, no errors

 

 Started Rebuilding the other dropped drive. 

 

21 drives running now, we shall see if this was the issue. Just waiting for rebuild to finish on the last drive.

Did you ever resolve the issue?

 

I'm having the same problems using a Netapp X2065A-R6 HBA and a DS4246.

At first I thought it was being caused by a bad interposer, but after reading your thread and having it happen to a second drive, I'm not so sure.

 

I'm considering removing all the interposers and running straight SATA. I've done this in the past but I get issues with drive detection in unRaid on 8TB+ drives when I reboot or power cycle without interposers.

 

edit: UPDATE

 

Ok, I've played around a bit, using "new config" and "parity is valid" to move drives around and add/remove interposers. 

Interposers changes the drive name, hence the "New Config".

 

Anyway,

I'm starting to think I had bad interposers on the drives that got kicked. Having a bunch of extra's, I replaced the interposers on those drives and so far everything seems fine.

 

In a perfect world I'd just not use interposers at all, because I lose most of the smart data, including drive temps and the ability to properly spin up and down (temps did work for awhile but stopped a few updates back), but without interposers I have a hard time detecting drives that are 8TB's and larger. Sometimes I need to replug those drives once or twice before they're detected after a reboot. Interposers solves that issue.

 

To save myself all the headache I've ordered an LSI card and the appropriate cable which will hopefully allow me to run the SATA drives without interposers once again. Looking forward to having SMART data again. 

Edited by pXius
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.