December 18, 20196 yr Is anyone else experiencing issues with write errors and then a disk being dropped? 6.7 was fine, not a single issue, then I updated to the release client, and it started from there. 21 drives. It would report one drive with read errors and kick it. I would run smart and preclear, not a single issue. Rebuild, not a single issue. Span a few hours later, many drives start having errors. Reboot and everything good, but it kicked the same disk. Pull it and definitely verify, everything is fine. Preclear again and rebuild, all is fine. Then a few hours after its done, parity disk #2 starts having these read erros. Now rebuilding, as i know the disk is fine. I am using a DS4246 disk shelf connected to my server. This issue wasn't present in 6.7 Any thoughts or ideas whats going on?
December 18, 20196 yr 3 minutes ago, Viper359 said: Any thoughts or ideas whats going on? Difficult to say without any diagnostics, post some if it happens again, or if you haven't reboot since last time.
December 18, 20196 yr 4 minutes ago, Viper359 said: Preclear again and rebuild Why would you preclear a disk to rebuild it?
December 18, 20196 yr Author Because the first try it rebuilt fine, but I wanted to be 100 percent sure the second time. Elimination of possible paths if you would
December 18, 20196 yr Preclear I guess would be another way to test the disk, but other than that has nothing at all to do with rebuild. A clear disk is only needed when adding it to a new slot in an array that already has valid parity. A clear (all zero) disk has no effect on parity, so parity remains valid when adding a clear disk to a new slot. There is no other scenario where a clear disk is required. When rebuilding, the disk will be completely overwritten anyway so it doesn't matter at all what was already on the disk.
December 18, 20196 yr Author I know, what I was trying to say is I do it to ensure that the disk was fine. Now that several are reporting errors that magically stop happening after a disk is kicked out of the array, or rebuilding a disk, I am now certain its not my disks. Reboot et doesn't seem to stop this issue, and no red flags popup anywhere either.
December 18, 20196 yr 35 minutes ago, Viper359 said: Now that several are reporting errors Do you have a link?
December 18, 20196 yr Author Nope. Rebooted and rebuilding that parity drive that got kicked. When its done, i think I will try to downgrade to 6.7, I didn't have issues then.
December 20, 20196 yr Author It has done it again. I am attaching my diagnostics. You will notice several disks report low amount of errors, and 1 drive has been kicked. tower-diagnostics-20191220-1027.zip
December 20, 20196 yr Looks to me like a controller problem: Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1880:sas IO status 0x24 Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e Dec 18 05:24:31 Tower kernel: sd 2:0:16:0: Power-on or device reset occurred Dec 18 05:24:33 Tower kernel: pm80xx mpi_ssp_completion 1880:sas IO status 0x24 Dec 18 05:24:33 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e Dec 18 05:24:33 Tower kernel: sd 2:0:16:0: [sdt] tag#90 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00 Dec 18 05:24:33 Tower kernel: sd 2:0:16:0: [sdt] tag#90 CDB: opcode=0x88 88 00 00 00 00 02 9a 4c 13 d0 00 00 00 20 00 00 Dec 18 05:24:33 Tower kernel: print_req_error: I/O error, dev sdt, sector 11178611664 Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611600 Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611608 Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611616 Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611624 It starts with disk11 and then also on the other disks.
December 20, 20196 yr Author This is just the thing, this didn't happen on 6.7. Does this make any sense? Its a disk shelf, and everything seems to be working. The only thing I can thing of is the interposers, but, you would think I would have had this issue on 6.7 for the couple weeks I was running it? The other thing I cannot get, why do these errors only happen after a rebuild? Like I rebuilt the parity drive, nothing happen. Within an hour of it finishing, bam. Same when disk 11 was causing issues. Rebuilt the disk, nothing, hour or so later, boom. Edited December 20, 20196 yr by Viper359
December 20, 20196 yr This is just the thing, this didn't happen on 6.7. Does this make any sense? Possibly something that changed on the HBA driver, I don't have any experience with PMC HBAs, they are also quite uncommon on the Unraid user base, so difficult to say if it's an isolated issue or not, can you post some v6.7 diags so I can see if it's using a diferent driver?
December 20, 20196 yr 21 minutes ago, Viper359 said: The other thing I cannot get, why do these errors only happen after a rebuild? Heat buildup. Excessive vibrations affecting connections.
December 20, 20196 yr Author 4 minutes ago, johnnie.black said: Possibly something that changed on the HBA driver, I don't have any experience with PMC HBAs, they are also quite uncommon on the Unraid user base, so difficult to say if it's an isolated issue or not, can you post some v6.7 diags so I can see if it's using a diferent driver? I will have to downgrade and give it a go and see. Is it safe to downgrade now that its kicked a disk out of my array, and then rebuild it, and go from there?
December 20, 20196 yr Is it safe to downgrade now that its kicked a disk out of my array, and then rebuild it, and go from there? Yes, just make sure the emulated disk is mounting and contents look correct before rebuilding on top. Don't know if this is the model you're using but it appears to describe a pretty similar issue: I've not had good luck with the NetApp PMC8003 HBAs. They work, until you have a lot of stress on the SAS link, then once a drive drops out, it seems to cascade lock up the whole bus under both FreeNAS and CentOS. https://forums.servethehome.com/index.php?threads/pmc-pm8001-based-hba.24186/#post-225079 Wen possible we always recommend LSI HBAs for Unraid.
December 20, 20196 yr Author 3 minutes ago, johnnie.black said: Yes, jus make sure the emulated disk is mounting and contents look correct before rebuilding on top. Don't know if this is the model you're using but it appears to describe a pretty similar issue: https://forums.servethehome.com/index.php?threads/pmc-pm8001-based-hba.24186/#post-225079 Wen possible we always recommend LSI HBAs for Unraid. Yes, I believe that is the model I am using. I do have an LSI HBA card and a spare QSFP to whatever adapter cable. Maybe I should try this first. Do a rebuild, and see if it stays stable.
December 20, 20196 yr Author Alright, later tonight I will power down everything and make the switch and see what happens. I will also triple check the cables are seated properly. Those QSFP cables are thick and don't like to move.
December 22, 20196 yr Author Had a different parity drive fail that night. Switched out the HBA to an LSI one. Rebuilt the parity drive. Ran 10hrs, no errors Started Rebuilding the other dropped drive. 21 drives running now, we shall see if this was the issue. Just waiting for rebuild to finish on the last drive.
May 14, 20206 yr On 12/23/2019 at 12:45 AM, Viper359 said: Had a different parity drive fail that night. Switched out the HBA to an LSI one. Rebuilt the parity drive. Ran 10hrs, no errors Started Rebuilding the other dropped drive. 21 drives running now, we shall see if this was the issue. Just waiting for rebuild to finish on the last drive. Did you ever resolve the issue? I'm having the same problems using a Netapp X2065A-R6 HBA and a DS4246. At first I thought it was being caused by a bad interposer, but after reading your thread and having it happen to a second drive, I'm not so sure. I'm considering removing all the interposers and running straight SATA. I've done this in the past but I get issues with drive detection in unRaid on 8TB+ drives when I reboot or power cycle without interposers. edit: UPDATE Ok, I've played around a bit, using "new config" and "parity is valid" to move drives around and add/remove interposers. Interposers changes the drive name, hence the "New Config". Anyway, I'm starting to think I had bad interposers on the drives that got kicked. Having a bunch of extra's, I replaced the interposers on those drives and so far everything seems fine. In a perfect world I'd just not use interposers at all, because I lose most of the smart data, including drive temps and the ability to properly spin up and down (temps did work for awhile but stopped a few updates back), but without interposers I have a hard time detecting drives that are 8TB's and larger. Sometimes I need to replug those drives once or twice before they're detected after a reboot. Interposers solves that issue. To save myself all the headache I've ordered an LSI card and the appropriate cable which will hopefully allow me to run the SATA drives without interposers once again. Looking forward to having SMART data again. Edited May 15, 20206 yr by pXius
May 19, 20206 yr Author Yes, I switched and all my problems went away. Left my interposers in. Since all issues went away, I stopped trying to debug what the actual issue was.
Archived
This topic is now archived and is closed to further replies.