Viper359 Posted December 18, 2019 Share Posted December 18, 2019 Is anyone else experiencing issues with write errors and then a disk being dropped? 6.7 was fine, not a single issue, then I updated to the release client, and it started from there. 21 drives. It would report one drive with read errors and kick it. I would run smart and preclear, not a single issue. Rebuild, not a single issue. Span a few hours later, many drives start having errors. Reboot and everything good, but it kicked the same disk. Pull it and definitely verify, everything is fine. Preclear again and rebuild, all is fine. Then a few hours after its done, parity disk #2 starts having these read erros. Now rebuilding, as i know the disk is fine. I am using a DS4246 disk shelf connected to my server. This issue wasn't present in 6.7 Any thoughts or ideas whats going on? Quote Link to comment
JorgeB Posted December 18, 2019 Share Posted December 18, 2019 3 minutes ago, Viper359 said: Any thoughts or ideas whats going on? Difficult to say without any diagnostics, post some if it happens again, or if you haven't reboot since last time. Quote Link to comment
trurl Posted December 18, 2019 Share Posted December 18, 2019 4 minutes ago, Viper359 said: Preclear again and rebuild Why would you preclear a disk to rebuild it? Quote Link to comment
Viper359 Posted December 18, 2019 Author Share Posted December 18, 2019 Because the first try it rebuilt fine, but I wanted to be 100 percent sure the second time. Elimination of possible paths if you would Quote Link to comment
trurl Posted December 18, 2019 Share Posted December 18, 2019 Preclear I guess would be another way to test the disk, but other than that has nothing at all to do with rebuild. A clear disk is only needed when adding it to a new slot in an array that already has valid parity. A clear (all zero) disk has no effect on parity, so parity remains valid when adding a clear disk to a new slot. There is no other scenario where a clear disk is required. When rebuilding, the disk will be completely overwritten anyway so it doesn't matter at all what was already on the disk. Quote Link to comment
Viper359 Posted December 18, 2019 Author Share Posted December 18, 2019 I know, what I was trying to say is I do it to ensure that the disk was fine. Now that several are reporting errors that magically stop happening after a disk is kicked out of the array, or rebuilding a disk, I am now certain its not my disks. Reboot et doesn't seem to stop this issue, and no red flags popup anywhere either. Quote Link to comment
trurl Posted December 18, 2019 Share Posted December 18, 2019 35 minutes ago, Viper359 said: Now that several are reporting errors Do you have a link? Quote Link to comment
Viper359 Posted December 18, 2019 Author Share Posted December 18, 2019 Nope. Rebooted and rebuilding that parity drive that got kicked. When its done, i think I will try to downgrade to 6.7, I didn't have issues then. Quote Link to comment
Viper359 Posted December 20, 2019 Author Share Posted December 20, 2019 It has done it again. I am attaching my diagnostics. You will notice several disks report low amount of errors, and 1 drive has been kicked. tower-diagnostics-20191220-1027.zip Quote Link to comment
JorgeB Posted December 20, 2019 Share Posted December 20, 2019 Looks to me like a controller problem: Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1880:sas IO status 0x24 Dec 18 05:24:31 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e Dec 18 05:24:31 Tower kernel: sd 2:0:16:0: Power-on or device reset occurred Dec 18 05:24:33 Tower kernel: pm80xx mpi_ssp_completion 1880:sas IO status 0x24 Dec 18 05:24:33 Tower kernel: pm80xx mpi_ssp_completion 1888:SAS Address of IO Failure Drive:500605ba0040eb1e Dec 18 05:24:33 Tower kernel: sd 2:0:16:0: [sdt] tag#90 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00 Dec 18 05:24:33 Tower kernel: sd 2:0:16:0: [sdt] tag#90 CDB: opcode=0x88 88 00 00 00 00 02 9a 4c 13 d0 00 00 00 20 00 00 Dec 18 05:24:33 Tower kernel: print_req_error: I/O error, dev sdt, sector 11178611664 Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611600 Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611608 Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611616 Dec 18 05:24:33 Tower kernel: md: disk11 read error, sector=11178611624 It starts with disk11 and then also on the other disks. Quote Link to comment
JorgeB Posted December 20, 2019 Share Posted December 20, 2019 Could also be a power related issue. Quote Link to comment
Viper359 Posted December 20, 2019 Author Share Posted December 20, 2019 (edited) This is just the thing, this didn't happen on 6.7. Does this make any sense? Its a disk shelf, and everything seems to be working. The only thing I can thing of is the interposers, but, you would think I would have had this issue on 6.7 for the couple weeks I was running it? The other thing I cannot get, why do these errors only happen after a rebuild? Like I rebuilt the parity drive, nothing happen. Within an hour of it finishing, bam. Same when disk 11 was causing issues. Rebuilt the disk, nothing, hour or so later, boom. Edited December 20, 2019 by Viper359 Quote Link to comment
JorgeB Posted December 20, 2019 Share Posted December 20, 2019 This is just the thing, this didn't happen on 6.7. Does this make any sense? Possibly something that changed on the HBA driver, I don't have any experience with PMC HBAs, they are also quite uncommon on the Unraid user base, so difficult to say if it's an isolated issue or not, can you post some v6.7 diags so I can see if it's using a diferent driver? Quote Link to comment
trurl Posted December 20, 2019 Share Posted December 20, 2019 21 minutes ago, Viper359 said: The other thing I cannot get, why do these errors only happen after a rebuild? Heat buildup. Excessive vibrations affecting connections. Quote Link to comment
Viper359 Posted December 20, 2019 Author Share Posted December 20, 2019 4 minutes ago, johnnie.black said: Possibly something that changed on the HBA driver, I don't have any experience with PMC HBAs, they are also quite uncommon on the Unraid user base, so difficult to say if it's an isolated issue or not, can you post some v6.7 diags so I can see if it's using a diferent driver? I will have to downgrade and give it a go and see. Is it safe to downgrade now that its kicked a disk out of my array, and then rebuild it, and go from there? Quote Link to comment
JorgeB Posted December 20, 2019 Share Posted December 20, 2019 Is it safe to downgrade now that its kicked a disk out of my array, and then rebuild it, and go from there? Yes, just make sure the emulated disk is mounting and contents look correct before rebuilding on top. Don't know if this is the model you're using but it appears to describe a pretty similar issue: I've not had good luck with the NetApp PMC8003 HBAs. They work, until you have a lot of stress on the SAS link, then once a drive drops out, it seems to cascade lock up the whole bus under both FreeNAS and CentOS. https://forums.servethehome.com/index.php?threads/pmc-pm8001-based-hba.24186/#post-225079 Wen possible we always recommend LSI HBAs for Unraid. Quote Link to comment
Viper359 Posted December 20, 2019 Author Share Posted December 20, 2019 3 minutes ago, johnnie.black said: Yes, jus make sure the emulated disk is mounting and contents look correct before rebuilding on top. Don't know if this is the model you're using but it appears to describe a pretty similar issue: https://forums.servethehome.com/index.php?threads/pmc-pm8001-based-hba.24186/#post-225079 Wen possible we always recommend LSI HBAs for Unraid. Yes, I believe that is the model I am using. I do have an LSI HBA card and a spare QSFP to whatever adapter cable. Maybe I should try this first. Do a rebuild, and see if it stays stable. Quote Link to comment
JorgeB Posted December 20, 2019 Share Posted December 20, 2019 That's what I would do. Quote Link to comment
Viper359 Posted December 20, 2019 Author Share Posted December 20, 2019 Alright, later tonight I will power down everything and make the switch and see what happens. I will also triple check the cables are seated properly. Those QSFP cables are thick and don't like to move. Quote Link to comment
Viper359 Posted December 22, 2019 Author Share Posted December 22, 2019 Had a different parity drive fail that night. Switched out the HBA to an LSI one. Rebuilt the parity drive. Ran 10hrs, no errors Started Rebuilding the other dropped drive. 21 drives running now, we shall see if this was the issue. Just waiting for rebuild to finish on the last drive. Quote Link to comment
pXius Posted May 14, 2020 Share Posted May 14, 2020 (edited) On 12/23/2019 at 12:45 AM, Viper359 said: Had a different parity drive fail that night. Switched out the HBA to an LSI one. Rebuilt the parity drive. Ran 10hrs, no errors Started Rebuilding the other dropped drive. 21 drives running now, we shall see if this was the issue. Just waiting for rebuild to finish on the last drive. Did you ever resolve the issue? I'm having the same problems using a Netapp X2065A-R6 HBA and a DS4246. At first I thought it was being caused by a bad interposer, but after reading your thread and having it happen to a second drive, I'm not so sure. I'm considering removing all the interposers and running straight SATA. I've done this in the past but I get issues with drive detection in unRaid on 8TB+ drives when I reboot or power cycle without interposers. edit: UPDATE Ok, I've played around a bit, using "new config" and "parity is valid" to move drives around and add/remove interposers. Interposers changes the drive name, hence the "New Config". Anyway, I'm starting to think I had bad interposers on the drives that got kicked. Having a bunch of extra's, I replaced the interposers on those drives and so far everything seems fine. In a perfect world I'd just not use interposers at all, because I lose most of the smart data, including drive temps and the ability to properly spin up and down (temps did work for awhile but stopped a few updates back), but without interposers I have a hard time detecting drives that are 8TB's and larger. Sometimes I need to replug those drives once or twice before they're detected after a reboot. Interposers solves that issue. To save myself all the headache I've ordered an LSI card and the appropriate cable which will hopefully allow me to run the SATA drives without interposers once again. Looking forward to having SMART data again. Edited May 15, 2020 by pXius Quote Link to comment
Viper359 Posted May 19, 2020 Author Share Posted May 19, 2020 Yes, I switched and all my problems went away. Left my interposers in. Since all issues went away, I stopped trying to debug what the actual issue was. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.