ptr727 Posted January 1, 2020 Share Posted January 1, 2020 (edited) I have two servers running 6.7.2 (corrected), connected to a UPS, extended power outage two weeks ago, graceful shutdown orchestrated by UPS, first scheduled parity check after restart reporting 5 errors, with exactly the same sector details. Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934168 Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934176 Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934184 Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934192 Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934200 Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934168 Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934176 Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934184 Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934192 Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934200 Both servers use the same model 12TB parity disks, one has 1 parity drive, the other 2 parity drives. It seems highly unlikely that this is actual corruption, and more likely some kind of logical issue? Any ideas? Edited January 1, 2020 by ptr727 6.7.0 -> 6.7.2 Quote Link to comment
trurl Posted January 1, 2020 Share Posted January 1, 2020 Since it isn't happening to everyone else it must be something specific to your systems. You haven't told us much about them though. Quote Link to comment
Vr2Io Posted January 1, 2020 Share Posted January 1, 2020 (edited) Sound interesting, both server show error with same sector no. and the magic no. 5 Does scheduled parity check no error before ? I feel problem still relate UPS shutdown issue, even it show graceful shutdown. My monthly parity check in 2020 still positive, but never shutdown by UPS shutdown routine. Edited January 1, 2020 by Benson Quote Link to comment
ptr727 Posted January 1, 2020 Author Share Posted January 1, 2020 Before the power outage servers were up for around 240 something days, no parity errors. Note, I said 6.7.0, actually 6.7.2. Supermicro 4U chassis with SM X10SLM+-F motherboard and Xeon E3 processors, Adaptec Series 8 RAID controllers in HBA passthrough mode, 12TB parity + 3 x mixture of 4TB and 12TB data disks, 4 x 1TB SSD cache, 2 x 12TB parity + 16 x mixture of 4TB and 12TB data, 4 x 1TB SSD cache. Quote Link to comment
JorgeB Posted January 2, 2020 Share Posted January 2, 2020 That looks similar to the SAS2LP 5 sync errors after a reboot that happens to some users, do both servers use the same controller? Try doing a reboot followed by another parity check. Quote Link to comment
ptr727 Posted January 2, 2020 Author Share Posted January 2, 2020 (edited) Server-1 uses a 81605ZQ and Server-2 uses a 7805Q controller. The parity check just completed, I'll do one more while the server is up, then reboot, followed by another check. How do I get the logs to persist during reboots, really need to see what happens at shutdown? Edited January 2, 2020 by ptr727 Quote Link to comment
itimpi Posted January 2, 2020 Share Posted January 2, 2020 14 minutes ago, ptr727 said: How do I get the logs to persist during reboots, really need to see what happens at shutdown As long as you are running Unraid 6.7.2 or later you can configure Settings->Syslog to keep a copy that survives a reboot (and is appended to after the reboot). Quote Link to comment
ptr727 Posted January 2, 2020 Author Share Posted January 2, 2020 12 minutes ago, itimpi said: As long as you are running Unraid 6.7.2 or later you can configure Settings->Syslog to keep a copy that survives a reboot (and is appended to after the reboot). If I enable the local syslog server, does unraid automatically use it, or is there another config? How reliable is using syslog vs. an option to just write to local disk during crashes or shutdown troubleshooting? Quote Link to comment
bonienl Posted January 2, 2020 Share Posted January 2, 2020 36 minutes ago, ptr727 said: If I enable the local syslog server, does unraid automatically use it, or is there another config? How reliable is using syslog vs. an option to just write to local disk during crashes or shutdown troubleshooting? See the built-in help on the syslog page. Quote Link to comment
ptr727 Posted January 3, 2020 Author Share Posted January 3, 2020 (edited) I enabled syslog, did a controlled reboot, started a check, and again got 5 errors: Jan 3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934168 Jan 3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934176 Jan 3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934184 Jan 3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934192 Jan 3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934200 Nothing extraordinary in syslog, attached, diagnostics also attached. I looked at the other threads that report similar 5 errors after reboots, blaming the SAS2LP / Supermicro / Marvell driver / hardware as the cause. I find it suspicious that the problem was attributed to a specific driver / hardware, when it started happening in Unraid v6, and it happens with my Adaptec hardware, I can't help but think it is a more generic issue in Unraid, e.g. handling of SAS backplanes, spindown, caching, taking the array offline, parity calculation, etc. Especially since it appears the parity errors are at the same reported locations. syslog server-2-diagnostics-20200103-2156.zip Edited January 3, 2020 by ptr727 Quote Link to comment
JorgeB Posted January 4, 2020 Share Posted January 4, 2020 9 hours ago, ptr727 said: I find it suspicious that the problem was attributed to a specific driver / hardware All the other cases happened only if the user was using a SAS2LP, so IHMO that was correctly blamed, you are the first no SAS2LP user I remember seeing with this issue, if possible you should try a different controller, like an LSI to rule it out. Besides that, do both server have any common hardware? Including disk models. Quote Link to comment
ptr727 Posted January 4, 2020 Author Share Posted January 4, 2020 The systems do use similar disks 12TB Seagate, and 4TB Hitachi, and 1TB Samsung, and similar processors, and similar memory, and similar motherboards. It could be that the Adaptec driver and the SAS2LP driver have a similar problem, or it could be Unraid, causation vs. correlation. E.g. how long did it take to fix the SQLite bug caused by Unraid, and experienced only by some. How can I find out what files are affected by the parity repair, so that I can determine the impact of corruption, and possibility of restore from backup? How can I see what driver Unraid is using for the Adaptec controller, so that I can see if it is a common driver or an adaptec specific driver? Quote Link to comment
trurl Posted January 4, 2020 Share Posted January 4, 2020 1 hour ago, ptr727 said: It could be that the Adaptec driver and the SAS2LP driver have a similar problem, or it could be Unraid, causation vs. correlation. E.g. how long did it take to fix the SQLite bug caused by Unraid, and experienced only by some. This issue is far less common than the SQLite problem. Or at least far less reported and not reported for some time. It's possible some users just let their servers carry on with a few parity errors not knowing any better. Quote Link to comment
ptr727 Posted January 4, 2020 Author Share Posted January 4, 2020 I got two LSI SAS9340-8i (I need mini SAS HD connectors) ServeRAID M1215 cards in IT mode (Art of Server on eBay), but I can't get the card to recognize my Samsung SSD drives. So a tough spot, Adaptec HBA 5 parity errors per boot, recommended catchall LSI in IT mode does not detect SSD drives. Quote Link to comment
JorgeB Posted January 5, 2020 Share Posted January 5, 2020 9 hours ago, ptr727 said: recommended catchall LSI 9340-8i is a MegaRAID controller, we recommend true LSI HBAs, like the 9300-8i, it should work with any SSD, though LSI HBAs can only trimm SSDs with reads zeros after trim support, e.g. 860 EVO is OK, 850 EVO won't trim. Quote Link to comment
ptr727 Posted January 5, 2020 Author Share Posted January 5, 2020 The 9340 flashed to IT mode acts like a more expensive 9300, so unless that is not supported, it should work fine, i.e. objective is no parity errors on reboot? The problem with SSD drives appear to be EVO specific, none of my EVO drives are detected by the LSI controller, only the Pro drives are. I am busy swapping EVO's for Pro's in the 4 x 1TB cache, one drive at a time. How long should it take to rebuild the BTRFS volume, it has been running 12+ hours, and I can't see any progress indicator? Quote Link to comment
JorgeB Posted January 6, 2020 Share Posted January 6, 2020 15 hours ago, ptr727 said: The 9340 flashed to IT mode acts like a more expensive 9300, so unless that is not supported, it should work fine, i.e. objective is no parity errors on reboot? I highly doubt it's a true HBA, what driver is used, mpt3sas or megaraid? If you don't know how to check you can post the diags. 15 hours ago, ptr727 said: How long should it take to rebuild the BTRFS volume, Not sure what you mean by this, but it should also be visible in the diags. Quote Link to comment
dgaschk Posted January 8, 2020 Share Posted January 8, 2020 On 1/5/2020 at 1:04 AM, johnnie.black said: 9340-8i is a MegaRAID controller, we recommend true LSI HBAs, like the 9300-8i, it should work with any SSD, though LSI HBAs can only trimm SSDs with reads zeros after trim support, e.g. 860 EVO is OK, 850 EVO won't trim. Have you confirmed this? I have a 9300-8i and 860 EVO does not TRIM. [email protected]:~# fstrim -v /mnt/cache fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error [email protected]:~# hdparm -I /dev/sd[k-m] | grep TRIM * Data Set Management TRIM supported (limit 8 blocks) * Deterministic read ZEROs after TRIM * Data Set Management TRIM supported (limit 8 blocks) * Deterministic read ZEROs after TRIM * Data Set Management TRIM supported (limit 8 blocks) * Deterministic read ZEROs after TRIM [email protected]:~# fstrim -av fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error sd[k-m] are (3x) 860 EVO connected to a 9300-8i. Maybe I should start another thread. Quote Link to comment
JorgeB Posted January 8, 2020 Share Posted January 8, 2020 11 minutes ago, dgaschk said: Have you confirmed this? Yes, tested myself, make sure you're using latest firmware. Edit: Never tested on v6.8 in case something changed with the LSI driver. Quote Link to comment
dgaschk Posted January 8, 2020 Share Posted January 8, 2020 johnnie.black is correct. Works with latest firmware. Thanks. Quote Link to comment
ptr727 Posted January 11, 2020 Author Share Posted January 11, 2020 Switching my systems to the LSI HBA corrected the behavior, see my blog post for details; https://blog.insanegenius.com/2020/01/10/unraid-repeat-parity-errors-on-reboot/ Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.