Jump to content
ptr727

Errors during parity check after clean restart

21 posts in this topic Last Reply

Recommended Posts

Posted (edited)

I have two servers running 6.7.2 (corrected), connected to a UPS, extended power outage two weeks ago, graceful shutdown orchestrated by UPS, first scheduled parity check after restart reporting 5 errors, with exactly the same sector details.

 

Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934168
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934176
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934184
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934192
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934200

Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934168
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934176
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934184
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934192
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934200

 

Both servers use the same model 12TB parity disks, one has 1 parity drive, the other 2 parity drives.

It seems highly unlikely that this is actual corruption, and more likely some kind of logical issue?

 

Any ideas?

 

Edited by ptr727
6.7.0 -> 6.7.2

Share this post


Link to post

Since it isn't happening to everyone else it must be something specific to your systems. You haven't told us much about them though.

Share this post


Link to post
Posted (edited)

Sound interesting, both server show error with same sector no. and the magic no.  5

 

 

Does scheduled parity check no error before ?

 

I feel problem still relate UPS shutdown issue, even it show graceful shutdown.

 

My monthly parity check in 2020 still positive, but never shutdown by UPS shutdown routine.

 

Edited by Benson

Share this post


Link to post

Before the power outage servers were up for around 240 something days, no parity errors.

 

Note, I said 6.7.0, actually 6.7.2.

Supermicro 4U chassis with SM X10SLM+-F motherboard and Xeon E3 processors, Adaptec Series 8 RAID controllers in HBA passthrough mode, 12TB parity + 3 x mixture of 4TB and 12TB data disks, 4 x 1TB SSD cache, 2 x 12TB parity + 16 x mixture of 4TB and 12TB data, 4 x 1TB SSD cache.

 

 

 

 

Share this post


Link to post

That looks similar to the SAS2LP 5 sync errors after a reboot that happens to some users, do both servers use the same controller? Try doing a reboot followed by another parity check.

Share this post


Link to post
Posted (edited)

Server-1 uses a 81605ZQ and Server-2 uses a 7805Q controller.

The parity check just completed, I'll do one more while the server is up, then reboot, followed by another check.

 

How do I get the logs to persist during reboots, really need to see what happens at shutdown?

Edited by ptr727

Share this post


Link to post
14 minutes ago, ptr727 said:

How do I get the logs to persist during reboots, really need to see what happens at shutdown

As long as you are running Unraid 6.7.2 or later you can configure Settings->Syslog to keep a copy that survives a reboot (and is appended to after the reboot).

Share this post


Link to post
12 minutes ago, itimpi said:

As long as you are running Unraid 6.7.2 or later you can configure Settings->Syslog to keep a copy that survives a reboot (and is appended to after the reboot).

If I enable the local syslog server, does unraid automatically use it, or is there another config?

How reliable is using syslog vs. an option to just write to local disk during crashes or shutdown troubleshooting?

 

 

Share this post


Link to post
36 minutes ago, ptr727 said:

If I enable the local syslog server, does unraid automatically use it, or is there another config?

How reliable is using syslog vs. an option to just write to local disk during crashes or shutdown troubleshooting?

See the built-in help on the syslog page.

Share this post


Link to post
Posted (edited)

I enabled syslog, did a controlled reboot, started a check, and again got 5 errors:

Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934168
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934176
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934184
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934192
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934200

 

Nothing extraordinary in syslog, attached, diagnostics also attached.

 

I looked at the other threads that report similar 5 errors after reboots, blaming the SAS2LP / Supermicro / Marvell driver / hardware as the cause.

I find it suspicious that the problem was attributed to a specific driver / hardware, when it started happening in Unraid v6, and it happens with my Adaptec hardware, I can't help but think it is a more generic issue in Unraid, e.g. handling of SAS backplanes, spindown, caching, taking the array offline, parity calculation, etc. 

Especially since it appears the parity errors are at the same reported locations.

syslog server-2-diagnostics-20200103-2156.zip

Edited by ptr727

Share this post


Link to post
9 hours ago, ptr727 said:

I find it suspicious that the problem was attributed to a specific driver / hardware

All the other cases happened only if the user was using a SAS2LP, so IHMO that was correctly blamed, you are the first no SAS2LP user I remember seeing with this issue, if possible you should try a different controller, like an LSI to rule it out.

 

Besides that, do both server have any common hardware? Including disk models.

Share this post


Link to post

The systems do use similar disks 12TB Seagate, and 4TB Hitachi, and 1TB Samsung, and similar processors, and similar memory, and similar motherboards.

 

It could be that the Adaptec driver and the SAS2LP driver have a similar problem, or it could be Unraid, causation vs. correlation. E.g. how long did it take to fix the SQLite bug caused by Unraid, and experienced only by some.

 

How can I find out what files are affected by the parity repair, so that I can determine the impact of corruption, and possibility of restore from backup?

How can I see what driver Unraid is using for the Adaptec controller, so that I can see if it is a common driver or an adaptec specific driver?

 

Share this post


Link to post
1 hour ago, ptr727 said:

It could be that the Adaptec driver and the SAS2LP driver have a similar problem, or it could be Unraid, causation vs. correlation. E.g. how long did it take to fix the SQLite bug caused by Unraid, and experienced only by some.

This issue is far less common than the SQLite problem. Or at least far less reported and not reported for some time. It's possible some users just let their servers carry on with a few parity errors not knowing any better.

Share this post


Link to post

I got two LSI SAS9340-8i (I need mini SAS HD connectors) ServeRAID M1215 cards in IT mode (Art of Server on eBay), but I can't get the card to recognize my Samsung SSD drives.

So a tough spot, Adaptec HBA 5 parity errors per boot, recommended catchall LSI in IT mode does not detect SSD drives.

Share this post


Link to post
9 hours ago, ptr727 said:

recommended catchall LSI

9340-8i is a MegaRAID controller, we recommend true LSI HBAs, like the 9300-8i, it should work with any SSD, though LSI HBAs can only trimm SSDs with reads zeros after trim support, e.g. 860 EVO is OK, 850 EVO won't trim.

Share this post


Link to post

The 9340 flashed to IT mode acts like a more expensive 9300, so unless that is not supported, it should work fine, i.e. objective is no parity errors on reboot?

The problem with SSD drives appear to be EVO specific, none of my EVO drives are detected by the LSI controller, only the Pro drives are.

 

I am busy swapping EVO's for Pro's in the 4 x 1TB cache, one drive at a time.

How long should it take to rebuild the BTRFS volume, it has been running 12+ hours, and I can't see any progress indicator?

Share this post


Link to post
15 hours ago, ptr727 said:

The 9340 flashed to IT mode acts like a more expensive 9300, so unless that is not supported, it should work fine, i.e. objective is no parity errors on reboot?

I highly doubt it's a true HBA, what driver is used, mpt3sas or megaraid? If you don't know how to check you can post the diags.

 

15 hours ago, ptr727 said:

How long should it take to rebuild the BTRFS volume,

Not sure what you mean by this, but it should also be visible in the diags.

Share this post


Link to post
On 1/5/2020 at 1:04 AM, johnnie.black said:

9340-8i is a MegaRAID controller, we recommend true LSI HBAs, like the 9300-8i, it should work with any SSD, though LSI HBAs can only trimm SSDs with reads zeros after trim support, e.g. 860 EVO is OK, 850 EVO won't trim.

Have you confirmed this? I have a 9300-8i and 860 EVO does not TRIM.

 

root@Othello:~# fstrim -v /mnt/cache
fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error
root@Othello:~# hdparm -I /dev/sd[k-m] | grep TRIM
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Deterministic read ZEROs after TRIM
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Deterministic read ZEROs after TRIM
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Deterministic read ZEROs after TRIM
root@Othello:~# fstrim -av
fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error

sd[k-m] are (3x) 860 EVO connected to a 9300-8i. Maybe I should start another thread.

 

Share this post


Link to post
11 minutes ago, dgaschk said:

Have you confirmed this?

Yes, tested myself, make sure you're using latest firmware.

 

Edit: Never tested on v6.8 in case something changed with the LSI driver.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.