metadata errors on cache drive


Recommended Posts

Issue: when I ran sabnzdb and its unpacking its downloads, I'm more and more often seeing "XFS metadata I/O error in "xlog iodone"" errors and the file system shuts down and unraid goes offline
 

I've searched for this topic and the results are typically:

- change your cable

- reformat your cache drive

- delete and redo your dockers.

 

I'm using a ADATA SSD connected to a motherboard SATA port. I've changed cables and ports. I've reformed the cache drive from btrfs to xfs. I've deleted my dockers and the img file and redid them - same result.

 

sabnzdb will trigger it, copying lots and lots of large files to the cache does not.

 

What the heck could be causing this? I've attached two of the debug files generated when unraid crashed.

 

tower-diagnostics-20200509-0030.zip tower-diagnostics-20200509-2005.zip

Edited by rilles
typo
Link to comment
7 hours ago, johnnie.black said:

Main suspect would be the overclocked RAM, respect the max officially supported RAM speed.

Very relevant as I'm running an AMD 2200G with an Asus b450m mobo, but only using officially supported RAM speeds and my AMD is not locking up as the console still works and I can do a graceful shutdown.  SMART health checks all good.

 

So if I assume its not a bad drive issue then sabnzdb is relevant because its pushing a high CPU load, c-state is probably not related as I have no issues any other time after running this system for 6 months.

 

which could then lead to motherboard based services choking?  leading to file system shutdown? from the error log it appears they stop responding:

 

May  9 00:32:02 Tower kernel: ahci 0000:02:00.1: AHCI controller unavailable!
May  9 00:32:03 Tower kernel: ata6: failed to resume link (SControl FFFFFFFF)
May  9 00:32:03 Tower kernel: ata6: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
May  9 00:32:05 Tower kernel: r8169 0000:08:00.0 eth0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).

 

I will start fiddling with settings based on this post:

 

 

Link to comment

The DDR4 ram was 3200 - changed it to auto (defaults to 2600), turned off C states, set PS supply mode to "typical current".

 

Held my breath - ran sabnzdb - and same issue.  I put in a old WD green spinning disk as cache - my theory now is the SSD when heavily loaded borks the SATA interface which borks other stuff on the mobo.

 

Link to comment
15 hours ago, rilles said:

May  9 00:32:02 Tower kernel: ahci 0000:02:00.1: AHCI controller unavailable!

Likely related to this, there are some common issues with the Ryzen SATA controller under heavy load, usually during a parity check or rebuild, there are some reports that updating to v6.9-beta1 helps due to the newer kernel.

Link to comment
10 hours ago, johnnie.black said:

Likely related to this, there are some common issues with the Ryzen SATA controller under heavy load, usually during a parity check or rebuild, there are some reports that updating to v6.9-beta1 helps due to the newer kernel.

I have a LSI-9211 controller - I put the spinning cache disk on this controller and repeated my test.

The SATA didn't die but everything else did.

 

So yes I believe you are right - the issue is under heavy load the Ryzen motherboard built-in interfaces (SATA / networking) bork.

 

though i've never had an issue with parity test only heavy application load. my parity check actually is faster now then before by 10MB/s oddly.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.