(Solved) Help Requested: BTRFS Cache Pool Errors


tomjrob

Recommended Posts

Could use some help here with Cache Pool errors. I apologize for the long post, but I believe that I should try and provide as much info as I can if I am going to ask anyone for help here.

 

So, long story, but here goes.

 

Array had been running fine for months with the initial configuration of the cache pool.

 

Initial Cache pool prior to any issues:

Cache - 256GB SSD

Cache2 - 256GB SSD

Cache3 - 512GB SSD

 

Goal: Remove both 256 GB SSD's and replace with a new ADATA 512GB SSD. End up with (2) 512GB SSD's in Raid 1 pool.

Used the procedures outlined here to reconfigure the cache pool. Made sure to do one step at a time and let rebalance complete before next step.

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/

 

Step 1: Replaced 1st 256GB with new 512GB SSD.  Rebalance completed without error.

Step 2: Removed Cache2 SSD. Rebalance completed without error.

Step 3: Unassigned Cache2 slot. All ok so far.

 

Final Cache pool: All steps completed and array working fine.

Cache - New ADATA 512GB SSD

Cache2 - Original 512GB SSDD

 

Everything seemed to be great at this point. Array ran fine for about 24 hours.

 

The problem:

At some point, I noticed that the new ADATA SSD was not showing temperature on the main page, as though it was "spun down".

Checked logs and logs were full of BTRFS errors on the new SSD. I assumed the new drive went bad.

 

Steps taken so far to isolate.

Ran BTRFS scrub on cache (No errors)

Put array in maintenance mode and ran BTRFS filesystem check (No errors).

Stopped array and replaced ADATA SSD in 1st cache slot with a 2TB spinning drive connected to a completely different controller and brought array back online.

Rebalance showed many BTRFS errors on new spinning drive. I began to suspect that the new ADATA SSD drive may not be the problem.

Stopped array, removed ADATA drive and attached it to Windows machine. Used Partition magic to wipe it and did a complete surface test. No errors!

At this point I removed the spinning drive in 1st cache slot from the array and brought it back online with only Cache2 drive. Array showed "missing drive" in 1st cache slot as expected, but came online and did a rebalance. Rebalance completed without any errors with nothing in 1st cache slot.

 

So I stopped array and powered down. I put the ADATA SSD back into the array. (Note that the slot I inserted this drive into is in the same slot as original cache2 256GB ssd in a hot swap chassis, so it is using the same slot and cable as the original cache 2 ssd, which ran fine for months.)

 

Powered back up and assigned the ADATA ssd again to 1st Cache spot, and started the array.

Immediately started getting many BTRFS errors trying to do the rebalance.

 

This is where I sit. Rebalance is continuing even though there are many errors.

 

I am attaching diagnostics taken at the time of the initial errors found before any troubleshooting steps were taken, as well as a screen shot of the current syslog showing the errors currently happening while trying to rebalance the cache.

 

I am baffled as I do not believe the ADATA ssd is defective, do not believe it is a slot/cable issue, but cannot understand why I can't introduce a drive into 1st cache slot and rebalance the data. The cache appears to be fine as long as there is only one drive as part of the pool.

 

Any help is appreciated.

 

 

 

tower-diagnostics-20191204-0353.zip tower-syslog-20191204-0954.zip

Edited by tomjrob
Link to comment

First time Adata SSD dropped offline:

 

Dec  3 12:06:09 TOWER kernel: ata4.00: qc timeout (cmd 0xec)
Dec  3 12:06:09 TOWER kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec  3 12:06:09 TOWER kernel: ata4.00: revalidation failed (errno=-5)
Dec  3 12:06:09 TOWER kernel: ata4.00: disabled

Second time it dropped again:

Dec  4 09:03:48 TOWER kernel: ata4.00: link online but device misclassified
Dec  4 09:04:18 TOWER kernel: ata4.00: qc timeout (cmd 0xec)
Dec  4 09:04:18 TOWER kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec  4 09:04:18 TOWER kernel: ata4.00: revalidation failed (errno=-5)
Dec  4 09:04:18 TOWER kernel: ata4.00: disabled

Adata devices don't have a very good reputation, but it can also be a connection issue, suggest you try again with the SSD in a different slot/cable/port, if it happens again replace with a new device, preferably from a different brand

  • Thanks 1
Link to comment

Thank you for the quick response. As you suggested, I moved the Adata device to a different port (cable). The new connection is also on a different controller. Now connected to LSI card instead of motherboard sata port.

Reintroduced it into the Cache, and the rebalance completed without error. The Adata has been running fine for a few hours in this configuration. I will monitor for awhile and report back as solved if there are no other errors.

I am suspecting now that the new Adata SSD has an incompatibility with the port on the motherboard, because as mentioned in the initial post that port & cable had a 256GB SSD (Crucial) running for months on that port/cable without issue until I replaced the Crucial with the Adata.

I also appreciate the link to the command to check the cache pool for errors. It showed many write errors prior to moving the the Adata ssd to the LSI card.

Completely clean after the swap & rebalance.

 

Thanks again for the help.

Link to comment

(Solved): BTRFS Cache Pool Errors

The ADATA ssd "worked" when connected to the LSI LSI 9211-8i controller, but did not work and kept dropping offline when connected to the AsRock 970 Extreme4 motherboard SATA controller.

However, I noticed that even though it worked when connected to the LSI card, it was running very slow and also very hot (124 degrees F).

So, I decided to take johnnie.black's advice and exchange it with a samsung 860 EVO ssd. Much better performance and runs much cooler (75 degrees F).

As always, thanks for the great advice here.

Hope this helps someone else avoid issues with ssd's in a btrfs pool down the road.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.