(SOLVED) FITRIM ioctl failed: Input/output error followed by Cache pool BTRFS missing device(s)


FQs19
Go to solution Solved by JorgeB,

Recommended Posts

If anyone has time to explain to me what happened on my server, I would really appreciate it. 

 

While at work last night, I received an email from my unRAID server that stated the following:

cron for user root /sbin/fstrim -a -v | logger &> /dev/null

fstrim: /mnt/arraycache: FITRIM ioctl failed: Input/output error

 

Immediately after that I received another email from my unRAID server that stated:

Unraid Status: Warning [THREADRIPPER19] - Cache pool BTRFS missing device(s)

Event: Unraid Arraycache disk message
Subject: Warning [THREADRIPPER19] - Cache pool BTRFS missing device(s)
Description: Force_MP600_20238284000130053009 (nvme0n1)
Importance: warning

 

I wasn't able to log in remotely to unRAID even though I had My Servers setup to do so, but that's a problem for another day.

When I got home and logged in, I found that the Disk in question, Disk nvme0n1, was still shown on the Main tab and I was able to view it by clicking on the Arraycache link. I've had nvme disks drop offline before, but they were gone from the Main tab. Why was unRAID saying it was offline, but still visible?

My Plex docker was still working perfectly fine, which I assume is because the other disk in the cachepool never dropped offline. 

 

I did some googling and read that I shouldn't be using TRIM on nvme devices especially if they are in a btrfs pool. That's the first I ever heard of that. So I will turn off TRIM on my server. 

 

I haven't had time to turn my unRAID server back on and see if there's any other problems. I just wanted to post here and get some feedback as to why it happened, how I can keep it from happening again, and what steps should I take before rebooting to make sure no data is lost?

 

Attached is my Diagnostic file and a screenshot from after getting the Diagnostic file and before shutting my server down.

 

Thank you in advance for any help given.

 

1061420260_Screenshot2022-09-03050712.thumb.jpg.d9913c6014ab6f535272d0799f14152e.jpg

threadripper19-diagnostics-20220903-0504.zip

Edited by FQs19
Topic Solved
Link to comment

After starting my Unraid server, I started the array in maintenance mode then ran a file system check on both of my cache pools. Neither of them had any errors. 

 

I then stopped the array and started it normally. I ran a scrub with repair corrupted blocks checked on my arraycache pool. The scrub repaired a lot of corruption since that had the nvme drop offline. 
I ran a scrub on my plexcache pool and there were no errors. 
 

I thought everything was good until a couple hours ago. Every hour at exactly the 47th min of the hour, I get an email stating:
unraid status: errors on cache pool 

Event: Unraid Status
Subject: ERRORS on cache pool
Description: No description
Importance: warning

 

It doesn't tell me what cache pool has the errors and I'm not seeing any errors on either cache pool. 

I've ran another scrub on each cache pool and there are no errors. 

I looked in my logs and there's no mention of cache pool errors except the system sending out an email. 

 

What is going on here?

 

F1489919-1854-4577-830A-34449581B3EA.thumb.jpeg.b4b89c01b38f338470ea5647c4d709bd.jpeg

threadripper19-syslog-20220904-0003.zip

Link to comment

I found the cause of the Errors on Cache Pool.

 

I forgot I made a user script to run a btrfs check hourly and that after an error is seen, it will continue to notify you of the error every time it runs, which for me is every hour. 

I had to force reset the stats on the cache pool to stop the error. I just assumed the stats would reset after correcting the errors. 

 

I still haven't figured out why the nvme drive dropped offline in the first place after a scheduled TRIM. 

Link to comment
  • Solution
2 hours ago, FQs19 said:

I still haven't figured out why the nvme drive dropped offline in the first place

 

The below sometimes helps:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference.

Link to comment
10 hours ago, JorgeB said:

 

The below sometimes helps:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference.

 

Thanks for the reply JorgeB.

I did have that line in my original unraid server, but don't remember why I didn't put it in my fresh install of unraid. 

I'll try and find out why I didn't put it back in, then add it and turn TRIM back on and see what happens. 

 

Thanks again. 

Link to comment

So one of my nvme drives (nvme0n1) went missing again. 

This happened right after the backup/restore appdata plugin finished running a backup. The last time is went missing after running TRIM. 

 

I'm at work, unable to do anything but grab the diagnostic zip. I'm attaching it to this post. 

 

Could someone please help me figure out why I keep losing an nvme disk. 

 

Do I need to wipe that cache pool and start over? 

Remove that nvme disk, put it in a windows machine, test it, then erase it and put it back in unraid and try again?

Just return it to Corsair if it's still under warranty?

Or is there a problem with this motherboard?

 

I just can't seem to win with unraid. It's been nothing but problem after problem after problem. I can't believe that I'm having this many hardware problems. 

In one year I've replaced the case, the power supply, the motherboard, the flash drive, the memory, the sata cables, the hba cables, and even changed out the nvme drives. 

 

Beyond frustrated at this point.

threadripper19-diagnostics-20220908-0253.zip

Link to comment

It's the other device this time, and since they are different brands it suggest it's not a device issue.

 

59 minutes ago, FQs19 said:

Do I need to wipe that cache pool and start over? 

That won't help, this is a hardware related issue, look for a BIOS update for the board, I also usually suggest a different m.2/pcie slot if available, but since both dropped that's also unlikely to help.

 

 

Link to comment
8 hours ago, JorgeB said:

It's the other device this time, and since they are different brands it suggest it's not a device issue.

 

That won't help, this is a hardware related issue, look for a BIOS update for the board, I also usually suggest a different m.2/pcie slot if available, but since both dropped that's also unlikely to help.

 

 

Thanks again for your help JorgeB.

 

I'm going to pull both nvme disks out, make sure all contacts are clean, verify the heatsinks are secured and that the thermal pads are in the correct location, then install the nvme disks. I might swap which m.2 slot they go back in. Would that affect the cachepool? I'm thinking it will change the name associated with each nvme disk like nvme0n1 will now be nvme1n1.

 

Also, to get my nvme disk back online, are these the steps to take (these are the steps I did the first time):

-reboot server

-verify all nvme disks are online

-start array in maintenance mode

-run a file system check

-if no errors are found, start array in normal mode, then run a correcting scrub.

 

After those steps, I'll shut my server down and do the steps I laid out earlier. 

I'm also going to change my memory speed to its default of 2666 from the 3000 it is set to now. 

 

Link to comment
33 minutes ago, FQs19 said:

Would that affect the cachepool?

Nope.

 

33 minutes ago, FQs19 said:

Also, to get my nvme disk back online, are these the steps to take (these are the steps I did the first time):

-reboot server

-verify all nvme disks are online

-start array in maintenance mode

-run a file system check

-if no errors are found, start array in normal mode, then run a correcting scrub.

File system check should not be needed, but it won't hurt as long as it's in read only mode.

Link to comment
  • FQs19 changed the title to (SOLVED) FITRIM ioctl failed: Input/output error followed by Cache pool BTRFS missing device(s)

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.