Jump to content

What did I do wrong? (Cache Drive issues)


Recommended Posts

Back in October I picked up a WD Red SN750 and one of these (PCIe NVMe adatper). Installed that into my server (Intel S1200BTL board with a Xeon E3 1240 cpu and 16Gb of DDR3 ECC RAM). Threw it into the system, configured the cache drive, installed the app data backup plugin, set it up, formatted the cache drive to btrfs and everything was great, until about two weeks ago. 

 

I left on vacation, the day after I left this new cache drive suddenly read only, and all my dockers never came back online after a backup. The data on the drive was corrupted (not all of it, just most of it as near as I can tell). The backup kept running, but at that point it was just backing up bad data.

 

I could not do much about it from across the globe, so I just let it be till this weekend.

The drive was visible by the system, I could browse to it and look at the data on it. Apart from the corrupted files and read only state everything looked fine. I decided that the first thing to try was to simply reboot the system, so I did. It started up, and the cache pool and drive were just missing. `lsblk` showed nothing, but `lspci` showed the drive was there.

 

My next thought was the I just got unlucky and the drive was bad. So I yanked it (and the adapter) from the server, and set about just rebuilding all my dockers. Thankfully this was reasonably straight forward and re-doing them was not really much of a loss except for Jellyfin, that one hurts a little to have to start all over again, but it's happily scanning my media folder now.

 

Ok, next step was to check the SN750, get a report going and see if I can RMA the drive through WD. Well, to my surprise the drive tests out just fine, nothing wrong with it. Threw it back into the carrier board and tested it out on another linux machine I have lying around, still works perfectly.

 

So know I am wondering what the heck happened? Should I have done something different when setting up the cache drive? Is the issue the cheap PCIe to NVMe adapter I bought and if so, is there a better one I can buy?

 

Obviously I am now very gun shy about putting it back in the server, especially as the backup plugin gives me near zero guarantee that this won't happen again (is there a better way to backup the app data folder?).

 

Any advice would be welcome here.

 

Thanks in advance

 

 

Edited by lutiana
Link to comment
13 hours ago, JorgeB said:

Impossible to say without the diagnostics.

Fair enough, attached are the diagnostics, but run today, after I pulled the drive from the system, and the original issue occurred around March 22, so not sure they will tell us anteing about what happened. Interestingly the drive does not even show up in the Historical device section of the Main section in the UI.

 

Mostly I am looking for an idea of what best practice is here, and how that compares to what I did and/or if there is any obvious hardware compatibility issues here. Like, is this a common occurrence with PCIe to NVMe adapters? Should I have prepped the drive differently? Was there some sort of extra backup process I should have looked into doing? etc. Basically how can I add the cache drive back in, but also put something in place to guarentee that this won't happen again, or if it does, a guaranteed way to recover from it.

deepthought-diagnostics-20220410-1541.zip

Edited by lutiana
Link to comment
10 hours ago, lutiana said:

and the original issue occurred around March 22, so not sure they will tell us anteing about what happened.

They won't, we'd need the diags from that time.

 

10 hours ago, lutiana said:

Like, is this a common occurrence with PCIe to NVMe adapters?

It's not that uncommon for some NVMe devices to drop offline, with ore without adapter, sometimes disabling power save helps with that.

 

10 hours ago, lutiana said:

Was there some sort of extra backup process I should have looked into doing?

You should always have backups of anything important.

Link to comment
6 hours ago, JorgeB said:

You should always have backups of anything important.

 

Thank you for this completely frustrating and useless statement. I did have backups, I followed the prescribed process of using the extension to do backups, but no where in my extensive research and questions I had about this did anyone tell me that this would not be adequate due to *known* issues with NVMe storage and unRAID. Hell I never wanted to move the Appdata folder to the cache drive in the first place, as I was more interested in resiliency than performance, but for some reason this is just not an option in UnRAID.

 

Also, can you point me to the official unRAID documentation that warns that NVMe based storage is not reliable due to the fact that it can "drop offline" and that this is not uncommon? Had I known that, I would not have wasted my time and money adding a cache drive at all.

Link to comment
3 minutes ago, lutiana said:

can you point me to the official unRAID documentation that warns that NVMe based storage is not reliable

I never said that, just that with some hardware it's not that uncommon, but it's generally reliable, it can also happen with SATA devices, though in those cases it's usually power/connection related, NVMe devices have been reliable for me and most users, and I have multiple NVMe devices in multiple servers.

Link to comment
1 hour ago, lutiana said:

Also, can you point me to the official unRAID documentation that warns that NVMe based storage is not reliable due to the fact that it can "drop offline" and that this is not uncommon? Had I known that, I would not have wasted my time and money adding a cache drive at all.

The vast majority of people have no issues using NVME drives with Unraid.   If it drops offline it suggests a problem at the hardware level in your particular configuration. 

Link to comment
6 hours ago, itimpi said:

The vast majority of people have no issues using NVME drives with Unraid.   If it drops offline it suggests a problem at the hardware level in your particular configuration. 

 

Does the diagnostic report I posted indicate what the issue is with my system?

FWIW - I acknolwledge that calling NVMe unreliable in unRAID is probably a step too far, but from what I gather in this thread, this is a known issue, and therefore should be at least warrant a foot note in the documentation, letting people factor this into their decisions, especially as this can cause data corruption even when the hardware is working just fine.

Edited by lutiana
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...