Jump to content

NVMe Parity fails repeatedly


Go to solution Solved by JorgeB,

Recommended Posts

Hi All,

I wanted to get some guidance to a nasty problem I am facing.

 

I have an all flash unraid box. 2x2TB SSDs, 2xNVMe (1x500GB, 1x 2TB).

I have the bigger NVMe assigned as the Parity Drive.

 

The Parity Drive failed for the second time now, within 4Months. (1st Failure after 1-2Months, got it replaced by the vendor, 2nd Failure now occured.)

 

Is there something fundamental that I am missing when using NVMe Parity Drives?

 

I assumed that I fried the first NVMe because it was mounted beneath the MoBo - hence I moved it to a CPU-Cooler Exposed position with the second.

Is a NVMe Heatsink a must? Are there other things I am missing?

 

Thanks for your help!

Link to comment
Posted (edited)

How much is/was the total write amount on that drive?

 

Parity gets as many writes as all other drives combined, and is always "100% full" as far as the drive is concerned which means it can't make use of most of its lifetime-extending tricks. 

 

Not a recommended setup, SSDs should be in a pool.

Edited by Kilrah
Link to comment
Posted (edited)

So the Idea Would be to always have spinning rust as the parity? :(

I wanted to get a low energy box, so spinning rust was out of the picture.

 

Is there anything i could do to avoid spinning rust?

 

The total write amount wasn't too high - but the read was high

Edited by ivangoetelek
Link to comment

What model is the device? I've used an NVMe device as parity for some time and never had issues, of course, if you are writing extreme amounts of data there could be, you should be able to check how much was written to the other array devices on SMART, then add that up to see if it was anywhere close to the parity device TBW max.

Link to comment
10 hours ago, ivangoetelek said:

Is a NVMe Heatsink a must? Are there other things I am missing?

What were the temps? Flash memory actually loves heat, but their controllers do not. I could peg tmax on my NVMe in a well ventilated install with little effort, so I always add a heatsink.

 

For a ~4TB  low power NAS, I'd just forego parity and rely on backups for uptime, unless you really need uptime and cant afford a few hours downtime. 

Link to comment
3 hours ago, Kilrah said:

ZFS pool instead of array, but that'll be a poor solution for your selection of drives and need emptying everything out...

I feel that this can't be the solution

2 hours ago, JorgeB said:

What model is the device? I've used an NVMe device as parity for some time and never had issues, of course, if you are writing extreme amounts of data there could be, you should be able to check how much was written to the other array devices on SMART, then add that up to see if it was anywhere close to the parity device TBW max.

Verbatim Vi3000 2TB - just some garbage NVMe - not much of anything, but it was cheap

In general I do not write that much of data to the array... I feel that I could write a lot more :D

I can't really figure out the SMART Reports - so I have no idea how much data I have actually written...

 

1 hour ago, Michael_P said:

What were the temps? Flash memory actually loves heat, but their controllers do not. I could peg tmax on my NVMe in a well ventilated install with little effort, so I always add a heatsink.

 

For a ~4TB  low power NAS, I'd just forego parity and rely on backups for uptime, unless you really need uptime and cant afford a few hours downtime. 

Temps were up to 85°C - 90°C while doing the parity check.

No parity also seems like a non option, any downtime is just a huge pain in the behind.

 

 

 

Isn't there anything I can do?

Is it possible that I had 2 consecutive Bad Devices - sounds like a challenge for high school statistics class :D

Link to comment

hmpf - okay, so I'll just try again with the next Vi3000, and if that fails again in some months, I shall revive this topic...

 

Is there anything I can do when it comes to logging / diagnostics, so if the next device fails, I can easily isolate the actual problem? I understood that rebooting after the drive failed didn't help much.

Link to comment
Just now, ivangoetelek said:

understood that rebooting after the drive failed didn't help much.

As you surmised the syslog in the diagnostics is the RAM version that starts afresh every time the system is booted.  You should enable the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash.  The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.  

 

2 minutes ago, ivangoetelek said:

so I'll just try again with the next Vi3000, and if that fails again in some months, I shall revive this topic...

It could be that the brand is the issue?  Perhaps you should try another one that might be  better quality?

Link to comment
28 minutes ago, ivangoetelek said:

Is there anything I can do when it comes to logging / diagnostics, so if the next device fails, I can easily isolate the actual problem? I understood that rebooting after the drive failed didn't help much.

If it fails save the diagnostics before rebooting, the diagnostics package will include the SMART information (that you can also see by clicking on the drive slot name and going to the Attributes tab).

Link to comment
1 minute ago, ivangoetelek said:

I have found the section where the drive died - but I can't make much of it...

That shows that you appeared to complete the parity check successfully but the several hours later suddenly started getting read and write errors on the parity drive.   No indication I can see as to why.

Link to comment

So I RMA'd the faulty drive - let's see what tehy say/do - also I have ordered a beefed up NVMe Cooler - maybe that helps.

 

But seriously - is it possible that the syslog is responsible for huge write effort?

Other than that I cannot imagine much write at all.

 

Is there any recommendation when it comes to e.g. appdata? to keep it away from the parity drive?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...