ivangoetelek Posted July 1 Share Posted July 1 Hi All, I wanted to get some guidance to a nasty problem I am facing. I have an all flash unraid box. 2x2TB SSDs, 2xNVMe (1x500GB, 1x 2TB). I have the bigger NVMe assigned as the Parity Drive. The Parity Drive failed for the second time now, within 4Months. (1st Failure after 1-2Months, got it replaced by the vendor, 2nd Failure now occured.) Is there something fundamental that I am missing when using NVMe Parity Drives? I assumed that I fried the first NVMe because it was mounted beneath the MoBo - hence I moved it to a CPU-Cooler Exposed position with the second. Is a NVMe Heatsink a must? Are there other things I am missing? Thanks for your help! Quote Link to comment
Kilrah Posted July 1 Share Posted July 1 (edited) How much is/was the total write amount on that drive? Parity gets as many writes as all other drives combined, and is always "100% full" as far as the drive is concerned which means it can't make use of most of its lifetime-extending tricks. Not a recommended setup, SSDs should be in a pool. Edited July 1 by Kilrah Quote Link to comment
ivangoetelek Posted July 1 Author Share Posted July 1 (edited) So the Idea Would be to always have spinning rust as the parity? I wanted to get a low energy box, so spinning rust was out of the picture. Is there anything i could do to avoid spinning rust? The total write amount wasn't too high - but the read was high Edited July 1 by ivangoetelek Quote Link to comment
ivangoetelek Posted July 1 Author Share Posted July 1 There you go Thanks! mini-diagnostics-20240701-1544.zip Quote Link to comment
JorgeB Posted July 1 Share Posted July 1 The diags are after a reboot, did you test power cycling the server in case the device dropped offline? If it did, just rebooting it's usually not enough to get it back. Quote Link to comment
ivangoetelek Posted July 1 Author Share Posted July 1 Indeed, I power cycled the machine. I have issued an RMA, but I have no idea how to behave differently so i won't end up with another bricked NVME in 2mo... Any Idea? Quote Link to comment
Kilrah Posted July 1 Share Posted July 1 (edited) ZFS pool instead of array, but that'll be a poor solution for your selection of drives and need emptying everything out... Edited July 1 by Kilrah Quote Link to comment
JorgeB Posted July 1 Share Posted July 1 What model is the device? I've used an NVMe device as parity for some time and never had issues, of course, if you are writing extreme amounts of data there could be, you should be able to check how much was written to the other array devices on SMART, then add that up to see if it was anywhere close to the parity device TBW max. Quote Link to comment
Michael_P Posted July 1 Share Posted July 1 10 hours ago, ivangoetelek said: Is a NVMe Heatsink a must? Are there other things I am missing? What were the temps? Flash memory actually loves heat, but their controllers do not. I could peg tmax on my NVMe in a well ventilated install with little effort, so I always add a heatsink. For a ~4TB low power NAS, I'd just forego parity and rely on backups for uptime, unless you really need uptime and cant afford a few hours downtime. Quote Link to comment
ivangoetelek Posted July 1 Author Share Posted July 1 3 hours ago, Kilrah said: ZFS pool instead of array, but that'll be a poor solution for your selection of drives and need emptying everything out... I feel that this can't be the solution 2 hours ago, JorgeB said: What model is the device? I've used an NVMe device as parity for some time and never had issues, of course, if you are writing extreme amounts of data there could be, you should be able to check how much was written to the other array devices on SMART, then add that up to see if it was anywhere close to the parity device TBW max. Verbatim Vi3000 2TB - just some garbage NVMe - not much of anything, but it was cheap In general I do not write that much of data to the array... I feel that I could write a lot more I can't really figure out the SMART Reports - so I have no idea how much data I have actually written... 1 hour ago, Michael_P said: What were the temps? Flash memory actually loves heat, but their controllers do not. I could peg tmax on my NVMe in a well ventilated install with little effort, so I always add a heatsink. For a ~4TB low power NAS, I'd just forego parity and rely on backups for uptime, unless you really need uptime and cant afford a few hours downtime. Temps were up to 85°C - 90°C while doing the parity check. No parity also seems like a non option, any downtime is just a huge pain in the behind. Isn't there anything I can do? Is it possible that I had 2 consecutive Bad Devices - sounds like a challenge for high school statistics class Quote Link to comment
Kilrah Posted July 1 Share Posted July 1 Unlikely, but for that use given it's seriously mistreated you'd really want a good quality TLC drive, not the cheapest of the bunch. Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 11 hours ago, ivangoetelek said: Verbatim Vi3000 2TB - just some garbage NVMe - not much of anything, but it was cheap Possibly just bad devices. Quote Link to comment
ivangoetelek Posted July 2 Author Share Posted July 2 hmpf - okay, so I'll just try again with the next Vi3000, and if that fails again in some months, I shall revive this topic... Is there anything I can do when it comes to logging / diagnostics, so if the next device fails, I can easily isolate the actual problem? I understood that rebooting after the drive failed didn't help much. Quote Link to comment
itimpi Posted July 2 Share Posted July 2 Just now, ivangoetelek said: understood that rebooting after the drive failed didn't help much. As you surmised the syslog in the diagnostics is the RAM version that starts afresh every time the system is booted. You should enable the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash. The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field. 2 minutes ago, ivangoetelek said: so I'll just try again with the next Vi3000, and if that fails again in some months, I shall revive this topic... It could be that the brand is the issue? Perhaps you should try another one that might be better quality? Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 21 minutes ago, ivangoetelek said: try again with the next Vi3000 I would recommend using a better quality device Quote Link to comment
Kilrah Posted July 2 Share Posted July 2 28 minutes ago, ivangoetelek said: Is there anything I can do when it comes to logging / diagnostics, so if the next device fails, I can easily isolate the actual problem? I understood that rebooting after the drive failed didn't help much. If it fails save the diagnostics before rebooting, the diagnostics package will include the SMART information (that you can also see by clicking on the drive slot name and going to the Attributes tab). Quote Link to comment
ivangoetelek Posted July 2 Author Share Posted July 2 22 minutes ago, itimpi said: You should enable the syslog server I do hava that enabled - could this be the cause for the parity failing? I have found the section where the drive died - but I can't make much of it... Syslog_nvme_dying.log Quote Link to comment
itimpi Posted July 2 Share Posted July 2 1 minute ago, ivangoetelek said: I have found the section where the drive died - but I can't make much of it... That shows that you appeared to complete the parity check successfully but the several hours later suddenly started getting read and write errors on the parity drive. No indication I can see as to why. Quote Link to comment
ivangoetelek Posted July 2 Author Share Posted July 2 But Parity Check in general is rather a read intensive part instead of a write intensive part - which in turn means to me: Heat was the issue, not the TBW - what do you think? Quote Link to comment
Kilrah Posted July 2 Share Posted July 2 Could be if you said it was reaching 90°C. A decent drive would throttle to limit temp, but... 1 Quote Link to comment
ivangoetelek Posted July 2 Author Share Posted July 2 So I RMA'd the faulty drive - let's see what tehy say/do - also I have ordered a beefed up NVMe Cooler - maybe that helps. But seriously - is it possible that the syslog is responsible for huge write effort? Other than that I cannot imagine much write at all. Is there any recommendation when it comes to e.g. appdata? to keep it away from the parity drive? Quote Link to comment
Michael_P Posted July 2 Share Posted July 2 48 minutes ago, ivangoetelek said: also I have ordered a beefed up NVMe Cooler - maybe that helps Doesn't need to be fancy, most anything will do Quote Link to comment
Michael_P Posted July 2 Share Posted July 2 I will say, don't use the two piece ones (bottom plate and top heatsink), just use the top heatsink and leave the bottom as it is Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.