nukeman Posted February 1, 2022 Share Posted February 1, 2022 Recently I received a warning about some Reallocated Sectors on one of the SSDs in my cache pool. The other day I was doing some heavy copying with the cache and received a similar warning on the other drive in the pool. I've searched the forums and this seems to be either "watch it to see if it gets worse" or "critical fix it now" problem. I went back and found some old diagnostics and two months ago both drives had Reallocated_Sector_Ct=0 in their SMART reports. Now one drive has Reallocated_Sector_Ct=3 and Reallocated_Sector_Ct=7. Both of these drives are 1TB Samsung Evo 870's that were purchased in February of 2021. I'm not excited about the prospect of swapping out the drives as they contain several critical VMs for my home business as well as Unraid's Cache. That being said, I'm also not excited about both drives in my cache pool failing at the same time. Assuming these warnings are something I should take care of I started the RMA process with Samsung. I did get an RMA issued but they won't send out a replacement drive until I send the old one in for evaluation. Before I start down that road though, I wanted to get some opinions on what my next steps should be. Is this a warning that warrants replacing both drives? If so, how should I go about doing it? I thought Samsung SSDs were generally well regarded, maybe I just got unlucky? BTW - I'm happy to post the SMART reports but I'm wondering if there's any sensitive information I should remove? Like should I remove the serial numbers in the report prior to posting? Quote Link to comment
Vr2Io Posted February 1, 2022 Share Posted February 1, 2022 (edited) 16 hours ago, nukeman said: The other day I was doing some heavy copying How full ( % used ) ? Do you daily moving data to array to free-up space ? If SSD almost full and most were static file then you may wrongly use SSD, endurance will greatly decrease. 16 hours ago, nukeman said: I maybe I just got unlucky? Not likely for 2 SSD got problem in same time, if you haven't reach writing endurance 600TB data. https://www.storagereview.com/review/samsung-870-evo-ssd-review If you really concern sensitive information, you can mask the serial and then post both SSD SMART report first. Edited February 1, 2022 by Vr2Io Quote Link to comment
nukeman Posted February 1, 2022 Author Share Posted February 1, 2022 (edited) 54 minutes ago, Vr2Io said: How full ( % used ) ? Do you daily moving data to array to free-up space ? If SSD almost full and most were static file then you may wrongly use SSD, endurance will greatly decrease. If you really concern sensitive information, you can mask the serial and then post both SSD SMART report first. The drives have 671GB free so they're 33% full. Here's the smart reports. 2947T SMART Report.txt 2900M SMART Report.txt Edited February 1, 2022 by nukeman Quote Link to comment
Vr2Io Posted February 2, 2022 Share Posted February 2, 2022 (edited) Both SSD Wear_Leveling_Count indicate 44 and 45, init value was 0, BTW I m not confirm 44 actual % meaning, but if according Total_LBAs_Written, it still under endurance spec. Got some info. as below, so 45 means writing 45TiB Some people also report failure two 870 in same time. Edited February 2, 2022 by Vr2Io Quote Link to comment
nukeman Posted February 2, 2022 Author Share Posted February 2, 2022 13 hours ago, Vr2Io said: Both SSD Wear_Leveling_Count indicate 44 and 45, init value was 0, BTW I m not confirm 44 actual % meaning, but if according Total_LBAs_Written, it still under endurance spec. Got some info. as below, so 45 means writing 45TiB Some people also report failure two 870 in same time. Well, I'm going to need some help translating the Wear_Leveling_Count. You're saying I've written 45TB to the drive?!?!? I don't understand how that could happen. Also, I'm not excited to read that report of other 870 EXOs failing... Quote Link to comment
JorgeB Posted February 2, 2022 Share Posted February 2, 2022 17 minutes ago, nukeman said: You're saying I've written 45TB to the drive?!?!? Around that yes, you can easily calculate based on this attribute: 241 Total_LBAs_Written -O--CK 099 099 000 - 101015376296 101015376296 x 512 (sector size) = 51 719 872 663 552 bytes, or 47.04 TiB 1 Quote Link to comment
nukeman Posted February 2, 2022 Author Share Posted February 2, 2022 Are there any reports of excessive writing to cache drives or cache pools? My cache hosts 4-5 VMs and "normal" dockers (Radarr, Sonarr, etc). Both of these drives were introduced into my system when I initially created the cache pool. Prior to that I was just using one (smaller) ssd for cache and docker/vm hosting. Quote Link to comment
JorgeB Posted February 2, 2022 Share Posted February 2, 2022 That looks normal to me for 1 year, previous Unraid releases had a problem with excessive writes, IIRC one of my cache SSDs at some point was writing about 3TB/Day, most of the reason it has now close to 1PB total writes. Quote Link to comment
nukeman Posted February 2, 2022 Author Share Posted February 2, 2022 ok, 45TBs sounded like a lot, but I guess I am moving large media files around frequently. Currently I have my downloads folder using the cache, perhaps it would make sense to not do so. I don't really care about fast writes from downloads. Regardless, I'm going to RMA the drives, one at a time, to keep the server up as much as possible. Plan is to: Remove one of the drives from the pool and return it When I get the replacement drive I'll put it into the pool and let BTRFS rebuild it Then I'll take the other old drive out of the pool and return it finally, I'll put the second replacement drive into the pool Or, should I do this procedure instead? This post says there's trouble rebuilding the pool in 6.9.2? But then, there's a workaround? Quote Link to comment
JorgeB Posted February 2, 2022 Share Posted February 2, 2022 Removing a drive from a pool then adding one later is fine, you just can't do a direct replacement, that's the broken part. Quote Link to comment
nukeman Posted February 16, 2022 Author Share Posted February 16, 2022 I sent one of the cache pool drives back to Samsung. They sent a new(?) refurbished replacement. I added the replacement drive to the pool and Unraid did a parity check and found no errors. Is that all I need to do prior to RMA'ing the other, original, cache drive. I've never done this procedure before and want to make sure I'm good to remove the other faulty drive. Is "no balance found" relevant? Quote Link to comment
JorgeB Posted February 17, 2022 Share Posted February 17, 2022 8 hours ago, nukeman said: did a parity check and found no errors. Parity check is for the array, you can run a scrub on the pool. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.