Jump to content

cache HDD uncorrectable scrubbing errors, again


Recommended Posts

Hi there,

 

some time past I had issues with two of my cache drives: 

 

 

My common cache, a SATA SSD, was reformatted and works great. The other cache I used as disk for transfers (downloads etc), it was an NVME which seemingly broke; it was not usable anymore after this incident. No SMART errors, but incredibly slow and basically bricked.

 

I exchanged it with a 4TB HDD (btrfs), which worked great till today. I had some I/O errors while renaming files, and suddenly everything acted up. Small things like three parts of a 20-part-archive were broken, but in another instance one folder lost all its files, even some that were weeks old. Some files, even newer ones, are totally fine.

 

I followed the old procedure and scrubbed the drive, I instantly got 74 uncorrectable errors, but then none along the way (I attached a screenshot, it reads "aborted" but it went through, no idea). SMART valus seem fine, as it did in the past. I'm not sure if I can even find out what exactly the issue is.

 

So, what do I do now? I can format the drive no problem, there's no big loss with that.

I don't include it into the array because that slows down downloads and un-zip operations a lot, the array runs flawless.

 

As mentioned, the old SSD was NVME, the new HDD is SATA (attached to an HBA). So I doubt it's some hardware defect.

I ran Memtest as well, works fine as intended.

 

Any ideas what to do, and how to avoid?

unbraid scrub.png

lunas-diagnostics-20230123-1440.zip

Link to comment
1 hour ago, JorgeB said:

Run memtest, lots of corruption detected, this is usually bad RAM, also since there's uncorrectable metadata corruption will need to backup and re-format that filesystem, ideally after fixing the problem.

As mentioned, I ran memtest and that went absolutely fine without errors for hours :(

Link to comment

I'm using ECC RAM, but not registered.

 

What are comprehensive tests I could do to find the error, though? I mean if the defect can corrupt my system in this measure, it shouldn't be too hard to find it I imagine? I can run Memtest for the whole night if that helps. I have no spare RAM to put into the system, and if I did it would take some weeks till an error would possibly arise again :/

 

Best,

Rick

Link to comment

Hi there,

 

at least the BIOS let me activate ECC and unRAID reports it as well on the dashboard :o

 

@JorgeB I get no errors in Memtest with both installed, that's the main issue here :(

 

@philliphartmanjr huh, is that so? my general cache (for Dockers etc) is a 500GB SATA SSD, while this cache is a 4TB HDD. Both btrfs. They are not in the same pool, they are individual caches; basically I just want it outside my array for the reasons given above (mainly un-zipping actually). Is that an "unsupported" setup?

 

Best,

Rick

Link to comment
38 minutes ago, CameraRick said:

Hi there,

 

at least the BIOS let me activate ECC and unRAID reports it as well on the dashboard :o

 

@JorgeB I get no errors in Memtest with both installed, that's the main issue here :(

 

@philliphartmanjr huh, is that so? my general cache (for Dockers etc) is a 500GB SATA SSD, while this cache is a 4TB HDD. Both btrfs. They are not in the same pool, they are individual caches; basically I just want it outside my array for the reasons given above (mainly un-zipping actually). Is that an "unsupported" setup?

 

Best,

Rick

Sounds like my concerns are not valid. Memtest normally won't find a bad ecc ram because it self corrects. I would take out half of your ram run the system see if the problem is fixed. If not put in the other half and see if that does.  I have found by subtracting and adding ram in halves makes the process much faster than 1 at a time. https://www.memtest86.com/ecc.htm

This link is Memtest on ecc memory if you want to check it out

Link to comment
1 hour ago, philliphartmanjr said:

Sounds like my concerns are not valid. Memtest normally won't find a bad ecc ram because it self corrects. I would take out half of your ram run the system see if the problem is fixed. If not put in the other half and see if that does.  I have found by subtracting and adding ram in halves makes the process much faster than 1 at a time. https://www.memtest86.com/ecc.htm

This link is Memtest on ecc memory if you want to check it out

the issue is that this is not really a repeatable error. It's nearly two months since I replaced the drive, and just now issues came through. 

 

If I understand the link properly, I should have a look if the "Platform First Error Handling (PFEH)" option is enabled, right? I mean disablind ECC altogether probably makes not much sense, right? I don't have the funds right now to download Memtest Pro for the ECC injection thing they mention :(

 

Best,

Rick

Link to comment
14 minutes ago, itimpi said:

Not quite sure I understand this?   You can download the latest memtest that can test ECC ram from memtest86.com for free.   Do you need the Pro feature?

Maybe I get it wrong, taken from @philliphartmanjrs Link:

 

"Can I use MemTest86 inject ECC errors?
MemTest86 Pro Edition supports ECC injection if the CPU/memory controller chipset supports error injection and the feature is not locked by BIOS. See the current list of chipsets with ECC injection capability supported by MemTest86."

 

It's possibly a "special" thing it does? I usually use the Memtest that got installed to my unRAID Flash drive

 

Best,

Rick

Link to comment

Ok let me clear up my intention for the link. It was not to show you how to how to use memtest to check your ecc I posted the link because it shows how supported ecc systems have a logging functions in BIOS for report errors and it mentions how OS has a logging function windows and Linux. My hope was for you to find one of those to see if it will help you finding a bad stick of ram.

 

Otherwise I would pull the sticks out of your system and put them back in relaunch unraid and see if you problem goes away. I do know with ecc memory it will cause sporadic problems due the t he nature of its self correcting ability. The larger quantity of ram that you have the harder it is to find a bad stick. ECC sticks normally in my experience don't fail very often but if you can either find a bad one or verify they are working it will make it easier to move to another step. Hopefully your BIOS has a logging function.

Link to comment

Hm, I'm not sure the BIOS can log anything, but seemingly the article only talks about how Linux and Windows log them, and how Memtest shows them?

 

I now disabled the function that was mentioned for AMD on the bottom of the page, also I updated the BIOS to the newest version. I only have two bars installed, but as far as I understand the data on my drive is corrupt already so not really a chance to get a reaction out of unraid?

 

Would it make sense to deactivate ECC altogether and run Memtest, as this would clearly show errors?

 

I just started Memtest now (ECC enabled, that one function disabled) and will go to bed. Just to see how it turns out after 10h or so.

 

Best,

Rick

Link to comment

I went through your logs have you tried disabling your cache drives to see if your system is normal? whichever one is sdc1 keeps trying to move a file or something and is just sending nonstop errors. particularly with a .rar file for avengers end game. could be a bad file or if that is your ssd it could have a bad cell. and I don't know if this is necessary but do you have nerd pack plug in installed? It has  rar and unrar I am not sure if unraid would need that installed to read a rar file it is not something I have ever tried.

Link to comment

Good morning @philliphartmanjr

 

First, I attached a photo of the Memtest result. Looks not so bad for now :/

 

Drive sdc is exactly the transfer HDD I have trouble with. Actually, this RAR archive is the one where I noticed the errors with :) I didn't try to un-rar it on unRAID though, I wanted to copy it to my PC over SMB, but I tried to copy it quite a bit (I thought there was some random failure and it might work on a 2nd or 3rd attempt)

I think I uninstalled Nerdpacks because they were EOL or something? At least I don't have them on anymore.

 

I just tried to backup some data (as I said, there's nothing inherently important on there). Funny enough there's some old data that I can't delete. I mean, I can, and Windows shows me the common delete-dialogue, showing all the files etc, but the folder doesn't vanish and all files inside are still intact. If I try with the Krusader Docker, it doesn't do anything when I hit Ok. Very strange.

 

Anyway, I'm still unsure where to go from here. It seems like RAM is fine. I will format the disk once the copy is done and see if some extensive SMART tests show something :(

 

Best,

Rick

0CLxxfNNTNiQduEbaLs5hQ.jpg

Link to comment
3 hours ago, philliphartmanjr said:

I would either unplug the bad drives or disable them then reboot unraid and see if it is stable then

I think I somewhat explained it wrong. unRAID works completely fine.

One drive, or better said its data, had the issue now. The reason why I'm anxious is because it happened before (different physical drive of course, but same usecase/share). When I unplug it, nothing improves, it's just me that can't properly work then anymore.

 

Sure, I can replace it. But I'm kind of afraid that there's something off, and that this was not just some funny coincidence :(

Link to comment
3 hours ago, CameraRick said:

I think I somewhat explained it wrong. unRAID works completely fine.

One drive, or better said its data, had the issue now. The reason why I'm anxious is because it happened before (different physical drive of course, but same usecase/share). When I unplug it, nothing improves, it's just me that can't properly work then anymore.

 

Sure, I can replace it. But I'm kind of afraid that there's something off, and that this was not just some funny coincidence :(

Ah OK. Unless there is something that is needed with .rar files with unraid (which is possible) The only possible problems that I can think of would be bad drive, bad ram, or an issue with the cpu. You are definitely getting logic errors. You have tested your ram and drive and it has check out as far as we can tell. I think we can rule out the motherboard being the issue because unregistered  unbuggered ram should be talking straight to the cpu. Unfortunately since ryzens support or ecc memory is a silent support it makes checking if the ecc is working properly. Honestly the only thing I can think of would you be to try regular ram dimm and see if the problem still persists. If yes then it probably is either your drive and the error was not picked up, happens sometimes or the logic in the cpu handling ecc is not working correctly. If the problem goes away then the issue was somewhere in the ecc chain or a bad ram stick that was not detected.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...