Unmountable: wrong or no file system (cache drive) (probably not drive failure)


Garbonzo

Recommended Posts

(posted to discord earlier as well, but I have to go out to buy another spinning drive right now, so I also posted here in case I get responses)


hi guys, I just updated/migrated to new hardware, and all seemed well for a day or so, but last night things locked/crashed and when I was trying to figure out what was going on, I think I may have missed a bios setting for advanced sleep on sata 0 and sata 1 of the asrock x370 taichi.  I disabled, but now it's not mounting, and I don't know how to do the equivalent of a chkdsk or something for btrfs... my only experience with it is my unraid cache pools.  Or if that is even the thing to do... any help appreciated (even if it's just where I "should" put this question)


anyway the cache was on sata0 so that seemed like the problem... the 7 spinning drives are on an hba, but I'd like to keep that last spot for another drive for the array.... and at this point, that aint exactly the problem anyway... but I am aware about the issues with disk5 specifically, but one or two others aren't great either probably.  I just can't replace all of these shucked drives (and am in fact buying another one today probably),  when I find the right deal on some larger more robust drives, I will replace the parity and move those out to the array if they are not suspect... but I already had some concerns on disk5 (aside from the crc redundancy errors that show on this and a couple other drives, all related to a bad sata > mini-sas that has since been replaced... either way, it had been re-precleared not long ago, but got disabled/emulated a few back... so that specific issue I am gonna try to get pulled and start rebuilding tonight, but can't really do anything until I get my cache drive sorted. However, there are a few other errors in my logs re: things installed/uninstalled mostly, and hardware changes, but I can probably google them each individually and resolve... but any insights there are also appreciated. 

TIA,

-g

things that stuck out to me:

image.thumb.png.a4faab7a280a123050f2a7e1d129dcc3.png

ezra-diagnostics-20230305-1305.zip

Edited by Garbonzo
added image
Link to comment

Thank you so much.. that got me back in business... I am replacing drive 5 and starting the rebuild this morning, will post diags afterward if anything else creeps up.... I need to learn to read these logs better.  I think I will order an nvme drive today to replace that cache drive... it was something I wanted to do when I got this board... but didn't want to spend the money until I researched what drive to buy right now... also I was intending to replace the 16gb with 32gb of ECC if I could find a good deal on something that this ryzen 1700 would be ok with (and also will work if I upgrade to a newer cpu if my vm's become more demanding) 
The memory I have is non-ecc gskill trident F4-3200C16D and the bios is currently auto xmp profile 1. It is decent memory (if it checks out) and I know lots of people run non-ecc without issues (I guess) so I could just buy 2 more matching sticks, but I kinda hate that idea in theory (especially if the right ecc deal pops up on ebay ;) )
Could the data corruptions you mentioned be a memory timings/voltages be an issue? (researching that in another tab)

 

Anyway, thanks again for your help...  it was most appreciated. 

-G

Edited by Garbonzo
Link to comment
12 hours ago, JorgeB said:

Constant errors with disk5, check/replace cables and post new diags.

 

As for cache, see if clearing the log helps:

 

btrfs rescue zero-log /dev/sdb1

 

Before this issue btrfs was also detecting data corruption so a good idea to run memtest.

Ok so I spoke too soon. 

It started and the cache drive showed up and things appeared to be working... dockers looked like they started fine... but I didn't try anything as I was walking out the door for a few hours, I just got back and was told Emby wasn't working, so I came in here an things look like they are running.. but any docker I click on and restart I get an "Execution error: 

Server error" popup... so I looked at log and see things like BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 93, rd 0, flush 0, corrupt 26467, gen 0

 

I have since shut-down, updated the bios and double checked my settings, moved the cache drive to my hba controller, zero'd the btrfs log again, rebooted and think I might be ok... I am starting the pre-clear on the replacement for disk5 (forgot it was running before I shutdown... it was only like 6hrs in, so no biggie).  However, this build has appeared stable before... however, the type of error like above don't seem to be occurring (most likely since moving the cache off of the mb controller). I really need to decide on a nvme drive to replace the cache with anyway... :D 

Nope, that didn't do it either... seemed ok for a bit, then all this started again..  I am attaching a new diagnostic in case anyone wants to take a look and see if anything jumps out. 
image.thumb.png.2829e053ebb33b0d2f989e8f9d0529a0.png

ezra-diagnostics-20230306-1805.zip

As always, TiA!

-g

Edited by Garbonzo
updated diags file and added image
Link to comment
9 hours ago, Garbonzo said:

The memory I have is non-ecc gskill trident F4-3200C16D and the bios is currently auto xmp profile 1

 

I believe the highest supported memory speed for that processor is 2666.  Turn off XMP would be my first suggestion, the board may support lightly higher memory but the R7-1700 really doesn't.  Trying to push memory speeds can cause issues.

Link to comment

Thanks so much for your assistance.
Yeah, since that last post I actually put it to 2133 and am running memtest now, its only half way through first pass, but so far so good on that front. 
I hadn't actually enabled xmp, it was just set to auto, but I have now used the timings shown here as JEDEC #1 (except that last tCCD_L setting, I didn't see in the bios, is there anything else it would called?) But @JorgeB mentioned I was getting data corruption errors so I was needing to try backing them off anyway.  If I get no errors after first pass on memtest I will probably boot to see where I am... hopefully good. thanks again for the help.

IMG20230306182331.thumb.jpg.54e1a74cc1f7cfef82315201951e7d42.jpg

Link to comment

yeah, its not "mission critical" but I don't know what else to do at the moment.   

the things that have changed are just the mb/cpu/ram/psu/case. I followed a spaceinvaderone video on moving to new hardware... it seemed pretty straight forward.

I guess I need to make a new trial USB key tomorrow and see how this hardware does starting from scratch... I dunno... I am alot tired and defeated at the moment. I need to come at this with fresh eyes in the morning... one or more of the things in that spaceinvaderone video looked like maybe not needed anymore, but I wasn't sure so I followed step-by-step... 

What is happening at the moment is after a "lock-up" (or whatever tf is happening) on reboot the cache isn't mountable... so I do a btrfs rescue zero-log and then reboot and it mounts ok... dockers spin up, and seemingly work... then withing like 10 mins I see similar things to the image above...  state: A, then EA, then EAL...  and I'm borked. 

 

again thanks in advance for any troubleshooting ideas. I will get back on this in the morning. 

-g

 

Edited by Garbonzo
Link to comment

I have changed hardware 6 times, MB/CPU/Ram/Case and not once had any issues.  If the system worked correctly prior to a swap and different hardware is causing an issue I would suspect something bad in the new/different hardware.  One the many beautiful things about Unraid is if the hardware works, so does the software.

Link to comment

can I trust a backup using the community apps plugin when it is acting this way? I will boot it now and try to get through it, I think if I boot with docker/vms off I might be ok.
I have another sata ssd I can swap in... since I still haven't gotten the nvme that will ultimately be for cache.  and the ram, is this bad ram or settings? I want ecc anyway, just don't want to buy something off ebay that wont work right either, and haven't found much good advice on that either... but I may not be asking the right questions

Link to comment

well the backup is running now, so I will have that and the actual ssd, but I think a longer memory test may show some issues with this ram.  I only did a single pass yesterday, as I was trying to get it back online... I can't seem to find any unbuffered ecc dd4 at a reasonable price, I don't have any bench dd4 to try out...  how did you see the memory corruption in my logs? (edit: probably this, huh? dev /dev/sdg1 errs: wr 0, rd 0, flush 0, corrupt 18421, gen 0)  I need to know what I am looking for so I know when I am in the clear... I certainly don't want to start a rebuild of disk5 from parity until I am 100% sure the memory thing is clear... it looks like I am going to have to chance NON-ECC if I can't at least get something reasonable... but supply/demand seem to be in a bad spot with DDR4 ECC UDIMMs, and I couldn't afford/find a good fit for a server mb that would work for me (update: bought this) ...  I am now questioning myself more than I like because I am in a spot where I might have to move all this crap back into the dell again.. ugh. 

 

p.s. in the process of restoring my appdata backup to newly formatted ssd... will update. I am knocking on wood, but that seems to have resolved it... I rebuilt the docker.img manually and everything seems to be working and so far no errors reported on the drive... but that didn't exactly happen right away last time either... but I think it was the memory running too fast, now that I backed that off I am hoping that is good. If that ecc works (and is actually working as ecc should, it will have been a good spend, I will thoroughly test these and maybe flip them for $30 or so and feel good about it)

So thanks to both of you for your help getting this fixed. I have a couple other things I need to fix, mb temps not showing, and some fan-speed stuff.... But I will use google for that and post anything I can't sort in a new post... I wish I could run memtest while also preclearing this drive (it is still in a usb case, I suppose I could make a trial unraid usb and do the pre-clear on another system while this one is running memtest).

 

Thanks

-g

Edited by Garbonzo
duh... the error + link second update
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.