Garbonzo Posted March 5, 2023 Share Posted March 5, 2023 (edited) (posted to discord earlier as well, but I have to go out to buy another spinning drive right now, so I also posted here in case I get responses) hi guys, I just updated/migrated to new hardware, and all seemed well for a day or so, but last night things locked/crashed and when I was trying to figure out what was going on, I think I may have missed a bios setting for advanced sleep on sata 0 and sata 1 of the asrock x370 taichi. I disabled, but now it's not mounting, and I don't know how to do the equivalent of a chkdsk or something for btrfs... my only experience with it is my unraid cache pools. Or if that is even the thing to do... any help appreciated (even if it's just where I "should" put this question) anyway the cache was on sata0 so that seemed like the problem... the 7 spinning drives are on an hba, but I'd like to keep that last spot for another drive for the array.... and at this point, that aint exactly the problem anyway... but I am aware about the issues with disk5 specifically, but one or two others aren't great either probably. I just can't replace all of these shucked drives (and am in fact buying another one today probably), when I find the right deal on some larger more robust drives, I will replace the parity and move those out to the array if they are not suspect... but I already had some concerns on disk5 (aside from the crc redundancy errors that show on this and a couple other drives, all related to a bad sata > mini-sas that has since been replaced... either way, it had been re-precleared not long ago, but got disabled/emulated a few back... so that specific issue I am gonna try to get pulled and start rebuilding tonight, but can't really do anything until I get my cache drive sorted. However, there are a few other errors in my logs re: things installed/uninstalled mostly, and hardware changes, but I can probably google them each individually and resolve... but any insights there are also appreciated. TIA, -g things that stuck out to me: ezra-diagnostics-20230305-1305.zip Edited March 5, 2023 by Garbonzo added image Quote Link to comment
JorgeB Posted March 6, 2023 Share Posted March 6, 2023 Constant errors with disk5, check/replace cables and post new diags. As for cache, see if clearing the log helps: btrfs rescue zero-log /dev/sdb1 Before this issue btrfs was also detecting data corruption so a good idea to run memtest. Quote Link to comment
Garbonzo Posted March 6, 2023 Author Share Posted March 6, 2023 (edited) Thank you so much.. that got me back in business... I am replacing drive 5 and starting the rebuild this morning, will post diags afterward if anything else creeps up.... I need to learn to read these logs better. I think I will order an nvme drive today to replace that cache drive... it was something I wanted to do when I got this board... but didn't want to spend the money until I researched what drive to buy right now... also I was intending to replace the 16gb with 32gb of ECC if I could find a good deal on something that this ryzen 1700 would be ok with (and also will work if I upgrade to a newer cpu if my vm's become more demanding) The memory I have is non-ecc gskill trident F4-3200C16D and the bios is currently auto xmp profile 1. It is decent memory (if it checks out) and I know lots of people run non-ecc without issues (I guess) so I could just buy 2 more matching sticks, but I kinda hate that idea in theory (especially if the right ecc deal pops up on ebay ) Could the data corruptions you mentioned be a memory timings/voltages be an issue? (researching that in another tab) Anyway, thanks again for your help... it was most appreciated. -G Edited March 6, 2023 by Garbonzo Quote Link to comment
Garbonzo Posted March 6, 2023 Author Share Posted March 6, 2023 (edited) 12 hours ago, JorgeB said: Constant errors with disk5, check/replace cables and post new diags. As for cache, see if clearing the log helps: btrfs rescue zero-log /dev/sdb1 Before this issue btrfs was also detecting data corruption so a good idea to run memtest. Ok so I spoke too soon. It started and the cache drive showed up and things appeared to be working... dockers looked like they started fine... but I didn't try anything as I was walking out the door for a few hours, I just got back and was told Emby wasn't working, so I came in here an things look like they are running.. but any docker I click on and restart I get an "Execution error: Server error" popup... so I looked at log and see things like BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 93, rd 0, flush 0, corrupt 26467, gen 0 I have since shut-down, updated the bios and double checked my settings, moved the cache drive to my hba controller, zero'd the btrfs log again, rebooted and think I might be ok... I am starting the pre-clear on the replacement for disk5 (forgot it was running before I shutdown... it was only like 6hrs in, so no biggie). However, this build has appeared stable before... however, the type of error like above don't seem to be occurring (most likely since moving the cache off of the mb controller). I really need to decide on a nvme drive to replace the cache with anyway... Nope, that didn't do it either... seemed ok for a bit, then all this started again.. I am attaching a new diagnostic in case anyone wants to take a look and see if anything jumps out. ezra-diagnostics-20230306-1805.zip As always, TiA! -g Edited March 6, 2023 by Garbonzo updated diags file and added image Quote Link to comment
SirReal63 Posted March 6, 2023 Share Posted March 6, 2023 9 hours ago, Garbonzo said: The memory I have is non-ecc gskill trident F4-3200C16D and the bios is currently auto xmp profile 1 I believe the highest supported memory speed for that processor is 2666. Turn off XMP would be my first suggestion, the board may support lightly higher memory but the R7-1700 really doesn't. Trying to push memory speeds can cause issues. Quote Link to comment
Garbonzo Posted March 7, 2023 Author Share Posted March 7, 2023 Thanks so much for your assistance. Yeah, since that last post I actually put it to 2133 and am running memtest now, its only half way through first pass, but so far so good on that front. I hadn't actually enabled xmp, it was just set to auto, but I have now used the timings shown here as JEDEC #1 (except that last tCCD_L setting, I didn't see in the bios, is there anything else it would called?) But @JorgeB mentioned I was getting data corruption errors so I was needing to try backing them off anyway. If I get no errors after first pass on memtest I will probably boot to see where I am... hopefully good. thanks again for the help. Quote Link to comment
SirReal63 Posted March 7, 2023 Share Posted March 7, 2023 2133 is where I would put it, it should work out the timings itself. That was a good board in it's day but I am not sure I would build a mission critical system based on it today. Quote Link to comment
Garbonzo Posted March 7, 2023 Author Share Posted March 7, 2023 (edited) yeah, its not "mission critical" but I don't know what else to do at the moment. the things that have changed are just the mb/cpu/ram/psu/case. I followed a spaceinvaderone video on moving to new hardware... it seemed pretty straight forward. I guess I need to make a new trial USB key tomorrow and see how this hardware does starting from scratch... I dunno... I am alot tired and defeated at the moment. I need to come at this with fresh eyes in the morning... one or more of the things in that spaceinvaderone video looked like maybe not needed anymore, but I wasn't sure so I followed step-by-step... What is happening at the moment is after a "lock-up" (or whatever tf is happening) on reboot the cache isn't mountable... so I do a btrfs rescue zero-log and then reboot and it mounts ok... dockers spin up, and seemingly work... then withing like 10 mins I see similar things to the image above... state: A, then EA, then EAL... and I'm borked. again thanks in advance for any troubleshooting ideas. I will get back on this in the morning. -g Edited March 7, 2023 by Garbonzo Quote Link to comment
SirReal63 Posted March 7, 2023 Share Posted March 7, 2023 I have changed hardware 6 times, MB/CPU/Ram/Case and not once had any issues. If the system worked correctly prior to a swap and different hardware is causing an issue I would suspect something bad in the new/different hardware. One the many beautiful things about Unraid is if the hardware works, so does the software. Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 You should backup and re-format the pool, ideally after correcting the RAM issue. Quote Link to comment
Garbonzo Posted March 7, 2023 Author Share Posted March 7, 2023 can I trust a backup using the community apps plugin when it is acting this way? I will boot it now and try to get through it, I think if I boot with docker/vms off I might be ok. I have another sata ssd I can swap in... since I still haven't gotten the nvme that will ultimately be for cache. and the ram, is this bad ram or settings? I want ecc anyway, just don't want to buy something off ebay that wont work right either, and haven't found much good advice on that either... but I may not be asking the right questions. Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 Difficult to say for sure, if there are RAM issues basically anything can be affected. Quote Link to comment
Garbonzo Posted March 7, 2023 Author Share Posted March 7, 2023 (edited) well the backup is running now, so I will have that and the actual ssd, but I think a longer memory test may show some issues with this ram. I only did a single pass yesterday, as I was trying to get it back online... I can't seem to find any unbuffered ecc dd4 at a reasonable price, I don't have any bench dd4 to try out... how did you see the memory corruption in my logs? (edit: probably this, huh? dev /dev/sdg1 errs: wr 0, rd 0, flush 0, corrupt 18421, gen 0) I need to know what I am looking for so I know when I am in the clear... I certainly don't want to start a rebuild of disk5 from parity until I am 100% sure the memory thing is clear... it looks like I am going to have to chance NON-ECC if I can't at least get something reasonable... but supply/demand seem to be in a bad spot with DDR4 ECC UDIMMs, and I couldn't afford/find a good fit for a server mb that would work for me (update: bought this) ... I am now questioning myself more than I like because I am in a spot where I might have to move all this crap back into the dell again.. ugh. p.s. in the process of restoring my appdata backup to newly formatted ssd... will update. I am knocking on wood, but that seems to have resolved it... I rebuilt the docker.img manually and everything seems to be working and so far no errors reported on the drive... but that didn't exactly happen right away last time either... but I think it was the memory running too fast, now that I backed that off I am hoping that is good. If that ecc works (and is actually working as ecc should, it will have been a good spend, I will thoroughly test these and maybe flip them for $30 or so and feel good about it) So thanks to both of you for your help getting this fixed. I have a couple other things I need to fix, mb temps not showing, and some fan-speed stuff.... But I will use google for that and post anything I can't sort in a new post... I wish I could run memtest while also preclearing this drive (it is still in a usb case, I suppose I could make a trial unraid usb and do the pre-clear on another system while this one is running memtest). Thanks -g Edited March 8, 2023 by Garbonzo duh... the error + link second update Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.