Jump to content

Multiple issues, root cause unknown


Recommended Posts

Hey guys, new to unraid and have been loving it so far but for some reason my server started losing it yesterday. No major changes, had NZB downloading stuff and checked my panel to find 2 drives offline with read errors. Tried a restart through the unraid dashboard, it never came back up and the CLI said “bzmodules checksum error”.

 

After some googling I copied the config file off my flash drive, remade it with the unraid tool, and readded my config file, same issue. Also tried multiple USB ports, both motherboard and front panel, both USB 2 and 3, with no luck. Next I remade the flash drive from a backup I had from yesterday morning when the server was fully functional and it booted no issues. Prompted me to run a parity check so I stopped almost all my dockers, started it then went to bed. This morning I’ve now got read errors on 1 drive, with the 2 that previously had errors running fine.

 

I’m running Unraid 6.12.6 on a Ryzen 3700x with 64gb of RAM. Array is 6x8tb Exos drives from Rhinotech through an Adaptek HBA, with 2 as parity.

 

Logs attached. I would be incredibly grateful for any advice!

 

http://cloud.tapatalk.com/s/65afba6ba2ed2/tower-diagnostics-20240123-0754.zip

Link to comment
8 minutes ago, Dr_Rak said:

I copied the config file off my flash drive,

Did you mean this?  You should have a config folder not a file.

 

8 minutes ago, Dr_Rak said:

“bzmodules checksum error”.

This indicates a problem reading some of the archive files off the flash drive.  Normally easiest fix is simply to rewrite all the bz* type files on the flash drive.   If this does not work it could mean that the flash drive is starting to fail.

Link to comment
1 minute ago, itimpi said:

Did you mean this?  You should have a config folder not a file.

 

Sorry, I did mean config FOLDER.

1 minute ago, itimpi said:

This indicates a problem reading some of the archive files off the flash drive.  Normally easiest fix is simply to rewrite all the bz* type files on the flash drive.   If this does not work it could mean that the flash drive is starting to fail.

That was my concern, but it seems like it only happened when I was using the "latest" config folder, even with fresh bz* files from the unraid flash drive tool. When I copied over a full backup from earlier that day (config + bz* files, etc) it booted first time without issues.

Link to comment
1 minute ago, Dr_Rak said:

 

Sorry, I did mean config FOLDER.

That was my concern, but it seems like it only happened when I was using the "latest" config folder, even with fresh bz* files from the unraid flash drive tool. When I copied over a full backup from earlier that day (config + bz* files, etc) it booted first time without issues.

Messages about the bz* files are independent of the config as they are loaded and checked before the later stages of the boot process when the 'config' folder starts coming into play.

Link to comment
Messages about the bz* files are independent of the config as they are loaded and checked before the later stages of the boot process when the 'config' folder starts coming into play.

Could a failing flash drive trigger read error warnings in multiple different drives in my array?

It seems like trying a different flash drive might be the right move. I see in the guide that the maximum recommended size is 32GB, but I can’t find many reputable drives of that capacity (and my existing flash is 64GB).

I know it’s a waste of space but the only other flash drive I have is a Samsung BAR 256GB flash drive, could I use that as a temporary solution and replace with a smaller capacity (looks like 64GB is as low as they go) of the same model if that solves the issue? Samsung is about the only trustable brand I can find on Amazon, and the BAR has a full metal body which I understand can help with heat dissipation.
Link to comment

Update for anyone who may come across this thread in the future:

 

It occurred to me that my issues started right after I finished some CPU intensive tasks, so I wondered if Ryzen C-states might be the culprit, and that the flash drive corruption was a byproduct of dirty shutdowns (maybe CPU stays in low power mode -> shutdown is forced due to time limit?).


C-state TL;DR: Ryzen has 3 C-states (power management states a core can have under low/no load, aimed at decreasing idle power draw):

  • C0 = active
  • C1 = halted
  • C6 = deep sleep

There have been mixed reports of Linux kernels handling C-states, particularly C6, very poorly. The reports seem somewhat hardware dependent, and many have alleged that various BIOS updates may have resolved the issue, but there is no real consensus. For reference, I am running a 3700X on a Gigabyte B550 Aorus Pro AC using the latest BIOS (F16h at the time of writing, with AMD AGESA V2 1.2.0.B).

 

I changed my BIOS settings as follows:

  • Global C-states: Enabled -> Enabled (testing if I can utilize the C1 "halted" state for idle power savings without compromising stability)
  • Power supply idle control: Low idle current -> Typical idle current (this change disables the C6 "deep sleep" state which seems to be the most problematic with Linux)
  • AMD Cool'n'quiet: Enabled -> Disabled (undervolts/underclocks CPU during decreased utilization, but some have indicated that it does not play well with Linux)
  • CPPC: Enabled -> Disabled (CPPC helps the OS prioritize higher-performance cores for task scheduling, but some reports suggest instability with Linux)

I immediately noticed that my server became more responsive and boot time was significantly quicker. I will continue to update once I have more stability information to share.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...