Unraid becomes unavailable once a week or so, followed by disabled drive


Recommended Posts

I have been using unraid for a few years now, and was on 6.8.0 without any issues before, using an older intel board and an i7-3990.  Recently I replaced that board and CPUwith an Asus B450 board and Ryzen 3 3200G.

 

Since then, once a week or so, sometimes longer, the system becomes unavailable on the network.  It cannot be pinged, and my router shows it as offline (system is turned on and there weren't any power outages or brownouts).  If I turn the system off by pressing the power button and letting Unraid do its thing to turn off, once the systems comes back up, I always end up having the same data drive (same drive each time) in a disabled status.

schumachertower-diagnostics-20200310-2115.zip

Edited by kevschu
clarify the disabled drive is a data drive.
Link to comment

Thank you for the reply.  Didn't know the 3200G was considered a 2nd gen chip.  

 

Going through the link that was shared will work on the following to see if it brings stability and resolves my issue.

 

1. Set syslog server up, just to make sure I have the diagnostic data saved somewhere safe on the reboots

2. check the "Power Supply Idle Control" in the BIOS, I don't believe it is set to the suggested "typical current idle".

3. Swap the memory out as I am using DDR4 3200, and the CPU only supports up to 2933.  Maybe swap it out for 2400 or 2666, 2933 seems niche based on what I see as available options on, say, Newegg.

 

Am I missing anything else from that article?  There weren't ever any errors listed on the drive that keeps showing up as Disabled, and I have copied the data off of that drive onto other drives so I am not worried about cloning it.

Link to comment
12 minutes ago, kevschu said:

Swap the memory out as I am using DDR4 3200, and the CPU only supports up to 2933.

Typically speed ratings are a maximum, so your current RAM if it's on the approved part list for that motherboard should be fine running at the speed called for by the CPU combined with the specific layout of your RAM.

 

There should be no need for different sticks, just run them at the approved speed. It's like getting supercar rated tires and putting them on a sedan. The only issue is that you probably overpaid a little because you will never run them at top speed.

  • Like 1
Link to comment
11 hours ago, kevschu said:

If I turn the system off by pressing the power button and letting Unraid do its thing to turn off, once the systems comes back up, I always end up having the same data drive (same drive each time) in a disabled status.

Did you rebuild the disabled drive? It won't get enabled unless you rebuild it (recommended) or set a New Config and rebuild parity instead.

Link to comment

hmmm.  would need clarification on "reboot on purpose".

 

if i.

1. rebuild the drive

2. start using it again

3. gracefully reboot using the unraid UI

4. system comes back up and the drive is still fine.

 

If

1. Unraid becomes unavailable

2. power device off using physical power button on system

3. power device on

4. system comes back up and the drive is disabled.

Link to comment

Alright, so, that didn't take long for it to become unresponsive this time.  I had to hard power it down again, but I think adjusting the "shutdown timeout" in disk settings helped because the disk that usually shows up as disabled wasn't disabled when it came back online, this time.  I do have the syslog on the syslog server though.  Is there a way for me to anonymize it, or will running the diagnostic collector grab it from that location and do that for me?

Link to comment
25 minutes ago, kevschu said:

I do have the syslog on the syslog server though.  Is there a way for me to anonymize it, or will running the diagnostic collector grab it from that location and do that for me?

Diagnostics does not grab the syslog from the syslog server file folder it would only be since last reboot. You'd need to upload the file from the syslog folder. If you need to anonymize (I don't think you do) open the text file hit CTRL+F and search for the term(s) you want to delete or replace and do so.

 

Might as well include diagnostics too....

 

Finally, looking at JB's link above did you read all of this about Global C-state Control (it's a link within the first link)

Edited by Dissones4U
typo and added link
Link to comment

I looked through the thread.  I've ensured that "Global C-State" is disabled, and that the PSU Idle setting is set to typical.

 

I've also added "rcu_nocbs=0-3" to the syslinux configuration. 

 

The only change this time will be the rcu-nocbs setting, as the two power settings were already configured during this last lock.

 

attached is the syslog and new diags.

 

This portion of the syslog seems important..

 

 

schumachertower-diagnostics-20200312-2016.zip syslog-192.168.0.3.log

Edited by kevschu
possibly relevant information from log
Link to comment

Still SATA controller problems:

Mar 12 16:34:37 SchumacherTower kernel: ahci 0000:02:00.1: AHCI controller unavailable!
Mar 12 16:34:38 SchumacherTower kernel: ata10: failed to resume link (SControl FFFFFFFF)
Mar 12 16:34:38 SchumacherTower kernel: ata10: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Mar 12 16:34:43 SchumacherTower kernel: ata10: hard resetting link
Mar 12 16:34:43 SchumacherTower kernel: ahci 0000:02:00.1: AHCI controller unavailable!
Mar 12 16:34:44 SchumacherTower kernel: ata10: failed to resume link (SControl FFFFFFFF)
Mar 12 16:34:44 SchumacherTower kernel: ata10: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Mar 12 16:34:44 SchumacherTower kernel: ata10: limiting SATA link speed to <unknown>
Mar 12 16:34:49 SchumacherTower kernel: ata10: hard resetting link
Mar 12 16:34:49 SchumacherTower kernel: ahci 0000:02:00.1: AHCI controller unavailable!
Mar 12 16:34:50 SchumacherTower kernel: ata10: failed to resume link (SControl FFFFFFFF)
Mar 12 16:34:50 SchumacherTower kernel: ata10: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Mar 12 16:34:50 SchumacherTower kernel: ata10.00: disabled
...
Mar 12 16:35:05 SchumacherTower kernel: ata9.00: disabled
...
Mar 12 16:35:59 SchumacherTower kernel: ata6.00: disabled
...
Mar 12 16:35:59 SchumacherTower kernel: ata5.00: disabled

 

Multiple disks dropping offline, there are also several NIC related errors, if you can I would try with a different board, that one appears to have issues, either actual problems or compatibility issues with Linux.

 

 

 

  • Like 1
Link to comment
  • 1 month later...
On 3/17/2020 at 1:20 AM, kevschu said:

Well, I replaced the Asus Prime B450M-A/CSM motherboard with an ASRock X570M Pro and things have been running solid for the last 3 days.

I have the same Asus board with a 2200g, and have the same SATA/network dropping but only under heavy sustained cpu load.

Interesting that a mobo changed this - I wonder if its the chipset or the manufacturer that improved things.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.