[Solved] v6.8 - USB Boot Drive Not Recognized on Startup, Works When Moved to Other Port?


Recommended Posts

So I recently got my first server up and running (for Plex, Sonarr, Radarr), and it's been working phenomenally for the last ~2 weeks, other than this one particular quirk: every so often (it's happened three times total now), I'll try and log into the server and it'll be unresponsive (can't ping or load login page). The first two times this happened, I rebooted the machine and still had the same problem - when I connected a monitor to the Unraid box I saw it just didn't see the USB drive in the BIOS. So I moved the USB drive to another port, and when I rebooted the server came up perfectly fine like nothing had happened. This last time, simply rebooting the server got it back up; I didn't need to relocate the drive this time. So my question is how can I figure out what is going on here? Some thoughts:

 

1) Where could I find relevant log files for seeing if there's some shutdown event that's occurring? (this is kind of a "troubleshooting this type of thing in general" type of question)

 

2) Is there a preferred USB drive size, brand, etc? I'm using an ADATA UV128 16GB USB3.0 drive, and it's currently plugged into the USB 3.0 port on the front of my case. Previously I had it plugged into a back I/O shield USB 3.0 port, and then internally using a female-USB-to-mobo-header cable, both worked initially and then needed to be switched after the BIOS stopped recognizing them.

 

Thanks guys, loving Unraid and the community so far!!

Edited by takethecake
Link to comment

The config transfer to move to a new flash drive is contingent on a paid license, the trial version requires setting things up again, for the obvious reasons.

 

It's not obvious from your posts whether you have a paid license yet, so I wanted to point that out.

 

Also, since the flash hardware GUID is linked with the license, when you change flash drives the license must be reissued to activate the new drive and deactivate the old flash. This is an automated process through limetech's servers, with a 1 year restriction. If you need to change flash drives sooner than 1 year, you will need to contact them and ask for a manual reissue.

  • Thanks 1
Link to comment

Ahh yeah I hadn't acquired a license yet, so I learned that when I went through the process of making a new boot drive and just went ahead and bought one.

 

But, of course, even after making the new boot drive with a new USB 2.0 drive, and copying JUST the /config files over, this dumb problem returned. The new boot drive booted up great, all the dockers were working perfectly, my networking with my other PCs was great, I thought I was set. And then I woke up this morning and the machine was unresponsive again.

 

When I click the Log button to get /tools/syslog, it only has entries from the current boot, so I can't see what might have happened to crash the computer to the unresponsive state.

 

Maybe I need a BIOS update or something? I haven't checked to see if my board needs one yet - it's a ASUS Prime A320I-K. I'll look into that right now..

Link to comment

Okay here's what I got last night, it happened again naturally, though I could ping the server and get a response. The only line I got after I logged in one last time at 7:55pm to make sure everything looked good was the one bolded below - exit status something something mover. Seems like there's some error with the mover that's causing this? - would explain why it happens regularly overnight (although strangely, when the problem started it was more like a weekly occurrence). The two lines at 8:42 are the beginning of the startup sequence after I rebooted the unresponsive machine.

 

Jan 10 19:55:00 MrPlex webGUI: Successful login user root from 172.16.0.9
Jan 11 03:30:04 MrPlex crond[1730]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Jan 11 08:42:03 MrPlex emhttpd: Starting services...
Jan 11 08:42:03 MrPlex emhttpd: shcmd (101): /etc/rc.d/rc.samba restart

Link to comment

syslog snippets are seldom sufficient

 

Possibly that mover entry was simply the last entry in syslog before crash because nothing else was going on at the time.

 

Do you have any particular reason to suspect that all your previous crashes happened at the scheduled mover time?

 

Have you done memtest?

Link to comment

I've always noticed it first thing in the morning, I don't think it's ever happened during the day - that timing could suggest it being mover-related but maybe not. I did manually run the mover just now and nothing bad happened, and mover logging didn't show anything suspicious. 

 

I have not done memtest - looks like I need to do that for at least 24 hours?

 

One other thing I should have mentioned that just occurred to me - I've been getting a notice in my Unraid dashboard that my M.2 SSD cache drive is running too hot for Unraid's liking. It's running around 55 degC, which it's done since I installed it - I initially googled it and it seemed that it shouldn't be a problem but maybe there's something related to that?

Link to comment

Alright well I upped the threshold yesterday so I don't have to worry about it anymore, but I don't feel any closer to solving this - crashed again last night. Only thing I have to go on is to test my memory, which I'm going to do today, but that's still a bit of a shot in the dark. Is there anything I can look for in my diagnostics zip? Right now, assuming the memtest checks out okay, I think my best strategy is to just disable my dockers one at a time until the system stabilizes, so I'll start that and report back. I also added all my hardware to my signature in case any of my components are known for causing trouble...

 

Another thing I stumbled across while trying to figure this out is the "Fix Common Problems" plugin - so I'm installing that now to see if it helps dig up anything useful.

 

*Update - welp, looks like a HW error. FCP plugin found machine check events, so I installed mcelog and lo and behold I got the following lovely lines:

 

Jan 12 08:33:18 MrPlex kernel: mce: [Hardware Error]: Machine check events logged
Jan 12 08:33:18 MrPlex kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
Jan 12 08:33:18 MrPlex kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff810725a4 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Jan 12 08:33:18 MrPlex kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1578843178 SOCKET 0 APIC 0 microcode 8001138

 

Debating whether to keep letting memtest run or to just turn everything off and re-seat every connector and the RAM and then redo the diagnostics...

Edited by takethecake
Link to comment

And another update - I did some more googling on that error and discovered that there seems to be an issue with the 1000-series Ryzen processors and C-states. I exited the memtest (1 hr no faults found), disabled C-states in my BIOS, and updated the BIOS for good measure (made sure C-states were still disabled after the update). When I rebooted the server, I re-ran Fix Common Problems and the MCE's no longer came up. Won't really know for sure if this fixed the problem until tomorrow morning (as that's when it "strikes".... lol).

 

Link to comment

Welp, that seems to have solved it - quite the rabbit trail, but hopefully this thread can help out anyone in the future doing a 1000-series Ryzen build. Server couldn't be happier this morning.

 

So if I understand C-states correctly, what was happening was the computer tried to go into a sort of power-saving mode, and that's when it would crash?

Link to comment
2 minutes ago, takethecake said:

So if I understand C-states correctly, what was happening was the computer tried to go into a sort of power-saving mode, and that's when it would crash?

Correct, in more recent bios there's usually an option to fix that without disabling c-states, look for "Power Supply Idle Control" (or similar) and set it to "typical current idle" (or similar).

  • Thanks 1
Link to comment
  • 2 years later...

So far so good! Haven't had any return of the problem I had in this thread; the only thing I've experienced recently is my NVMe drive running hot - I've had one unexplained crash in the last few months and I think that high temp was the reason. 

 

The other thing I might've considered when building this rig is getting a mobo/cpu combo that had built-in video support. Even though i'm running a headless server, when booting Unraid up for the first time I ran into an issue where if I didn't have a PCI video card attached, I couldn't make the server accessible to a local browser so that I could log in. Even after setting everything up and trying to remove the video card I couldn't get the server to boot. So now I just keep that video card in the server, and it keeps me from being able to use my PCI SATA expansion card I got to try and utilize some extra HDDs. Oh well, I can always just get bigger disks if I run out of room.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.