unraid 6.6.7 - stability issues on new server - Asus Maximus Extreme Alpha 2990wx 8HDD 2SSD cache


Recommended Posts

Current issue: Server keeps locking up for some reason. 

Behavior: Running fine, then different things will start not working - Ill lose connection to a docker app, then all dockers, then the unraid web page. Even if the web page hasnt yet locked up, if i try to do a diagnostic download, it will just spin without generating anything. However Ill still be able to login via the console (not on the web page, but directly). But even then major stuff doesnt work like 'reboot'. The bios speaker beeps but nothing happens. No 'sigterm' messages, just sits. Requires a press of the reset button. Sometimes, (most but not every time), there are trace dumps put on the console where I can see them when I login directly. 

 

Other wierd behavior:  Ive noticed that there are discrepancies between the 'top' output vs the dashboard on utilization. Sometimes, minutes before a lock up described above, when I notice things arent behaving, the Dashboard will show one cpu pegged at 100%, then another, then 4, 6, then more till lockup, however during this time, 'top' shows basically idle. 

Other times, when things are working properly, (in the screen shot attached, plex is working fine doing a transcode ) , 'top' will show way over 100% cpu usage, but the Dashboard will show less than 10% usage.

1811850149_serverusage.thumb.PNG.f2a9b645c68eeb092a117c0b3741ac9c.PNG

 

Full story: Upgrading my server from:

Intel i7 4970x, ASUS MAXIMUS VI EXTREME, 16GB DDR3 Gskill, 256GB Adata (nvme sata) cache drive

 

To:

Threadripper 2990wx, Asus Zenith Extreme Alpha, 128GB DDR4 3000 Gskill, dual 500GB WD black (nvme pcie x4) cache drives

 

Same Array drives, and same pci cards. Not updating graphics cards as of yet. 

 

Moving Procedure and what went wrong:

I changed all shares to not use the old cache drive, ran mover, and made sure it was clear. Then i brought the system down.

I put all the old drives, which were in a separate sans digital enclosure, into the new case, swapped the flash drive over, and booted up. 

Things were working immediately on boot, so i re-enabled the cache usage, pointing now to the 2 new ssd drives, and started up dockers. Little did I know that things were about to go south. 

I started experiencing all sorts of issues. Pcie errors on the main console, File system errors, kernel panics. The system would lock up like above. Id reboot, and it would fail to even get to the login part on the console before a kernel panic or trace dump would freeze the system (requiring reset press). When I did get in, I disabled docker, and VMs, and set the array to not auto start. I tried removing all my extra PCIe devices leaving only the main graphics card. I was still getting pcie errors sometimes.

I even tried copying the entirety of my flash drive to my laptop, rebuilding the flash drive with a fresh install using the Unraid install tool, and then grabbing my key and configs from the backup to the 'new' drive. I would still see all the same errors, yet sometimes boot in fine and see my drives and dockers and vms fine (not started of course).

Between all this trial and error and multiple lock-ups and reboots, I was incredibly unlucky. Disk 5 (one of my 4TB drives) was corrupted.

I know I did this next part wrong, its too late, Ive moved past it, Ive accepted I will have to reaquire 3TB of data.

It started showing my disk as disabled, so I tried to remove the drive, so i could re-add it and rebuild. When I stopped the array, I removed it, restarted the array, stopped, re-added the drive, and started again. This allowed the rebuild option. Then shortly thereafter it locked up. Reset and now the drive was corrupted data but green status.

I tried to get the array to rebuild the drive, but while it would fail to mount the drive, it would still show as green. I tried to do a parity check, that locked up the OS. I started the array in maintenance mode, then tried a filesystem check, and subsequent fix, but the system wouldnt do it saying I needed to do a parity rebuild, or something like that. When I stopped the array, I removed it, but then couldnt start the array, because it was greyed out, so I couldnt then re add it, to make unraid think I replaced the drive. Somewhere in all this, the format option came available, and I did that to the drive, thinking "ok, clean FS, then I can rebuild". I was wrong. It formatted, and wrote the clean drive to parity, and then the OS locked up again. I became pissed, and left alone for a day. 

I did some research into the crashing, and found in one of SpaceInvaderOne's older videos on Ryzen, to add the no callbacks option to aid in stability. I did this, and got the system to stay stable long enough to do a parity check. This only proved that the data was gone. So be it. Move along.

 

Now that I have better stability, I was finally able to restart some dockers. Plex works, I got some other dockers up and running. Then still a lockup, reset, and a trace error or kernel panic stopping the OS from completely starting. Another reset, and then hours of stability, until another lockup. I have gone as long as 2 days, and as short as 30 minutes between lockups. I have yet to try any VMs, only dockers. I still only have the one graphics card attached, along with all my drives, but nothing else.

 

Thats the story of how I got where Im at now.

unraid-diagnostics-20190508-0943.zip

Link to comment

Example of how when it locks up I can still login via the console and issue commands like ‘reboot’ but they don’t actually work completely. The attached picture shows the output of my reboot command where it says “going down for reboot now” and all it’s subsequent output, however it’s been sitting like this for 10+ minutes now. 

 

Time unfortunately for another reset. 

A84F0551-6132-4F70-B2B7-C42FCE61C686.jpeg

Link to comment

Additional troubleshooting that I've now done:

I have tried rebuilding a new usb stick, by transferring my key to a new usb drive and importing all the drives and recreating my dockers. I didnt even get far past my first couple dockers before this behavior continued.

I thought I would try a memtest to see if something wasnt right. Everytime (i did multiple tries), it would hang momentarily and reboot within seconds of choosing the memtest option. This made me think the ram needs reseating. Did that, and still cannot get a memtest to run.

Link to comment

Having read through the entirety of the post and reviewing your diagnostics, I can safely say this is likely a hardware issue.  Either CPU, memory, cooling, power, or cabling is likely amiss.  We have tested Unraid thoroughly on multiple Threadripper systems (both TR1 and TR2) and we've never run into these kind of stability issues on any release of Unraid OS.  I definitely think you either have some defective hardware, insufficient/improper cooling, or something else amiss in your build to have all these problems.  At this point, I would completely disassemble the system and inspect each component one by one.  Maybe even try installing another OS for testing if you can.  If you can't even get into memtest though, that's a pretty strong sign pointing to a hardware-specific problem.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.