[6.8.0] Unraid crashing/freezing under systained network reads

ShinseiTom · December 22, 2019

I recently bought a key of unraid after testing for around 120 days, with one stretch of 90+ days uptime at the end (so the trial didn't end).

As you can see, it was rock-solid stable no problems, except it was slow as crud since I was using 12 old SAN 500 GB drives. Hot (so hot, 40+ idle!), loud, slow, low capacity, but great for testing. I had it in a 2:2:8 setup, parity cache data. Even had a super old pci raid card for the extra sata ports. Did what I wanted, if slow, so great! I decided to buy a key and new hardware on black Friday, and it got reconfigured into 4 4TB drives 1 parity 3 data, and 2 1TB SSD for cache. Ditched the raid card for now.

Everything was (and kinda is?) working great! Speeds, temps, loudness, capacity, all great jumps. I'm now held back by my 2500k and gigabit nic more than anything else.

And then it started restarting or freezing literally every night in the 3-8am time-frame. Every morning I'd check and it would be sitting there at the array not started (because I turned off autoinit), OR it would be completely unresponsive even to ping and if a monitor+m/k was physically plugged in that would also be unresponsive.

So first I thought it might be power or heat, as I moved it into a closet on an old power strip when I reconfigured the hardware. Just ruled that out, connected to UPS it had same issue and heat was never an issue that I could see from the stats. Also nothing else is affected by flickers around the house.

Then I tried mirroring the syslog to flash, but there's nothing in there but a gap from when it fails (I assume) until I start it back up.

Then I tried upgrading to the just-released 6.8.0, thinking maybe I had some kind of weird software error. No change.

Finally, maybe basic hardware, so I ran a memtest (since everyone gets so hot and bothered by it on these forums) for 24+hours, got 11 successes no errors/fails. No SMART errors either

So then I started doing all kinds of stuff, plugins and whatnot to see what was going on in that 3-8am timespan. Think I tracked it down to Plex (on a completely different computer) scheduled to deep scan files in that timespan. It would have tons of files open on Unraid at the same time (from looking at the active streams plugin) as it scanned the copied contents of my NAS I setup after getting the new hardware. Since it died most often after 6, when mover starts, maybe something with that too? Failed as early as 3:45am though, so it's definitely not all.

TLDR: It appears my separate Plex server kills my Unraid with tons of network reads as it deep-scans files in libraries I pointed to look at the Unraid array share. I'd rather not turn the deep-scan feature off, and this is probably not really Plex's fault. I've done memtest and tried to look in a mirrored syslog file.

What do I need to do to get more technical data to look through as syslog doesn't seem to cut it? Diagnostics?

Motherboard: ASRock Z77 Pro4

CPU: 2500k

Ram: 4GB + 8GB DDR3 (bad setup but it's a friends hand-me-down while he uses my old 1800x computer, huge upgrade for him)

No add-in cards

4 WD Red 4TB drives, 2 Samsung 870 evo SSD

Also, will probably be unresponsive or slow for next week, sorry if I don't immediately respond. Been trying to fix this for the last two+ weeks before I go on vacation, but finally ran out of patience/exhausted my knowledge. Creating this thread so that I don't forget anything while on vacation! I'm also going to use it as a sorta "test", as I've disconnected my Plex server from Unraid and I'm going to see if it crashes any while I'm gone.

Frank1940 · December 22, 2019

You will need to capture a Diagnostics file after the problem has occurred. Tools >>> Diagnostics

Unfortunately, this is easier said then done when the system locks up or becomes unresponsive. So I would suggest that you run the server for a while then capture a Diagnostics file. In the mean time, setup the Syslog Server. If you can force the problem to occur with in a day or so, I would log to the boot drive. Here is a link to setting up the Syslog Server:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-781601

Post up both files with your next post and include any other observations at that point.

Michael_P · December 22, 2019

How old is the power supply?

ShinseiTom · December 30, 2019

Ok, so I ran the system without Plex hitting it for a week while on vacation. And what do you know, no lockups or restarts. So heavy read loads are what's causing it, though I know it can be not technically it's fault.

On 12/22/2019 at 2:00 PM, Frank1940 said:

You will need to capture a Diagnostics file after the problem has occurred. Tools >>> Diagnostics

Unfortunately, this is easier said then done when the system locks up or becomes unresponsive. So I would suggest that you run the server for a while then capture a Diagnostics file. In the mean time, setup the Syslog Server. If you can force the problem to occur with in a day or so, I would log to the boot drive. Here is a link to setting up the Syslog Server:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-781601

Post up both files with your next post and include any other observations at that point.

Syslog was already mirrored to boot days before I posted here, and it appeared to have caught nothing. I remember reading it and there was just a big blank timespan without any kind of information after what looks like normal lines.

I'll attach the current syslog while I try to cause a new failure. I'm sure you already know this, but searching for "microcode: microcode updated" line will help find the times where the server was restarted (though some of them are normal restarts as I tried stuff).

I'll also attach the current diagnostics after it's been running a while.

On 12/22/2019 at 3:01 PM, Michael_P said:

How old is the power supply?

Power supply is brand new bought for the Unraid build in July. It's also incredibly overkill for Unraid at 650W. My Unraid server pulls from 60-160W depending on load.

shinseiunraid-diagnostics-20191230-0020.zip syslog.7z

Frank1940 · December 30, 2019

It looks like the restart (in your Diagnostics file) happened around 02:15 on Dec 22nd and the last previous activity was on Dec 20 at 21:33. It looks like it could well be some external device is triggering it. Some suggestions:

First, don't dismiss out-of-hand the PS. I had a brand new PS behave quite similarly a while back. (It was less than thirty days old!) PS are no longer a dumb brick that just supply power. They actually interact with the MB. Verify that all PS cables and front panel cables are securely plugged in to the MB.

Second, do you have some action setup scheduled to start about 02:15 on a Sunday morning?

Third, is it possible that some person or animal might be pushing the power/reset buttons on the server case? (Cats seem to be attracted to those case lights.)

Fourth, you have a UPS on this server. What are the parameters to trigger a shutdown? Have you tested the batteries lately? (Some UPS's switch to battery operation very quickly on power-line transients. So other devices might not be be affected.)

Plex has been giving some problems with recent releases (data corruption???). Since I don't use Plex, I haven't really tracked them that closely. But it is another possibility. By the way, you are ignoring an update to mariadb and I seem to recall that it has been mentioned as giving problems lately. I would suggest as least checking the change log to see why an update is issued.

Michael_P · December 30, 2019

13 minutes ago, Frank1940 said:

Third, is it possible that some person or animal might be pushing the power/reset buttons on the server case? (Cats seem to be attracted to those case lights.)

Wouldn't explain the freezes, but worth investigating. I had a dodgy power button on my WMC box, it kept randomly (cleanly) shutting down in the middle of the night. Took a few months for me to figure out what was happening.

ShinseiTom · December 31, 2019

9 hours ago, Frank1940 said:

It looks like the restart (in your Diagnostics file) happened around 02:15 on Dec 22nd and the last previous activity was on Dec 20 at 21:33. It looks like it could well be some external device is triggering it. Some suggestions:

First, don't dismiss out-of-hand the PS. I had a brand new PS behave quite similarly a while back. (It was less than thirty days old!) PS are no longer a dumb brick that just supply power. They actually interact with the MB. Verify that all PS cables and front panel cables are securely plugged in to the MB.

Second, do you have some action setup scheduled to start about 02:15 on a Sunday morning?

Third, is it possible that some person or animal might be pushing the power/reset buttons on the server case? (Cats seem to be attracted to those case lights.)

Fourth, you have a UPS on this server. What are the parameters to trigger a shutdown? Have you tested the batteries lately? (Some UPS's switch to battery operation very quickly on power-line transients. So other devices might not be be affected.)

Plex has been giving some problems with recent releases (data corruption???). Since I don't use Plex, I haven't really tracked them that closely. But it is another possibility. By the way, you are ignoring an update to mariadb and I seem to recall that it has been mentioned as giving problems lately. I would suggest as least checking the change log to see why an update is issued.

1. PSU is definitely hooked up alright, I tried replugging all the power (and other cables) once. I've never run into a power supply that only died when under heavy hdd reads. I could see one of the hdd being somehow bad and not reflecting that in SMART, but that wouldn't crash the whole system surely? Unraid also finishes Parity checks just fine after each reboot I have to do, and that's stressfull on both the cpu and hdd for long periods of time.

2. Yes, it crashed/froze every time between 3-8, when my Plex server (on another computer) is set to do deep scans of files. And it only started restarting/freezing when I pointed Plex to the new copies of the library on the array and it stopped when I unlinked those libraries. Nothing external writes to Unraid as far as I know, only VM/Docker/Plugins.

SSD Trim and Mover are set to run daily, at 4AM and 6AM respectively though, so I could see them possibly interacting?

3. No, for most of the failures it was in a closet with a closed door. Doesn't get hot either. As far as I can tell no issue with the power button.

4. I have one connected now, but that was in effort to eliminate possible power issues and only done right before my first post. I don't have power issues usually (hence the previous 90+ days continuous runtime before these issues), but to be sure I hooked it up anyway.

As for my Plex server, I have no issues with it otherwise. Been running fine pointed at the NAS I was trying to replace. It only reads from my unraid box, nothing is being written far as I know when you point a library somewhere. I did notice that when Plex does a deep scan of music files it has a shit-ton of handles/streams open though. The list in the Unraid Active Streams plugin goes crazy, probably 100+ entries at times.

Not sure about Maria, it's not ignored far as I can tell. Up-to-date too.

tchmnkyz · December 31, 2019

I am having the same exact thing. Mine is a fairly new Supermicro with Dual Xeon in a really decent cooling situation. I did catch a screen cap of a kernel panic:

I did update the Bios and ipmi to latest firmware.

srv-stor-001-diagnostics-20191230-2058.zip

Frank1940 · December 31, 2019

Have a read through this thread. I would read the first few posts and then jump to the last couple of pages (and then work your way back as required).

https://forums.unraid.net/bug-reports/stable-releases/downgraded-back-to-667-due-to-sqlite-corruption-r576/

I admit that I did not follow this problem very closely but I believe that it was determined that where you store certain of the database files made a big difference in system stability for a lot of folks. The problem may have been fixed in the most recent release but a corrupt DB when you are doing a 'deep scan' could cause a problems...

tchmnkyz · December 31, 2019

the difference here is that my plex/sonar/radar/etc all run on a remote host. Only the video files live on the unraid node.

ShinseiTom · December 31, 2019

Yep, which is the same for me. Plex is completely separate hardware from my Unraid box, only connected by networking. Plex isn't crashing, the worst it gets is the connection fails when Unraid crashes/freezes.

Well, it was completely separate at least. I spun up the Plex docker I had for messing around and pointed it at all the "local" files (mounted the /user/public/Plex folder on the array to /unraid in the docker, and created 7 libraries pointed at 7+tb of media) and let it run uninterrupted since my last post. No crashes or anything despite load, and even my weekly scheduled Parity check ran last night at the same time. It's still running right now, I believe it's creating the interval screenshots for JoJo's.

Now, the unraid box doesn't have nearly the grunt as my external Plex server (stock 2500k vs R610 dual-quad xeon) so maybe that has some to do with it, but the only big difference in what Unraid is doing is that it's not reading over the network to get the files.

Frank1940 · December 31, 2019

@tchmnkyz, @ShinseiTom, Where are your Plex database files at? This was the problem that the link was talking about.

tchmnkyz · January 1, 2020

only media files are hosted on unraid. the db and everything else is hosted on a VM on my proxmox node that stores everything on local storage.

tchmnkyz · January 13, 2020

So i was hoping things would get better with 6.8.1 and things actually got worse. It seems to lockup after like 3 - 4 hours now. I am reverting back but i was like come on... it just keeps getting worse....

Decto · January 14, 2020

May be worth trying to run all drives on the 6 main SATA ports, looks like the two SSD's are on Asmedia Sata 7 and Sata 8. Asmeida has caused issues in the past.

You can just move the cables, no reconfiguration needed as unraid uses drive serial numbers

tchmnkyz · January 14, 2020

I will have to check, but this is a Supermicro 2u backplane. I am not sure how much wiggle room ihave to move cables like that. IIRC it goes from the backplane to a raid controller. again i have to double check.

tchmnkyz · January 15, 2020

I changed it waiting to see if it is better now

tchmnkyz · January 15, 2020

So long story short it is still an issue. Could this be a issue with a bad USB flash drive? basically at random points it seems to just lockup. When I connect to the console on ipmi. It seems that the host just becomes unresponsive. I cant even type ps auxf or anything.

ShinseiTom · January 15, 2020

So, I hadn't responded in a while since it was actually stable for a while (didn't have remote plex pointed at it).

Decided to point the remote Plex at it again, which seemed to work... for a couple days.

Decided to do the update to 6.8.1. Probably should have left well enough alone, crashed again. Last thing in the syslog before the crash is what I assume is a mover entry:

Jan 15 05:49:26 ShinseiUNRAID kernel: mdcmd (71): spindown 1
Jan 15 06:00:02 ShinseiUNRAID crond[1652]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null

So, I'm going to leave the Plex connected for now and see if it crashes again. If it does, well I'm kinda feeling let down. This was supposed to be the replacement for my Synology NAS but this instability for Plex over the network kinda kills that use.

tchmnkyz · January 17, 2020

knock on wood mine has been stable a few days now.... I am wondering if it is not a bad USB stick or something. I am going to try replacing it this weekend.

ShinseiTom · January 29, 2020

Ok, finally the system crashed but kept an error on-screen on the attached monitor.

I feel like I know what the issue is, though I'm still puzzled by why it would only crash at a certain time at night. I hope this is the same issue and not something new.

So, the error:

"whole bunch of gibberish with hex locations and whatnot"

FAT-fs (sda1): FAT read failed (blocknr 1941)

FAT-fs (sda1): unable to read inode block for updating (i_pos 12544047)

JorgeB · January 29, 2020

That's a flash drive problem, run chkdsk, if more issues it could be failing, also make you're using a USB 2.0 port.

ShinseiTom · January 29, 2020

chkdsk didn't return any issues with the drive.

But it was indeed plugged in to a USB3 slot. I'll swap it to USB2 and continue monitoring.

tchmnkyz · January 30, 2020

Mine stopped crashing after i stopped running my xteve in docker on unraid. I moved it over to my proxmox box as a container over there and everything seems to be fine now... I am continuing to monitor it.

ShinseiTom · February 6, 2020

Alright, another crash with an error I could catch. Had been stable 8 days, was in the middle of a network copy of a few large (10+gb) files.

[6.8.0] Unraid crashing/freezing under systained network reads

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation