[SOLVED] System crash and reboots on array start


Forusim

Recommended Posts

Hello community, I searched through various threads in this forum, but could not find a working solution.

Few days ago, I tried to run the paritiy check, after which the system directly crashed and rebooted.

I looked into the logs and noticed following errors on my flash drive:

Feb  3 21:38:07 Tower kernel: print_req_error: critical medium error, dev sda, sector 30030848
Feb  3 21:38:07 Tower kernel: Buffer I/O error on dev sda, logical block 15015424, async page read

After some research I figured, that my SanDisk Ultra Fit may fail in near future, so I ordered a replacement (Samsung Fit Plus 32GB).

I followed the steps in https://wiki.unraid.net/UnRAID_6/Changing_The_Flash_Device, which worked fine.

However on array start the server just crashed + rebooted - I chosen the safe mode now.

I disabled VM and Docker services as was recommended on other threads.

I tried to start the arrray again, but in maintenance mode, which worked out.

I stopped the array and tried to start normally (with mounting) - the server rebooted again.

 

I am quiet desperate, because I do not even see on the screen, why it is crashing-rebooting.

I tried to catch the events before the crash by mirroring syslog to flash, but cannot see a clear hint there.

 

I was first thinking of some corrupted config from my flash backup and even started a new config (keeping current assignments).

I also did a memory test, which passed 1 time without errors.

However the symptoms are same: array start in maintenance mode works, array start in normal mode crash the system.

 

Edit:

Today I made a cold boot from my old flash drive, which still gives me "print_req_error: critical medium error, dev sda, sector 30030664".

However the array starts and drives are mounted without problems.

So I proceeded with start parity check - this time no crash and usual speed of ~ 180 MB/s.

Enabled docker service - started without issues.

Enabled vm service - started without issues.

Started a Windows 10 VM (see attachments), which I passed through the IGD before all those issues appeared - system crashed + rebooted.

For IGD passthough followed this steps, which worked without issues for several days before I started the parity check.

Can those additional entries in syslinux.cfg "video=efifb:off,vesafb:off modprobe.blacklist=i2c_i801,i2c_smbus,snd_hda_intel,snd_hda_codec_hdmi,i915,drm,drm_kms_helper,i2c_algo_bit" cause the crashes?

This may be all coincidence, because before it was crashing in safe mode without any plugins and services.

 

tower-diagnostics-20210206-2350.zip syslog

Windows 10 VM.xml

Edited by Forusim
Link to comment

Ok, now it looks like a hardware issue to me, which is triggered by parity check...
Booted from old flash drive, started array, started parity check and let it run for several minutes without further actions.
Suddenly the system crash + reboot.

 

Some information about my hardware:
UPS: Cyberpower EX Series 850 VA / 490 Watt (Purchased 02/2016)
Case: Fractal Design Arc Mini R2 (Purchased 02/2016)
PSU: Cooler Master VS Series Modular 80+ Gold 450 Watt (Purchased 02/2016)
MoBo: Gigabyte H370M D3H GSM - BIOS F12 (Purchased 09/2018)
CPU: Intel i3-8100 (Purchased 09/2018)
RAM: Kingston KVR21E15D8K2/32 - 32GB (16GB 2Rx8 2G x 72-Bit x 2pcs.) PC4-2133 (Purchased 07/2017)
HDD: 6x Western Digital Whitelabel(Red) 8TB / 10TB as array (Purchased over last 4 years)
SSD: 2x Samsung EVO 970 Plus as cache  (Purchased 06/2019)

 

Where should I start to look at?

 

Edited by Forusim
Link to comment
13 minutes ago, ChatNoir said:

Just to understand the situation a bit better, this is an old server that was running fine until a few days ago ?

Did you make any changes recently ?

Hello ChatNoir, thanks for your reply.

Yes I started with UNRAID on 02/2016 at low scale with ASRock N3150-ITX, 8 GB RAM, 1 Samsung EVO 850, 1 WD Red 8 TB and SanDisk Ultra Fit flash drive.

Upgraded Mainboard and CPU on 09/2018 and upgrade cache SSD´s on 06/2019.

Had no major issues in all these years and overall very happy with UNRAID Pro purchase.

About 2 weeks ago I started with passing though the IGD of my i3-8100 to a Windows 10 VM, which is more potent than my notebooks i3-3110M.

If all works out, I would completely switch my daily desktop to the Windows 10 VM, since all my data is already on Unraid.

 

It all looked very promising until I started a parity check, which is usually scheduled every month.

However the last one was missed, because to server was sleeping in last day of January.

So I started the parity check from the Main menu and the server just crashed + rebooted.

So here I am with my issue, stated in the initial post.

Link to comment

I wonder if you have a PSU issue.   Starting a parity check is quite likely to put extra stress on the PSU and PSU’s frequently degrade with age.  Do you have another one you can try?

 

The other possibility that springs to mind is a problem with the CPU cooling causing thermal shutdown to occur.

Link to comment
5 minutes ago, itimpi said:

I wonder if you have a PSU issue.   Starting a parity check is quite likely to put extra stress on the PSU and PSU’s frequently degrade with age.  Do you have another one you can try?

 

The other possibility that springs to mind is a problem with the CPU cooling causing thermal shutdown to occur.

Unfortunately I do not have a spare PSU to test. Can I simulate a high load on PSU without running parity check?

I already removed the UPS and plugged the server directly into the socket - still crashing.

Temps are at 40 °C as per Temp plugin and fan are running as usual.

 

My current state is, that I can mount the array with old flash drive (UNRAID not forcing a parity check here).

The last time I started the parity check it ran for 15 minutes and checked 154 GB (1.5 %) with 179.0 MB/sec - out of a sudden crashed + reboot.

Link to comment
3 minutes ago, ChatNoir said:

Where you doing plex transcoding at the time of the crashs ?

Crashes occur even in safe mode without docker or VM running.

The trigger is highly likely the parity check - usually it crash directly on start, but sometimes few minutes after start.

Link to comment

I did some stress tests on my systems:

Mprime ran without issues for half an hour in hard test mode - no issues.

Diskspeed.sh ran without issues for all disks - each disk sequentially.

Then I tried to simulate parity check load with parallel call of dd:

(dd if=/dev/sdd of=/dev/null bs=1G count=1 iflag=nocache) & (dd if=/dev/sde of=/dev/null bs=1G count=1 iflag=nocache) & (dd if=/dev/sdf of=/dev/null bs=1G count=1 iflag=nocache) & (dd if=/dev/sdg of=/dev/null bs=1G count=1 iflag=nocache)

4 out of 6 disks can run in parallel, when I add more -> crash + reboot.

I will check all the cabling of all my disks now.

 

Any recommendations how to test, whether a PSU is failing?

Edited by Forusim
Link to comment
1 hour ago, jonathanm said:

Replace it.

 

You could possibly rearrange how the drives are getting power, as in make sure you are using all the leads possible directly from the PSU, eliminate as many splitters as you can.

My PSU has single 12V rail and is semi-modular with 3x3 SATA connectors.

I used 2x3 for my 6 disks for about a year now without issues.

Tested also with 3x2 (3 cables), but the results are same, if I read test with dd on more than 4 disks, the system crash + reboot.

It seems like that PSU dies exactly before 5 years warranty (purchased 02/10/2016).

Hope that Cooler Master will accept the RMA.

Link to comment
3 minutes ago, Forusim said:

It seems like that PSU dies exactly before 5 years warranty (purchased 02/10/2016).

Hope that Cooler Master will accept the RMA.

My Cooler Master 650W PSU on a desktop Windows PC died last week 2 months after the 5 year warranty expired.  It was never pushed anywhere near its limit.  I replaced it with a Seasonic PSU with a 10-year warranty.  Time will tell if it is any better.

 

 

Link to comment
  • 3 weeks later...
  • Forusim changed the title to [SOLVED] System crash and reboots on array start
  • 4 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.