Jump to content
Sign in to follow this  
B1scu1T

Newbie experiencing some problems

12 posts in this topic Last Reply

Recommended Posts

Hi Guys!

 

I have been trying to get my UnRaid server set up and running for a while now. Its been mostly a success story but there has been a few stumbling blocks along the way, mostly to do with the learning process of a new OS and the way things are done.

This morning I went to continue transferring files from my old server to the new one and I realised I couldn't access or even ping the server at all. I went to its location and had a look and it was switched on so I figured I would hit the power button and let it do a safe shutdown... it didn't do anything. I plugged in a monitor and keyboard and the terminal was displaying an error message:

 

Error: error removing the device 'missing' - no missing devices found to remove.

 

It seems as the KB and mouse are usually not plugged in, it didn't recognise the keyboard so my troubleshooting abilities were limited. At this point I forced a reboot, knowing that meant a parity check. In hindsight I should have had a look at the help threads at this point but I obviously wasn't thinking about it properly.

 

Once the system had booted back up, the error message displayed again and although I can now access the server, the performance is outrageously slow and the parity check is claiming it will take ~100 days!

I have already run a parity check since I built the system and it took less than a day (17 hours), so something is obviously not right. I grabbed a diagnostics download from this point.

 

The next thing I did was stop the parity check and perform a proper safe reboot.... the message is still there. I grabbed another diagnostics report.

 

Next I thought I would rule out the plug ins I have installed so I stopped the array and shutdown then made sure it went into safe mode.... get that same error on boot up. I grabbed yet another report.

 

unRAID version:

6.1.9

 

Usual Plug ins:

Preclear

Community Applications

 

Docker:

Sonarr

Deluge

 

The specifications are:

Corsair CS450M PSU

2 x E5-2428L

Supermicro MBD-X9DBL

Logic-case SC-43400-8HS

24GB Samsung DDR3 ECC Registered RAM

 

Array - Connected into SAS hotswap via reverse 8087 cables to the motherboard headers:

HGST_HDN724040ALE640 (Parity)

SAMSUNG_HD753LJ

2 x SAMSUNG_HD502HJ

2 x SAMSUNG_HD203WI

WDC_WD40EFRX-68WT0N0

 

Cache:

KINGSTON_SV300S37A480G

SanDisk_SDSSDHII480G

 

Boot:

Cruzer_Fit

 

Hopefully you guys can find some kind of information to point me as to the probable cause. Let me know if you need more info!

 

Edit: Minor correction to working of the error message

tower-diagnostics-20160421-1158.zip

tower-diagnostics-20160421-1229.zip

tower-diagnostics-20160421-1234.zip

Share this post


Link to post

Well I rebooted it and ran the parity check again and this time it went at a regular pace. Would be nice to have some kind of pointers as to possibilities of what I have done wrong to cause this though.

Share this post


Link to post

After a quick look all disks look fine, there are a couple of machine check errors on the 1st syslog, this usually point to a hardware issue, ram, board, power, etc.

Share this post


Link to post

Hmmmm, strange that it would randomly crop up rather than be a persistent problem. Do you have any suggestions on narrowing it down?

Share this post


Link to post

Its just happened again :(

 

Same issue again with the terminal, even with a monitor and keyboard plugged in from boot up I cant enter anything. Looks something totally locks up in the system which means yet another parity check.

 

Perhaps a memtest run through is worthwhile?

 

Edit: I just pulled the top off to check for overheating and the PCH was seriously scolding hot.... which I think would add up to the issues I was experiencing.

 

IMG_20160422_174641.jpg

 

The small silver heatsink at the top right. Guess I need to get my drill out and cram some more fans into the case!

Share this post


Link to post

Well I put a different heatsink on the PCH to deal with the heat (I had an old thermalright one with the correct mounts). Just installed Emby, set off the library scan and the the system rebooted itself :(

 

Means I have yet another bloody parity check and clearly no closer to solving the issue :(

Share this post


Link to post

Is it possible to set up an automatic backup of the syslogs to a folder so its easier for me to potentially see what happens on the lead up to the failure?

 

I added power supply model to the system specs in the first post.

Share this post


Link to post

Is it possible to set up an automatic backup of the syslogs to a folder so its easier for me to potentially see what happens on the lead up to the failure?

 

I added power supply model to the system specs in the first post.

from the localconsole

tail -f /var/log/syslog > /boot/syslog.txt

Will continually copy the syslog to the flash drive (nothing will appear on the screen however)

 

Has to be done from the local keyboard & monitor so that if the network goes down, it will keep copying

Share this post


Link to post

Is it possible to set up an automatic backup of the syslogs to a folder so its easier for me to potentially see what happens on the lead up to the failure?

 

I added power supply model to the system specs in the first post.

from the localconsole

tail -f /var/log/syslog > /boot/syslog.txt

Will continually copy the syslog to the flash drive (nothing will appear on the screen however)

 

Has to be done from the local keyboard & monitor so that if the network goes down, it will keep copying

 

I think I managed to capture one when I had the issue again and it displayed MCE errors. I cant see any /dev/mcelog folder though. I think i'll run a memcheck as It seems to run for quite a while when there isnt much load on the system perhaps its only when it utilises a certain stick of RAM or CPU core that it panics.

syslog.zip

Share this post


Link to post

Well I took out all the RAM except two sticks (minimum for dual CPUs) and ran memtest. Immediately a load of errors popped up under CPU0. I stopped the test and swapped out the RAM for some crucial non registered stuff out of my other server and again, errors immediately. Guess its either a CPU Issue or perhaps even the motherboard :(

 

I'm pretty stumped to be honest now will have to look at it again another day :(

Share this post


Link to post

I have also started a hardware thread to discuss the issues I have been having with memory errors. Unfortunately no-one has replied to that one, so I'm going the very long slow way of testing everything I can. With the help of some people from another forum, I'm actually making progress.... but I'll report on that once done. Currently on 1CPU with 2 sticks of RAM and no "CATERR" system events in IMPI so its looking relatively good at the moment..... anyway.....

 

I keep seeing this warning in the system log:

Apr 29 22:48:10 Tower kernel: BTRFS warning (device sdc1): block group 155780644864 has wrong amount of free space

Apr 29 22:48:10 Tower kernel: BTRFS warning (device sdc1): failed to load free space cache for block group 155780644864, rebuild it now

 

Is there anything to be worried about there?

 

Edit - spoke too soon about the CATERR messages. I forced a rescan of Embys media folders and it collapsed on me again :(

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this