[SOLVED] UNRAID freezing requiring reboot. Could someone please look at my log?


Recommended Posts

Currently running 5.0b9 and previously ran 5.0b7 since first setting up my UNRAID server. I've been experiencing random freezes requiring a cold reboot to get everything working again and have had them on both beta versions of UNRAID. I'm currently performing some backups before starting over with 4.7, however, I am wondering if the issue is a Beta issue or something else (e.g. hardware issue, bad drive, need to update Bios, etc.). My most recent systen log is attached which should reflect UNRAID freezing while copying some movies to a USB drive attached to my iMac.

 

When I first began having these freezes I thought it was just the server losing network. However, I've since connected a display directly to the motherboard and it turns out UNRAID is simply going unresponsive altogether. Moreover, I noticed in the last couple of days that it will freeze up even if the array is stopped. Any help and/or suggestions would be appreciated. Thanks.

 

Hardware Specs:

Biostar A760G M2+

supermicro saslp-mv8

AMD Sempron 140

HD's from a variety of manufacturers

syslog.txt

Link to comment

Have you run memtest on the system to make sure the memory is good.

 

you can select memtest if you reboot your machine and then select it to run instead of unRAID.

 

 

Short of that, you need to open a telnet session to the server and run:

tail -f /var/log/syslog

 

leave it open and running until the server freezes completely.  Once that happens then copy and paste what you can here.

Link to comment

Thanks for taking a look at this for me prostuff1. I wasn't able to copy and paste but I took a picture of my console screen after I ran that very command last night. It is attached.

 

I ran memtest overnight back when I first set up UNRAID and saw nothing that concerned me. I can certainly run it again.

 

Bryan

L10505501.jpg.d5f0858c8f79fee6ed9a5f959ffc8b93.jpg

Link to comment

Thanks for taking a look at this for me prostuff1. I wasn't able to copy and paste but I took a picture of my console screen after I ran that very command last night. It is attached.

 

I ran memtest overnight back when I first set up UNRAID and saw nothing that concerned me. I can certainly run it again.

 

Bryan

 

Try running the same command in a telnet connection so that more info can be captured.

 

Is there a certain thing that seems to cause this? Does it happen at a certain time?

 

Running another memtest overnight would not be a bad idea.

Link to comment

There is a kernel bug call trace at the end of the syslog and it looks to be related to a SATA expansion card. You may want to pull the Supermicro and run without it for a bit to see if the problem clears up.

 

If you do the telnet syslog tail then depending on the telnet software you may also want to increase the buffer size so you get even more of the log.

 

Peter

Link to comment

Thanks for the tips. I just went out and bought a couple 2GB sticks of RAM since I wanted more RAM anyway. So, if the previous stick was bad at least I can kill two birds with one stone. That said, as soon as I get home from the office I'll try running without the Supermicro card and update the thread.

 

I am having trouble with the log command prostuff1 suggested earlier. After the server was rebooted today I opened a telnet session from a local Mac and entered the command as provided. Then, I proceeded to start backing up files off the server until it froze up again, which it did about 30 minutes later. However, I noticed the log generated in the terminal window did not update beyond the first 10 lines when I first entered the command. Should I be entering that command from the console instead? And if so, how can I set it up to grab more info than what is on a single screen?

 

While I've been working through this issue I've had the system log automatically saving to the flash drive because otherwise, I'd lose it each time I had to reboot. I attached the most recent again although I don't know if it will provide much different info than the first.

 

Thanks again guys,

 

~B

syslog.txt

Link to comment

I'm certainly no expert but this line

 

Jul 21 23:19:41 Tower kernel: Modules linked in: md_mod xor mvsas ahci libahci libsas scsi_transport_sas r8169 atiixp [last unloaded: md_mod]

 

seems to be pointing to the SATA card. I believe MVSAS is the Marvel driver for the serial attached SCSI chip used on the SATA card. There are also other AHCI libraries and SAS libraries listed, which in my mind all point to some issue with a SATA controller.

 

Peter

Link to comment

I am having trouble with the log command prostuff1 suggested earlier. After the server was rebooted today I opened a telnet session from a local Mac and entered the command as provided. Then, I proceeded to start backing up files off the server until it froze up again, which it did about 30 minutes later. However, I noticed the log generated in the terminal window did not update beyond the first 10 lines when I first entered the command. Should I be entering that command from the console instead? And if so, how can I set it up to grab more info than what is on a single screen?

 

While I've been working through this issue I've had the system log automatically saving to the flash drive because otherwise, I'd lose it each time I had to reboot. I attached the most recent again although I don't know if it will provide much different info than the first.

 

Coincidentally, tail without the -f option displays the last 10 lines. Any chance? If it exits back to a prompt then it isn't seeing the -f.

 

In any case, the console will often show things that never make it to ptys, either because those are connected across the problem (network) or the additional buffering and layers makes them susceptible to the problem. The downside is the lack of scroll-back which is where serial attached consoles can be very handy.

 

Peter, yea, here either. I was looking at:

 

kernel BUG at lib/radix-tree.c:355!
Jul 21 23:19:41 Tower kernel: invalid opcode: 0000 [#1] SMP 
Jul 21 23:19:41 Tower kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:12.2/usb1/1-4/1-4:1.0/host1/target1:0:0/1:0:0:0/block/sda/stat

 

The eip points to a problem in the tree insert and sysfs is in the flash's (sda) stat file. That's just one crash though. If they move around I agree on ram or other hardware as maybes.

Link to comment

Coincidentally, tail without the -f option displays the last 10 lines. Any chance? If it exits back to a prompt then it isn't seeing the -f.

 

Anything's possible. I was VNC'ing into my Mac at home to run the command and the terminal screen was pretty small. I could have missed it. I've definitely got it in there now though.

 

I'm fully prepared to test all hardware issues this weekend. I've bought or borrowed new RAM, SAS cables, and a PSU to be able to begin ruling items out (Couldn't get another SuperMicro card but will use the onboard SATA ports). My plan was/is to swap only one out at a time, run the same tasks I usually did to cause freezes, and see what works and does not. If and once I get the first crash I'll update the thread with the details and log.

 

That said, while it is still too early to say "Wham Bam Thank You RAM" it's the only thing I have replaced so far and the server has been on for 18 hours straight running a parity check and preclearing 3 disks. I'm crossing my fingers but, this is the longest by far it's stayed on since this issue first came up about a week ago...

 

 

Link to comment

Thank you everyone once again for your help. It's been a very full day since I swapped out the RAM and the server hasn't crashed once. Considering how often it was crashing beforehand, I think it is safe to say it was the RAM. That said, I am not an expert on Power Supplies but I am wondering if my original stick could have gone bad due to a faulty PSU.

 

When installing my new RAM I found that the 2nd slot on the mobo is bad. Both new sticks work fine in slot 1 but if I have both slots occupied or just slot 2, the computer wont POST. Therefore, I'm beginning to wonder if I've got an electrical problem f%&*ing up my parts. Does this seem logical or am I grasping at straws?

Link to comment
  • 7 years later...

I know this is a very old thread, but do not rely on memtest alone. As OP experienced it was still faulty RAM even though memtest reported it being good.

Used to work for a computer repair shop and we used to say "memtest shows if the RAM is bad, not if it is good".

Basically, if memtest reports RAM being bad, it is bad. But if memtest shows no errors it means it can be good or bad.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.