Jump to content

UNRAID stops working...


TedatTNT

Recommended Posts

To clarify, these are the issues that I have had THROUGH the last two software updates.

 

  • Every time I use the UNRAID heavily (moving/saving/accessing large files or LOTS of small files), I lose the ability to communicate with my UNRAID server via Telnet or via the web GUI. I can still use a keyboard/mouse/monitor.
  • Every time I perform a parity check (or one is performed automatically), I lose the ability to communicate with my UNRAID server via Telnet or via the web GUI. I can still use a keyboard/mouse/monitor.

 

This has been an issue for more than a year! I have posted and asked for help numerous times, but even though there are some very helpful folks here, we haven't found a solution and I am about a month away from giving up and buying a Drobo.

 

Here are things that we've tried:

  • posting the diagnostic report (attached) - although I can only get it after a reboot, so it isn't real helpful
  • moving the Ethernet cable to the other NIC on my MB -- both NICs use chipsets recommended on the HCL
  • Using a PCI-based, add-on NIC
  • Replacing one drive that reported some SMART issues
  • Running memtest, swapping memory, running again, etc.
  • Replacement of the power supply
  • Replacement/reloading the USB flash drive

 

I'm at my wit's end -- I'm hoping somebody has some insight that will help me regain the stability that I had four/five years ago, but haven't had in over a year.

unraid-diagnostics-20160309-0811.zip

Link to comment

To clarify, these are the issues that I have had THROUGH the last two software updates.

 

  • Every time I use the UNRAID heavily (moving/saving/accessing large files or LOTS of small files), I lose the ability to communicate with my UNRAID server via Telnet or via the web GUI. I can still use a keyboard/mouse/monitor.
  • Every time I perform a parity check (or one is performed automatically), I lose the ability to communicate with my UNRAID server via Telnet or via the web GUI. I can still use a keyboard/mouse/monitor.

 

Can you explain what you mean by the above ? 

 

Does "I lose the ability to communicate with my unraid server via TElnet  of the webgui" mean that you cannot access it any more ?  Does it come back after a time or do you need to reboot ?  Do your shares stil function ? Does the move/copy/parity check complete and what are the results ?

 

I do not understand what you mean by "I can still use keyboard/mouse/monitor".... Do you mean that your pc still functions when your unraid has crashed ?  Why would you think it might not ? These things do not seem related..

 

 

This has been an issue for more than a year! I have posted and asked for help numerous times, but even though there are some very helpful folks here, we haven't found a solution and I am about a month away from giving up and buying a Drobo.

 

Here are things that we've tried:

  • posting the diagnostic report (attached) - although I can only get it after a reboot, so it isn't real helpful

 

First thing to do is get a syslog of the issue at hand.. Since you say you can reproduce the issue (just start a copy or a parity check right?) this is not difficult and discribed in the troubleshooting faq:

 

If you are running headless (no monitor and possibly no keyboard or graphics card), then you can try directing the output of the command above to your flash drive or a data drive. For example, the following will output the last syslog lines to syslogtail.txt on your flash drive. This should allow you to obtain the very last message that the system was able to log.
tail -f --lines=100 /var/log/syslog >/boot/syslogtail.txt


  •  
    Post the results in this thread.
     

  • moving the Ethernet cable to the other NIC on my MB -- both NICs use chipsets recommended on the HCL
  • Using a PCI-based, add-on NIC
  • Replacing one drive that reported some SMART issues
  • Running memtest, swapping memory, running again, etc.
  • Replacement of the power supply
  • Replacement/reloading the USB flash drive

 

 

All of the above were done with the same result ?  I see you got advice on changing the power supply, that means it was a thought that you had power issues... Starting a big copy and a parity check both spin up a lot if not all of your drives... What happens when you do just that, just spin up all your drives, does that crash your system ?

 

If it does not... do you run cache_dirs ? If so turn it off and spin up all your drives and start browsing around your drives.. does that crash your system ?

 

I'm at my wit's end -- I'm hoping somebody has some insight that will help me regain the stability that I had four/five years ago, but haven't had in over a year.

 

It might help if you point us to the threads with all the questions that were allready asked and you answered.. Saves asking them again..

Link to comment

Can you tell us about your hardware? When you say you had stability for or five years ago, was that on the same hardware you are using today? What changed over a year ago when you began to have these problems? Did you do a hardware update then?

 

I upgraded the motherboard a couple of years ago from from an ASRock MB that was short on PCIe to an ASUS P5W-DH Deluxe board that was retired from being a HTPC. I also added PCIe for additional SATA ports. It seemed to work beautifully for weeks. I also updated the software at that point.  A couple of months later, I began to notice problems.

Link to comment

There’s been various reports of solving this type of issue on V6 by converting all disks from Reiserfs to XFS, unfortunately it’s an involved process and it will take some time, but if you’re willing to do it, see here:

 

http://lime-technology.com/forum/index.php?topic=37490.0

 

Thanks -- I will look into this. My fear, though, is that the system would 'hang' during the process and I'd have a hell of a time getting it completed. Still, I'm very tempted to proceed.

Link to comment

There’s been various reports of solving this type of issue on V6 by converting all disks from Reiserfs to XFS, unfortunately it’s an involved process and it will take some time, but if you’re willing to do it, see here:

 

http://lime-technology.com/forum/index.php?topic=37490.0

 

Thanks -- I will look into this. My fear, though, is that the system would 'hang' during the process and I'd have a hell of a time getting it completed. Still, I'm very tempted to proceed.

 

Its not that difficult.. Easiest thing (but costing something) is to add a drive to the system and making that xfs.. then move the reiserfs drives contents to the xfs drive.. and then repeat.. But I would try and get some more certainty on this beiing the issue.. It -is- a lengthy process and if it doesnt work that would be a bit of a dissapointment...

 

Have you made sure to remove ALL of your plugins and running stock unraid only ? See if you can get it to break then ?

 

But first of all: get a syslog of the system failing, that should be easy to do and will probably do a lot in solving the issue..

Link to comment


 
Does "I lose the ability to communicate with my unraid server via TElnet  of the webgui" mean that you cannot access it any more ?  Does it come back after a time or do you need to reboot ?  Do your shares stil function ? Does the move/copy/parity check complete and what are the results ?
 
I do not understand what you mean by "I can still use keyboard/mouse/monitor".... Do you mean that your pc still functions when your unraid has crashed ?  Why would you think it might not ? These things do not seem related..
 
When this happens, I can no longer access the UNRAID server via any means by which one would access it over a network. No web GUI, no telnet, no shares, nothing. Yes, after a reboot (and starting the array), everything works until the next time that it doesn't. Any reading/writing via the network fails. The parity check completes successfully.
 
Keyboard/mouse -- I have a keyboard/mouse/monitor attached to my UNRAID server. These still work and do not appear to have suffered an issue. I can still type in commands directly and they still work.
 

First thing to do is get a syslog of the issue at hand.. Since you say you can reproduce the issue (just start a copy or a parity check right?) this is not difficult and discribed in the troubleshooting faq:
 

If you are running headless (no monitor and possibly no keyboard or graphics card), then you can try directing the output of the command above to your flash drive or a data drive. For example, the following will output the last syslog lines to syslogtail.txt on your flash drive. This should allow you to obtain the very last message that the system was able to log.
tail -f --lines=100 /var/log/syslog >/boot/syslogtail.txt


  •  
    Post the results in this thread.
     
 
I'm not running headless -- is there a different method that I should employ?
 

All of the above were done with the same result ?  I see you got advice on changing the power supply, that means it was a thought that you had power issues... Starting a big copy and a parity check both spin up a lot if not all of your drives... What happens when you do just that, just spin up all your drives, does that crash your system ?
 
If it does not... do you run cache_dirs ? If so turn it off and spin up all your drives and start browsing around your drives.. does that crash your system ?
 
 
 
I was able to spin up all drives and it worked fine. I could browse files / open files on various drives. I don't know what cache_dirs is.

Link to comment


 
Does "I lose the ability to communicate with my unraid server via TElnet  of the webgui" mean that you cannot access it any more ?  Does it come back after a time or do you need to reboot ?  Do your shares stil function ? Does the move/copy/parity check complete and what are the results ?
 
I do not understand what you mean by "I can still use keyboard/mouse/monitor".... Do you mean that your pc still functions when your unraid has crashed ?  Why would you think it might not ? These things do not seem related..
 
When this happens, I can no longer access the UNRAID server via any means by which one would access it over a network. No web GUI, no telnet, no shares, nothing. Yes, after a reboot (and starting the array), everything works until the next time that it doesn't. Any reading/writing via the network fails. The parity check completes successfully.
 
Keyboard/mouse -- I have a keyboard/mouse/monitor attached to my UNRAID server. These still work and do not appear to have suffered an issue. I can still type in commands directly and they still work.
 

First thing to do is get a syslog of the issue at hand.. Since you say you can reproduce the issue (just start a copy or a parity check right?) this is not difficult and discribed in the troubleshooting faq:
 

If you are running headless (no monitor and possibly no keyboard or graphics card), then you can try directing the output of the command above to your flash drive or a data drive. For example, the following will output the last syslog lines to syslogtail.txt on your flash drive. This should allow you to obtain the very last message that the system was able to log.
tail -f --lines=100 /var/log/syslog >/boot/syslogtail.txt


  •  
    Post the results in this thread.
     
 
I'm not running headless -- is there a different method that I should employ?
 

All of the above were done with the same result ?  I see you got advice on changing the power supply, that means it was a thought that you had power issues... Starting a big copy and a parity check both spin up a lot if not all of your drives... What happens when you do just that, just spin up all your drives, does that crash your system ?
 
If it does not... do you run cache_dirs ? If so turn it off and spin up all your drives and start browsing around your drives.. does that crash your system ?
 
 
 
I was able to spin up all drives and it worked fine. I could browse files / open files on various drives. I don't know what cache_dirs is.

 

Okay. That suggests that your array itself is continuing to run but your network fails.. That is why you got advised to change the network card.. Gottit..

 

If you have a monitor attached you can still use the procedure to capture the syslog... You could also just do:

 

Tail -f /var/log/syslog

 

This will show your syslog building up on your monitor, then bring your system down by starting a parity check and make a picture with your camera from the syslog data on your screen, post that..

 

Link to comment

 

Its not that difficult.. Easiest thing (but costing something) is to add a drive to the system and making that xfs.. then move the reiserfs drives contents to the xfs drive.. and then repeat.. But I would try and get some more certainty on this beiing the issue.. It -is- a lengthy process and if it doesnt work that would be a bit of a dissapointment...

 

Have you made sure to remove ALL of your plugins and running stock unraid only ? See if you can get it to break then ?

 

But first of all: get a syslog of the system failing, that should be easy to do and will probably do a lot in solving the issue..

 

I have a new 3TB just sitting here and an empty slot - could definitely start with this.

 

The only plugin (other than the included UNRAID Server OS and Dynamix webGUI) is ProFTPd -- I share files with others, occasionally. To test without it, can it be temporarily disabled, or would I have to uninstall it completely and reconfigure it again later?

Link to comment

 

 

If you have a monitor attached you can still use the procedure to capture the syslog... You could also just do:

 

Tail -f /var/log/syslog

 

This will show your syslog building up on your monitor, then bring your system down by starting a parity check and make a picture with your camera from the syslog data on your screen, post that..

 

Okay - I'm on it!!

Link to comment

 

Tail -f /var/log/syslog

 

This will show your syslog building up on your monitor, then bring your system down by starting a parity check and make a picture with your camera from the syslog data on your screen, post that..

 

When I went to take a photo, the lines were changing pretty quick, so I missed a bunch of them -- should I do it again -- maybe shoot video?

 

Ted

UnraidScreen---2016-03-09---001.jpg.b7b9de9d524533c1e73ba0768124ddb5.jpg

Link to comment

Keyboard/mouse -- I have a keyboard/mouse/monitor attached to my UNRAID server. These still work and do not appear to have suffered an issue. I can still type in commands directly and they still work.

 

I hadn’t notice this part before, so your server continues to work except you can no longer use the network?

 

In that case, when it happens again, type diagnostics on the console, then attach the zip created on your flash drive.

 

Also, in this case, converting to xfs may not help, though I’d recommend doing that anyway, but after solving this issue.

 

Link to comment

 

Tail -f /var/log/syslog

 

This will show your syslog building up on your monitor, then bring your system down by starting a parity check and make a picture with your camera from the syslog data on your screen, post that..

 

When I went to take a photo, the lines were changing pretty quick, so I missed a bunch of them -- should I do it again -- maybe shoot video?

 

Ted

 

As far as I can tell that at least confirms it is something at network level that fails.. New interfaces are beiing created.. Must be because they became unavailable somewhere before in the log.. See what johnie.black posted below... This will get you a complete diagnostics set.. I was not aware that could be done from console, it should make stuff clear (at least we then know what the issue is, a little bit of googling then mostly finds a solution..)

Link to comment

 

Tail -f /var/log/syslog

 

This will show your syslog building up on your monitor, then bring your system down by starting a parity check and make a picture with your camera from the syslog data on your screen, post that..

 

When I went to take a photo, the lines were changing pretty quick, so I missed a bunch of them -- should I do it again -- maybe shoot video?

 

Ted

 

So, very strange...

 

It only took a moment to make it freeze before -- just accessed a few files and hadn't even begun the Parity Check.

 

This time, with a video camera recording the screen, I have transferred dozens of GB's of data and been running a Parity Check concurrently and it hasn't failed. I even tried accessing identical files.

 

It seems that the fix is to keep a video camera recording the screen -- that was easier than I thought.

 

 

Link to comment

I originally thought you were getting a complete system lockup where even the console was not working. These problems can be particularly hard to resolve as at the point of failure you can no longer get access to diagnostic information.  However if the console remains working then the problem is elsewhere.

 

If you have a working console session then you should be able to use the

diagnostics

command to get the full diagnostic ZIP file put into the 'logs' folder on the USB stick.  This would include the full syslog amongst other things and other information useful in diagnosing problems.  As long as that command completes OK then it should mean that the unRAID server is basically functioning and the problem probably then comes down to trying to resolve why you are losing access via the network.

Link to comment

Are you using an Marvell 88E8053 gigabit ethernet interface ?

 

 

 

Yes, the dual NIC's are the Marvell Yukon 88E8053's. I have the some issues no matter which is being used.

 

Is there a HIGHLY recommended GB LAN card that I could purchase and test with? I'm asking because the Marvell Yukon's are on the hardware compatibility list.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...