Jump to content

unRAID box keeps disappearing from network


jdmlight

Recommended Posts

I think this is a hardware problem, so I posted in the hardware support section.

 

At this point I'm a little lost, as my unRAID box keeps disappearing from the network after being on for 1-2 days.  The only way I can get it back to operational is by rebooting (hard, as I can't access anything to do a clean shutdown).  I really don't know where to start with diagnosing the problem, as it could be a million things.  I don't know if unRAID is crashing because I don't have a monitor plugged in to my PC (it's headless).  I'll try to post what I think is relevant information, unfortunately though I don't have any logs just before it crashed because I have no access to it after it crashes obviously...

 

Hardware:

Antec 900 case

Gigabyte GA-EP45-UD3P motherboard

2gb 1066mhz OCZ RAM

Intel Core 2 Duo E7300 (might be E7400, I don't remember exactly)

ATI Radeon 4670 (overkill I know, but it's what I had)

Xigmatek S1283 CPU cooler (with 120mm fan)

Corsair 550w PSU

 

Drives:

Western Digital WD15EARS 1.5tb SATA drive (as parity)

Seagate ST3650640A 650gb IDE drive (as disk1)

Seagate ST3160812AS 160gb SATA drive (as disk2)

Sony 2gb flash drive (bought brand new for unRAID)

 

Software:

unRAID Basic 4.5.1

unMenu 1.2 with all plug-ins

FEMUR Firefox monitoring plugin

 

If there's anything else that would make diagnosis easier, just ask and I'll work on finding the relevant information.

Link to comment

Can you "ping" the server?

 

Can you get to the unMENU web-interface?

 

Neither of  those has anything to do with the unRAID software itself, but need network connectivity.  If you can ping the server, odds are the cabling to it is good.

 

If you can't get to either of those, you've probably got a cabling issue, or a router issue.

Time to NOT be headless.

 

Joe L.

Link to comment

Can you "ping" the server?

 

Can you get to the unMENU web-interface?

No and no.  Pinging the server just gives me "Host not available" errors.

Neither of  those has anything to do with the unRAID software itself, but need network connectivity.  If you can ping the server, odds are the cabling to it is good.

 

If you can't get to either of those, you've probably got a cabling issue, or a router issue.

Time to NOT be headless.

 

Joe L.

OK, I'll try plugging it in to a different router to see if I can access it.  I'll post back to see if that makes a difference.  I'll also figure out a way to plug my server into a monitor...I don't have an extra monitor, but I'll figure something out.

 

Edit: Update: Hmm.  I tried connecting just my Mac mini and my unRAID box to my Linksys WRT54g (known working) using known-working ethernet cables.  In the router's web-interface, my Mac mini shows up as an active client, but my unRAID box is nowhere to be found.  I did not restart (I had DHCP enabled on my unRAID box, so it should still get a new IP address if it's not frozen).  I also tried plugging a monitor in (again, without restarting) and nothing showed up on the display.  Time to restart and connect a monitor I guess...

 

Edit2: Just tried restarting my unRAID box.  It appeared on the network long enough for me to view the syslog (through unmenu).  I have attached this syslog.  It looked like my 650gb drive was in the process of mounting before it became inaccessible again.  Hmm...

syslog.txt

Link to comment

I don't see anything really unusual there other than you don't have mail configured, nor do you have a /boot/custom/etc/rc.d/ folder... but don't worry, most people do not unless they put one there.

 

As you said, it was starting a parity check.

Link to comment

Right...strange new behavior.  So I moved my unRAID box by my monitor and plugged everything in: power, monitor, ethernet, keyboard.  Went to turn it on and it doesn't show me the BIOS screen.  It just spins up the hard drives, waits maybe 20 seconds, then powers itself down and restarts.  This cycle endlessly repeats.

Link to comment

Right...strange new behavior.  So I moved my unRAID box by my monitor and plugged everything in: power, monitor, ethernet, keyboard.  Went to turn it on and it doesn't show me the BIOS screen.  It just spins up the hard drives, waits maybe 20 seconds, then powers itself down and restarts.  This cycle endlessly repeats.

That sounds like a hardware issue to me.    ;)

 

Until you can get it to stay powered up, you'll never keep network connectivity established.

Link to comment

That sounds like a hardware issue to me.     ;)

 

Until you can get it to stay powered up, you'll never keep network connectivity established.

Really? I never knew that... :P

 

It was staying powered up before, but it would just disconnect from the network after 1-2 days.  Now it won't even start...and I don't know where to start diagnosing this issue.  My guess is that it's not heat-related, as this computer ran perfectly for a year and a half as a gaming computer ;).  Would bad RAM cause this?  Graphics card died?  Processor died?  Motherboard died?  How do I test for these things? ???

Link to comment

That sounds like a hardware issue to me.     ;)

 

Until you can get it to stay powered up, you'll never keep network connectivity established.

Really? I never knew that... :P

 

It was staying powered up before, but it would just disconnect from the network after 1-2 days.  Now it won't even start...and I don't know where to start diagnosing this issue.  My guess is that it's not heat-related, as this computer ran perfectly for a year and a half as a gaming computer ;).  Would bad RAM cause this?  Graphics card died?  Processor died?  Motherboard died?  How do I test for these things? ???

Start by removing the cards, ram, etc.  You've got to get it to POST and beep at you that there is no RAM.

 

If it can't get that far, then it has to be what remains... the MB itself, or the power supply.  Then, add pieces back in.

RAM first.  Make sure you set the RAM voltage, timing, and speed appropriate for your specific RAM.  Only after you can get to where you can see the BIOS screen can you go further and assign the boot device.

 

Who knows, it might just be a loose connection, or loose board, and moving the server made it worse.

 

Joe L.

Link to comment

Start by removing the cards, ram, etc.  You've got to get it to POST and beep at you that there is no RAM.

Heh.  This made me realize that I never bought a case speaker.  I'll pick up one of those before continuing.

Wow.  I should've bought a case speaker long before now.  It was giving me continuous short beeps, meaning something's wrong with the RAM.  Pulled out the RAM and reseated it, booted right up.  Now we just wait and see if it continues to work...

Link to comment

oookay.  This scares me a little.  My 650gb IDE drive may be bad...  Here is a snippet from the syslog:

Mar 8 23:39:21 JohnsMediaServer kernel: md: disk1 write error

Mar 8 23:39:21 JohnsMediaServer kernel: handle_stripe write error: 45970272/1, count: 1

The above two lines were repeated a large number of times with different blocks listed.

 

Mar 8 23:39:21 JohnsMediaServer kernel: md: recovery thread sync completion status: -4

Mar 8 23:39:21 JohnsMediaServer kernel: md: recovery thread woken up ...

Mar 8 23:39:21 JohnsMediaServer emhttp: disk_temperature: ioctl (smart_enable): Input/output error

Mar 8 23:39:21 JohnsMediaServer emhttp: disk_temperature: ioctl (smart_enable): Input/output error

Mar 8 23:39:21 JohnsMediaServer kernel: md: recovery thread has nothing to resync

Mar 8 23:39:21 JohnsMediaServer emhttp: disk_temperature: ioctl (smart_enable): Input/output error

 

Then the unRAID main web interface has it listed as disabled (little dot is red) with 576 errors. :o

Link to comment

oookay.  This scares me a little.  My 650gb IDE drive may be bad...  Here is a snippet from the syslog:

Mar 8 23:39:21 JohnsMediaServer kernel: md: disk1 write error

Mar 8 23:39:21 JohnsMediaServer kernel: handle_stripe write error: 45970272/1, count: 1

The above two lines were repeated a large number of times with different blocks listed.

 

Mar 8 23:39:21 JohnsMediaServer kernel: md: recovery thread sync completion status: -4

Mar 8 23:39:21 JohnsMediaServer kernel: md: recovery thread woken up ...

Mar 8 23:39:21 JohnsMediaServer emhttp: disk_temperature: ioctl (smart_enable): Input/output error

Mar 8 23:39:21 JohnsMediaServer emhttp: disk_temperature: ioctl (smart_enable): Input/output error

Mar 8 23:39:21 JohnsMediaServer kernel: md: recovery thread has nothing to resync

Mar 8 23:39:21 JohnsMediaServer emhttp: disk_temperature: ioctl (smart_enable): Input/output error

 

Then the unRAID main web interface has it listed as disabled (little dot is red) with 576 errors. :o

Attach a full copy of your syslog to the next post.  (zip it if too big) or use pastebin.

 

Whatever you do DO NOT press the button labeled as "restore" It has nothing to do with restoring data.  If you replace the 650Gig drive, make absolutely certain you press "Start" to start the process of rebuilding the contents of the old drive onto the new replacement.

Link to comment

Yep, I read about the Restore button.  I used it once when moving from experimenting with a bunch of old drives (30, 60, and 160gb drives lol) to using my real hard drives (160gb, 650gb, 1.5tb) - it's basically a reformat button, NOT data restoration.

 

This syslog is after a reboot (I shut it down overnight because I was worried about something failing in the middle of the night).  The 650gb drive is still kicked out of the array, but now there are none of the I/O error messages.

 

I also am attaching the SMART report for the drive, I will let the long SMART test run while I am in class.

syslog-2010-03-09.txt

smartreport.txt

Link to comment

Yep, I read about the Restore button.  I used it once when moving from experimenting with a bunch of old drives (30, 60, and 160gb drives lol) to using my real hard drives (160gb, 650gb, 1.5tb) - it's basically a reformat button, NOT data restoration.

Actually, it is more like a "reset array configuration" button, since it does not write to the disks or format them, but instead sets a new drive configuration in the config/super.dat file that tracks the disks assigned to the array.  It is created by the currently assigned (and working) drives.

 

 

This syslog is after a reboot (I shut it down overnight because I was worried about something failing in the middle of the night).  The 650gb drive is still kicked out of the array, but now there are none of the I/O error messages.

 

I also am attaching the SMART report for the drive, I will let the long SMART test run while I am in class.

Too bad you did not capture the syslog before you rebooted.  It would have given you the error that kicked the drive out of the array.

 

Right now, the syslog looks good, and the smart report looks good too.  To get the disk back in the array you will need to

1. Stop the array

2. un-assign the disk that has failed

3. Start the array without it assigned  (This will cause the array to forget the old disk's model&serial number)

4. Stop the array once more

5. re-assign the failed disk.  (It will now think it is a replacement disk, since it forgot the model/serial number in the step above)

6 Press "Start" to begin the process of reading from parity and all the other drives to reconstruct the contents of the failed drive.  You must reconstruct it because it was taken out of service when a write to it failed.  We know the physical disk does not contain the proper data.  We also know the "Write" succeeded to the parity drive, so we also know it has the correct data to re-construct the drive that had failed.

 

Once the disk is re-constructed back onto itself, you will be protected by parity once more.  Hopefully, the reason for the failures was always the loose RAM strip.

 

Joe L.

Link to comment

Actually, it is more like a "reset array configuration" button, since it does not write to the disks or format them, but instead sets a new drive configuration in the config/super.dat file that tracks the disks assigned to the array.  It is created by the currently assigned (and working) drives.

 

Too bad you did not capture the syslog before you rebooted.  It would have given you the error that kicked the drive out of the array.

 

Right now, the syslog looks good, and the smart report looks good too.  To get the disk back in the array you will need to

1. Stop the array

2. un-assign the disk that has failed

3. Start the array without it assigned  (This will cause the array to forget the old disk's model&serial number)

4. Stop the array once more

5. re-assign the failed disk.   (It will now think it is a replacement disk, since it forgot the model/serial number in the step above)

6 Press "Start" to begin the process of reading from parity and all the other drives to reconstruct the contents of the failed drive.  You must reconstruct it because it was taken out of service when a write to it failed.  We know the physical disk does not contain the proper data.  We also know the "Write" succeeded to the parity drive, so we also know it has the correct data to re-construct the drive that had failed.

 

Once the disk is re-constructed back onto itself, you will be protected by parity once more.  Hopefully, the reason for the failures was always the loose RAM strip.

 

Joe L.

I did the above, waited a while for it to rebuild, and so far all looks good!  Hopefully it'll still be working in a few days (the loose RAM does make sense based on the problems I was having - plus, I had recently dusted out my server with one of those air dusters and could've easily knocked the RAM loose).  Thanks a bunch!

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...