Jump to content

[SOLVED] Uh-oh...Tower has dropped off the LAN twice today...


Recommended Posts

I woke up this morning and found that the unRAID had fallen offline at around 6AM.  It wasn't reachable by http, telnet and the router reported it unavailable.  I tried to shutdown with one click of the power button which should make it shutdown cleanly, but no response.  So now that's a crash.  I finally had to do a hard shutdown by holding the power button.

When I started it up again it booted fine, but reported 1 sync error and started a parity check.  It took about 10 minutes for most of the drives to mount, except one, the only IDE drive, took about 20 minutes to show up.  The parity check finished up around 5:45PM and reported no errors.  I started copying a 2gb file over to the tower and about 1/3 of the way thru it dropped offline again and crashed! :o

I checked my patches, they seem to be OK.  All the temps looked fine.  The router hasn't dropped anything else, so I'm guessing it's not in the router or switch, but that's not based on very much.

I started up again and it's doing a parity check again, but shows no sync errors and all the drives mounted quickly this time.  I'm including 2 syslogs, one I took this afternoon before the crash and one just now after start up.  They look awfully similar, but I don't know how to interpret them.

 

A few lines that came up orange in UNMENU in both syslogs could be a concern:

Jul 25 09:13:58 unRAID kernel: spurious 8259A interrupt: IRQ7.

 

Jul 25 09:13:58 unRAID kernel: ACPI Warning for \_SB_.PCI0._PRT: Return Package type mismatch at index 2 - found [NULL Object Descriptor], expected Integer/Reference (20090903/nspredef-1012)

 

It also reported replaying xxxx transactions in xxxx seconds, different in each syslog, but orange in unmenu.  An example:

 

Jul 25 09:18:23 unRAID kernel: REISERFS (device md4): replayed 2082 transactions in 261 seconds

Jul 25 09:18:23 unRAID kernel: REISERFS (device md4): Using r5 hash to sort names

Jul 25 09:18:45 unRAID kernel: REISERFS (device md3): replayed 2257 transactions in 283 seconds

Jul 25 09:18:45 unRAID kernel: REISERFS (device md3): Using r5 hash to sort names

 

And then duplicate objects showed up, but that's happened before and I don't think it's a problem...for example:

 

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk3/./._.TemporaryItems

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk4/./.apdisk

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk4/./._.TemporaryItems

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk5/./.apdisk

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk5/./._.TemporaryItems

 

I'm not sure where to go from here.   Suggestions?  THANKS!

2_syslogs.zip

Link to comment

Sorry, but i can't help. My system started doing the same thing, almost to the day. Sunday it went off the lan (green network light went orange). forced a reboot, and had to perform a parity check. Went off line later the same day, another reboot and parity check. This AM it wasn't on the LAN again. i am on 4.6 for past several months.

Link to comment

Must be Y2K just a little late!  I've had it up and running for about 2 hours and just played back some media and there are no problems.  But the crashes happened when I was writing data.  I'm going to hunt around a little more to see what I can find.  I'll report back.  In the meantime, anyone with suggestions would be much appreciated.

Link to comment

Weirdly enough, I had the same thing happen out of the blue yesterday and just right now.  I'm not at home and just logged into one of my home computers, but my unRaid tower just fell off the map (so I came here to check on the issue).  The array was being written to both times via newsgroup and torrent downloads, but that's a regular occurrence for me.

 

Yesterday, when I got home I checked the monitor hooked up to the unRaid server and nothing on the screen.  I had to restart, as it seemed like it crashed (not accessible via any means: telnet, http, smb, nfs).  I also checked the switch and changed ports between a device that was accessible via the network, but that didn't work.  On reboot, it took about 10-15 minutes to start the array and it was doing a parity check.  All drive lights were green.

 

I'm running 4.7 and haven't changed anything since 4.7 came out.  Plugin wise, I'm using apcuspd and unmenu.  As I'm not at home and can't restart the server, I'll post the syslog tonight.

Link to comment

Finally got to 4.7 after removing HPA from 2 drives. Will see if it is stable for a few days.

 

The only pattern i can think of was running SlimServer. I had played for minutes or hours each day the server went offline, but it didn't go offline while i was streaming music.

Link to comment

So last night when I got home, I verified that the system had again crashed.  Upon reboot, I was greeted with a red dot.  I rejiggled the SATA cable and rebooted again, and it saw the drive.  It's been re-added and the re-build completed.  The SMART data doesn't look bad on the drive:

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

 1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

 2 Throughput_Performance  0x0005   126   126   054    Pre-fail  Offline      -       173

 3 Spin_Up_Time            0x0007   122   122   024    Pre-fail  Always       -       313 (Average 314)

 4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       3542

 5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

 7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

 8 Seek_Time_Performance   0x0005   129   129   020    Pre-fail  Offline      -       30

 9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       17709

10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       70

192 Power-Off_Retract_Count 0x0032   097   097   000    Old_age   Always       -       3994

193 Load_Cycle_Count        0x0012   097   097   000    Old_age   Always       -       3994

194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       21 (Lifetime Min/Max 15/44)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

 

I don't know how to grab the old syslog that shows when the system crashed over the last two days.  When I view the syslog in unMenu it only shows the last day.  Here is my last syslog:

untitled.txt

Link to comment

Hard crashes like you're describing could be related to memory or CPU rather than disk. Check if the CPU fan is still functioning properly and not too dusty, check airflow around memory modules as well. When the server's been running for a while, check CPU temps (in UnMenu, open the system info tab, hit the CPU info button and scroll down to the bottom). Also run memtest to rule out faulty memory.

Link to comment

Hard crashes like you're describing could be related to memory or CPU rather than disk. Check if the CPU fan is still functioning properly and not too dusty, check airflow around memory modules as well. When the server's been running for a while, check CPU temps (in UnMenu, open the system info tab, hit the CPU info button and scroll down to the bottom). Also run memtest to rule out faulty memory.

 

I'll try out memtest tonight and report back.

 

I don't think it's a temperature related issue, as the server is in my cool basement (around 19-22 degrees celcius range throughout the year), and the fans in the case (Antec P190) are all running at high speed (noise isn't an issue where the server is).  That case has all the fan slots filled and airflow is good throughout the case.  The drive racks all have fans (I replaced all the stock rack fans with thick 40mm Delta high-speed-whiners).  My CPU temperature ranges from 30-35 degrees celcius (currently at 33).

Link to comment

Well, everything seems to be fine now.  I don't know what happened, but after it crashed the 2nd time, I had a running syslog going thru telnet and it didn't happen again.  I've had the server running now for about 3 days, solid as a rock.  BTW, we just had a power outage here and the UPS worked magnificently!  Gotta love UPS

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...