[SOLVED] Uh-oh...Tower has dropped off the LAN twice today...

abernardi · July 26, 2011

I woke up this morning and found that the unRAID had fallen offline at around 6AM. It wasn't reachable by http, telnet and the router reported it unavailable. I tried to shutdown with one click of the power button which should make it shutdown cleanly, but no response. So now that's a crash. I finally had to do a hard shutdown by holding the power button.

When I started it up again it booted fine, but reported 1 sync error and started a parity check. It took about 10 minutes for most of the drives to mount, except one, the only IDE drive, took about 20 minutes to show up. The parity check finished up around 5:45PM and reported no errors. I started copying a 2gb file over to the tower and about 1/3 of the way thru it dropped offline again and crashed!

I checked my patches, they seem to be OK. All the temps looked fine. The router hasn't dropped anything else, so I'm guessing it's not in the router or switch, but that's not based on very much.

I started up again and it's doing a parity check again, but shows no sync errors and all the drives mounted quickly this time. I'm including 2 syslogs, one I took this afternoon before the crash and one just now after start up. They look awfully similar, but I don't know how to interpret them.

A few lines that came up orange in UNMENU in both syslogs could be a concern:

Jul 25 09:13:58 unRAID kernel: spurious 8259A interrupt: IRQ7.

Jul 25 09:13:58 unRAID kernel: ACPI Warning for \_SB_.PCI0._PRT: Return Package type mismatch at index 2 - found [NULL Object Descriptor], expected Integer/Reference (20090903/nspredef-1012)

It also reported replaying xxxx transactions in xxxx seconds, different in each syslog, but orange in unmenu. An example:

Jul 25 09:18:23 unRAID kernel: REISERFS (device md4): replayed 2082 transactions in 261 seconds
Jul 25 09:18:23 unRAID kernel: REISERFS (device md4): Using r5 hash to sort names

Jul 25 09:18:45 unRAID kernel: REISERFS (device md3): replayed 2257 transactions in 283 seconds

Jul 25 09:18:45 unRAID kernel: REISERFS (device md3): Using r5 hash to sort names

And then duplicate objects showed up, but that's happened before and I don't think it's a problem...for example:

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk3/./._.TemporaryItems
Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk4/./.apdisk

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk4/./._.TemporaryItems

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk5/./.apdisk

Jul 25 09:24:31 unRAID shfs: duplicate object: /mnt/disk5/./._.TemporaryItems

I'm not sure where to go from here. Suggestions? THANKS!

2_syslogs.zip

abernardi · July 27, 2011

Anyone?

JDGJr · July 28, 2011

Sorry, but i can't help. My system started doing the same thing, almost to the day. Sunday it went off the lan (green network light went orange). forced a reboot, and had to perform a parity check. Went off line later the same day, another reboot and parity check. This AM it wasn't on the LAN again. i am on 4.6 for past several months.

abernardi · July 28, 2011

Must be Y2K just a little late! I've had it up and running for about 2 hours and just played back some media and there are no problems. But the crashes happened when I was writing data. I'm going to hunt around a little more to see what I can find. I'll report back. In the meantime, anyone with suggestions would be much appreciated.

brandon · July 28, 2011

Weirdly enough, I had the same thing happen out of the blue yesterday and just right now. I'm not at home and just logged into one of my home computers, but my unRaid tower just fell off the map (so I came here to check on the issue). The array was being written to both times via newsgroup and torrent downloads, but that's a regular occurrence for me.

Yesterday, when I got home I checked the monitor hooked up to the unRaid server and nothing on the screen. I had to restart, as it seemed like it crashed (not accessible via any means: telnet, http, smb, nfs). I also checked the switch and changed ports between a device that was accessible via the network, but that didn't work. On reboot, it took about 10-15 minutes to start the array and it was doing a parity check. All drive lights were green.

I'm running 4.7 and haven't changed anything since 4.7 came out. Plugin wise, I'm using apcuspd and unmenu. As I'm not at home and can't restart the server, I'll post the syslog tonight.

dgaschk · July 28, 2011

Disable all addons and see if the problem continues. Then add them back one at a time to determine the culprit.

JDGJr · July 29, 2011

Finally got to 4.7 after removing HPA from 2 drives. Will see if it is stable for a few days.

The only pattern i can think of was running SlimServer. I had played for minutes or hours each day the server went offline, but it didn't go offline while i was streaming music.

brandon · July 29, 2011

So last night when I got home, I verified that the system had again crashed. Upon reboot, I was greeted with a red dot. I rejiggled the SATA cable and rebooted again, and it saw the drive. It's been re-added and the re-build completed. The SMART data doesn't look bad on the drive:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0

2 Throughput_Performance 0x0005 126 126 054 Pre-fail Offline - 173

3 Spin_Up_Time 0x0007 122 122 024 Pre-fail Always - 313 (Average 314)

4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3542

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0

7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0

8 Seek_Time_Performance 0x0005 129 129 020 Pre-fail Offline - 30

9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 17709

10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 70

192 Power-Off_Retract_Count 0x0032 097 097 000 Old_age Always - 3994

193 Load_Cycle_Count 0x0012 097 097 000 Old_age Always - 3994

194 Temperature_Celsius 0x0002 253 253 000 Old_age Always - 21 (Lifetime Min/Max 15/44)

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0

197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

I don't know how to grab the old syslog that shows when the system crashed over the last two days. When I view the syslog in unMenu it only shows the last day. Here is my last syslog:

untitled.txt

AlphaBravo · July 29, 2011

Hard crashes like you're describing could be related to memory or CPU rather than disk. Check if the CPU fan is still functioning properly and not too dusty, check airflow around memory modules as well. When the server's been running for a while, check CPU temps (in UnMenu, open the system info tab, hit the CPU info button and scroll down to the bottom). Also run memtest to rule out faulty memory.

brandon · July 29, 2011

Hard crashes like you're describing could be related to memory or CPU rather than disk. Check if the CPU fan is still functioning properly and not too dusty, check airflow around memory modules as well. When the server's been running for a while, check CPU temps (in UnMenu, open the system info tab, hit the CPU info button and scroll down to the bottom). Also run memtest to rule out faulty memory.

I'll try out memtest tonight and report back.

I don't think it's a temperature related issue, as the server is in my cool basement (around 19-22 degrees celcius range throughout the year), and the fans in the case (Antec P190) are all running at high speed (noise isn't an issue where the server is). That case has all the fan slots filled and airflow is good throughout the case. The drive racks all have fans (I replaced all the stock rack fans with thick 40mm Delta high-speed-whiners). My CPU temperature ranges from 30-35 degrees celcius (currently at 33).

abernardi · August 1, 2011

Well, everything seems to be fine now. I don't know what happened, but after it crashed the 2nd time, I had a running syslog going thru telnet and it didn't happen again. I've had the server running now for about 3 days, solid as a rock. BTW, we just had a power outage here and the UPS worked magnificently! Gotta love UPS

[SOLVED] Uh-oh...Tower has dropped off the LAN twice today...

Recommended Posts

abernardi

Link to comment

abernardi

Link to comment

JDGJr

Link to comment

abernardi

Link to comment

brandon

Link to comment

dgaschk

Link to comment

JDGJr

Link to comment

brandon

Link to comment

AlphaBravo

Link to comment

brandon

Link to comment

abernardi

Link to comment

Archived