Jump to content

Server unresponsive


Recommended Posts

I have had my server up and running for a week or so now, and have not had any troubles until today. When copying a file with teracopy to the server from a Win7 box, the connection was lost. It came back shortly after, and I attempted the copy again. Again, the connection was lost, and did not come back this time. Console is unresponsive and shows the following:

100_2417.jpg

 

All writing to the server before was done with parity unassigned - I was getting VERY slow write speeds copying to the server with TeraCopy. Parity was assigned during the attempted write. There was a partial file after the first failed copy that was reported as ~3GB (the original file is ~8GB), that I was able to delete before the second attempted copy.  

 

There was a second 5K3000 in the system that had finished (I assume - I did not check its completion but should have finished 12+ hours before) preclearing.

 

I have not shut down the server yet, I assume the only way to do so is with the power button. From my reading, that will invoke an automatic parity check upon reboot? Searching the forum shows possible file system corruption? Possible memory errors?

 

What should be my next steps? Let me know if there is any other information I can provide.

 

SYSTEM

UnRaid Version: 4.7

Motherboard/Processor: ASUS E35M1-M PRO AMD Zacate E350 AMD Hudson M1 Micro ATX Motherboard/CPU Combo

RAM: 1x Kingston KVR1333D3N9/2G (2GB)

Hard Drives (assigned): HITACHI Deskstar 5K3000 (Parity); 1x Samsung F4  - firmware updated (Data)

Power Supply: Antec EarthWatts Green EA-380D

USB: Verbatim TUFF-'N'-TINY 4GB Flash Drive (using StarTech USBMBADAPT2 2 Port USB Motherboard Header Adapter)

Case: Antec 300

 

Link to comment

OK, so more than 10 hours of Memtest results in 9 passes and 0 errors.

 

However, the line "Settings: RAM: 320 MHz (DDR640) / CAS : 4-3-3-3 / DDR2 (64 bits)" is interesting. I have never used memtest before, rebooting and checking the bios shows it correctly reported as 1066MHz DDR3 with voltage and CAS timings on auto (1.5v[correct] and CAS timings not displayed). Does that line mean anything significant? Should I manually set the CAS timings instead of leaving on auto?

 

Where do I go from here? Start the array and let it check parity?

Link to comment

Thank you, attached find my current syslog. I have tail running in a telnet window to capture the syslog in the event it crashes again.

 

I restarted unraid and it is currently checking parity (currently at 43% - 0 errors found so far - ~271 minutes to complete). Once it finishes, I will attempt to recreate the crash.

 

Thank you for your help.

 

 

SYSTEM

UnRaid Version: 4.7

Motherboard/Processor: ASUS E35M1-M PRO AMD Zacate E350 AMD Hudson M1 Micro ATX Motherboard/CPU Combo

RAM: 1x Kingston KVR1333D3N9/2G (2GB)

Hard Drives (assigned): HITACHI Deskstar 5K3000 (Parity); 1x Samsung F4  - firmware updated (Data)

Power Supply: Antec EarthWatts Green EA-380D

USB: Verbatim TUFF-'N'-TINY 4GB Flash Drive (using StarTech USBMBADAPT2 2 Port USB Motherboard Header Adapter)

Case: Antec 300

Cables: using the SATA cables that came with the MB.

 

Hard drives currently report at 26° and 28°.

 

All hardware was bought new, and not used before building the server. Hard drives ran 3 preclear cycles before being assigned.

 

The only add-on I have installed or used is the preclear script. unMenu will be installed in the near future.

syslog.txt

Link to comment

OK, so parity check completed with the 512 sync errors it had found previously. Attempted to copy the file that had crashed unraid before. After ~47 seconds, teracopy fails - states "Error: The network path was not found". Unraid becomes unreachable from the network, console becomes unresponsive (see photo below). Attempt to copy syslog (w/ tail) from putty, it disappears - however had copied all but last line previously:

 

Apr 10 18:51:23 Tower kernel: md: sync done. time=27629sec rate=70705K/sec

Apr 10 18:51:23 Tower kernel: md: recovery thread sync completion status: 0

 

last line was eth0 up (or something to that effect)

 

putty had no other lines than that that were not in last posted syslog.

 

Console:

100_2418.jpg

 

Any theories? Should I attempt to write without parity as I did before? Attempt to write with parity and a different file?

 

EDIT: Attached syslog from after latest reboot.

syslog.txt

Link to comment

Does it seem to only happen when copying this file? What about another file? Can you try copying files from another drive? See if its related to the drive you're copying from... Or related to the HD controller you're copying from.

 

I'd open up the case and reseat all the controllers/connections.

Link to comment

So far it has only happened with this file. It is located on a external USB hard drive connected to a Windows 7 box. I have transferred successfully from this drive/box before, however parity was not assigned during previous transfers. Unraid is checking parity now, about 7.5 hours to go. I will then attempt to copy another file from a different drive/computer (will probably tomorrow early morning or early afternoon).

 

After the latest crash I reseated RAM/sata/power cables/usb before booting.

 

Thanks for the response!

Link to comment

If memory passed the memtest then its ok.

 

Are you sure about that?

 

 

 

OK, so more than 10 hours of Memtest results in 9 passes and 0 errors.

 

However, the line "Settings: RAM: 320 MHz (DDR640) / CAS : 4-3-3-3 / DDR2 (64 bits)" is interesting.

 

If memtest is detecting a 1066MHz stick as 320MHz, I would think that there is something very significant wrong with the system.

 

If memtest tests at 320MHz, but the OS runs the memory at 1066MHz then memtest is very unlikely to find any faults which are being provoked in live running.

Link to comment

Certainly that fact that the RAM passed memtest is a good thing but I do not think that will guarantee your RAM to be trouble free.  Case and point - I recently received 2 new sticks of RAM.  Ironically it was Kingston ValueRAM very similar to what you are using. I ran 4 passes on memtest with no errors.  Installed them in my server and everything worked fine for a few days until I started to get random crashes every few days.  Thinking that it might be a compatability issue with my server MB (this specific RAM was not on the QVL) I swapped the RAM between my HTPC & server. I got a BSOD within 30 minutes in my HTPC.  So while I waited for the RMA I just split my two good sticks that were originally in the HTPC between the two machines. So the moral of the story is just because your RAM passes memtest don't think that it is defect free.

 

My suggestion would be to try another stick of RAM in your MB - preferably one that has been running in a stable machine for some time. Really this should be your strategy going forward - swap out components one at a time (RAM, PSU, MB/CPU) until you can pinpoint the cause.

Link to comment

Ok, first off thanks everyone.

 

I took a 4gb stick of ram out of my desktop and put it into the unraid box. For s***s and giggles, I decided to see what it showed up as in memtest. The "settings: ram:" showed the same. Mapping correctly showed it as 1333MHz DDR3, the same speed/type as the Knigston stick (the board can only run it at 1066MHz though). Could this be some sort of MB issue? An issue with memtest itself?

 

When I got home from work this afternoon, it had successfully completed its parity check - no errors reported. I then tried to copy a file from the same box but different drive. Same result. unraid is currently running another parity check. I will see if it continues to reboot itself with the new stick.

 

Should I bother to run memtest on this stick? What other hardware tests are recommended (other than swapping out hardware - which I will do after we see how the new RAM runs)?

Link to comment
Should I bother to run memtest on this stick? What other hardware tests are recommended (other than swapping out hardware - which I will do after we see how the new RAM runs)?

If it'll make you feel better then go ahead and run memtest but I don't think it's really necessary (I assume it was working fine in your other machine) but others here may disagree.  Do you have a discrete NIC you could throw in there? Perhaps your onboard NIC is buggy. Otherwise you'd have to swap out the MB to check the NIC.
Link to comment
If it'll make you feel better then go ahead and run memtest but I don't think it's really necessary (I assume it was working fine in your other machine) but others here may disagree.

 

Yes, I disagree - strongly.  Just because a memory stick works well in one motherboard doesn't it will be compatible with another motherboard.  Why do you think that the mobo manufacturers publish lists of compatible memory?

Link to comment

Have you tried to boot a Linux distro on this server (Like Ubuntu on a USB drive or something like that) and see whether it crashes? In theory, if you tried this test, Ubuntu should mount all disks as individual disks, thus allowing you to set a make-shift Samba file server, and then write a large file to each disk via an individual disk share as a test. You could get the server to boot into StressLinux, which does a full HW stress test (Be careful as it might do a disk test, resulting to data lost, so be careful!), it might be a faulty disk controller or other component besides the RAM? I've seem bad RAM cause Kernel Panic erros and the alike, other failing hardware can cause the same error.

 

Hope this heaps,

 

Cheers.

Link to comment

If it'll make you feel better then go ahead and run memtest but I don't think it's really necessary (I assume it was working fine in your other machine) but others here may disagree.

 

Yes, I disagree - strongly.  Just because a memory stick works well in one motherboard doesn't it will be compatible with another motherboard.  Why do you think that the mobo manufacturers publish lists of compatible memory?

My experience has always been that any incompatibility issues between RAM and the MB result in a failed boot.  However I'm sure that others may have had different experiences. 
Link to comment

If it'll make you feel better then go ahead and run memtest but I don't think it's really necessary (I assume it was working fine in your other machine) but others here may disagree.

 

Yes, I disagree - strongly.  Just because a memory stick works well in one motherboard doesn't it will be compatible with another motherboard.  Why do you think that the mobo manufacturers publish lists of compatible memory?

My experience has always been that any incompatibility issues between RAM and the MB result in a failed boot.  However I'm sure that others may have had different experiences. 

 

Yes, I have definitely had situations where a machine would boot, but fail randomly in use - and the cause has been ram incompatibility.

 

Most recent occurrence was when I added some Kingston RAM to existing Geil RAM - matched pairs of each, and of identical timing specification.  The machine would randomly lock up/crash in Ubuntu.  Running memtest showed intermittent random errors - not every pass and not always the same location.  However, either pair would work fine together with another pair of sticks (Silicon Power)!

Link to comment

Well, was able to copy two files from 2 seperate computers (the same win7 box and a MBP). Attempted to copy one of the files that had made it crash before - crashed again. So can I rule out RAM? The Kingston is on the QVL. The fact that both the Kingston and PNY ram show up the same under memtest is weird though. Can this mean some sort of MB issue? The BIOS does correctly recognize it though.

 

I have a few 10/100 PCI nic's, so I can try one of those tonight.

 

I have never tried StressLinux. Can it be run with the data drive unattached? Or would that give inaccurate results?

Link to comment

I'm trying to better understand these two statements from your original post.

I have had my server up and running for a week or so now, and have not had any troubles until today.

All writing to the server before was done with parity unassigned - I was getting VERY slow write speeds copying to the server with TeraCopy. Parity was assigned during the attempted write. There was a partial file after the first failed copy that was reported as ~3GB (the original file is ~8GB), that I was able to delete before the second attempted copy.

 

So if I understand correctly you built the server and copied all your data onto it using TeraCopy before calculating parity.  Then after you had copied all your data you assigned a parity drive and established parity.  Now you are trying to write files to your parity protected array and you are getting crashes.  Is that correct?

 

You mentioned very slow write speeds. What were the write speeds that you saw without parity?  They should probably be ~60MB/sec (or more) if you have a gigabit network.  It sounds to me like you've had slow transfer speeds from the beginning which means that your server (or something else) may not have been running properly all along.  How about copying data from the server?  Does it crash when you do that?

 

Did you preclear all of your drives before adding them to the array.  If so, how many preclear cycles did you perform?  Did you preclear multiple drives at the same time?

 

Have you made any changes to your network other than adding your server?  You might consider connecting your server to another port on your switch/router.

 

If I were you my strategy would be to verify the server is stable on its own first - i.e. disconnected from the network - that way you can narrow down the number of variables you're working with.  I feel like you've established that your RAM is not faulty so my next suggestion would be to run simultaneous preclears on all the disks you have or as many as you can without losing data.  FYI-I'm assuming that the data you wrote to your server is still intact somewhere else.  Running multiple preclears is a good way to stress some of the key components in your server.  I have no experience with StressLinux but it sounds like it's trying to do the same thing that multiple preclears would do - stress the system to find weak components.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...