unraid becoming unresponsive, errors in log

January 27, 201115 yr

I've recently set up an unraid server, and I can't get it to run for more than a day before everything stops responding. I usually see some errors like I've pasted below. What do these mean, and what drive are they referring to?

Thanks!

Jan 27 17:22:03 bbunraid kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen (Errors)
Jan 27 17:22:03 bbunraid kernel: ata1.00: failed command: CHECK POWER MODE (Minor Issues)

Jan 27 17:22:03 bbunraid kernel: ata1.00: cmd e5/00:00:00:00:00/00:00:00:00:00/00 tag 0 (Drive related)

Jan 27 17:22:03 bbunraid kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)

Jan 27 17:22:03 bbunraid kernel: ata1.00: status: { DRDY } (Drive related)

Jan 27 17:22:03 bbunraid kernel: ata1: hard resetting link (Minor Issues)

Jan 27 17:22:13 bbunraid kernel: ata1: softreset failed (device not ready) (Minor Issues)

Jan 27 17:22:13 bbunraid kernel: ata1: hard resetting link (Minor Issues)

Jan 27 17:22:23 bbunraid kernel: ata1: softreset failed (device not ready) (Minor Issues)

Jan 27 17:22:23 bbunraid kernel: ata1: hard resetting link (Minor Issues)

Jan 27 17:22:34 bbunraid kernel: ata1: link is slow to respond, please be patient (ready=0) (Drive related)

Jan 27 17:22:58 bbunraid kernel: ata1: softreset failed (device not ready) (Minor Issues)

Jan 27 17:22:58 bbunraid kernel: ata1: limiting SATA link speed to 1.5 Gbps (Drive related)

Jan 27 17:22:58 bbunraid kernel: ata1: hard resetting link (Minor Issues)

Quote

January 27, 201115 yr

What are your full system specs? It looks to me like the drive may not be getting enough power.

Post a full syslog and we can tell you what drive is having issues.

Quote

January 28, 201115 yr

Author

Thanks for your response.

System Specs:

Motherboard: http://www.gigabyte.com/products/product-page.aspx?pid=2946#ov

PSU: http://www.newegg.com/Product/Product.aspx?Item=N82E16817703030

CPU: E5200

RAM: 3GB

HDD: 1xWD2002FYPS (parity) 4xWD20EARS

Complete log: http://pastebin.com/Ae9qgRLa

Quote

January 28, 201115 yr

Even with the syslog, there is really not enough information unfortunately. Something does appear to be wrong with your system, but I don't know what. At the end of the syslog, the drive just suddenly stops responding, even to a CHECK POWER MODE command, which I believe is just a request to the drive as to what power mode it is in - standby, active, or idle, same as the "hdparm -C" command. From what you are saying though, it is not just this one drive, but the whole system seems to hang? Then perhaps the drive is fine, just the first device to hang...

The syslog seems incomplete, there should normally be more lines after that, because it really has not resolved the issue with the drive yet. Were there more lines? You might also start a tail running on the console, just to see the very last messages logged, when it hangs next: tail -f --lines=100 /var/log/syslog.

There a couple of things you can check, especially since this is a new system.

* Run the memtest overnight, so we can eliminate memory as an cause

* Check for heat issues, CPU running too hot, northbridge or southbridge chipsets getting too hot, etc

* Check to see that you have the latest BIOS/firmware for this board

Up until the last few lines of the syslog, your system seemed to be fine, no apparent issues noted.

Quote

January 28, 201115 yr

Author

That's the entire log. When I say the entire system freezes up, I don't mean completely at the same time; First the web-ui stops responding, then as I attempt to access other parts (that I assume may try to interface that drive) they stop responding as well. I had enough time to pull the syslog, which says a little I guess.

If I could know which drive is problematic, I would just pull it, but I can't tell from the log which one it is.

Also, as unfortunate as it would be to miss the functionality, would disabling smart for now possibly help?

Quote

January 28, 201115 yr

These are symptoms of filling RAM. It could be a log file growing that fills the RAM disk. The OS starts killing processes when RAM gets too low. Thats why the web interface goes down but you can still telnet in. Are you running any add-ons?

Quote

January 28, 201115 yr

Author

I'm running unmenu with a few essentials (ssh/htop/iftop), but this was happening when I was running without any as well. How big is the ramdisk?

Quote

January 28, 201115 yr

I don't know exactly. It may depend on how much RAM you have. People are running with 512MB of RAM successfully. Your RAM disk should not fill up unless there is a problem. Do what RobJ suggested; run memtest overnight, etc. If that works, you will need to start swapping cables around and see if the problem follows a cable, a port, or a disk. Someone like Joe L may be able to provide more insight.

Quote

January 28, 201115 yr

Author

Thanks. I have plenty of ram, so unless the ram disk has a set limit that it uses, I don't see it even coming withing 10% of it's available capacity. I'll run a ram test if I need to, though it's hard to believe that's a cause when it is using less than 100mb to run in. I wish I could figure out which drive it is indicating (always the same drive); it would definitely ease the road to finding a root cause.

Quote

January 28, 201115 yr

If you have not done a memtest that is the place to start. Bad RAM is reported more often than you'd think. Run the test for a very long time, at least overnight.

Quote

January 28, 201115 yr

Is this typo:

RAM: 3GB

in your configuration because from the syslog it looks like you have 4GB.

Quote

January 28, 201115 yr

Author

It's a little misleading, because the onboard vid card uses shared memory and takes some of it.

Quote

January 28, 201115 yr

That's the entire log. When I say the entire system freezes up, I don't mean completely at the same time; First the web-ui stops responding, then as I attempt to access other parts (that I assume may try to interface that drive) they stop responding as well. I had enough time to pull the syslog, which says a little I guess.

If I could know which drive is problematic, I would just pull it, but I can't tell from the log which one it is.

Also, as unfortunate as it would be to miss the functionality, would disabling smart for now possibly help?

The drive indicated by this syslog is Disk 1, sda, WDC_WD20EARS-00MVWB0, with serial# ending in 048. I do apologize for forgetting to mention that before. And no, I can't think of anyway that disabling SMART could possibly help.

However, while it is remotely possible that Disk 1 is very indirectly causing the system to go down, I still do NOT think that this drive is the problem. There are no drive errors here, simply a drive that cannot respond, even at the SATA link level, and in my experience that usually is NOT the drive's fault. If it really is a problem related to the drive, then the possibilities are:

* SATA cable has slipped off, at either end

* power cable has slipped off, or a power splitter is loose, become disconnected

* power supply is failing, stopped providing power to this drive

* disk controller is failing

* both the drive firmware has crashed AND the drive SATA chip has crashed (very remote chance, extremely unlikely)

It is MUCH more likely that the system is crashing, and the drive is just the first evident symptom. There really should be much more at the end of that syslog, so the logging subsystem may have crashed or been disabled. (That is why I asked for a tail of the syslog on the physical console, to see if anything else is reported.) The very last entries indicate that the system is still trying to contact the drive, but so far is unsuccessful, so there should be a LOT more entries. When a drive cannot be reached, syslogs often grow to a hundred times as big as yours, with exception handling, handle stripe errors, I/O buffer errors, and drive errors and a red ball on the web management page. The only way that that could be a valid end of the syslog is if the drive did respond (and the logging system forgot to inform us), and everything is perfectly fine with the drive, AND with the system, and as you know that is not true.

A drive cannot crash a system. If it ever *appears* to have caused the system to crash, then that means there is a serious problem with the disk controller. In your case, that is the onboard chipset, so that would mean you have to replace the motherboard. That is a possibility, but I think it is a very remote one.

I don't know what the problem is, but memory and heat still look like strong possibilities, loose connections are a possibility, and a bad motherboard or buggy BIOS are slight possibilities. A RAM disk or the amount of RAM is not really relevant to whether the RAM is bad or incorrectly configured or not. I still strongly advise an overnight test, using the Memtest on the initial unRAID boot screen.

These are symptoms of filling RAM. It could be a log file growing that fills the RAM disk. The OS starts killing processes when RAM gets too low. Thats why the web interface goes down but you can still telnet in.

I very much agree, it really does look just like an out of memory situation, with subsystems being shut down. Except that the syslog is tiny, he apparently has more than enough RAM, and there is no direct evidence in the syslog that things are being shut down, and he did not mention any such messages on the console. Also, an unresponsive drive should become evident long *after* other things have been killed or crashed, I think.

Is this typo:

RAM: 3GB

in your configuration because from the syslog it looks like you have 4GB.

There is something strange about the memory reported here. By the way, I don't believe video shared memory is visible to the kernel, has been mapped out during POST. From the syslog:

Jan 27 09:30:17 bbunraid kernel: 4230MB HIGHMEM available.

Jan 27 09:30:17 bbunraid kernel: 889MB LOWMEM available.

...

Jan 27 09:30:17 bbunraid kernel: Memory: 3042816k/5242880k available (2692k kernel code, 100248k reserved, 1359k data, 312k init, 2233288k highmem)

If there is a problem with the way the memory is reporting itself, Memtest should detect that.

Quote

January 28, 201115 yr

System heat may not be an issue here, since your system had completed a parity check almost an hour and a half before the system hung, and parity checks and builds drive the system harder than almost anything else. Any system heat issues would most likely have occurred during the parity check. And I assume you checked the individual drive temps? They should normally be in the 20's and 30's, and under load in the 30's and 40's. A temp in the high 50's or higher could possibly crash a drive (but not the system).

A parity check would only use a smaller part of your memory. But transfers of very large files after the parity check was complete could use very large amounts of your memory, and touch areas of the RAM never used before. That makes memory more of a suspect here.

Quote

January 29, 201115 yr

Author

I haven't run the mem check yet (I will, but considering the time it will take I'm running through everything else first), but I did move some stuff around (into drive cages and better cooled) and used new cables, etc. Basically, right now it ends up to the point that only the console is responsive, so eventually I reboot it. This log is from after a reboot (and subsequent parity check) last night, until this morning when I rebooted it again: http://pastebin.com/tWimEVcZ

Quote

January 29, 201115 yr

Author

ran memtest, passed without errors

Quote

January 29, 201115 yr

How long did you let memtest run for? If it was less than 3 hours then it doesn't mean much at all.

Quote

unraid becoming unresponsive, errors in log

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)