Unraid disconnects/becomes unavailable during large transfer

kenshinx34 · September 4, 2013

WARNING: NEW USER

I have been running a configuration for the past year which has been extremely stable (>100 days) without a shutdown, error or anything.

Lately I've been trying to transfer ~300Gb of data to my "Tower" and in the process it will become unavailable. After cancelling the transfer I cannot connect to the network share, or ping the tower.

However, it is still on. I've hooked up a monitor to it when this happens to see the command line and it only sometimes displays an image. I haven't made any changes recently or messed with the drive configuration.

Each hard shutdown requires a parity check which takes ~18 hours.

Sometimes I'll also have trouble accessing the web interface, however in this circumstance I can still Telnet into it.

Here's my setup:

unRAID Version: unRAID Server Plus, Version 5.0-beta14

Motherboard: ASUSTeK - M4A88T-M

Processor: AMD AthlonTM II X3 450 - 3.2 GHz

Cache: L1 = 384 kB L2 = 1536 kB L3 = 0 kB

Memory: 6912 MB - DIMM1 = 1333 MHz DIMM2 = 1333 MHz DIMM3 = 1333 MHz

Network: 1000Mb/s - Full Duplex

Simple Features 1.0.5

I guess my question is that I don't know where to start troubleshooting. Any help would be appreciated. I also have logs I can post, but I'm not sure which log files are relevant.

Thanks!

dgaschk · September 5, 2013

Install the current release, 5.0 final. I can't help with an old beta.

kenshinx34 · September 14, 2013

Alright, currently updated to 5.0 final.

Unraid no longer freezes, however it is showing some worrying messages in the syslog. This is pretty much the main message I see repeated throughout my syslog:

Sep 12 23:08:15 Tower kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 12 23:08:15 Tower kernel: hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
Sep 12 23:08:15 Tower kernel: hdc: possibly failed opcode: 0x35

"hdc" is my parity drive, should I be nervous that my data is being corrupted? My parity checks come out clean, but if I have a dying parity drive I wonder if that could corrupt everything.

I've attached the full log:

New_Text_Document_2_2.txt

Barziya · September 14, 2013

if I have a dying parity drive I wonder if that could corrupt everything.

If you have a dying parity drive, that can't corrupt the data on the other disks, but it can leave you unprotrcted until you replace the failed disk.

Here are some other things can be causing your problems:

1. Loose cable or a bad cable can be causing the errors seen in your syslog. Check the cables, make sure they are firmly attached. Better yet, swap the cable on the disk in question with another cable and see if you get the same errors on the same disk. Also, get a SMART report on that disk, the disk may be indeed failing.

2: I see that you have SATA disks, but for some reason they show up as IDE disks (hdX insted of sdX). Go into the BIOS settings, and check the SATA setting: It is now probably set as "compatible" or somethoing. It should be set as AHCI.

3. After you set your BIOS setting to AHCI, if the disks still show up as hdX instead of sdX, then you should blacklist the obsolete IDE driver altogether. (I don't know why that obsolete IDE driver is still cmompiled into the unRAID kernel: Every other Linux distro I know switched entirely to the new libata driver years ago.) To blacklist that driver, add the following things to the "append" line of the syslinux.cfg file on your USB flash disk:

piix.blacklist=yes  ide_gd_mod.blacklist=yes  ide-generic.blacklist=yes  i2c_piix4.blacklist=yes

4. Some systems have shown some serious problems with the 32-bit unRAID when run with 4GB of physical RAM or more. I see that you have 6GB RAM installed. You can tell the kernel to boot in 3.99GB of RAM by adding a mem=4095M to the "append" line of the syslinux.cfg file on your USB flash disk. Make that line look something like this:

  append initrd=bzroot mem=4095M

Notice that it is 4095M, and NOT 4096M! This particular setting "cured" the timeout and disconnect problems I was having with my server.

Let us know how it all goes.

kenshinx34 · September 15, 2013

Alright I swapped the cable and no more error. That's the strangest thing in the world. I would have thought that a cable either worked or wouldn't work! I didn't realize that could have an effect like it did.

I'm still getting a hang when I try to copy large files over (it says network no longer available), but now it doesn't lock up the server, I just have to retry the copy. I'll do some testing and see if I can rule out my router, computer and other stuff but for now here's the smart test...

I ran a long smart test on the drive in question and here's the info:

Linux 3.9.6p-unRAID.
root@Tower:~# smartctl  -a  -d  ata  /dev/hdc
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WCAWZ2926157
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Sep 14 16:25:21 2013 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (49680) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp                                                                             ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_                                                                             FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -                                                                                    0
  3 Spin_Up_Time            0x0027   153   152   021    Pre-fail  Always       -                                                                                    9333
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -                                                                                    284
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -                                                                                    0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -                                                                                    0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -                                                                                    6182
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -                                                                                    0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -                                                                                    0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -                                                                                    50
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -                                                                                    28
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -                                                                                    817
194 Temperature_Celsius     0x0022   125   111   000    Old_age   Always       -                                                                                    27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -                                                                                    0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -                                                                                    0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -                                                                                    0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -                                                                                    1012
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -                                                                                    0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@Tower:~# smartctl -d ata -tlong /dev/hdc
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 255 minutes for test to complete.
Test will complete after Sat Sep 14 20:43:29 2013

Use smartctl -X to abort test.
root@Tower:~# smartctl -d ata -a /dev/hdc > smartctl.txt
root@Tower:~# smartctl -d ata -a /dev/hdc
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WCAWZ2926157
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Sep 15 06:45:49 2013 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (49680) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   153   152   021    Pre-fail  Always       -       9350
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       285
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6196
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       50
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       28
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       820
194 Temperature_Celsius     0x0022   119   111   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1012
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      6190         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

kenshinx34 · September 16, 2013

I was in the process of completing the other items listed when I last posted. -Sorry

I've completed the following:

#2 (Switching BIOS AHCI)

#3 (wasn't needed as #2 changed the hdx to sdx)

#4 I added that line to the 'syslinux.cfg' file to change the amount of RAM but didn't see on the syslog where to check that.

I've attached the latest syslog since these changes (to verify the correct amount of RAM), and will be trying a large file transfer today to see how it goes!

I appreciate the help!

syslog_after_fixes.txt

Barziya · September 16, 2013

I've attached the latest syslog since these changes (to verify the correct amount of RAM)

That syslog shows that it booted in 6GB, so we've messed up something in #4. I was probably not very clear there. The syslinux.cfg should look something like this:

label unRAID OS
  menu default
  kernel bzimage
  append initrd=bzroot mem=4095M

(^^^ That's not the whole syslinux.cfg, just the relevant part of it.)

---

Note for anybody else reading this thread: Don't use that 'mem=' option if you have less than 4GB of RAM, as that will crash your server.

unevent · September 16, 2013

IIRC, the suspect boards that had issues with 4G+ ram were Supermicro with certain chipsets and it was for slow writes. If this does not affect you then you should remove the memory limit option of the boot. Regardless, the issue should no longer exist in 5.0.

Sounds like you have the drive problem fixed by changing the cable. For the transfer issue, what are you transferring from - which version of Windows? Using SMB or NFS? Using any program to do the transfer such as rsync, terracopy, robocopy, etc? Can you post a full syslog? A quick search i the forum will net instructions on how to obtain one.

Barziya · September 16, 2013

IIRC, the suspect boards that had issues with 4G+ ram were Supermicro with certain chipsets and it was for slow writes. If this does not affect you then you should remove the memory limit option of the boot. Regardless, the issue should no longer exist in 5.0.

That is not quite true. That issue was never really fixed. I can still reproduce it in v5.0 on my server: I watch the network graph on the Task Manager on the windows client, and when the server has more than 4GB memory, the graph at times goes all the way down to zero for long periods of time, then network traffic resumes for awhile, then it goes down to zero again, etc... When it happens that it goes down for a longer period, then windoze complains that it's lost the connection, or that the path is no longer available. Restarting the same server with the mem=4095M option, and the network transfer graph is smooth, never drops downto zero while the files are copying.

unevent · September 17, 2013

IIRC, the suspect boards that had issues with 4G+ ram were Supermicro with certain chipsets and it was for slow writes. If this does not affect you then you should remove the memory limit option of the boot. Regardless, the issue should no longer exist in 5.0.

That is not quite true. That issue was never really fixed. I can still reproduce it in v5.0 on my server: I watch the network graph on the Task Manager on the windows client, and when the server has more than 4GB memory, the graph at times goes all the way down to zero for long periods of time, then network traffic resumes for awhile, then it goes down to zero again, etc... When it happens that it goes down for a longer period, then windoze complains that it's lost the connection, or that the path is no longer available. Restarting the same server with the mem=4095M option, and the network transfer graph is smooth, never drops downto zero while the files are copying.

Hard to say, but it sounds more likely a tuning issue with the VM subsystem, not necessarily the original problem that was fixed. IIRC, the original problem was due to a hardware issue or kernel conflict with certain chipsets and was prevalent on certain Supermicro boards. In the case of what you describe, I believe the Linux kernel is caching too much of the incoming data before starting the workers to write out the data. Once writing starts things slow because of the IO waits. The defaults in the VM subsystem are archaic and were set when hardware (memory size) was a fraction of what it is today. Do a web search for Linus Torvalds dirty limits (32bit kernels) and the like for more on the matter.

Barziya · September 17, 2013

... not necessarily the original problem that was fixed.

No, it wasn't fixed. If you review that thread -- was it around RC14 or something -- you'll see that Tom ended up deciding between two possible solutions: #1: Make a 64-bit kernel. (That would have been the real fix.) and #2: Make the new RC with a mem=4095M option in the boot menu. (That is not really a fix, but a workaround.) At the end Tom decided to go with #2. So he posted a RC with that boot option, and immediately people started screaming that their servers were crashing -- people with less that 4GB RAM of course. So what Tom did was he immediately posted yet another RC the next day removing that option from the boot menu, and called it a fix for the crashing servers -- no mention of the original problem that called for the 'mem=' workaround in the first place. And that was the end of it. That was the "fix" you remember.

Do a web search for Linus Torvalds...

Here are two quotes from Linus that are quite relevant to the matter:

http://lkml.org/lkml/2013/4/29/512

http://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-about-pae/

kenshinx34 · September 18, 2013

Okay I made the changes to my syslinux.cfg on the correct lines this time!

Where do I look to verify that it is only using 4095?

Here's my syslog!

Thanks!

New_Text_Document_3.txt

Unraid disconnects/becomes unavailable during large transfer

Recommended Posts

kenshinx34

Link to comment

dgaschk

Link to comment

kenshinx34

Link to comment

Barziya

Link to comment

kenshinx34

Link to comment

kenshinx34

Link to comment

Barziya

Link to comment

unevent

Link to comment

Barziya

Link to comment

unevent

Link to comment

Barziya

Link to comment

kenshinx34

Link to comment

Archived