kenshinx34 Posted September 4, 2013 Share Posted September 4, 2013 WARNING: NEW USER I have been running a configuration for the past year which has been extremely stable (>100 days) without a shutdown, error or anything. Lately I've been trying to transfer ~300Gb of data to my "Tower" and in the process it will become unavailable. After cancelling the transfer I cannot connect to the network share, or ping the tower. However, it is still on. I've hooked up a monitor to it when this happens to see the command line and it only sometimes displays an image. I haven't made any changes recently or messed with the drive configuration. Each hard shutdown requires a parity check which takes ~18 hours. Sometimes I'll also have trouble accessing the web interface, however in this circumstance I can still Telnet into it. Here's my setup: unRAID Version: unRAID Server Plus, Version 5.0-beta14 Motherboard: ASUSTeK - M4A88T-M Processor: AMD AthlonTM II X3 450 - 3.2 GHz Cache: L1 = 384 kB L2 = 1536 kB L3 = 0 kB Memory: 6912 MB - DIMM1 = 1333 MHz DIMM2 = 1333 MHz DIMM3 = 1333 MHz Network: 1000Mb/s - Full Duplex Simple Features 1.0.5 I guess my question is that I don't know where to start troubleshooting. Any help would be appreciated. I also have logs I can post, but I'm not sure which log files are relevant. Thanks! Link to comment
dgaschk Posted September 5, 2013 Share Posted September 5, 2013 Install the current release, 5.0 final. I can't help with an old beta. Link to comment
kenshinx34 Posted September 14, 2013 Author Share Posted September 14, 2013 Alright, currently updated to 5.0 final. Unraid no longer freezes, however it is showing some worrying messages in the syslog. This is pretty much the main message I see repeated throughout my syslog: Sep 12 23:08:15 Tower kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 12 23:08:15 Tower kernel: hdc: dma_intr: error=0x84 { DriveStatusError BadCRC } Sep 12 23:08:15 Tower kernel: hdc: possibly failed opcode: 0x35 "hdc" is my parity drive, should I be nervous that my data is being corrupted? My parity checks come out clean, but if I have a dying parity drive I wonder if that could corrupt everything. I've attached the full log: New_Text_Document_2_2.txt Link to comment
Barziya Posted September 14, 2013 Share Posted September 14, 2013 if I have a dying parity drive I wonder if that could corrupt everything. If you have a dying parity drive, that can't corrupt the data on the other disks, but it can leave you unprotrcted until you replace the failed disk. Here are some other things can be causing your problems: 1. Loose cable or a bad cable can be causing the errors seen in your syslog. Check the cables, make sure they are firmly attached. Better yet, swap the cable on the disk in question with another cable and see if you get the same errors on the same disk. Also, get a SMART report on that disk, the disk may be indeed failing. 2: I see that you have SATA disks, but for some reason they show up as IDE disks (hdX insted of sdX). Go into the BIOS settings, and check the SATA setting: It is now probably set as "compatible" or somethoing. It should be set as AHCI. 3. After you set your BIOS setting to AHCI, if the disks still show up as hdX instead of sdX, then you should blacklist the obsolete IDE driver altogether. (I don't know why that obsolete IDE driver is still cmompiled into the unRAID kernel: Every other Linux distro I know switched entirely to the new libata driver years ago.) To blacklist that driver, add the following things to the "append" line of the syslinux.cfg file on your USB flash disk: piix.blacklist=yes ide_gd_mod.blacklist=yes ide-generic.blacklist=yes i2c_piix4.blacklist=yes 4. Some systems have shown some serious problems with the 32-bit unRAID when run with 4GB of physical RAM or more. I see that you have 6GB RAM installed. You can tell the kernel to boot in 3.99GB of RAM by adding a mem=4095M to the "append" line of the syslinux.cfg file on your USB flash disk. Make that line look something like this: append initrd=bzroot mem=4095M Notice that it is 4095M, and NOT 4096M! This particular setting "cured" the timeout and disconnect problems I was having with my server. Let us know how it all goes. Link to comment
kenshinx34 Posted September 15, 2013 Author Share Posted September 15, 2013 Alright I swapped the cable and no more error. That's the strangest thing in the world. I would have thought that a cable either worked or wouldn't work! I didn't realize that could have an effect like it did. I'm still getting a hang when I try to copy large files over (it says network no longer available), but now it doesn't lock up the server, I just have to retry the copy. I'll do some testing and see if I can rule out my router, computer and other stuff but for now here's the smart test... I ran a long smart test on the drive in question and here's the info: Linux 3.9.6p-unRAID. root@Tower:~# smartctl -a -d ata /dev/hdc smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD30EZRX-00MMMB0 Serial Number: WD-WCAWZ2926157 Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat Sep 14 16:25:21 2013 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (49680) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off supp ort. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 153 152 021 Pre-fail Always - 9333 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 284 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6182 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 28 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 817 194 Temperature_Celsius 0x0022 125 111 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1012 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. root@Tower:~# smartctl -d ata -tlong /dev/hdc smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 255 minutes for test to complete. Test will complete after Sat Sep 14 20:43:29 2013 Use smartctl -X to abort test. root@Tower:~# smartctl -d ata -a /dev/hdc > smartctl.txt root@Tower:~# smartctl -d ata -a /dev/hdc smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD30EZRX-00MMMB0 Serial Number: WD-WCAWZ2926157 Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Sep 15 06:45:49 2013 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (49680) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 153 152 021 Pre-fail Always - 9350 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 285 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6196 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 28 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 820 194 Temperature_Celsius 0x0022 119 111 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1012 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 6190 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Link to comment
kenshinx34 Posted September 16, 2013 Author Share Posted September 16, 2013 I was in the process of completing the other items listed when I last posted. -Sorry I've completed the following: #2 (Switching BIOS AHCI) #3 (wasn't needed as #2 changed the hdx to sdx) #4 I added that line to the 'syslinux.cfg' file to change the amount of RAM but didn't see on the syslog where to check that. I've attached the latest syslog since these changes (to verify the correct amount of RAM), and will be trying a large file transfer today to see how it goes! I appreciate the help! syslog_after_fixes.txt Link to comment
Barziya Posted September 16, 2013 Share Posted September 16, 2013 I've attached the latest syslog since these changes (to verify the correct amount of RAM) That syslog shows that it booted in 6GB, so we've messed up something in #4. I was probably not very clear there. The syslinux.cfg should look something like this: label unRAID OS menu default kernel bzimage append initrd=bzroot mem=4095M (^^^ That's not the whole syslinux.cfg, just the relevant part of it.) --- Note for anybody else reading this thread: Don't use that 'mem=' option if you have less than 4GB of RAM, as that will crash your server. Link to comment
unevent Posted September 16, 2013 Share Posted September 16, 2013 IIRC, the suspect boards that had issues with 4G+ ram were Supermicro with certain chipsets and it was for slow writes. If this does not affect you then you should remove the memory limit option of the boot. Regardless, the issue should no longer exist in 5.0. Sounds like you have the drive problem fixed by changing the cable. For the transfer issue, what are you transferring from - which version of Windows? Using SMB or NFS? Using any program to do the transfer such as rsync, terracopy, robocopy, etc? Can you post a full syslog? A quick search i the forum will net instructions on how to obtain one. Link to comment
Barziya Posted September 16, 2013 Share Posted September 16, 2013 IIRC, the suspect boards that had issues with 4G+ ram were Supermicro with certain chipsets and it was for slow writes. If this does not affect you then you should remove the memory limit option of the boot. Regardless, the issue should no longer exist in 5.0. That is not quite true. That issue was never really fixed. I can still reproduce it in v5.0 on my server: I watch the network graph on the Task Manager on the windows client, and when the server has more than 4GB memory, the graph at times goes all the way down to zero for long periods of time, then network traffic resumes for awhile, then it goes down to zero again, etc... When it happens that it goes down for a longer period, then windoze complains that it's lost the connection, or that the path is no longer available. Restarting the same server with the mem=4095M option, and the network transfer graph is smooth, never drops downto zero while the files are copying. Link to comment
unevent Posted September 17, 2013 Share Posted September 17, 2013 IIRC, the suspect boards that had issues with 4G+ ram were Supermicro with certain chipsets and it was for slow writes. If this does not affect you then you should remove the memory limit option of the boot. Regardless, the issue should no longer exist in 5.0. That is not quite true. That issue was never really fixed. I can still reproduce it in v5.0 on my server: I watch the network graph on the Task Manager on the windows client, and when the server has more than 4GB memory, the graph at times goes all the way down to zero for long periods of time, then network traffic resumes for awhile, then it goes down to zero again, etc... When it happens that it goes down for a longer period, then windoze complains that it's lost the connection, or that the path is no longer available. Restarting the same server with the mem=4095M option, and the network transfer graph is smooth, never drops downto zero while the files are copying. Hard to say, but it sounds more likely a tuning issue with the VM subsystem, not necessarily the original problem that was fixed. IIRC, the original problem was due to a hardware issue or kernel conflict with certain chipsets and was prevalent on certain Supermicro boards. In the case of what you describe, I believe the Linux kernel is caching too much of the incoming data before starting the workers to write out the data. Once writing starts things slow because of the IO waits. The defaults in the VM subsystem are archaic and were set when hardware (memory size) was a fraction of what it is today. Do a web search for Linus Torvalds dirty limits (32bit kernels) and the like for more on the matter. Link to comment
Barziya Posted September 17, 2013 Share Posted September 17, 2013 ... not necessarily the original problem that was fixed. No, it wasn't fixed. If you review that thread -- was it around RC14 or something -- you'll see that Tom ended up deciding between two possible solutions: #1: Make a 64-bit kernel. (That would have been the real fix.) and #2: Make the new RC with a mem=4095M option in the boot menu. (That is not really a fix, but a workaround.) At the end Tom decided to go with #2. So he posted a RC with that boot option, and immediately people started screaming that their servers were crashing -- people with less that 4GB RAM of course. So what Tom did was he immediately posted yet another RC the next day removing that option from the boot menu, and called it a fix for the crashing servers -- no mention of the original problem that called for the 'mem=' workaround in the first place. And that was the end of it. That was the "fix" you remember. Do a web search for Linus Torvalds... Here are two quotes from Linus that are quite relevant to the matter: http://lkml.org/lkml/2013/4/29/512 http://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-about-pae/ Link to comment
kenshinx34 Posted September 18, 2013 Author Share Posted September 18, 2013 Okay I made the changes to my syslinux.cfg on the correct lines this time! Where do I look to verify that it is only using 4095? Here's my syslog! Thanks! New_Text_Document_3.txt Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.