jellis Posted March 16, 2016 Share Posted March 16, 2016 Hi Guys, Would a failed drive, in this case 2 failed drives not assigned to the array, cause crashing? When booting the server I get SMART errors on 2 of my drives that were no assigned. I restarted the server into UNRAID twice and each night the server crashes. Today I removed both drives and then ran a memory test, which passed. Is it as simple as the drives causing this failure? Also, no errors show int he log from the gui that I checked. Thanks for your help! Model: N/A M/B: Supermicro - X10SL7-F CPU: Intel® Xeon® CPU E3-1230 v3 @ 3.30GHz HVM: Enabled IOMMU: Enabled Cache: 256 kB, 1024 kB, 8192 kB Memory: 16384 MB (max. installable capacity 32 GB) Network: eth0: 1000Mb/s - Full Duplex eth1: not connected Kernel: Linux 4.1.18-unRAID x86_64 OpenSSL: 1.0.1s Uptime:0 days, 01:07:54 Quote Link to comment
kizer Posted March 16, 2016 Share Posted March 16, 2016 Define daily crash? I'd also highly recommend posting up a diagnostic file. Quote Link to comment
jellis Posted March 16, 2016 Author Share Posted March 16, 2016 Hi Kizer. Sure and thanks! Sometime between the time I go to bed and wake up the system freezes, probably better term than crash. To the point that I am not able to login via root or even connect via telnet. One thing I have not been able to tell is what the screen displays when it freezes. I have connected a monitor so I can get a visual if it happens again. I have connected the diagnostics zip. Thanks for looking at this. Here is what shows in the error and warnings log: Mar 16 10:33:22 Tower kernel: ACPI: Early table checksum verification disabled Mar 16 10:33:22 Tower kernel: ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S1_] (20150410/hwxface-580) Mar 16 10:33:22 Tower kernel: ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S2_] (20150410/hwxface-580) Mar 16 10:33:22 Tower kernel: floppy0: no floppy controllers found Mar 16 10:33:26 Tower rpc.statd[1614]: Failed to read /var/lib/nfs/state: Success Mar 16 10:33:26 Tower sshd[1634]: Server listening on 0.0.0.0 port 22. Mar 16 10:33:36 Tower kernel: REISERFS (device md1): replayed 740 transactions in 6 seconds Mar 16 10:33:36 Tower kernel: REISERFS (device md2): replayed 6 transactions in 0 seconds Mar 16 10:33:37 Tower kernel: REISERFS (device md3): replayed 6 transactions in 0 seconds Mar 16 10:33:37 Tower kernel: REISERFS (device md4): replayed 6 transactions in 0 seconds Mar 16 10:33:37 Tower kernel: REISERFS (device md5): replayed 7 transactions in 0 seconds Mar 16 10:33:39 Tower kernel: REISERFS (device md6): replayed 77 transactions in 1 seconds Mar 16 10:33:49 Tower avahi-daemon[11181]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! Mar 16 10:33:58 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen Mar 16 10:33:58 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error Mar 16 10:33:58 Tower kernel: ata2.00: failed command: READ DMA EXT Mar 16 10:33:58 Tower kernel: ata2: hard resetting link Mar 16 10:34:12 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen Mar 16 10:34:12 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error Mar 16 10:34:12 Tower kernel: ata2.00: failed command: READ DMA EXT Mar 16 10:34:12 Tower kernel: ata2: hard resetting link Mar 16 10:37:03 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen Mar 16 10:37:03 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error Mar 16 10:37:03 Tower kernel: ata2.00: failed command: READ DMA EXT Mar 16 10:37:03 Tower kernel: ata2: hard resetting link Mar 16 10:37:26 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen Mar 16 10:37:26 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error Mar 16 10:37:26 Tower kernel: ata2.00: failed command: READ DMA EXT Mar 16 10:37:26 Tower kernel: ata2: hard resetting link tower-diagnostics-20160316-1330.zip Quote Link to comment
JorgeB Posted March 16, 2016 Share Posted March 16, 2016 In my experience, and in some cases, a bad disk can crash or make the computer unresponsive, just by being connected, even if not in use. You should disconnected them and check if the problem goes away. Quote Link to comment
JorgeB Posted March 16, 2016 Share Posted March 16, 2016 Those ata2 errors are probably from a bad sata cable, replace this cable: Device Model: ST5000DM000-1FK178 Serial Number: W4J04L88 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 177 Quote Link to comment
JorgeB Posted March 16, 2016 Share Posted March 16, 2016 Your cache disk also needs attention: Device Model: ST3000DM001-9YN166 Serial Number: W1F055SX 197 Current_Pending_Sector 0x0012 100 001 000 Old_age Always - 88 198 Offline_Uncorrectable 0x0010 100 001 000 Old_age Offline - 88 If you have a spare replace it. Quote Link to comment
jellis Posted March 16, 2016 Author Share Posted March 16, 2016 Those ata2 errors are probably from a bad sata cable, replace this cable: Device Model: ST5000DM000-1FK178 Serial Number: W4J04L88 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 177 So this is interesting. This is my parity drive and is plugged into a Icy Dock Backplane: http://www.newegg.com/Product/Product.aspx?Item=N82E16817994150&nm_mc=TEMC-RMA-Approvel&cm_mmc=TEMC-RMA-Approvel-_-Content-_-text-_- I will try replacing the sata cable to see if this helps it. Quote Link to comment
jellis Posted March 16, 2016 Author Share Posted March 16, 2016 In my experience, and in some cases, a bad disk can crash or make the computer unresponsive, just by being connected, even if not in use. You should disconnected them and check if the problem goes away. I removed the drives that had SMART errors. Incidentally, both of which were in the Icy Dock Backplane. Quote Link to comment
jellis Posted March 16, 2016 Author Share Posted March 16, 2016 Your cache disk also needs attention: Device Model: ST3000DM001-9YN166 Serial Number: W1F055SX 197 Current_Pending_Sector 0x0012 100 001 000 Old_age Always - 88 198 Offline_Uncorrectable 0x0010 100 001 000 Old_age Offline - 88 If you have a spare replace it. ok, so on this one, I just replaced the drive about a month ago. What exactly is this error telling me? A sector within the drive is failing to be read and uncorrectable? Thank for your help! Quote Link to comment
JorgeB Posted March 16, 2016 Share Posted March 16, 2016 I will try replacing the sata cable to see if this helps it. The most common cause of this error is a bad (or badly connected) cable, but it can also be the enclosure, or much less likely, the sata port, keep an eye on the UDMA_CRC value, an increase of 2 or more means there's still a problem. Quote Link to comment
JorgeB Posted March 16, 2016 Share Posted March 16, 2016 ok, so on this one, I just replaced the drive about a month ago. What exactly is this error telling me? A sector within the drive is failing to be read and uncorrectable? Thank for your help! This usually means there are bad sectors, in this case at least 88, you can check by doing an extended SMART test, if it fails with a read error (and I suspect it will) you should replace it, you can then try running a few preclear cycles and see if the pending sectors are reallocated and the number goes to 0. Quote Link to comment
jellis Posted March 17, 2016 Author Share Posted March 17, 2016 ok, so on this one, I just replaced the drive about a month ago. What exactly is this error telling me? A sector within the drive is failing to be read and uncorrectable? Thank for your help! This usually means there are bad sectors, in this case at least 88, you can check by doing an extended SMART test, if it fails with a read error (and I suspect it will) you should replace it, you can then try running a few preclear cycles and see if the pending sectors are reallocated and the number goes to 0. Ran the smart test and here are the results: smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.18-unRAID] (local build) Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST3000DM001-9YN166 Serial Number: W1F055SX LU WWN Device Id: 5 000c50 044d6b4a8 Firmware Version: CC46 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Mar 16 22:30:23 2016 EDT ==> WARNING: A firmware update for this drive is available, see the following Seagate web pages: http://knowledge.seagate.com/articles/en_US/FAQ/207931en http://knowledge.seagate.com/articles/en_US/FAQ/223651en SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: ( 575) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 349) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 108 079 006 Pre-fail Always - 15312136 3 Spin_Up_Time 0x0003 094 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 19 5 Reallocated_Sector_Ct 0x0033 095 095 036 Pre-fail Always - 7016 7 Seek_Error_Rate 0x000f 060 055 030 Pre-fail Always - 309311370372 9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 24250 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 915 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 3311 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 4 4 4 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 061 051 045 Old_age Always - 39 (Min/Max 34/43) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 872 193 Load_Cycle_Count 0x0032 015 015 000 Old_age Always - 171208 194 Temperature_Celsius 0x0022 039 049 000 Old_age Always - 39 (0 24 0 0 0) 197 Current_Pending_Sector 0x0012 100 001 000 Old_age Always - 48 198 Offline_Uncorrectable 0x0010 100 001 000 Old_age Offline - 48 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 12373h+08m+04.612s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 45133539560702 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 141632493877148 SMART Error Log Version: 1 ATA Error Count: 4327 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 4327 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a0 f8 9f 06 Error: WP at LBA = 0x069ff8a0 = 111147168 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 e0 ff ff ff 4f 00 23d+12:25:30.513 WRITE FPDMA QUEUED 61 00 e0 ff ff ff 4f 00 23d+12:25:30.311 WRITE FPDMA QUEUED 61 00 e0 ff ff ff 4f 00 23d+12:25:30.223 WRITE FPDMA QUEUED 61 00 e0 ff ff ff 4f 00 23d+12:25:30.067 WRITE FPDMA QUEUED 61 00 e0 ff ff ff 4f 00 23d+12:25:29.770 WRITE FPDMA QUEUED Error 4326 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a0 f8 9f 06 Error: WP at LBA = 0x069ff8a0 = 111147168 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 00 ff ff ff 4f 00 23d+12:25:21.472 WRITE FPDMA QUEUED 61 00 00 ff ff ff 4f 00 23d+12:25:21.472 WRITE FPDMA QUEUED 61 00 00 ff ff ff 4f 00 23d+12:25:21.472 WRITE FPDMA QUEUED 61 00 00 ff ff ff 4f 00 23d+12:25:21.472 WRITE FPDMA QUEUED 61 00 00 ff ff ff 4f 00 23d+12:25:21.472 WRITE FPDMA QUEUED Error 4325 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a0 f8 9f 06 Error: UNC at LBA = 0x069ff8a0 = 111147168 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 20 80 71 08 40 00 23d+12:25:13.089 READ FPDMA QUEUED 61 00 e0 f0 92 e0 4d 00 23d+12:25:13.067 WRITE FPDMA QUEUED 61 00 e0 60 49 e0 4c 00 23d+12:25:13.000 WRITE FPDMA QUEUED 61 00 00 60 45 e0 4c 00 23d+12:25:13.000 WRITE FPDMA QUEUED 61 00 e0 38 95 20 4c 00 23d+12:25:12.698 WRITE FPDMA QUEUED Error 4324 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a0 f8 9f 06 Error: WP at LBA = 0x069ff8a0 = 111147168 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 a0 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED 61 00 20 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED 61 00 20 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED 61 00 c0 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED 61 00 20 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED Error 4323 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a0 f8 9f 06 Error: WP at LBA = 0x069ff8a0 = 111147168 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 00 ff ff ff 4f 00 23d+12:24:56.323 WRITE FPDMA QUEUED 61 00 00 ff ff ff 4f 00 23d+12:24:56.323 WRITE FPDMA QUEUED 61 00 00 ff ff ff 4f 00 23d+12:24:56.323 WRITE FPDMA QUEUED 61 00 00 ff ff ff 4f 00 23d+12:24:56.323 WRITE FPDMA QUEUED 61 00 00 ff ff ff 4f 00 23d+12:24:56.322 WRITE FPDMA QUEUED SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 24247 60855848 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
John_M Posted March 17, 2016 Share Posted March 17, 2016 SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 24247 60855848 As johnnie predicted, it failed. So you need to replace it. Quote Link to comment
jellis Posted March 17, 2016 Author Share Posted March 17, 2016 Thanks guys. I will get to work on the replacement of SATA cables and drives and then see where I stand. Bummer I am losing so many drives....they are getting old though. Quote Link to comment
jellis Posted March 17, 2016 Author Share Posted March 17, 2016 Quick update. The server froze again last night. I have not made any changes to the cache drive yet bud did pull the 2 failed drives. Understanding that I have some hardware issues.... there must also be something running in the middle of the night that is causing this? Could it be the CRON job? I was able to grab this screenshot before I restarted the server this morning. Does this make sense to anyone? I am going to disable all of my dockers and cache drive for now. Hoping to stop the freezing. Quote Link to comment
JorgeB Posted March 17, 2016 Share Posted March 17, 2016 Do you have the mover running at night? Since the cache disk is bad it could be reason for the crash. Quote Link to comment
bardsleyb Posted March 18, 2016 Share Posted March 18, 2016 Do you have the mover running at night? Since the cache disk is bad it could be reason for the crash. Bingo! I agree with JB... I had this exact same issue several month back. Bad cache drive being written to in the middle of the night and every morning when I woke up, my server was toast and unusable every single morning. I hope your troubles get solved quickly. Unraid is an awesome powerhouse when it all runs as designed. Good luck! Quote Link to comment
jellis Posted March 18, 2016 Author Share Posted March 18, 2016 Yep, I sure do! This makes perfect sense. I have changed the mover to run monthly. This should buy me enough time to swap out the drive. Not looking forward to the long process of clearing the new drive. Agreed that unRAID is great. I am very happy with it, especially with my plex, tv and home automation needs. Thanks again for your help guys! Quote Link to comment
jellis Posted March 18, 2016 Author Share Posted March 18, 2016 No crash this morning. Thanks again. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.