[SOLVED] Unraid 6 Daily Crash

jellis · March 16, 2016

Hi Guys,

Would a failed drive, in this case 2 failed drives not assigned to the array, cause crashing? When booting the server I get SMART errors on 2 of my drives that were no assigned. I restarted the server into UNRAID twice and each night the server crashes. Today I removed both drives and then ran a memory test, which passed. Is it as simple as the drives causing this failure? Also, no errors show int he log from the gui that I checked. Thanks for your help!

Model: N/A

M/B: Supermicro - X10SL7-F

CPU: Intel® Xeon® CPU E3-1230 v3 @ 3.30GHz

HVM: Enabled

IOMMU: Enabled

Cache: 256 kB, 1024 kB, 8192 kB

Memory: 16384 MB (max. installable capacity 32 GB)

Network: eth0: 1000Mb/s - Full Duplex

eth1: not connected

Kernel: Linux 4.1.18-unRAID x86_64

OpenSSL: 1.0.1s

Uptime:0 days, 01:07:54

kizer · March 16, 2016

Define daily crash? I'd also highly recommend posting up a diagnostic file.

jellis · March 16, 2016

Hi Kizer. Sure and thanks! Sometime between the time I go to bed and wake up the system freezes, probably better term than crash. To the point that I am not able to login via root or even connect via telnet. One thing I have not been able to tell is what the screen displays when it freezes. I have connected a monitor so I can get a visual if it happens again. I have connected the diagnostics zip. Thanks for looking at this.

Here is what shows in the error and warnings log:

Mar 16 10:33:22 Tower kernel: ACPI: Early table checksum verification disabled

Mar 16 10:33:22 Tower kernel: ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S1_] (20150410/hwxface-580)

Mar 16 10:33:22 Tower kernel: ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S2_] (20150410/hwxface-580)

Mar 16 10:33:22 Tower kernel: floppy0: no floppy controllers found

Mar 16 10:33:26 Tower rpc.statd[1614]: Failed to read /var/lib/nfs/state: Success

Mar 16 10:33:26 Tower sshd[1634]: Server listening on 0.0.0.0 port 22.

Mar 16 10:33:36 Tower kernel: REISERFS (device md1): replayed 740 transactions in 6 seconds

Mar 16 10:33:36 Tower kernel: REISERFS (device md2): replayed 6 transactions in 0 seconds

Mar 16 10:33:37 Tower kernel: REISERFS (device md3): replayed 6 transactions in 0 seconds

Mar 16 10:33:37 Tower kernel: REISERFS (device md4): replayed 6 transactions in 0 seconds

Mar 16 10:33:37 Tower kernel: REISERFS (device md5): replayed 7 transactions in 0 seconds

Mar 16 10:33:39 Tower kernel: REISERFS (device md6): replayed 77 transactions in 1 seconds

Mar 16 10:33:49 Tower avahi-daemon[11181]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns!

Mar 16 10:33:58 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Mar 16 10:33:58 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error

Mar 16 10:33:58 Tower kernel: ata2.00: failed command: READ DMA EXT

Mar 16 10:33:58 Tower kernel: ata2: hard resetting link

Mar 16 10:34:12 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Mar 16 10:34:12 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error

Mar 16 10:34:12 Tower kernel: ata2.00: failed command: READ DMA EXT

Mar 16 10:34:12 Tower kernel: ata2: hard resetting link

Mar 16 10:37:03 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Mar 16 10:37:03 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error

Mar 16 10:37:03 Tower kernel: ata2.00: failed command: READ DMA EXT

Mar 16 10:37:03 Tower kernel: ata2: hard resetting link

Mar 16 10:37:26 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Mar 16 10:37:26 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error

Mar 16 10:37:26 Tower kernel: ata2.00: failed command: READ DMA EXT

Mar 16 10:37:26 Tower kernel: ata2: hard resetting link

tower-diagnostics-20160316-1330.zip

JorgeB · March 16, 2016

In my experience, and in some cases, a bad disk can crash or make the computer unresponsive, just by being connected, even if not in use.

You should disconnected them and check if the problem goes away.

JorgeB · March 16, 2016

Those ata2 errors are probably from a bad sata cable, replace this cable:

Device Model:     ST5000DM000-1FK178
Serial Number:    W4J04L88
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       177

JorgeB · March 16, 2016

Your cache disk also needs attention:

Device Model:     ST3000DM001-9YN166
Serial Number:    W1F055SX
197 Current_Pending_Sector  0x0012   100   001   000    Old_age   Always       -       88
198 Offline_Uncorrectable   0x0010   100   001   000    Old_age   Offline      -       88

If you have a spare replace it.

jellis · March 16, 2016

Those ata2 errors are probably from a bad sata cable, replace this cable:

Device Model:     ST5000DM000-1FK178
Serial Number:    W4J04L88
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       177

So this is interesting. This is my parity drive and is plugged into a Icy Dock Backplane:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817994150&nm_mc=TEMC-RMA-Approvel&cm_mmc=TEMC-RMA-Approvel-_-Content-_-text-_-

I will try replacing the sata cable to see if this helps it.

jellis · March 16, 2016

In my experience, and in some cases, a bad disk can crash or make the computer unresponsive, just by being connected, even if not in use.

You should disconnected them and check if the problem goes away.

I removed the drives that had SMART errors. Incidentally, both of which were in the Icy Dock Backplane.

jellis · March 16, 2016

Your cache disk also needs attention:

Device Model:     ST3000DM001-9YN166
Serial Number:    W1F055SX
197 Current_Pending_Sector  0x0012   100   001   000    Old_age   Always       -       88
198 Offline_Uncorrectable   0x0010   100   001   000    Old_age   Offline      -       88

If you have a spare replace it.

ok, so on this one, I just replaced the drive about a month ago. What exactly is this error telling me? A sector within the drive is failing to be read and uncorrectable? Thank for your help!

JorgeB · March 16, 2016

I will try replacing the sata cable to see if this helps it.

The most common cause of this error is a bad (or badly connected) cable, but it can also be the enclosure, or much less likely, the sata port, keep an eye on the UDMA_CRC value, an increase of 2 or more means there's still a problem.

JorgeB · March 16, 2016

ok, so on this one, I just replaced the drive about a month ago. What exactly is this error telling me? A sector within the drive is failing to be read and uncorrectable? Thank for your help!

This usually means there are bad sectors, in this case at least 88, you can check by doing an extended SMART test, if it fails with a read error (and I suspect it will) you should replace it, you can then try running a few preclear cycles and see if the pending sectors are reallocated and the number goes to 0.

jellis · March 17, 2016

ok, so on this one, I just replaced the drive about a month ago. What exactly is this error telling me? A sector within the drive is failing to be read and uncorrectable? Thank for your help!

This usually means there are bad sectors, in this case at least 88, you can check by doing an extended SMART test, if it fails with a read error (and I suspect it will) you should replace it, you can then try running a few preclear cycles and see if the pending sectors are reallocated and the number goes to 0.

Ran the smart test and here are the results:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.18-unRAID] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Seagate Barracuda 7200.14 (AF)

Device Model: ST3000DM001-9YN166

Serial Number: W1F055SX

LU WWN Device Id: 5 000c50 044d6b4a8

Firmware Version: CC46

User Capacity: 3,000,592,982,016 bytes [3.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: 7200 rpm

Device is: In smartctl database [for details use: -P show]

ATA Version is: ATA8-ACS T13/1699-D revision 4

SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Wed Mar 16 22:30:23 2016 EDT

==> WARNING: A firmware update for this drive is available,

see the following Seagate web pages:

http://knowledge.seagate.com/articles/en_US/FAQ/207931en

http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status: ( 121) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: ( 575) seconds.

Offline data collection

capabilities: (0x73) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

No Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 349) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 108 079 006 Pre-fail Always - 15312136

3 Spin_Up_Time 0x0003 094 092 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 19

5 Reallocated_Sector_Ct 0x0033 095 095 036 Pre-fail Always - 7016

7 Seek_Error_Rate 0x000f 060 055 030 Pre-fail Always - 309311370372

9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 24250

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 915

183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0

184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 3311

188 Command_Timeout 0x0032 100 099 000 Old_age Always - 4 4 4

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0022 061 051 045 Old_age Always - 39 (Min/Max 34/43)

191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 872

193 Load_Cycle_Count 0x0032 015 015 000 Old_age Always - 171208

194 Temperature_Celsius 0x0022 039 049 000 Old_age Always - 39 (0 24 0 0 0)

197 Current_Pending_Sector 0x0012 100 001 000 Old_age Always - 48

198 Offline_Uncorrectable 0x0010 100 001 000 Old_age Offline - 48

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 12373h+08m+04.612s

241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 45133539560702

242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 141632493877148

SMART Error Log Version: 1

ATA Error Count: 4327 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4327 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 a0 f8 9f 06 Error: WP at LBA = 0x069ff8a0 = 111147168

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

61 00 e0 ff ff ff 4f 00 23d+12:25:30.513 WRITE FPDMA QUEUED

61 00 e0 ff ff ff 4f 00 23d+12:25:30.311 WRITE FPDMA QUEUED

61 00 e0 ff ff ff 4f 00 23d+12:25:30.223 WRITE FPDMA QUEUED

61 00 e0 ff ff ff 4f 00 23d+12:25:30.067 WRITE FPDMA QUEUED

61 00 e0 ff ff ff 4f 00 23d+12:25:29.770 WRITE FPDMA QUEUED

Error 4326 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 a0 f8 9f 06 Error: WP at LBA = 0x069ff8a0 = 111147168

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

61 00 00 ff ff ff 4f 00 23d+12:25:21.472 WRITE FPDMA QUEUED

Error 4325 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 a0 f8 9f 06 Error: UNC at LBA = 0x069ff8a0 = 111147168

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

60 00 20 80 71 08 40 00 23d+12:25:13.089 READ FPDMA QUEUED

61 00 e0 f0 92 e0 4d 00 23d+12:25:13.067 WRITE FPDMA QUEUED

61 00 e0 60 49 e0 4c 00 23d+12:25:13.000 WRITE FPDMA QUEUED

61 00 00 60 45 e0 4c 00 23d+12:25:13.000 WRITE FPDMA QUEUED

61 00 e0 38 95 20 4c 00 23d+12:25:12.698 WRITE FPDMA QUEUED

Error 4324 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 a0 f8 9f 06 Error: WP at LBA = 0x069ff8a0 = 111147168

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

61 00 a0 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED

61 00 20 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED

61 00 c0 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED

61 00 20 ff ff ff 4f 00 23d+12:25:02.559 WRITE FPDMA QUEUED

Error 4323 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 a0 f8 9f 06 Error: WP at LBA = 0x069ff8a0 = 111147168

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

61 00 00 ff ff ff 4f 00 23d+12:24:56.323 WRITE FPDMA QUEUED

61 00 00 ff ff ff 4f 00 23d+12:24:56.322 WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 24247 60855848

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

John_M · March 17, 2016

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 24247 60855848

As johnnie predicted, it failed. So you need to replace it.

jellis · March 17, 2016

Thanks guys. I will get to work on the replacement of SATA cables and drives and then see where I stand. Bummer I am losing so many drives....they are getting old though.

jellis · March 17, 2016

Quick update. The server froze again last night. I have not made any changes to the cache drive yet bud did pull the 2 failed drives. Understanding that I have some hardware issues.... there must also be something running in the middle of the night that is causing this? Could it be the CRON job? I was able to grab this screenshot before I restarted the server this morning. Does this make sense to anyone?

I am going to disable all of my dockers and cache drive for now. Hoping to stop the freezing.

JorgeB · March 17, 2016

Do you have the mover running at night?

Since the cache disk is bad it could be reason for the crash.

bardsleyb · March 18, 2016

Do you have the mover running at night?

Since the cache disk is bad it could be reason for the crash.

Bingo! I agree with JB... I had this exact same issue several month back. Bad cache drive being written to in the middle of the night and every morning when I woke up, my server was toast and unusable every single morning. I hope your troubles get solved quickly. Unraid is an awesome powerhouse when it all runs as designed. Good luck!

jellis · March 18, 2016

Yep, I sure do! This makes perfect sense. I have changed the mover to run monthly. This should buy me enough time to swap out the drive. Not looking forward to the long process of clearing the new drive.

Agreed that unRAID is great. I am very happy with it, especially with my plex, tv and home automation needs.

Thanks again for your help guys!

jellis · March 18, 2016

No crash this morning. Thanks again.

[SOLVED] Unraid 6 Daily Crash

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation