[SOLVED] Unraid 6 Daily Crash


Recommended Posts

Hi Guys,

 

Would a failed drive, in this case 2 failed drives not assigned to the array, cause crashing?  When booting the server I get SMART errors on 2 of my drives that were no assigned.  I restarted the server into UNRAID twice and each night the server crashes.  Today I removed both drives and then ran a memory test, which passed.  Is it as simple as the drives causing this failure?  Also, no errors show int he log from the gui that I checked.  Thanks for your help!

 

 

Model: N/A

M/B: Supermicro - X10SL7-F

CPU: Intel® Xeon® CPU E3-1230 v3 @ 3.30GHz

HVM: Enabled

IOMMU: Enabled

Cache: 256 kB, 1024 kB, 8192 kB

Memory: 16384 MB (max. installable capacity 32 GB)

Network: eth0: 1000Mb/s - Full Duplex

eth1: not connected

Kernel: Linux 4.1.18-unRAID x86_64

OpenSSL: 1.0.1s

Uptime:0 days, 01:07:54

Link to comment

Hi Kizer.  Sure and thanks!  Sometime between the time I go to bed and wake up the system freezes, probably better term than crash.  To the point that I am not able to login via root or even connect via telnet.  One thing I have not been able to tell is what the screen displays when it freezes.  I have connected a monitor so I can get a visual if it happens again.  I have connected the diagnostics zip.  Thanks for looking at this.   

 

Here is what shows in the error and warnings log:

 

Mar 16 10:33:22 Tower kernel: ACPI: Early table checksum verification disabled

Mar 16 10:33:22 Tower kernel: ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S1_] (20150410/hwxface-580)

Mar 16 10:33:22 Tower kernel: ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S2_] (20150410/hwxface-580)

Mar 16 10:33:22 Tower kernel: floppy0: no floppy controllers found

Mar 16 10:33:26 Tower rpc.statd[1614]: Failed to read /var/lib/nfs/state: Success

Mar 16 10:33:26 Tower sshd[1634]: Server listening on 0.0.0.0 port 22.

Mar 16 10:33:36 Tower kernel: REISERFS (device md1): replayed 740 transactions in 6 seconds

Mar 16 10:33:36 Tower kernel: REISERFS (device md2): replayed 6 transactions in 0 seconds

Mar 16 10:33:37 Tower kernel: REISERFS (device md3): replayed 6 transactions in 0 seconds

Mar 16 10:33:37 Tower kernel: REISERFS (device md4): replayed 6 transactions in 0 seconds

Mar 16 10:33:37 Tower kernel: REISERFS (device md5): replayed 7 transactions in 0 seconds

Mar 16 10:33:39 Tower kernel: REISERFS (device md6): replayed 77 transactions in 1 seconds

Mar 16 10:33:49 Tower avahi-daemon[11181]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns!

Mar 16 10:33:58 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Mar 16 10:33:58 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error

Mar 16 10:33:58 Tower kernel: ata2.00: failed command: READ DMA EXT

Mar 16 10:33:58 Tower kernel: ata2: hard resetting link

Mar 16 10:34:12 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Mar 16 10:34:12 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error

Mar 16 10:34:12 Tower kernel: ata2.00: failed command: READ DMA EXT

Mar 16 10:34:12 Tower kernel: ata2: hard resetting link

Mar 16 10:37:03 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Mar 16 10:37:03 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error

Mar 16 10:37:03 Tower kernel: ata2.00: failed command: READ DMA EXT

Mar 16 10:37:03 Tower kernel: ata2: hard resetting link

Mar 16 10:37:26 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Mar 16 10:37:26 Tower kernel: ata2.00: irq_stat 0x08000000, interface fatal error

Mar 16 10:37:26 Tower kernel: ata2.00: failed command: READ DMA EXT

Mar 16 10:37:26 Tower kernel: ata2: hard resetting link

 

tower-diagnostics-20160316-1330.zip

Link to comment

Your cache disk also needs attention:

 

Device Model:     ST3000DM001-9YN166
Serial Number:    W1F055SX
197 Current_Pending_Sector  0x0012   100   001   000    Old_age   Always       -       88
198 Offline_Uncorrectable   0x0010   100   001   000    Old_age   Offline      -       88

 

If you have a spare replace it.

Link to comment

Those ata2 errors are probably from a bad sata cable, replace this cable:

 

Device Model:     ST5000DM000-1FK178
Serial Number:    W4J04L88
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       177

 

So this is interesting.  This is my parity drive and is plugged into a Icy Dock Backplane:

 

http://www.newegg.com/Product/Product.aspx?Item=N82E16817994150&nm_mc=TEMC-RMA-Approvel&cm_mmc=TEMC-RMA-Approvel-_-Content-_-text-_-

 

I will try replacing the sata cable to see if this helps it.

Link to comment

In my experience, and in some cases,  a bad disk can crash or make the computer unresponsive, just by being connected, even if not in use.

 

You should disconnected them and check if the problem goes away.

 

I removed the drives that had SMART errors.  Incidentally, both of which were in the Icy Dock Backplane. 

Link to comment

Your cache disk also needs attention:

 

Device Model:     ST3000DM001-9YN166
Serial Number:    W1F055SX
197 Current_Pending_Sector  0x0012   100   001   000    Old_age   Always       -       88
198 Offline_Uncorrectable   0x0010   100   001   000    Old_age   Offline      -       88

 

If you have a spare replace it.

 

ok, so on this one, I just replaced the drive about a month ago.  What exactly is this error telling me?  A sector within the drive is failing to be read and uncorrectable?  Thank for your help!

Link to comment

I will try replacing the sata cable to see if this helps it.

 

The most common cause of this error is a bad (or badly connected) cable, but it can also be the enclosure, or much less likely, the sata port, keep an eye on the UDMA_CRC value, an increase of 2 or more means there's still a problem.

 

Link to comment

ok, so on this one, I just replaced the drive about a month ago.  What exactly is this error telling me?  A sector within the drive is failing to be read and uncorrectable?  Thank for your help!

 

This usually means there are bad sectors, in this case at least 88, you can check by doing an extended SMART test, if it fails with a read error (and I suspect it will) you should replace it, you can then try running a few preclear cycles and see if the pending sectors are reallocated and the number goes to 0.

Link to comment

ok, so on this one, I just replaced the drive about a month ago.  What exactly is this error telling me?  A sector within the drive is failing to be read and uncorrectable?  Thank for your help!

 

This usually means there are bad sectors, in this case at least 88, you can check by doing an extended SMART test, if it fails with a read error (and I suspect it will) you should replace it, you can then try running a few preclear cycles and see if the pending sectors are reallocated and the number goes to 0.

 

Ran the smart test and here are the results:

 

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.18-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=== START OF INFORMATION SECTION ===

Model Family:    Seagate Barracuda 7200.14 (AF)

Device Model:    ST3000DM001-9YN166

Serial Number:    W1F055SX

LU WWN Device Id: 5 000c50 044d6b4a8

Firmware Version: CC46

User Capacity:    3,000,592,982,016 bytes [3.00 TB]

Sector Sizes:    512 bytes logical, 4096 bytes physical

Rotation Rate:    7200 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Wed Mar 16 22:30:23 2016 EDT

 

==> WARNING: A firmware update for this drive is available,

see the following Seagate web pages:

http://knowledge.seagate.com/articles/en_US/FAQ/207931en

http://knowledge.seagate.com/articles/en_US/FAQ/223651en

 

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status:      ( 121) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: (  575) seconds.

Offline data collection

capabilities: (0x73) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

No Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 349) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x3085) SCT Status supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  108  079  006    Pre-fail  Always      -      15312136

  3 Spin_Up_Time            0x0003  094  092  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      19

  5 Reallocated_Sector_Ct  0x0033  095  095  036    Pre-fail  Always      -      7016

  7 Seek_Error_Rate        0x000f  060  055  030    Pre-fail  Always      -      309311370372

  9 Power_On_Hours          0x0032  073  073  000    Old_age  Always      -      24250

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      915

183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  001  001  000    Old_age  Always      -      3311

188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      4 4 4

189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0

190 Airflow_Temperature_Cel 0x0022  061  051  045    Old_age  Always      -      39 (Min/Max 34/43)

191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      872

193 Load_Cycle_Count        0x0032  015  015  000    Old_age  Always      -      171208

194 Temperature_Celsius    0x0022  039  049  000    Old_age  Always      -      39 (0 24 0 0 0)

197 Current_Pending_Sector  0x0012  100  001  000    Old_age  Always      -      48

198 Offline_Uncorrectable  0x0010  100  001  000    Old_age  Offline      -      48

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      12373h+08m+04.612s

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      45133539560702

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      141632493877148

 

SMART Error Log Version: 1

ATA Error Count: 4327 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 

Error 4327 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 a0 f8 9f 06  Error: WP at LBA = 0x069ff8a0 = 111147168

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  61 00 e0 ff ff ff 4f 00  23d+12:25:30.513  WRITE FPDMA QUEUED

  61 00 e0 ff ff ff 4f 00  23d+12:25:30.311  WRITE FPDMA QUEUED

  61 00 e0 ff ff ff 4f 00  23d+12:25:30.223  WRITE FPDMA QUEUED

  61 00 e0 ff ff ff 4f 00  23d+12:25:30.067  WRITE FPDMA QUEUED

  61 00 e0 ff ff ff 4f 00  23d+12:25:29.770  WRITE FPDMA QUEUED

 

Error 4326 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 a0 f8 9f 06  Error: WP at LBA = 0x069ff8a0 = 111147168

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  61 00 00 ff ff ff 4f 00  23d+12:25:21.472  WRITE FPDMA QUEUED

  61 00 00 ff ff ff 4f 00  23d+12:25:21.472  WRITE FPDMA QUEUED

  61 00 00 ff ff ff 4f 00  23d+12:25:21.472  WRITE FPDMA QUEUED

  61 00 00 ff ff ff 4f 00  23d+12:25:21.472  WRITE FPDMA QUEUED

  61 00 00 ff ff ff 4f 00  23d+12:25:21.472  WRITE FPDMA QUEUED

 

Error 4325 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 a0 f8 9f 06  Error: UNC at LBA = 0x069ff8a0 = 111147168

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 00 20 80 71 08 40 00  23d+12:25:13.089  READ FPDMA QUEUED

  61 00 e0 f0 92 e0 4d 00  23d+12:25:13.067  WRITE FPDMA QUEUED

  61 00 e0 60 49 e0 4c 00  23d+12:25:13.000  WRITE FPDMA QUEUED

  61 00 00 60 45 e0 4c 00  23d+12:25:13.000  WRITE FPDMA QUEUED

  61 00 e0 38 95 20 4c 00  23d+12:25:12.698  WRITE FPDMA QUEUED

 

Error 4324 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 a0 f8 9f 06  Error: WP at LBA = 0x069ff8a0 = 111147168

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  61 00 a0 ff ff ff 4f 00  23d+12:25:02.559  WRITE FPDMA QUEUED

  61 00 20 ff ff ff 4f 00  23d+12:25:02.559  WRITE FPDMA QUEUED

  61 00 20 ff ff ff 4f 00  23d+12:25:02.559  WRITE FPDMA QUEUED

  61 00 c0 ff ff ff 4f 00  23d+12:25:02.559  WRITE FPDMA QUEUED

  61 00 20 ff ff ff 4f 00  23d+12:25:02.559  WRITE FPDMA QUEUED

 

Error 4323 occurred at disk power-on lifetime: 23681 hours (986 days + 17 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 a0 f8 9f 06  Error: WP at LBA = 0x069ff8a0 = 111147168

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  61 00 00 ff ff ff 4f 00  23d+12:24:56.323  WRITE FPDMA QUEUED

  61 00 00 ff ff ff 4f 00  23d+12:24:56.323  WRITE FPDMA QUEUED

  61 00 00 ff ff ff 4f 00  23d+12:24:56.323  WRITE FPDMA QUEUED

  61 00 00 ff ff ff 4f 00  23d+12:24:56.323  WRITE FPDMA QUEUED

  61 00 00 ff ff ff 4f 00  23d+12:24:56.322  WRITE FPDMA QUEUED

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure      90%    24247        60855848

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

Link to comment

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure      90%    24247        60855848

 

As johnnie predicted, it failed. So you need to replace it.

 

Link to comment

Quick update.  The server froze again last night.  I have not made any changes to the cache drive yet bud did pull the 2 failed drives.  Understanding that I have some hardware issues.... there must also be something running in the middle of the night that is causing this?  Could it be the CRON job?  I was able to grab this screenshot before I restarted the server this morning.  Does this make sense to anyone?

 

I am going to disable all of my dockers and cache drive for now.  Hoping to stop the freezing.

screenshots-unraid.jpg.54bf75a1f44db02aafed7de41d25448d.jpg

Link to comment

Do you have the mover running at night?

 

Since the cache disk is bad it could be reason for the crash.

Bingo! I agree with JB... I had this exact same issue several month back. Bad cache drive being written to in the middle of the night and every morning when I woke up,  my server was toast and unusable every single morning. I hope your troubles get solved quickly. Unraid is an awesome powerhouse when it all runs as designed. Good luck!

Link to comment

Yep, I sure do!  This makes perfect sense.  I have changed the mover to run monthly.  This should buy me enough time to swap out the drive.  Not looking forward to the long process of clearing the new drive. 

 

Agreed that unRAID is great.  I am very happy with it, especially with my plex, tv and home automation needs. 

 

Thanks again for your help guys!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.