Cache Drive Error

njm5785 · April 21, 2015

I had a drive on my unraid server fail last week. I ordered a new drive and put it in. Unraid rebuilt the new drive and all my files seemed fine. Then I tried to start my docker image back up and it doesn't seem to start. I also got notifications about a 187 error on my cache drive. I started digging around the menus and I found a screen that seems to highlight the issue (I attached a screenshot). I have no idea what it means and wondered if anyone could help out. Do I need to replace the drive? Is there something I can do to get my docker back up and running?

One more thing that might relate is that when I try and turn off the array I have to ssh in and force unmount the cache drive because it says it is busy. I know this is sorta random information but I am pretty new to problems with unraid. I set my server up almost 2 years ago and haven't had any issues until last week when one of my drives failed.

I am running unraid version 6.0-beta14b.

Thanks for any help someone can offer.

njm5785 · April 21, 2015

I thought maybe I should also attach a smart log from the drive. Here it is.

Warning: ATA error count 0 inconsistent with error log pointer 1

ATA Error Count: 0
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 13273 hours (553 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 08 78 b9 8f 40   at LBA = 0x008fb978 = 9419128

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 78 d8 8f 40 00   6d+06:21:19.744  READ FPDMA QUEUED
  60 00 00 90 72 05 40 00   6d+06:21:19.744  READ FPDMA QUEUED
  60 00 78 78 b9 8f 40 00   6d+06:21:19.744  READ FPDMA QUEUED
  60 00 18 e0 b8 8f 40 00   6d+06:21:19.744  READ FPDMA QUEUED
  60 00 28 78 d2 8f 40 00   6d+06:21:19.744  READ FPDMA QUEUED

Error -1 occurred at disk power-on lifetime: 13273 hours (553 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 20 98 b7 8f 40   at LBA = 0x008fb798 = 9418648

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 20 80 b7 8f 40 00   6d+06:21:19.744  READ FPDMA QUEUED
  60 00 08 78 b7 8f 40 00   6d+06:21:19.744  READ FPDMA QUEUED
  60 00 08 18 70 05 40 00   6d+06:21:19.744  READ FPDMA QUEUED
  60 00 40 a0 d7 8f 40 00   6d+06:21:19.744  READ FPDMA QUEUED
  60 00 20 80 d7 8f 40 00   6d+06:21:19.744  READ FPDMA QUEUED

Error -2 occurred at disk power-on lifetime: 13204 hours (550 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 08 78 b9 8f 40   at LBA = 0x008fb978 = 9419128

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 03 08 a8 70 05 40 00   3d+09:25:19.744  READ FPDMA QUEUED
  60 00 78 78 b9 8f 40 00   3d+09:25:19.744  READ FPDMA QUEUED
  60 00 18 e0 b8 8f 40 00   3d+09:25:19.744  READ FPDMA QUEUED
  60 00 08 a0 70 05 40 00   3d+09:25:19.744  READ FPDMA QUEUED
  60 10 28 78 d2 8f 40 00   3d+09:25:19.744  READ FPDMA QUEUED

Error -3 occurred at disk power-on lifetime: 13204 hours (550 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 20 98 b7 8f 40   at LBA = 0x008fb798 = 9418648

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 20 80 b7 8f 40 00   3d+09:25:19.744  READ FPDMA QUEUED
  60 00 08 78 b7 8f 40 00   3d+09:25:19.744  READ FPDMA QUEUED
  60 03 08 08 8f 05 40 00   3d+09:25:19.744  READ FPDMA QUEUED
  60 00 40 a0 d7 8f 40 00   3d+09:25:19.744  READ FPDMA QUEUED
  60 00 20 80 d7 8f 40 00   3d+09:25:19.744  READ FPDMA QUEUED

Error -4 occurred at disk power-on lifetime: 13204 hours (550 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 08 78 b9 8f 40   at LBA = 0x008fb978 = 9419128

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 78 d8 8f 40 00   3d+09:01:19.744  READ FPDMA QUEUED
  60 00 20 c8 70 05 40 00   3d+09:01:19.744  READ FPDMA QUEUED
  60 00 78 78 b9 8f 40 00   3d+09:01:19.744  READ FPDMA QUEUED
  60 00 18 e0 b8 8f 40 00   3d+09:01:19.744  READ FPDMA QUEUED
  60 00 18 b0 70 05 40 00   3d+09:01:19.744  READ FPDMA QUEUED

dgaschk · April 21, 2015

Attach a syslog. zip if needed.

njm5785 · April 22, 2015

Here is my syslog

/usr/bin/tail -n 42 -f /var/log/syslog 2>&1
Apr 20 12:43:22 Tower emhttp: shcmd (47): cp /etc/avahi/services/afp.service- /etc/avahi/services/afp.service
Apr 20 12:43:22 Tower avahi-daemon[3444]: Files changed, reloading.
Apr 20 12:43:22 Tower avahi-daemon[3444]: Service group file /services/afp.service changed, reloading.
Apr 20 12:43:23 Tower avahi-daemon[3444]: Service "Tower" (/services/smb.service) successfully established.
Apr 20 12:43:23 Tower avahi-daemon[3444]: Service "Tower-AFP" (/services/afp.service) successfully established.
Apr 20 13:13:04 Tower kernel: mdcmd (35): spindown 0
Apr 20 13:18:22 Tower kernel: mdcmd (36): spindown 1
Apr 20 13:18:22 Tower kernel: mdcmd (37): spindown 2
Apr 20 22:17:43 Tower kernel: ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Apr 20 22:17:43 Tower kernel: ata2: irq_stat 0x00400040, connection status changed
Apr 20 22:17:43 Tower kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
Apr 20 22:17:43 Tower kernel: ata2: hard resetting link
Apr 20 22:17:49 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Apr 20 22:17:53 Tower kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 20 22:17:53 Tower kernel: ata2.00: configured for UDMA/133
Apr 20 22:17:53 Tower kernel: ata2: EH complete
Apr 21 00:20:01 Tower sSMTP[27690]: Creating SSL connection to host
Apr 21 00:20:02 Tower sSMTP[27690]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Apr 21 00:20:04 Tower sSMTP[27690]: Sent mail for unraid@*****.me (221 2.0.0 closing connection f1sm863160pdp.24 - gsmtp) uid=0 username=root outbytes=628
Apr 21 03:40:01 Tower logger: mover started
Apr 21 03:40:01 Tower logger: skipping "applications"
Apr 21 03:40:01 Tower logger: skipping "downloads"
Apr 21 03:40:01 Tower logger: mover finished
Apr 21 04:20:45 Tower kernel: mdcmd (38): spindown 2
Apr 21 04:20:51 Tower kernel: mdcmd (39): spindown 1
Apr 21 07:59:01 Tower sSMTP[23288]: Creating SSL connection to host
Apr 21 07:59:01 Tower sSMTP[23288]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Apr 21 07:59:04 Tower sSMTP[23288]: Sent mail for unraid@*****.me (221 2.0.0 closing connection m8sm2201318pdn.5 - gsmtp) uid=0 username=root outbytes=731
Apr 21 07:59:04 Tower sSMTP[23308]: Creating SSL connection to host
Apr 21 07:59:05 Tower sSMTP[23308]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Apr 21 07:59:07 Tower sSMTP[23308]: Sent mail for unraid@*****.me (221 2.0.0 closing connection vu7sm2156902pbc.39 - gsmtp) uid=0 username=root outbytes=701
Apr 21 13:15:49 Tower sshd[10417]: Accepted password for root from 10.10.0.6 port 56606 ssh2
Apr 21 17:48:36 Tower emhttp: read_line: client closed the connection
Apr 22 00:20:02 Tower sSMTP[12388]: Creating SSL connection to host
Apr 22 00:20:02 Tower sSMTP[12388]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Apr 22 00:20:06 Tower sSMTP[12388]: Sent mail for unraid@*****.me (221 2.0.0 closing connection x1sm3914625pdp.1 - gsmtp) uid=0 username=root outbytes=910
Apr 22 03:40:01 Tower logger: mover started
Apr 22 03:40:01 Tower logger: skipping "applications"
Apr 22 03:40:01 Tower logger: skipping "downloads"
Apr 22 03:40:01 Tower logger: mover finished
Apr 22 07:34:55 Tower emhttp: read_line: client closed the connection
Apr 22 07:59:58 Tower sshd[17175]: Accepted password for root from 10.10.0.6 port 57446 ssh2

trurl · April 22, 2015

Here is my syslog

/usr/bin/tail -n 42 -f /var/log/syslog 2>&1
Apr 20 12:43:22 Tower emhttp: shcmd (47): cp /etc/avahi/services/afp.service- /etc/avahi/services/afp.service
Apr 20 12:43:22 Tower avahi-daemon[3444]: Files changed, reloading.
Apr 20 12:43:22 Tower avahi-daemon[3444]: Service group file /services/afp.service changed, reloading.
Apr 20 12:43:23 Tower avahi-daemon[3444]: Service "Tower" (/services/smb.service) successfully established.
Apr 20 12:43:23 Tower avahi-daemon[3444]: Service "Tower-AFP" (/services/afp.service) successfully established.
Apr 20 13:13:04 Tower kernel: mdcmd (35): spindown 0
Apr 20 13:18:22 Tower kernel: mdcmd (36): spindown 1
Apr 20 13:18:22 Tower kernel: mdcmd (37): spindown 2
Apr 20 22:17:43 Tower kernel: ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Apr 20 22:17:43 Tower kernel: ata2: irq_stat 0x00400040, connection status changed
Apr 20 22:17:43 Tower kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
Apr 20 22:17:43 Tower kernel: ata2: hard resetting link
Apr 20 22:17:49 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Apr 20 22:17:53 Tower kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 20 22:17:53 Tower kernel: ata2.00: configured for UDMA/133
Apr 20 22:17:53 Tower kernel: ata2: EH complete
Apr 21 00:20:01 Tower sSMTP[27690]: Creating SSL connection to host
Apr 21 00:20:02 Tower sSMTP[27690]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Apr 21 00:20:04 Tower sSMTP[27690]: Sent mail for unraid@*****.me (221 2.0.0 closing connection f1sm863160pdp.24 - gsmtp) uid=0 username=root outbytes=628
Apr 21 03:40:01 Tower logger: mover started
Apr 21 03:40:01 Tower logger: skipping "applications"
Apr 21 03:40:01 Tower logger: skipping "downloads"
Apr 21 03:40:01 Tower logger: mover finished
Apr 21 04:20:45 Tower kernel: mdcmd (38): spindown 2
Apr 21 04:20:51 Tower kernel: mdcmd (39): spindown 1
Apr 21 07:59:01 Tower sSMTP[23288]: Creating SSL connection to host
Apr 21 07:59:01 Tower sSMTP[23288]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Apr 21 07:59:04 Tower sSMTP[23288]: Sent mail for unraid@*****.me (221 2.0.0 closing connection m8sm2201318pdn.5 - gsmtp) uid=0 username=root outbytes=731
Apr 21 07:59:04 Tower sSMTP[23308]: Creating SSL connection to host
Apr 21 07:59:05 Tower sSMTP[23308]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Apr 21 07:59:07 Tower sSMTP[23308]: Sent mail for unraid@*****.me (221 2.0.0 closing connection vu7sm2156902pbc.39 - gsmtp) uid=0 username=root outbytes=701
Apr 21 13:15:49 Tower sshd[10417]: Accepted password for root from 10.10.0.6 port 56606 ssh2
Apr 21 17:48:36 Tower emhttp: read_line: client closed the connection
Apr 22 00:20:02 Tower sSMTP[12388]: Creating SSL connection to host
Apr 22 00:20:02 Tower sSMTP[12388]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Apr 22 00:20:06 Tower sSMTP[12388]: Sent mail for unraid@*****.me (221 2.0.0 closing connection x1sm3914625pdp.1 - gsmtp) uid=0 username=root outbytes=910
Apr 22 03:40:01 Tower logger: mover started
Apr 22 03:40:01 Tower logger: skipping "applications"
Apr 22 03:40:01 Tower logger: skipping "downloads"
Apr 22 03:40:01 Tower logger: mover finished
Apr 22 07:34:55 Tower emhttp: read_line: client closed the connection
Apr 22 07:59:58 Tower sshd[17175]: Accepted password for root from 10.10.0.6 port 57446 ssh2

If you are on the latest v6 (this is a v6 support forum) it is very easy to give us a complete zipped syslog. Just go to Tools - System Log, and click the Download button.

Complete syslogs are always preferred unless you are able to diagnose them yourself.

njm5785 · April 22, 2015

If you are on the latest v6 (this is a v6 support forum) it is very easy to give us a complete zipped syslog. Just go to Tools - System Log, and click the Download button.

Complete syslogs are always preferred unless you are able to diagnose them yourself.

Thanks for the info I had no idea how to get the complete syslog. I thought the log link to the right on all the pages was it.

Here is the complete one based on your instructions.

syslog.zip

dgaschk · April 24, 2015

Disk 2 appears to need a new SATA cable:

Apr 20 12:42:21 Tower kernel: ata2.00: ATA-9: ST3000DM001-1CH166,             W1F4TBN7, CC27, max UDMA/133
Apr 20 22:17:43 Tower kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }

Cache disk is having read errors:

ata4.00: exception Emask 0x0 SAct 0x40 SErr 0x0 action 0x0
Apr 20 12:43:03 Tower kernel: ata4.00: irq_stat 0x40000008
Apr 20 12:43:03 Tower kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 20 12:43:03 Tower kernel: ata4.00: cmd 60/20:30:80:b7:8f/00:00:00:00:00/40 tag 6 ncq 16384 in
Apr 20 12:43:03 Tower kernel:         res 41/40:00:98:b7:8f/00:00:00:00:00/40 Emask 0x409 (media error) <F>
Apr 20 12:43:03 Tower kernel: ata4.00: status: { DRDY ERR }
Apr 20 12:43:03 Tower kernel: ata4.00: error: { UNC }

The cache disk file system appears corrupted as a result:

Apr 20 12:43:03 Tower kernel: BTRFS: error (device loop0) in write_all_supers:3442: errno=-5 IO failure (errors while submitting device barriers.)

Run pre-clear on the cache disk and see if it passes.

trurl · April 24, 2015

Looks like the cache device is probably SSD. Not sure preclear is a good idea.

dgaschk · April 24, 2015

Pre-clear is bad for a SSD. Run the manufacturers diagnostic on the drive.

njm5785 · April 24, 2015

Yes it is a ssd drive. Should I get a new cable?

I will try and run the manufactures diagnostics tonight.

Cache Drive Error

Recommended Posts

njm5785

Link to comment

njm5785

Link to comment

dgaschk

Link to comment

njm5785

Link to comment

trurl

Link to comment

njm5785

Link to comment

dgaschk

Link to comment

trurl

Link to comment

dgaschk

Link to comment

njm5785

Link to comment

Archived