Jump to content

Error messages in log -> drive failure imminent?


dlmh

Recommended Posts

I have 4x Samsung F3 1.5 TB disk in my array with parity drive (same model) and a Samsung F3 500GB cache drive. Lately, when copying files to shares located on or directly to disk4 show sudden drops in throughput and sometimes even halt completely. And, when streaming movies located on this disk with XBMC sometimes results in playback failure and long buffer times.

 

When I browse through the log I find these entries:

 

Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706392/4, count: 1
Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706400/4, count: 1
Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706408/4, count: 1
Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706416/4, count: 1
Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706424/4, count: 1
Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706432/4, count: 1
Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706440/4, count: 1
Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706448/4, count: 1
Nov  8 19:17:28 Prometheus kernel: md: disk4 read error
Nov  8 19:17:28 Prometheus kernel: handle_stripe read error: 2080706456/4, count: 1

 

and sometimes these occur too:

 

Nov  8 19:17:30 Prometheus shfs: duplicate object: /mnt/disk4/.AppleDouble/.Parent
Nov  8 19:17:31 Prometheus kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Nov  8 19:17:31 Prometheus kernel: ata4.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
Nov  8 19:17:31 Prometheus kernel: ata4.00: cmd 25/00:00:df:0f:05/00:04:7c:00:00/e0 tag 0 dma 524288 in
Nov  8 19:17:31 Prometheus kernel:          res 51/40:00:12:13:05/40:00:7c:00:00/e0 Emask 0x9 (media error)
Nov  8 19:17:31 Prometheus kernel: ata4.00: status: { DRDY ERR }
Nov  8 19:17:31 Prometheus kernel: ata4.00: error: { UNC }
Nov  8 19:17:31 Prometheus kernel: ata4: hard resetting link
Nov  8 19:17:31 Prometheus kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov  8 19:17:31 Prometheus kernel: ata4.00: configured for UDMA/133
Nov  8 19:17:31 Prometheus kernel: ata4: EH complete
Nov  8 19:17:34 Prometheus kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6

 

Could these be early warning signs of a pending failure of this disk? The other disks don't show these kind of errors....

Link to comment

I just recently had a drive start giving me the "read error" messages.  It turned out that the drive WAS failing and a clear sign of that was the Reallocated sector count kept increasing.  You need to get a smart report from the drive (can use unMenu to do this) and then you need to run a smart long test on the drive (again can use unMenu).  Once those are done post the results so the community can take a look and advice further.

 

You can also check the SATA cable and the connection to see if it is good/bad/etc.  Replace the cable if you have an extra and go from there.

 

My drive was a 1TB Seagate that had some 300+ Reallocated sectors and it had 30 pending.  So yeah, get those smart tests done and you will get an idea of what might be the problem.

Link to comment

Can you also post the complete syslog?  We need to see those errors in context, and especially what the very first errors were.  The errors quoted in the first section are typical after a drive has been disabled, or a read error has occurred.  The second section indicates a media error (UNCorrectable), probably a bad sector.  There is not enough info yet to make any conclusions at all about whether the drive is failing, but the likelihood is that it probably is not.

Link to comment

Thanks for the replies. I just tried to stream a movie from XBMC and it became completely inresponsive. The same when opening the Web GUI. I ssh-ed to the unRAID machine and entered

reboot

, but it wouldn't reboot (although it gave me the message "sending HALT...."). So I had to press and hold the power button to shutdown the machine and reboot.

 

After this, the the disks show as "Unformatted". I checked the sys log and copied this to pastebin. I'm starting to feel there's definitely something wrong with that disk...

Link to comment

It does look bad, but I would check that SMART report first to confirm.  All of the errors are the same, "media errors" with error code UNC (UNCorrectable), initially in multiple large clusters, then randomly scattered across the drive.  Check the SMART report for increasing Reallocated_Sector_Ct and Current_Pending_Sector, then do the SMART long test, then check a SMART report again and compare those same numbers plus the Offline_Uncorrectable.

Link to comment

If you have the room somewhere, I would backup the data just in case something goes wrong. It probably won't, but you never know.

 

I had two 2 disk failures all within a month of each other. I lost about 3 TB of data. The only stuff I didn't lose was the stuff I had backed up. I had been running unRAID for well over a year with zero problems. My power supply started to go out. Didn't know it and slowly fried 4 or 5 drives.

 

Like I said it's probably ok to just do a rebuild, but if you have the room I would backup..

 

Phil

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...