[SOLVED] DISK DSBL Error

February 1, 201412 yr

It's been some time since I've had to post any questions (which is good, as that means everything has been rock solid), unfortunately while at work we lost power and I was unable to access my server to shut it down properly. I did have it running with an APC UPS, but unfortunately that died. I have the server in a UPS (CyberPower), but I never explored if its possible to have it auto shutdown like it did with the APC. I digress....

The issue is now that the server is powered back up, I have the evil 'red dot' beside my "disk 3". I did power everything down, tore my server apart (12 drives in 4-in-3 cages) to locate the faulty drive (lesson learned, should have created a layout chart of what drive is where, as it was the darn 12th drive I checked). Once locating the faulty drive, I made sure power and SATA cables were secure on both it and the board (sata cable is part of a break-out cable). This made no difference.

I'm running version 4.7 of UnRaid

The drive is accessible. It passes the smart status check.

Statistics for /dev/sdj ST3750330AS_5QK00GT5

smartctl -a -d ata /dev/sdj

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Seagate Barracuda 7200.11 family

Device Model: ST3750330AS

Serial Number: 5QK00GT5

Firmware Version: SD04

User Capacity: 750,156,374,016 bytes

Device is: In smartctl database [for details use: -P show]

ATA Version is: 7

ATA Standard is: Exact ATA specification draft version not indicated

Local Time is: Fri Jan 31 21:41:27 2014 EST

==> WARNING: There are known problems with these drives,

see the following Seagate web pages:

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 642) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 159) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x003b) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 197610088

3 Spin_Up_Time 0x0003 094 085 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1675

5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0

7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 4333950815

9 Power_On_Hours 0x0032 044 044 000 Old_age Always - 49849 (Over 5 years, non-stop..aww!)

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 6

12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 260

184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

188 Command_Timeout 0x0032 100 099 000 Old_age Always - 65537

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0022 075 047 045 Old_age Always - 25 (Lifetime Min/Max 21/25)

194 Temperature_Celsius 0x0022 025 053 000 Old_age Always - 25 (0 5 0 0)

195 Hardware_ECC_Recovered 0x001a 040 023 000 Old_age Always - 197610088

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

HDParm Info for /dev/sdj ST3750330AS_5QK00GT5

/dev/sdj:

ATA device, with non-removable media

Model Number: ST3750330AS

Serial Number: 5QK00GT5

Firmware Revision: SD04

Standards:

Supported: 7 6 5 4

Likely used: 8

Configuration:

Logical max current

cylinders 16383 16383

heads 16 16

sectors/track 63 63

--

CHS current addressable sectors: 16514064

LBA user addressable sectors: 268435455

LBA48 user addressable sectors: 1465149168

Logical/Physical Sector size: 512 bytes

device size with M = 1024*1024: 715404 MBytes

device size with M = 1000*1000: 750156 MBytes (750 GB)

cache/buffer size = unknown

Nominal Media Rotation Rate: 7200

Capabilities:

LBA, IORDY(can be disabled)

Queue depth: 32

Standby timer values: spec'd by Standard, no device specific minimum

R/W multiple sector transfer: Max = 16 Current = ?

Recommended acoustic management value: 254, current value: 0

DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6

Cycle time: min=120ns recommended=120ns

PIO: pio0 pio1 pio2 pio3 pio4

Cycle time: no flow control=120ns IORDY flow control=120ns

Commands/features:

Enabled Supported:

* SMART feature set

Security Mode feature set

* Power Management feature set

* Write cache

* Look-ahead

* Host Protected Area feature set

* WRITE_BUFFER command

* READ_BUFFER command

* DOWNLOAD_MICROCODE

SET_MAX security extension

* 48-bit Address feature set

* Device Configuration Overlay feature set

* Mandatory FLUSH_CACHE

* FLUSH_CACHE_EXT

* SMART error logging

* SMART self-test

General Purpose Logging feature set

* 64-bit World wide name

* Write-Read-Verify feature set

* WRITE_UNCORRECTABLE_EXT command

* Gen1 signaling speed (1.5Gb/s)

* Gen2 signaling speed (3.0Gb/s)

* Native Command Queueing (NCQ)

* Phy event counters

* Software settings preservation

* SMART Command Transport (SCT) feature set

* SCT Long Sector Access (AC1)

* SCT Error Recovery Control (AC3)

* SCT Features Control (AC4)

* SCT Data Tables (AC5)

Security:

Master password revision code = 65534

supported

not enabled

not locked

not frozen

not expired: security count

supported: enhanced erase

Logical Unit WWN Device Identifier: 5000c500028898de

NAA : 5

IEEE OUI : 000c50

Unique ID : 0028898de

Checksum: correct

If one of the skilled guys (Joe L., I always remember as you help getting setup initially was incredibly helpful), could help with interpretation of info included, I would greatly appreciate it. I've got a 2TB drive sitting by (granted, never got tested as I have no slots in my tower left), which I do actually want to replace the drive in question with at some point, but it would be nice to know if the drive has a problem.

Please advise, guyz

Thanks

P.S. I did have to remove a bunch of characters from the syslog as it made it huge (i think its from me doing a file system check and they were almost like hashes to show it progressing). No details removed as it was the same repeating character sequence ( I left a few lines of it in the log ).

syslog-2014-01-31.txt

Quote

February 1, 201412 yr

The disk has a bad or loose SATA cable. Also see here: http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F

Quote

February 1, 201412 yr

Author

I'll give it another disconnect / Reconnect , checking both ends. If it isn't one of my breakout cables from my SATA expansion card and it goes to the motherboard directly, I'll replace it. Considering the only event that has happened from when the server was running fine to having this error is a power outage, it is somewhat hard to believe the cable is a problem (the power outage wasn't the result of an earthquake - lol ). What I am saying, is that inside of the case has not been disturbed until after the problem was discovered.

Going to muck with it now and I'll report back!

Quote

February 1, 201412 yr

A simple disconnect/re-seating will NOT put the drive back to OK status. (even if it was just a bad connection) You must re-construct the drive, as writes to it failed. It is guaranteed to have incorrect data.

Follow the directions in the wiki.

When you lost power the case probably cooled off. The thermal stress on the cable/connector could have caused a poor connection.

Quote

February 1, 201412 yr

Author

A simple disconnect/re-seating will NOT put the drive back to OK status. (even if it was just a bad connection) You must re-construct the drive, as writes to it failed. It is guaranteed to have incorrect data.

Follow the directions in the wiki.

When you lost power the case probably cooled off. The thermal stress on the cable/connector could have caused a poor connection.

I replaced the cable nonetheless, no harm, as it was one going right to my motherboard (not one of the breakout cables). After 5 minutes of "WTF", as my machine wouldn't boot up at all (no power), I realized during all the digging around this time, I pulled the connector from the case to the board for the power switch. lol Fun Fun Fun! I was looking to replace the 750GB drive with a spare 2TB I have, would this be as good as any time or would you re-construct the drive first and then yank and reconstruct again on new drive?

Quote

February 1, 201412 yr

I was looking to replace the 750GB drive with a spare 2TB I have, would this be as good as any time or would you re-construct the drive first and then yank and reconstruct again on new drive?

You will be running at risk until the drive is rebuilt, so putting in the 2TB and rebuilding on to that would probably be a good move, as long as the 2TB has been tested as good.

Quote

February 1, 201412 yr

Author

......as long as the 2TB has been tested as good.

That is where we have an issue. I have filled my server with 12 drives, leaving me know open drive bays to preclear the drive, as I had do with almost all drive up to now (thats to Joe L.'s assistance in my starting days). I could hook it up to my Windows 7 machine, but I am not familiar with a test process that was as extensive as preclear. I had this idea that I would just hook up a drive to the windows machine and run it for a few months, ensure nothing odd happened with it, then if it seemed ok, toss it in the server to replace one of my smaller drives. Far from a sequential / structured test. I have a couple of 2TB data drives in my main PC, that I could transfer the content from and on to the new drive sitting on my desk. Then, I could pull the drive and use it, but I would at best say it works and wasn't D.O.A.

EDIT: I rebuilt the original drive. I'm going to run with it until I can figure out a way to preclear a drive (or something similar) hooked to a machine running windows. Not sure its possible. If I can't find a decent way to prepare the replacement drive, then plan B is to swap it in anyways, but not right anything new to it for multiple days. Worse case, the drive is faulty and I either hear issues with it, or smart reports identify something odd. If so, I still have the original drive with all the data intact, so I won't lose any data. (I'm really hoping to Avoid plan B)

Quote

February 2, 201412 yr

This indicates a cable problem:

ata11.00: exception Emask 0x50 SAct 0x3 SErr 0x280900 action 0x6 frozen
Jan 31 21:05:14 Tower kernel: ata11.00: irq_stat 0x08000000, interface fatal error
Jan 31 21:05:14 Tower kernel: ata11: SError: { UnrecovData HostInt 10B8B BadCRC }
Jan 31 21:05:14 Tower kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 31 21:05:14 Tower kernel: ata11.00: cmd 60/40:00:60:00:00/00:00:00:00:00/40 tag 0 ncq 32768 in
Jan 31 21:05:14 Tower kernel:          res 40/00:08:a8:88:e0/00:00:e8:00:00/40 Emask 0x50 (ATA bus error)
Jan 31 21:05:14 Tower kernel: ata11.00: status: { DRDY }

Do these messages still appear in the log?

Quote

February 3, 201412 yr

Author

Just a quick search, on the Syslog, I looked for "UnrecovData" and found no match.

Their is lots of colourful stuff in the log, but I have to admit, their is so much in it, I only worry about it if something isn't working that I notice during my regular usage (ya, I know, not the best approach, but understanding everything in the log is for advanced users)

Log attached since everything up and working again.

syslog-2014-02-02.txt

Quote

February 3, 201412 yr

I have filled my server with 12 drives, leaving me know open drive bays to preclear the drive, as I had do with almost all drive up to now (thats to Joe L.'s assistance in my starting days).

If you have an open SATA connector and power. just open the case an leave the drive laying on the bottom of the case until it completes the preclear.

Quote

[SOLVED] DISK DSBL Error

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)