Jump to content

Help: PARITY NOT VALID: DISK_DSBL


Recommended Posts

I seem to have run into a problem this morning.   I was trying to copy over a few DVD's that I ripped, however I got some errors in the process and as a result of them, the parity drive appears to have been disabled.

 

I stopped the transfer that was in progress ( the only thing moved over was a jpg ) and I removed that one folder and one image from the server.

 

I then stopped the array and wanted to run a parity check, however that isn't one of my options.   I can only select from:

- Start will bring the array on-line (array will be unprotected).

- Restore will initialize the stored array configuration; all drives will appear as New, but data disk contents are not  affected.

Then the usual, Spin Up, Spin Down, Reboot, Shutdown.

 

I have 5 data drives and 1 parity drive, all of which are running at around 96% full  ( that leaves me about 250GB of space remaining ).  I had have one of the 8 Port Sata controllers to install and I have 2 more 1.5TB drives in the mail, along with sata cables and a breakout cable.

 

I'd like to be cautious right now, as I won't want to jeopardize losing my data.    If some of the experts (ie. those who know 1% more than I) could direct me on what my next step is, I would appreciate it :)   Thanks!

 

 

Here's my most recent syslog entries.

 

Jun 29 05:52:34 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory

Jun 29 05:53:35 Tower last message repeated 2 times

Jun 29 05:53:36 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15688/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15696/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15704/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15712/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15720/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15728/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15736/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15744/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15752/0, count: 1

Jun 29 05:54:34 Tower kernel: md: disk0 read error

Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15760/0, count: 1

Jun 29 05:54:36 Tower kernel: md: disk0 read error

Jun 29 05:54:36 Tower kernel: handle_stripe read error: 65680/0, count: 1

Jun 29 05:54:36 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15688/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: md: recovery thread woken up ...

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15696/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15704/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15712/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15720/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15728/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15736/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15744/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15752/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15760/0, count: 1

Jun 29 05:54:43 Tower kernel: md: disk0 write error

Jun 29 05:54:43 Tower kernel: handle_stripe write error: 65680/0, count: 1

Jun 29 05:54:43 Tower kernel: md: recovery thread has nothing to resync

Jun 29 05:54:45 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory

Jun 29 05:55:45 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory

Jun 29 05:56:47 Tower last message repeated 2 times

Jun 29 05:57:42 Tower last message repeated 2 times

Jun 29 05:57:42 Tower emhttp: disk_temperature: open: No such file or directory

Jun 29 05:57:53 Tower emhttp: Spinning down all drives...

Link to comment

Basically, step 1 is to post a full syslog as an attachment.

 

Other than that,  your parity drive was taken out of service when a "write" to it failed.

 

Your data disks are probably perfectly fine, but obviously, not protected from a failure by the parity drive.  Therefore, be very careful in working on the array to not damage them or dislodge the cable to them.

 

The write to the parity drive could have failed because

1. the drive failed

2. the power cable to the drive is loose/intermittent, or a connector on a splitter feeding the power cable is loose/intermittent.

3. if plugged into a backplane, the power or SATA connector to the drive could be loose/intermittent.

4. The SATA cable to it could be loose/intermittent/defective.

 

When checking power connections to the drives, be sure to stop the array, power down, AND unplug the server from the wall/turn off the power supply switch.

 

Before doing anything you can attempt to get a smartctl report or hdparm report from the drive.  If it is dead. it might not respond at all.

 

If the drive has died, your only choice is to replace it.  The replacement must be as large or larger than any of your data drives.

(Normally, it also has to be as large or larger than the parity drive, but since the failed drive is the parity drive, it just has to be as large or larger than the data drives.)  Now is not the time to look for a perfect sale on a drive, now is when you spend a few dollars more so you can replace the drive quickly.

 

Then, power down, replace the parity drive, trying hard not to dislodge the cables to the other drives, and power up.  The array should present you with the option to start the array and build a new set of parity data.  That will write the entire parity drive.

 

Once it is done, you'll want to immediately run a parity check.  That will then read the entire parity drive.  Until you do this, you'll not know if the parity data you wrote the drive is readable.

 

Normally I'd also warn against pressing the button labeled as "restore" as it is actually an "Initialize Configuration and Immediately Invalidate Parity" button.  In your case, parity is already invalid.

 

If you do find a loose cable to the parity drive and you can get a good smartctl report from it after correcting the cabling you can use the procedure described here in the wiki to put it back in service but you will need run a full parity check to fix what it could not originally write (remember, it was taken out of service because a "write" to it failed)

 

Joe L.

Link to comment

Here's some of the info I can add ( or tried to get ).

 

Using UnMenu I tried to run Smart Status Report and got:

Statistics for /dev/sde ST31500341AS_9VS2L1L4

 

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

Smartctl open device: /dev/sde failed: No such file or directory

 

I then ran HDParm Info:

HDParm Info for /dev/sde ST31500341AS_9VS2L1L4

(nothing appeared)

 

 

So.....I opened up the case and what do I see....one of the sata cables had the two plastic parts, that snap around the cable end, partially come apart and the cable wasn't plugged in all the way ( it wasn't a parallel with the drive ).

 

I was able to run SmartClr and HDParm.   HEre is the info:

 

Statistics for /dev/sde ST31500341AS_9VS2L1L4

 

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

=== START OF INFORMATION SECTION ===

Device Model:     ST31500341AS

Serial Number:    9VS2L1L4

Firmware Version: CC1H

User Capacity:    1,500,301,910,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Wed Jun 30 19:20:00 2010 GMT+5

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 609) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (   1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (   2) minutes.

SCT capabilities:       (0x103f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

 1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       124539259

 3 Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0

 4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       234

 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

 7 Seek_Error_Rate         0x000f   067   060   030    Pre-fail  Always       -       5872207

 9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       6024

10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       9

184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0

189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       106

190 Airflow_Temperature_Cel 0x0022   067   059   045    Old_age   Always       -       33 (Lifetime Min/Max 31/33)

194 Temperature_Celsius     0x0022   033   041   000    Old_age   Always       -       33 (0 19 0 0)

195 Hardware_ECC_Recovered  0x001a   051   032   000    Old_age   Always       -       124539259

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       115440130982907

241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       385727526

242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       1204353787

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

   1        0        0  Not_testing

   2        0        0  Not_testing

   3        0        0  Not_testing

   4        0        0  Not_testing

   5        0        0  Not_testing

Selective self-test flags (0x0):

 After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

HDParm Info for /dev/sde ST31500341AS_9VS2L1L4

 

/dev/sde:

 

ATA device, with non-removable media

Model Number:       ST31500341AS                            

Serial Number:      9VS2L1L4

Firmware Revision:  CC1H    

Transport:          Serial

Standards:

Used: unknown (minor revision code 0x0029)

Supported: 8 7 6 5

Likely used: 8

Configuration:

Logical max current

cylinders 16383 16383

heads 16 16

sectors/track 63 63

--

CHS current addressable sectors:   16514064

LBA    user addressable sectors:  268435455

LBA48  user addressable sectors: 2930277168

device size with M = 1024*1024:     1430799 MBytes

device size with M = 1000*1000:     1500301 MBytes (1500 GB)

Capabilities:

LBA, IORDY(can be disabled)

Queue depth: 32

Standby timer values: spec'd by Standard, no device specific minimum

R/W multiple sector transfer: Max = 16 Current = ?

Recommended acoustic management value: 254, current value: 0

DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6

    Cycle time: min=120ns recommended=120ns

PIO: pio0 pio1 pio2 pio3 pio4

    Cycle time: no flow control=120ns  IORDY flow control=120ns

Commands/features:

Enabled Supported:

  * SMART feature set

    Security Mode feature set

  * Power Management feature set

  * Write cache

  * Look-ahead

  * Host Protected Area feature set

  * WRITE_BUFFER command

  * READ_BUFFER command

  * DOWNLOAD_MICROCODE

    SET_MAX security extension

  * Automatic Acoustic Management feature set

  * 48-bit Address feature set

  * Device Configuration Overlay feature set

  * Mandatory FLUSH_CACHE

  * FLUSH_CACHE_EXT

  * SMART error logging

  * SMART self-test

  * General Purpose Logging feature set

  * WRITE_{DMA|MULTIPLE}_FUA_EXT

  * 64-bit World wide name

    Write-Read-Verify feature set

  * WRITE_UNCORRECTABLE_EXT command

  * {READ,WRITE}_DMA_EXT_GPL commands

  * Segmented DOWNLOAD_MICROCODE

  * SATA-I signaling speed (1.5Gb/s)

  * SATA-II signaling speed (3.0Gb/s)

  * Native Command Queueing (NCQ)

  * Phy event counters

    Device-initiated interface power management

  * Software settings preservation

  * SMART Command Transport (SCT) feature set

  * SCT Long Sector Access (AC1)

  * SCT LBA Segment Access (AC2)

  * SCT Error Recovery Control (AC3)

  * SCT Features Control (AC4)

  * SCT Data Tables (AC5)

    unknown 206[12] (vendor specific)

Security:

Master password revision code = 65534

supported

not enabled

not locked

not frozen

not expired: security count

supported: enhanced erase

256min for SECURITY ERASE UNIT. 256min for ENHANCED SECURITY ERASE UNIT.

Logical Unit WWN Device Identifier: 5000c50015af8095

NAA : 5

IEEE OUI : c50

Unique ID : 015af8095

Checksum: correct

 

 

So.......now I am following the steps to Restore my configuration, as per the link you provided.

 

All drives are "Green" and the Parity Check is in Progress.

 

After that finishes, I then have to work on removing the HPA feature as two of my drives are slightly undersized because of it and I would like to fix it before I add two new drives, of equal size to my parity or before it writes to my parity drive and causes more grief :)

 

As always, thank you Joe for troubleshooting steps and links to Wiki's to fix things :)  

 

 

------------------------------------------------------------------------------------------------

Update:

 

I completed the Parity Check and this is what I got:

(Last checked on 7/1/2010 1:17:31 AM, finding 1068401 errors.)

 

Could this be from when I initially got the write errors, I removed the items from the server that partially wrote ( as I didn't know which files moved over cleanly, so I was just going to re-copy them all )?

 

I am 99% sure the problems were all from the SATA cable not connected properly ( ie. I trust the parity drive ).  I'm just trying to figure out now how to get back to where I was before this happened.

 

What would be the recommended next step?

 

Thanks

 

 

 

 

 

Link to comment

The files you were writing to the drive were not written to it once the cable had loosened.  They were written to the "simulated" drive, the drive simulated by parity in combination with all the other data drives.

 

If you followed the "trust" procedure, you basically asked the server to use the actual contents of the failed disk and to update parity based on it.  That would erase/undo anything you had done to the disk since the cable loosened.

 

If you had wanted to keep what you had written to the drive when the cable was disconnected you could have just let the array re-construct the failed drive.

 

This is described in the wiki at the link I had provided:

A situation where this procedure may NOT apply:

 

   * You wish the keep the data that was in the process of being written when the "write" to the disk failed.

 

If a disk is off-line, it is because a "write" to it failed. If the drive went off-line when you were saving the only copy of important files/music/pictures/movies, etc then the failed drive does not contain any of the data that was written to it. Any files written to the failed drive were recorded in parity, but not the physical data drive. It is entirely possible to load the entire failed disk with files, but only upate parity and not the physical data disk, because it is disabled.

 

If you use this procedure, it will force the disabled back online, and when the resulting parity check proceeds, it will be updated to reflect the data on the physical disk. In other words, using this procedure will bring the disk back without the new files written to it when it went off-line or any written to it since that time. You will effectively rolled back the clock, as if they were never written to the array.

 

A subsequent parity check will contain many errors, as it is brought in sync with the physical disk contents. Remember, the disk was taken off-line because a write to it failed.

 

If you want to preserve the data that was written to the failed drive, you must un-assign the drive, start the array, stop the array, re-assign the drive, and let unRAID re-construct the contents as created by parity and all the other data drives. They can reconstruct a full copy of what was written to the array based on parity and the other data drives. You will not have parity protection during this period of time, but you will have the latest data written.

 

Joe L.

Link to comment

Sorry..... comprehension was limited as it was after 3am when I posted ( I tried to stay up until the rebuild finished ) :)

 

Considering that the problem all started with some write errors, then it would be expected to have errors on the rebuild.

 

If I was to run the parity check now, assuming all the drives are in working order, it should pass the parity check ... correct?

Link to comment

Sorry..... comprehension was limited as it was after 3am when I posted ( I tried to stay up until the rebuild finished ) :)

 

Considering that the problem all started with some write errors, then it would be expected to have errors on the rebuild.

 

If I was to run the parity check now, assuming all the drives are in working order, it should pass the parity check ... correct?

A second parity check should be error free.
Link to comment

As you suggested:

 

Parity-Check.

(Last checked on 7/1/2010 1:51:30 PM, finding 0 errors.)

 

Glad that's over! :)

 

Lesson's learned:

1.  Get SATA cables that lock

2.  Always have 1 spare SATA cable

3.  Trust Joe

4.  Thank Joe

5.  Remember to Read the Wiki's (all of the related Wiki, not just half)

 

I have two more jobs to work on this week......beware! :)  (Removing the HPA from my motherboard and getting two of my drives to hopefully show equal size to the parity drive AND installed one of the 8-Port SATA expansion cards and 2 new 1.5TB drives on it)

 

Fun Fun Fun  :)

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...