[Solved] Thousands R/W errors - crazy parity

garion · April 9, 2013

I try to explain as best as I can, its not easy (not only for language )

Several months running different 5.0 RC's. In RC11 since release day. No problem at all.

Last week I notice lot of errors in unraid web console. I review my syslog and found thousand and thousand disk errors in a pretty new Hitachi 2TB parity disk (less a year in unraid), some parity errors too.

Apr  7 21:27:29 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr  7 21:27:29 unraid kernel: ata10.00: irq_stat 0x40000001
Apr  7 21:27:29 unraid kernel: ata10.00: failed command: READ DMA EXT
Apr  7 21:27:29 unraid kernel: ata10.00: cmd 25/00:00:30:34:0b/00:01:e7:00:00/e0 tag 0 dma 131072 in
Apr  7 21:27:29 unraid kernel:          res 51/40:7a:b6:34:0b/00:00:e7:00:00/07 Emask 0x9 (media error)
Apr  7 21:27:29 unraid kernel: ata10.00: status: { DRDY ERR }
Apr  7 21:27:29 unraid kernel: ata10.00: error: { UNC }
Apr  7 21:27:29 unraid kernel: ata10.00: configured for UDMA/133
Apr  7 21:27:29 unraid kernel: ata10: EH complete

Things I did, in order:

1.- recheck my cables - data and power. (Parity always plug into mainboard sata ports)-> no solution

2.- change SATA cable and reoder cables inside computer -> no solution

3.- go to a stock "go" file

4.- unistall Simple Features

5.- buy a new parity disk Seagate 7200.14 2TB

6.- buy a new certified sata cable

7.- install new disk and cable in a different SATA port.

8.- preclear new seagate parity drive -> No errors. No allocated sectors.

9.- rebuild parity -> OK

10.- test parity -> no errors

11.- no errors after a couple days in read/write in web-console.

.

Yesterday I found 2 R/W in web-console.

I did a check only parity and found 6 erros with 15.258 R/W errors. :-[

any ideas please?

---------------

I did a preclear with the previous Hitachi parity disk and it found 11 sectors for allocation. I did a second one and all ok. I´m runing a 2 cycle now with same disk but it doesn´t matter because I have exactly the same problems with the new parity Seagate disk than with old parity Hitachi. New disk, new cable, different sata port.... same problems.

--------------

I attached all info as I could. I´m not an IT expert but if you can guide me, show some road to solution I´ll follow.

-------------

My equipment:

APC UPS SAI

Corsair TX-650W PSU

Intel I3 - 4 gigs ram

No graphics card (I3 graphics only)

No overclock

6 mainboard SATA

2 cards 2 sata ports PCIe (no parity in this cards)

Attachments:

new syslog (seagate parity)

previous syslog (hitachi parity - where errors began)

preclear results

screen capture

syslog-20130409.txt.zip

garion · April 9, 2013

other attachments

syslog-20130406-112349.txt.zip

preclears_reports.zip

garion · April 9, 2013

last attachment

vca · April 9, 2013

2 cards 2 sata ports PCIe (no parity in this cards)

Are these PCIe cards the SIL3132 cards like this one:

http://www.monoprice.com/products/product.asp?c_id=104&cp_id=10407&cs_id=1040702&p_id=2530&seq=1&format=2

there have been a number of reports of odd problems being caused by these. In fact, I just had a run in with disk read errors being caused by one of these in a Windows 7 machine.

Regards,

Stephen

garion · April 9, 2013

I think no. I heard about SIL issues and I went to new ASM1061. No errors in months, only data disks attached there.

Two boards like this:

PCI-e 2 Port SATA III (6Gbps) Card, Chipset ASM1061

Compliant with PCI-express specification V2.0 and backward compatible with PCI-express 1.x

Compliant with serial-ATA AHCI specifications revision 1.3 and serial-ATA revision 3.0

Supports hot-plug and hot-swap

Supports communication speeds of 6.0Gbps, 3.0Gbps, and 1.5Gbps

Supports 2 ports serial-ATA III

Supports native command queue (NCQ)

Supports port multiplier

http://www.ebay.es/itm/271014212269?ssPageName=STRK:MEWAX:IT&_trksid=p3984.m1423.l2649#ht_3403wt_1175

burtjr · April 10, 2013

Looks like you have HPA on ata9, I'm not sure if that would be causing some of your problems or not. I am sure someone who has more knowledge about that will chime in I am sure. If you do a google search for HPA in unraid you will get a bunch of hits that may help you while you wait. Was ata9 in a gigabyte motherboard at some point in its life?

garion · April 10, 2013

Not sure there is a relationship with ata9 and hpa but I have free space in other discs to empty that one and then try to remove hpa. Why now? Running unraid almost a year?

Today I found parity icon becomes red..... seems no device :'( :'( :'( :'(

Really don´t know.

I only have the plugs you can see in attachment.

Should I disable all and try a few days again?

--

edit: I remove ALL plugins and move to RC12a. Rebooting now. I´ll check cables again.

burtjr · April 10, 2013

Yes I would disable all plug ins. My understanding is that you have your drive causing problems on a motherboard sata port yes?

garion · April 10, 2013

Yes I would disable all plug ins. My understanding is that you have your drive causing problems on a motherboard sata port yes?

Disabled all plugins and upgrade to RC12a

2 parity harddrives tested (second after a perfect preclear)

2 new sata cables tested (second 3gbps certified)

diferent motherboard sata port tested

same problems in parity - few parity errors and thousands R/W errors in log.

no errors in data or cache drives, plugged to motherboard or pcie sata cards. R/W problems only in parity.

dgaschk · April 10, 2013

Attach the new log.

garion · April 10, 2013

Attach the new log.

Done

up-to-date log.

parity sync is in progress and no errors yet (errors should begin later)

syslog_20130410_23pm.txt

RobJ · April 11, 2013

Things I did, in order:

1.- recheck my cables - data and power. (Parity always plug into mainboard sata ports)-> no solution

2.- change SATA cable and reoder cables inside computer -> no solution

3.- go to a stock "go" file

4.- unistall Simple Features

5.- buy a new parity disk Seagate 7200.14 2TB

6.- buy a new certified sata cable

7.- install new disk and cable in a different SATA port.

8.- preclear new seagate parity drive -> No errors. No allocated sectors.

9.- rebuild parity -> OK

10.- test parity -> no errors

11.- no errors after a couple days in read/write in web-console.

Wow! The stories your syslog could tell, if a syslog could talk!

I'll bet you're a risk-taker. You probably don't bother worrying about static discharge, you probably don't care if power is off before yanking cables! And I know you don't worry about your data when you want to swap drives, because your syslog told me so! That's the first time I have ever seen someone pull the cables to spinning drives, including ones assigned to the array, and reconnect them in a different way, then start the array to see if it all worked. And it did! For the most part, you got away with it. You do like living dangerously. In the end though, I suspect the subsequent problems *may* be because of the live reconnections, however I don't know that for sure. I have not had a chance to examine the other syslogs yet, only the first, but this one has a lot to say. What is below is rather long, my apologies, but it tells the story. Please do read my comments along the way. Above, your steps 1 through 9 seem correct, but step 10 is wrong, the new parity drive lost contact almost 12 hours after it was built, and was quickly dropped from the system. That explains the huge number of subsequent errors, which you can completely ignore, since the real issue is why the drive lost communications like that. There is probably nothing wrong with the Seagate drive itself. Here is a commented version of your syslog:

(By the way, the HPA is on the Cache drive (sdj, a 500GB Seagate), not a part of the array, so its HPA is not a real problem here. It's not a good thing, but is not causing parity trouble.)

[syslog starts] (leading dots represent removed stuff, not relevant to discussion)

Apr 2 22:07:08 unraid kernel: Linux version 3.4.26-unRAID (root@Develop) (gcc version 4.4.4 (GCC) ) #2 SMP Mon Jan 28 17:00:22 PST 2013

... (no real issues, array is started and stopped once or twice)

Apr 3 18:44:02 ... (Here is where it all starts)

Apr 3 18:44:02 unraid kernel: r8168: eth0: link down (network disconnected?)

...

Apr 3 18:49:11 unraid kernel: ata8: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr 3 18:49:11 unraid kernel: ata8: irq_stat 0x00400000, PHY RDY changed

Apr 3 18:49:11 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B } (a cable is bumped loose)

Apr 3 18:49:11 unraid kernel: ata8: hard resetting link (drive is sdi, Disk 6, WD10EADS)

...

Apr 3 18:49:18 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0xe frozen

Apr 3 18:49:18 unraid kernel: ata10: irq_stat 0x00400040, connection status changed

Apr 3 18:49:18 unraid kernel: ata10: SError: { PHYRdyChg CommWake DevExch } (new Seagate is connected to addon card port)

Apr 3 18:49:18 unraid kernel: ata10: hard resetting link

Apr 3 18:49:27 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr 3 18:49:32 unraid kernel: ata10.00: qc timeout (cmd 0xec)

Apr 3 18:49:32 unraid kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4) (but it's a bad connection)

.Apr 3 18:49:55 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr 3 18:49:55 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed

Apr 3 18:49:55 unraid kernel: ata10: SError: { RecovComm PHYRdyChg 10B8B }

Apr 3 18:49:55 unraid kernel: ata10: hard resetting link

Apr 3 18:49:56 unraid kernel: ata10: SATA link down (SStatus 0 SControl 300) (connection lost)

.Apr 3 18:50:07 unraid kernel: ata10.00: disabled (system removes it)

.Apr 3 18:50:07 unraid kernel: sd 9:0:0:0: [sdk] Stopping disk

...

Apr 3 18:50:41 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B } (a cable is bumped loose again)

Apr 3 18:50:41 unraid kernel: ata8: hard resetting link (drive is sdi, Disk 6, WD10EADS)

...

Apr 3 18:50:54 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen

Apr 3 18:50:54 unraid kernel: ata10: irq_stat 0x00000040, connection status changed

Apr 3 18:50:54 unraid kernel: ata10: SError: { CommWake DevExch } (new Seagate is connected again to addon card port)

Apr 3 18:50:54 unraid kernel: ata10: hard resetting link

Apr 3 18:50:55 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr 3 18:50:55 unraid kernel: ata10.00: ATA-8: ST2000DM001-1CH164, CC24, max UDMA/133

.Apr 3 18:50:55 unraid kernel: sd 9:0:0:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) (assigned drive symbol sdk)

.Apr 3 18:50:55 unraid kernel: sdk: unknown partition table

Apr 3 18:50:55 unraid kernel: sd 9:0:0:0: [sdk] Attached SCSI disk

Apr 3 18:51:02 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr 3 18:51:02 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed (connection lost, still a little loose)

Apr 3 18:51:02 unraid kernel: ata10: SError: { RecovComm PHYRdyChg 10B8B }

Apr 3 18:51:02 unraid kernel: ata10: hard resetting link

Apr 3 18:51:02 unraid kernel: ata10: SATA link down (SStatus 0 SControl 300)

Apr 3 18:51:12 unraid kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310) (finally fully connected, but at slower speed)

.Apr 3 18:51:12 unraid kernel: sdk: detected capacity change from 0 to 2000398934016 (still assigned to be sdk)

...

Apr 3 18:52:39 unraid kernel: r8168: eth0: link up (reconnected network?)

...

Apr 3 19:33:18 unraid kernel: sdk: unknown partition table (Preclear begins of new Seagate)

...

Apr 4 17:54:05 unraid preclear_disk-diff[10081]: == invoked as: ./preclear_disk.sh /dev/sdk (Preclear finishes, logs the report)

Apr 4 17:54:05 unraid preclear_disk-diff[10081]: == ST2000DM001-1CH164 W2F0QYSN

Apr 4 17:54:05 unraid preclear_disk-diff[10081]: == Disk /dev/sdk has been successfully precleared

...

Apr 4 22:55:17 unraid emhttp: Device inventory: (drive list with old sdd Hitachi and new sdk Seagate)

Apr 4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD8Q3HK (sda) 1953514584

Apr 4 22:55:17 unraid emhttp: ST31500541AS_9XW01XYQ (sdc) 1465138584

Apr 4 22:55:17 unraid emhttp: Hitachi_HDS5C3020ALA632_ML2220FA040BGE (sdd) 1953514584

Apr 4 22:55:17 unraid emhttp: WDC_WD15EARS-00J2GB0_WD-WCAYY0179785 (sde) 1465138584

Apr 4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD91B32 (sdf) 1953514584

Apr 4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD900BG (sdg) 1953514584

Apr 4 22:55:17 unraid emhttp: ST31500541AS_6XW0DRHW (sdh) 1465138584

Apr 4 22:55:17 unraid emhttp: WDC_WD10EADS-00L5B1_WD-WCAU4C404846 (sdi) 976762584

Apr 4 22:55:17 unraid emhttp: ST3500418AS_9VMVWG70 (sdj) 488378359

Apr 4 22:55:17 unraid emhttp: ST2000DM001-1CH164_W2F0QYSN (sdk) 1953514584

...

Apr 4 22:57:37 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen

Apr 4 22:57:37 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed

Apr 4 22:57:37 unraid kernel: ata10: SError: { RecovComm PHYRdyChg } (new Seagate sdk is disconnected!)

Apr 4 22:57:37 unraid kernel: ata10: hard resetting link

Apr 4 22:57:38 unraid kernel: ata10: SATA link down (SStatus 0 SControl 310)

.Apr 4 22:57:48 unraid kernel: ata10.00: disabled

.Apr 4 22:57:48 unraid kernel: sd 9:0:0:0: [sdk] Stopping disk

...

Apr 4 22:58:42 unraid kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen

Apr 4 22:58:42 unraid kernel: ata3: irq_stat 0x00400040, connection status changed

Apr 4 22:58:42 unraid kernel: ata3: SError: { PHYRdyChg 10B8B DevExch } (Parity drive (sdd Hitachi) is disconnected!)

Apr 4 22:58:42 unraid kernel: ata3: hard resetting link

Apr 4 22:58:42 unraid kernel: ata3: SATA link down (SStatus 0 SControl 300)

.Apr 4 22:58:53 unraid kernel: ata3.00: disabled

.Apr 4 22:58:53 unraid kernel: sd 2:0:0:0: [sdd] Stopping disk

...

Apr 4 22:59:08 unraid kernel: ata1: SError: { HostInt PHYRdyChg 10B8B DevExch } (a cable is bumped completely off)

Apr 4 22:59:08 unraid kernel: ata1: hard resetting link (drive is sda, Disk 5, ST2000DL003)

Apr 4 22:59:09 unraid kernel: ata1: SATA link down (SStatus 0 SControl 300)

.Apr 4 22:59:19 unraid kernel: ata1.00: disabled (system removes the drive!)

.Apr 4 22:59:20 unraid kernel: ata1: SError: { DevExch } (cable is reconnected!)

Apr 4 22:59:20 unraid kernel: ata1: hard resetting link

Apr 4 22:59:21 unraid kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr 4 22:59:21 unraid kernel: ata1.00: ATA-8: ST2000DL003-9VT166, CC3C, max UDMA/133

.Apr 4 22:59:21 unraid kernel: sda: sda1 (and luckily re-assigned the same drive symbol, sda!)

Apr 4 22:59:21 unraid kernel: sd 0:0:0:0: [sda] Attached SCSI disk

...

Apr 4 22:59:14 unraid kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch } (a cable is bumped off)

Apr 4 22:59:14 unraid kernel: ata2: hard resetting link (drive is sdc, Disk 3, ST31500541AS)

.Apr 4 22:59:19 unraid kernel: ata2: reset failed (errno=-32), retrying in 5 secs

.Apr 4 22:59:24 unraid kernel: ata2: limiting SATA link speed to 1.5 Gbps (drive is recovered finally, but slightly slowed access)

.Apr 4 23:08:15 unraid kernel: ata2: SError: { PHYRdyChg 10B8B DevExch }

...

Apr 4 23:08:36 unraid kernel: ata4: SError: { PHYRdyChg 10B8B DevExch } (a cable is bumped loose)

Apr 4 23:08:36 unraid kernel: ata4: hard resetting link (drive is sde, Disk 4, WD15EARS)

...

Apr 4 23:09:29 unraid kernel: ata3: irq_stat 0x00000040, connection status changed

Apr 4 23:09:29 unraid kernel: ata3: SError: { CommWake DevExch } (new Seagate is connected/hotplugged to motherboard port!)

Apr 4 23:09:29 unraid kernel: ata3: hard resetting link

Apr 4 23:09:30 unraid kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr 4 23:09:30 unraid kernel: ata3.00: ATA-8: ST2000DM001-1CH164, CC24, max UDMA/133

.Apr 4 23:09:30 unraid kernel: scsi 2:0:0:0: Direct-Access ATA ST2000DM001-1CH1 CC24 PQ: 0 ANSI: 5

Apr 4 23:09:30 unraid kernel: sd 2:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

.Apr 4 23:09:30 unraid kernel: sdd: sdd1 (and reassigned by system as sdd)

Apr 4 23:09:30 unraid kernel: sd 2:0:0:0: [sdd] Attached SCSI disk

...

Apr 4 23:14:31 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B } (a cable is bumped loose)

Apr 4 23:14:31 unraid kernel: ata8: hard resetting link (drive is sdi, Disk 6, WD10EADS)

...

Apr 4 23:14:59 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4000000 action 0xe frozen

Apr 4 23:14:59 unraid kernel: ata10: irq_stat 0x00000040, connection status changed

Apr 4 23:14:59 unraid kernel: ata10: SError: { DevExch } (the Hitachi is connected/hotplugged to ata10 port!)

Apr 4 23:14:59 unraid kernel: ata10: hard resetting link (so the new Seagate and Hitachi swap ports and drive symbols!)

Apr 4 23:15:00 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr 4 23:15:00 unraid kernel: ata10.00: ATA-8: Hitachi HDS5C3020ALA632, ML6OA800, max UDMA/133

.Apr 4 23:15:00 unraid kernel: scsi 9:0:0:0: Direct-Access ATA Hitachi HDS5C302 ML6O PQ: 0 ANSI: 5

Apr 4 23:15:00 unraid kernel: sd 9:0:0:0: Attached scsi generic sg10 type 0

Apr 4 23:15:00 unraid kernel: sd 9:0:0:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

.Apr 4 23:15:00 unraid kernel: sdk: sdk1 (and reassigned by system as sdk)

Apr 4 23:15:00 unraid kernel: sd 9:0:0:0: [sdk] Attached SCSI disk

...

Apr 4 23:21:05 unraid emhttp: Device inventory: (the new drive inventory!)

Apr 4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD8Q3HK (sda) 1953514584

Apr 4 23:21:05 unraid emhttp: ST31500541AS_9XW01XYQ (sdc) 1465138584

Apr 4 23:21:05 unraid emhttp: ST2000DM001-1CH164_W2F0QYSN (sdd) 1953514584

Apr 4 23:21:05 unraid emhttp: WDC_WD15EARS-00J2GB0_WD-WCAYY0179785 (sde) 1465138584

Apr 4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD91B32 (sdf) 1953514584

Apr 4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD900BG (sdg) 1953514584

Apr 4 23:21:05 unraid emhttp: ST31500541AS_6XW0DRHW (sdh) 1465138584

Apr 4 23:21:05 unraid emhttp: WDC_WD10EADS-00L5B1_WD-WCAU4C404846 (sdi) 976762584

Apr 4 23:21:05 unraid emhttp: ST3500418AS_9VMVWG70 (sdj) 488378359

Apr 4 23:21:05 unraid emhttp: Hitachi_HDS5C3020ALA632_ML2220FA040BGE (sdk) 1953514584

...

Apr 4 23:21:05 unraid kernel: md: import disk0: [8,160] (sdk) Hitachi_HDS5C3020ALA632_ML2220FA040BGE size: 1953514552

...

Apr 4 23:21:25 unraid kernel: md: import disk0: [8,48] (sdd) ST2000DM001-1CH164_W2F0QYSN size: 1953514552

Apr 4 23:21:25 unraid kernel: md: disk0 wrong (parity is re-assigned from sdk Hitachi to sdd Seagate)

...

Apr 4 23:21:59 unraid emhttp: shcmd (248): /usr/local/sbin/set_ncq sdd 1 &> /dev/null (array is started...)

...

Apr 4 23:21:59 unraid emhttp: writing MBR on disk (sdd) with partition 1 offset 64, erased: 0 (new parity disk is partitioned)

...

Apr 4 23:22:02 unraid emhttp_event: disks_mounted (array is fully up)

...

Apr 4 23:23:16 unraid kernel: md: recovery thread syncing parity disk ... (begin build of new parity disk sdd ST2000DM001)

...

Apr 4 23:32:46 unraid kernel: sdk: sdk1 (some operation (Preclear?) is performed on sdk, the Hitachi, no evidence of formatting)

...

Apr 5 03:05:21 unraid kernel: ata10.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x6 frozen (something is trying to read from Hitachi!)

... (this begins 11.5 hours of I/O errors from sdk (the old Hitachi parity drive), which is NOT formatted, must be Preclear reading)

...

Apr 5 04:16:03 unraid kernel: ... (media error) (the first bad sector, many more bad sector messages to come, not shown below)

Apr 5 04:16:03 unraid kernel: ata10.00: error: { UNC }

...

Apr 5 07:10:07 unraid kernel: ata10: EH complete

Apr 5 07:14:40 unraid kernel: md: sync done. time=28284sec (parity rebuild onto new sdd ST2000DM001 complete)

Apr 5 07:14:40 unraid kernel: md: recovery thread sync completion status: 0

Apr 5 07:17:16 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen (Hitachi errors continue...)

...

Apr 5 14:40:54 unraid kernel: ata10: EH complete (last sdk Hitachi error for awhile, Preclear pre-read over?)

... (4 hours of quiet, then ...)

Apr 5 18:44:52 unraid kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen (new parity drive sdd suddenly stops communicating)

... (system tries for one minute to reset and recover the drive)

Apr 5 18:45:52 unraid kernel: ata3.00: disabled (drive is quickly disabled by system)

...

Apr 5 18:45:52 unraid kernel: md: disk0 read error (these can be ignored, drive is disabled by system)

...

Apr 5 18:46:02 unraid kernel: md: disk0 write error (last one)

...

Apr 5 19:33:16 unraid login[27047]: ROOT LOGIN on '/dev/pts/1' from '192.168.1.63' (you logged in, attempted SMART analysis)

Apr 5 19:33:44 unraid kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

...

Apr 5 21:02:15 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (sdk Hitachi errors start again)

...

Apr 5 21:29:45 unraid kernel: ata10: EH complete (sdk Hitachi errors stop)

...

Apr 6 11:23:49 unraid kernel: mdcmd (25): stop (system is shut down, don't know what happened to the Preclear, no report)

[end of syslog]

I'm tired...

Andrewch · April 11, 2013

Wow!

Wow! Nice work RobJ

PeterB · April 11, 2013

That's the first time I have ever seen someone pull the cables to spinning drives, including ones assigned to the array, and reconnect them in a different way, then start the array to see if it all worked.

Probably using 'hot swap' disk bays?

garion · April 11, 2013

Robj, I have no words, really. You review a really loooonnnng log. Thanks a lot.

I read all, but a second reading its necessary. I'm on mobile and only posting to thank you and try to explain about "cabling"

I'm not using hot swap bays.

Problem with Hitachi began previous any hot unplugged cable.

I read in unraid forums from people doing drive maintenance tasks without turn off computer, only unmounting array -> Change drive -> mount array. I suppose I was wrong.

Yesterday I physically unplugg Hitachi (old parity) after a 2 pass prelear. That drive its not in array or as hot spare dive until problems dissappear. Then start a parity build with new Seagate. Afternoon (Spain) I will report results and log.

I unplugged Hitachi with computer fully off, check other drives cables and turn on...jejeje

Again, thanks for your comments. It wasn't a 5 min job.

Francisco

PeterB · April 11, 2013

I read in unraid forums from people doing drive maintenance tasks without turn off computer, only unmounting array -> Change drive -> mount array. I suppose I was wrong.

This works if your hardware supports hot swap.

garion · April 11, 2013

Ok, after a second reading I think I learn better turn off/on always :-[

parity rebuld finished I think without errors.

Apr 10 19:47:17 unraid kernel: mdcmd (45): check CORRECT
Apr 10 19:47:17 unraid kernel: md: recovery thread woken up ...

Apr 10 19:47:17 unraid kernel: md: recovery thread syncing parity disk ...

Apr 10 19:47:17 unraid kernel: md: using 1536k window, over a total of 1953514552 blocks.

....

....

Apr 11 03:39:44 unraid kernel: md: sync done. time=28347sec

Apr 11 03:39:44 unraid kernel: md: recovery thread sync completion status: 0

I begin a parity check and lets see. Errors usually came after a couple of days after rebuild.

syslog_20130411_18pm.txt

lars · April 12, 2013

Things I did, in order:

1.- recheck my cables - data and power. (Parity always plug into mainboard sata ports)-> no solution

2.- change SATA cable and reoder cables inside computer -> no solution

3.- go to a stock "go" file

4.- unistall Simple Features

5.- buy a new parity disk Seagate 7200.14 2TB

6.- buy a new certified sata cable

7.- install new disk and cable in a different SATA port.

8.- preclear new seagate parity drive -> No errors. No allocated sectors.

9.- rebuild parity -> OK

10.- test parity -> no errors

11.- no errors after a couple days in read/write in web-console.

Wow! The stories your syslog could tell, if a syslog could talk!

I'll bet you're a risk-taker. You probably don't bother worrying about static discharge, you probably don't care if power is off before yanking cables! And I know you don't worry about your data when you want to swap drives, because your syslog told me so! That's the first time I have ever seen someone pull the cables to spinning drives, including ones assigned to the array, and reconnect them in a different way, then start the array to see if it all worked. And it did! For the most part, you got away with it. You do like living dangerously. In the end though, I suspect the subsequent problems *may* be because of the live reconnections, however I don't know that for sure. I have not had a chance to examine the other syslogs yet, only the first, but this one has a lot to say. What is below is rather long, my apologies, but it tells the story. Please do read my comments along the way. Above, your steps 1 through 9 seem correct, but step 10 is wrong, the new parity drive lost contact almost 12 hours after it was built, and was quickly dropped from the system. That explains the huge number of subsequent errors, which you can completely ignore, since the real issue is why the drive lost communications like that. There is probably nothing wrong with the Seagate drive itself. Here is a commented version of your syslog:

(By the way, the HPA is on the Cache drive (sdj, a 500GB Seagate), not a part of the array, so its HPA is not a real problem here. It's not a good thing, but is not causing parity trouble.)

[syslog starts] (leading dots represent removed stuff, not relevant to discussion)

Apr 2 22:07:08 unraid kernel: Linux version 3.4.26-unRAID (root@Develop) (gcc version 4.4.4 (GCC) ) #2 SMP Mon Jan 28 17:00:22 PST 2013

... (no real issues, array is started and stopped once or twice)

Apr 3 18:44:02 ... (Here is where it all starts)

Apr 3 18:44:02 unraid kernel: r8168: eth0: link down (network disconnected?)

...

Apr 3 18:49:11 unraid kernel: ata8: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr 3 18:49:11 unraid kernel: ata8: irq_stat 0x00400000, PHY RDY changed

Apr 3 18:49:11 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B } (a cable is bumped loose)

Apr 3 18:49:11 unraid kernel: ata8: hard resetting link (drive is sdi, Disk 6, WD10EADS)

...

Apr 3 18:49:18 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0xe frozen

Apr 3 18:49:18 unraid kernel: ata10: irq_stat 0x00400040, connection status changed

Apr 3 18:49:18 unraid kernel: ata10: SError: { PHYRdyChg CommWake DevExch } (new Seagate is connected to addon card port)

Apr 3 18:49:18 unraid kernel: ata10: hard resetting link

Apr 3 18:49:27 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr 3 18:49:32 unraid kernel: ata10.00: qc timeout (cmd 0xec)

Apr 3 18:49:32 unraid kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4) (but it's a bad connection)

.Apr 3 18:49:55 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr 3 18:49:55 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed

Apr 3 18:49:55 unraid kernel: ata10: SError: { RecovComm PHYRdyChg 10B8B }

Apr 3 18:49:55 unraid kernel: ata10: hard resetting link

Apr 3 18:49:56 unraid kernel: ata10: SATA link down (SStatus 0 SControl 300) (connection lost)

.Apr 3 18:50:07 unraid kernel: ata10.00: disabled (system removes it)

.Apr 3 18:50:07 unraid kernel: sd 9:0:0:0: [sdk] Stopping disk

...

Apr 3 18:50:41 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B } (a cable is bumped loose again)

Apr 3 18:50:41 unraid kernel: ata8: hard resetting link (drive is sdi, Disk 6, WD10EADS)

...

Apr 3 18:50:54 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen

Apr 3 18:50:54 unraid kernel: ata10: irq_stat 0x00000040, connection status changed

Apr 3 18:50:54 unraid kernel: ata10: SError: { CommWake DevExch } (new Seagate is connected again to addon card port)

Apr 3 18:50:54 unraid kernel: ata10: hard resetting link

Apr 3 18:50:55 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr 3 18:50:55 unraid kernel: ata10.00: ATA-8: ST2000DM001-1CH164, CC24, max UDMA/133

.Apr 3 18:50:55 unraid kernel: sd 9:0:0:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) (assigned drive symbol sdk)

.Apr 3 18:50:55 unraid kernel: sdk: unknown partition table

Apr 3 18:50:55 unraid kernel: sd 9:0:0:0: [sdk] Attached SCSI disk

Apr 3 18:51:02 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr 3 18:51:02 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed (connection lost, still a little loose)

Apr 3 18:51:02 unraid kernel: ata10: SError: { RecovComm PHYRdyChg 10B8B }

Apr 3 18:51:02 unraid kernel: ata10: hard resetting link

Apr 3 18:51:02 unraid kernel: ata10: SATA link down (SStatus 0 SControl 300)

Apr 3 18:51:12 unraid kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310) (finally fully connected, but at slower speed)

.Apr 3 18:51:12 unraid kernel: sdk: detected capacity change from 0 to 2000398934016 (still assigned to be sdk)

...

Apr 3 18:52:39 unraid kernel: r8168: eth0: link up (reconnected network?)

...

Apr 3 19:33:18 unraid kernel: sdk: unknown partition table (Preclear begins of new Seagate)

...

Apr 4 17:54:05 unraid preclear_disk-diff[10081]: == invoked as: ./preclear_disk.sh /dev/sdk (Preclear finishes, logs the report)

Apr 4 17:54:05 unraid preclear_disk-diff[10081]: == ST2000DM001-1CH164 W2F0QYSN

Apr 4 17:54:05 unraid preclear_disk-diff[10081]: == Disk /dev/sdk has been successfully precleared

...

Apr 4 22:55:17 unraid emhttp: Device inventory: (drive list with old sdd Hitachi and new sdk Seagate)

Apr 4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD8Q3HK (sda) 1953514584

Apr 4 22:55:17 unraid emhttp: ST31500541AS_9XW01XYQ (sdc) 1465138584

Apr 4 22:55:17 unraid emhttp: Hitachi_HDS5C3020ALA632_ML2220FA040BGE (sdd) 1953514584

Apr 4 22:55:17 unraid emhttp: WDC_WD15EARS-00J2GB0_WD-WCAYY0179785 (sde) 1465138584

Apr 4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD91B32 (sdf) 1953514584

Apr 4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD900BG (sdg) 1953514584

Apr 4 22:55:17 unraid emhttp: ST31500541AS_6XW0DRHW (sdh) 1465138584

Apr 4 22:55:17 unraid emhttp: WDC_WD10EADS-00L5B1_WD-WCAU4C404846 (sdi) 976762584

Apr 4 22:55:17 unraid emhttp: ST3500418AS_9VMVWG70 (sdj) 488378359

Apr 4 22:55:17 unraid emhttp: ST2000DM001-1CH164_W2F0QYSN (sdk) 1953514584

...

Apr 4 22:57:37 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen

Apr 4 22:57:37 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed

Apr 4 22:57:37 unraid kernel: ata10: SError: { RecovComm PHYRdyChg } (new Seagate sdk is disconnected!)

Apr 4 22:57:37 unraid kernel: ata10: hard resetting link

Apr 4 22:57:38 unraid kernel: ata10: SATA link down (SStatus 0 SControl 310)

.Apr 4 22:57:48 unraid kernel: ata10.00: disabled

.Apr 4 22:57:48 unraid kernel: sd 9:0:0:0: [sdk] Stopping disk

...

Apr 4 22:58:42 unraid kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen

Apr 4 22:58:42 unraid kernel: ata3: irq_stat 0x00400040, connection status changed

Apr 4 22:58:42 unraid kernel: ata3: SError: { PHYRdyChg 10B8B DevExch } (Parity drive (sdd Hitachi) is disconnected!)

Apr 4 22:58:42 unraid kernel: ata3: hard resetting link

Apr 4 22:58:42 unraid kernel: ata3: SATA link down (SStatus 0 SControl 300)

.Apr 4 22:58:53 unraid kernel: ata3.00: disabled

.Apr 4 22:58:53 unraid kernel: sd 2:0:0:0: [sdd] Stopping disk

...

Apr 4 22:59:08 unraid kernel: ata1: SError: { HostInt PHYRdyChg 10B8B DevExch } (a cable is bumped completely off)

Apr 4 22:59:08 unraid kernel: ata1: hard resetting link (drive is sda, Disk 5, ST2000DL003)

Apr 4 22:59:09 unraid kernel: ata1: SATA link down (SStatus 0 SControl 300)

.Apr 4 22:59:19 unraid kernel: ata1.00: disabled (system removes the drive!)

.Apr 4 22:59:20 unraid kernel: ata1: SError: { DevExch } (cable is reconnected!)

Apr 4 22:59:20 unraid kernel: ata1: hard resetting link

Apr 4 22:59:21 unraid kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr 4 22:59:21 unraid kernel: ata1.00: ATA-8: ST2000DL003-9VT166, CC3C, max UDMA/133

.Apr 4 22:59:21 unraid kernel: sda: sda1 (and luckily re-assigned the same drive symbol, sda!)

Apr 4 22:59:21 unraid kernel: sd 0:0:0:0: [sda] Attached SCSI disk

...

Apr 4 22:59:14 unraid kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch } (a cable is bumped off)

Apr 4 22:59:14 unraid kernel: ata2: hard resetting link (drive is sdc, Disk 3, ST31500541AS)

.Apr 4 22:59:19 unraid kernel: ata2: reset failed (errno=-32), retrying in 5 secs

.Apr 4 22:59:24 unraid kernel: ata2: limiting SATA link speed to 1.5 Gbps (drive is recovered finally, but slightly slowed access)

.Apr 4 23:08:15 unraid kernel: ata2: SError: { PHYRdyChg 10B8B DevExch }

...

Apr 4 23:08:36 unraid kernel: ata4: SError: { PHYRdyChg 10B8B DevExch } (a cable is bumped loose)

Apr 4 23:08:36 unraid kernel: ata4: hard resetting link (drive is sde, Disk 4, WD15EARS)

...

Apr 4 23:09:29 unraid kernel: ata3: irq_stat 0x00000040, connection status changed

Apr 4 23:09:29 unraid kernel: ata3: SError: { CommWake DevExch } (new Seagate is connected/hotplugged to motherboard port!)

Apr 4 23:09:29 unraid kernel: ata3: hard resetting link

Apr 4 23:09:30 unraid kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr 4 23:09:30 unraid kernel: ata3.00: ATA-8: ST2000DM001-1CH164, CC24, max UDMA/133

.Apr 4 23:09:30 unraid kernel: scsi 2:0:0:0: Direct-Access ATA ST2000DM001-1CH1 CC24 PQ: 0 ANSI: 5

Apr 4 23:09:30 unraid kernel: sd 2:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

.Apr 4 23:09:30 unraid kernel: sdd: sdd1 (and reassigned by system as sdd)

Apr 4 23:09:30 unraid kernel: sd 2:0:0:0: [sdd] Attached SCSI disk

...

Apr 4 23:14:31 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B } (a cable is bumped loose)

Apr 4 23:14:31 unraid kernel: ata8: hard resetting link (drive is sdi, Disk 6, WD10EADS)

...

Apr 4 23:14:59 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4000000 action 0xe frozen

Apr 4 23:14:59 unraid kernel: ata10: irq_stat 0x00000040, connection status changed

Apr 4 23:14:59 unraid kernel: ata10: SError: { DevExch } (the Hitachi is connected/hotplugged to ata10 port!)

Apr 4 23:14:59 unraid kernel: ata10: hard resetting link (so the new Seagate and Hitachi swap ports and drive symbols!)

Apr 4 23:15:00 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr 4 23:15:00 unraid kernel: ata10.00: ATA-8: Hitachi HDS5C3020ALA632, ML6OA800, max UDMA/133

.Apr 4 23:15:00 unraid kernel: scsi 9:0:0:0: Direct-Access ATA Hitachi HDS5C302 ML6O PQ: 0 ANSI: 5

Apr 4 23:15:00 unraid kernel: sd 9:0:0:0: Attached scsi generic sg10 type 0

Apr 4 23:15:00 unraid kernel: sd 9:0:0:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

.Apr 4 23:15:00 unraid kernel: sdk: sdk1 (and reassigned by system as sdk)

Apr 4 23:15:00 unraid kernel: sd 9:0:0:0: [sdk] Attached SCSI disk

...

Apr 4 23:21:05 unraid emhttp: Device inventory: (the new drive inventory!)

Apr 4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD8Q3HK (sda) 1953514584

Apr 4 23:21:05 unraid emhttp: ST31500541AS_9XW01XYQ (sdc) 1465138584

Apr 4 23:21:05 unraid emhttp: ST2000DM001-1CH164_W2F0QYSN (sdd) 1953514584

Apr 4 23:21:05 unraid emhttp: WDC_WD15EARS-00J2GB0_WD-WCAYY0179785 (sde) 1465138584

Apr 4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD91B32 (sdf) 1953514584

Apr 4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD900BG (sdg) 1953514584

Apr 4 23:21:05 unraid emhttp: ST31500541AS_6XW0DRHW (sdh) 1465138584

Apr 4 23:21:05 unraid emhttp: WDC_WD10EADS-00L5B1_WD-WCAU4C404846 (sdi) 976762584

Apr 4 23:21:05 unraid emhttp: ST3500418AS_9VMVWG70 (sdj) 488378359

Apr 4 23:21:05 unraid emhttp: Hitachi_HDS5C3020ALA632_ML2220FA040BGE (sdk) 1953514584

...

Apr 4 23:21:05 unraid kernel: md: import disk0: [8,160] (sdk) Hitachi_HDS5C3020ALA632_ML2220FA040BGE size: 1953514552

...

Apr 4 23:21:25 unraid kernel: md: import disk0: [8,48] (sdd) ST2000DM001-1CH164_W2F0QYSN size: 1953514552

Apr 4 23:21:25 unraid kernel: md: disk0 wrong (parity is re-assigned from sdk Hitachi to sdd Seagate)

...

Apr 4 23:21:59 unraid emhttp: shcmd (248): /usr/local/sbin/set_ncq sdd 1 &> /dev/null (array is started...)

...

Apr 4 23:21:59 unraid emhttp: writing MBR on disk (sdd) with partition 1 offset 64, erased: 0 (new parity disk is partitioned)

...

Apr 4 23:22:02 unraid emhttp_event: disks_mounted (array is fully up)

...

Apr 4 23:23:16 unraid kernel: md: recovery thread syncing parity disk ... (begin build of new parity disk sdd ST2000DM001)

...

Apr 4 23:32:46 unraid kernel: sdk: sdk1 (some operation (Preclear?) is performed on sdk, the Hitachi, no evidence of formatting)

...

Apr 5 03:05:21 unraid kernel: ata10.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x6 frozen (something is trying to read from Hitachi!)

... (this begins 11.5 hours of I/O errors from sdk (the old Hitachi parity drive), which is NOT formatted, must be Preclear reading)

...

Apr 5 04:16:03 unraid kernel: ... (media error) (the first bad sector, many more bad sector messages to come, not shown below)

Apr 5 04:16:03 unraid kernel: ata10.00: error: { UNC }

...

Apr 5 07:10:07 unraid kernel: ata10: EH complete

Apr 5 07:14:40 unraid kernel: md: sync done. time=28284sec (parity rebuild onto new sdd ST2000DM001 complete)

Apr 5 07:14:40 unraid kernel: md: recovery thread sync completion status: 0

Apr 5 07:17:16 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen (Hitachi errors continue...)

...

Apr 5 14:40:54 unraid kernel: ata10: EH complete (last sdk Hitachi error for awhile, Preclear pre-read over?)

... (4 hours of quiet, then ...)

Apr 5 18:44:52 unraid kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen (new parity drive sdd suddenly stops communicating)

... (system tries for one minute to reset and recover the drive)

Apr 5 18:45:52 unraid kernel: ata3.00: disabled (drive is quickly disabled by system)

...

Apr 5 18:45:52 unraid kernel: md: disk0 read error (these can be ignored, drive is disabled by system)

...

Apr 5 18:46:02 unraid kernel: md: disk0 write error (last one)

...

Apr 5 19:33:16 unraid login[27047]: ROOT LOGIN on '/dev/pts/1' from '192.168.1.63' (you logged in, attempted SMART analysis)

Apr 5 19:33:44 unraid kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

...

Apr 5 21:02:15 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (sdk Hitachi errors start again)

...

Apr 5 21:29:45 unraid kernel: ata10: EH complete (sdk Hitachi errors stop)

...

Apr 6 11:23:49 unraid kernel: mdcmd (25): stop (system is shut down, don't know what happened to the Preclear, no report)

[end of syslog]

I'm tired...

wow!!! amazing the whole box is even turning on this days! next time i have a old pc i have to try sth like that, bet with my luck it gets up in flames halfway thru!

L

garion · April 12, 2013

First parity check after rebuild. No R/W or parity errors.

/usr/bin/tail -f /var/log/syslog

Apr 12 12:30:49 unraid kernel: mdcmd (60): spindown 6

Apr 12 14:43:12 unraid kernel: mdcmd (61): spindown 3

Apr 12 14:43:13 unraid kernel: mdcmd (62): spindown 4

Apr 12 14:43:14 unraid kernel: mdcmd (63): spindown 7

Apr 12 16:02:21 unraid kernel: md: sync done. time=27970sec

Apr 12 16:02:21 unraid kernel: md: recovery thread sync completion status: 0

Apr 12 16:32:24 unraid kernel: mdcmd (64): spindown 0

Apr 12 16:32:24 unraid kernel: mdcmd (65): spindown 1

Apr 12 16:32:25 unraid kernel: mdcmd (66): spindown 2

Apr 12 16:32:26 unraid kernel: mdcmd (67): spindown 5

I turn on sickbeard and nzb plugins and see...

RobJ · April 13, 2013

Problem with Hitachi began previous any hot unplugged cable.

At first, I wondered why you were swapping out the Hitachi, because until I saw its SMART report, there were no issues at all related to the Hitachi. Its SMART report however clearly indicated you had recently been having a number of problems with bad sectors on the drive, so I could see why you were replacing it, and Preclearing it.

I read in unraid forums from people doing drive maintenance tasks without turn off computer, only unmounting array -> Change drive -> mount array. I suppose I was wrong.

As Peter said, newer SATA hardware does *claim* to support hot swap, but all of the software has to support it too. Your syslog showed that it can work, but I'm not sure it's something we can completely trust with UnRAID yet. Tom's UnRAID module was not designed for hot swapping drives, although the new behavior of rechecking the drive inventory whenever the Web GUI is refreshed does help in this regard.

I still remember seeing the first syslog where a user had put their UnRAID server to sleep, then awakened it. It looked completely broken, and I thought at first there had been a terrible crash, totally unrecoverable. But as I examined further, I could see most of the 'broken' subsystems slowly recovering, and finally the full system appeared to be almost completely operational. At least, the user seemed to think so, although I had my doubts! The user was not aware of all of the errors! That was quite a few Linux kernels back, so I don't doubt that it's much improved now. The thing is, all of this recovery work is at a low level, completely unseen by upper level applications, such as UnRAID. It all has to work in such a way that everything appears to be restored exactly the same, including all device ID's and symbols, buffers and handles. Some of us veterans will wait and watch, while you youngsters and risk-takers be the guinea pigs, testing hot swap and sleep/wake for us!

garion · April 13, 2013

Glad to be an unofficial risk-taker betatester

Problems (one) return last night. Now with ata1. I belive ata1 = sda = new parity.

I began a new parity check....again $:-\$

garion · April 13, 2013

When parity check finish Ill attach a syslog but RW errors grown close to 10K again, at 50% check.

:'(

I'm lost. Yesterday I did a parity check with 0 errors. Today I repeat problem and 10.000 RW errors and 5 parity errors. Only 2 files of 350 MB written to data drives between both parity checks. Both by sickbeard. After first file no error, file move from cache to data at 12-4-2013 18:36. After second file the error I wrote in my previous post, file and error close to 12-4-2014 at 06:30.

If motherboard issue why only on parity?

If MB sata port and I change it why only in parity?

If PSU (corsair TX650) why only in parity?

If drive.....I change it

If cable... I change it

Should I RMA new Seagate parity? Even with same errors than Hitachi maybe probability rules fault on me.

Aaaaaarrrrrrggggggg

dgaschk · April 13, 2013

Post a SMART report and a new syslog.

garion · April 13, 2013

Post a SMART report and a new syslog.

I return home and red ball in parity, no result about parity error after check but about 60% system found 7. As SDA (parity) is disabled smartctl doesnt work (smartctl -a -A /dev/sda).

I prefer left raid untouched untill your replies but last days experience says that if I reboot, without touching any cable or a hardware part, parity drive will be online again and after array start a parity rebuild will be done without errors. Then when I manually choose parity check again no errors. A couple days before that RW errors begin and a manual parity check put parity offline.

thanks everybody.

syslog_20130414_00am.txt.zip

garion · April 14, 2013

Turn off

change sata cable between parity and a data drive

turn on

parity rebuild

no errors shown on syslog (as I understand)

smartctl show 1758 UNC errors.

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 078 078 006 Pre-fail Always - 9154368

3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 23

5 Reallocated_Sector_Ct 0x0033 087 087 010 Pre-fail Always - 17216

7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 300355

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 212

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 7

183 Runtime_Bad_Block 0x0032 096 096 000 Old_age Always - 4

184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 1758

188 Command_Timeout 0x0032 100 099 000 Old_age Always - 17180131332

189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3

190 Airflow_Temperature_Cel 0x0022 071 058 045 Old_age Always - 29 (Min/Max 25/35)

191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 7

193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 67

194 Temperature_Celsius 0x0022 029 042 000 Old_age Always - 29 (0 20 0 0)

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 193677255245911

241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 19542190504

242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 27561533468

As I wrote, perhaps probability rules are against me and new drive show same errors than previous one?

smartctl.txt

syslog_20130414_09am.txt

[Solved] Thousands R/W errors - crazy parity

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation