[Solved] Thousands R/W errors - crazy parity


Recommended Posts

 

 

I try to explain as best as I can, its not easy (not only for language  :))

 

Several months running different 5.0 RC's. In RC11 since release day. No problem at all.

 

Last week I notice lot of errors in unraid web console. I review my syslog and found thousand and thousand disk errors in a pretty new Hitachi 2TB parity disk (less a year in unraid), some parity errors too.

 

 

Apr  7 21:27:29 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr  7 21:27:29 unraid kernel: ata10.00: irq_stat 0x40000001
Apr  7 21:27:29 unraid kernel: ata10.00: failed command: READ DMA EXT
Apr  7 21:27:29 unraid kernel: ata10.00: cmd 25/00:00:30:34:0b/00:01:e7:00:00/e0 tag 0 dma 131072 in
Apr  7 21:27:29 unraid kernel:          res 51/40:7a:b6:34:0b/00:00:e7:00:00/07 Emask 0x9 (media error)
Apr  7 21:27:29 unraid kernel: ata10.00: status: { DRDY ERR }
Apr  7 21:27:29 unraid kernel: ata10.00: error: { UNC }
Apr  7 21:27:29 unraid kernel: ata10.00: configured for UDMA/133
Apr  7 21:27:29 unraid kernel: ata10: EH complete

 

Things I did, in order:

 

1.- recheck my cables - data and power.  (Parity always plug into mainboard sata ports)-> no solution

2.- change SATA cable and reoder cables inside computer -> no solution

3.- go to a stock "go" file

4.- unistall Simple Features

5.- buy a new parity disk Seagate 7200.14 2TB

6.- buy a new certified sata cable

7.- install new disk and cable in a different SATA port.

8.- preclear new seagate parity drive -> No errors. No allocated sectors.

9.- rebuild parity -> OK

10.- test parity -> no errors  :)

11.- no errors after a couple days in read/write in web-console.  :)

.

.

.

.

Yesterday I found 2 R/W in web-console.

I did a check only parity and found 6 erros with 15.258 R/W errors.  :-[

 

any ideas please?

 

---------------

 

I did a preclear with the previous Hitachi parity disk and it found 11 sectors for allocation. I did a second one and all ok. I´m runing a 2 cycle now with same disk but it doesn´t matter because I have exactly the same problems with the new parity Seagate disk than with old parity Hitachi. New disk, new cable, different sata port.... same problems.

 

--------------

 

I attached all info as I could. I´m not an IT expert but if you can guide me, show some road to solution I´ll follow.

 

-------------

 

My equipment:

 

APC UPS SAI

Corsair TX-650W PSU

Intel I3 - 4 gigs ram

No graphics card (I3 graphics only)

No overclock

6 mainboard SATA

2 cards 2 sata ports PCIe (no parity in this cards)

 

Attachments:

 

new syslog (seagate parity)

previous syslog (hitachi parity - where errors began)

preclear results

screen capture

syslog-20130409.txt.zip

Link to comment

 

2 cards 2 sata ports PCIe (no parity in this cards)

 

 

Are these PCIe cards the SIL3132 cards like this one:

 

http://www.monoprice.com/products/product.asp?c_id=104&cp_id=10407&cs_id=1040702&p_id=2530&seq=1&format=2

 

there have been a number of reports of odd problems being caused by these.  In fact, I just had a run in with disk read errors being caused by one of these in a Windows 7 machine.

 

Regards,

 

Stephen

Link to comment

 

I think no. I heard about SIL issues and I went to new ASM1061. No errors in months, only data disks attached there.

 

Two boards like this:

 

PCI-e 2 Port SATA III (6Gbps) Card, Chipset ASM1061

Compliant with PCI-express specification V2.0 and backward compatible with PCI-express 1.x

Compliant with serial-ATA AHCI specifications revision 1.3 and serial-ATA revision 3.0

Supports hot-plug and hot-swap

Supports communication speeds of 6.0Gbps, 3.0Gbps, and 1.5Gbps

Supports 2 ports serial-ATA III

Supports native command queue (NCQ)

Supports port multiplier

 

http://www.ebay.es/itm/271014212269?ssPageName=STRK:MEWAX:IT&_trksid=p3984.m1423.l2649#ht_3403wt_1175

Link to comment

Looks like you have HPA on ata9, I'm not sure if that would be causing some of your problems or not. I am sure someone who has more knowledge about that will chime in I am sure. If you do a google search for HPA in unraid you will get a bunch of hits that may help you while you wait. Was ata9 in a gigabyte motherboard at some point in its life?

Link to comment

Not sure there is a relationship with ata9 and hpa but I have free space in other discs to empty that one and then try to remove hpa. Why now? Running unraid almost a year?

 

Today I found parity icon becomes red.....  seems no device :'( :'( :'( :'(

 

Really don´t know.

 

I only have the plugs you can see in attachment.

 

Should I disable all and try a few days again?

 

--

 

edit: I remove ALL plugins and move to RC12a. Rebooting now. I´ll check cables again.

Captura_de_pantalla_2013-04-10_a_las_18_41_58.png.cf58a945e72f926a11f150a9413764e7.png

Link to comment

Yes I would disable all plug ins. My understanding is that you have your drive causing problems on a motherboard sata port yes?

 

Disabled all plugins and upgrade to RC12a

 

2 parity harddrives tested (second after a perfect preclear)

2 new sata cables tested (second 3gbps certified)

diferent motherboard sata port tested

 

same problems in parity - few parity errors and thousands R/W errors in log.

 

no errors in data or cache drives, plugged to motherboard or pcie sata cards. R/W problems only in parity.

 

 

Link to comment

Things I did, in order:

 

1.- recheck my cables - data and power.  (Parity always plug into mainboard sata ports)-> no solution

2.- change SATA cable and reoder cables inside computer -> no solution

3.- go to a stock "go" file

4.- unistall Simple Features

5.- buy a new parity disk Seagate 7200.14 2TB

6.- buy a new certified sata cable

7.- install new disk and cable in a different SATA port.

8.- preclear new seagate parity drive -> No errors. No allocated sectors.

9.- rebuild parity -> OK

10.- test parity -> no errors  :)

11.- no errors after a couple days in read/write in web-console.  :)

 

Wow!  The stories your syslog could tell, if a syslog could talk!    ;)

 

I'll bet you're a risk-taker.  You probably don't bother worrying about static discharge, you probably don't care if power is off before yanking cables!  And I know you don't worry about your data when you want to swap drives, because your syslog told me so!  ;D  That's the first time I have ever seen someone pull the cables to spinning drives, including ones assigned to the array, and reconnect them in a different way, then start the array to see if it all worked.  And it did!  For the most part, you got away with it.  You do like living dangerously.  In the end though, I suspect the subsequent problems *may* be because of the live reconnections, however I don't know that for sure.  I have not had a chance to examine the other syslogs yet, only the first, but this one has a lot to say.  What is below is rather long, my apologies, but it tells the story.  Please do read my comments along the way.  Above, your steps 1 through 9 seem correct, but step 10 is wrong, the new parity drive lost contact almost 12 hours after it was built, and was quickly dropped from the system.  That explains the huge number of subsequent errors, which you can completely ignore, since the real issue is why the drive lost communications like that.  There is probably nothing wrong with the Seagate drive itself.  Here is a commented version of your syslog:

 

(By the way, the HPA is on the Cache drive (sdj, a 500GB Seagate), not a part of the array, so its HPA is not a real problem here.  It's not a good thing, but is not causing parity trouble.)

 

[syslog starts]        (leading dots represent removed stuff, not relevant to discussion)

Apr  2 22:07:08 unraid kernel: Linux version 3.4.26-unRAID (root@Develop) (gcc version 4.4.4 (GCC) ) #2 SMP Mon Jan 28 17:00:22 PST 2013

...                        (no real issues, array is started and stopped once or twice)

Apr  3 18:44:02 ...  (Here is where it all starts)

Apr  3 18:44:02 unraid kernel: r8168: eth0: link down        (network disconnected?)

...

Apr  3 18:49:11 unraid kernel: ata8: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr  3 18:49:11 unraid kernel: ata8: irq_stat 0x00400000, PHY RDY changed

Apr  3 18:49:11 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B }      (a cable is bumped loose)

Apr  3 18:49:11 unraid kernel: ata8: hard resetting link                                          (drive is sdi, Disk 6, WD10EADS)

...

Apr  3 18:49:18 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0xe frozen

Apr  3 18:49:18 unraid kernel: ata10: irq_stat 0x00400040, connection status changed

Apr  3 18:49:18 unraid kernel: ata10: SError: { PHYRdyChg CommWake DevExch }      (new Seagate is connected to addon card port)

Apr  3 18:49:18 unraid kernel: ata10: hard resetting link

Apr  3 18:49:27 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr  3 18:49:32 unraid kernel: ata10.00: qc timeout (cmd 0xec)

Apr  3 18:49:32 unraid kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)        (but it's a bad connection)

.Apr  3 18:49:55 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr  3 18:49:55 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed

Apr  3 18:49:55 unraid kernel: ata10: SError: { RecovComm PHYRdyChg 10B8B }

Apr  3 18:49:55 unraid kernel: ata10: hard resetting link

Apr  3 18:49:56 unraid kernel: ata10: SATA link down (SStatus 0 SControl 300)          (connection lost)

.Apr  3 18:50:07 unraid kernel: ata10.00: disabled                                                  (system removes it)

.Apr  3 18:50:07 unraid kernel: sd 9:0:0:0: [sdk] Stopping disk

...

Apr  3 18:50:41 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B }      (a cable is bumped loose again)

Apr  3 18:50:41 unraid kernel: ata8: hard resetting link                                        (drive is sdi, Disk 6, WD10EADS)

...

Apr  3 18:50:54 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen

Apr  3 18:50:54 unraid kernel: ata10: irq_stat 0x00000040, connection status changed

Apr  3 18:50:54 unraid kernel: ata10: SError: { CommWake DevExch }      (new Seagate is connected again to addon card port)

Apr  3 18:50:54 unraid kernel: ata10: hard resetting link

Apr  3 18:50:55 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr  3 18:50:55 unraid kernel: ata10.00: ATA-8: ST2000DM001-1CH164, CC24, max UDMA/133

.Apr  3 18:50:55 unraid kernel: sd 9:0:0:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)  (assigned drive symbol sdk)

.Apr  3 18:50:55 unraid kernel:  sdk: unknown partition table

Apr  3 18:50:55 unraid kernel: sd 9:0:0:0: [sdk] Attached SCSI disk

Apr  3 18:51:02 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen 

Apr  3 18:51:02 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed                (connection lost, still a little loose)

Apr  3 18:51:02 unraid kernel: ata10: SError: { RecovComm PHYRdyChg 10B8B }

Apr  3 18:51:02 unraid kernel: ata10: hard resetting link

Apr  3 18:51:02 unraid kernel: ata10: SATA link down (SStatus 0 SControl 300)

Apr  3 18:51:12 unraid kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)    (finally fully connected, but at slower speed)

.Apr  3 18:51:12 unraid kernel: sdk: detected capacity change from 0 to 2000398934016    (still assigned to be sdk)

...

Apr  3 18:52:39 unraid kernel: r8168: eth0: link up          (reconnected network?)

...

Apr  3 19:33:18 unraid kernel:  sdk: unknown partition table        (Preclear begins of new Seagate)

...

Apr  4 17:54:05 unraid preclear_disk-diff[10081]: == invoked as: ./preclear_disk.sh /dev/sdk  (Preclear finishes, logs the report)

Apr  4 17:54:05 unraid preclear_disk-diff[10081]: ==  ST2000DM001-1CH164    W2F0QYSN

Apr  4 17:54:05 unraid preclear_disk-diff[10081]: == Disk /dev/sdk has been successfully precleared

...

Apr  4 22:55:17 unraid emhttp: Device inventory:      (drive list with old sdd Hitachi and new sdk Seagate)

Apr  4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD8Q3HK (sda) 1953514584

Apr  4 22:55:17 unraid emhttp: ST31500541AS_9XW01XYQ (sdc) 1465138584

Apr  4 22:55:17 unraid emhttp: Hitachi_HDS5C3020ALA632_ML2220FA040BGE (sdd) 1953514584

Apr  4 22:55:17 unraid emhttp: WDC_WD15EARS-00J2GB0_WD-WCAYY0179785 (sde) 1465138584

Apr  4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD91B32 (sdf) 1953514584

Apr  4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD900BG (sdg) 1953514584

Apr  4 22:55:17 unraid emhttp: ST31500541AS_6XW0DRHW (sdh) 1465138584

Apr  4 22:55:17 unraid emhttp: WDC_WD10EADS-00L5B1_WD-WCAU4C404846 (sdi) 976762584

Apr  4 22:55:17 unraid emhttp: ST3500418AS_9VMVWG70 (sdj) 488378359

Apr  4 22:55:17 unraid emhttp: ST2000DM001-1CH164_W2F0QYSN (sdk) 1953514584

...

Apr  4 22:57:37 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen

Apr  4 22:57:37 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed

Apr  4 22:57:37 unraid kernel: ata10: SError: { RecovComm PHYRdyChg }        (new Seagate sdk is disconnected!)

Apr  4 22:57:37 unraid kernel: ata10: hard resetting link

Apr  4 22:57:38 unraid kernel: ata10: SATA link down (SStatus 0 SControl 310)

.Apr  4 22:57:48 unraid kernel: ata10.00: disabled

.Apr  4 22:57:48 unraid kernel: sd 9:0:0:0: [sdk] Stopping disk

...

Apr  4 22:58:42 unraid kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen

Apr  4 22:58:42 unraid kernel: ata3: irq_stat 0x00400040, connection status changed

Apr  4 22:58:42 unraid kernel: ata3: SError: { PHYRdyChg 10B8B DevExch }        (Parity drive (sdd Hitachi) is disconnected!)

Apr  4 22:58:42 unraid kernel: ata3: hard resetting link

Apr  4 22:58:42 unraid kernel: ata3: SATA link down (SStatus 0 SControl 300)

.Apr  4 22:58:53 unraid kernel: ata3.00: disabled

.Apr  4 22:58:53 unraid kernel: sd 2:0:0:0: [sdd] Stopping disk

...

Apr  4 22:59:08 unraid kernel: ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }      (a cable is bumped completely off)

Apr  4 22:59:08 unraid kernel: ata1: hard resetting link                                              (drive is sda, Disk 5, ST2000DL003)

Apr  4 22:59:09 unraid kernel: ata1: SATA link down (SStatus 0 SControl 300)

.Apr  4 22:59:19 unraid kernel: ata1.00: disabled             (system removes the drive!)

.Apr  4 22:59:20 unraid kernel: ata1: SError: { DevExch }                      (cable is reconnected!)

Apr  4 22:59:20 unraid kernel: ata1: hard resetting link

Apr  4 22:59:21 unraid kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  4 22:59:21 unraid kernel: ata1.00: ATA-8: ST2000DL003-9VT166, CC3C, max UDMA/133

.Apr  4 22:59:21 unraid kernel:  sda: sda1        (and luckily re-assigned the same drive symbol, sda!)

Apr  4 22:59:21 unraid kernel: sd 0:0:0:0: [sda] Attached SCSI disk

...

Apr  4 22:59:14 unraid kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }      (a cable is bumped off)

Apr  4 22:59:14 unraid kernel: ata2: hard resetting link                                      (drive is sdc, Disk 3, ST31500541AS)

.Apr  4 22:59:19 unraid kernel: ata2: reset failed (errno=-32), retrying in 5 secs

.Apr  4 22:59:24 unraid kernel: ata2: limiting SATA link speed to 1.5 Gbps        (drive is recovered finally, but slightly slowed access)

.Apr  4 23:08:15 unraid kernel: ata2: SError: { PHYRdyChg 10B8B DevExch }

...

Apr  4 23:08:36 unraid kernel: ata4: SError: { PHYRdyChg 10B8B DevExch }          (a cable is bumped loose)

Apr  4 23:08:36 unraid kernel: ata4: hard resetting link                                      (drive is sde, Disk 4, WD15EARS)

...

Apr  4 23:09:29 unraid kernel: ata3: irq_stat 0x00000040, connection status changed

Apr  4 23:09:29 unraid kernel: ata3: SError: { CommWake DevExch }      (new Seagate is connected/hotplugged to motherboard port!)

Apr  4 23:09:29 unraid kernel: ata3: hard resetting link

Apr  4 23:09:30 unraid kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  4 23:09:30 unraid kernel: ata3.00: ATA-8: ST2000DM001-1CH164, CC24, max UDMA/133

.Apr  4 23:09:30 unraid kernel: scsi 2:0:0:0: Direct-Access    ATA      ST2000DM001-1CH1 CC24 PQ: 0 ANSI: 5

Apr  4 23:09:30 unraid kernel: sd 2:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

.Apr  4 23:09:30 unraid kernel:  sdd: sdd1      (and reassigned by system as sdd)

Apr  4 23:09:30 unraid kernel: sd 2:0:0:0: [sdd] Attached SCSI disk

...

Apr  4 23:14:31 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B }      (a cable is bumped loose)

Apr  4 23:14:31 unraid kernel: ata8: hard resetting link                                        (drive is sdi, Disk 6, WD10EADS)

...

Apr  4 23:14:59 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4000000 action 0xe frozen

Apr  4 23:14:59 unraid kernel: ata10: irq_stat 0x00000040, connection status changed

Apr  4 23:14:59 unraid kernel: ata10: SError: { DevExch }        (the Hitachi is connected/hotplugged to ata10 port!)

Apr  4 23:14:59 unraid kernel: ata10: hard resetting link          (so the new Seagate and Hitachi swap ports and drive symbols!)

Apr  4 23:15:00 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr  4 23:15:00 unraid kernel: ata10.00: ATA-8: Hitachi HDS5C3020ALA632, ML6OA800, max UDMA/133

.Apr  4 23:15:00 unraid kernel: scsi 9:0:0:0: Direct-Access    ATA      Hitachi HDS5C302 ML6O PQ: 0 ANSI: 5

Apr  4 23:15:00 unraid kernel: sd 9:0:0:0: Attached scsi generic sg10 type 0

Apr  4 23:15:00 unraid kernel: sd 9:0:0:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

.Apr  4 23:15:00 unraid kernel:  sdk: sdk1                  (and reassigned by system as sdk)

Apr  4 23:15:00 unraid kernel: sd 9:0:0:0: [sdk] Attached SCSI disk

...

Apr  4 23:21:05 unraid emhttp: Device inventory:      (the new drive inventory!)

Apr  4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD8Q3HK (sda) 1953514584

Apr  4 23:21:05 unraid emhttp: ST31500541AS_9XW01XYQ (sdc) 1465138584

Apr  4 23:21:05 unraid emhttp: ST2000DM001-1CH164_W2F0QYSN (sdd) 1953514584

Apr  4 23:21:05 unraid emhttp: WDC_WD15EARS-00J2GB0_WD-WCAYY0179785 (sde) 1465138584

Apr  4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD91B32 (sdf) 1953514584

Apr  4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD900BG (sdg) 1953514584

Apr  4 23:21:05 unraid emhttp: ST31500541AS_6XW0DRHW (sdh) 1465138584

Apr  4 23:21:05 unraid emhttp: WDC_WD10EADS-00L5B1_WD-WCAU4C404846 (sdi) 976762584

Apr  4 23:21:05 unraid emhttp: ST3500418AS_9VMVWG70 (sdj) 488378359

Apr  4 23:21:05 unraid emhttp: Hitachi_HDS5C3020ALA632_ML2220FA040BGE (sdk) 1953514584

...

Apr  4 23:21:05 unraid kernel: md: import disk0: [8,160] (sdk) Hitachi_HDS5C3020ALA632_ML2220FA040BGE size: 1953514552

...

Apr  4 23:21:25 unraid kernel: md: import disk0: [8,48] (sdd) ST2000DM001-1CH164_W2F0QYSN size: 1953514552

Apr  4 23:21:25 unraid kernel: md: disk0 wrong        (parity is re-assigned from sdk Hitachi to sdd Seagate)

...

Apr  4 23:21:59 unraid emhttp: shcmd (248): /usr/local/sbin/set_ncq sdd 1 &> /dev/null      (array is started...)

...

Apr  4 23:21:59 unraid emhttp: writing MBR on disk (sdd) with partition 1 offset 64, erased: 0  (new parity disk is partitioned)

...

Apr  4 23:22:02 unraid emhttp_event: disks_mounted      (array is fully up)

...

Apr  4 23:23:16 unraid kernel: md: recovery thread syncing parity disk ...  (begin build of new parity disk sdd ST2000DM001)

...

Apr  4 23:32:46 unraid kernel:  sdk: sdk1        (some operation (Preclear?) is performed on sdk, the Hitachi, no evidence of formatting)

...

Apr  5 03:05:21 unraid kernel: ata10.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x6 frozen    (something is trying to read from Hitachi!)

...      (this begins 11.5 hours of I/O errors from sdk (the old Hitachi parity drive), which is NOT formatted, must be Preclear reading)

...

Apr  5 04:16:03 unraid kernel:  ...  (media error)  (the first bad sector, many more bad sector messages to come, not shown below)

Apr  5 04:16:03 unraid kernel: ata10.00: error: { UNC }

...

Apr  5 07:10:07 unraid kernel: ata10: EH complete

Apr  5 07:14:40 unraid kernel: md: sync done. time=28284sec        (parity rebuild onto new sdd ST2000DM001 complete)

Apr  5 07:14:40 unraid kernel: md: recovery thread sync completion status: 0

Apr  5 07:17:16 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen    (Hitachi errors continue...)

...

Apr  5 14:40:54 unraid kernel: ata10: EH complete      (last sdk Hitachi error for awhile, Preclear pre-read over?)

...  (4 hours of quiet, then ...)

Apr  5 18:44:52 unraid kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen  (new parity drive sdd suddenly stops communicating)

...  (system tries for one minute to reset and recover the drive)

Apr  5 18:45:52 unraid kernel: ata3.00: disabled      (drive is quickly disabled by system)

...

Apr  5 18:45:52 unraid kernel: md: disk0 read error  (these can be ignored, drive is disabled by system)

...

Apr  5 18:46:02 unraid kernel: md: disk0 write error    (last one)

...

Apr  5 19:33:16 unraid login[27047]: ROOT LOGIN  on '/dev/pts/1' from '192.168.1.63'        (you logged in, attempted SMART analysis)

Apr  5 19:33:44 unraid kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

...

Apr  5 21:02:15 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0  (sdk Hitachi errors start again)

...

Apr  5 21:29:45 unraid kernel: ata10: EH complete  (sdk Hitachi errors stop)

...

Apr  6 11:23:49 unraid kernel: mdcmd (25): stop    (system is shut down, don't know what happened to the Preclear, no report)

[end of syslog]

    I'm tired...
Link to comment

Robj, I have no words, really. You review a really loooonnnng log. Thanks a lot.

 

I read all, but a second reading its necessary. I'm on mobile and only posting to thank you and try to explain about "cabling"

 

I'm not using hot swap bays.

Problem with Hitachi began previous any hot unplugged cable.

I read in unraid forums from people doing drive maintenance tasks without turn off computer, only unmounting array -> Change drive -> mount array. I suppose I was wrong.

 

Yesterday I physically unplugg Hitachi (old parity) after a 2 pass prelear. That drive its not in array or as hot spare dive until problems dissappear. Then start a parity build with new Seagate. Afternoon (Spain) I will report results and log.

 

I unplugged Hitachi with computer fully off, check other drives cables and turn on...jejeje

 

Again, thanks for your comments. It wasn't a 5 min job.

 

Francisco

Link to comment

 

Ok, after a second reading I think I learn better turn off/on always  :-[

 

parity rebuld finished I think without errors.

 

Apr 10 19:47:17 unraid kernel: mdcmd (45): check CORRECT

Apr 10 19:47:17 unraid kernel: md: recovery thread woken up ...

Apr 10 19:47:17 unraid kernel: md: recovery thread syncing parity disk ...

Apr 10 19:47:17 unraid kernel: md: using 1536k window, over a total of 1953514552 blocks.

....

....

Apr 11 03:39:44 unraid kernel: md: sync done. time=28347sec

Apr 11 03:39:44 unraid kernel: md: recovery thread sync completion status: 0

 

I begin a parity check and lets see. Errors usually came after a couple of days after rebuild.

 

 

syslog_20130411_18pm.txt

Link to comment

Things I did, in order:

 

1.- recheck my cables - data and power.  (Parity always plug into mainboard sata ports)-> no solution

2.- change SATA cable and reoder cables inside computer -> no solution

3.- go to a stock "go" file

4.- unistall Simple Features

5.- buy a new parity disk Seagate 7200.14 2TB

6.- buy a new certified sata cable

7.- install new disk and cable in a different SATA port.

8.- preclear new seagate parity drive -> No errors. No allocated sectors.

9.- rebuild parity -> OK

10.- test parity -> no errors  :)

11.- no errors after a couple days in read/write in web-console.  :)

 

Wow!  The stories your syslog could tell, if a syslog could talk!    ;)

 

I'll bet you're a risk-taker.  You probably don't bother worrying about static discharge, you probably don't care if power is off before yanking cables!  And I know you don't worry about your data when you want to swap drives, because your syslog told me so!  ;D  That's the first time I have ever seen someone pull the cables to spinning drives, including ones assigned to the array, and reconnect them in a different way, then start the array to see if it all worked.  And it did!  For the most part, you got away with it.  You do like living dangerously.  In the end though, I suspect the subsequent problems *may* be because of the live reconnections, however I don't know that for sure.  I have not had a chance to examine the other syslogs yet, only the first, but this one has a lot to say.  What is below is rather long, my apologies, but it tells the story.  Please do read my comments along the way.  Above, your steps 1 through 9 seem correct, but step 10 is wrong, the new parity drive lost contact almost 12 hours after it was built, and was quickly dropped from the system.  That explains the huge number of subsequent errors, which you can completely ignore, since the real issue is why the drive lost communications like that.  There is probably nothing wrong with the Seagate drive itself.  Here is a commented version of your syslog:

 

(By the way, the HPA is on the Cache drive (sdj, a 500GB Seagate), not a part of the array, so its HPA is not a real problem here.  It's not a good thing, but is not causing parity trouble.)

 

[syslog starts]        (leading dots represent removed stuff, not relevant to discussion)

Apr  2 22:07:08 unraid kernel: Linux version 3.4.26-unRAID (root@Develop) (gcc version 4.4.4 (GCC) ) #2 SMP Mon Jan 28 17:00:22 PST 2013

...                        (no real issues, array is started and stopped once or twice)

Apr  3 18:44:02 ...  (Here is where it all starts)

Apr  3 18:44:02 unraid kernel: r8168: eth0: link down        (network disconnected?)

...

Apr  3 18:49:11 unraid kernel: ata8: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr  3 18:49:11 unraid kernel: ata8: irq_stat 0x00400000, PHY RDY changed

Apr  3 18:49:11 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B }      (a cable is bumped loose)

Apr  3 18:49:11 unraid kernel: ata8: hard resetting link                                          (drive is sdi, Disk 6, WD10EADS)

...

Apr  3 18:49:18 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0xe frozen

Apr  3 18:49:18 unraid kernel: ata10: irq_stat 0x00400040, connection status changed

Apr  3 18:49:18 unraid kernel: ata10: SError: { PHYRdyChg CommWake DevExch }      (new Seagate is connected to addon card port)

Apr  3 18:49:18 unraid kernel: ata10: hard resetting link

Apr  3 18:49:27 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr  3 18:49:32 unraid kernel: ata10.00: qc timeout (cmd 0xec)

Apr  3 18:49:32 unraid kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)        (but it's a bad connection)

.Apr  3 18:49:55 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Apr  3 18:49:55 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed

Apr  3 18:49:55 unraid kernel: ata10: SError: { RecovComm PHYRdyChg 10B8B }

Apr  3 18:49:55 unraid kernel: ata10: hard resetting link

Apr  3 18:49:56 unraid kernel: ata10: SATA link down (SStatus 0 SControl 300)          (connection lost)

.Apr  3 18:50:07 unraid kernel: ata10.00: disabled                                                  (system removes it)

.Apr  3 18:50:07 unraid kernel: sd 9:0:0:0: [sdk] Stopping disk

...

Apr  3 18:50:41 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B }      (a cable is bumped loose again)

Apr  3 18:50:41 unraid kernel: ata8: hard resetting link                                        (drive is sdi, Disk 6, WD10EADS)

...

Apr  3 18:50:54 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen

Apr  3 18:50:54 unraid kernel: ata10: irq_stat 0x00000040, connection status changed

Apr  3 18:50:54 unraid kernel: ata10: SError: { CommWake DevExch }      (new Seagate is connected again to addon card port)

Apr  3 18:50:54 unraid kernel: ata10: hard resetting link

Apr  3 18:50:55 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr  3 18:50:55 unraid kernel: ata10.00: ATA-8: ST2000DM001-1CH164, CC24, max UDMA/133

.Apr  3 18:50:55 unraid kernel: sd 9:0:0:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)  (assigned drive symbol sdk)

.Apr  3 18:50:55 unraid kernel:  sdk: unknown partition table

Apr  3 18:50:55 unraid kernel: sd 9:0:0:0: [sdk] Attached SCSI disk

Apr  3 18:51:02 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen 

Apr  3 18:51:02 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed                (connection lost, still a little loose)

Apr  3 18:51:02 unraid kernel: ata10: SError: { RecovComm PHYRdyChg 10B8B }

Apr  3 18:51:02 unraid kernel: ata10: hard resetting link

Apr  3 18:51:02 unraid kernel: ata10: SATA link down (SStatus 0 SControl 300)

Apr  3 18:51:12 unraid kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)    (finally fully connected, but at slower speed)

.Apr  3 18:51:12 unraid kernel: sdk: detected capacity change from 0 to 2000398934016    (still assigned to be sdk)

...

Apr  3 18:52:39 unraid kernel: r8168: eth0: link up          (reconnected network?)

...

Apr  3 19:33:18 unraid kernel:  sdk: unknown partition table        (Preclear begins of new Seagate)

...

Apr  4 17:54:05 unraid preclear_disk-diff[10081]: == invoked as: ./preclear_disk.sh /dev/sdk  (Preclear finishes, logs the report)

Apr  4 17:54:05 unraid preclear_disk-diff[10081]: ==  ST2000DM001-1CH164    W2F0QYSN

Apr  4 17:54:05 unraid preclear_disk-diff[10081]: == Disk /dev/sdk has been successfully precleared

...

Apr  4 22:55:17 unraid emhttp: Device inventory:      (drive list with old sdd Hitachi and new sdk Seagate)

Apr  4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD8Q3HK (sda) 1953514584

Apr  4 22:55:17 unraid emhttp: ST31500541AS_9XW01XYQ (sdc) 1465138584

Apr  4 22:55:17 unraid emhttp: Hitachi_HDS5C3020ALA632_ML2220FA040BGE (sdd) 1953514584

Apr  4 22:55:17 unraid emhttp: WDC_WD15EARS-00J2GB0_WD-WCAYY0179785 (sde) 1465138584

Apr  4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD91B32 (sdf) 1953514584

Apr  4 22:55:17 unraid emhttp: ST2000DL003-9VT166_5YD900BG (sdg) 1953514584

Apr  4 22:55:17 unraid emhttp: ST31500541AS_6XW0DRHW (sdh) 1465138584

Apr  4 22:55:17 unraid emhttp: WDC_WD10EADS-00L5B1_WD-WCAU4C404846 (sdi) 976762584

Apr  4 22:55:17 unraid emhttp: ST3500418AS_9VMVWG70 (sdj) 488378359

Apr  4 22:55:17 unraid emhttp: ST2000DM001-1CH164_W2F0QYSN (sdk) 1953514584

...

Apr  4 22:57:37 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen

Apr  4 22:57:37 unraid kernel: ata10: irq_stat 0x00400000, PHY RDY changed

Apr  4 22:57:37 unraid kernel: ata10: SError: { RecovComm PHYRdyChg }        (new Seagate sdk is disconnected!)

Apr  4 22:57:37 unraid kernel: ata10: hard resetting link

Apr  4 22:57:38 unraid kernel: ata10: SATA link down (SStatus 0 SControl 310)

.Apr  4 22:57:48 unraid kernel: ata10.00: disabled

.Apr  4 22:57:48 unraid kernel: sd 9:0:0:0: [sdk] Stopping disk

...

Apr  4 22:58:42 unraid kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen

Apr  4 22:58:42 unraid kernel: ata3: irq_stat 0x00400040, connection status changed

Apr  4 22:58:42 unraid kernel: ata3: SError: { PHYRdyChg 10B8B DevExch }        (Parity drive (sdd Hitachi) is disconnected!)

Apr  4 22:58:42 unraid kernel: ata3: hard resetting link

Apr  4 22:58:42 unraid kernel: ata3: SATA link down (SStatus 0 SControl 300)

.Apr  4 22:58:53 unraid kernel: ata3.00: disabled

.Apr  4 22:58:53 unraid kernel: sd 2:0:0:0: [sdd] Stopping disk

...

Apr  4 22:59:08 unraid kernel: ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }      (a cable is bumped completely off)

Apr  4 22:59:08 unraid kernel: ata1: hard resetting link                                              (drive is sda, Disk 5, ST2000DL003)

Apr  4 22:59:09 unraid kernel: ata1: SATA link down (SStatus 0 SControl 300)

.Apr  4 22:59:19 unraid kernel: ata1.00: disabled             (system removes the drive!)

.Apr  4 22:59:20 unraid kernel: ata1: SError: { DevExch }                      (cable is reconnected!)

Apr  4 22:59:20 unraid kernel: ata1: hard resetting link

Apr  4 22:59:21 unraid kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  4 22:59:21 unraid kernel: ata1.00: ATA-8: ST2000DL003-9VT166, CC3C, max UDMA/133

.Apr  4 22:59:21 unraid kernel:  sda: sda1        (and luckily re-assigned the same drive symbol, sda!)

Apr  4 22:59:21 unraid kernel: sd 0:0:0:0: [sda] Attached SCSI disk

...

Apr  4 22:59:14 unraid kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }      (a cable is bumped off)

Apr  4 22:59:14 unraid kernel: ata2: hard resetting link                                      (drive is sdc, Disk 3, ST31500541AS)

.Apr  4 22:59:19 unraid kernel: ata2: reset failed (errno=-32), retrying in 5 secs

.Apr  4 22:59:24 unraid kernel: ata2: limiting SATA link speed to 1.5 Gbps        (drive is recovered finally, but slightly slowed access)

.Apr  4 23:08:15 unraid kernel: ata2: SError: { PHYRdyChg 10B8B DevExch }

...

Apr  4 23:08:36 unraid kernel: ata4: SError: { PHYRdyChg 10B8B DevExch }          (a cable is bumped loose)

Apr  4 23:08:36 unraid kernel: ata4: hard resetting link                                      (drive is sde, Disk 4, WD15EARS)

...

Apr  4 23:09:29 unraid kernel: ata3: irq_stat 0x00000040, connection status changed

Apr  4 23:09:29 unraid kernel: ata3: SError: { CommWake DevExch }      (new Seagate is connected/hotplugged to motherboard port!)

Apr  4 23:09:29 unraid kernel: ata3: hard resetting link

Apr  4 23:09:30 unraid kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  4 23:09:30 unraid kernel: ata3.00: ATA-8: ST2000DM001-1CH164, CC24, max UDMA/133

.Apr  4 23:09:30 unraid kernel: scsi 2:0:0:0: Direct-Access    ATA      ST2000DM001-1CH1 CC24 PQ: 0 ANSI: 5

Apr  4 23:09:30 unraid kernel: sd 2:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

.Apr  4 23:09:30 unraid kernel:  sdd: sdd1      (and reassigned by system as sdd)

Apr  4 23:09:30 unraid kernel: sd 2:0:0:0: [sdd] Attached SCSI disk

...

Apr  4 23:14:31 unraid kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B }      (a cable is bumped loose)

Apr  4 23:14:31 unraid kernel: ata8: hard resetting link                                        (drive is sdi, Disk 6, WD10EADS)

...

Apr  4 23:14:59 unraid kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x4000000 action 0xe frozen

Apr  4 23:14:59 unraid kernel: ata10: irq_stat 0x00000040, connection status changed

Apr  4 23:14:59 unraid kernel: ata10: SError: { DevExch }        (the Hitachi is connected/hotplugged to ata10 port!)

Apr  4 23:14:59 unraid kernel: ata10: hard resetting link          (so the new Seagate and Hitachi swap ports and drive symbols!)

Apr  4 23:15:00 unraid kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr  4 23:15:00 unraid kernel: ata10.00: ATA-8: Hitachi HDS5C3020ALA632, ML6OA800, max UDMA/133

.Apr  4 23:15:00 unraid kernel: scsi 9:0:0:0: Direct-Access    ATA      Hitachi HDS5C302 ML6O PQ: 0 ANSI: 5

Apr  4 23:15:00 unraid kernel: sd 9:0:0:0: Attached scsi generic sg10 type 0

Apr  4 23:15:00 unraid kernel: sd 9:0:0:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

.Apr  4 23:15:00 unraid kernel:  sdk: sdk1                  (and reassigned by system as sdk)

Apr  4 23:15:00 unraid kernel: sd 9:0:0:0: [sdk] Attached SCSI disk

...

Apr  4 23:21:05 unraid emhttp: Device inventory:      (the new drive inventory!)

Apr  4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD8Q3HK (sda) 1953514584

Apr  4 23:21:05 unraid emhttp: ST31500541AS_9XW01XYQ (sdc) 1465138584

Apr  4 23:21:05 unraid emhttp: ST2000DM001-1CH164_W2F0QYSN (sdd) 1953514584

Apr  4 23:21:05 unraid emhttp: WDC_WD15EARS-00J2GB0_WD-WCAYY0179785 (sde) 1465138584

Apr  4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD91B32 (sdf) 1953514584

Apr  4 23:21:05 unraid emhttp: ST2000DL003-9VT166_5YD900BG (sdg) 1953514584

Apr  4 23:21:05 unraid emhttp: ST31500541AS_6XW0DRHW (sdh) 1465138584

Apr  4 23:21:05 unraid emhttp: WDC_WD10EADS-00L5B1_WD-WCAU4C404846 (sdi) 976762584

Apr  4 23:21:05 unraid emhttp: ST3500418AS_9VMVWG70 (sdj) 488378359

Apr  4 23:21:05 unraid emhttp: Hitachi_HDS5C3020ALA632_ML2220FA040BGE (sdk) 1953514584

...

Apr  4 23:21:05 unraid kernel: md: import disk0: [8,160] (sdk) Hitachi_HDS5C3020ALA632_ML2220FA040BGE size: 1953514552

...

Apr  4 23:21:25 unraid kernel: md: import disk0: [8,48] (sdd) ST2000DM001-1CH164_W2F0QYSN size: 1953514552

Apr  4 23:21:25 unraid kernel: md: disk0 wrong        (parity is re-assigned from sdk Hitachi to sdd Seagate)

...

Apr  4 23:21:59 unraid emhttp: shcmd (248): /usr/local/sbin/set_ncq sdd 1 &> /dev/null      (array is started...)

...

Apr  4 23:21:59 unraid emhttp: writing MBR on disk (sdd) with partition 1 offset 64, erased: 0  (new parity disk is partitioned)

...

Apr  4 23:22:02 unraid emhttp_event: disks_mounted      (array is fully up)

...

Apr  4 23:23:16 unraid kernel: md: recovery thread syncing parity disk ...  (begin build of new parity disk sdd ST2000DM001)

...

Apr  4 23:32:46 unraid kernel:  sdk: sdk1        (some operation (Preclear?) is performed on sdk, the Hitachi, no evidence of formatting)

...

Apr  5 03:05:21 unraid kernel: ata10.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x6 frozen    (something is trying to read from Hitachi!)

...      (this begins 11.5 hours of I/O errors from sdk (the old Hitachi parity drive), which is NOT formatted, must be Preclear reading)

...

Apr  5 04:16:03 unraid kernel:  ...  (media error)  (the first bad sector, many more bad sector messages to come, not shown below)

Apr  5 04:16:03 unraid kernel: ata10.00: error: { UNC }

...

Apr  5 07:10:07 unraid kernel: ata10: EH complete

Apr  5 07:14:40 unraid kernel: md: sync done. time=28284sec        (parity rebuild onto new sdd ST2000DM001 complete)

Apr  5 07:14:40 unraid kernel: md: recovery thread sync completion status: 0

Apr  5 07:17:16 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen    (Hitachi errors continue...)

...

Apr  5 14:40:54 unraid kernel: ata10: EH complete      (last sdk Hitachi error for awhile, Preclear pre-read over?)

...  (4 hours of quiet, then ...)

Apr  5 18:44:52 unraid kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen  (new parity drive sdd suddenly stops communicating)

...  (system tries for one minute to reset and recover the drive)

Apr  5 18:45:52 unraid kernel: ata3.00: disabled      (drive is quickly disabled by system)

...

Apr  5 18:45:52 unraid kernel: md: disk0 read error  (these can be ignored, drive is disabled by system)

...

Apr  5 18:46:02 unraid kernel: md: disk0 write error    (last one)

...

Apr  5 19:33:16 unraid login[27047]: ROOT LOGIN  on '/dev/pts/1' from '192.168.1.63'        (you logged in, attempted SMART analysis)

Apr  5 19:33:44 unraid kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

...

Apr  5 21:02:15 unraid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0  (sdk Hitachi errors start again)

...

Apr  5 21:29:45 unraid kernel: ata10: EH complete  (sdk Hitachi errors stop)

...

Apr  6 11:23:49 unraid kernel: mdcmd (25): stop    (system is shut down, don't know what happened to the Preclear, no report)

[end of syslog]

    I'm tired...

 

wow!!! amazing the whole box is even turning on this days! next time i have a old pc i have to try sth like that, bet with my luck it gets up in flames halfway thru!

 

L

Link to comment

 

First parity check after rebuild. No R/W or parity errors.

 

/usr/bin/tail -f /var/log/syslog

Apr 12 12:30:49 unraid kernel: mdcmd (60): spindown 6

Apr 12 14:43:12 unraid kernel: mdcmd (61): spindown 3

Apr 12 14:43:13 unraid kernel: mdcmd (62): spindown 4

Apr 12 14:43:14 unraid kernel: mdcmd (63): spindown 7

Apr 12 16:02:21 unraid kernel: md: sync done. time=27970sec

Apr 12 16:02:21 unraid kernel: md: recovery thread sync completion status: 0

Apr 12 16:32:24 unraid kernel: mdcmd (64): spindown 0

Apr 12 16:32:24 unraid kernel: mdcmd (65): spindown 1

Apr 12 16:32:25 unraid kernel: mdcmd (66): spindown 2

Apr 12 16:32:26 unraid kernel: mdcmd (67): spindown 5

 

I turn on sickbeard and nzb plugins and see...

Link to comment

Problem with Hitachi began previous any hot unplugged cable.

 

At first, I wondered why you were swapping out the Hitachi, because until I saw its SMART report, there were no issues at all related to the Hitachi.  Its SMART report however clearly indicated you had recently been having a number of problems with bad sectors on the drive, so I could see why you were replacing it, and Preclearing it. 

 

I read in unraid forums from people doing drive maintenance tasks without turn off computer, only unmounting array -> Change drive -> mount array. I suppose I was wrong.

 

As Peter said, newer SATA hardware does *claim* to support hot swap, but all of the software has to support it too.  Your syslog showed that it can work, but I'm not sure it's something we can completely trust with UnRAID yet.  Tom's UnRAID module was not designed for hot swapping drives, although the new behavior of rechecking the drive inventory whenever the Web GUI is refreshed does help in this regard.

 

I still remember seeing the first syslog where a user had put their UnRAID server to sleep, then awakened it.  It looked completely broken, and I thought at first there had been a terrible crash, totally unrecoverable.  But as I examined further, I could see most of the 'broken' subsystems slowly recovering, and finally the full system appeared to be almost completely operational.  At least, the user seemed to think so, although I had my doubts!  The user was not aware of all of the errors!  That was quite a few Linux kernels back, so I don't doubt that it's much improved now.  The thing is, all of this recovery work is at a low level, completely unseen by upper level applications, such as UnRAID.  It all has to work in such a way that everything appears to be restored exactly the same, including all device ID's and symbols, buffers and handles.  Some of us veterans will wait and watch, while you youngsters and risk-takers be the guinea pigs, testing hot swap and sleep/wake for us!

Link to comment

When parity check finish Ill attach a syslog but RW errors grown close to 10K again, at 50% check.

 

:'(

 

I'm lost. Yesterday I did a parity check with 0 errors. Today I repeat problem and 10.000 RW errors and 5 parity errors. Only 2 files of 350 MB written to data drives between both parity checks. Both by sickbeard. After first file no error, file move from cache to data at 12-4-2013 18:36. After second file the error I wrote in my previous post, file and error close to 12-4-2014 at 06:30.

 

If motherboard issue why only on parity?

If MB sata port and I change it why only in parity?

If PSU (corsair TX650) why only in parity?

If drive.....I change it

If cable... I change it

 

Should I RMA new Seagate parity? Even with same errors  than Hitachi maybe probability rules fault on me.

 

Aaaaaarrrrrrggggggg

387548223.737506.jpg.04eebb8ef255e985f6e146539ef2fdbc.jpg

Link to comment

Post a SMART report and a new syslog.

 

I return home and red ball in parity, no result about parity error after check but about 60% system found 7. As SDA (parity) is disabled smartctl doesnt work (smartctl -a -A /dev/sda).

 

I prefer left raid untouched untill your replies but last days experience says that if I reboot, without touching any cable or a hardware part, parity drive will be online again and after array start a parity rebuild will be done without errors. Then when I manually choose parity check again no errors. A couple days before that RW errors begin and a manual parity check put parity offline.

 

thanks everybody.

 

 

syslog_20130414_00am.txt.zip

Link to comment

 

Turn off

change sata cable between parity and a data drive

turn on

parity rebuild

no errors shown on syslog (as I understand)

 

smartctl show 1758 UNC errors.

 

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  078  078  006    Pre-fail  Always      -      9154368

  3 Spin_Up_Time            0x0003  094  094  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      23

  5 Reallocated_Sector_Ct  0x0033  087  087  010    Pre-fail  Always      -      17216

  7 Seek_Error_Rate        0x000f  100  253  030    Pre-fail  Always      -      300355

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      212

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      7

183 Runtime_Bad_Block      0x0032  096  096  000    Old_age  Always      -      4

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  001  001  000    Old_age  Always      -      1758

188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      17180131332

189 High_Fly_Writes        0x003a  097  097  000    Old_age  Always      -      3

190 Airflow_Temperature_Cel 0x0022  071  058  045    Old_age  Always      -      29 (Min/Max 25/35)

191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      7

193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      67

194 Temperature_Celsius    0x0022  029  042  000    Old_age  Always      -      29 (0 20 0 0)

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      193677255245911

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      19542190504

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      27561533468

 

As I wrote, perhaps probability rules are against me and new drive show same errors than previous one?

 

smartctl.txt

syslog_20130414_09am.txt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.