Jump to content

PARITY NOT VALID: DISK_DSBL


Recommended Posts

Hi folks,

 

I'm pretty new to unRAID but have been very happy with what I've seen so far.

 

That being said, this morning I awoke to a lot of errors in my syslog (attached what I could of the syslog, let me know if you need more).  As far as I can tell it's the parity drive, but I just wanted to make sure.  This system has been up and running fine for about a week.

 

Here's the current system configuration

- GIGABYTE GA-P35-DS3L

- Intel Core 2 duo processor

- 2 GB of RAM (Crucial 1GBx2) DDR2 800

- 3 SATA Drives - IDE1: WDC WD20EARS-00MVWB0, IDE2: ST2000DL003, IDE3: Samsung HD204UI.  All drive 2TB.

- unRAID 4.7 - free license

 

In unRAID and unmenu I saw the following error on the parity drive: PARITY NOT VALID: DISK_DSBL.  The parity drive is the IDE2 drive (The Seagate).  It could not run smart tools via the unmenu commands (I couldn't remember the cli commands).  The other two drives are data drives and appeared to have fine SMART reports.  unRAID was reporting the parity drive had about 1200 errors.  The errors in the syslog show up at Jul 24 at 00:38:20.  I believe from looking eariier in the syslogs that ATA3 is the Seagate drive.  I am assuming disk0 is unRAID nomenclature for the parity drive.  I am assumng that sbd is the parity drive as well, but I don't know of a good way of confirming that.

 

Here's the sata portion of the syslog that I couldn't post in the attachement:

Jul 16 22:10:34 10 kernel: scsi3 : ata_piix

Jul 16 22:10:34 10 kernel: scsi4 : ata_piix

Jul 16 22:10:34 10 kernel: ata3: SATA max UDMA/133 cmd 0xe700 ctl 0xe800 bmdma 0xeb00 irq 19

Jul 16 22:10:34 10 kernel: ata4: SATA max UDMA/133 cmd 0xe900 ctl 0xea00 bmdma 0xeb08 irq 19

Jul 16 22:10:34 10 kernel: ata1: SATA link down (SStatus 0 SControl 300)

Jul 16 22:10:34 10 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jul 16 22:10:34 10 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jul 16 22:10:34 10 kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jul 16 22:10:34 10 kernel: ata3.00: ATA-8: ST2000DL003-9VT166, CC32, max UDMA/133

Jul 16 22:10:34 10 kernel: ata3.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 0/32)

Jul 16 22:10:34 10 kernel: ata2.00: HPA detected: current 3907027055, native 3907029168

Jul 16 22:10:34 10 kernel: ata2.00: ATA-8: WDC WD20EARS-00MVWB0, 51.0AB51, max UDMA/133

Jul 16 22:10:34 10 kernel: ata2.00: 3907027055 sectors, multi 16: LBA48 NCQ (depth 0/32)

Jul 16 22:10:34 10 kernel: ata4.00: ATA-8: SAMSUNG HD204UI, 1AQ10001, max UDMA/133

Jul 16 22:10:34 10 kernel: ata4.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 0/32)

Jul 16 22:10:34 10 kernel: ata4.00: configured for UDMA/133

Jul 16 22:10:34 10 kernel: ata3.00: configured for UDMA/133

Jul 16 22:10:34 10 kernel: ata2.00: configured for UDMA/133

Jul 16 22:10:34 10 kernel: scsi 2:0:0:0: Direct-Access    ATA      WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5

Jul 16 22:10:34 10 kernel: scsi 3:0:0:0: Direct-Access    ATA      ST2000DL003-9VT1 CC32 PQ: 0 ANSI: 5

Jul 16 22:10:34 10 kernel: scsi 4:0:0:0: Direct-Access    ATA      SAMSUNG HD204UI  1AQ1 PQ: 0 ANSI: 5

Jul 16 22:10:34 10 kernel: sd 4:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

Jul 16 22:10:34 10 kernel: sd 3:0:0:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

Jul 16 22:10:34 10 kernel: sd 2:0:0:0: [sda] 3907027055 512-byte logical blocks: (2.00 TB/1.81 TiB)

Jul 16 22:10:34 10 kernel: sd 3:0:0:0: [sdb] Write Protect is off

Jul 16 22:10:34 10 kernel: sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00

Jul 16 22:10:34 10 kernel: sd 4:0:0:0: [sdc] Write Protect is off

Jul 16 22:10:34 10 kernel: sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00

Jul 16 22:10:34 10 kernel: sd 2:0:0:0: [sda] Write Protect is off

Jul 16 22:10:34 10 kernel: sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00

Jul 16 22:10:34 10 kernel: sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Jul 16 22:10:34 10 kernel: sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Jul 16 22:10:34 10 kernel: sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

 

My suspicion is that it's something wrong with the parity drive.  The system is currently shutdown.  I do have another 2TB Seagate drive that I can throw in it (was going to throw the extra in my 4.7 pro server once I get my RMA'd WD's back), but I wanted to post here before I proceeded. 

 

I am currently running Seagate's diagnostic tools on the parity drive on another system.  It passed the short test so I am running the long test now. 

 

Any suggestions on how I should proceed based upon this and the syslog would be appreciated.  Please let me know if you need further information.

syslog-errors-start.txt

Link to comment

Disk passed Seagate's long test utility.  Put it in another system, here's the smart output:

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  
1 Raw_Read_Error_Rate     0x000f   104   099   006    Pre-fail  Always       -       211960  
3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0  
4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       23  
5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0  
7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       7855072  
9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1018 
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       16
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   063   045    Old_age   Always       -       29 (Lifetime Min/Max 29/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       28
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always       -       29 (0 23 0 0)
195 Hardware_ECC_Recovered  0x001a   037   019   000    Old_age   Always       -       211960
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       272618754146798
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1762619400
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4079038857
SMART Error Log Version: 1
No Errors Logged

Link to comment

So I rebuilt parity and the system seemed ok (took about 12 hours to rebuild parity).  I am seeing attrocious read speeds off the array now, however.  Between two machines on a gigbit switch (verified gigabit connectivity the whole way through) I am seeing Windows copy speeds off the disks on the unraid server of between 1-2MB/second.  There is no cache drive on this system.  I know it's not a fair comparison, but just to make sure it wasn't my destination box, I tried copying the same files off my FreeNAS box and was reading at a 66 MB/second rate, so it's not the destination (Windows 7 in this case).  I'm not sure where to look next, or if this is related to the problems I had earlier with the parity drive, but these are some pretty painful speeds.  Any help would be greatly appreciated.

Link to comment

Neither the parity drive, nor the cache drive are involved at all when reading from the array in most cases.

 

Exception 1. You have a failed data disk, and parity and the remaining data disks are being used to re-construct the contents of the failed disk.

Exception 2.  You are using a cache disk, bu the file being read has not yet been moved to the protected array, in which case the cache drive is read.

 

If you are getting poor READ rates over the LAN when reading from the server, the LAN is the most suspect unless the disk being read is putting lots of errors in the syslog, in which case it is most suspect.

 

Joe L.

 

Link to comment

Thanks for the responses folks.

 

Here's the output from ethtool -S eth0:

NIC statistics:

    tx_packets: 142228611

    rx_packets: 91714291

    tx_errors: 0

    rx_errors: 0

    rx_missed: 0

    align_errors: 0

    tx_single_collisions: 0

    tx_multi_collisions: 0

    unicast: 91665822

    broadcast: 48461

    multicast: 8

    tx_aborted: 0

    tx_underrun: 0

 

and here's the ifconfig:

root@10:~# ifconfig

eth0      Link encap:Ethernet  HWaddr 00:1d:7d:a2:7b:d7

          inet addr:10.10.8.5  Bcast:10.10.8.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:91714427 errors:0 dropped:0 overruns:0 frame:0

          TX packets:142228686 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:1167558375 (1.0 GiB)  TX bytes:2941094830 (2.7 GiB)

          Interrupt:28 Base address:0xa000

 

lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          UP LOOPBACK RUNNING  MTU:16436  Metric:1

          RX packets:4411 errors:0 dropped:0 overruns:0 frame:0

          TX packets:4411 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:510300 (498.3 KiB)  TX bytes:510300 (498.3 KiB)

 

Here's everything in the syslog from after the parity build was started (nothing in the syslog from the parity rebuild started until this -anything else in the syslog was from the boot about 24 hours ago):

 

Jul 24 21:05:42 10 kernel: md: sync done. time=38923sec rate=50189K/sec (unRAID engine)

Jul 24 21:05:42 10 kernel: md: recovery thread sync completion status: 0 (unRAID engine)

Jul 24 21:07:14 10 ntpd[1373]: synchronized to 77.243.184.65, stratum 2

Jul 24 21:24:17 10 ntpd[1373]: time reset -0.150947 s

Jul 24 21:24:46 10 ntpd[1373]: synchronized to 77.243.184.65, stratum 2

Jul 24 22:05:52 10 kernel: mdcmd (12): spindown 2 (Routine)

Jul 24 22:37:00 10 kernel: NTFS driver 2.1.29 [Flags: R/O MODULE]. (System)

Jul 24 23:32:51 10 kernel: mdcmd (13): spindown 2 (Routine)

Jul 25 05:06:52 10 kernel: mdcmd (14): spindown 2 (Routine)

Jul 25 06:13:48 10 in.telnetd[12223]: connect from 10.10.8.152 (10.10.8.152) (Routine)

Jul 25 06:13:53 10 login[12224]: invalid password for `root'  on `pts/1' from `10.10.8.152' (Logins)

Jul 25 06:13:58 10 login[12224]: ROOT LOGIN  on `pts/1' from `10.10.8.152' (Logins)

Jul 25 06:14:01 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:15:58 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:17:36 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:21:33 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:31:23 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:33:53 10 last message repeated 2 times

Jul 25 06:37:11 10 last message repeated 2 times

Jul 25 06:38:29 10 last message repeated 2 times

Jul 25 06:40:45 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:43:10 10 last message repeated 3 times

Jul 25 06:44:50 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:46:10 10 last message repeated 2 times

Jul 25 06:47:22 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:51:54 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:54:21 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:56:11 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 06:57:12 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 07:00:22 10 last message repeated 2 times

Jul 25 07:03:00 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 07:08:18 10 kernel: mdcmd (15): spindown 2 (Routine)

Jul 25 07:09:44 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 07:10:26 10 last message repeated 3 times

Jul 25 07:12:27 10 last message repeated 2 times

Jul 25 07:15:44 10 last message repeated 2 times

Jul 25 07:16:55 10 last message repeated 2 times

Jul 25 07:18:25 10 last message repeated 3 times

Jul 25 07:21:37 10 last message repeated 2 times

Jul 25 07:25:58 10 last message repeated 2 times

Jul 25 07:27:37 10 last message repeated 2 times

Jul 25 07:36:42 10 last message repeated 2 times

Jul 25 07:46:16 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 07:53:53 10 unmenu[1904]: bad method -  5224482 495434746 4005265833 67467740        7        1      64      70        0 25057330 674676705224482-^M

Jul 25 08:02:15 10 kernel: r8169: eth0: link down (Network)

Jul 25 08:02:16 10 ifplugd(eth0)[1335]: Link beat lost. (Network)

Jul 25 08:02:20 10 kernel: r8169: eth0: link up (Network)

Jul 25 08:02:20 10 ifplugd(eth0)[1335]: Link beat detected. (Network)

 

The link up/downs were me re-seating the network cable to make sure everything was ok.  Admittedly, I am getting much better transfer rates off my unRAID box now that I did that (I just transferred a 5.7GB file at about 97MB/sec off the user share onto my Windows 7 box).  So you may very well have had something there.  I'll replace it with a high quality CAT6 cable I'll nab from work today).  However, write speeds onto the unraid user share are bad though (I copied the file off of one user share on the unraid down to my Windows 7 box and then attempted to copy the same file up to another user share on my unraid box).  I'm getting about 4.30 MB/sec on the write to the unraid share of the same file I just got 97MB/sec reading.  I know last week when I copied files up to the same share I was getting about 20 MB/sec.

 

Ultimately, all I'm really trying to do is copy files off of one user share to another user share.  I'm sure I could do it from a telnet session and just move the files to one of the disk shares, but I'd still expect decent file copy rates (rate-limited by the write capabilities of the unraid box as long as the network is ok) copying through a Windows box from share to share.

 

Thanks so much for the help.

-Scott

 

Link to comment

Here's smart for the new parity drive.

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       226675474
 3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       12
 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000f   063   060   030    Pre-fail  Always       -       1851793
 9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       4066
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       12
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always       -       2
190 Airflow_Temperature_Cel 0x0022   073   067   045    Old_age   Always       -       27 (Lifetime Min/Max 27/30)
194 Temperature_Celsius     0x0022   027   040   000    Old_age   Always       -       27 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   047   021   000    Old_age   Always       -       226675474
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       69728794053074
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4175819361
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2763197974

 

And here's the smart for the parity drive about an hour later.  The Raw_Read_Error_Rate and Seek_Error_Rate are much different now.  Did they roll over?

 

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   107   099   006    Pre-fail  Always       -       14441544
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       13
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   063   060   030    Pre-fail  Always       -       1912992
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       4067
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       12
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always       -       2
190 Airflow_Temperature_Cel 0x0022   072   067   045    Old_age   Always       -       28 (Lifetime Min/Max 24/30)
194 Temperature_Celsius     0x0022   028   040   000    Old_age   Always       -       28 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   047   021   000    Old_age   Always       -       14441544
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       62912680954323
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4185852585
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2776453480

 

Link to comment

The SMART values don't look bad. Yes, those raw numbers move around in normal operation, depending on manufacturer/model. In most cases you should ignore the raw values and stick with value/worst/thresh.

 

The network stats look normal.

 

You might check with Joe about all the "bad method" messages. See where it says around 1904 in whatever version of unMenu you're using? Their number and frequency suggests something significant could be happening.

 

Link to comment

You might check with Joe about all the "bad method" messages. See where it says around 1904 in whatever version of unMenu you're using? Their number and frequency suggests something significant could be happening.

Start here: http://lime-technology.com/forum/index.php?topic=5568.msg87797#msg87797

then          http://lime-technology.com/forum/index.php?topic=5568.msg87918#msg87918

then          http://lime-technology.com/forum/index.php?topic=5568.msg88035#msg88035

then          http://lime-technology.com/forum/index.php?topic=5568.msg88053#msg88053

then          http://lime-technology.com/forum/index.php?topic=5568.msg88136#msg88136  (should look familiar)

then          http://lime-technology.com/forum/index.php?topic=5568.msg88151#msg88151

then here  http://lime-technology.com/forum/index.php?topic=5568.msg88168#msg88168  (do you have a Dell on your LAN)

finally here http://lime-technology.com/forum/index.php?topic=5568.msg88221#msg88221

and here    http://lime-technology.com/forum/index.php?topic=5568.msg89054#msg89054

 

It is unMENU reacting to an invalid "request method" and logging it.  Harmless otherwise.  Something on your LAN is probing all possible ports and addresses.

Link to comment

Ouch. Apologies, Joe.

 

Looking at this again I'd sure like to see a full syslog. Seems a little like watching Avatar through a keyhole.

 

The slow writes could be explained by the bad parity drive. Is it still showing as invalid? While most reads won't touch it, writes of course will.

Link to comment

Thanks for the input, folks.  Just to clarify, the parity drive was swapped out for another after my initial post.  The syslog on the first post was what I could fit as an attachment from the issues that were in the syslog from that.  I have syslogs from after I put in another parity drive and rebuilt parity, but nothing is showing up in there of interest at all.

 

Here's the syslog since I rebooted the computer this evening.  I moved the parity drive to another SATA port and changed the parity drive's SATA cable.  Write speeds to the array are still pretty abysmal, about 4.5MB/second.  Would disabling the parity and trying a write to a drive prove anything? (I realize I'd endanger the array and have to rebuild parity, but I'm desperate here).

 

Thanks again for the help,

Scott

syslog-2011-07-25.txt

Link to comment

According to that syslog, md doesn't see the parity drive. Did you assign it after swapping drives? How do things look on the unRAID web page?

 

The slow write performance and the Realtek nic may just be coincidental but many times it is the culprit.  I still don't know if any of the syslog fragments we've seen reflect activity of copying to or from the array so don't know if you've been seeing errors. Problems with nics generally show up when you start hammering the server at Gb speeds, sometimes with reads but usually with writes. tail -f /var/log/syslog on the console and try some of your big-file copies again. eth0 going up and down? IRQ messages?

 

Since you're just getting started how do you feel about trying 5.0b10? It sounds like Lime Tech has moved to another Realtek driver that may work much better than anything in the 4.x line. Of course, I'm jumping a little ahead assuming the net is the problem.

Link to comment

Ahhh, yes, sorry, that boot wouldn't have showed things prior to me re-assigning the parity drive.  I've attached another boot that should show things more accurately.  I've also put in an Intel 1Gb NIC I had, I'm not sure what driver that'd be using.  I disabled the on-board NIC.  I'm not certain if it's coincidence or not or if I just didn't show enough patience with some earlier file transfers this evening, but I am getting better transfer rates (about 15.5 MB/second)  I've got /var/log/syslog tailed during a large file copy as I type this, but it's not showing anything new. 

 

I definitely didn't see performance this bad with the original parity drive in when I copied my 2TB of data over - at that point I was getting about 20 MBs, which what more of what one can expect.  I've got two 2TB WD Blacks coming in tomorrow from an RMA, maybe I'll throw one of those in as parity to see how it goes, though they were going to go into my 4.7 pro beast that I'm building.  I'd also consider moving to a 5.x version... although I'd be tempted to buy a 3TB drive if I did that :P

 

-Scott

syslog-2011-07-25_1.txt

Link to comment

Thanks,

 

Initially I was doing tower->win->tower and that's where I was seeing slow speeds.  To help sort out things, I simplified the process and first did tower->win and then win->tower.  The tower->win (ie, the read) is flying now.  The win->tower (the write) is still slow.  For some reason I was able to achieve 20MB/sec on that one file I mentioned.  However, on most of my tower->win copies, I am still seeing poor performance at times (generally less than 5MB/sec), but then other times I get very good performance (I'm writing one 5.8GB file to drive2 at 29MB/s, and wrote it to drive1 at about 20MB/s

 

I did some raw read/write tests to each disk (dd -if=/dev/zero of=/dev/disk1/testfile or of=dev/disk2/testfile).  They're both pretty consistently at about 21MB/s.

 

I'm a network engineer by trade, so I feel pretty good about the network, though the tests and overall incompetence may indicate to you otherwise.

 

 

 

Link to comment

Sure,

 

here's the SMART for the parity drive:

smartctl -a -d ata /dev/sda (parity)

 

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    ST32000542AS

Serial Number:    5XW1NGC5

Firmware Version: CC34

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Thu Jul 28 11:59:15 2011 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 633) seconds.

Offline data collection

capabilities: (0x73) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

No Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x103f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  111  099  006    Pre-fail  Always      -      29813841

  3 Spin_Up_Time            0x0003  100  100  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      26

  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  063  060  030    Pre-fail  Always      -      2303857

  9 Power_On_Hours          0x0032  096  096  000    Old_age  Always      -      4138

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      14

183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0

189 High_Fly_Writes        0x003a  098  098  000    Old_age  Always      -      2

190 Airflow_Temperature_Cel 0x0022  077  067  045    Old_age  Always      -      23 (Lifetime Min/Max 22/28)

194 Temperature_Celsius    0x0022  023  040  000    Old_age  Always      -      23 (0 18 0 0)

195 Hardware_ECC_Recovered  0x001a  048  021  000    Old_age  Always      -      29813841

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      240208930933239

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      237065898

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      3203902465

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

SMART for drive1

smartctl -a -d ata /dev/sdb (disk1)

 

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    WDC WD20EARS-00MVWB0

Serial Number:    WD-WMAZA1148692

Firmware Version: 51.0AB51

User Capacity:    2,000,397,852,160 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Thu Jul 28 12:00:41 2011 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

was suspended by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (36180) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  5) minutes.

SCT capabilities:       (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  169  167  021    Pre-fail  Always      -      6550

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      57

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  094  094  000    Old_age  Always      -      4461

10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      27

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      22

193 Load_Cycle_Count        0x0032  199  199  000    Old_age  Always      -      5437

194 Temperature_Celsius    0x0022  126  117  000    Old_age  Always      -      24

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      1

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Drive2

smartctl -a -d ata /dev/sdc (disk2)

 

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    SAMSUNG HD204UI

Serial Number:    S2H7J1NB501731

Firmware Version: 1AQ10001

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 6

Local Time is:    Thu Jul 28 12:01:35 2011 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (20400) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

SCT capabilities:       (0x003f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  100  100  051    Pre-fail  Always      -      0

  2 Throughput_Performance  0x0026  252  252  000    Old_age  Always      -      0

  3 Spin_Up_Time            0x0023  067  066  025    Pre-fail  Always      -      10287

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      47

  5 Reallocated_Sector_Ct  0x0033  252  252  010    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  252  252  051    Old_age  Always      -      0

  8 Seek_Time_Performance  0x0024  252  252  015    Old_age  Offline      -      0

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      1164

10 Spin_Retry_Count        0x0032  252  252  051    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  252  252  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      10

181 Program_Fail_Cnt_Total  0x0022  099  099  000    Old_age  Always      -      30304758

191 G-Sense_Error_Rate      0x0022  252  252  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0022  252  252  000    Old_age  Always      -      0

194 Temperature_Celsius    0x0002  064  064  000    Old_age  Always      -      22 (Lifetime Min/Max 21/32)

195 Hardware_ECC_Recovered  0x003a  100  100  000    Old_age  Always      -      0

196 Reallocated_Event_Count 0x0032  252  252  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  252  252  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  252  252  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0036  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x002a  100  100  000    Old_age  Always      -      0

223 Load_Retry_Count        0x0032  252  252  000    Old_age  Always      -      0

225 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      47

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run

SMART Selective self-test log data structure revision number 0

Note: revision number not 1 implies that no selective self-test has ever been run

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Completed [00% left] (0-65535)

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Thanks for the help.

Link to comment

Settings for eth0:

        Supported ports: [ TP ]

        Supported link modes:  10baseT/Half 10baseT/Full

                                100baseT/Half 100baseT/Full

                                1000baseT/Full

        Supports auto-negotiation: Yes

        Advertised link modes:  10baseT/Half 10baseT/Full

                                100baseT/Half 100baseT/Full

                                1000baseT/Full

        Advertised auto-negotiation: Yes

        Speed: 1000Mb/s

        Duplex: Full

        Port: Twisted Pair

        PHYAD: 0

        Transceiver: internal

        Auto-negotiation: on

        Supports Wake-on: umbg

        Wake-on: g

        Current message level: 0x00000007 (7)

        Link detected: yes

root@10:~#

 

Here's that.  I think I mentioned that a couple days ago I put an Intel gigE NIC into the box because it was mentioned that there have been some issues/concerns with the RealTek NICs.  Didn't see a change.  I swapped ports on my switch with my FreeNAS box.  I was able to write to my FreeNAS box at about 88MB/s in either port (the port this box was in and the port it is now).  It's a 24 port D-Link managed switch.

 

Thanks!

Link to comment

Stats come with ethtool -S eth0 or ifconfig eth0, ideally after you've done a bunch of transfers so we see the results.

 

Since you're seeing similar results with the Intel card as you did with the built-in Realtek any network problems are more likely in the cabling, switch, etc. Sounds like you know that area so take this for what it's worth. If you haven't already I'd be trying other ports, another switch, etc. Edit: but stick with the Intel for now. If we can get consistent speeds there then you can drop to the Realtek and compare.

 

Back to this:

Initially I was doing tower->win->tower and that's where I was seeing slow speeds.  To help sort out things, I simplified the process and first did tower->win and then win->tower.  The tower->win (ie, the read) is flying now.  The win->tower (the write) is still slow.  For some reason I was able to achieve 20MB/sec on that one file I mentioned.  However, on most of my tower->win copies, I am still seeing poor performance at times (generally less than 5MB/sec), but then other times I get very good performance (I'm writing one 5.8GB file to drive2 at 29MB/s, and wrote it to drive1 at about 20MB/s

 

I'm having a tough time understanding this. Feels like you wrote this over the course of several tests. The above tells me:

 

1) tower->win is fast

2) win->tower is fast for one file

3) two sentences later, tower->win is slow.

 

I know you're battling the hordes vs. me sitting in an armchair but it might help to stick with a fixed set of measurements and watch for changes. I keep a folder of test files at each end for this: a few large files and folders full of little files. Helps provide perspective when comparing changes or multiple systems.

 

I did some raw read/write tests to each disk (dd -if=/dev/zero of=/dev/disk1/testfile or of=dev/disk2/testfile).  They're both pretty consistently at about 21MB/s.

 

Consistent all around? Reads should be much faster than parity-protected writes, unless the array is degraded. dd to /dev/diskX will be slower because it's writing to both disk2 and the parity drive.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...