srogala Posted July 24, 2011 Share Posted July 24, 2011 Hi folks, I'm pretty new to unRAID but have been very happy with what I've seen so far. That being said, this morning I awoke to a lot of errors in my syslog (attached what I could of the syslog, let me know if you need more). As far as I can tell it's the parity drive, but I just wanted to make sure. This system has been up and running fine for about a week. Here's the current system configuration - GIGABYTE GA-P35-DS3L - Intel Core 2 duo processor - 2 GB of RAM (Crucial 1GBx2) DDR2 800 - 3 SATA Drives - IDE1: WDC WD20EARS-00MVWB0, IDE2: ST2000DL003, IDE3: Samsung HD204UI. All drive 2TB. - unRAID 4.7 - free license In unRAID and unmenu I saw the following error on the parity drive: PARITY NOT VALID: DISK_DSBL. The parity drive is the IDE2 drive (The Seagate). It could not run smart tools via the unmenu commands (I couldn't remember the cli commands). The other two drives are data drives and appeared to have fine SMART reports. unRAID was reporting the parity drive had about 1200 errors. The errors in the syslog show up at Jul 24 at 00:38:20. I believe from looking eariier in the syslogs that ATA3 is the Seagate drive. I am assuming disk0 is unRAID nomenclature for the parity drive. I am assumng that sbd is the parity drive as well, but I don't know of a good way of confirming that. Here's the sata portion of the syslog that I couldn't post in the attachement: Jul 16 22:10:34 10 kernel: scsi3 : ata_piix Jul 16 22:10:34 10 kernel: scsi4 : ata_piix Jul 16 22:10:34 10 kernel: ata3: SATA max UDMA/133 cmd 0xe700 ctl 0xe800 bmdma 0xeb00 irq 19 Jul 16 22:10:34 10 kernel: ata4: SATA max UDMA/133 cmd 0xe900 ctl 0xea00 bmdma 0xeb08 irq 19 Jul 16 22:10:34 10 kernel: ata1: SATA link down (SStatus 0 SControl 300) Jul 16 22:10:34 10 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jul 16 22:10:34 10 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jul 16 22:10:34 10 kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jul 16 22:10:34 10 kernel: ata3.00: ATA-8: ST2000DL003-9VT166, CC32, max UDMA/133 Jul 16 22:10:34 10 kernel: ata3.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 0/32) Jul 16 22:10:34 10 kernel: ata2.00: HPA detected: current 3907027055, native 3907029168 Jul 16 22:10:34 10 kernel: ata2.00: ATA-8: WDC WD20EARS-00MVWB0, 51.0AB51, max UDMA/133 Jul 16 22:10:34 10 kernel: ata2.00: 3907027055 sectors, multi 16: LBA48 NCQ (depth 0/32) Jul 16 22:10:34 10 kernel: ata4.00: ATA-8: SAMSUNG HD204UI, 1AQ10001, max UDMA/133 Jul 16 22:10:34 10 kernel: ata4.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 0/32) Jul 16 22:10:34 10 kernel: ata4.00: configured for UDMA/133 Jul 16 22:10:34 10 kernel: ata3.00: configured for UDMA/133 Jul 16 22:10:34 10 kernel: ata2.00: configured for UDMA/133 Jul 16 22:10:34 10 kernel: scsi 2:0:0:0: Direct-Access ATA WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5 Jul 16 22:10:34 10 kernel: scsi 3:0:0:0: Direct-Access ATA ST2000DL003-9VT1 CC32 PQ: 0 ANSI: 5 Jul 16 22:10:34 10 kernel: scsi 4:0:0:0: Direct-Access ATA SAMSUNG HD204UI 1AQ1 PQ: 0 ANSI: 5 Jul 16 22:10:34 10 kernel: sd 4:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Jul 16 22:10:34 10 kernel: sd 3:0:0:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Jul 16 22:10:34 10 kernel: sd 2:0:0:0: [sda] 3907027055 512-byte logical blocks: (2.00 TB/1.81 TiB) Jul 16 22:10:34 10 kernel: sd 3:0:0:0: [sdb] Write Protect is off Jul 16 22:10:34 10 kernel: sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00 Jul 16 22:10:34 10 kernel: sd 4:0:0:0: [sdc] Write Protect is off Jul 16 22:10:34 10 kernel: sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00 Jul 16 22:10:34 10 kernel: sd 2:0:0:0: [sda] Write Protect is off Jul 16 22:10:34 10 kernel: sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 Jul 16 22:10:34 10 kernel: sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jul 16 22:10:34 10 kernel: sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jul 16 22:10:34 10 kernel: sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA My suspicion is that it's something wrong with the parity drive. The system is currently shutdown. I do have another 2TB Seagate drive that I can throw in it (was going to throw the extra in my 4.7 pro server once I get my RMA'd WD's back), but I wanted to post here before I proceeded. I am currently running Seagate's diagnostic tools on the parity drive on another system. It passed the short test so I am running the long test now. Any suggestions on how I should proceed based upon this and the syslog would be appreciated. Please let me know if you need further information. syslog-errors-start.txt Link to comment
srogala Posted July 24, 2011 Author Share Posted July 24, 2011 Disk passed Seagate's long test utility. Put it in another system, here's the smart output: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 104 099 006 Pre-fail Always - 211960 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 23 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 7855072 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1018 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 16 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 071 063 045 Old_age Always - 29 (Lifetime Min/Max 29/31) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 13 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 28 194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 23 0 0) 195 Hardware_ECC_Recovered 0x001a 037 019 000 Old_age Always - 211960 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 272618754146798 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1762619400 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 4079038857 SMART Error Log Version: 1 No Errors Logged Link to comment
srogala Posted July 25, 2011 Author Share Posted July 25, 2011 So I rebuilt parity and the system seemed ok (took about 12 hours to rebuild parity). I am seeing attrocious read speeds off the array now, however. Between two machines on a gigbit switch (verified gigabit connectivity the whole way through) I am seeing Windows copy speeds off the disks on the unraid server of between 1-2MB/second. There is no cache drive on this system. I know it's not a fair comparison, but just to make sure it wasn't my destination box, I tried copying the same files off my FreeNAS box and was reading at a 66 MB/second rate, so it's not the destination (Windows 7 in this case). I'm not sure where to look next, or if this is related to the problems I had earlier with the parity drive, but these are some pretty painful speeds. Any help would be greatly appreciated. Link to comment
Joe L. Posted July 25, 2011 Share Posted July 25, 2011 Neither the parity drive, nor the cache drive are involved at all when reading from the array in most cases. Exception 1. You have a failed data disk, and parity and the remaining data disks are being used to re-construct the contents of the failed disk. Exception 2. You are using a cache disk, bu the file being read has not yet been moved to the protected array, in which case the cache drive is read. If you are getting poor READ rates over the LAN when reading from the server, the LAN is the most suspect unless the disk being read is putting lots of errors in the syslog, in which case it is most suspect. Joe L. Link to comment
cyrnel Posted July 25, 2011 Share Posted July 25, 2011 Another syslog would help. Something after you replicate the slow copies. ethtool -S eth0 and/or ifconfig would also be useful. Link to comment
srogala Posted July 25, 2011 Author Share Posted July 25, 2011 Thanks for the responses folks. Here's the output from ethtool -S eth0: NIC statistics: tx_packets: 142228611 rx_packets: 91714291 tx_errors: 0 rx_errors: 0 rx_missed: 0 align_errors: 0 tx_single_collisions: 0 tx_multi_collisions: 0 unicast: 91665822 broadcast: 48461 multicast: 8 tx_aborted: 0 tx_underrun: 0 and here's the ifconfig: root@10:~# ifconfig eth0 Link encap:Ethernet HWaddr 00:1d:7d:a2:7b:d7 inet addr:10.10.8.5 Bcast:10.10.8.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:91714427 errors:0 dropped:0 overruns:0 frame:0 TX packets:142228686 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1167558375 (1.0 GiB) TX bytes:2941094830 (2.7 GiB) Interrupt:28 Base address:0xa000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:4411 errors:0 dropped:0 overruns:0 frame:0 TX packets:4411 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:510300 (498.3 KiB) TX bytes:510300 (498.3 KiB) Here's everything in the syslog from after the parity build was started (nothing in the syslog from the parity rebuild started until this -anything else in the syslog was from the boot about 24 hours ago): Jul 24 21:05:42 10 kernel: md: sync done. time=38923sec rate=50189K/sec (unRAID engine) Jul 24 21:05:42 10 kernel: md: recovery thread sync completion status: 0 (unRAID engine) Jul 24 21:07:14 10 ntpd[1373]: synchronized to 77.243.184.65, stratum 2 Jul 24 21:24:17 10 ntpd[1373]: time reset -0.150947 s Jul 24 21:24:46 10 ntpd[1373]: synchronized to 77.243.184.65, stratum 2 Jul 24 22:05:52 10 kernel: mdcmd (12): spindown 2 (Routine) Jul 24 22:37:00 10 kernel: NTFS driver 2.1.29 [Flags: R/O MODULE]. (System) Jul 24 23:32:51 10 kernel: mdcmd (13): spindown 2 (Routine) Jul 25 05:06:52 10 kernel: mdcmd (14): spindown 2 (Routine) Jul 25 06:13:48 10 in.telnetd[12223]: connect from 10.10.8.152 (10.10.8.152) (Routine) Jul 25 06:13:53 10 login[12224]: invalid password for `root' on `pts/1' from `10.10.8.152' (Logins) Jul 25 06:13:58 10 login[12224]: ROOT LOGIN on `pts/1' from `10.10.8.152' (Logins) Jul 25 06:14:01 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:15:58 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:17:36 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:21:33 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:31:23 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:33:53 10 last message repeated 2 times Jul 25 06:37:11 10 last message repeated 2 times Jul 25 06:38:29 10 last message repeated 2 times Jul 25 06:40:45 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:43:10 10 last message repeated 3 times Jul 25 06:44:50 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:46:10 10 last message repeated 2 times Jul 25 06:47:22 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:51:54 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:54:21 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:56:11 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 06:57:12 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 07:00:22 10 last message repeated 2 times Jul 25 07:03:00 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 07:08:18 10 kernel: mdcmd (15): spindown 2 (Routine) Jul 25 07:09:44 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 07:10:26 10 last message repeated 3 times Jul 25 07:12:27 10 last message repeated 2 times Jul 25 07:15:44 10 last message repeated 2 times Jul 25 07:16:55 10 last message repeated 2 times Jul 25 07:18:25 10 last message repeated 3 times Jul 25 07:21:37 10 last message repeated 2 times Jul 25 07:25:58 10 last message repeated 2 times Jul 25 07:27:37 10 last message repeated 2 times Jul 25 07:36:42 10 last message repeated 2 times Jul 25 07:46:16 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 07:53:53 10 unmenu[1904]: bad method - 5224482 495434746 4005265833 67467740 7 1 64 70 0 25057330 674676705224482-^M Jul 25 08:02:15 10 kernel: r8169: eth0: link down (Network) Jul 25 08:02:16 10 ifplugd(eth0)[1335]: Link beat lost. (Network) Jul 25 08:02:20 10 kernel: r8169: eth0: link up (Network) Jul 25 08:02:20 10 ifplugd(eth0)[1335]: Link beat detected. (Network) The link up/downs were me re-seating the network cable to make sure everything was ok. Admittedly, I am getting much better transfer rates off my unRAID box now that I did that (I just transferred a 5.7GB file at about 97MB/sec off the user share onto my Windows 7 box). So you may very well have had something there. I'll replace it with a high quality CAT6 cable I'll nab from work today). However, write speeds onto the unraid user share are bad though (I copied the file off of one user share on the unraid down to my Windows 7 box and then attempted to copy the same file up to another user share on my unraid box). I'm getting about 4.30 MB/sec on the write to the unraid share of the same file I just got 97MB/sec reading. I know last week when I copied files up to the same share I was getting about 20 MB/sec. Ultimately, all I'm really trying to do is copy files off of one user share to another user share. I'm sure I could do it from a telnet session and just move the files to one of the disk shares, but I'd still expect decent file copy rates (rate-limited by the write capabilities of the unraid box as long as the network is ok) copying through a Windows box from share to share. Thanks so much for the help. -Scott Link to comment
srogala Posted July 25, 2011 Author Share Posted July 25, 2011 Here's smart for the new parity drive. ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 226675474 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 12 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 063 060 030 Pre-fail Always - 1851793 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4066 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 12 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 190 Airflow_Temperature_Cel 0x0022 073 067 045 Old_age Always - 27 (Lifetime Min/Max 27/30) 194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 047 021 000 Old_age Always - 226675474 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 69728794053074 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 4175819361 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2763197974 And here's the smart for the parity drive about an hour later. The Raw_Read_Error_Rate and Seek_Error_Rate are much different now. Did they roll over? SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 107 099 006 Pre-fail Always - 14441544 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 13 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 063 060 030 Pre-fail Always - 1912992 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4067 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 12 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 190 Airflow_Temperature_Cel 0x0022 072 067 045 Old_age Always - 28 (Lifetime Min/Max 24/30) 194 Temperature_Celsius 0x0022 028 040 000 Old_age Always - 28 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 047 021 000 Old_age Always - 14441544 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 62912680954323 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 4185852585 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2776453480 Link to comment
cyrnel Posted July 25, 2011 Share Posted July 25, 2011 The SMART values don't look bad. Yes, those raw numbers move around in normal operation, depending on manufacturer/model. In most cases you should ignore the raw values and stick with value/worst/thresh. The network stats look normal. You might check with Joe about all the "bad method" messages. See where it says around 1904 in whatever version of unMenu you're using? Their number and frequency suggests something significant could be happening. Link to comment
Joe L. Posted July 25, 2011 Share Posted July 25, 2011 You might check with Joe about all the "bad method" messages. See where it says around 1904 in whatever version of unMenu you're using? Their number and frequency suggests something significant could be happening. Start here: http://lime-technology.com/forum/index.php?topic=5568.msg87797#msg87797 then http://lime-technology.com/forum/index.php?topic=5568.msg87918#msg87918 then http://lime-technology.com/forum/index.php?topic=5568.msg88035#msg88035 then http://lime-technology.com/forum/index.php?topic=5568.msg88053#msg88053 then http://lime-technology.com/forum/index.php?topic=5568.msg88136#msg88136 (should look familiar) then http://lime-technology.com/forum/index.php?topic=5568.msg88151#msg88151 then here http://lime-technology.com/forum/index.php?topic=5568.msg88168#msg88168 (do you have a Dell on your LAN) finally here http://lime-technology.com/forum/index.php?topic=5568.msg88221#msg88221 and here http://lime-technology.com/forum/index.php?topic=5568.msg89054#msg89054 It is unMENU reacting to an invalid "request method" and logging it. Harmless otherwise. Something on your LAN is probing all possible ports and addresses. Link to comment
cyrnel Posted July 25, 2011 Share Posted July 25, 2011 Ouch. Apologies, Joe. Looking at this again I'd sure like to see a full syslog. Seems a little like watching Avatar through a keyhole. The slow writes could be explained by the bad parity drive. Is it still showing as invalid? While most reads won't touch it, writes of course will. Link to comment
srogala Posted July 26, 2011 Author Share Posted July 26, 2011 Thanks for the input, folks. Just to clarify, the parity drive was swapped out for another after my initial post. The syslog on the first post was what I could fit as an attachment from the issues that were in the syslog from that. I have syslogs from after I put in another parity drive and rebuilt parity, but nothing is showing up in there of interest at all. Here's the syslog since I rebooted the computer this evening. I moved the parity drive to another SATA port and changed the parity drive's SATA cable. Write speeds to the array are still pretty abysmal, about 4.5MB/second. Would disabling the parity and trying a write to a drive prove anything? (I realize I'd endanger the array and have to rebuild parity, but I'm desperate here). Thanks again for the help, Scott syslog-2011-07-25.txt Link to comment
cyrnel Posted July 26, 2011 Share Posted July 26, 2011 According to that syslog, md doesn't see the parity drive. Did you assign it after swapping drives? How do things look on the unRAID web page? The slow write performance and the Realtek nic may just be coincidental but many times it is the culprit. I still don't know if any of the syslog fragments we've seen reflect activity of copying to or from the array so don't know if you've been seeing errors. Problems with nics generally show up when you start hammering the server at Gb speeds, sometimes with reads but usually with writes. tail -f /var/log/syslog on the console and try some of your big-file copies again. eth0 going up and down? IRQ messages? Since you're just getting started how do you feel about trying 5.0b10? It sounds like Lime Tech has moved to another Realtek driver that may work much better than anything in the 4.x line. Of course, I'm jumping a little ahead assuming the net is the problem. Link to comment
srogala Posted July 26, 2011 Author Share Posted July 26, 2011 Ahhh, yes, sorry, that boot wouldn't have showed things prior to me re-assigning the parity drive. I've attached another boot that should show things more accurately. I've also put in an Intel 1Gb NIC I had, I'm not sure what driver that'd be using. I disabled the on-board NIC. I'm not certain if it's coincidence or not or if I just didn't show enough patience with some earlier file transfers this evening, but I am getting better transfer rates (about 15.5 MB/second) I've got /var/log/syslog tailed during a large file copy as I type this, but it's not showing anything new. I definitely didn't see performance this bad with the original parity drive in when I copied my 2TB of data over - at that point I was getting about 20 MBs, which what more of what one can expect. I've got two 2TB WD Blacks coming in tomorrow from an RMA, maybe I'll throw one of those in as parity to see how it goes, though they were going to go into my 4.7 pro beast that I'm building. I'd also consider moving to a 5.x version... although I'd be tempted to buy a 3TB drive if I did that -Scott syslog-2011-07-25_1.txt Link to comment
cyrnel Posted July 26, 2011 Share Posted July 26, 2011 Wow, that system should be much faster. If you were copying tower->win->tower then 20MB/s might make sense. Did you check your BIOS' SATA AHCI/IDE settings? Have you tried raw read/write tests on each disk? (dd) Link to comment
srogala Posted July 27, 2011 Author Share Posted July 27, 2011 Thanks, Initially I was doing tower->win->tower and that's where I was seeing slow speeds. To help sort out things, I simplified the process and first did tower->win and then win->tower. The tower->win (ie, the read) is flying now. The win->tower (the write) is still slow. For some reason I was able to achieve 20MB/sec on that one file I mentioned. However, on most of my tower->win copies, I am still seeing poor performance at times (generally less than 5MB/sec), but then other times I get very good performance (I'm writing one 5.8GB file to drive2 at 29MB/s, and wrote it to drive1 at about 20MB/s I did some raw read/write tests to each disk (dd -if=/dev/zero of=/dev/disk1/testfile or of=dev/disk2/testfile). They're both pretty consistently at about 21MB/s. I'm a network engineer by trade, so I feel pretty good about the network, though the tests and overall incompetence may indicate to you otherwise. Link to comment
dgaschk Posted July 28, 2011 Share Posted July 28, 2011 Post SMART reports for all of your drives. Link to comment
srogala Posted July 28, 2011 Author Share Posted July 28, 2011 Sure, here's the SMART for the parity drive: smartctl -a -d ata /dev/sda (parity) smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST32000542AS Serial Number: 5XW1NGC5 Firmware Version: CC34 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu Jul 28 11:59:15 2011 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 633) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 29813841 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 26 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 063 060 030 Pre-fail Always - 2303857 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4138 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 14 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 190 Airflow_Temperature_Cel 0x0022 077 067 045 Old_age Always - 23 (Lifetime Min/Max 22/28) 194 Temperature_Celsius 0x0022 023 040 000 Old_age Always - 23 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 048 021 000 Old_age Always - 29813841 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 240208930933239 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 237065898 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3203902465 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SMART for drive1 smartctl -a -d ata /dev/sdb (disk1) smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA1148692 Firmware Version: 51.0AB51 User Capacity: 2,000,397,852,160 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Thu Jul 28 12:00:41 2011 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (36180) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 169 167 021 Pre-fail Always - 6550 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 57 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 4461 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 27 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 22 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 5437 194 Temperature_Celsius 0x0022 126 117 000 Old_age Always - 24 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Drive2 smartctl -a -d ata /dev/sdc (disk2) smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: SAMSUNG HD204UI Serial Number: S2H7J1NB501731 Firmware Version: 1AQ10001 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 6 Local Time is: Thu Jul 28 12:01:35 2011 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (20400) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 0 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 067 066 025 Pre-fail Always - 10287 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 47 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1164 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10 181 Program_Fail_Cnt_Total 0x0022 099 099 000 Old_age Always - 30304758 191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 064 000 Old_age Always - 22 (Lifetime Min/Max 21/32) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 0 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 47 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Completed [00% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Thanks for the help. Link to comment
dgaschk Posted July 28, 2011 Share Posted July 28, 2011 They all look ok. The results of the dd test is good. The array is accepting writes at 21MBps. The bottleneck must be elsewhere. What does "ethtool eth0" show? Link to comment
srogala Posted July 28, 2011 Author Share Posted July 28, 2011 Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: umbg Wake-on: g Current message level: 0x00000007 (7) Link detected: yes root@10:~# Here's that. I think I mentioned that a couple days ago I put an Intel gigE NIC into the box because it was mentioned that there have been some issues/concerns with the RealTek NICs. Didn't see a change. I swapped ports on my switch with my FreeNAS box. I was able to write to my FreeNAS box at about 88MB/s in either port (the port this box was in and the port it is now). It's a 24 port D-Link managed switch. Thanks! Link to comment
cyrnel Posted July 28, 2011 Share Posted July 28, 2011 Stats come with ethtool -S eth0 or ifconfig eth0, ideally after you've done a bunch of transfers so we see the results. Since you're seeing similar results with the Intel card as you did with the built-in Realtek any network problems are more likely in the cabling, switch, etc. Sounds like you know that area so take this for what it's worth. If you haven't already I'd be trying other ports, another switch, etc. Edit: but stick with the Intel for now. If we can get consistent speeds there then you can drop to the Realtek and compare. Back to this: Initially I was doing tower->win->tower and that's where I was seeing slow speeds. To help sort out things, I simplified the process and first did tower->win and then win->tower. The tower->win (ie, the read) is flying now. The win->tower (the write) is still slow. For some reason I was able to achieve 20MB/sec on that one file I mentioned. However, on most of my tower->win copies, I am still seeing poor performance at times (generally less than 5MB/sec), but then other times I get very good performance (I'm writing one 5.8GB file to drive2 at 29MB/s, and wrote it to drive1 at about 20MB/s I'm having a tough time understanding this. Feels like you wrote this over the course of several tests. The above tells me: 1) tower->win is fast 2) win->tower is fast for one file 3) two sentences later, tower->win is slow. I know you're battling the hordes vs. me sitting in an armchair but it might help to stick with a fixed set of measurements and watch for changes. I keep a folder of test files at each end for this: a few large files and folders full of little files. Helps provide perspective when comparing changes or multiple systems. I did some raw read/write tests to each disk (dd -if=/dev/zero of=/dev/disk1/testfile or of=dev/disk2/testfile). They're both pretty consistently at about 21MB/s. Consistent all around? Reads should be much faster than parity-protected writes, unless the array is degraded. dd to /dev/diskX will be slower because it's writing to both disk2 and the parity drive. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.