fitbrit

Members
  • Posts

    445
  • Joined

  • Last visited

Everything posted by fitbrit

  1. I installed mine as a parity drive after my WD Black was found to be DOA. The parity sync is going quite slowly at 7 MBps, but that seems to be unrelated to the drive itself.
  2. My one (Parity drive) worked for a day or two, and has been clicking like crazy. I wasn't sure which disk was clicking, but since disabling the parity, no more clicks and everything is MUCH faster (to be expected without any parity drive, let alone a faulty one). Parity syncs/checks would crawl along at as low as 100 kBps. I'll be RMAing it too, I guess. Sadly, I bought a second one before I used the first one. I think I'm just going to sell that one without opening it.
  3. Thanks very much! Feb 9 15:32:14 MediaServer kernel: ata8.01: ATA-8: WDC WD20EADS-00S2B0, 01.00A01, max UDMA/133 Feb 9 15:32:14 MediaServer kernel: ata8.01: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32) Feb 9 15:32:14 MediaServer kernel: ata8.01: configured for UDMA/100 Feb 9 15:32:14 MediaServer emhttp: pci-0000:02:00.0-scsi-0:0:0:0 host2 (sdi) WDC_WD20EADS-00S2B0_WD-WCAVY1606662 Feb 9 15:32:14 MediaServer emhttp: pci-0000:02:00.0-scsi-0:1:0:0 host2 (sdj) WDC_WD20EADS-00S2B0_WD-WCAVY1937974
  4. Thanks, Rob. I think I may have done us all a disservice. The ata8 drive is probably not the cause of my bigger problems. The drive ending in 7974 was just put into the port multiplying enclosure for that particular reboot. It was being precleared and put into service as disk 12, to replace one of the missing drives. I've learned a lot from your analysis of the syslog. Doing what you suggested in ummenu seems to show that the parity drive itself is the worst of the bunch. It's just a few weeks old, but it's one of the problematic Seagates which need a firmware upgrade. Included is a screenshot. I'm also not enjoying the dark colour of some of the other drives.
  5. I have two hot-swap 5 bay enclosures (disks 11-20), but currently I've removed disk 12 and 13. All my drives have the serial numbers written and taped on the sides so it's not so hard to know which disk is where. The problem is that I know each disks serial number, sd(x) designation and md(y) desigation. It's the ATA(z) reference that's eluding me. Part one of syslog attached (all up to login); more available if needed. UnRAIDSyslogFeb9-11a.txt
  6. Hopefully this is a simple question, but so far I've not found a way... unless I watch the unraid boot-up messages and jot down quickly and hope for the best. One of my disks - ATA8 - is continually showing error messages in the unmenu syslog snapshots. e.g. Feb 9 15:43:17 MediaServer kernel: ata8.01: status: { DRDY DF } Feb 9 15:43:17 MediaServer kernel: ata8.01: hard resetting link Feb 9 15:43:17 MediaServer kernel: ata8.01: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Feb 9 15:43:17 MediaServer kernel: ata8.01: configured for UDMA/33 Feb 9 15:43:17 MediaServer kernel: ata8: EH complete Also, one of my disks, and I don't know which one, is clicking a lot during parity syncs and checks, which slow to a crawl while the clicking is happening. While there are clicks, there are numerous errors attributed to disk ATA8. I'd like to know which one it is so I can change the cables, copy the data off the disk, replace it etc. Is there a simple way to determine which one it is? I'm running 4.6 rc5 and attached is a screenshot from unmenu. Thanks for any help. [EDIT] From what I recall of the boot-up spew, it may be one of the 1 TB WD EACS drives, but even then, there are four of them (disks 14-17).
  7. If the whole thing can show up as one drive, can I put 20 of them into unRAID?
  8. As unRAID moves towards supporting more than 20 data drives, perhaps Tom would be interested in participating in the group buy? That might get us close to the 1000
  9. Will do, but I'm having some issues with my server/parity at the moment. I've had so many WD green drives fail, and recently bought a Seagate and a Samsung F4... only to find they have their own rather significant issues. So I'm trying Hitachi now. I used to hate how hot their 750GB 7.2k drives were reporting in unRAID. However, when I pulled the drives and now run them in an external dock, they run much cooler to the touch than a WD 7.2k drive.
  10. Thanks very much, bjp999. It kind of makes sense: I've had parity problems with several drives now with my newish motherboard - a Supermicro C2SEA, like Tom uses in his servers. I had all-locking cables, but changed the one on my parity drive when my last two parity drives had problems. When I moved the drives to data ones (after successful pre-clears and also using them for a while in Windows without issue). I think it might be the motherboard port that's screwy if taking your post into account. Bummer, because this is an RMA'd board, with the first one being DOA. I have more internal ports available than I have slots in my main server cage, so I could move all the drives down one port and use one of the unused connections on my Supermicro 4x SAS board to take up the displaced configuration. Here I was thinking I wouldn't need to do drive rearrangement on this scale for a while!
  11. Thanks very much, bcbgboy. Your name sounds familiar; was it RFD? My configuration isn't that unusual for some of the expert users here. The eSATA boxes allow oen to expand the array outside the limits of one's case. If I'd known I was going to get so much storage over time, I would have invested in a Norco 4224. The Seagate data is strange in that there has not been that many power cycles at all. Could this be the head parking issue with the C34 firmware? I said it was a new drive so under 1000 hours seems right to me too. It's such a pain to update the firmware too, but I guess I'll have to try it and see if it works out. The reason I chose the Seagate was because I was getting fed up with DOA WD drives, or them giving up the ghost a few weeks after installing them. I've had six or more WD drives go/arrive bad in the past year. In fact one of the RMA replacements was DOA too. I loved the Samsung drives I have, but now the F4 also seems to have firmware problems. My plan is to just replace the parity drive for now and rebuild parity. However, I just wanted to check with some experts that that was not a bad thing to do at this stage.
  12. Part 2 of syslog... And SMART report on parity drive: smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST32000542AS Serial Number: 5XW1FDFJ Firmware Version: CC34 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Mon Jan 31 12:20:52 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 623) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 097 082 006 Pre-fail Always - 43827459 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 094 094 020 Old_age Always - 6607 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 10150913 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 981 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 094 094 020 Old_age Always - 6512 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 8391 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 077 064 045 Old_age Always - 23 (Lifetime Min/Max 23/33) 194 Temperature_Celsius 0x0022 023 040 000 Old_age Always - 23 (0 19 0 0) 195 Hardware_ECC_Recovered 0x001a 049 040 000 Old_age Always - 43827459 197 Current_Pending_Sector 0x0012 100 095 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 095 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 210273008879140 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3018893516 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1466551571 SMART Error Log Version: 1 ATA Error Count: 8466 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 8466 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:27.714 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:27.713 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:27.712 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:27.712 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:27.688 READ NATIVE MAX ADDRESS EXT Error 8465 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:23.929 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:23.929 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:23.928 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:23.927 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:23.903 READ NATIVE MAX ADDRESS EXT Error 8464 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:20.176 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:20.175 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:20.174 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:20.174 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:20.150 READ NATIVE MAX ADDRESS EXT Error 8463 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:16.412 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:16.411 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:16.410 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:16.410 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:16.386 READ NATIVE MAX ADDRESS EXT Error 8462 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:12.665 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:12.664 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:12.663 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:12.663 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:12.637 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Jan302011logPt2.txt
  13. I'm running 4.6 rc5. The 20 data drive server comprises a Centurion 590 with 12 drives; parity at top, then cache drive, followed by data drives 1-10. The bottom 4 drives are run off a Supermicro 8-port PCIe x4 card. The remaining ten drives are in two Sans digital 5-bay eSATA enclosures, run by a single Sil3132(?) eSATA PCIe x1 card. Currently drives 12 and 13 have been removed and were going to be replaced by bigger drives. After some recent problems, I mistakenly believed all was well because I successfully rebuilt parity with a new Seagate 2TB LP drive. However, when I started a parity check, I'd always have lots of errors shown very quickly. Having read about some of the problems with the model of parity drive I use, I decided to run a non-correcting parity check to completion. The results were alarming, including tens if not hundreds of thousands reported in the area which reports how much progress in the check has been made, and tens of thousands in the main disk status area. Additionally, I'm still hearing a clicking from the main server case. At first I though it was my cache drive, but am now pretty sure it's not that one. We just had a power outage which lasted longer than my UPS was able to support while I was out. When I returned and restarted the server, it took some time to mount the drives. The parity check that started showed lots of errors very quickly again. Now I'm not sure what to do. I don't trust the parity, but am not sure whether it's due to a bad parity drive or another drive that's failing. I'm attaching the log from the unmenu readout. Any help appreciated. Jan302011logPt1.txt
  14. Yes, they should work with the firmware update. The forum is looking for someone with one of these drives willing to test unRaid 4.7 with a Samsung F4. Can you direct me to this firmware issue? What firmware is required by the F4 before use in 4.7? [EDIT] Never mind, here's a thread: http://lime-technology.com/forum/index.php?topic=9339.0
  15. No, I think I did screw it up if I'm not mistaken: I used the sdx format to identify the disks to be duplicated, rather than the mdx nomenclature. From what I understand, the mdx and mdy would have kept the parity disk updated during the duplication, but the sdx and sdy identifiers somehow did not. In any case, the parity was very badly screwed up. A parity check yielded millions of errors after just a few percent completion. Now, I've got to rebuild parity after removing those drives to make it all good again. I'm going to get the chance to do it right very soon, as I have a number of 1TB drives to replace with 2TB ones.
  16. Well, it seems I got very lucky - mostly. I removed the two drives today and restarted a parity sync. It's moving along faster than ever before... and it turns out that the clicking drive is actually just my cache drive. Normally, after transferrign a file to the cache, I don't delete it until it's been safely moved to the array. So now, if the drive crashes during the next move to the array, I'll have most of the stuff backed up. In the worst case scenario, losing just 300 GB of media, which can be replaced, compared to a minimum of 1 TB if any of my other drives failed is a win in my book. Lesson learned, and next time I zero two drives, I'll do it properly!
  17. Thanks, Joe L I figured that's what happened afterwards. I just wasn't familiar with the md nomenclature - I thought that it was equivalent to the sdx and hdx.... and I'm really used to using the sdx form from the preclear script. Oh well... live and learn. I don't know if I can complete a parity check or rebuild at a few kb per sec, but I'll see what I can do. At least if I can read the data I might be able to transfer it to somewhere. I'm importing all the yet to be imported media into my J. River Media Center. That way, if the clicking disk fails, I'll see what's missing and can then at least re-rip, re-download or reload everything.
  18. The mirroring finished according to my telnet window, but it didn't show properly in my web interface. I screwed up with the mirroring, I think. I used the device identifiers for the correct disks - sdi and sdj, but now see that I should have used mdx and mdy where x and y are the disk numbers in the unraid interface - is that correct? In any case, I tried to stop and restart the array. The drives took forever to unmount, and I had to force a shutdown, since some construction work in the house required a shutting off of power. When I restarted the array took more than half a day to mount all the disks, with lots of clicking coming from the server. A parity check showed ten million sync errors and 268 parity errors after 4% completion, and the check itself was crawling along at down to 100 kb/sec The two mirrored drives show exactly the same free space now though. What are my options? I'm thinking I should check the logs to see if I can identify which is the troubled disk. I'm not sure that I can trust parity now, though. If the trouble with the clicking disk is only when it's being written to, I may be able to transfer the data off it.
  19. Thank you guys, and bjp especially. This is a great idea and will give us mortal users a lot of confidence in the new releases before they're out of beta, and beyond.
  20. Thx for the reply. "Sounds right" isn't quite good enough for me . I want an "Is right" before I venture into this task. I'm going to give myself another year of unraid before I can confidently make such statements!
  21. I've just started doing this, but have noticed that another one of the drives in the array is clicking loudly. I think I should be okay, though even if it means having to replace the clicking drive once the drives have mirrored one another. If the clicking drive fails before the mirroring is done, I'll simply remove those two drives, replace the failed drive, rebuild its data and then go on as usual, right?
  22. Someone correct me if I'm wrong, but unraid is reacting exactly as it should be. You've removed one disk (7) and put it into slot 6. So, disk 7 should be red-balled as it IS missing, and the slot 6 disk (former number 7) appears to be new to unraid. Your problem is that you've been a little too cautious. Stay on the devices page and reassign all the disks disk 7 to disk 6, disk 8 to disk 7 etc. Once your done, you can run the initconfig procedure. I'm not an expert user, but I've done this before. Still, you may want to wait until someone else confirms the above.