msobon Posted October 25, 2012 Share Posted October 25, 2012 Hi Guys I need some expert opinion on whats going on, I started a pre-clear on 4 drives and 2 of 4 preclear ended early below is a snippet of the syslog and there's a link to full syslog http://www.sobon.ca/Unraid/syslog-2012-10-25.txt Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ========================================================================1.13 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == invoked as: ./preclear_disk.sh -A /dev/sdl (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == ST2000DL003-9VT166 5YD5VCJP (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Disk /dev/sdl has been successfully precleared (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == with a starting sector of 64 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Ran 1 cycle (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Using :Read block size = Bytes (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Last Cycle's Pre Read Time : 3:41:00 (150 MB/s) (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Last Cycle's Zeroing time : 0:00:27 ( MB/s) (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Last Cycle's Post Read Time : ( MB/s) (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Last Cycle's Total Time : (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Total Elapsed Time 3:41:28 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Disk Start Temperature: 28C (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == Current (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ============================================================================ (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ** Changed attributes in files: /tmp/smart_start_sdl /tmp/smart_finish_sdl (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Raw_Read_Error_Rate = 111 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Spin_Up_Time = 90 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Start_Stop_Count = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Reallocated_Sector_Ct = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Seek_Error_Rate = 68 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Power_On_Hours = 94 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Spin_Retry_Count = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Power_Cycle_Count = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Runtime_Bad_Block = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: End-to-End_Error = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Reported_Uncorrect = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Command_Timeout = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: High_Fly_Writes = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Airflow_Temperature_Cel = 72 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: G-Sense_Error_Rate = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Power-Off_Retract_Count = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Load_Cycle_Count = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Temperature_Celsius = 28 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Hardware_ECC_Recovered = 29 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Current_Pending_Sector = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Offline_Uncorrectable = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: UDMA_CRC_Error_Count = 200 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Head_Flying_Hours = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Total_LBAs_Written = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Total_LBAs_Read = 100 ok (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: No SMART attributes are FAILING_NOW (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 0 sectors were pending re-allocation before the start of the preclear. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: a change of 0 in the number of sectors pending re-allocation. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 0 sectors had been re-allocated before the start of the preclear. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: a change of 0 in the number of sectors re-allocated. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART overall-health status = (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ============================================================================ (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ============================================================================ (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == S.M.A.R.T Initial Report for /dev/sdl (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Disk: /dev/sdl (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: === START OF INFORMATION SECTION === (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Device Model: ST2000DL003-9VT166 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Serial Number: 5YD5VCJP (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Firmware Version: CC3C (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: User Capacity: 2,000,398,934,016 bytes (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Device is: Not in smartctl database [for details use: -P showall] (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ATA Version is: 8 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ATA Standard is: ATA-8-ACS revision 4 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Local Time is: Thu Oct 25 14:00:45 2012 EDT (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART support is: Available - device has SMART capability. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART support is: Enabled (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: === START OF READ SMART DATA SECTION === (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART overall-health self-assessment test result: PASSED (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: General SMART Values: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Offline data collection status: (0x82)^IOffline data collection activity (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^Iwas completed without error. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^IAuto Offline Data Collection: Enabled. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Self-test execution status: ( 0)^IThe previous self-test routine completed (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^Iwithout error or no self-test has ever (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^Ibeen run. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Total time to complete Offline (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: data collection: ^I^I ( 612) seconds. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Offline data collection (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: capabilities: ^I^I^I (0x7b) SMART execute Offline immediate. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^IAuto Offline data collection on/off support. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^ISuspend Offline collection upon new (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^Icommand. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^IOffline surface scan supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^ISelf-test supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^IConveyance Self-test supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^ISelective Self-test supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART capabilities: (0x0003)^ISaves SMART data before entering (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^Ipower-saving mode. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^ISupports SMART auto save timer. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Error logging capability: (0x01)^IError logging supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^IGeneral Purpose Logging supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Short self-test routine (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: recommended polling time: ^I ( 1) minutes. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Extended self-test routine (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: recommended polling time: ^I ( 255) minutes. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Conveyance self-test routine (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: recommended polling time: ^I ( 2) minutes. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SCT capabilities: ^I (0x30b7)^ISCT Status supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^ISCT Feature Control supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ^I^I^I^I^ISCT Data Table supported. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART Attributes Data Structure revision number: 10 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Vendor Specific SMART Attributes with Thresholds: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 30108296 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 3 Spin_Up_Time 0x0003 090 085 000 Pre-fail Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 29 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 7 Seek_Error_Rate 0x000f 068 060 030 Pre-fail Always - 7009913 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5279 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 21 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 190 Airflow_Temperature_Cel 0x0022 072 056 045 Old_age Always - 28 (Min/Max 24/30) (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 29 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 194 Temperature_Celsius 0x0022 028 044 000 Old_age Always - 28 (0 20 0 0) (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 195 Hardware_ECC_Recovered 0x001a 029 011 000 Old_age Always - 30108296 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 77227806953798 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2724079113 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3759955724 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART Error Log Version: 1 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: No Errors Logged (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART Self-test log structure revision number 1 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: No self-tests have been logged. [To run self-tests, use: smartctl -t] (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SMART Selective self-test log data structure revision number 1 (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 1 0 0 Not_testing (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 2 0 0 Not_testing (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 3 0 0 Not_testing (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 4 0 0 Not_testing (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: 5 0 0 Not_testing (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Selective self-test flags (0x0): (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: After scanning selected spans, do NOT read-scan remainder of disk. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: If Selective self-test is pending on power-up, resume after 0 minute delay. (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ============================================================================ (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Oct 25 17:41:46 Tower last message repeated 2 times Oct 25 17:41:46 Tower preclear_disk-diff[1903]: ============================================================================ (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == S.M.A.R.T Final Report for /dev/sdl (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: == (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Disk: /dev/sdl (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net (Misc) Oct 25 17:41:46 Tower preclear_disk-diff[1903]: (Misc) Backround it's a Virtualized Unraid 5.0-rc8a runing on ESXI 4.1 with 3 x aoc-saslp-mv8 with PCI passthrough, the system ran stable until one of my drives died which led me to a full rebuild as more than one drive seems affected. Quote Link to comment
dgaschk Posted October 26, 2012 Share Posted October 26, 2012 The drive 5YD5VCJP has pending sectors and preclear should be repeated until pending sectors goes to zero Quote Link to comment
Joe L. Posted October 26, 2012 Share Posted October 26, 2012 The syslog "diff" stuff you posted is very difficult to wade through... I cannot determine why you think the preclear failed from it. To help you, please zip and attach the reports for those two disks from the /boot/preclear_reports folder. For each disk there is a SMART report from before the preclear, a SMART report from after, and a summary report. For me to spend ANY time, I need to see the three reports. From your syslog, it appears that /dev/sdl has stopped responding to the disk controller. No smart report is being produced. (Could be bad power connection, bad SATA connection, or drive that has failed, or disk controller port that has failed, bad cable to disk, power supply inadequate to power all disks) Joe L. Quote Link to comment
msobon Posted October 26, 2012 Author Share Posted October 26, 2012 link to reports, one is they only ran for approx 2 hours on 2tb drives the other is that they didnt get past first phase. http://www.sobon.ca/Unraid/preclear_reports.rar Quote Link to comment
msobon Posted October 26, 2012 Author Share Posted October 26, 2012 to add to this those drives, now appear like they just disappeared, someone yanked them: Quote Link to comment
Joe L. Posted October 26, 2012 Share Posted October 26, 2012 to add to this those drives, now appear like they just disappeared, someone yanked them: Or, as I already said, they lost power, or the disk controller died, or both disks died. Or both SATA cables pulled loose. It happens. Look up "bathtub-curve" on google. Your drives might be perfect examples, although unlikely for two to die at nearly the same time unless from the same defective manufacturing batch.. Far more likely for one to die, and lock up a disk controller chipset they share with another disk. Or, you show 13 or 14 disks... Is your power supply capable? (What exact make/model?) Do the new disks share a power splitter? or a backplane? Joe L. Quote Link to comment
msobon Posted October 26, 2012 Author Share Posted October 26, 2012 It's got me stumped, yes they share the same backplane, but there's 3 other disks unafected, the only thing I can think of is the controller card, however it doesn't make any sense why onyl the two disks have gone. The power supply should be more than adequate supermicro triple redundant 750w and it's pulled it's weigh through a lot more drives, previously. I'm going to do a cold boot of the ESXI host and look over the cabling etc. it's one of those days i want to pack up my toys and go home... lol Quote Link to comment
msobon Posted October 26, 2012 Author Share Posted October 26, 2012 Just an update, took a look inside the box everything checked out the only thing I decided to change is move the card the drives failed down a slot, since they are both 4x it doesn't really matter however I remeber having a "Diabling IRQ#18" issue back in the day which owuld grind things to a snails pace, I believe it was related to that particular slot. The only work around I found was to add the "noirqdebug" in the syslinux.cfg, however I don't want to explore that today. Powered up and chumming away, the drives that appeared powered off are now seen again and preclears are chumming away Quote Link to comment
msobon Posted October 26, 2012 Author Share Posted October 26, 2012 Update Disks failed again I took one out will see what happens, it's like it blows up that controller as not even temperature info is shown, it may be the controller itself however I doubt it, i guess im going by process of elimination. Quote Link to comment
msobon Posted October 26, 2012 Author Share Posted October 26, 2012 So far so good, time will tell, it looks like last time things went south around the 5 hour and 40 minute mark from above screenshot. Quote Link to comment
msobon Posted October 27, 2012 Author Share Posted October 27, 2012 Pooof gone again.... It's not the cables or slot, what's left is the backplane or the card Quote Link to comment
msobon Posted October 28, 2012 Author Share Posted October 28, 2012 Update- I think i figured out the issue is with a single drive that was causing the chaos, it's the one that was going at 4Mb/s here in the screen shot. Once I pulled is so far so good, I can't belive a drive would cause so much headache.. Quote Link to comment
Joe L. Posted October 28, 2012 Share Posted October 28, 2012 On the 25th I said: Your drives might be perfect examples, although unlikely for two to die at nearly the same time unless from the same defective manufacturing batch.. Far more likely for one to die, and lock up a disk controller chipset they share with another disk. Joe L. Quote Link to comment
msobon Posted October 28, 2012 Author Share Posted October 28, 2012 Thx Joe! I'm not out of the woods yet, now that i'm back on track I started getting a errors in syslog: Oct 28 14:17:01 Tower kernel: Mem-Info: Oct 28 14:17:01 Tower kernel: DMA per-cpu: Oct 28 14:17:01 Tower kernel: CPU 0: hi: 0, btch: 1 usd: 0 Oct 28 14:17:01 Tower kernel: CPU 1: hi: 0, btch: 1 usd: 0 Oct 28 14:17:01 Tower kernel: CPU 2: hi: 0, btch: 1 usd: 0 Oct 28 14:17:01 Tower kernel: CPU 3: hi: 0, btch: 1 usd: 0 Oct 28 14:17:01 Tower kernel: CPU 4: hi: 0, btch: 1 usd: 0 Oct 28 14:17:01 Tower kernel: CPU 5: hi: 0, btch: 1 usd: 0 Oct 28 14:17:01 Tower kernel: CPU 6: hi: 0, btch: 1 usd: 0 Oct 28 14:17:01 Tower kernel: CPU 7: hi: 0, btch: 1 usd: 0 Oct 28 14:17:01 Tower kernel: Normal per-cpu: Oct 28 14:17:01 Tower kernel: CPU 0: hi: 186, btch: 31 usd: 178 Oct 28 14:17:01 Tower kernel: CPU 1: hi: 186, btch: 31 usd: 104 Oct 28 14:17:01 Tower kernel: CPU 2: hi: 186, btch: 31 usd: 27 Oct 28 14:17:01 Tower kernel: CPU 3: hi: 186, btch: 31 usd: 22 Oct 28 14:17:01 Tower kernel: CPU 4: hi: 186, btch: 31 usd: 163 Oct 28 14:17:01 Tower kernel: CPU 5: hi: 186, btch: 31 usd: 41 Oct 28 14:17:01 Tower kernel: CPU 6: hi: 186, btch: 31 usd: 166 Oct 28 14:17:01 Tower kernel: CPU 7: hi: 186, btch: 31 usd: 60 Oct 28 14:17:01 Tower kernel: HighMem per-cpu: Oct 28 14:17:01 Tower kernel: CPU 0: hi: 186, btch: 31 usd: 183 Oct 28 14:17:01 Tower kernel: CPU 1: hi: 186, btch: 31 usd: 182 Oct 28 14:17:01 Tower kernel: CPU 2: hi: 186, btch: 31 usd: 178 Oct 28 14:17:01 Tower kernel: CPU 3: hi: 186, btch: 31 usd: 165 Oct 28 14:17:01 Tower kernel: CPU 4: hi: 186, btch: 31 usd: 52 Oct 28 14:17:01 Tower kernel: CPU 5: hi: 186, btch: 31 usd: 171 Oct 28 14:17:01 Tower kernel: CPU 6: hi: 186, btch: 31 usd: 50 Oct 28 14:17:01 Tower kernel: CPU 7: hi: 186, btch: 31 usd: 20 Oct 28 14:17:01 Tower kernel: active_anon:8967 inactive_anon:34 isolated_anon:0 Oct 28 14:17:01 Tower kernel: active_file:41721 inactive_file:636147 isolated_file:0 Oct 28 14:17:01 Tower kernel: unevictable:44399 dirty:0 writeback:0 unstable:0 Oct 28 14:17:01 Tower kernel: free:5398 slab_reclaimable:10233 slab_unreclaimable:6374 Oct 28 14:17:01 Tower kernel: mapped:2245 shmem:52 pagetables:299 bounce:0 Oct 28 14:17:01 Tower kernel: DMA free:3504kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:680kB inactive_file:5988kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15780kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4260kB slab_unreclaimable:1372kB kernel_stack:80kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 28 14:17:01 Tower kernel: lowmem_reserve[]: 0 869 3032 3032 Oct 28 14:17:01 Tower kernel: Normal free:6356kB min:3736kB low:4668kB high:5604kB active_anon:2080kB inactive_anon:4kB active_file:133404kB inactive_file:581028kB unevictable:120kB isolated(anon):0kB isolated(file):128kB present:890008kB mlocked:0kB dirty:0kB writeback:0kB mapped:64kB shmem:8kB slab_reclaimable:36672kB slab_unreclaimable:24124kB kernel_stack:1152kB pagetables:32kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 28 14:17:01 Tower kernel: lowmem_reserve[]: 0 0 17303 17303 Oct 28 14:17:01 Tower kernel: HighMem free:11732kB min:512kB low:2836kB high:5160kB active_anon:33788kB inactive_anon:132kB active_file:32800kB inactive_file:1957432kB unevictable:177476kB isolated(anon):0kB isolated(file):0kB present:2214820kB mlocked:0kB dirty:0kB writeback:0kB mapped:8916kB shmem:200kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:1164kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 28 14:17:01 Tower kernel: lowmem_reserve[]: 0 0 0 0 Oct 28 14:17:01 Tower kernel: DMA: 112*4kB 2*8kB 0*16kB 1*32kB 1*64kB 5*128kB 3*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 3504kB Oct 28 14:17:01 Tower kernel: Normal: 1573*4kB 3*8kB 8*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6604kB Oct 28 14:17:01 Tower kernel: HighMem: 2217*4kB 40*8kB 11*16kB 14*32kB 10*64kB 8*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11732kB Oct 28 14:17:01 Tower kernel: 722321 total pagecache pages Oct 28 14:17:01 Tower kernel: 0 pages in swap cache Oct 28 14:17:01 Tower kernel: Swap cache stats: add 0, delete 0, find 0/0 Oct 28 14:17:01 Tower kernel: Free swap = 0kB Oct 28 14:17:01 Tower kernel: Total swap = 0kB Oct 28 14:17:01 Tower kernel: 786416 pages RAM Oct 28 14:17:01 Tower kernel: 558082 pages HighMem Oct 28 14:17:01 Tower kernel: 7748 pages reserved Oct 28 14:17:01 Tower kernel: 394509 pages shared Oct 28 14:17:01 Tower kernel: 388105 pages non-shared Oct 28 14:17:02 Tower kernel: swapper/0: page allocation failure: order:2, mode:0x4020 Oct 28 14:17:02 Tower kernel: Pid: 0, comm: swapper/0 Not tainted 3.4.11-unRAID #1 (Errors) Oct 28 14:17:02 Tower kernel: Call Trace: (Errors) Oct 28 14:17:02 Tower kernel: [<c1062eb6>] warn_alloc_failed+0xbd/0xcf (Errors) Oct 28 14:17:02 Tower kernel: [<c106372e>] __alloc_pages_nodemask+0x47c/0x4a5 (Errors) Oct 28 14:17:02 Tower kernel: [<c10637ad>] __get_free_pages+0xf/0x21 (Errors) Oct 28 14:17:02 Tower kernel: [<c10821c8>] __kmalloc+0x2c/0xf0 (Errors) Oct 28 14:17:02 Tower kernel: [<c12a0072>] pskb_expand_head+0xc1/0x224 (Errors) Oct 28 14:17:02 Tower kernel: [<c12a054b>] __pskb_pull_tail+0x41/0x21e (Errors) Oct 28 14:17:02 Tower kernel: [<c12a4421>] ? netif_skb_features+0x84/0x8e (Errors) Oct 28 14:17:02 Tower kernel: [<c12a788f>] dev_hard_start_xmit+0x213/0x31b (Errors) Oct 28 14:17:02 Tower kernel: [<c12c34f7>] ? ip_finish_output+0x220/0x258 (Errors) Oct 28 14:17:02 Tower kernel: [<c12b6184>] sch_direct_xmit+0x54/0x13f (Errors) Oct 28 14:17:02 Tower kernel: [<c12a7a95>] dev_queue_xmit+0xfe/0x286 (Errors) Oct 28 14:17:02 Tower kernel: [<c12c34f7>] ip_finish_output+0x220/0x258 (Errors) Oct 28 14:17:02 Tower kernel: [<c12c35c2>] ip_output+0x93/0x9a (Errors) Oct 28 14:17:02 Tower kernel: [<c12c2a5a>] ip_local_out+0x1b/0x1e (Errors) Oct 28 14:17:02 Tower kernel: [<c12c2f3e>] ip_queue_xmit+0x2ad/0x2ee (Errors) Oct 28 14:17:02 Tower kernel: [<c129e519>] ? __skb_clone+0x22/0xbf (Errors) Oct 28 14:17:02 Tower kernel: [<c12d27b6>] tcp_transmit_skb+0x4d3/0x508 (Errors) Oct 28 14:17:02 Tower kernel: [<c12d4993>] tcp_write_xmit+0x2f2/0x3ec (Errors) Oct 28 14:17:02 Tower kernel: [<c12d4ad1>] __tcp_push_pending_frames+0x18/0x6f (Errors) Oct 28 14:17:02 Tower kernel: [<c12d1481>] tcp_rcv_established+0xfa/0x575 (Errors) Oct 28 14:17:02 Tower kernel: [<c12d6978>] tcp_v4_do_rcv+0x47/0x13a (Errors) Oct 28 14:17:02 Tower kernel: [<c12d6e4e>] tcp_v4_rcv+0x3e3/0x665 (Errors) Oct 28 14:17:02 Tower kernel: [<c12bf240>] ip_local_deliver_finish+0xba/0x192 (Errors) Oct 28 14:17:02 Tower kernel: [<c12bf379>] ip_local_deliver+0x61/0x66 (Errors) Oct 28 14:17:02 Tower kernel: [<c12bef01>] ip_rcv_finish+0x23d/0x253 (Errors) Oct 28 14:17:02 Tower kernel: [<c12bf150>] ip_rcv+0x239/0x26f (Errors) Oct 28 14:17:02 Tower kernel: [<c12a5189>] __netif_receive_skb+0x223/0x259 (Errors) Oct 28 14:17:02 Tower kernel: [<c12a6767>] netif_receive_skb+0x66/0x6c (Errors) Oct 28 14:17:02 Tower kernel: [<c12a682b>] napi_skb_finish+0x1e/0x34 (Errors) Oct 28 14:17:02 Tower kernel: [<c12a6c98>] napi_gro_receive+0xe7/0xef (Errors) Oct 28 14:17:02 Tower kernel: [<f84d3c39>] e1000_receive_skb+0x46/0x4c [e1000] (Errors) Oct 28 14:17:02 Tower kernel: [<f84d43ae>] e1000_clean_rx_irq+0x2b7/0x357 [e1000] (Errors) Oct 28 14:17:02 Tower kernel: [<f84d7061>] e1000_clean+0x3c/0x198 [e1000] (Errors) Oct 28 14:17:02 Tower kernel: [<c12a6d6e>] net_rx_action+0x59/0x12c (Errors) Oct 28 14:17:02 Tower kernel: [<c1027056>] __do_softirq+0x6b/0xe5 (Errors) Oct 28 14:17:02 Tower kernel: [<c1026feb>] ? irq_enter+0x41/0x41 (Errors) Oct 28 14:17:02 Tower kernel: <IRQ> [<c1026e9f>] ? irq_exit+0x32/0x58 Oct 28 14:17:02 Tower kernel: [<c1003506>] ? do_IRQ+0x7c/0x90 (Errors) Oct 28 14:17:02 Tower kernel: [<c13208e9>] ? common_interrupt+0x29/0x30 (Errors) Oct 28 14:17:02 Tower kernel: [<c100820f>] ? default_idle+0x1c/0x2c (Errors) Oct 28 14:17:02 Tower kernel: [<c10083e8>] ? cpu_idle+0x4b/0x65 (Errors) Oct 28 14:17:02 Tower kernel: [<c130f8f0>] ? rest_init+0x58/0x5a (Errors) Oct 28 14:17:02 Tower kernel: [<c14547a3>] ? start_kernel+0x286/0x28b (Errors) Oct 28 14:17:02 Tower kernel: [<c14540a0>] ? i386_start_kernel+0xa0/0xa7 (Errors) Oct 28 14:17:02 Tower kernel: Mem-Info: Oct 28 14:17:02 Tower kernel: DMA per-cpu: Oct 28 14:17:02 Tower kernel: CPU 0: hi: 0, btch: 1 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 1: hi: 0, btch: 1 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 2: hi: 0, btch: 1 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 3: hi: 0, btch: 1 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 4: hi: 0, btch: 1 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 5: hi: 0, btch: 1 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 6: hi: 0, btch: 1 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 7: hi: 0, btch: 1 usd: 0 Oct 28 14:17:02 Tower kernel: Normal per-cpu: Oct 28 14:17:02 Tower kernel: CPU 0: hi: 186, btch: 31 usd: 33 Oct 28 14:17:02 Tower kernel: CPU 1: hi: 186, btch: 31 usd: 36 Oct 28 14:17:02 Tower kernel: CPU 2: hi: 186, btch: 31 usd: 155 Oct 28 14:17:02 Tower kernel: CPU 3: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 4: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 5: hi: 186, btch: 31 usd: 41 Oct 28 14:17:02 Tower kernel: CPU 6: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 7: hi: 186, btch: 31 usd: 26 Oct 28 14:17:02 Tower kernel: HighMem per-cpu: Oct 28 14:17:02 Tower kernel: CPU 0: hi: 186, btch: 31 usd: 23 Oct 28 14:17:02 Tower kernel: CPU 1: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 2: hi: 186, btch: 31 usd: 31 Oct 28 14:17:02 Tower kernel: CPU 3: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 4: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 5: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 6: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: CPU 7: hi: 186, btch: 31 usd: 0 Oct 28 14:17:02 Tower kernel: active_anon:9065 inactive_anon:34 isolated_anon:0 Oct 28 14:17:02 Tower kernel: active_file:41721 inactive_file:636525 isolated_file:0 Oct 28 14:17:02 Tower kernel: unevictable:44399 dirty:0 writeback:0 unstable:0 Oct 28 14:17:02 Tower kernel: free:6277 slab_reclaimable:10233 slab_unreclaimable:6374 Oct 28 14:17:02 Tower kernel: mapped:2245 shmem:52 pagetables:372 bounce:0 Oct 28 14:17:02 Tower kernel: DMA free:3504kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:680kB inactive_file:5988kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15780kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4260kB slab_unreclaimable:1372kB kernel_stack:80kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 28 14:17:02 Tower kernel: lowmem_reserve[]: 0 869 3032 3032 Oct 28 14:17:02 Tower kernel: Normal free:5336kB min:3736kB low:4668kB high:5604kB active_anon:2080kB inactive_anon:4kB active_file:133404kB inactive_file:584108kB unevictable:120kB isolated(anon):0kB isolated(file):0kB present:890008kB mlocked:0kB dirty:0kB writeback:0kB mapped:64kB shmem:8kB slab_reclaimable:36672kB slab_unreclaimable:24124kB kernel_stack:1152kB pagetables:32kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 28 14:17:02 Tower kernel: lowmem_reserve[]: 0 0 17303 17303 Oct 28 14:17:02 Tower kernel: HighMem free:16268kB min:512kB low:2836kB high:5160kB active_anon:34180kB inactive_anon:132kB active_file:32800kB inactive_file:1956004kB unevictable:177476kB isolated(anon):0kB isolated(file):0kB present:2214820kB mlocked:0kB dirty:0kB writeback:0kB mapped:8916kB shmem:200kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:1456kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 28 14:17:02 Tower kernel: lowmem_reserve[]: 0 0 0 0 Oct 28 14:17:02 Tower kernel: DMA: 112*4kB 2*8kB 0*16kB 1*32kB 1*64kB 5*128kB 3*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 3504kB Oct 28 14:17:02 Tower kernel: Normal: 1298*4kB 2*8kB 8*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5496kB Oct 28 14:17:02 Tower kernel: HighMem: 3346*4kB 108*8kB 11*16kB 14*32kB 10*64kB 8*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16792kB Oct 28 14:17:02 Tower kernel: 722704 total pagecache pages Oct 28 14:17:02 Tower kernel: 0 pages in swap cache Oct 28 14:17:02 Tower kernel: Swap cache stats: add 0, delete 0, find 0/0 Oct 28 14:17:02 Tower kernel: Free swap = 0kB Oct 28 14:17:02 Tower kernel: Total swap = 0kB Oct 28 14:17:02 Tower kernel: 786416 pages RAM Oct 28 14:17:02 Tower kernel: 558082 pages HighMem Oct 28 14:17:02 Tower kernel: 7748 pages reserved Oct 28 14:17:02 Tower kernel: 396288 pages shared Oct 28 14:17:02 Tower kernel: 387973 pages non-shared Quote Link to comment
Joe L. Posted October 28, 2012 Share Posted October 28, 2012 Those look like you are running out of "low" memory. (memory allocation failures) Quote Link to comment
msobon Posted October 29, 2012 Author Share Posted October 29, 2012 Is there a way to address those issues, it have 3gb assigned and cache dirs not running? Quote Link to comment
Joe L. Posted October 29, 2012 Share Posted October 29, 2012 Is there a way to address those issues, it have 3gb assigned and cache dirs not running? Typically, it is "low" memory that runs out. Run fewer processes. Add more memory. Tune kernel parameters. Add a swap file. Type free -l to see memory status. If it only occurs when preclearing drives, use the -r -w -b options to the preclear script to limit its memory usage. preclear_disk.sh -r 65536 -w 65536 -b 2048 /dev/sdX Quote Link to comment
msobon Posted October 29, 2012 Author Share Posted October 29, 2012 Thanks Joe, and yes it only appears to kick in while a number of pre clears are running at once. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.