drawde Posted April 23, 2015 Author Share Posted April 23, 2015 It sounds like you don't have stable power initially, but it's eventually settling down -- and then it may on occasion have spikes that cause your reboot issue. Could be the power supply; or could be power regulation on the motherboard. Look at the motherboard VERY carefully (with a flashlight) to see if there are any signs of leaking or bulging capacitors. no visible bulging or leaking. could it be that my power supply doesn't have enough power at boot? i have, i wanna say, 14 drives in there including cache and parity. about half are green drives but i know i have a few regular 7200rpm drives in there as well. Quote Link to comment
ken-ji Posted April 23, 2015 Share Posted April 23, 2015 That's likely it. unless you have staggered spin up - all those drives will do a number on your PSU during startup. Quote Link to comment
garycase Posted April 23, 2015 Share Posted April 23, 2015 A quality 650w unit should be plenty for 14 drives; but the PSU you have is fairly low-end, and is a bit less than that (620w). It certainly sounds like this is likely what's causing the multiple boot attempts ... until all the drives are spun up, there's simply not enough power for the system. ... the occasional reboots may also be associated with drive spin-ups; or it may simply be that you have a rail that's having random stability issues (likely a result of the problems related to initial boots). I'd think the Corsair HX unit I recommended earlier would be a good upgrade that will likely resolve your issue. If you want a bit more "headroom", go with a slightly higher power version: http://www.newegg.com/Product/Product.aspx?Item=N82E16817139084&cm_re=Corsair_750w-_-17-139-084-_-Product Quote Link to comment
drawde Posted April 23, 2015 Author Share Posted April 23, 2015 thank you everyone. i just stopped the memtest at 8hrs with 0 errors. was gonna let it run for longer but i think we are all leaning towards PSU at this point. i think i will go for the 750w unit just to be safe. Quote Link to comment
drawde Posted May 5, 2015 Author Share Posted May 5, 2015 To possibly help with the power issue temporarily, i removed a SATA card that did not have any drives hooked up to it yet (maybe that couple W or two would make some difference with the power load). For over a week I did not experience any random reboots and the errors on my cache drive went away so I thought the problem was gone. Today I got errors another drive.. I've tried multiple iterations of the smartctrl command but cannot get a smart report out of the redballed drive. root@Tower:~# smartctl -a -A /dev/sdl smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build) Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: /1:0:3:0 Product: User Capacity: 600,332,565,813,390,450 bytes [600 PB] Logical block size: 774843950 bytes Physical block size: 1903784304 bytes Lowest aligned LBA: 14896 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. root@Tower:~# smartctl -a -d ata /dev/sdl smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build) Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org Read Device Identity failed: Input/output error A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. root@Tower:~# smartctl -a -A -T /dev/sdl smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build) Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org =======> INVALID ARGUMENT TO -T: /dev/sdl =======> VALID ARGUMENTS ARE: normal, conservative, permissive, verypermissive <======= Use smartctl -h to get a usage summary root@Tower:~# smartctl /dev/sdl smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build) Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org SCSI device successfully opened Use 'smartctl -a' (or '-x') to print SMART (and more) information root@Tower:~# smartctl -a /dev/sdl smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build) Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: /1:0:3:0 Product: User Capacity: 600,332,565,813,390,450 bytes [600 PB] Logical block size: 774843950 bytes Physical block size: 1903784304 bytes Lowest aligned LBA: 14896 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. syslog: May 5 17:34:28 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 May 5 17:34:28 Tower kernel: sas: trying to find task 0xffff880154b26200 May 5 17:34:28 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff880154b26200 May 5 17:34:28 Tower kernel: sas: sas_scsi_find_task: task 0xffff880154b26200 is aborted May 5 17:34:28 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff880154b26200 is aborted May 5 17:34:28 Tower kernel: sas: ata16: end_device-1:3: cmd error handler May 5 17:34:28 Tower kernel: sas: ata13: end_device-1:0: dev error handler May 5 17:34:28 Tower kernel: sas: ata15: end_device-1:2: dev error handler May 5 17:34:28 Tower kernel: sas: ata14: end_device-1:1: dev error handler May 5 17:34:28 Tower kernel: sas: ata16: end_device-1:3: dev error handler May 5 17:34:28 Tower kernel: ata16.00: exception Emask 0x0 SAct 0x20000 SErr 0x0 action 0x6 frozen May 5 17:34:28 Tower kernel: ata16.00: failed command: READ FPDMA QUEUED May 5 17:34:28 Tower kernel: ata16.00: cmd 60/08:00:80:b5:65/00:00:30:00:00/40 tag 17 ncq 4096 in May 5 17:34:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 5 17:34:28 Tower kernel: sas: ata17: end_device-1:4: dev error handler May 5 17:34:28 Tower kernel: ata16.00: status: { DRDY } May 5 17:34:28 Tower kernel: ata16: hard resetting link May 5 17:34:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1431:mvs_I_T_nexus_reset for device[3]:rc= 0 May 5 17:34:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a May 5 17:34:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a May 5 17:34:31 Tower kernel: ata16.00: both IDENTIFYs aborted, assuming NODEV May 5 17:34:31 Tower kernel: ata16.00: revalidation failed (errno=-2) May 5 17:34:31 Tower kernel: mvsas 0000:02:00.0: Phy6 : No sig fis May 5 17:34:35 Tower kernel: sas: sas_form_port: phy6 belongs to port3 already(1)! May 5 17:34:36 Tower kernel: ata16: hard resetting link May 5 17:34:41 Tower kernel: ata16.00: qc timeout (cmd 0x27) May 5 17:34:41 Tower kernel: ata16.00: failed to read native max address (err_mask=0x4) May 5 17:34:41 Tower kernel: ata16.00: HPA support seems broken, skipping HPA handling May 5 17:34:41 Tower kernel: ata16.00: revalidation failed (errno=-5) May 5 17:34:41 Tower kernel: ata16: hard resetting link May 5 17:34:43 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1431:mvs_I_T_nexus_reset for device[3]:rc= 0 May 5 17:34:43 Tower kernel: sas: sas_ata_task_done: SAS error 8a May 5 17:34:43 Tower kernel: sas: sas_ata_task_done: SAS error 8a May 5 17:34:43 Tower kernel: ata16.00: both IDENTIFYs aborted, assuming NODEV May 5 17:34:43 Tower kernel: ata16.00: revalidation failed (errno=-2) May 5 17:34:43 Tower kernel: ata16.00: disabled May 5 17:34:43 Tower kernel: ata16.00: device reported invalid CHS sector 0 May 5 17:34:43 Tower kernel: ata16: EH complete May 5 17:34:43 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 May 5 17:34:43 Tower kernel: sd 1:0:3:0: [sdl] UNKNOWN Result: hostbyte=0x04 driverbyte=0x00 May 5 17:34:43 Tower kernel: sd 1:0:3:0: [sdl] CDB: May 5 17:34:43 Tower kernel: cdb[0]=0x28: 28 00 30 65 b5 80 00 00 08 00 May 5 17:34:43 Tower kernel: blk_update_request: I/O error, dev sdl, sector 811971968 May 5 17:34:43 Tower kernel: md: disk9 read error, sector=811971904 May 5 17:34:44 Tower kernel: mvsas 0000:02:00.0: Phy6 : No sig fis May 5 17:34:48 Tower kernel: sas: sas_form_port: phy6 belongs to port3 already(1)! May 5 17:34:53 Tower kernel: sd 1:0:3:0: [sdl] UNKNOWN Result: hostbyte=0x04 driverbyte=0x00 May 5 17:34:53 Tower kernel: sd 1:0:3:0: [sdl] CDB: May 5 17:34:53 Tower kernel: cdb[0]=0x2a: 2a 00 30 65 b5 80 00 00 08 00 May 5 17:34:53 Tower kernel: blk_update_request: I/O error, dev sdl, sector 811971968 May 5 17:34:53 Tower kernel: blk_update_request: I/O error, dev sdl, sector 811971968 May 5 17:34:53 Tower kernel: md: disk9 write error, sector=811971904 May 5 17:34:53 Tower kernel: md: recovery thread woken up ... May 5 17:34:53 Tower kernel: md: recovery thread has nothing to resync i have the corsair HX750i PSU on order and it should arrive on thursday. also this does not appear like a normal drive failure, as i see the sas log entries right before the drive redballed, possibly indicating an issue with my supermicro AOC that use SAS breakout cables? this appears to be in line with a PSU issue so hopefully once i replace the PSU all will be good with the world! in the mean time, what do you guys recommend i do in the mean time? or even after it arrives? - obviously check cabling, replace sata cable as i have spares anyway, get smart report (hopefully) - attempt to rebuild drive?? - order and start preclearing new drive?? - leave as is until new PSU is in? Quote Link to comment
dgaschk Posted May 5, 2015 Share Posted May 5, 2015 hey guys, it seems the plot thickens.. i am now on beta15, current uptime about 1d 17h. no reboots yet. last night i had some errors on my cache drive. after some googling it says this can also indicate a power issue. unfortunately i have not yet had a chance to run memtest over night, possibly tonight if i get a chance. anyways, what would you guys do first if you were me if the memetst comes up clean? new PSU? new cache drive? no drives are redballed in the gui. Apr 21 19:05:45 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen Apr 21 19:05:45 Tower kernel: ata4.00: irq_stat 0x00400000, PHY RDY changed Apr 21 19:05:45 Tower kernel: ata4: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B } Apr 21 19:05:45 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT Apr 21 19:05:45 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 12 Apr 21 19:05:45 Tower kernel: res 40/00:5c:5f:1f:ee/00:00:00:00:00/40 Emask 0x50 (ATA bus error) Apr 21 19:05:45 Tower kernel: ata4.00: status: { DRDY } Apr 21 19:05:45 Tower kernel: ata4: hard resetting link Apr 21 19:05:51 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Apr 21 19:05:51 Tower kernel: ata4.00: configured for UDMA/133 Apr 21 19:05:51 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x50 Apr 21 19:05:51 Tower kernel: ata4: EH complete 10B8B is a SATA error. Bad or loose cable, bad or dirty SATA port. Quote Link to comment
drawde Posted May 6, 2015 Author Share Posted May 6, 2015 10B8B is a SATA error. Bad or loose cable, bad or dirty SATA port. thank you dgaschk. this cache drive issue has not returned since moving off a sata card to another port and removing that sata card to hopefully help with power issue temporarily. the new issue is 2 posts back for my drive 9. unable to get SMART report, and disk is currently disabled. EDIT: OK after a reboot (as suggested in http://lime-technology.com/forum/index.php?topic=36763.0) i am now able to get a SMART report for that drive. root@Tower:~# smartctl -a -A /dev/sdl smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build) Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s) Device Model: WDC WD20EZRX-00D8PB0 Serial Number: WD-WMC4N2459216 LU WWN Device Id: 5 0014ee 003c62b39 Firmware Version: 80.00A80 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Tue May 5 20:56:44 2015 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (25200) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 181 177 021 Pre-fail Always - 5933 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1060 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 7034 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 37 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 18 193 Load_Cycle_Count 0x0032 191 191 000 Old_age Always - 27973 194 Temperature_Celsius 0x0022 122 113 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 4 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. what do you guys think about this? i don't see any reallocated sectors or pending sectors. as far as i can tell it looks OK.. should i let unraid rebuild this drive? should i leave it as-is (simplex kinda) and wait for my PSU to come in? Quote Link to comment
drawde Posted May 6, 2015 Author Share Posted May 6, 2015 i moved disk9 to a new slot and am currently letting unraid rebuild that drive. will report back. Quote Link to comment
drawde Posted May 8, 2015 Author Share Posted May 8, 2015 disk9 green again after rebuild and new psu in. will monitor for a bit before marking this solved. thank you everybody!! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.