random reboots 14b

drawde · April 23, 2015

It sounds like you don't have stable power initially, but it's eventually settling down -- and then it may on occasion have spikes that cause your reboot issue.

Could be the power supply; or could be power regulation on the motherboard. Look at the motherboard VERY carefully (with a flashlight) to see if there are any signs of leaking or bulging capacitors.

no visible bulging or leaking. could it be that my power supply doesn't have enough power at boot? i have, i wanna say, 14 drives in there including cache and parity. about half are green drives but i know i have a few regular 7200rpm drives in there as well.

ken-ji · April 23, 2015

That's likely it. unless you have staggered spin up - all those drives will do a number on your PSU during startup.

garycase · April 23, 2015

A quality 650w unit should be plenty for 14 drives; but the PSU you have is fairly low-end, and is a bit less than that (620w). It certainly sounds like this is likely what's causing the multiple boot attempts ... until all the drives are spun up, there's simply not enough power for the system.

... the occasional reboots may also be associated with drive spin-ups; or it may simply be that you have a rail that's having random stability issues (likely a result of the problems related to initial boots).

I'd think the Corsair HX unit I recommended earlier would be a good upgrade that will likely resolve your issue. If you want a bit more "headroom", go with a slightly higher power version:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817139084&cm_re=Corsair_750w-_-17-139-084-_-Product

drawde · April 23, 2015

thank you everyone. i just stopped the memtest at 8hrs with 0 errors. was gonna let it run for longer but i think we are all leaning towards PSU at this point. i think i will go for the 750w unit just to be safe.

drawde · May 5, 2015

To possibly help with the power issue temporarily, i removed a SATA card that did not have any drives hooked up to it yet (maybe that couple W or two would make some difference with the power load). For over a week I did not experience any random reboots and the errors on my cache drive went away so I thought the problem was gone. Today I got errors another drive..

I've tried multiple iterations of the smartctrl command but cannot get a smart report out of the redballed drive.

root@Tower:~# smartctl -a -A /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

=== START OF INFORMATION SECTION ===

Vendor: /1:0:3:0

Product:

User Capacity: 600,332,565,813,390,450 bytes [600 PB]

Logical block size: 774843950 bytes

Physical block size: 1903784304 bytes

Lowest aligned LBA: 14896

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

root@Tower:~# smartctl -a -d ata /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

root@Tower:~# smartctl -a -A -T /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

=======> INVALID ARGUMENT TO -T: /dev/sdl

=======> VALID ARGUMENTS ARE: normal, conservative, permissive, verypermissive <=======

Use smartctl -h to get a usage summary

root@Tower:~# smartctl /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

SCSI device successfully opened

Use 'smartctl -a' (or '-x') to print SMART (and more) information

root@Tower:~# smartctl -a /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

=== START OF INFORMATION SECTION ===

Vendor: /1:0:3:0

Product:

User Capacity: 600,332,565,813,390,450 bytes [600 PB]

Logical block size: 774843950 bytes

Physical block size: 1903784304 bytes

Lowest aligned LBA: 14896

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

syslog:

May 5 17:34:28 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

May 5 17:34:28 Tower kernel: sas: trying to find task 0xffff880154b26200

May 5 17:34:28 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff880154b26200

May 5 17:34:28 Tower kernel: sas: sas_scsi_find_task: task 0xffff880154b26200 is aborted

May 5 17:34:28 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff880154b26200 is aborted

May 5 17:34:28 Tower kernel: sas: ata16: end_device-1:3: cmd error handler

May 5 17:34:28 Tower kernel: sas: ata13: end_device-1:0: dev error handler

May 5 17:34:28 Tower kernel: sas: ata15: end_device-1:2: dev error handler

May 5 17:34:28 Tower kernel: sas: ata14: end_device-1:1: dev error handler

May 5 17:34:28 Tower kernel: sas: ata16: end_device-1:3: dev error handler

May 5 17:34:28 Tower kernel: ata16.00: exception Emask 0x0 SAct 0x20000 SErr 0x0 action 0x6 frozen

May 5 17:34:28 Tower kernel: ata16.00: failed command: READ FPDMA QUEUED

May 5 17:34:28 Tower kernel: ata16.00: cmd 60/08:00:80:b5:65/00:00:30:00:00/40 tag 17 ncq 4096 in

May 5 17:34:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

May 5 17:34:28 Tower kernel: sas: ata17: end_device-1:4: dev error handler

May 5 17:34:28 Tower kernel: ata16.00: status: { DRDY }

May 5 17:34:28 Tower kernel: ata16: hard resetting link

May 5 17:34:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1431:mvs_I_T_nexus_reset for device[3]:rc= 0

May 5 17:34:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a

May 5 17:34:31 Tower kernel: ata16.00: both IDENTIFYs aborted, assuming NODEV

May 5 17:34:31 Tower kernel: ata16.00: revalidation failed (errno=-2)

May 5 17:34:31 Tower kernel: mvsas 0000:02:00.0: Phy6 : No sig fis

May 5 17:34:35 Tower kernel: sas: sas_form_port: phy6 belongs to port3 already(1)!

May 5 17:34:36 Tower kernel: ata16: hard resetting link

May 5 17:34:41 Tower kernel: ata16.00: qc timeout (cmd 0x27)

May 5 17:34:41 Tower kernel: ata16.00: failed to read native max address (err_mask=0x4)

May 5 17:34:41 Tower kernel: ata16.00: HPA support seems broken, skipping HPA handling

May 5 17:34:41 Tower kernel: ata16.00: revalidation failed (errno=-5)

May 5 17:34:41 Tower kernel: ata16: hard resetting link

May 5 17:34:43 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1431:mvs_I_T_nexus_reset for device[3]:rc= 0

May 5 17:34:43 Tower kernel: sas: sas_ata_task_done: SAS error 8a

May 5 17:34:43 Tower kernel: ata16.00: both IDENTIFYs aborted, assuming NODEV

May 5 17:34:43 Tower kernel: ata16.00: revalidation failed (errno=-2)

May 5 17:34:43 Tower kernel: ata16.00: disabled

May 5 17:34:43 Tower kernel: ata16.00: device reported invalid CHS sector 0

May 5 17:34:43 Tower kernel: ata16: EH complete

May 5 17:34:43 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1

May 5 17:34:43 Tower kernel: sd 1:0:3:0: [sdl] UNKNOWN Result: hostbyte=0x04 driverbyte=0x00

May 5 17:34:43 Tower kernel: sd 1:0:3:0: [sdl] CDB:

May 5 17:34:43 Tower kernel: cdb[0]=0x28: 28 00 30 65 b5 80 00 00 08 00

May 5 17:34:43 Tower kernel: blk_update_request: I/O error, dev sdl, sector 811971968

May 5 17:34:43 Tower kernel: md: disk9 read error, sector=811971904

May 5 17:34:44 Tower kernel: mvsas 0000:02:00.0: Phy6 : No sig fis

May 5 17:34:48 Tower kernel: sas: sas_form_port: phy6 belongs to port3 already(1)!

May 5 17:34:53 Tower kernel: sd 1:0:3:0: [sdl] UNKNOWN Result: hostbyte=0x04 driverbyte=0x00

May 5 17:34:53 Tower kernel: sd 1:0:3:0: [sdl] CDB:

May 5 17:34:53 Tower kernel: cdb[0]=0x2a: 2a 00 30 65 b5 80 00 00 08 00

May 5 17:34:53 Tower kernel: blk_update_request: I/O error, dev sdl, sector 811971968

May 5 17:34:53 Tower kernel: md: disk9 write error, sector=811971904

May 5 17:34:53 Tower kernel: md: recovery thread woken up ...

May 5 17:34:53 Tower kernel: md: recovery thread has nothing to resync

i have the corsair HX750i PSU on order and it should arrive on thursday.

also this does not appear like a normal drive failure, as i see the sas log entries right before the drive redballed, possibly indicating an issue with my supermicro AOC that use SAS breakout cables? this appears to be in line with a PSU issue so hopefully once i replace the PSU all will be good with the world!

in the mean time, what do you guys recommend i do in the mean time? or even after it arrives?

- obviously check cabling, replace sata cable as i have spares anyway, get smart report (hopefully)

- attempt to rebuild drive??

- order and start preclearing new drive??

- leave as is until new PSU is in?

dgaschk · May 5, 2015

hey guys, it seems the plot thickens.. i am now on beta15, current uptime about 1d 17h. no reboots yet.

last night i had some errors on my cache drive. after some googling it says this can also indicate a power issue. unfortunately i have not yet had a chance to run memtest over night, possibly tonight if i get a chance.

anyways, what would you guys do first if you were me if the memetst comes up clean? new PSU? new cache drive? no drives are redballed in the gui.
Apr 21 19:05:45 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen
Apr 21 19:05:45 Tower kernel: ata4.00: irq_stat 0x00400000, PHY RDY changed
Apr 21 19:05:45 Tower kernel: ata4: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B }
Apr 21 19:05:45 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Apr 21 19:05:45 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 12
Apr 21 19:05:45 Tower kernel:         res 40/00:5c:5f:1f:ee/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 21 19:05:45 Tower kernel: ata4.00: status: { DRDY }
Apr 21 19:05:45 Tower kernel: ata4: hard resetting link
Apr 21 19:05:51 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 21 19:05:51 Tower kernel: ata4.00: configured for UDMA/133
Apr 21 19:05:51 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x50
Apr 21 19:05:51 Tower kernel: ata4: EH complete

10B8B is a SATA error. Bad or loose cable, bad or dirty SATA port.

drawde · May 6, 2015

10B8B is a SATA error. Bad or loose cable, bad or dirty SATA port.

thank you dgaschk. this cache drive issue has not returned since moving off a sata card to another port and removing that sata card to hopefully help with power issue temporarily. the new issue is 2 posts back for my drive 9. unable to get SMART report, and disk is currently disabled.

EDIT: OK after a reboot (as suggested in http://lime-technology.com/forum/index.php?topic=36763.0) i am now able to get a SMART report for that drive.

root@Tower:~# smartctl -a -A /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)

Device Model: WDC WD20EZRX-00D8PB0

Serial Number: WD-WMC4N2459216

LU WWN Device Id: 5 0014ee 003c62b39

Firmware Version: 80.00A80

User Capacity: 2,000,398,934,016 bytes [2.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: 5400 rpm

Device is: In smartctl database [for details use: -P show]

ATA Version is: ACS-2 (minor revision not indicated)

SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is: Tue May 5 20:56:44 2015 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (25200) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x7035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0

3 Spin_Up_Time 0x0027 181 177 021 Pre-fail Always - 5933

4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1060

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0

9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 7034

10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 37

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 18

193 Load_Cycle_Count 0x0032 191 191 000 Old_age Always - 27973

194 Temperature_Celsius 0x0022 122 113 000 Old_age Always - 28

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 4

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

what do you guys think about this? i don't see any reallocated sectors or pending sectors. as far as i can tell it looks OK.. should i let unraid rebuild this drive? should i leave it as-is (simplex kinda) and wait for my PSU to come in?

drawde · May 6, 2015

i moved disk9 to a new slot and am currently letting unraid rebuild that drive. will report back.

drawde · May 8, 2015

disk9 green again after rebuild and new psu in. will monitor for a bit before marking this solved. thank you everybody!!

random reboots 14b

Recommended Posts

drawde

Link to comment

ken-ji

Link to comment

garycase

Link to comment

drawde

Link to comment

drawde

Link to comment

dgaschk

Link to comment

drawde

Link to comment

drawde

Link to comment

drawde

Link to comment

Join the conversation