random reboots 14b


Recommended Posts

It sounds like you don't have stable power initially, but it's eventually settling down -- and then it may on occasion have spikes that cause your reboot issue.

 

Could be the power supply; or could be power regulation on the motherboard.    Look at the motherboard VERY carefully (with a flashlight) to see if there are any signs of leaking or bulging capacitors.

 

no visible bulging or leaking. could it be that my power supply doesn't have enough power at boot? i have, i wanna say, 14 drives in there including cache and parity. about half are green drives but i know i have a few regular 7200rpm drives in there as well.

Link to comment

A quality 650w unit should be plenty for 14 drives; but the PSU you have is fairly low-end, and is a bit less than that (620w).  It certainly sounds like this is likely what's causing the multiple boot attempts ... until all the drives are spun up, there's simply not enough power for the system.

 

... the occasional reboots may also be associated with drive spin-ups;  or it may simply be that you have a rail that's having random stability issues (likely a result of the problems related to initial boots).

 

I'd think the Corsair HX unit I recommended earlier would be a good upgrade that will likely resolve your issue.  If you want a bit more "headroom", go with a slightly higher power version:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817139084&cm_re=Corsair_750w-_-17-139-084-_-Product

 

 

Link to comment
  • 2 weeks later...

To possibly help with the power issue temporarily, i removed a SATA card that did not have any drives hooked up to it yet (maybe that couple W or two would make some difference with the power load). For over a week I did not experience any random reboots and the errors on my cache drive went away so I thought the problem was gone. Today I got errors another drive..

 

I've tried multiple iterations of the smartctrl command but cannot get a smart report out of the redballed drive.

 

root@Tower:~# smartctl -a -A /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=== START OF INFORMATION SECTION ===

Vendor:              /1:0:3:0

Product:

User Capacity:        600,332,565,813,390,450 bytes [600 PB]

Logical block size:  774843950 bytes

Physical block size:  1903784304 bytes

Lowest aligned LBA:  14896

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

root@Tower:~# smartctl  -a  -d  ata  /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

Read Device Identity failed: Input/output error

 

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

root@Tower:~# smartctl -a -A -T /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=======> INVALID ARGUMENT TO -T: /dev/sdl

=======> VALID ARGUMENTS ARE: normal, conservative, permissive, verypermissive <=======

 

Use smartctl -h to get a usage summary

 

root@Tower:~# smartctl /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

SCSI device successfully opened

 

Use 'smartctl -a' (or '-x') to print SMART (and more) information

 

root@Tower:~# smartctl -a /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=== START OF INFORMATION SECTION ===

Vendor:              /1:0:3:0

Product:

User Capacity:        600,332,565,813,390,450 bytes [600 PB]

Logical block size:  774843950 bytes

Physical block size:  1903784304 bytes

Lowest aligned LBA:  14896

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

 

syslog:

 

May  5 17:34:28 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

May  5 17:34:28 Tower kernel: sas: trying to find task 0xffff880154b26200

May  5 17:34:28 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff880154b26200

May  5 17:34:28 Tower kernel: sas: sas_scsi_find_task: task 0xffff880154b26200 is aborted

May  5 17:34:28 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff880154b26200 is aborted

May  5 17:34:28 Tower kernel: sas: ata16: end_device-1:3: cmd error handler

May  5 17:34:28 Tower kernel: sas: ata13: end_device-1:0: dev error handler

May  5 17:34:28 Tower kernel: sas: ata15: end_device-1:2: dev error handler

May  5 17:34:28 Tower kernel: sas: ata14: end_device-1:1: dev error handler

May  5 17:34:28 Tower kernel: sas: ata16: end_device-1:3: dev error handler

May  5 17:34:28 Tower kernel: ata16.00: exception Emask 0x0 SAct 0x20000 SErr 0x0 action 0x6 frozen

May  5 17:34:28 Tower kernel: ata16.00: failed command: READ FPDMA QUEUED

May  5 17:34:28 Tower kernel: ata16.00: cmd 60/08:00:80:b5:65/00:00:30:00:00/40 tag 17 ncq 4096 in

May  5 17:34:28 Tower kernel:        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

May  5 17:34:28 Tower kernel: sas: ata17: end_device-1:4: dev error handler

May  5 17:34:28 Tower kernel: ata16.00: status: { DRDY }

May  5 17:34:28 Tower kernel: ata16: hard resetting link

May  5 17:34:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1431:mvs_I_T_nexus_reset for device[3]:rc= 0

May  5 17:34:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a

May  5 17:34:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a

May  5 17:34:31 Tower kernel: ata16.00: both IDENTIFYs aborted, assuming NODEV

May  5 17:34:31 Tower kernel: ata16.00: revalidation failed (errno=-2)

May  5 17:34:31 Tower kernel: mvsas 0000:02:00.0: Phy6 : No sig fis

May  5 17:34:35 Tower kernel: sas: sas_form_port: phy6 belongs to port3 already(1)!

May  5 17:34:36 Tower kernel: ata16: hard resetting link

May  5 17:34:41 Tower kernel: ata16.00: qc timeout (cmd 0x27)

May  5 17:34:41 Tower kernel: ata16.00: failed to read native max address (err_mask=0x4)

May  5 17:34:41 Tower kernel: ata16.00: HPA support seems broken, skipping HPA handling

May  5 17:34:41 Tower kernel: ata16.00: revalidation failed (errno=-5)

May  5 17:34:41 Tower kernel: ata16: hard resetting link

May  5 17:34:43 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1431:mvs_I_T_nexus_reset for device[3]:rc= 0

May  5 17:34:43 Tower kernel: sas: sas_ata_task_done: SAS error 8a

May  5 17:34:43 Tower kernel: sas: sas_ata_task_done: SAS error 8a

May  5 17:34:43 Tower kernel: ata16.00: both IDENTIFYs aborted, assuming NODEV

May  5 17:34:43 Tower kernel: ata16.00: revalidation failed (errno=-2)

May  5 17:34:43 Tower kernel: ata16.00: disabled

May  5 17:34:43 Tower kernel: ata16.00: device reported invalid CHS sector 0

May  5 17:34:43 Tower kernel: ata16: EH complete

May  5 17:34:43 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1

May  5 17:34:43 Tower kernel: sd 1:0:3:0: [sdl] UNKNOWN Result: hostbyte=0x04 driverbyte=0x00

May  5 17:34:43 Tower kernel: sd 1:0:3:0: [sdl] CDB:

May  5 17:34:43 Tower kernel: cdb[0]=0x28: 28 00 30 65 b5 80 00 00 08 00

May  5 17:34:43 Tower kernel: blk_update_request: I/O error, dev sdl, sector 811971968

May  5 17:34:43 Tower kernel: md: disk9 read error, sector=811971904

May  5 17:34:44 Tower kernel: mvsas 0000:02:00.0: Phy6 : No sig fis

May  5 17:34:48 Tower kernel: sas: sas_form_port: phy6 belongs to port3 already(1)!

May  5 17:34:53 Tower kernel: sd 1:0:3:0: [sdl] UNKNOWN Result: hostbyte=0x04 driverbyte=0x00

May  5 17:34:53 Tower kernel: sd 1:0:3:0: [sdl] CDB:

May  5 17:34:53 Tower kernel: cdb[0]=0x2a: 2a 00 30 65 b5 80 00 00 08 00

May  5 17:34:53 Tower kernel: blk_update_request: I/O error, dev sdl, sector 811971968

May  5 17:34:53 Tower kernel: blk_update_request: I/O error, dev sdl, sector 811971968

May  5 17:34:53 Tower kernel: md: disk9 write error, sector=811971904

May  5 17:34:53 Tower kernel: md: recovery thread woken up ...

May  5 17:34:53 Tower kernel: md: recovery thread has nothing to resync

 

i have the corsair HX750i PSU on order and it should arrive on thursday.

 

also this does not appear like a normal drive failure, as i see the sas log entries right before the drive redballed, possibly indicating an issue with my supermicro AOC that use SAS breakout cables? this appears to be in line with a PSU issue so hopefully once i replace the PSU all will be good with the world!

 

in the mean time, what do you guys recommend i do in the mean time? or even after it arrives?

 

- obviously check cabling, replace sata cable as i have spares anyway, get smart report (hopefully)

- attempt to rebuild drive??

- order and start preclearing new drive??

- leave as is until new PSU is in?

 

 

Link to comment

hey guys, it seems the plot thickens.. i am now on beta15, current uptime about 1d 17h. no reboots yet.

 

last night i had some errors on my cache drive. after some googling it says this can also indicate a power issue. unfortunately i have not yet had a chance to run memtest over night, possibly tonight if i get a chance.

 

anyways,  what would you guys do first if you were me if the memetst comes up clean? new PSU? new cache drive? no drives are redballed in the gui.

 

Apr 21 19:05:45 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen
Apr 21 19:05:45 Tower kernel: ata4.00: irq_stat 0x00400000, PHY RDY changed
Apr 21 19:05:45 Tower kernel: ata4: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B }
Apr 21 19:05:45 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Apr 21 19:05:45 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 12
Apr 21 19:05:45 Tower kernel:         res 40/00:5c:5f:1f:ee/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 21 19:05:45 Tower kernel: ata4.00: status: { DRDY }
Apr 21 19:05:45 Tower kernel: ata4: hard resetting link
Apr 21 19:05:51 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 21 19:05:51 Tower kernel: ata4.00: configured for UDMA/133
Apr 21 19:05:51 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x50
Apr 21 19:05:51 Tower kernel: ata4: EH complete

 

 

10B8B is a SATA error. Bad or loose cable, bad or dirty SATA port.

Link to comment

10B8B is a SATA error. Bad or loose cable, bad or dirty SATA port.

 

thank you dgaschk. this cache drive issue has not returned since moving off a sata card to another port and removing that sata card to hopefully help with power issue temporarily. the new issue is 2 posts back for my drive 9. unable to get SMART report, and disk is currently disabled.

 

EDIT: OK after a reboot (as suggested in http://lime-technology.com/forum/index.php?topic=36763.0) i am now able to get a SMART report for that drive.

 

root@Tower:~# smartctl -a -A /dev/sdl

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.4-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=== START OF INFORMATION SECTION ===

Model Family:    Western Digital Caviar Green (AF, SATA 6Gb/s)

Device Model:    WDC WD20EZRX-00D8PB0

Serial Number:    WD-WMC4N2459216

LU WWN Device Id: 5 0014ee 003c62b39

Firmware Version: 80.00A80

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Sizes:    512 bytes logical, 4096 bytes physical

Rotation Rate:    5400 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Tue May  5 20:56:44 2015 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (25200) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (  2) minutes.

Extended self-test routine

recommended polling time:        ( 255) minutes.

Conveyance self-test routine

recommended polling time:        (  5) minutes.

SCT capabilities:              (0x7035) SCT Status supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  181  177  021    Pre-fail  Always      -      5933

  4 Start_Stop_Count        0x0032  099  099  000    Old_age  Always      -      1060

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  091  091  000    Old_age  Always      -      7034

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      37

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      18

193 Load_Cycle_Count        0x0032  191  191  000    Old_age  Always      -      27973

194 Temperature_Celsius    0x0022  122  113  000    Old_age  Always      -      28

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      4

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

what do you guys think about this? i don't see any reallocated sectors or pending sectors. as far as i can tell it looks OK.. should i let unraid rebuild this drive? should i leave it as-is (simplex kinda) and wait for my PSU to come in?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.