Jump to content
We're Hiring! Full Stack Developer ×

Oh.. great. Marvell to LSI disk errors...


pkh106

Recommended Posts

Unraid 6.4

HP Z800

 

Moved from a Marvell 8ch SAS/SATA adapter to a LSI (flashed IT mode Dell H310); hoping for increased performance.

Booted fine after the tape trick on pin 5 and pin 6 on the LSI to work with the motherboard.

Did a parity check, fails after about an hour.  Disk #3 errors out and becomes a RED X.

Reboot, unassign drive, start array, stop array, assign drive, rebuild drive, everything comes up, no issues.

Start another parity check, fails after about an hour.  Disk #3 errors out and becomes a RED X. (same issue as before, prior to rebuilding the drive)

 

Attached the email diagnostics log... hope the pro eyeballs out there can help me make heads/tails out of the syslog and smart reports. (DISK #3)

 

Thanks all in advance for the review and support!! 

 

tower-diagnostics-20180130-0931.zip

 

 

To help you zero in, parity check starts here:

 

Jan 30 08:33:48 Tower kernel: mpt2sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : 0x2 [current] 
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 ASC=0x4 ASCQ=0x0 
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 11 51 c4 f8 00 00 04 00 00 00
Jan 30 08:33:48 Tower kernel: print_req_error: I/O error, dev sdb, sector 290571512
Jan 30 08:33:48 Tower kernel: md: disk3 read error, sector=290571448
Jan 30 08:33:48 Tower kernel: md: disk3 read error, sector=290571456

 

Link to comment

preclear passed.

 

going to swap the SATA connector ends now and rebuild the drive. after will re-run parity to see if the failure comes back

 

############################################################################################################################
#                                                                                                                          #
#                                        unRAID Server Preclear of disk W3001Z18                                           #
#                                       Cycle 1 of 1, partition start on sector 64.                                        #
#                                                                                                                          #
#                                                                                                                          #
#   Step 1 of 3 - Zeroing the disk:                                                        [8:25:08 @ 132 MB/s] SUCCESS    #
#   Step 2 of 3 - Writing unRAID's Preclear signature:                                                          SUCCESS    #
#   Step 3 of 3 - Verifying unRAID's Preclear signature:                                                        SUCCESS    #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#                               Cycle elapsed time: 8:25:13 | Total elapsed time: 8:25:15                                  #
############################################################################################################################


############################################################################################################################
#                                                                                                                          #
#                                               S.M.A.R.T. Status default                                                  #
#                                                                                                                          #
#                                                                                                                          #
#   ATTRIBUTE                    INITIAL  CYCLE 1  STATUS                                                                  #
#   5-Reallocated_Sector_Ct      0        0        -                                                                       #
#   9-Power_On_Hours             16762    16770    Up 8                                                                    #
#   183-Runtime_Bad_Block        0        0        -                                                                       #
#   184-End-to-End_Error         0        0        -                                                                       #
#   187-Reported_Uncorrect       0        0        -                                                                       #
#   190-Airflow_Temperature_Cel  35       35       -                                                                       #
#   197-Current_Pending_Sector   0        0        -                                                                       #
#   198-Offline_Uncorrectable    0        0        -                                                                       #
#   199-UDMA_CRC_Error_Count     0        0        -                                                                       #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                               #
############################################################################################################################


--> ATTENTION: Please take a look into the SMART report above for drive health issues.

--> RESULT: Preclear Finished Successfully!.


cat: /tmp/.preclear/sdb/cmp_out: No such file or directory
root@Tower:/usr/local/emhttp#
Link to comment

in the middle of the DISK3 rebuild; DISK0 (parity) started experiencing read errors... rebuilding ongoing though

 

from the disklog:

Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 70 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871736896
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 74 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871737920

 

from the syslog:

Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871744064
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 90 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871745088
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 94 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871746112

Link to comment

So... this is the current set up; and change log

 

Purple is SAS->SATA cable A; parity (p1), disk1 (p2), disk2 (p3), disk3 (p4)

Black is SAS->SATA cable B, disk4 (p1), disk5 (p2), disk6 (p3), cache(p4)

 

green-on.pngParity ST4000DM000-2AE166_WDH0TARW - 4 TB (sdc) 72 F 785,248 38 70,912  
green-on.pngDisk 1 ST4000DM000-1F2168_Z302AHHL - 4 TB (sde) 73 F 787,236 6 0 xfs 4 TB 3.65 TB 349 GB explore.png
green-on.pngDisk 2 ST4000DM000-1F2168_Z301QBF3 - 4 TB (sdb) 79 F 789,248 4 0 xfs 4 TB 3.58 TB 418 GB explore.png
yellow-on.pngDisk 3 ST4000DM000-1F2168_W3001Z18 - 4 TB (sdd) 81 F 27 784,630 0 xfs 4 TB 3.55 TB 446 GB explore.png
green-on.pngDisk 4 ST4000DM000-1F2168_Z3019HFP - 4 TB (sdg) 99 F 787,306 5 0 xfs 4 TB 3.56 TB 444 GB explore.png
green-on.pngDisk 5 ST4000DM000-1F2168_Z3018EYG - 4 TB (sdh) 95 F 789,244 5 0 xfs 4 TB 3.59 TB 412 GB explore.png
green-on.pngDisk 6 ST4000DM000-1F2168_S300KS81 - 4 TB (sdf) 93 F 785,768 5 0 xfs 4 TB 3.86 TB 140 GB explore.png

 

Both were originally controlled by a 8ch Marvell (super micro) controller.

Swapped it out with a dell h310.

 

When i swapped out the controller, ran a parity check, had a failure on disk3, it stopped the parity check.  Failure.

Stopped the array, unassigned the drive, started the array, stopped the array, assigned the drive, rebuilt the drive.  no issues. 

 

ran a preclear on disk 3, no issues.

 

Swapped, disk 3 with disk 2's cable (same SAS/SATA cable A, p4 swapped with p3, didn't disconnect it from the controller).

 

Assigned the drive to the array, started the array, rebuilt the drive, received errors:

Parity disk - ST4000DM000-2AE166_WDH0TARW (sdc) (errors 68096)

However, drive rebuilt in the end

Subject: Notice [TOWER] - Parity sync / Data rebuild finished (68096 errors)

 

Tried to start another parity check, (the entire array is now up and running) after the rebuild

Failure on disk 3.

Event: unRAID Disk 3 error
Subject: Alert [TOWER] - Disk 3 in error state (disk dsbl)

 

Stopped the array, unassigned the drive, started the array, stopped the array, assigned the drive, now rebuilding the drive.

received errors (similar like before, however not as many).

Parity disk - ST4000DM000-2AE166_WDH0TARW (sdc) (errors 70912) (was 68096)

 

Rebuilding is continuing.....

 

So... after the rebuild, do i swap disks? swap controllers?  thoughts? comments? appreciate all in advance!

 

EDIT: diagnostics attached

 

 

tower-diagnostics-20180131-0644.zip

Link to comment

johnnie black - ack! saw your reply after i swapped out the controller.

 

noticed on the SMART reports for DISK3 there's an increasing numbers of high fly writes.....

 

let me finish the parity check with the old Marvell controller and report back.. stay tuned

Link to comment

parity check with the old Marvell revealed no errors.

 

swapped out the old controller with the h310 and the new hdd (replacing disk3).

 

rebuilt without any issues!  looks like the h310 is more "sensitive" to high fly writes and latency issues than the Marvell?  the SAS2LP seems more tolerant?  dunno.

 

i'm going to mark this as solved.  re-running a parity check one more time with the new gear just to make sure, but thank you again for the help! 

Link to comment
On 2/2/2018 at 4:08 PM, pkh106 said:

looks like the h310 is more "sensitive" to high fly writes and latency issues than the Marvell?

This has nothing to do with the controller.

 

It's the drive itself that tries to monitor itself. And it's the drive that has aborted writes because it has somehow concluded/suspected that the write head hasn't been positioned well enough. It isn't an error in itself but if the frequency of high-fly errors starts to increase then it might be a reason to reevaluate.

 

Since we don't know exactly how the drive measures the flying height - in write mode the write head is expected to be aligned on the target track which means the read head is not aligned and can't read data for the current track - it's hard to know what will cause the high-flying detection. But it isn't impossible that vibrations between the disks are causing the high-fly writes.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...