Oh.. great. Marvell to LSI disk errors...

pkh106 · January 30, 2018

Unraid 6.4

HP Z800

Moved from a Marvell 8ch SAS/SATA adapter to a LSI (flashed IT mode Dell H310); hoping for increased performance.

Booted fine after the tape trick on pin 5 and pin 6 on the LSI to work with the motherboard.

Did a parity check, fails after about an hour. Disk #3 errors out and becomes a RED X.

Reboot, unassign drive, start array, stop array, assign drive, rebuild drive, everything comes up, no issues.

Start another parity check, fails after about an hour. Disk #3 errors out and becomes a RED X. (same issue as before, prior to rebuilding the drive)

Attached the email diagnostics log... hope the pro eyeballs out there can help me make heads/tails out of the syslog and smart reports. (DISK #3)

Thanks all in advance for the review and support!!

tower-diagnostics-20180130-0931.zip

To help you zero in, parity check starts here:

Jan 30 08:33:48 Tower kernel: mpt2sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : 0x2 [current]
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 ASC=0x4 ASCQ=0x0
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 11 51 c4 f8 00 00 04 00 00 00
Jan 30 08:33:48 Tower kernel: print_req_error: I/O error, dev sdb, sector 290571512
Jan 30 08:33:48 Tower kernel: md: disk3 read error, sector=290571448
Jan 30 08:33:48 Tower kernel: md: disk3 read error, sector=290571456

JorgeB · January 30, 2018

Swap both cables/backplane with another disk and try again.

pkh106 · January 30, 2018

thank you. was running a pre-clear for now while i waited till you and the pro's chimed in. will try that shortly and report back.

anything revealing in the logs/smart reports? i'm not sure what i should be looking for.

JorgeB · January 30, 2018

The mptsas driver is not very helpful when there's an error, SMART looks fine, but you should rule out any cable/backplane issues, if it still fails after that it's probably the disk despite the healthy looking SMART.

pkh106 · January 30, 2018

will completing the preclear flush out any issues with the disk and/or the mptsas driver?

JorgeB · January 30, 2018

It can also error, if it's the disk especially when doing the read test.

pkh106 · January 31, 2018

preclear passed.

going to swap the SATA connector ends now and rebuild the drive. after will re-run parity to see if the failure comes back

############################################################################################################################
#                                                                                                                          #
#                                        unRAID Server Preclear of disk W3001Z18                                           #
#                                       Cycle 1 of 1, partition start on sector 64.                                        #
#                                                                                                                          #
#                                                                                                                          #
#   Step 1 of 3 - Zeroing the disk:                                                        [8:25:08 @ 132 MB/s] SUCCESS    #
#   Step 2 of 3 - Writing unRAID's Preclear signature:                                                          SUCCESS    #
#   Step 3 of 3 - Verifying unRAID's Preclear signature:                                                        SUCCESS    #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#                               Cycle elapsed time: 8:25:13 | Total elapsed time: 8:25:15                                  #
############################################################################################################################


############################################################################################################################
#                                                                                                                          #
#                                               S.M.A.R.T. Status default                                                  #
#                                                                                                                          #
#                                                                                                                          #
#   ATTRIBUTE                    INITIAL  CYCLE 1  STATUS                                                                  #
#   5-Reallocated_Sector_Ct      0        0        -                                                                       #
#   9-Power_On_Hours             16762    16770    Up 8                                                                    #
#   183-Runtime_Bad_Block        0        0        -                                                                       #
#   184-End-to-End_Error         0        0        -                                                                       #
#   187-Reported_Uncorrect       0        0        -                                                                       #
#   190-Airflow_Temperature_Cel  35       35       -                                                                       #
#   197-Current_Pending_Sector   0        0        -                                                                       #
#   198-Offline_Uncorrectable    0        0        -                                                                       #
#   199-UDMA_CRC_Error_Count     0        0        -                                                                       #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                               #
############################################################################################################################


--> ATTENTION: Please take a look into the SMART report above for drive health issues.

--> RESULT: Preclear Finished Successfully!.


cat: /tmp/.preclear/sdb/cmp_out: No such file or directory
root@Tower:/usr/local/emhttp#

pkh106 · January 31, 2018

in the middle of the DISK3 rebuild; DISK0 (parity) started experiencing read errors... rebuilding ongoing though

from the disklog:

Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 70 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871736896
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 74 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871737920

from the syslog:

Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871744064
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 90 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871745088
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 94 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871746112

JorgeB · January 31, 2018

Is parity the disk you swapped the cables, or in the same miniSAS cable? If not there's likely some other issue, like power supply, etc.

pkh106 · January 31, 2018

So... this is the current set up; and change log

Purple is SAS->SATA cable A; parity (p1), disk1 (p2), disk2 (p3), disk3 (p4)

Black is SAS->SATA cable B, disk4 (p1), disk5 (p2), disk6 (p3), cache(p4)

Parity	ST4000DM000-2AE166_WDH0TARW - 4 TB (sdc)	72 F	785,248	38	70,912
Disk 1	ST4000DM000-1F2168_Z302AHHL - 4 TB (sde)	73 F	787,236	6	0	xfs	4 TB	3.65 TB	349 GB
Disk 2	ST4000DM000-1F2168_Z301QBF3 - 4 TB (sdb)	79 F	789,248	4	0	xfs	4 TB	3.58 TB	418 GB
Disk 3	ST4000DM000-1F2168_W3001Z18 - 4 TB (sdd)	81 F	27	784,630	0	xfs	4 TB	3.55 TB	446 GB
Disk 4	ST4000DM000-1F2168_Z3019HFP - 4 TB (sdg)	99 F	787,306	5	0	xfs	4 TB	3.56 TB	444 GB
Disk 5	ST4000DM000-1F2168_Z3018EYG - 4 TB (sdh)	95 F	789,244	5	0	xfs	4 TB	3.59 TB	412 GB
Disk 6	ST4000DM000-1F2168_S300KS81 - 4 TB (sdf)	93 F	785,768	5	0	xfs	4 TB	3.86 TB	140 GB

Both were originally controlled by a 8ch Marvell (super micro) controller.

Swapped it out with a dell h310.

When i swapped out the controller, ran a parity check, had a failure on disk3, it stopped the parity check. Failure.

Stopped the array, unassigned the drive, started the array, stopped the array, assigned the drive, rebuilt the drive. no issues.

ran a preclear on disk 3, no issues.

Swapped, disk 3 with disk 2's cable (same SAS/SATA cable A, p4 swapped with p3, didn't disconnect it from the controller).

Assigned the drive to the array, started the array, rebuilt the drive, received errors:

Parity disk - ST4000DM000-2AE166_WDH0TARW (sdc) (errors 68096)

However, drive rebuilt in the end

Subject: Notice [TOWER] - Parity sync / Data rebuild finished (68096 errors)

Tried to start another parity check, (the entire array is now up and running) after the rebuild

Failure on disk 3.

Event: unRAID Disk 3 error
Subject: Alert [TOWER] - Disk 3 in error state (disk dsbl)

Stopped the array, unassigned the drive, started the array, stopped the array, assigned the drive, now rebuilding the drive.

received errors (similar like before, however not as many).

Parity disk - ST4000DM000-2AE166_WDH0TARW (sdc) (errors 70912) (was 68096)

Rebuilding is continuing.....

So... after the rebuild, do i swap disks? swap controllers? thoughts? comments? appreciate all in advance!

EDIT: diagnostics attached

tower-diagnostics-20180131-0644.zip

JorgeB · January 31, 2018

I would get a new cable for the first four disks, since at least for now issues are limited to those.

pkh106 · February 1, 2018

johnnie black - ack! saw your reply after i swapped out the controller.

noticed on the SMART reports for DISK3 there's an increasing numbers of high fly writes.....

let me finish the parity check with the old Marvell controller and report back.. stay tuned

pkh106 · February 2, 2018

parity check with the old Marvell revealed no errors.

swapped out the old controller with the h310 and the new hdd (replacing disk3).

rebuilt without any issues! looks like the h310 is more "sensitive" to high fly writes and latency issues than the Marvell? the SAS2LP seems more tolerant? dunno.

i'm going to mark this as solved. re-running a parity check one more time with the new gear just to make sure, but thank you again for the help!

pwm · February 4, 2018

On 2/2/2018 at 4:08 PM, pkh106 said:

looks like the h310 is more "sensitive" to high fly writes and latency issues than the Marvell?

This has nothing to do with the controller.

It's the drive itself that tries to monitor itself. And it's the drive that has aborted writes because it has somehow concluded/suspected that the write head hasn't been positioned well enough. It isn't an error in itself but if the frequency of high-fly errors starts to increase then it might be a reason to reevaluate.

Since we don't know exactly how the drive measures the flying height - in write mode the write head is expected to be aligned on the target track which means the read head is not aligned and can't read data for the current track - it's hard to know what will cause the high-flying detection. But it isn't impossible that vibrations between the disks are causing the high-fly writes.

Oh.. great. Marvell to LSI disk errors...

Recommended Posts

pkh106

Link to comment

JorgeB

Link to comment

pkh106

Link to comment

JorgeB

Link to comment

pkh106

Link to comment

JorgeB

Link to comment

pkh106

Link to comment

pkh106

Link to comment

JorgeB

Link to comment

pkh106

Link to comment

JorgeB

Link to comment

pkh106

Link to comment

pkh106

Link to comment

pwm

Link to comment

Archived