Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Oh.. great. Marvell to LSI disk errors...

Featured Replies

Unraid 6.4

HP Z800

 

Moved from a Marvell 8ch SAS/SATA adapter to a LSI (flashed IT mode Dell H310); hoping for increased performance.

Booted fine after the tape trick on pin 5 and pin 6 on the LSI to work with the motherboard.

Did a parity check, fails after about an hour.  Disk #3 errors out and becomes a RED X.

Reboot, unassign drive, start array, stop array, assign drive, rebuild drive, everything comes up, no issues.

Start another parity check, fails after about an hour.  Disk #3 errors out and becomes a RED X. (same issue as before, prior to rebuilding the drive)

 

Attached the email diagnostics log... hope the pro eyeballs out there can help me make heads/tails out of the syslog and smart reports. (DISK #3)

 

Thanks all in advance for the review and support!! 

 

tower-diagnostics-20180130-0931.zip

 

 

To help you zero in, parity check starts here:

 

Jan 30 08:33:48 Tower kernel: mpt2sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : 0x2 [current] 
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 ASC=0x4 ASCQ=0x0 
Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 11 51 c4 f8 00 00 04 00 00 00
Jan 30 08:33:48 Tower kernel: print_req_error: I/O error, dev sdb, sector 290571512
Jan 30 08:33:48 Tower kernel: md: disk3 read error, sector=290571448
Jan 30 08:33:48 Tower kernel: md: disk3 read error, sector=290571456

 

Edited by pkh106

Swap both cables/backplane with another disk and try again.

  • Author

thank you. was running a pre-clear for now while i waited till you and the pro's chimed in. will try that shortly and report back.

 

anything revealing in the logs/smart reports? i'm not sure what i should be looking for.

 

The mptsas driver is not very helpful when there's an error, SMART looks fine, but you should rule out any cable/backplane issues, if it still fails after that it's probably the disk despite the healthy looking SMART.

  • Author

will completing the preclear flush out any issues with the disk and/or the mptsas driver?

 

 

It can also error, if it's the disk especially when doing the read test.

Edited by johnnie.black

  • Author

preclear passed.

 

going to swap the SATA connector ends now and rebuild the drive. after will re-run parity to see if the failure comes back

 

############################################################################################################################
#                                                                                                                          #
#                                        unRAID Server Preclear of disk W3001Z18                                           #
#                                       Cycle 1 of 1, partition start on sector 64.                                        #
#                                                                                                                          #
#                                                                                                                          #
#   Step 1 of 3 - Zeroing the disk:                                                        [8:25:08 @ 132 MB/s] SUCCESS    #
#   Step 2 of 3 - Writing unRAID's Preclear signature:                                                          SUCCESS    #
#   Step 3 of 3 - Verifying unRAID's Preclear signature:                                                        SUCCESS    #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#                               Cycle elapsed time: 8:25:13 | Total elapsed time: 8:25:15                                  #
############################################################################################################################


############################################################################################################################
#                                                                                                                          #
#                                               S.M.A.R.T. Status default                                                  #
#                                                                                                                          #
#                                                                                                                          #
#   ATTRIBUTE                    INITIAL  CYCLE 1  STATUS                                                                  #
#   5-Reallocated_Sector_Ct      0        0        -                                                                       #
#   9-Power_On_Hours             16762    16770    Up 8                                                                    #
#   183-Runtime_Bad_Block        0        0        -                                                                       #
#   184-End-to-End_Error         0        0        -                                                                       #
#   187-Reported_Uncorrect       0        0        -                                                                       #
#   190-Airflow_Temperature_Cel  35       35       -                                                                       #
#   197-Current_Pending_Sector   0        0        -                                                                       #
#   198-Offline_Uncorrectable    0        0        -                                                                       #
#   199-UDMA_CRC_Error_Count     0        0        -                                                                       #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                               #
############################################################################################################################


--> ATTENTION: Please take a look into the SMART report above for drive health issues.

--> RESULT: Preclear Finished Successfully!.


cat: /tmp/.preclear/sdb/cmp_out: No such file or directory
root@Tower:/usr/local/emhttp#
  • Author

in the middle of the DISK3 rebuild; DISK0 (parity) started experiencing read errors... rebuilding ongoing though

 

from the disklog:

Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 70 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871736896
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 74 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871737920

 

from the syslog:

Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871744064
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 90 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871745088
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 Sense Key : 0x2 [current]
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 ASC=0x4 ASCQ=0x0
Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 94 40 00 00 04 00 00 00
Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871746112

Edited by pkh106

Is parity the disk you swapped the cables, or in the same miniSAS cable? If not there's likely some other issue, like power supply, etc.

  • Author

So... this is the current set up; and change log

 

Purple is SAS->SATA cable A; parity (p1), disk1 (p2), disk2 (p3), disk3 (p4)

Black is SAS->SATA cable B, disk4 (p1), disk5 (p2), disk6 (p3), cache(p4)

 

green-on.pngParity ST4000DM000-2AE166_WDH0TARW - 4 TB (sdc) 72 F 785,248 38 70,912  
green-on.pngDisk 1 ST4000DM000-1F2168_Z302AHHL - 4 TB (sde) 73 F 787,236 6 0 xfs 4 TB 3.65 TB 349 GB explore.png
green-on.pngDisk 2 ST4000DM000-1F2168_Z301QBF3 - 4 TB (sdb) 79 F 789,248 4 0 xfs 4 TB 3.58 TB 418 GB explore.png
yellow-on.pngDisk 3 ST4000DM000-1F2168_W3001Z18 - 4 TB (sdd) 81 F 27 784,630 0 xfs 4 TB 3.55 TB 446 GB explore.png
green-on.pngDisk 4 ST4000DM000-1F2168_Z3019HFP - 4 TB (sdg) 99 F 787,306 5 0 xfs 4 TB 3.56 TB 444 GB explore.png
green-on.pngDisk 5 ST4000DM000-1F2168_Z3018EYG - 4 TB (sdh) 95 F 789,244 5 0 xfs 4 TB 3.59 TB 412 GB explore.png
green-on.pngDisk 6 ST4000DM000-1F2168_S300KS81 - 4 TB (sdf) 93 F 785,768 5 0 xfs 4 TB 3.86 TB 140 GB explore.png

 

Both were originally controlled by a 8ch Marvell (super micro) controller.

Swapped it out with a dell h310.

 

When i swapped out the controller, ran a parity check, had a failure on disk3, it stopped the parity check.  Failure.

Stopped the array, unassigned the drive, started the array, stopped the array, assigned the drive, rebuilt the drive.  no issues. 

 

ran a preclear on disk 3, no issues.

 

Swapped, disk 3 with disk 2's cable (same SAS/SATA cable A, p4 swapped with p3, didn't disconnect it from the controller).

 

Assigned the drive to the array, started the array, rebuilt the drive, received errors:

Parity disk - ST4000DM000-2AE166_WDH0TARW (sdc) (errors 68096)

However, drive rebuilt in the end

Subject: Notice [TOWER] - Parity sync / Data rebuild finished (68096 errors)

 

Tried to start another parity check, (the entire array is now up and running) after the rebuild

Failure on disk 3.

Event: unRAID Disk 3 error
Subject: Alert [TOWER] - Disk 3 in error state (disk dsbl)

 

Stopped the array, unassigned the drive, started the array, stopped the array, assigned the drive, now rebuilding the drive.

received errors (similar like before, however not as many).

Parity disk - ST4000DM000-2AE166_WDH0TARW (sdc) (errors 70912) (was 68096)

 

Rebuilding is continuing.....

 

So... after the rebuild, do i swap disks? swap controllers?  thoughts? comments? appreciate all in advance!

 

EDIT: diagnostics attached

 

 

tower-diagnostics-20180131-0644.zip

Edited by pkh106

I would get a new cable for the first four disks, since at least for now issues are limited to those.

  • Author

johnnie black - ack! saw your reply after i swapped out the controller.

 

noticed on the SMART reports for DISK3 there's an increasing numbers of high fly writes.....

 

let me finish the parity check with the old Marvell controller and report back.. stay tuned

  • Author

parity check with the old Marvell revealed no errors.

 

swapped out the old controller with the h310 and the new hdd (replacing disk3).

 

rebuilt without any issues!  looks like the h310 is more "sensitive" to high fly writes and latency issues than the Marvell?  the SAS2LP seems more tolerant?  dunno.

 

i'm going to mark this as solved.  re-running a parity check one more time with the new gear just to make sure, but thank you again for the help! 

On 2/2/2018 at 4:08 PM, pkh106 said:

looks like the h310 is more "sensitive" to high fly writes and latency issues than the Marvell?

This has nothing to do with the controller.

 

It's the drive itself that tries to monitor itself. And it's the drive that has aborted writes because it has somehow concluded/suspected that the write head hasn't been positioned well enough. It isn't an error in itself but if the frequency of high-fly errors starts to increase then it might be a reason to reevaluate.

 

Since we don't know exactly how the drive measures the flying height - in write mode the write head is expected to be aligned on the target track which means the read head is not aligned and can't read data for the current track - it's hard to know what will cause the high-flying detection. But it isn't impossible that vibrations between the disks are causing the high-fly writes.

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.