pkh106 Posted January 30, 2018 Share Posted January 30, 2018 Unraid 6.4 HP Z800 Moved from a Marvell 8ch SAS/SATA adapter to a LSI (flashed IT mode Dell H310); hoping for increased performance. Booted fine after the tape trick on pin 5 and pin 6 on the LSI to work with the motherboard. Did a parity check, fails after about an hour. Disk #3 errors out and becomes a RED X. Reboot, unassign drive, start array, stop array, assign drive, rebuild drive, everything comes up, no issues. Start another parity check, fails after about an hour. Disk #3 errors out and becomes a RED X. (same issue as before, prior to rebuilding the drive) Attached the email diagnostics log... hope the pro eyeballs out there can help me make heads/tails out of the syslog and smart reports. (DISK #3) Thanks all in advance for the review and support!! tower-diagnostics-20180130-0931.zip To help you zero in, parity check starts here: Jan 30 08:33:48 Tower kernel: mpt2sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01) Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : 0x2 [current] Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 ASC=0x4 ASCQ=0x0 Jan 30 08:33:48 Tower kernel: sd 7:0:0:0: [sdb] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 11 51 c4 f8 00 00 04 00 00 00 Jan 30 08:33:48 Tower kernel: print_req_error: I/O error, dev sdb, sector 290571512 Jan 30 08:33:48 Tower kernel: md: disk3 read error, sector=290571448 Jan 30 08:33:48 Tower kernel: md: disk3 read error, sector=290571456 Link to comment
JorgeB Posted January 30, 2018 Share Posted January 30, 2018 Swap both cables/backplane with another disk and try again. Link to comment
pkh106 Posted January 30, 2018 Author Share Posted January 30, 2018 thank you. was running a pre-clear for now while i waited till you and the pro's chimed in. will try that shortly and report back. anything revealing in the logs/smart reports? i'm not sure what i should be looking for. Link to comment
JorgeB Posted January 30, 2018 Share Posted January 30, 2018 The mptsas driver is not very helpful when there's an error, SMART looks fine, but you should rule out any cable/backplane issues, if it still fails after that it's probably the disk despite the healthy looking SMART. Link to comment
pkh106 Posted January 30, 2018 Author Share Posted January 30, 2018 will completing the preclear flush out any issues with the disk and/or the mptsas driver? Link to comment
JorgeB Posted January 30, 2018 Share Posted January 30, 2018 It can also error, if it's the disk especially when doing the read test. Link to comment
pkh106 Posted January 31, 2018 Author Share Posted January 31, 2018 preclear passed. going to swap the SATA connector ends now and rebuild the drive. after will re-run parity to see if the failure comes back ############################################################################################################################ # # # unRAID Server Preclear of disk W3001Z18 # # Cycle 1 of 1, partition start on sector 64. # # # # # # Step 1 of 3 - Zeroing the disk: [8:25:08 @ 132 MB/s] SUCCESS # # Step 2 of 3 - Writing unRAID's Preclear signature: SUCCESS # # Step 3 of 3 - Verifying unRAID's Preclear signature: SUCCESS # # # # # # # # # # # # # # # # # # # ############################################################################################################################ # Cycle elapsed time: 8:25:13 | Total elapsed time: 8:25:15 # ############################################################################################################################ ############################################################################################################################ # # # S.M.A.R.T. Status default # # # # # # ATTRIBUTE INITIAL CYCLE 1 STATUS # # 5-Reallocated_Sector_Ct 0 0 - # # 9-Power_On_Hours 16762 16770 Up 8 # # 183-Runtime_Bad_Block 0 0 - # # 184-End-to-End_Error 0 0 - # # 187-Reported_Uncorrect 0 0 - # # 190-Airflow_Temperature_Cel 35 35 - # # 197-Current_Pending_Sector 0 0 - # # 198-Offline_Uncorrectable 0 0 - # # 199-UDMA_CRC_Error_Count 0 0 - # # # # # # # ############################################################################################################################ # SMART overall-health self-assessment test result: PASSED # ############################################################################################################################ --> ATTENTION: Please take a look into the SMART report above for drive health issues. --> RESULT: Preclear Finished Successfully!. cat: /tmp/.preclear/sdb/cmp_out: No such file or directory root@Tower:/usr/local/emhttp# Link to comment
pkh106 Posted January 31, 2018 Author Share Posted January 31, 2018 in the middle of the DISK3 rebuild; DISK0 (parity) started experiencing read errors... rebuilding ongoing though from the disklog: Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 Sense Key : 0x2 [current]Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 ASC=0x4 ASCQ=0x0Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 70 40 00 00 04 00 00 00Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871736896Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 Sense Key : 0x2 [current]Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 ASC=0x4 ASCQ=0x0Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 74 40 00 00 04 00 00 00Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871737920 from the syslog: Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871744064Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 Sense Key : 0x2 [current]Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 ASC=0x4 ASCQ=0x0Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 90 40 00 00 04 00 00 00Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871745088Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 Sense Key : 0x2 [current]Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 ASC=0x4 ASCQ=0x0Jan 30 22:10:01 Tower kernel: sd 7:0:1:0: [sdc] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 6f 90 94 40 00 00 04 00 00 00Jan 30 22:10:01 Tower kernel: print_req_error: I/O error, dev sdc, sector 1871746112 Link to comment
JorgeB Posted January 31, 2018 Share Posted January 31, 2018 Is parity the disk you swapped the cables, or in the same miniSAS cable? If not there's likely some other issue, like power supply, etc. Link to comment
pkh106 Posted January 31, 2018 Author Share Posted January 31, 2018 So... this is the current set up; and change log Purple is SAS->SATA cable A; parity (p1), disk1 (p2), disk2 (p3), disk3 (p4) Black is SAS->SATA cable B, disk4 (p1), disk5 (p2), disk6 (p3), cache(p4) Parity ST4000DM000-2AE166_WDH0TARW - 4 TB (sdc) 72 F 785,248 38 70,912 Disk 1 ST4000DM000-1F2168_Z302AHHL - 4 TB (sde) 73 F 787,236 6 0 xfs 4 TB 3.65 TB 349 GB Disk 2 ST4000DM000-1F2168_Z301QBF3 - 4 TB (sdb) 79 F 789,248 4 0 xfs 4 TB 3.58 TB 418 GB Disk 3 ST4000DM000-1F2168_W3001Z18 - 4 TB (sdd) 81 F 27 784,630 0 xfs 4 TB 3.55 TB 446 GB Disk 4 ST4000DM000-1F2168_Z3019HFP - 4 TB (sdg) 99 F 787,306 5 0 xfs 4 TB 3.56 TB 444 GB Disk 5 ST4000DM000-1F2168_Z3018EYG - 4 TB (sdh) 95 F 789,244 5 0 xfs 4 TB 3.59 TB 412 GB Disk 6 ST4000DM000-1F2168_S300KS81 - 4 TB (sdf) 93 F 785,768 5 0 xfs 4 TB 3.86 TB 140 GB Both were originally controlled by a 8ch Marvell (super micro) controller. Swapped it out with a dell h310. When i swapped out the controller, ran a parity check, had a failure on disk3, it stopped the parity check. Failure. Stopped the array, unassigned the drive, started the array, stopped the array, assigned the drive, rebuilt the drive. no issues. ran a preclear on disk 3, no issues. Swapped, disk 3 with disk 2's cable (same SAS/SATA cable A, p4 swapped with p3, didn't disconnect it from the controller). Assigned the drive to the array, started the array, rebuilt the drive, received errors: Parity disk - ST4000DM000-2AE166_WDH0TARW (sdc) (errors 68096) However, drive rebuilt in the end Subject: Notice [TOWER] - Parity sync / Data rebuild finished (68096 errors) Tried to start another parity check, (the entire array is now up and running) after the rebuild Failure on disk 3. Event: unRAID Disk 3 errorSubject: Alert [TOWER] - Disk 3 in error state (disk dsbl) Stopped the array, unassigned the drive, started the array, stopped the array, assigned the drive, now rebuilding the drive. received errors (similar like before, however not as many). Parity disk - ST4000DM000-2AE166_WDH0TARW (sdc) (errors 70912) (was 68096) Rebuilding is continuing..... So... after the rebuild, do i swap disks? swap controllers? thoughts? comments? appreciate all in advance! EDIT: diagnostics attached tower-diagnostics-20180131-0644.zip Link to comment
JorgeB Posted January 31, 2018 Share Posted January 31, 2018 I would get a new cable for the first four disks, since at least for now issues are limited to those. Link to comment
pkh106 Posted February 1, 2018 Author Share Posted February 1, 2018 johnnie black - ack! saw your reply after i swapped out the controller. noticed on the SMART reports for DISK3 there's an increasing numbers of high fly writes..... let me finish the parity check with the old Marvell controller and report back.. stay tuned Link to comment
pkh106 Posted February 2, 2018 Author Share Posted February 2, 2018 parity check with the old Marvell revealed no errors. swapped out the old controller with the h310 and the new hdd (replacing disk3). rebuilt without any issues! looks like the h310 is more "sensitive" to high fly writes and latency issues than the Marvell? the SAS2LP seems more tolerant? dunno. i'm going to mark this as solved. re-running a parity check one more time with the new gear just to make sure, but thank you again for the help! Link to comment
pwm Posted February 4, 2018 Share Posted February 4, 2018 On 2/2/2018 at 4:08 PM, pkh106 said: looks like the h310 is more "sensitive" to high fly writes and latency issues than the Marvell? This has nothing to do with the controller. It's the drive itself that tries to monitor itself. And it's the drive that has aborted writes because it has somehow concluded/suspected that the write head hasn't been positioned well enough. It isn't an error in itself but if the frequency of high-fly errors starts to increase then it might be a reason to reevaluate. Since we don't know exactly how the drive measures the flying height - in write mode the write head is expected to be aligned on the target track which means the read head is not aligned and can't read data for the current track - it's hard to know what will cause the high-flying detection. But it isn't impossible that vibrations between the disks are causing the high-fly writes. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.