Array with CRC errors and dropping disks

sdcp73 · August 19, 2019

Setup:

R720xd w/

6.7.2
PERC H310
Dell branded LSI 9207-8i (04TMJF and 0F4DPW cables)
Several assorted disks

Story:

System was running stable until it was time for a parity check. Apparently the H310 can't handle running all 12 disks along with a cache pool and other operations. I thought it would be a good idea to replace the H310 with the LSI 9207-8i, since it was supported by Dell there should be no issue. Wrong. Now during parity I get tons of CRC errors. Also, my only two HGST drives drop to unassigned devices when they spin down.

Attempted remedies:

Updated LSI f/w and bios to latest P20. Same issues
Replaced SAS cables with used SAS cable that came with the LSI (connectors angled incorrectly, no dell tag). I do not know if these cables are good. Same issues.

Other thoughts:

I learned from another post that the 310 doesn't report any SMART info, so I don't know if my drives have issues or other historical issues. Also, the cable for the 310 has the SAS connector molded together, so I can't try that cable. I find it odd that my two HGST drives behave differently than other brands. I've also noticed that the HDD lights aren't operating as they did on the 310, not sure if this is anything. My next path is a memtest to rule out bad RAM. I have attached my diagnostics, in case I'm maybe missing something.

donnie-diagnostics-20190819-2255.zip

JorgeB · August 20, 2019

7 hours ago, sdcp73 said:

Apparently the H310 can't handle running all 12 disks along with a cache pool and other operations.

It should, unless there's a problem.

7 hours ago, sdcp73 said:

Now during parity I get tons of CRC errors.

This is a connection problem, most times cables, but it could also be the HBA, especially if it's a fake.

sdcp73 · August 21, 2019

Hey Johnnie, thanks for the input. I've tried to add some more detail to accompany your comments.

To support my previous statement about the H310, here is the article I read regarding queue depth. Admittedly, I don't fully understand this.

http://www.yellow-bricks.com/2014/04/17/disk-controller-features-and-queue-depth/

Well, I've tried a few things today and had a few issues pop up.

Unraid disabled a disk, I'm not sure why. This keeps me from during a parity check and makes me think something might really be wrong. new diags attached

After the disk was disabled I changed the following, one at a time, with the same errors. To test it I could no longer do a parity check so I transferred from the emulated disk to another machine.

enabled the H310 Mini
Rolled back to P19 firmware per newegg review
deleted the bios per firmware review

Also, I've taken some pictures of the card. From what I can tell it doesn't look like a fake but I can't tell for sure.

What are my best next steps? Thanks!

Edit....Forgot to ask: Can I revert back to my 310 now that I have a disabled disk? Was the 310 just not reporting these problems?

donnie-diagnostics-20190821-0128.zip

Edited August 21, 2019 by sdcp73

JorgeB · August 21, 2019

5 hours ago, sdcp73 said:

To support my previous statement about the H310, here is the article I read regarding queue depth. Admittedly, I don't fully understand this.

http://www.yellow-bricks.com/2014/04/17/disk-controller-features-and-queue-depth/

This is with the original firmware, and it doesn't mean "it can't handle them", just that it could be slower, but if using the LSI firmware it will behave as an LSI HBA.

Problems with multiple disks:

Aug 20 19:14:18 Donnie kernel: sd 2:0:3:0: attempting task abort! scmd(000000001a1f5085)
Aug 20 19:14:18 Donnie kernel: sd 2:0:3:0: [sde] tag#0 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
Aug 20 19:14:18 Donnie kernel: scsi target2:0:3: handle(0x000d), sas_address(0x500056b36789abe5), phy(5)
Aug 20 19:14:18 Donnie kernel: scsi target2:0:3: enclosure logical id(0x500056b36789abff), slot(10)
Aug 20 19:14:19 Donnie kernel: sd 2:0:3:0: device_block, handle(0x000d)
Aug 20 19:14:21 Donnie kernel: sd 2:0:3:0: device_unblock and setting to running, handle(0x000d)
Aug 20 19:14:21 Donnie kernel: sd 2:0:3:0: [sde] Synchronizing SCSI cache
Aug 20 19:14:22 Donnie kernel: sd 2:0:3:0: task abort: SUCCESS scmd(000000001a1f5085)
Aug 20 19:14:22 Donnie kernel: sd 2:0:3:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Aug 20 19:14:22 Donnie kernel: mpt2sas_cm0: removing handle(0x000d), sas_addr(0x500056b36789abe5)
Aug 20 19:14:22 Donnie kernel: mpt2sas_cm0: enclosure logical id(0x500056b36789abff), slot(10)
Aug 20 19:14:24 Donnie kernel: scsi 2:0:15:0: Direct-Access     ATA      HGST HDN724040AL A5E0 PQ: 0 ANSI: 6
Aug 20 19:14:24 Donnie kernel: scsi 2:0:15:0: SATA: handle(0x000d), sas_addr(0x500056b36789abe5), phy(5), device_name(0x0000000000000000)
Aug 20 19:14:24 Donnie kernel: scsi 2:0:15:0: enclosure logical id (0x500056b36789abff), slot(10)
Aug 20 19:14:24 Donnie kernel: scsi 2:0:15:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Aug 20 19:14:24 Donnie kernel: sd 2:0:15:0: Power-on or device reset occurred
Aug 20 19:14:24 Donnie kernel: sd 2:0:15:0: Attached scsi generic sg4 type 0
Aug 20 19:14:24 Donnie kernel: sd 2:0:15:0: [sdp] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
Aug 20 19:14:24 Donnie kernel: sd 2:0:15:0: [sdp] 4096-byte physical blocks
Aug 20 19:14:25 Donnie kernel: sd 2:0:15:0: [sdp] Write Protect is off
Aug 20 19:14:25 Donnie kernel: sd 2:0:15:0: [sdp] Mode Sense: 7f 00 10 08
Aug 20 19:14:25 Donnie kernel: sd 2:0:15:0: [sdp] Write cache: enabled, read cache: enabled, supports DPO and FUA
Aug 20 19:14:25 Donnie kernel: sdp: sdp1
Aug 20 19:14:25 Donnie kernel: sd 2:0:15:0: [sdp] Attached SCSI disk
Aug 20 19:14:25 Donnie unassigned.devices: Disk with serial 'HGST_HDN724040ALE640_PK1334PEKDSG2S', mountpoint 'HGST_HDN724040ALE640_PK1334PEKDSG2S' is not set to auto mount and will not be mounted...
Aug 20 19:14:28 Donnie shfs: share cache full
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Aug 20 19:14:29 Donnie kernel: sd 2:0:4:0: attempting task abort! scmd(000000001a1f5085)
Aug 20 19:14:29 Donnie kernel: sd 2:0:4:0: [sdf] tag#0 CDB: opcode=0x85 85 08 2e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00
Aug 20 19:14:29 Donnie kernel: scsi target2:0:4: handle(0x000e), sas_address(0x500056b36789abe6), phy(6)
Aug 20 19:14:29 Donnie kernel: scsi target2:0:4: enclosure logical id(0x500056b36789abff), slot(8)
Aug 20 19:14:30 Donnie kernel: sd 2:0:4:0: device_block, handle(0x000e)
Aug 20 19:14:30 Donnie kernel: sd 2:0:4:0: task abort: SUCCESS scmd(000000001a1f5085)
Aug 20 19:14:30 Donnie kernel: sd 2:0:4:0: device_unblock and setting to running, handle(0x000e)
Aug 20 19:14:37 Donnie shfs: share cache full
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Aug 20 19:14:38 Donnie kernel: sd 2:0:6:0: attempting task abort! scmd(000000001a1f5085)
Aug 20 19:14:38 Donnie kernel: sd 2:0:6:0: [sdh] tag#0 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
Aug 20 19:14:38 Donnie kernel: scsi target2:0:6: handle(0x0010), sas_address(0x500056b36789abe8), phy(8)
Aug 20 19:14:38 Donnie kernel: scsi target2:0:6: enclosure logical id(0x500056b36789abff), slot(2)
Aug 20 19:14:39 Donnie kernel: sd 2:0:6:0: task abort: SUCCESS scmd(000000001a1f5085)
Aug 20 19:14:47 Donnie shfs: share cache full
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Aug 20 19:14:49 Donnie kernel: sd 2:0:10:0: attempting task abort! scmd(000000005f22b75c)
Aug 20 19:14:49 Donnie kernel: sd 2:0:10:0: [sdl] tag#0 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
Aug 20 19:14:49 Donnie kernel: scsi target2:0:10: handle(0x0014), sas_address(0x500056b36789abec), phy(12)
Aug 20 19:14:49 Donnie kernel: scsi target2:0:10: enclosure logical id(0x500056b36789abff), slot(1)
Aug 20 19:14:50 Donnie kernel: sd 2:0:10:0: task abort: SUCCESS scmd(000000005f22b75c)
Aug 20 19:14:57 Donnie shfs: share cache full
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Aug 20 19:14:59 Donnie kernel: sd 2:0:13:0: attempting task abort! scmd(0000000016af462f)
Aug 20 19:14:59 Donnie kernel: sd 2:0:13:0: [sdo] tag#0 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
Aug 20 19:14:59 Donnie kernel: scsi target2:0:13: handle(0x0017), sas_address(0x500056b36789abef), phy(15)
Aug 20 19:14:59 Donnie kernel: scsi target2:0:13: enclosure logical id(0x500056b36789abff), slot(3)
Aug 20 19:15:00 Donnie kernel: sd 2:0:13:0: task abort: SUCCESS scmd(0000000016af462f)

This suggests a cable/power problem or an HBA problem, best way to check if the HBA is a fake is to contact LSI support so they can check the serial number.

sdcp73 · August 22, 2019

I have not had a chance to verify the card yet, I hope to accomplish that soon. I have found documentation on how to flash the integrated H310, seems to be a fairly recent post. H310 Flash Procedure for anyone else looking. Will report back.

Johnnie, thanks for all the advice.

sdcp73 · August 26, 2019

I've gone back to the H310 mini with the new firmware installed. I made it through parity with no crc errors. It took much longer, reporting 70MB/s average... about 20 hours longer. After my two HGSTs spun down it dropped them to unassigned devices with read errors, just like before. I seem to have made no progress so started testing the ram, one came out with an error, so working on swapping out the suspect stick.

Found a post that may help my speed issue, enabling write cache. Has anyone else done this? Had issues with it? Is this still relevant?

Drive write speeds really slow

Thanks

donnie-diagnostics-20190826-1821.zip

Edited August 26, 2019 by sdcp73

sdcp73 · September 5, 2019

I have replaced the ram and tried different combinations of HBA and cables and always the same result. When either of the HGST drives spins down it goes to unassigned devices. If I manually spin the drive up it comes back as disabled with a read error. I still think it is odd that it is my only two deskstar drives and they started doing this at the same time. Looking for advice on my next move. I've also attached diags, just in case.

I can replace the drives. Try more cables. Should I try these drives in different bays? Anything else?

Thanks!

donnie-diagnostics-20190905-0200.zip

Vr2Io · September 5, 2019

Drive have phy reset, pls boot unraid in safe mode and try again.

If no problem after that, then I would suggest you uninstall the plugin "disk location".

Edited September 5, 2019 by Benson

Array with CRC errors and dropping disks

Recommended Posts

sdcp73

Link to comment

JorgeB

Link to comment

sdcp73

Link to comment

JorgeB

Link to comment

sdcp73

Link to comment

sdcp73

Link to comment

sdcp73

Link to comment

Vr2Io

Link to comment

Join the conversation