Jump to content

[SOLVED] Parity check takes forever


Recommended Posts

Hi all,

 

Sunday morning at 1:30am, the parity check started. Now, Monday evening, it's still going and it doesn't look like it's going to be done anytime soon:

 

image.png.22dd9b75fb1c7d3f57242c18d8947410.png

 

DIsk 1 was reported to have read errors, which it kinda has had for several months, but I just haven't had the time to swap the disk out. I guess that time is approaching fast...

 

25705549_unraiddiskarray.thumb.png.a5d5238b06932e1493d04416529a1e5d.png

 

Usually, the parity check takes somewhere between 11,5 and 13,5 hours, which is fine and expected for the parity size and the disks being spinning rust, but this is just ridiculous.

 

What may cause this? My understanding is that the parity calculation starts with disk 1 and works its way through the disks in the array, but I'm probably mistaken. In any case, it seems that something is afoot and I can't figure out why.

 

System:

  • Dell PowerEdge T430, 144GB RAM, 2xXeon E5-2640 v3 8c16t (16c/32t).
  • 8xHGST Ultrastar 6TB drives in array.
  • 250GB cache disk for docker container data.
  • 512GB NVMe SSD for downloads.
  • 2TB SATA SSD for VMs.
  • Unraid 6.9.2 Pro.

 

There are a quite a few docker containers and VMs running. The VMs have a dedicated disk, the docker containers have a dedicated disk for the docker data, but some of them download and write data to the array (via the cache disk) when Mover runs every night. The data is stored temporarily on a dedicated SSD just for downloads until Mover runs.

 

I'm assuming that such write activity will impact the duration of the parity check as things need to be recalculated, but 57-58 days for a parity check seems somewhat egregrious.

 

As I'm planning to replace disk 1 soon, I don't want to stop the parity check as I need that parity when the disk contents are to be rebuilt after the disk replacement.

 

Will stopping the docker services help in this situation?

Edited by reverend remiel
Link to comment

So I poked around a bit in the syslog and it seems that disk1 is on its way out:

 

Mar  6 17:24:41 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:27:04 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:30:45 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:33:10 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:39:13 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:43:52 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:47:03 aram kernel: sd 1:0:1:0: [sdc] tag#228 CDB: opcode=0x88 88 00 00 00 00 02 98 52 4b b0 00 00 02 00 00 00
Mar  6 17:47:03 aram kernel: sd 1:0:1:0: [sdc] tag#217 CDB: opcode=0x88 88 00 00 00 00 02 98 52 49 b0 00 00 02 00 00 00
Mar  6 17:48:29 aram kernel: sd 1:0:1:0: [sdc] tag#217 OCR is requested due to IO timeout!!
Mar  6 17:48:29 aram kernel: sd 1:0:1:0: [sdc] tag#217 SCSI host state: 5  SCSI host busy: 2  FW outstanding: 0
Mar  6 17:48:29 aram kernel: sd 1:0:1:0: [sdc] tag#217 scmd: (0x00000000d20e7c46)  retries: 0x0  allowed: 0x5
Mar  6 17:48:29 aram kernel: sd 1:0:1:0: [sdc] tag#217 CDB: opcode=0x88 88 00 00 00 00 02 98 52 49 b0 00 00 02 00 00 00
Mar  6 17:48:29 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Mar  6 17:48:29 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
Mar  6 17:49:46 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:53:22 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:56:18 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#421 CDB: opcode=0x88 88 00 00 00 00 02 98 53 45 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#420 CDB: opcode=0x88 88 00 00 00 00 02 98 53 4d 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#419 CDB: opcode=0x88 88 00 00 00 00 02 98 53 57 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#418 CDB: opcode=0x88 88 00 00 00 00 02 98 53 53 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#417 CDB: opcode=0x88 88 00 00 00 00 02 98 53 49 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#416 CDB: opcode=0x88 88 00 00 00 00 02 98 53 51 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#414 CDB: opcode=0x88 88 00 00 00 00 02 98 53 43 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#413 CDB: opcode=0x88 88 00 00 00 00 02 98 53 47 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#412 CDB: opcode=0x88 88 00 00 00 00 02 98 53 4b 90 00 00 02 00 00 00
Mar  6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#411 CDB: opcode=0x88 88 00 00 00 00 02 98 53 55 90 00 00 02 00 00 00
Mar  6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 OCR is requested due to IO timeout!!
Mar  6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 SCSI host state: 5  SCSI host busy: 10  FW outstanding: 10
Mar  6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 scmd: (0x00000000b70a797a)  retries: 0x1  allowed: 0x5
Mar  6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 CDB: opcode=0x88 88 00 00 00 00 02 98 53 55 90 00 00 02 00 00 00
Mar  6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 Request descriptor details:
Mar  6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 RequestFlags:0xc  MSIxIndex:0x3  SMID:0x19c  LMID:0x0  DevHandle:0x11
Mar  6 17:58:50 aram kernel: IO request frame:
Mar  6 17:58:50 aram kernel: 00000000: 00000011 00000000 00000000 ffc69a20 00600002 00000020 00000000 00040000
Mar  6 17:58:50 aram kernel: 00000020: 00000000 00004010 00000000 00000000 00000000 00000000 00000000 02000000
Mar  6 17:58:50 aram kernel: 00000040: 00000088 53980200 00009055 00000002 00000000 00000000 00000000 00000000
Mar  6 17:58:50 aram kernel: 00000060: 005b0000 00010000 00000000 00000000 00000000 00000000 00004010 00000000
Mar  6 17:58:50 aram kernel: 00000080: a3a80000 00000000 00001000 00000000 a3a81000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000000a0: a3a82000 00000000 00001000 00000000 a3a83000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000000c0: a3a84000 00000000 00001000 00000000 a3a85000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000000e0: a3a86000 00000000 00001000 00000000 ffac4000 00000000 00000390 80000000
Mar  6 17:58:50 aram kernel: Chain frame:
Mar  6 17:58:50 aram kernel: 00000000: a3a87000 00000000 00001000 00000000 a3a88000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000020: a3a89000 00000000 00001000 00000000 a3a8a000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000040: a3a8b000 00000000 00001000 00000000 a3a8c000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000060: a3a8d000 00000000 00001000 00000000 a3a8e000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000080: a3a8f000 00000000 00001000 00000000 a3a90000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000000a0: a3a91000 00000000 00001000 00000000 a3a92000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000000c0: a3a93000 00000000 00001000 00000000 a3a94000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000000e0: a3a95000 00000000 00001000 00000000 a3a96000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000100: a3a97000 00000000 00001000 00000000 a3a98000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000120: a3a99000 00000000 00001000 00000000 a3a9a000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000140: a3a9b000 00000000 00001000 00000000 a3a9c000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000160: a3a9d000 00000000 00001000 00000000 a3a9e000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000180: a3a9f000 00000000 00001000 00000000 a3aa0000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000001a0: a3aa1000 00000000 00001000 00000000 a3aa2000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000001c0: a3aa3000 00000000 00001000 00000000 a3aa4000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000001e0: a3aa5000 00000000 00001000 00000000 a3aa6000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000200: a3aa7000 00000000 00001000 00000000 a3aa8000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000220: a3aa9000 00000000 00001000 00000000 a3aaa000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000240: a3aab000 00000000 00001000 00000000 a3aac000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000260: a3aad000 00000000 00001000 00000000 a3aae000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000280: a3aaf000 00000000 00001000 00000000 a3ab0000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000002a0: a3ab1000 00000000 00001000 00000000 a3ab2000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000002c0: a3ab3000 00000000 00001000 00000000 a3ab4000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 000002e0: a3ab5000 00000000 00001000 00000000 a3ab6000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000300: a3ab7000 00000000 00001000 00000000 a3ab8000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000320: a3ab9000 00000000 00001000 00000000 a3aba000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000340: a3abb000 00000000 00001000 00000000 a3abc000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000360: a3abd000 00000000 00001000 00000000 a3abe000 00000000 00001000 00000000
Mar  6 17:58:50 aram kernel: 00000380: a3abf000 00000000 00001000 40000000 00000000 00000000 00000000 00000000
Mar  6 17:58:50 aram kernel: 000003a0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mar  6 17:58:50 aram kernel: 000003c0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mar  6 17:58:50 aram kernel: 000003e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mar  6 17:58:50 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Mar  6 17:58:50 aram kernel: megaraid_sas 0000:03:00.0: [ 0]waiting for 10 commands to complete for scsi1
Mar  6 17:58:55 aram kernel: megaraid_sas 0000:03:00.0: [ 5]waiting for 9 commands to complete for scsi1
Mar  6 17:59:00 aram kernel: megaraid_sas 0000:03:00.0: [10]waiting for 8 commands to complete for scsi1
Mar  6 17:59:06 aram kernel: megaraid_sas 0000:03:00.0: [15]waiting for 6 commands to complete for scsi1
Mar  6 17:59:11 aram kernel: megaraid_sas 0000:03:00.0: [20]waiting for 6 commands to complete for scsi1
Mar  6 17:59:16 aram kernel: megaraid_sas 0000:03:00.0: [25]waiting for 5 commands to complete for scsi1
Mar  6 17:59:21 aram kernel: megaraid_sas 0000:03:00.0: [30]waiting for 5 commands to complete for scsi1
Mar  6 17:59:26 aram kernel: megaraid_sas 0000:03:00.0: [35]waiting for 4 commands to complete for scsi1
Mar  6 17:59:31 aram kernel: megaraid_sas 0000:03:00.0: [40]waiting for 3 commands to complete for scsi1
Mar  6 17:59:36 aram kernel: megaraid_sas 0000:03:00.0: [45]waiting for 2 commands to complete for scsi1
Mar  6 17:59:41 aram kernel: megaraid_sas 0000:03:00.0: [50]waiting for 2 commands to complete for scsi1
Mar  6 17:59:47 aram kernel: megaraid_sas 0000:03:00.0: [55]waiting for 1 commands to complete for scsi1
Mar  6 17:59:49 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
Mar  6 18:01:56 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 18:02:29 aram kernel: sd 1:0:1:0: [sdc] tag#431 OCR is requested due to IO timeout!!
Mar  6 18:02:29 aram kernel: sd 1:0:1:0: [sdc] tag#431 SCSI host state: 5  SCSI host busy: 4  FW outstanding: 0
Mar  6 18:02:29 aram kernel: sd 1:0:1:0: [sdc] tag#431 scmd: (0x00000000b70a797a)  retries: 0x2  allowed: 0x5
Mar  6 18:02:29 aram kernel: sd 1:0:1:0: [sdc] tag#431 CDB: opcode=0x88 88 00 00 00 00 02 98 53 55 90 00 00 02 00 00 00
Mar  6 18:02:29 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Mar  6 18:02:29 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
Mar  6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#468 CDB: opcode=0x88 88 00 00 00 00 02 98 53 b7 90 00 00 02 00 00 00
Mar  6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#467 CDB: opcode=0x88 88 00 00 00 00 02 98 53 a5 90 00 00 02 00 00 00
Mar  6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#466 CDB: opcode=0x88 88 00 00 00 00 02 98 53 9f 90 00 00 02 00 00 00
Mar  6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#465 CDB: opcode=0x88 88 00 00 00 00 02 98 53 a3 90 00 00 02 00 00 00
Mar  6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#464 CDB: opcode=0x88 88 00 00 00 00 02 98 53 a1 90 00 00 02 00 00 00
Mar  6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 OCR is requested due to IO timeout!!
Mar  6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 SCSI host state: 5  SCSI host busy: 9  FW outstanding: 8
Mar  6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 scmd: (0x00000000415c5b1d)  retries: 0x1  allowed: 0x5
Mar  6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 CDB: opcode=0x88 88 00 00 00 00 02 98 53 a1 90 00 00 02 00 00 00
Mar  6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 Request descriptor details:
Mar  6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 RequestFlags:0xc  MSIxIndex:0x3  SMID:0x1d1  LMID:0x0  DevHandle:0x11
Mar  6 18:04:26 aram kernel: IO request frame:
Mar  6 18:04:26 aram kernel: 00000000: 00000011 00000000 00000000 ffc6ae00 00600002 00000020 00000000 00040000
Mar  6 18:04:26 aram kernel: 00000020: 00000000 00004010 00000000 00000000 00000000 00000000 00000000 02000000
Mar  6 18:04:26 aram kernel: 00000040: 00000088 53980200 000090a1 00000002 00000000 00000000 00000000 00000000
Mar  6 18:04:26 aram kernel: 00000060: 005b0000 00010000 00000000 00000000 00000000 00000000 00004010 00000000
Mar  6 18:04:26 aram kernel: 00000080: a0ac0000 00000000 00001000 00000000 a0ac1000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000000a0: a0ac2000 00000000 00001000 00000000 a0ac3000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000000c0: a0ac4000 00000000 00001000 00000000 a0ac5000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000000e0: a0ac6000 00000000 00001000 00000000 ffa8f000 00000000 00000390 80000000
Mar  6 18:04:26 aram kernel: Chain frame:
Mar  6 18:04:26 aram kernel: 00000000: a0ac7000 00000000 00001000 00000000 a0ac8000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000020: a0ac9000 00000000 00001000 00000000 a0aca000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000040: a0acb000 00000000 00001000 00000000 a0acc000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000060: a0acd000 00000000 00001000 00000000 a0ace000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000080: a0acf000 00000000 00001000 00000000 a0ad0000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000000a0: a0ad1000 00000000 00001000 00000000 a0ad2000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000000c0: a0ad3000 00000000 00001000 00000000 a0ad4000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000000e0: a0ad5000 00000000 00001000 00000000 a0ad6000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000100: a0ad7000 00000000 00001000 00000000 a0ad8000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000120: a0ad9000 00000000 00001000 00000000 a0ada000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000140: a0adb000 00000000 00001000 00000000 a0adc000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000160: a0add000 00000000 00001000 00000000 a0ade000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000180: a0adf000 00000000 00001000 00000000 a0ae0000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000001a0: a0ae1000 00000000 00001000 00000000 a0ae2000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000001c0: a0ae3000 00000000 00001000 00000000 a0ae4000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000001e0: a0ae5000 00000000 00001000 00000000 a0ae6000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000200: a0ae7000 00000000 00001000 00000000 a0ae8000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000220: a0ae9000 00000000 00001000 00000000 a0aea000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000240: a0aeb000 00000000 00001000 00000000 a0aec000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000260: a0aed000 00000000 00001000 00000000 a0aee000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000280: a0aef000 00000000 00001000 00000000 a0af0000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000002a0: a0af1000 00000000 00001000 00000000 a0af2000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000002c0: a0af3000 00000000 00001000 00000000 a0af4000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 000002e0: a0af5000 00000000 00001000 00000000 a0af6000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000300: a0af7000 00000000 00001000 00000000 a0af8000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000320: a0af9000 00000000 00001000 00000000 a0afa000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000340: a0afb000 00000000 00001000 00000000 a0afc000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000360: a0afd000 00000000 00001000 00000000 a0afe000 00000000 00001000 00000000
Mar  6 18:04:26 aram kernel: 00000380: a0aff000 00000000 00001000 40000000 00000000 00000000 00000000 00000000
Mar  6 18:04:26 aram kernel: 000003a0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mar  6 18:04:26 aram kernel: 000003c0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mar  6 18:04:26 aram kernel: 000003e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mar  6 18:04:26 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Mar  6 18:04:26 aram kernel: megaraid_sas 0000:03:00.0: [ 0]waiting for 8 commands to complete for scsi1
Mar  6 18:04:31 aram kernel: megaraid_sas 0000:03:00.0: [ 5]waiting for 4 commands to complete for scsi1
Mar  6 18:04:36 aram kernel: megaraid_sas 0000:03:00.0: [10]waiting for 1 commands to complete for scsi1
Mar  6 18:04:37 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
Mar  6 18:06:23 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 18:09:29 aram kernel: sd 1:0:1:0: [sdc] tag#408 CDB: opcode=0x88 88 00 00 00 00 02 98 54 43 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#407 CDB: opcode=0x88 88 00 00 00 00 02 98 54 51 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#405 CDB: opcode=0x88 88 00 00 00 00 02 98 54 4f 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#402 CDB: opcode=0x88 88 00 00 00 00 02 98 54 4d 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#401 CDB: opcode=0x88 88 00 00 00 00 02 98 54 4b 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#400 CDB: opcode=0x88 88 00 00 00 00 02 98 54 49 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#399 CDB: opcode=0x88 88 00 00 00 00 02 98 54 47 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#398 CDB: opcode=0x88 88 00 00 00 00 02 98 54 45 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#200 CDB: opcode=0x88 88 00 00 00 00 02 98 54 53 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#208 CDB: opcode=0x88 88 00 00 00 00 02 98 54 55 90 00 00 02 00 00 00
Mar  6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#216 CDB: opcode=0x88 88 00 00 00 00 02 98 54 57 90 00 00 02 00 00 00
Mar  6 18:09:50 aram kernel: sd 1:0:1:0: [sdc] tag#224 CDB: opcode=0x88 88 00 00 00 00 02 98 54 59 90 00 00 02 00 00 00
Mar  6 18:10:01 aram kernel: sd 1:0:1:0: [sdc] tag#196 CDB: opcode=0x88 88 00 00 00 00 02 98 54 5b 90 00 00 02 00 00 00
Mar  6 18:10:12 aram kernel: sd 1:0:1:0: Power-on or device reset occurred
Mar  6 18:10:23 aram kernel: sd 1:0:1:0: [sdc] tag#196 OCR is requested due to IO timeout!!
Mar  6 18:10:23 aram kernel: sd 1:0:1:0: [sdc] tag#196 SCSI host state: 5  SCSI host busy: 13  FW outstanding: 0
Mar  6 18:10:23 aram kernel: sd 1:0:1:0: [sdc] tag#196 scmd: (0x000000005fb8537f)  retries: 0x0  allowed: 0x5
Mar  6 18:10:23 aram kernel: sd 1:0:1:0: [sdc] tag#196 CDB: opcode=0x88 88 00 00 00 00 02 98 54 5b 90 00 00 02 00 00 00
Mar  6 18:10:23 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Mar  6 18:10:23 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000

 

Well, this is new and exciting!

 

What are the odds that the parity check will complete before the disk gives up the ghost completely?

 

Complete diagnostics attached.

aram-diagnostics-20230306-2047.zip

Link to comment
1 minute ago, itimpi said:

It all depends on whether you have valid parity - if you have then the answer is yes.

 

Looking at the diagnostics it appears that disk1 may have dropped offline.

 

How do I know if the parity is valid, then? Just wait for the parity check to complete?

 

The last parity check was 2 months ago and that passed with no errors, not even read errors on disk 1. Granted, disk 1 has reported read errors at a few occasions and that prompted me to buy a spare drive, but the errors went away and everything seemed fine until now.

 

I also enabled the SAS spindown plugin a while back, which is probably what killed the drive now. Oh, the irony.

Link to comment
20 minutes ago, reverend remiel said:

How do I know if the parity is valid, then? Just wait for the parity check to complete

The parity check is going to be meaningless if the disk has dropped offline :( Is disk1 still showing as present or has it got a red 'x' against it?

 

I think the only way to know for certain is going to be if Unraid can emulate disk1 correctly if it is not in the system.   I would wait for @JorgeB to get online as the best expert on these things to confirm but I think that is going to have to be the way forward.   You definitely want to keep disk1 as intact as possible at this point in case it is needed for any recovery purposes.   Do you have a spare drive you could use to replace it?

Link to comment
4 minutes ago, itimpi said:

The parity check is going to be meaningless if the disk has dropped offline :( Is disk1 still showing as present or has it got a red 'x' against it?

 

I think the only way to know for certain is going to be if Unraid can emulate disk1 correctly if it is not in the system.   I would wait for @JorgeB to get online as the best expert on these things to confirm but I think that is going to have to be the way forward.   You definitely want to keep disk1 as intact as possible at this point in case it is needed for any recovery purposes.   Do you have a spare drive you could use to replace it?

 

The disk is still present in the system and is occasionally responding to commands, so it's not completely gone yet. I can also browse the disk's contents directly via /mnt/disk1, which is good.

 

I have a spare 6TB disk of the same type that I bought probably a year ago "just to make sure", so once I feel confident that the disk's contents are able to be rebuilt, I'm ready to swap them.

Edited by reverend remiel
Link to comment
Just now, reverend remiel said:

I can also browse the disk's contents directly via /mnt/disk1, which is good.

Even without the disk present you would be able to browse its contents if parity is valid as Unraid will try and 'emulate' it using the combination of the other drives plus parity.  That is one of the reasons I asked if the drive had a red 'x' against it as if that is the case then Unraid would be ignoring the physical drive and any contents you see would be from the emulated drive.

Link to comment
1 minute ago, itimpi said:

Even without the disk present you would be able to browse its contents if parity is valid as Unraid will try and 'emulate' it using the combination of the other drives plus parity.  That is one of the reasons I asked if the drive had a red 'x' against it as if that is the case then Unraid would be ignoring the physical drive and any contents you see would be from the emulated drive.

Gotcha.

 

This is how it looks at the moment:

 

image.thumb.png.407c4f9ae163faa97f252f27455882a7.png

 

Parity status:

 

image.png.fdafcdee536d192fdc2ec52ae16b9b6e.png

 

It picks up the pace every now and then before dropping back down to sub-30KB/sec speeds. I'm afraid that I have more disks on the way out, so I think I'll buy a set of new 18TB drives to replace these once this situation is sorted.

 

My understanding of the coming events would be:

  1. Unassign disk1
  2. Replace disk1
  3. Let parity rebuild disk1

Then, when the times comes to replace the disks:

  1. Unassign parity disk
  2. Replace parity disk with new 18TB drive
  3. Rebuild parity
  4. Replace array disk1 with larger disk, let rebuild from parity (or attach one new drive as unassigned device, copy everything over, replace the disk and manually copy it back onto the new disk1)
  5. Move data from disk2 to disk1
  6. Replace disk2 with larger disk
  7. Move data from disk3 and disk4 to disks 1 and 2
  8. Replace disks 3 and 4 with larger ones
  9. Move data from disks 5, 6 and 7 to disks 1, 2, 3 and 4
  10. Replace disks 5, 6 and 7
  11. Run unbalance plugin to spread data evenly across drives

I like moving things around myself so that I know what's going on, and that probably interferes with some ways Unraid prefers to do things, and I don't know which way is better - rebuilding from parity sounds like it'll take a very long time and copying the data manually sounds like it'll be faster. Maybe it won't.

 

Link to comment

Okay, SMART data collection is done:

 

root@aram:~# smartctl --all /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.10.28-Unraid] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726060AL5210
Revision:             A907
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2424dba1c
Serial number:        NAHBS4ZY
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Mar  7 10:29:31 2023 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: FIRMWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=62]

Current Drive Temperature:     48 C
Drive Trip Temperature:        85 C

Manufactured in week 52 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  621
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2578
Elements in grown defect list: 509

Vendor (Seagate Cache) information
  Blocks sent to initiator = 9574154483269632

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0  1690116         0   1690116   41747712     206550.071          89
write:         0       47         0        47    2831239      32983.083           0
verify:        0    42893         0     42893   119043714       9119.288          49

Non-medium error count:      399

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   45459                 - [-   -    -]
# 2  Background short  Completed                   -   45459                 - [-   -    -]

Long (extended) Self-test duration: 48442 seconds [807.4 minutes]

 

Link to comment
1 minute ago, JorgeB said:

 

 

 

These are no good, it should be replaced.

That's the plan, I am just waiting to get home from work :)

 

While I wait, is it safe to cancel the parity check? As I understand it, the chances of it completing at this point are slim to none.

Edited by reverend remiel
Link to comment
54 minutes ago, reverend remiel said:

While I wait, is it safe to cancel the parity check? As I understand it, the chances of it completing at this point are slim to none.

Definitely cancel it.   You rarely want to be running a parity check if you think you have drives playing up.

Link to comment

Time for an update.

 

The failed disk has been pulled and the new one installed. The disk's contents are being reconstructed as we speak and is estimated to take just shy of 8 hours.

 

So far, so good.

 

The VMs are running as I need them for DNS and adblocking and things of that nature, but I'll leave the docker containers disabled for now and deactivate tonight's Mover run as well, to make sure no new data interferes with the reconstruction.

 

Thanks for all your help, highly appreciated.

 

 

Link to comment
19 minutes ago, itimpi said:

If you have the array running in normal mode then you should be able to view the contents of what is being reconstructed even though it is still in progress.   Can you do this and does it look like what you expect?

Yep, it looks pretty much like it did on Saturday, so I think this will be fine.

 

I'll still keep the docker containers stopped until the reconstruction is complete just to ease my paranoia. The fewer things that can go wrong at the same time, the better :)

Link to comment
  • rainformpurple changed the title to [SOLVED] Parity check takes forever

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...