rainformpurple Posted March 6, 2023 Share Posted March 6, 2023 (edited) Hi all, Sunday morning at 1:30am, the parity check started. Now, Monday evening, it's still going and it doesn't look like it's going to be done anytime soon: DIsk 1 was reported to have read errors, which it kinda has had for several months, but I just haven't had the time to swap the disk out. I guess that time is approaching fast... Usually, the parity check takes somewhere between 11,5 and 13,5 hours, which is fine and expected for the parity size and the disks being spinning rust, but this is just ridiculous. What may cause this? My understanding is that the parity calculation starts with disk 1 and works its way through the disks in the array, but I'm probably mistaken. In any case, it seems that something is afoot and I can't figure out why. System: Dell PowerEdge T430, 144GB RAM, 2xXeon E5-2640 v3 8c16t (16c/32t). 8xHGST Ultrastar 6TB drives in array. 250GB cache disk for docker container data. 512GB NVMe SSD for downloads. 2TB SATA SSD for VMs. Unraid 6.9.2 Pro. There are a quite a few docker containers and VMs running. The VMs have a dedicated disk, the docker containers have a dedicated disk for the docker data, but some of them download and write data to the array (via the cache disk) when Mover runs every night. The data is stored temporarily on a dedicated SSD just for downloads until Mover runs. I'm assuming that such write activity will impact the duration of the parity check as things need to be recalculated, but 57-58 days for a parity check seems somewhat egregrious. As I'm planning to replace disk 1 soon, I don't want to stop the parity check as I need that parity when the disk contents are to be rebuilt after the disk replacement. Will stopping the docker services help in this situation? Edited March 8, 2023 by reverend remiel Quote Link to comment
rainformpurple Posted March 6, 2023 Author Share Posted March 6, 2023 So I poked around a bit in the syslog and it seems that disk1 is on its way out: Mar 6 17:24:41 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:27:04 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:30:45 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:33:10 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:39:13 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:43:52 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:47:03 aram kernel: sd 1:0:1:0: [sdc] tag#228 CDB: opcode=0x88 88 00 00 00 00 02 98 52 4b b0 00 00 02 00 00 00 Mar 6 17:47:03 aram kernel: sd 1:0:1:0: [sdc] tag#217 CDB: opcode=0x88 88 00 00 00 00 02 98 52 49 b0 00 00 02 00 00 00 Mar 6 17:48:29 aram kernel: sd 1:0:1:0: [sdc] tag#217 OCR is requested due to IO timeout!! Mar 6 17:48:29 aram kernel: sd 1:0:1:0: [sdc] tag#217 SCSI host state: 5 SCSI host busy: 2 FW outstanding: 0 Mar 6 17:48:29 aram kernel: sd 1:0:1:0: [sdc] tag#217 scmd: (0x00000000d20e7c46) retries: 0x0 allowed: 0x5 Mar 6 17:48:29 aram kernel: sd 1:0:1:0: [sdc] tag#217 CDB: opcode=0x88 88 00 00 00 00 02 98 52 49 b0 00 00 02 00 00 00 Mar 6 17:48:29 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 Mar 6 17:48:29 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000 Mar 6 17:49:46 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:53:22 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:56:18 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#421 CDB: opcode=0x88 88 00 00 00 00 02 98 53 45 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#420 CDB: opcode=0x88 88 00 00 00 00 02 98 53 4d 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#419 CDB: opcode=0x88 88 00 00 00 00 02 98 53 57 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#418 CDB: opcode=0x88 88 00 00 00 00 02 98 53 53 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#417 CDB: opcode=0x88 88 00 00 00 00 02 98 53 49 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#416 CDB: opcode=0x88 88 00 00 00 00 02 98 53 51 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#414 CDB: opcode=0x88 88 00 00 00 00 02 98 53 43 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#413 CDB: opcode=0x88 88 00 00 00 00 02 98 53 47 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#412 CDB: opcode=0x88 88 00 00 00 00 02 98 53 4b 90 00 00 02 00 00 00 Mar 6 17:57:55 aram kernel: sd 1:0:1:0: [sdc] tag#411 CDB: opcode=0x88 88 00 00 00 00 02 98 53 55 90 00 00 02 00 00 00 Mar 6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 OCR is requested due to IO timeout!! Mar 6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 SCSI host state: 5 SCSI host busy: 10 FW outstanding: 10 Mar 6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 scmd: (0x00000000b70a797a) retries: 0x1 allowed: 0x5 Mar 6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 CDB: opcode=0x88 88 00 00 00 00 02 98 53 55 90 00 00 02 00 00 00 Mar 6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 Request descriptor details: Mar 6 17:58:50 aram kernel: sd 1:0:1:0: [sdc] tag#411 RequestFlags:0xc MSIxIndex:0x3 SMID:0x19c LMID:0x0 DevHandle:0x11 Mar 6 17:58:50 aram kernel: IO request frame: Mar 6 17:58:50 aram kernel: 00000000: 00000011 00000000 00000000 ffc69a20 00600002 00000020 00000000 00040000 Mar 6 17:58:50 aram kernel: 00000020: 00000000 00004010 00000000 00000000 00000000 00000000 00000000 02000000 Mar 6 17:58:50 aram kernel: 00000040: 00000088 53980200 00009055 00000002 00000000 00000000 00000000 00000000 Mar 6 17:58:50 aram kernel: 00000060: 005b0000 00010000 00000000 00000000 00000000 00000000 00004010 00000000 Mar 6 17:58:50 aram kernel: 00000080: a3a80000 00000000 00001000 00000000 a3a81000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000000a0: a3a82000 00000000 00001000 00000000 a3a83000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000000c0: a3a84000 00000000 00001000 00000000 a3a85000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000000e0: a3a86000 00000000 00001000 00000000 ffac4000 00000000 00000390 80000000 Mar 6 17:58:50 aram kernel: Chain frame: Mar 6 17:58:50 aram kernel: 00000000: a3a87000 00000000 00001000 00000000 a3a88000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000020: a3a89000 00000000 00001000 00000000 a3a8a000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000040: a3a8b000 00000000 00001000 00000000 a3a8c000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000060: a3a8d000 00000000 00001000 00000000 a3a8e000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000080: a3a8f000 00000000 00001000 00000000 a3a90000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000000a0: a3a91000 00000000 00001000 00000000 a3a92000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000000c0: a3a93000 00000000 00001000 00000000 a3a94000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000000e0: a3a95000 00000000 00001000 00000000 a3a96000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000100: a3a97000 00000000 00001000 00000000 a3a98000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000120: a3a99000 00000000 00001000 00000000 a3a9a000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000140: a3a9b000 00000000 00001000 00000000 a3a9c000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000160: a3a9d000 00000000 00001000 00000000 a3a9e000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000180: a3a9f000 00000000 00001000 00000000 a3aa0000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000001a0: a3aa1000 00000000 00001000 00000000 a3aa2000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000001c0: a3aa3000 00000000 00001000 00000000 a3aa4000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000001e0: a3aa5000 00000000 00001000 00000000 a3aa6000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000200: a3aa7000 00000000 00001000 00000000 a3aa8000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000220: a3aa9000 00000000 00001000 00000000 a3aaa000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000240: a3aab000 00000000 00001000 00000000 a3aac000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000260: a3aad000 00000000 00001000 00000000 a3aae000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000280: a3aaf000 00000000 00001000 00000000 a3ab0000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000002a0: a3ab1000 00000000 00001000 00000000 a3ab2000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000002c0: a3ab3000 00000000 00001000 00000000 a3ab4000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 000002e0: a3ab5000 00000000 00001000 00000000 a3ab6000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000300: a3ab7000 00000000 00001000 00000000 a3ab8000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000320: a3ab9000 00000000 00001000 00000000 a3aba000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000340: a3abb000 00000000 00001000 00000000 a3abc000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000360: a3abd000 00000000 00001000 00000000 a3abe000 00000000 00001000 00000000 Mar 6 17:58:50 aram kernel: 00000380: a3abf000 00000000 00001000 40000000 00000000 00000000 00000000 00000000 Mar 6 17:58:50 aram kernel: 000003a0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Mar 6 17:58:50 aram kernel: 000003c0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Mar 6 17:58:50 aram kernel: 000003e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Mar 6 17:58:50 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 Mar 6 17:58:50 aram kernel: megaraid_sas 0000:03:00.0: [ 0]waiting for 10 commands to complete for scsi1 Mar 6 17:58:55 aram kernel: megaraid_sas 0000:03:00.0: [ 5]waiting for 9 commands to complete for scsi1 Mar 6 17:59:00 aram kernel: megaraid_sas 0000:03:00.0: [10]waiting for 8 commands to complete for scsi1 Mar 6 17:59:06 aram kernel: megaraid_sas 0000:03:00.0: [15]waiting for 6 commands to complete for scsi1 Mar 6 17:59:11 aram kernel: megaraid_sas 0000:03:00.0: [20]waiting for 6 commands to complete for scsi1 Mar 6 17:59:16 aram kernel: megaraid_sas 0000:03:00.0: [25]waiting for 5 commands to complete for scsi1 Mar 6 17:59:21 aram kernel: megaraid_sas 0000:03:00.0: [30]waiting for 5 commands to complete for scsi1 Mar 6 17:59:26 aram kernel: megaraid_sas 0000:03:00.0: [35]waiting for 4 commands to complete for scsi1 Mar 6 17:59:31 aram kernel: megaraid_sas 0000:03:00.0: [40]waiting for 3 commands to complete for scsi1 Mar 6 17:59:36 aram kernel: megaraid_sas 0000:03:00.0: [45]waiting for 2 commands to complete for scsi1 Mar 6 17:59:41 aram kernel: megaraid_sas 0000:03:00.0: [50]waiting for 2 commands to complete for scsi1 Mar 6 17:59:47 aram kernel: megaraid_sas 0000:03:00.0: [55]waiting for 1 commands to complete for scsi1 Mar 6 17:59:49 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000 Mar 6 18:01:56 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 18:02:29 aram kernel: sd 1:0:1:0: [sdc] tag#431 OCR is requested due to IO timeout!! Mar 6 18:02:29 aram kernel: sd 1:0:1:0: [sdc] tag#431 SCSI host state: 5 SCSI host busy: 4 FW outstanding: 0 Mar 6 18:02:29 aram kernel: sd 1:0:1:0: [sdc] tag#431 scmd: (0x00000000b70a797a) retries: 0x2 allowed: 0x5 Mar 6 18:02:29 aram kernel: sd 1:0:1:0: [sdc] tag#431 CDB: opcode=0x88 88 00 00 00 00 02 98 53 55 90 00 00 02 00 00 00 Mar 6 18:02:29 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 Mar 6 18:02:29 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000 Mar 6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#468 CDB: opcode=0x88 88 00 00 00 00 02 98 53 b7 90 00 00 02 00 00 00 Mar 6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#467 CDB: opcode=0x88 88 00 00 00 00 02 98 53 a5 90 00 00 02 00 00 00 Mar 6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#466 CDB: opcode=0x88 88 00 00 00 00 02 98 53 9f 90 00 00 02 00 00 00 Mar 6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#465 CDB: opcode=0x88 88 00 00 00 00 02 98 53 a3 90 00 00 02 00 00 00 Mar 6 18:04:09 aram kernel: sd 1:0:1:0: [sdc] tag#464 CDB: opcode=0x88 88 00 00 00 00 02 98 53 a1 90 00 00 02 00 00 00 Mar 6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 OCR is requested due to IO timeout!! Mar 6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 SCSI host state: 5 SCSI host busy: 9 FW outstanding: 8 Mar 6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 scmd: (0x00000000415c5b1d) retries: 0x1 allowed: 0x5 Mar 6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 CDB: opcode=0x88 88 00 00 00 00 02 98 53 a1 90 00 00 02 00 00 00 Mar 6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 Request descriptor details: Mar 6 18:04:26 aram kernel: sd 1:0:1:0: [sdc] tag#464 RequestFlags:0xc MSIxIndex:0x3 SMID:0x1d1 LMID:0x0 DevHandle:0x11 Mar 6 18:04:26 aram kernel: IO request frame: Mar 6 18:04:26 aram kernel: 00000000: 00000011 00000000 00000000 ffc6ae00 00600002 00000020 00000000 00040000 Mar 6 18:04:26 aram kernel: 00000020: 00000000 00004010 00000000 00000000 00000000 00000000 00000000 02000000 Mar 6 18:04:26 aram kernel: 00000040: 00000088 53980200 000090a1 00000002 00000000 00000000 00000000 00000000 Mar 6 18:04:26 aram kernel: 00000060: 005b0000 00010000 00000000 00000000 00000000 00000000 00004010 00000000 Mar 6 18:04:26 aram kernel: 00000080: a0ac0000 00000000 00001000 00000000 a0ac1000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000000a0: a0ac2000 00000000 00001000 00000000 a0ac3000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000000c0: a0ac4000 00000000 00001000 00000000 a0ac5000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000000e0: a0ac6000 00000000 00001000 00000000 ffa8f000 00000000 00000390 80000000 Mar 6 18:04:26 aram kernel: Chain frame: Mar 6 18:04:26 aram kernel: 00000000: a0ac7000 00000000 00001000 00000000 a0ac8000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000020: a0ac9000 00000000 00001000 00000000 a0aca000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000040: a0acb000 00000000 00001000 00000000 a0acc000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000060: a0acd000 00000000 00001000 00000000 a0ace000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000080: a0acf000 00000000 00001000 00000000 a0ad0000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000000a0: a0ad1000 00000000 00001000 00000000 a0ad2000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000000c0: a0ad3000 00000000 00001000 00000000 a0ad4000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000000e0: a0ad5000 00000000 00001000 00000000 a0ad6000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000100: a0ad7000 00000000 00001000 00000000 a0ad8000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000120: a0ad9000 00000000 00001000 00000000 a0ada000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000140: a0adb000 00000000 00001000 00000000 a0adc000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000160: a0add000 00000000 00001000 00000000 a0ade000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000180: a0adf000 00000000 00001000 00000000 a0ae0000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000001a0: a0ae1000 00000000 00001000 00000000 a0ae2000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000001c0: a0ae3000 00000000 00001000 00000000 a0ae4000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000001e0: a0ae5000 00000000 00001000 00000000 a0ae6000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000200: a0ae7000 00000000 00001000 00000000 a0ae8000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000220: a0ae9000 00000000 00001000 00000000 a0aea000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000240: a0aeb000 00000000 00001000 00000000 a0aec000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000260: a0aed000 00000000 00001000 00000000 a0aee000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000280: a0aef000 00000000 00001000 00000000 a0af0000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000002a0: a0af1000 00000000 00001000 00000000 a0af2000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000002c0: a0af3000 00000000 00001000 00000000 a0af4000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 000002e0: a0af5000 00000000 00001000 00000000 a0af6000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000300: a0af7000 00000000 00001000 00000000 a0af8000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000320: a0af9000 00000000 00001000 00000000 a0afa000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000340: a0afb000 00000000 00001000 00000000 a0afc000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000360: a0afd000 00000000 00001000 00000000 a0afe000 00000000 00001000 00000000 Mar 6 18:04:26 aram kernel: 00000380: a0aff000 00000000 00001000 40000000 00000000 00000000 00000000 00000000 Mar 6 18:04:26 aram kernel: 000003a0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Mar 6 18:04:26 aram kernel: 000003c0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Mar 6 18:04:26 aram kernel: 000003e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Mar 6 18:04:26 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 Mar 6 18:04:26 aram kernel: megaraid_sas 0000:03:00.0: [ 0]waiting for 8 commands to complete for scsi1 Mar 6 18:04:31 aram kernel: megaraid_sas 0000:03:00.0: [ 5]waiting for 4 commands to complete for scsi1 Mar 6 18:04:36 aram kernel: megaraid_sas 0000:03:00.0: [10]waiting for 1 commands to complete for scsi1 Mar 6 18:04:37 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000 Mar 6 18:06:23 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 18:09:29 aram kernel: sd 1:0:1:0: [sdc] tag#408 CDB: opcode=0x88 88 00 00 00 00 02 98 54 43 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#407 CDB: opcode=0x88 88 00 00 00 00 02 98 54 51 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#405 CDB: opcode=0x88 88 00 00 00 00 02 98 54 4f 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#402 CDB: opcode=0x88 88 00 00 00 00 02 98 54 4d 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#401 CDB: opcode=0x88 88 00 00 00 00 02 98 54 4b 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#400 CDB: opcode=0x88 88 00 00 00 00 02 98 54 49 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#399 CDB: opcode=0x88 88 00 00 00 00 02 98 54 47 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#398 CDB: opcode=0x88 88 00 00 00 00 02 98 54 45 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#200 CDB: opcode=0x88 88 00 00 00 00 02 98 54 53 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#208 CDB: opcode=0x88 88 00 00 00 00 02 98 54 55 90 00 00 02 00 00 00 Mar 6 18:09:49 aram kernel: sd 1:0:1:0: [sdc] tag#216 CDB: opcode=0x88 88 00 00 00 00 02 98 54 57 90 00 00 02 00 00 00 Mar 6 18:09:50 aram kernel: sd 1:0:1:0: [sdc] tag#224 CDB: opcode=0x88 88 00 00 00 00 02 98 54 59 90 00 00 02 00 00 00 Mar 6 18:10:01 aram kernel: sd 1:0:1:0: [sdc] tag#196 CDB: opcode=0x88 88 00 00 00 00 02 98 54 5b 90 00 00 02 00 00 00 Mar 6 18:10:12 aram kernel: sd 1:0:1:0: Power-on or device reset occurred Mar 6 18:10:23 aram kernel: sd 1:0:1:0: [sdc] tag#196 OCR is requested due to IO timeout!! Mar 6 18:10:23 aram kernel: sd 1:0:1:0: [sdc] tag#196 SCSI host state: 5 SCSI host busy: 13 FW outstanding: 0 Mar 6 18:10:23 aram kernel: sd 1:0:1:0: [sdc] tag#196 scmd: (0x000000005fb8537f) retries: 0x0 allowed: 0x5 Mar 6 18:10:23 aram kernel: sd 1:0:1:0: [sdc] tag#196 CDB: opcode=0x88 88 00 00 00 00 02 98 54 5b 90 00 00 02 00 00 00 Mar 6 18:10:23 aram kernel: megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 Mar 6 18:10:23 aram kernel: megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000 Well, this is new and exciting! What are the odds that the parity check will complete before the disk gives up the ghost completely? Complete diagnostics attached. aram-diagnostics-20230306-2047.zip Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 If I stop the parity check, will I be able to rebuild the failing drive from parity? The check drags on and I don't think it will ever finish, and the performance hit is affecting everything on the server. Quote Link to comment
itimpi Posted March 7, 2023 Share Posted March 7, 2023 2 hours ago, reverend remiel said: If I stop the parity check, will I be able to rebuild the failing drive from parity? It all depends on whether you have valid parity - if you have then the answer is yes. Looking at the diagnostics it appears that disk1 may have dropped offline. Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 1 minute ago, itimpi said: It all depends on whether you have valid parity - if you have then the answer is yes. Looking at the diagnostics it appears that disk1 may have dropped offline. How do I know if the parity is valid, then? Just wait for the parity check to complete? The last parity check was 2 months ago and that passed with no errors, not even read errors on disk 1. Granted, disk 1 has reported read errors at a few occasions and that prompted me to buy a spare drive, but the errors went away and everything seemed fine until now. I also enabled the SAS spindown plugin a while back, which is probably what killed the drive now. Oh, the irony. Quote Link to comment
itimpi Posted March 7, 2023 Share Posted March 7, 2023 20 minutes ago, reverend remiel said: How do I know if the parity is valid, then? Just wait for the parity check to complete The parity check is going to be meaningless if the disk has dropped offline Is disk1 still showing as present or has it got a red 'x' against it? I think the only way to know for certain is going to be if Unraid can emulate disk1 correctly if it is not in the system. I would wait for @JorgeB to get online as the best expert on these things to confirm but I think that is going to have to be the way forward. You definitely want to keep disk1 as intact as possible at this point in case it is needed for any recovery purposes. Do you have a spare drive you could use to replace it? Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 (edited) 4 minutes ago, itimpi said: The parity check is going to be meaningless if the disk has dropped offline Is disk1 still showing as present or has it got a red 'x' against it? I think the only way to know for certain is going to be if Unraid can emulate disk1 correctly if it is not in the system. I would wait for @JorgeB to get online as the best expert on these things to confirm but I think that is going to have to be the way forward. You definitely want to keep disk1 as intact as possible at this point in case it is needed for any recovery purposes. Do you have a spare drive you could use to replace it? The disk is still present in the system and is occasionally responding to commands, so it's not completely gone yet. I can also browse the disk's contents directly via /mnt/disk1, which is good. I have a spare 6TB disk of the same type that I bought probably a year ago "just to make sure", so once I feel confident that the disk's contents are able to be rebuilt, I'm ready to swap them. Edited March 7, 2023 by reverend remiel Quote Link to comment
itimpi Posted March 7, 2023 Share Posted March 7, 2023 Just now, reverend remiel said: I can also browse the disk's contents directly via /mnt/disk1, which is good. Even without the disk present you would be able to browse its contents if parity is valid as Unraid will try and 'emulate' it using the combination of the other drives plus parity. That is one of the reasons I asked if the drive had a red 'x' against it as if that is the case then Unraid would be ignoring the physical drive and any contents you see would be from the emulated drive. Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 1 minute ago, itimpi said: Even without the disk present you would be able to browse its contents if parity is valid as Unraid will try and 'emulate' it using the combination of the other drives plus parity. That is one of the reasons I asked if the drive had a red 'x' against it as if that is the case then Unraid would be ignoring the physical drive and any contents you see would be from the emulated drive. Gotcha. This is how it looks at the moment: Parity status: It picks up the pace every now and then before dropping back down to sub-30KB/sec speeds. I'm afraid that I have more disks on the way out, so I think I'll buy a set of new 18TB drives to replace these once this situation is sorted. My understanding of the coming events would be: Unassign disk1 Replace disk1 Let parity rebuild disk1 Then, when the times comes to replace the disks: Unassign parity disk Replace parity disk with new 18TB drive Rebuild parity Replace array disk1 with larger disk, let rebuild from parity (or attach one new drive as unassigned device, copy everything over, replace the disk and manually copy it back onto the new disk1) Move data from disk2 to disk1 Replace disk2 with larger disk Move data from disk3 and disk4 to disks 1 and 2 Replace disks 3 and 4 with larger ones Move data from disks 5, 6 and 7 to disks 1, 2, 3 and 4 Replace disks 5, 6 and 7 Run unbalance plugin to spread data evenly across drives I like moving things around myself so that I know what's going on, and that probably interferes with some ways Unraid prefers to do things, and I don't know which way is better - rebuilding from parity sounds like it'll take a very long time and copying the data manually sounds like it'll be faster. Maybe it won't. Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 Can't see SMART for disk1 but should be OK to replace it, assuming parity was valid. Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 1 minute ago, JorgeB said: Can't see SMART for disk1 but should be OK to replace it, assuming parity was valid. Yeah, but that's the thing - how do I know parity was valid? Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 Did it complete a previous check without errors? Diags don't show the start of this check but if it was non correct as recommended it would remain valid for sure. Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 Just now, JorgeB said: Did it complete a previous check without errors? Diags don't show the start of this check but if it was non correct as recommended it would remain valid for sure. Yes, the previous check 2 months ago completed with no errors. Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 I'm trying to retrieve SMART data from the drive, but it's taking its sweet time... Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 Okay, SMART data collection is done: root@aram:~# smartctl --all /dev/sdc smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.10.28-Unraid] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HGST Product: HUS726060AL5210 Revision: A907 Compliance: SPC-4 User Capacity: 6,001,175,126,016 bytes [6.00 TB] Logical block size: 512 bytes Physical block size: 4096 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000cca2424dba1c Serial number: NAHBS4ZY Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue Mar 7 10:29:31 2023 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: FIRMWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=62] Current Drive Temperature: 48 C Drive Trip Temperature: 85 C Manufactured in week 52 of year 2015 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 621 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 2578 Elements in grown defect list: 509 Vendor (Seagate Cache) information Blocks sent to initiator = 9574154483269632 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 1690116 0 1690116 41747712 206550.071 89 write: 0 47 0 47 2831239 32983.083 0 verify: 0 42893 0 42893 119043714 9119.288 49 Non-medium error count: 399 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 45459 - [- - -] # 2 Background short Completed - 45459 - [- - -] Long (extended) Self-test duration: 48442 seconds [807.4 minutes] Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 13 minutes ago, reverend remiel said: Elements in grown defect list: 509 13 minutes ago, reverend remiel said: uncorrected errors read: 89 write: 0 verify: 49 These are no good, it should be replaced. Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 (edited) 1 minute ago, JorgeB said: These are no good, it should be replaced. That's the plan, I am just waiting to get home from work While I wait, is it safe to cancel the parity check? As I understand it, the chances of it completing at this point are slim to none. Edited March 7, 2023 by reverend remiel Quote Link to comment
itimpi Posted March 7, 2023 Share Posted March 7, 2023 54 minutes ago, reverend remiel said: While I wait, is it safe to cancel the parity check? As I understand it, the chances of it completing at this point are slim to none. Definitely cancel it. You rarely want to be running a parity check if you think you have drives playing up. Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 Time for an update. The failed disk has been pulled and the new one installed. The disk's contents are being reconstructed as we speak and is estimated to take just shy of 8 hours. So far, so good. The VMs are running as I need them for DNS and adblocking and things of that nature, but I'll leave the docker containers disabled for now and deactivate tonight's Mover run as well, to make sure no new data interferes with the reconstruction. Thanks for all your help, highly appreciated. Quote Link to comment
itimpi Posted March 7, 2023 Share Posted March 7, 2023 If you have the array running in normal mode then you should be able to view the contents of what is being reconstructed even though it is still in progress. Can you do this and does it look like what you expect? Quote Link to comment
rainformpurple Posted March 7, 2023 Author Share Posted March 7, 2023 19 minutes ago, itimpi said: If you have the array running in normal mode then you should be able to view the contents of what is being reconstructed even though it is still in progress. Can you do this and does it look like what you expect? Yep, it looks pretty much like it did on Saturday, so I think this will be fine. I'll still keep the docker containers stopped until the reconstruction is complete just to ease my paranoia. The fewer things that can go wrong at the same time, the better Quote Link to comment
rainformpurple Posted March 8, 2023 Author Share Posted March 8, 2023 (edited) Disk reconstruction has completed with no errors! It took about 9,5 hours, but that's fine. All's well that ends well. Again, a big thank you to all who responded! Edited March 8, 2023 by reverend remiel 1 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.