-
Recurring ZFS errors
If anyone is having similar issues or wondering what happened, this had (as expected) nothing to do with the mount point. It was a SAS cable from the HBA to the backplane, which had worked flawlessly for years, that started intermittently failing. Issue resolved with replacement cable.
-
drcref started following Recurring ZFS errors
-
Recurring ZFS errors
Yes, those are ZFS datasets that I have mounted in /mnt/user.
-
Recurring ZFS errors
Attached diagnostics to the post
-
Recurring ZFS errors
This is quite the odd issue I've been having in the past few weeks. My hardware: Supermicro CSE-826 2U chassis X10SRi-F motherboard with Xeon E5-2680 v4 and 96 GB ECC RAM LSI 9211-8i (crossflashed Fujitsu D2607-8I) in IT mode 5x 16 TB SAS HDD Toshiba MG08SCA16TE All hard drives are roughly the same age with similar stats Accumulated power on time, hours:minutes 17053:01 Manufactured in week 44 of year 2021 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 23 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 284 Elements in grown defect list: 0 I am running Unraid 6.11.5 with the ZFS plugin. I have a single RAIDZ1 pool with the 5 drives. On 20 Jan I noticed my pool had degraded and zpool status was showing a large number of errors (READ, WRITE, CKSUM) across all hard drives at the same time. I cleared the errors with zpool clear and ran a scrub. The pool appeared healthy and everything seemed fine. I put it down to a controller drop out or power glitch or something, as it affected all hard drives. The same thing happened again on 22 Jan, 24 Jan, 25 Jan. Each time I cleared the errors and scrubbed the pool. There were no data errors at any time. I suspected an issue with the controller or cable so I shut down the server and reconnected the SAS cables and put the controller in another PCIe slot. After that, it stopped happening for some time, until the 16 Feb when it happened again, this time with a huge number of errors across all drives. In the meantime, I had ordered a replacement HBA, and on 16 Feb after another occurrence of the errors, I swapped the original LSI 9211-4i with a Fujitsu D2607-8I (crossflashed to LSI 9211-8i). That seemed to fix things, however a few days ago on 29 Feb it happened again, even with the new controller. As always I cleared the errors and scrubbed, but it keeps coming back more and more often now with larger number of errors each time. It happened again yesterday and now today. This is what it looks like: root@Tower:~# zpool status pool: vault state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 09:34:44 with 0 errors on Fri Mar 1 13:54:45 2024 config: NAME STATE READ WRITE CKSUM vault DEGRADED 0 0 0 raidz1-0 DEGRADED 111 7.44K 0 scsi-35000039b3868c8c9 DEGRADED 47 2.56K 0 too many errors scsi-35000039b3868ca2d DEGRADED 39 2.82K 0 too many errors scsi-35000039b480a5da5 DEGRADED 45 2.10K 0 too many errors scsi-35000039b480a5f5d FAULTED 0 15 0 too many errors scsi-35000039b480a7f71 DEGRADED 7 2.88K 0 too many errors errors: No known data errors As you can see, one of the drives (random one each time) goes into the faulted state after the first few errors. The other ones remain online and keep accumulating errors for the duration of the event. I assume this is due to ZFS trying to keep the pool alive so it avoids faulting the remaining drives if one drive is already faulted. Below is part of the kernel logs from the last occurrence: [Mon Mar 4 00:03:13 2024] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [Mon Mar 4 00:03:13 2024] sd 8:0:0:0: [sdb] tag#3195 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=DRIVER_OK cmd_age=0s [Mon Mar 4 00:03:13 2024] sd 8:0:0:0: [sdb] tag#3195 CDB: opcode=0x8a 8a 00 00 00 00 04 e1 c6 31 18 00 00 00 18 00 00 [Mon Mar 4 00:03:13 2024] I/O error, dev sdb, sector 20967731480 op 0x1:(WRITE) flags 0x700 phys_seg 3 prio class 0 [Mon Mar 4 00:03:13 2024] zio pool=vault vdev=/dev/disk/by-id/scsi-35000039b3868ca2d-part1 error=5 type=2 offset=10735477469184 size=12288 flags=40080c80 [Mon Mar 4 00:04:20 2024] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [Mon Mar 4 00:04:20 2024] sd 8:0:0:0: [sdb] tag#3104 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=DRIVER_OK cmd_age=0s [Mon Mar 4 00:04:20 2024] sd 8:0:0:0: [sdb] tag#3104 CDB: opcode=0x8a 8a 00 00 00 00 04 e9 4e 29 18 00 00 00 08 00 00 [Mon Mar 4 00:04:20 2024] I/O error, dev sdb, sector 21094082840 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0 [Mon Mar 4 00:04:20 2024] zio pool=vault vdev=/dev/disk/by-id/scsi-35000039b3868ca2d-part1 error=5 type=2 offset=10800169365504 size=4096 flags=180880 [Mon Mar 4 00:04:40 2024] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [Mon Mar 4 00:04:40 2024] sd 8:0:2:0: [sdd] tag#3091 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=DRIVER_OK cmd_age=0s [Mon Mar 4 00:04:40 2024] sd 8:0:2:0: [sdd] tag#3091 CDB: opcode=0x8a 8a 00 00 00 00 04 ca a1 95 60 00 00 00 10 00 00 [Mon Mar 4 00:04:40 2024] I/O error, dev sdd, sector 20579456352 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0 [Mon Mar 4 00:04:40 2024] zio pool=vault vdev=/dev/disk/by-id/scsi-35000039b480a5da5-part1 error=5 type=2 offset=10536680603648 size=8192 flags=180880 [Mon Mar 4 00:05:27 2024] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [Mon Mar 4 00:05:27 2024] sd 8:0:1:0: [sdc] tag#3084 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=DRIVER_OK cmd_age=0s [Mon Mar 4 00:05:27 2024] sd 8:0:1:0: [sdc] tag#3084 CDB: opcode=0x8a 8a 00 00 00 00 04 e9 4e 2e 20 00 00 00 38 00 00 [Mon Mar 4 00:05:27 2024] I/O error, dev sdc, sector 21094084128 op 0x1:(WRITE) flags 0x700 phys_seg 7 prio class 0 [Mon Mar 4 00:05:27 2024] zio pool=vault vdev=/dev/disk/by-id/scsi-35000039b480a7f71-part1 error=5 type=2 offset=10800170024960 size=28672 flags=40080c80 [Mon Mar 4 00:09:12 2024] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [Mon Mar 4 00:09:12 2024] sd 8:0:1:0: [sdc] tag#3186 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=DRIVER_OK cmd_age=0s [Mon Mar 4 00:09:12 2024] sd 8:0:1:0: [sdc] tag#3186 CDB: opcode=0x8a 8a 00 00 00 00 04 e9 4e 3e c0 00 00 00 30 00 00 [Mon Mar 4 00:09:12 2024] I/O error, dev sdc, sector 21094088384 op 0x1:(WRITE) flags 0x700 phys_seg 6 prio class 0 [Mon Mar 4 00:09:12 2024] zio pool=vault vdev=/dev/disk/by-id/scsi-35000039b480a7f71-part1 error=5 type=2 offset=10800172204032 size=24576 flags=40080c80 Also, see SMART for the hard drives below. They all look very similar in the number of ECC errors. I don't know if that's normal, whether these errors indicate something wrong with the drives themselves or just reflect SAS/communication errors with the controller. I consider it highly unlikely that there's something wrong with all 5 drives, and that those errors would occur exactly the same time on all 5. Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 23533 0 0 0 1540822.007 0 write: 0 773 0 0 0 57629.268 0 Non-medium error count: 0 Potential causes I have considered: Faulty HBA ruled out - replaced, problem persists Overheating HBA ruled out - temperature measured around 50 Celsius when problem appears Bad connections ruled out - reconnected SAS cable on either end, problem persists Power supply? unlikely - IPMI has not recorded any voltage faults at all Backplane? I would appreciate any help and pointers! Running out of ideas to the point I'm considering replacing the entire server! tower-diagnostics-20240306-0904.zip
drcref
Members
-
Joined
-
Last visited