Bushibot Posted September 26, 2023 Share Posted September 26, 2023 I seem be having a issue where my dockers (and maybe the whole system?) are going read only. I see this start when some dockers start to fail with different issues. I see in the syslog: Sep 26 13:19:43 nas-mass kernel: BTRFS error (device sdh1): block=1088175357952 write time tree block corruption detected Sep 26 13:19:43 nas-mass kernel: BTRFS: error (device sdh1) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction) Sep 26 13:19:43 nas-mass kernel: BTRFS info (device sdh1: state E): forced readonly Sep 26 13:19:43 nas-mass kernel: BTRFS warning (device sdh1: state E): Skipping commit of aborted transaction. Sep 26 13:19:43 nas-mass kernel: BTRFS: error (device sdh1: state EA) in cleanup_transaction:1964: errno=-5 IO failure Sep 26 13:32:21 nas-mass kernel: traps: mariadbd[17707] general protection fault ip:14ed75a28898 sp:14ed7406f500 error:0 in libc.so.6[14ed75a28000+195000] I have no idea what could be causing this. I found the system locked up over night too. Attached is the Dianostic from when I first noticed the read only issue coming up starting up. nas-mass-diagnostics-20230926-1333.zip Quote Link to comment
Bushibot Posted September 26, 2023 Author Share Posted September 26, 2023 (edited) Trying to follow here: Running repair seeing scroll infinitely: super bytes used 146319687680 mismatches actual used 146319671296 super bytes used 146319687680 mismatches actual used 146319671296 Is that normal? Edited September 27, 2023 by Bushibot Quote Link to comment
JorgeB Posted September 27, 2023 Share Posted September 27, 2023 11 hours ago, Bushibot said: write time tree block corruption detected This usually means bad RAM or other kernel memory corruption, rebooting should bring the filesystem back, but good idea to run memtest Quote Link to comment
Bushibot Posted September 27, 2023 Author Share Posted September 27, 2023 9 hours ago, JorgeB said: This usually means bad RAM or other kernel memory corruption, rebooting should bring the filesystem back, but good idea to run memtest Running a mem test, and will validate that no OC is happening. That seems like a common issue. If nothing is found then what's next? Quote Link to comment
JorgeB Posted September 27, 2023 Share Posted September 27, 2023 There's is possibly a btrfs issue with the v6.12 kernel, since it's been happening to some users without an apparent reason, before 9 out of 10 times this happened memtest would find errors, if that is the case either go back to v6.11.5 to see if it no longer happens or try zfs. Quote Link to comment
Bushibot Posted September 27, 2023 Author Share Posted September 27, 2023 (edited) 1 hour ago, JorgeB said: There's is possibly a btrfs issue with the v6.12 kernel, since it's been happening to some users without an apparent reason, before 9 out of 10 times this happened memtest would find errors, if that is the case either go back to v6.11.5 to see if it no longer happens or try zfs. Can I go to zfs without a data wipe? I'm a little light on ram for that only 24G, though I can go to 32.... How would I down grade the kernel? Looks like rollback required? https://docs.unraid.net/unraid-os/release-notes/6.12.4/#rolling-back I'm letting mem test run today but server has been down for a day and the plex kids are starting to gripe :p. Edited September 27, 2023 by Bushibot Quote Link to comment
Bushibot Posted September 27, 2023 Author Share Posted September 27, 2023 Or did you mean just convert the cache to zfs? Quote Link to comment
Bushibot Posted September 28, 2023 Author Share Posted September 28, 2023 (edited) well crap. Mem passed testing but now getting BTF errors running move just to try and clear the cash that went read only. I really can't tell what's going on here. ran check UUID: a267484f-8dc7-4df0-b715-36becf7e1795 no stats available Total to scrub: 227.54GiB Rate: 0.00B/s Error summary: no errors found yet Sep 27 17:42:41 nas-mass emhttpd: shcmd (93): /usr/local/sbin/mover &> /dev/null & Sep 27 17:43:21 nas-mass flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update Sep 27 17:50:53 nas-mass kernel: ata16.00: exception Emask 0x10 SAct 0xe0 SErr 0x0 action 0x6 frozen Sep 27 17:50:53 nas-mass kernel: ata16.00: irq_stat 0x08000000, interface fatal error Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/80:28:f0:80:7d/02:00:0d:00:00/40 tag 5 ncq dma 327680 in Sep 27 17:50:53 nas-mass kernel: res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error) Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY } Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/00:30:70:83:7d/02:00:0d:00:00/40 tag 6 ncq dma 262144 in Sep 27 17:50:53 nas-mass kernel: res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error) Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY } Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/20:38:80:ae:18/00:00:00:00:00/40 tag 7 ncq dma 16384 in Sep 27 17:50:53 nas-mass kernel: res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error) Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY } Sep 27 17:50:53 nas-mass kernel: ata16: hard resetting link Sep 27 17:50:58 nas-mass kernel: ata16: link is slow to respond, please be patient (ready=0) Sep 27 17:51:03 nas-mass kernel: ata16: COMRESET failed (errno=-16) Sep 27 17:51:03 nas-mass kernel: ata16: hard resetting link Sep 27 17:51:03 nas-mass kernel: ata16: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Sep 27 17:51:03 nas-mass kernel: ata16.00: supports DRM functions and may not be fully accessible Sep 27 17:51:03 nas-mass kernel: ata16.00: supports DRM functions and may not be fully accessible Sep 27 17:51:03 nas-mass kernel: ata16.00: configured for UDMA/133 Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 Sense Key : 0x5 [current] Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 ASC=0x21 ASCQ=0x4 Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 CDB: opcode=0x28 28 00 0d 7d 80 f0 00 02 80 00 Sep 27 17:51:03 nas-mass kernel: I/O error, dev sdj, sector 226328816 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 2 Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 Sense Key : 0x5 [current] Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 ASC=0x21 ASCQ=0x4 Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 CDB: opcode=0x28 28 00 0d 7d 83 70 00 02 00 00 Sep 27 17:51:03 nas-mass kernel: I/O error, dev sdj, sector 226329456 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2 Sep 27 17:51:03 nas-mass kernel: ata16: EH complete Sep 27 17:51:03 nas-mass kernel: ata16.00: Enabling discard_zeroes_data Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209100288 (dev /dev/sdj1 sector 226327480) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209067520 (dev /dev/sdj1 sector 226327416) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209104384 (dev /dev/sdj1 sector 226327488) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209108480 (dev /dev/sdj1 sector 226327496) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208739840 (dev /dev/sdj1 sector 226326776) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208735744 (dev /dev/sdj1 sector 226326768) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208756224 (dev /dev/sdj1 sector 226326808) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209075712 (dev /dev/sdj1 sector 226327432) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208764416 (dev /dev/sdj1 sector 226326824) Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209083904 (dev /dev/sdj1 sector 226327448) Edited September 28, 2023 by Bushibot Quote Link to comment
Bushibot Posted September 28, 2023 Author Share Posted September 28, 2023 added updated diag. nas-mass-diagnostics-20230927-1808.zip Quote Link to comment
JorgeB Posted September 28, 2023 Share Posted September 28, 2023 Replace cables for that device and run a scrub. Quote Link to comment
Bushibot Posted September 29, 2023 Author Share Posted September 29, 2023 I moved to a different SATA port (different cable). scrub said no error (which seems inconsistent). Quote Link to comment
JorgeB Posted September 29, 2023 Share Posted September 29, 2023 5 minutes ago, Bushibot said: scrub said no error It can be normal if all were were previously corrected during the reads, any errors for the access data are automatically corrected (if the pool is redundant) Quote Link to comment
Bushibot Posted September 29, 2023 Author Share Posted September 29, 2023 So I'm still trying to figure out if I have disk issue. Mem86 did two complete tests on the upgraded memory pool (thought the old x4G had tested fine with two full passes). I have since reformatted my cache to a 2T ZFS mirror pool in the hopes of avoiding more issues. I still see a error for what is now disk 2 in the pool, this is the same drive that went read only prior with BTFS errors. Scrub just instantly returns everything is fine... doesn't look like it's doing anything at all, not sure what the expected behavior is though. I'm really trying to avoid even more down time. pool: cache state: ONLINE scan: scrub repaired 0B in 00:00:00 with 0 errors on Fri Sep 29 10:22:39 2023 config: NAME STATE READ WRITE CKSUM cache ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /dev/sdg1 ONLINE 0 0 0 /dev/sdk1 ONLINE 0 0 0 errors: No known data errors root@nas-mass:~# zpool scrub cache root@nas-mass:~# zpool status pool: cache state: ONLINE scan: scrub repaired 0B in 00:00:00 with 0 errors on Fri Sep 29 10:20:58 2023 config: NAME STATE READ WRITE CKSUM cache ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdg1 ONLINE 0 0 0 sdk1 ONLINE 0 0 0 Drive passed long smart test without error Drive reports: Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk The main screen says there are CRC errors (maybe old?): 199CRC error count0x003e099099000Old ageAlwaysNever102 Quote Link to comment
Bushibot Posted September 29, 2023 Author Share Posted September 29, 2023 (edited) I also noticed the Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk) runs quite a bit hotter than the other drives SSD, heck even then the disks. It will be up at 115F when the other SSD are 15 degrees or more cooler. I tried swapping drive positions etc but doesn't seem to effect their temp. Cooling path is the same. I'm not sure if excess heat is a sign of potential issues or it just happens to run hotter. 29-09-2023 10:41Unraid Cache 2 temperatureWarning [NAS-MASS] - Cache 2 is hot (115 F)Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk)warning 29-09-2023 10:09Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning 29-09-2023 09:52Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning 29-09-2023 09:47Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning 29-09-2023 09:44Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning 29-09-2023 09:33Unraid Cache 2 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk)warning Edited September 29, 2023 by Bushibot Quote Link to comment
JorgeB Posted September 29, 2023 Share Posted September 29, 2023 21 minutes ago, Bushibot said: Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk These are not disks errors, post the complete diags instead. 22 minutes ago, Bushibot said: Scrub just instantly returns everything is fine. If there's no or very little data, a scrub will be instant. Quote Link to comment
Bushibot Posted September 29, 2023 Author Share Posted September 29, 2023 (edited) Ah hah just reran scrub. pool: cache state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 512K in 00:01:55 with 0 errors on Fri Sep 29 11:01:37 2023 config: NAME STATE READ WRITE CKSUM cache ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /dev/sdg1 ONLINE 4 0 0 /dev/sdk1 ONLINE 0 0 0 errors: No known data errors With ZFS mirror on unraid can I just swap the drives or do I need to manually use the zpool repalce? Edited September 29, 2023 by Bushibot Quote Link to comment
JorgeB Posted September 29, 2023 Share Posted September 29, 2023 You can replace: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=480419 Could also be a cable issue. Quote Link to comment
Bushibot Posted September 29, 2023 Author Share Posted September 29, 2023 2 minutes ago, JorgeB said: You can replace: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=480419 Could also be a cable issue. I already replaced the cables. Quote Link to comment
Bushibot Posted September 29, 2023 Author Share Posted September 29, 2023 4 hours ago, JorgeB said: These are not disks errors, post the complete diags instead. If there's no or very little data, a scrub will be instant. Fair enough, current diag. my current plan is to replace the Samsung drive. nas-mass-diagnostics-20230929-1531.zip Quote Link to comment
Solution JorgeB Posted September 30, 2023 Solution Share Posted September 30, 2023 Sep 29 11:00:02 nas-mass kernel: ata1: SError: { UnrecovData 10B8B BadCRC } This really looks like a bad SATA cable, I would try replacing it once more, if issues persist replace the device, could be making a bad connection. Quote Link to comment
Bushibot Posted September 30, 2023 Author Share Posted September 30, 2023 Will try that, certainly can’t hurt Quote Link to comment
Bushibot Posted October 2, 2023 Author Share Posted October 2, 2023 I'm gonna replace the drive anyway, maybe 3 drive pool? Anyway, haven't seen any more issues since cable swap so hopefully that was it, double bad or I somehow moved the wrong one. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.