Read only

Bushibot · September 26, 2023

I seem be having a issue where my dockers (and maybe the whole system?) are going read only. I see this start when some dockers start to fail with different issues.

I see in the syslog:

Sep 26 13:19:43 nas-mass kernel: BTRFS error (device sdh1): block=1088175357952 write time tree block corruption detected
Sep 26 13:19:43 nas-mass kernel: BTRFS: error (device sdh1) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction)
Sep 26 13:19:43 nas-mass kernel: BTRFS info (device sdh1: state E): forced readonly
Sep 26 13:19:43 nas-mass kernel: BTRFS warning (device sdh1: state E): Skipping commit of aborted transaction.
Sep 26 13:19:43 nas-mass kernel: BTRFS: error (device sdh1: state EA) in cleanup_transaction:1964: errno=-5 IO failure
Sep 26 13:32:21 nas-mass kernel: traps: mariadbd[17707] general protection fault ip:14ed75a28898 sp:14ed7406f500 error:0 in libc.so.6[14ed75a28000+195000]

I have no idea what could be causing this. I found the system locked up over night too.

Attached is the Dianostic from when I first noticed the read only issue coming up starting up.

nas-mass-diagnostics-20230926-1333.zip

Bushibot · September 26, 2023

Trying to follow here:

Running repair seeing scroll infinitely:
super bytes used 146319687680 mismatches actual used 146319671296
super bytes used 146319687680 mismatches actual used 146319671296

Is that normal?

Edited September 27, 2023 by Bushibot

JorgeB · September 27, 2023

11 hours ago, Bushibot said:
write time tree block corruption detected

This usually means bad RAM or other kernel memory corruption, rebooting should bring the filesystem back, but good idea to run memtest

Bushibot · September 27, 2023

9 hours ago, JorgeB said:

This usually means bad RAM or other kernel memory corruption, rebooting should bring the filesystem back, but good idea to run memtest

Running a mem test, and will validate that no OC is happening. That seems like a common issue.

If nothing is found then what's next?

JorgeB · September 27, 2023

There's is possibly a btrfs issue with the v6.12 kernel, since it's been happening to some users without an apparent reason, before 9 out of 10 times this happened memtest would find errors, if that is the case either go back to v6.11.5 to see if it no longer happens or try zfs.

Bushibot · September 27, 2023

1 hour ago, JorgeB said:

There's is possibly a btrfs issue with the v6.12 kernel, since it's been happening to some users without an apparent reason, before 9 out of 10 times this happened memtest would find errors, if that is the case either go back to v6.11.5 to see if it no longer happens or try zfs.

Can I go to zfs without a data wipe? I'm a little light on ram for that only 24G, though I can go to 32....

~~How would I down grade the kernel?~~

Looks like rollback required? https://docs.unraid.net/unraid-os/release-notes/6.12.4/#rolling-back

I'm letting mem test run today but server has been down for a day and the plex kids are starting to gripe :p.

Edited September 27, 2023 by Bushibot

Bushibot · September 27, 2023

Or did you mean just convert the cache to zfs?

Bushibot · September 28, 2023

well crap.

Mem passed testing but now getting BTF errors running move just to try and clear the cash that went read only.

I really can't tell what's going on here.

ran check

UUID: a267484f-8dc7-4df0-b715-36becf7e1795 no stats available Total to scrub: 227.54GiB Rate: 0.00B/s Error summary: no errors found

yet

Sep 27 17:42:41 nas-mass emhttpd: shcmd (93): /usr/local/sbin/mover &> /dev/null &
Sep 27 17:43:21 nas-mass flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Sep 27 17:50:53 nas-mass kernel: ata16.00: exception Emask 0x10 SAct 0xe0 SErr 0x0 action 0x6 frozen
Sep 27 17:50:53 nas-mass kernel: ata16.00: irq_stat 0x08000000, interface fatal error
Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED
Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/80:28:f0:80:7d/02:00:0d:00:00/40 tag 5 ncq dma 327680 in
Sep 27 17:50:53 nas-mass kernel:         res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY }
Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED
Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/00:30:70:83:7d/02:00:0d:00:00/40 tag 6 ncq dma 262144 in
Sep 27 17:50:53 nas-mass kernel:         res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY }
Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED
Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/20:38:80:ae:18/00:00:00:00:00/40 tag 7 ncq dma 16384 in
Sep 27 17:50:53 nas-mass kernel:         res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY }
Sep 27 17:50:53 nas-mass kernel: ata16: hard resetting link
Sep 27 17:50:58 nas-mass kernel: ata16: link is slow to respond, please be patient (ready=0)
Sep 27 17:51:03 nas-mass kernel: ata16: COMRESET failed (errno=-16)
Sep 27 17:51:03 nas-mass kernel: ata16: hard resetting link
Sep 27 17:51:03 nas-mass kernel: ata16: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 27 17:51:03 nas-mass kernel: ata16.00: supports DRM functions and may not be fully accessible
Sep 27 17:51:03 nas-mass kernel: ata16.00: supports DRM functions and may not be fully accessible
Sep 27 17:51:03 nas-mass kernel: ata16.00: configured for UDMA/133
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 Sense Key : 0x5 [current] 
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 ASC=0x21 ASCQ=0x4 
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 CDB: opcode=0x28 28 00 0d 7d 80 f0 00 02 80 00
Sep 27 17:51:03 nas-mass kernel: I/O error, dev sdj, sector 226328816 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 2
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 Sense Key : 0x5 [current] 
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 ASC=0x21 ASCQ=0x4 
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 CDB: opcode=0x28 28 00 0d 7d 83 70 00 02 00 00
Sep 27 17:51:03 nas-mass kernel: I/O error, dev sdj, sector 226329456 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
Sep 27 17:51:03 nas-mass kernel: ata16: EH complete
Sep 27 17:51:03 nas-mass kernel: ata16.00: Enabling discard_zeroes_data
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209100288 (dev /dev/sdj1 sector 226327480)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209067520 (dev /dev/sdj1 sector 226327416)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209104384 (dev /dev/sdj1 sector 226327488)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209108480 (dev /dev/sdj1 sector 226327496)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208739840 (dev /dev/sdj1 sector 226326776)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208735744 (dev /dev/sdj1 sector 226326768)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208756224 (dev /dev/sdj1 sector 226326808)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209075712 (dev /dev/sdj1 sector 226327432)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208764416 (dev /dev/sdj1 sector 226326824)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209083904 (dev /dev/sdj1 sector 226327448)

Edited September 28, 2023 by Bushibot

Bushibot · September 28, 2023

added updated diag.

nas-mass-diagnostics-20230927-1808.zip

JorgeB · September 28, 2023

Replace cables for that device and run a scrub.

Bushibot · September 29, 2023

I moved to a different SATA port (different cable). scrub said no error (which seems inconsistent).

JorgeB · September 29, 2023

5 minutes ago, Bushibot said:

scrub said no error

It can be normal if all were were previously corrected during the reads, any errors for the access data are automatically corrected (if the pool is redundant)

Bushibot · September 29, 2023

So I'm still trying to figure out if I have disk issue. Mem86 did two complete tests on the upgraded memory pool (thought the old x4G had tested fine with two full passes).

I have since reformatted my cache to a 2T ZFS mirror pool in the hopes of avoiding more issues.

I still see a error for what is now disk 2 in the pool, this is the same drive that went read only prior with BTFS errors.

Scrub just instantly returns everything is fine... doesn't look like it's doing anything at all, not sure what the expected behavior is though.

I'm really trying to avoid even more down time.

pool: cache state: ONLINE scan: scrub repaired 0B in 00:00:00 with 0 errors on Fri Sep 29 10:22:39 2023 config: NAME STATE READ WRITE CKSUM cache ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /dev/sdg1 ONLINE 0 0 0 /dev/sdk1 ONLINE 0 0 0 errors: No known data errors

root@nas-mass:~# zpool scrub cache
root@nas-mass:~# zpool status
pool: cache
state: ONLINE
scan: scrub repaired 0B in 00:00:00 with 0 errors on Fri Sep 29 10:20:58 2023
config:

NAME STATE READ WRITE CKSUM
cache ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdg1 ONLINE 0 0 0
sdk1 ONLINE 0 0 0

Drive passed long smart test without error

Drive reports:

Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk
Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk

The main screen says there are CRC errors (maybe old?):

199CRC error count0x003e099099000Old ageAlwaysNever102

Bushibot · September 29, 2023

I also noticed the

Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk)

runs quite a bit hotter than the other drives SSD, heck even then the disks.

It will be up at 115F when the other SSD are 15 degrees or more cooler. I tried swapping drive positions etc but doesn't seem to effect their temp. Cooling path is the same.

I'm not sure if excess heat is a sign of potential issues or it just happens to run hotter.

29-09-2023 10:41Unraid Cache 2 temperatureWarning [NAS-MASS] - Cache 2 is hot (115 F)Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk)warning

29-09-2023 10:09Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning

29-09-2023 09:52Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning

29-09-2023 09:47Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning

29-09-2023 09:44Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning

29-09-2023 09:33Unraid Cache 2 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk)warning

Edited September 29, 2023 by Bushibot

JorgeB · September 29, 2023

21 minutes ago, Bushibot said:

Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk
Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk

These are not disks errors, post the complete diags instead.

22 minutes ago, Bushibot said:

Scrub just instantly returns everything is fine.

If there's no or very little data, a scrub will be instant.

Bushibot · September 29, 2023

Ah hah just reran scrub.


  pool: cache
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 512K in 00:01:55 with 0 errors on Fri Sep 29 11:01:37 2023
config:

	NAME           STATE     READ WRITE CKSUM
	cache          ONLINE       0     0     0
	  mirror-0     ONLINE       0     0     0
	    /dev/sdg1  ONLINE       4     0     0
	    /dev/sdk1  ONLINE       0     0     0

errors: No known data errors

With ZFS mirror on unraid can I just swap the drives or do I need to manually use the zpool repalce?

Edited September 29, 2023 by Bushibot

JorgeB · September 29, 2023

You can replace:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=480419

Could also be a cable issue.

Bushibot · September 29, 2023

2 minutes ago, JorgeB said:

You can replace:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=480419

Could also be a cable issue.

I already replaced the cables.

Bushibot · September 29, 2023

4 hours ago, JorgeB said:

These are not disks errors, post the complete diags instead.

If there's no or very little data, a scrub will be instant.

Fair enough, current diag.

my current plan is to replace the Samsung drive.

nas-mass-diagnostics-20230929-1531.zip

JorgeB · September 30, 2023

Sep 29 11:00:02 nas-mass kernel: ata1: SError: { UnrecovData 10B8B BadCRC }

This really looks like a bad SATA cable, I would try replacing it once more, if issues persist replace the device, could be making a bad connection.

Bushibot · September 30, 2023

Will try that, certainly can’t hurt

Bushibot · October 2, 2023

I'm gonna replace the drive anyway, maybe 3 drive pool? Anyway, haven't seen any more issues since cable swap so hopefully that was it, double bad or I somehow moved the wrong one.

Read only

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation