Read only


Bushibot
Go to solution Solved by JorgeB,

Recommended Posts

I seem be having a issue where my dockers (and maybe the whole system?) are going read only. I see this start when some dockers start to fail with different issues.

I see in the syslog:

Sep 26 13:19:43 nas-mass kernel: BTRFS error (device sdh1): block=1088175357952 write time tree block corruption detected
Sep 26 13:19:43 nas-mass kernel: BTRFS: error (device sdh1) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction)
Sep 26 13:19:43 nas-mass kernel: BTRFS info (device sdh1: state E): forced readonly
Sep 26 13:19:43 nas-mass kernel: BTRFS warning (device sdh1: state E): Skipping commit of aborted transaction.
Sep 26 13:19:43 nas-mass kernel: BTRFS: error (device sdh1: state EA) in cleanup_transaction:1964: errno=-5 IO failure
Sep 26 13:32:21 nas-mass kernel: traps: mariadbd[17707] general protection fault ip:14ed75a28898 sp:14ed7406f500 error:0 in libc.so.6[14ed75a28000+195000]

 

I have no idea what could be causing this. I found the system locked up over night too.

Attached is the Dianostic from when I first noticed the read only issue coming up starting up.

 

nas-mass-diagnostics-20230926-1333.zip

Link to comment
1 hour ago, JorgeB said:

There's is possibly a btrfs issue with the v6.12 kernel, since it's been happening to some users without an apparent reason, before 9 out of 10 times this happened memtest would find errors, if that is the case either go back to v6.11.5 to see if it no longer happens or try zfs.

Can I go to zfs without a data wipe? I'm a little light on ram for that only 24G, though I can go to 32....

How would I down grade the kernel?

     Looks like rollback required? https://docs.unraid.net/unraid-os/release-notes/6.12.4/#rolling-back

I'm letting mem test run today but server has been down for a day and the plex kids are starting to gripe :p.

Edited by Bushibot
Link to comment

well crap.

Mem passed testing but now getting BTF errors running move just to try and clear the cash that went read only.

I really can't tell what's going on here.

ran check

 

UUID: a267484f-8dc7-4df0-b715-36becf7e1795 no stats available Total to scrub: 227.54GiB Rate: 0.00B/s Error summary: no errors found

yet

Sep 27 17:42:41 nas-mass emhttpd: shcmd (93): /usr/local/sbin/mover &> /dev/null &
Sep 27 17:43:21 nas-mass flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Sep 27 17:50:53 nas-mass kernel: ata16.00: exception Emask 0x10 SAct 0xe0 SErr 0x0 action 0x6 frozen
Sep 27 17:50:53 nas-mass kernel: ata16.00: irq_stat 0x08000000, interface fatal error
Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED
Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/80:28:f0:80:7d/02:00:0d:00:00/40 tag 5 ncq dma 327680 in
Sep 27 17:50:53 nas-mass kernel:         res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY }
Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED
Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/00:30:70:83:7d/02:00:0d:00:00/40 tag 6 ncq dma 262144 in
Sep 27 17:50:53 nas-mass kernel:         res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY }
Sep 27 17:50:53 nas-mass kernel: ata16.00: failed command: READ FPDMA QUEUED
Sep 27 17:50:53 nas-mass kernel: ata16.00: cmd 60/20:38:80:ae:18/00:00:00:00:00/40 tag 7 ncq dma 16384 in
Sep 27 17:50:53 nas-mass kernel:         res 40/00:38:80:ae:18/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Sep 27 17:50:53 nas-mass kernel: ata16.00: status: { DRDY }
Sep 27 17:50:53 nas-mass kernel: ata16: hard resetting link
Sep 27 17:50:58 nas-mass kernel: ata16: link is slow to respond, please be patient (ready=0)
Sep 27 17:51:03 nas-mass kernel: ata16: COMRESET failed (errno=-16)
Sep 27 17:51:03 nas-mass kernel: ata16: hard resetting link
Sep 27 17:51:03 nas-mass kernel: ata16: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 27 17:51:03 nas-mass kernel: ata16.00: supports DRM functions and may not be fully accessible
Sep 27 17:51:03 nas-mass kernel: ata16.00: supports DRM functions and may not be fully accessible
Sep 27 17:51:03 nas-mass kernel: ata16.00: configured for UDMA/133
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 Sense Key : 0x5 [current] 
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 ASC=0x21 ASCQ=0x4 
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#5 CDB: opcode=0x28 28 00 0d 7d 80 f0 00 02 80 00
Sep 27 17:51:03 nas-mass kernel: I/O error, dev sdj, sector 226328816 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 2
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 Sense Key : 0x5 [current] 
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 ASC=0x21 ASCQ=0x4 
Sep 27 17:51:03 nas-mass kernel: sd 18:0:0:0: [sdj] tag#6 CDB: opcode=0x28 28 00 0d 7d 83 70 00 02 00 00
Sep 27 17:51:03 nas-mass kernel: I/O error, dev sdj, sector 226329456 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
Sep 27 17:51:03 nas-mass kernel: ata16: EH complete
Sep 27 17:51:03 nas-mass kernel: ata16.00: Enabling discard_zeroes_data
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209100288 (dev /dev/sdj1 sector 226327480)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209067520 (dev /dev/sdj1 sector 226327416)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209104384 (dev /dev/sdj1 sector 226327488)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209108480 (dev /dev/sdj1 sector 226327496)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208739840 (dev /dev/sdj1 sector 226326776)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208735744 (dev /dev/sdj1 sector 226326768)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208756224 (dev /dev/sdj1 sector 226326808)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209075712 (dev /dev/sdj1 sector 226327432)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5208764416 (dev /dev/sdj1 sector 226326824)
Sep 27 17:51:03 nas-mass kernel: BTRFS info (device sdj1): read error corrected: ino 524028 off 5209083904 (dev /dev/sdj1 sector 226327448)

 

Edited by Bushibot
Link to comment

So I'm still trying to figure out if I have disk issue. Mem86 did two complete tests on the upgraded memory pool (thought the old x4G had tested fine with two full passes).

I have since reformatted my cache to a 2T ZFS mirror pool in the hopes of avoiding more issues.

I still see a error for what is now disk 2 in the pool, this is the same drive that went read only prior with BTFS errors.

Scrub just instantly returns everything is fine... doesn't look like it's doing anything at all, not sure what the expected behavior is though. 

I'm really trying to avoid even more down time.

 

pool: cache state: ONLINE scan: scrub repaired 0B in 00:00:00 with 0 errors on Fri Sep 29 10:22:39 2023 config: NAME STATE READ WRITE CKSUM cache ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /dev/sdg1 ONLINE 0 0 0 /dev/sdk1 ONLINE 0 0 0 errors: No known data errors

 

root@nas-mass:~# zpool scrub cache
root@nas-mass:~# zpool status
  pool: cache
 state: ONLINE
  scan: scrub repaired 0B in 00:00:00 with 0 errors on Fri Sep 29 10:20:58 2023
config:

        NAME        STATE     READ WRITE CKSUM
        cache       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdg1    ONLINE       0     0     0
            sdk1    ONLINE       0     0     0

 

Drive passed long smart test without error

Drive reports:

Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk
Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk

 

The main screen says there are CRC errors (maybe old?):

199CRC error count0x003e099099000Old ageAlwaysNever102

Link to comment

I also noticed the 

Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk)

runs quite a bit hotter than the other drives SSD, heck even then the disks.

It will be up at 115F when the other SSD are 15 degrees or more cooler. I tried swapping drive positions etc but doesn't seem to effect their temp. Cooling path is the same.

I'm not sure if excess heat is a sign of potential issues or it just happens to run hotter.

29-09-2023 10:41Unraid Cache 2 temperatureWarning [NAS-MASS] - Cache 2 is hot (115 F)Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk)warning

29-09-2023 10:09Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning

29-09-2023 09:52Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning

29-09-2023 09:47Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning

29-09-2023 09:44Unraid device dev1 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (dev1)warning

29-09-2023 09:33Unraid Cache 2 SMART health [199]Warning [NAS-MASS] - crc error count is 102Samsung_SSD_870_QVO_2TB_S6R4NJ0R701417P (sdk)warning

 

Edited by Bushibot
Link to comment
21 minutes ago, Bushibot said:

Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk
Sep 29 09:47:51 nas-mass emhttpd: error: get_device_size, 1589: No such device or address (6): open: /dev/sdk

These are not disks errors, post the complete diags instead.

 

22 minutes ago, Bushibot said:

Scrub just instantly returns everything is fine.

If there's no or very little data, a scrub will be instant.

Link to comment

Ah hah just reran scrub.


  pool: cache
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 512K in 00:01:55 with 0 errors on Fri Sep 29 11:01:37 2023
config:

	NAME           STATE     READ WRITE CKSUM
	cache          ONLINE       0     0     0
	  mirror-0     ONLINE       0     0     0
	    /dev/sdg1  ONLINE       4     0     0
	    /dev/sdk1  ONLINE       0     0     0

errors: No known data errors

 

With ZFS mirror on unraid can I just swap the drives or do I need to manually use the zpool repalce?

Edited by Bushibot
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.