Docker Dead Post Update

plantsandbinary · April 13, 2022

So I updated to 6.10.0-rc4 , to solve an issue I was having and Docker seems to have spontaneously died.

Logs indicated something wrong with my cache drive. I get this error even after multiple reboots:

Apr 13 11:01:36 Tower kernel: ata5: SATA max UDMA/133 abar m2048@0xfacd0000 port 0xfacd0300 irq 36
Apr 13 11:01:36 Tower kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 13 11:01:36 Tower kernel: ata5.00: supports DRM functions and may not be fully accessible
Apr 13 11:01:36 Tower kernel: ata5.00: ATA-9: Samsung SSD 850 EVO 250GB, EMT03B6Q, max UDMA/133
Apr 13 11:01:36 Tower kernel: ata5.00: disabling queued TRIM support
Apr 13 11:01:36 Tower kernel: ata5.00: 488397168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Apr 13 11:01:36 Tower kernel: ata5.00: Features: Trust Dev-Sleep NCQ-sndrcv
Apr 13 11:01:36 Tower kernel: ata5.00: supports DRM functions and may not be fully accessible
Apr 13 11:01:36 Tower kernel: ata5.00: disabling queued TRIM support
Apr 13 11:01:36 Tower kernel: ata5.00: configured for UDMA/133
Apr 13 11:01:36 Tower kernel: sd 5:0:0:0: [sdf] 488397168 512-byte logical blocks: (250 GB/233 GiB)
Apr 13 11:01:36 Tower kernel: sd 5:0:0:0: [sdf] Write Protect is off
Apr 13 11:01:36 Tower kernel: sd 5:0:0:0: [sdf] Mode Sense: 00 3a 00 00
Apr 13 11:01:36 Tower kernel: sd 5:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 13 11:01:36 Tower kernel: sdf: sdf1
Apr 13 11:01:36 Tower kernel: sd 5:0:0:0: [sdf] Attached SCSI disk
Apr 13 11:01:36 Tower kernel: BTRFS: device fsid 7026cabb-40c4-4fd4-9d10-135feb1312ea devid 1 transid 21806631 /dev/sdf1 scanned by udevd (700)
Apr 13 11:01:47 Tower emhttpd: Samsung_SSD_850_EVO_250GB_S3R0NF0J888231T (sdf) 512 488397168
Apr 13 11:01:47 Tower emhttpd: import 30 cache device: (sdf) Samsung_SSD_850_EVO_250GB_S3R0NF0J888231T
Apr 13 11:01:48 Tower emhttpd: read SMART /dev/sdf
Apr 13 11:01:56 Tower emhttpd: shcmd (42): mount -t btrfs -o noatime,space_cache=v2 /dev/sdf1 /mnt/cache
Apr 13 11:01:56 Tower kernel: BTRFS info (device sdf1): flagging fs with big metadata feature
Apr 13 11:01:56 Tower kernel: BTRFS info (device sdf1): using free space tree
Apr 13 11:01:56 Tower kernel: BTRFS info (device sdf1): has skinny extents
Apr 13 11:01:56 Tower kernel: BTRFS info (device sdf1): bdev /dev/sdf1 errs: wr 0, rd 0, flush 0, corrupt 844, gen 0
Apr 13 11:01:56 Tower kernel: BTRFS info (device sdf1): enabling ssd optimizations
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 7700480 have 2977452051
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 7716864 have 0
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 7733248 have 0
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 7749632 have 0
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 7684096 have 2969063443
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 8323072 have 0
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 8339456 have 0
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 8339456 have 0
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 8323072 have 0
Apr 13 11:01:59 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 8339456 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 8355840 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 8355840 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 10780672 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 10764288 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 10780672 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 10780672 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 10780672 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 10780672 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 10780672 have 0
Apr 13 11:06:58 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 10780672 have 0
Apr 13 11:07:09 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 41320448 have 18446719884523653056
Apr 13 11:07:17 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 701038592 have 0
Apr 13 11:07:29 Tower kernel: BTRFS error (device sdf1): bad tree block start, want 1017528320 have 4096
Apr 13 11:07:29 Tower kernel: BTRFS: error (device sdf1) in __btrfs_free_extent:3069: errno=-5 IO failure
Apr 13 11:07:29 Tower kernel: BTRFS info (device sdf1): forced readonly
Apr 13 11:07:29 Tower kernel: BTRFS: error (device sdf1) in btrfs_run_delayed_refs:2150: errno=-5 IO failure

I'm now getting the error in fix common problems that the cache drive can't be written to. Obvously due to the above readout of "forced readonly"

I've also deleted my docker.img and then run the docker safe permissions script.

I cannot get docker to start again though no matter what. Diagnostics attached.

What's the steps to follow in this case? I haven't formatted the cache drive because I am quite sure that there's some system files there.

root@Tower:/mnt/cache# ls
CommunityApplicationsAppdataBackup/  appdata/  domains/  isos/  system/
root@Tower:/mnt/cache# cd system
root@Tower:/mnt/cache/system# ls
docker/  libvirt/

This is a pretty big catastrophe for my system. I've lost over 50 docker images and a lot of working services.

tower-diagnostics-20220413-1114.zip

plantsandbinary · April 13, 2022

For some reason also some of my shares are being reported as only partially protected, eg. the cache despite the fact that I have appdata set to "cache only".

JorgeB · April 13, 2022

That's normal since cache is a single device and not redundant.

plantsandbinary · April 14, 2022

How do I fix this? Seems that Unraid doesn't have rights to write to my cache drive and the docker.img can't be created.

Apr 14 10:54:12 Tower emhttpd: shcmd (310): /usr/local/sbin/mount_image '/mnt/user/system/docker/docker.img' /var/lib/docker 40
Apr 14 10:54:12 Tower root: Creating new image file: '/mnt/user/system/docker/docker.img' size: 40G
Apr 14 10:54:12 Tower root: touch: cannot touch '/mnt/user/system/docker/docker.img': Read-only file system
Apr 14 10:54:12 Tower root: failed to create image file
Apr 14 10:54:12 Tower emhttpd: shcmd (310): exit status: 1
Apr 14 10:54:13 Tower avahi-daemon[31433]: Server startup complete. Host name is Tower.local. Local service cookie is 633978188.
Apr 14 10:54:14 Tower avahi-daemon[31433]: Service "Tower" (/services/ssh.service) successfully established.
Apr 14 10:54:14 Tower avahi-daemon[31433]: Service "Tower" (/services/smb.service) successfully established.
Apr 14 10:54:14 Tower avahi-daemon[31433]: Service "Tower" (/services/sftp-ssh.service) successfully established.

JorgeB · April 14, 2022

Cache filesystem is corrupt, you should backup and re-format.

plantsandbinary · April 14, 2022

It looks like the cache drive is just dead?

I backed up the data with CA Backup and Restore. Formatted the drive, booted it in maintenance mode and this is what I get:

[1/7] checking root items
[2/7] checking extents
checksum verify failed on 8355840 wanted 0x00000000 found 0xb6bde3e4
checksum verify failed on 8355840 wanted 0x00000000 found 0xb6bde3e4
bad tree block 8355840, bytenr mismatch, want=8355840, have=0
checksum verify failed on 5783552 wanted 0xdaed5a47 found 0x77236f47
checksum verify failed on 5783552 wanted 0xdaed5a47 found 0x77236f47
Csum didn't match
checksum verify failed on 7684096 wanted 0x13f0f7b0 found 0x578f2147
checksum verify failed on 7684096 wanted 0x13f0f7b0 found 0x578f2147
bad tree block 7684096, bytenr mismatch, want=7684096, have=2969063443
checksum verify failed on 7700480 wanted 0x13f077b1 found 0xcdece909
checksum verify failed on 7700480 wanted 0x13f077b1 found 0xcdece909
bad tree block 7700480, bytenr mismatch, want=7700480, have=2977452051
checksum verify failed on 7716864 wanted 0x00000000 found 0x22b4c582
checksum verify failed on 7716864 wanted 0x00000000 found 0x22b4c582
bad tree block 7716864, bytenr mismatch, want=7716864, have=0
checksum verify failed on 7733248 wanted 0x00000000 found 0xb6bde3e4
checksum verify failed on 7733248 wanted 0x00000000 found 0xb6bde3e4
bad tree block 7733248, bytenr mismatch, want=7733248, have=0

--------------------------- there is a billion lines of similar info here
---------------------- skip to the bottom
...
root 5 inode 75857766 errors 2001, no inode item, link count wrong
	unresolved ref dir 260 index 89 namelen 8 name jellyfin filetype 2 errors 4, no inode ref
root 5 inode 75857768 errors 2001, no inode item, link count wrong
	unresolved ref dir 260 index 90 namelen 8 name Jellyfin filetype 2 errors 4, no inode ref
ERROR: errors found in fs roots
Opening filesystem to check...
Checking filesystem on /dev/sdf1
UUID: 7026cabb-40c4-4fd4-9d10-135feb1312ea
The following tree block(s) is corrupted in tree 5:
	tree block bytenr: 6094848, level: 1, node key: (5494035, 108, 696893440)
found 30542331904 bytes used, error(s) found
total csum bytes: 2828108
total tree bytes: 109182976
total fs tree bytes: 80052224
total extent tree bytes: 23707648
btree space waste bytes: 21342871
file data blocks allocated: 389063712768
 referenced 17720246272

I've formatted it as xfs, back to btrfs etc. and the check gives these errors every time.

Also for some reason during this my /var/log is giving the following error:

	Either your server has an extremely long uptime, or your syslog could be potentially being spammed with error messages. A reboot of your server will at least temporarily solve this problem, but ideally you should seek assistance in the forums and post your
  More Information

My /var/log partition is 568MB in size to account for the statistics plugin etc. I have no idea why it instantly went from 15% to 100% after this. I guess it wrote the above billion lines of SSD errors to file?

log.png.357e0e4919e055a029be691f24769276.png

plantsandbinary · April 14, 2022

Safe to say it's dead jim?

ouch.png.b28a3bd154e7acf9ad5c4b5a43c0de13.png

Here's the SMART data:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.30-Unraid] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 250GB
Serial Number:    S3R0NF0J888231T
LU WWN Device Id: 5 002538 d422b6b74
Firmware Version: EMT03B6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Apr 14 11:30:52 2022 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 133) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   093   093   000    -    30832
 12 Power_Cycle_Count       -O--CK   099   099   000    -    72
177 Wear_Leveling_Count     PO--C-   082   082   000    -    374
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0
181 Program_Fail_Cnt_Total  -O--CK   100   100   010    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   010    -    0
183 Runtime_Bad_Block       PO--C-   100   100   010    -    0
187 Uncorrectable_Error_Cnt -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O--CK   070   051   000    -    30
195 ECC_Error_Rate          -O-RC-   200   200   000    -    0
199 CRC_Error_Count         -OSRCK   100   100   000    -    0
235 POR_Recovery_Count      -O--C-   099   099   000    -    7
241 Total_LBAs_Written      -O--CK   099   099   000    -    121124616664
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1           SL  VS      16  Device vendor specific log
0xa5           SL  VS      16  Device vendor specific log
0xce           SL  VS      16  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    30 Celsius
Power Cycle Min/Max Temperature:     27/41 Celsius
Lifetime    Min/Max Temperature:     19/48 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        10 minutes
Min/Max recommended Temperature:      0/70 Celsius
Min/Max Temperature Limit:            0/70 Celsius
Temperature History Size (Index):    128 (106)

Index    Estimated Time   Temperature Celsius
 107    2022-04-13 14:20    31  ************
 ...    ..(  7 skipped).    ..  ************
 115    2022-04-13 15:40    31  ************
 116    2022-04-13 15:50    30  ***********
 ...    ..(  8 skipped).    ..  ***********
 125    2022-04-13 17:20    30  ***********
 126    2022-04-13 17:30    29  **********
 127    2022-04-13 17:40    30  ***********
   0    2022-04-13 17:50    29  **********
 ...    ..( 20 skipped).    ..  **********
  21    2022-04-13 21:20    29  **********
  22    2022-04-13 21:30    28  *********
 ...    ..(  3 skipped).    ..  *********
  26    2022-04-13 22:10    28  *********
  27    2022-04-13 22:20    29  **********
  28    2022-04-13 22:30    29  **********
  29    2022-04-13 22:40    29  **********
  30    2022-04-13 22:50    30  ***********
 ...    ..(  2 skipped).    ..  ***********
  33    2022-04-13 23:20    30  ***********
  34    2022-04-13 23:30    31  ************
 ...    ..( 65 skipped).    ..  ************
 100    2022-04-14 10:30    31  ************
 101    2022-04-14 10:40    30  ***********
 102    2022-04-14 10:50    29  **********
 103    2022-04-14 11:00    29  **********
 104    2022-04-14 11:10    34  ***************
 105    2022-04-14 11:20    30  ***********
 106    2022-04-14 11:30    30  ***********

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           20  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           20  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

Edited April 14, 2022 by plantsandbinary

plantsandbinary · April 14, 2022

That many LBAs = 56.40 TB of data written. It's 3.5 years old also. So I'd say it's finally kicked the bucket.

Samsung states the 850 EVO should last 5 years or to 75TB TBW. So it's more than 2/3rd of the way there.

JorgeB · April 14, 2022

For now I only see a filesystem problem, not a device problem, but if the filesystem is getting frequently corrupt you likely have an underlying hardware issue.

plantsandbinary · April 14, 2022

So how can I fix the filesystem problem?

I just zeroed the entire cache drive. Reformatted it as xfs and now I get this:



    Phase 1 - find and verify superblock...
    bad primary superblock - bad magic number !!!

    attempting to find secondary superblock...

It never found the secondary superblock.

I've no idea how I could have a filesystem problem unless this happened after upgrading to 6.10.0-rc4 because that's the only thing I did recently.

My system is an HP ProLaiant Microserver Gen8. Running the last BIOS they released for it.

My syslog is getting polluted with these errors too. Never seen them before:

Apr 14 11:01:34 Tower kernel: DMAR: ERROR: DMA PTE for vPFN 0xaf83a already set (to af83a003 not 101a6b803)
Apr 14 11:01:34 Tower kernel: ------------[ cut here ]------------
Apr 14 11:01:34 Tower kernel: WARNING: CPU: 6 PID: 3903 at drivers/iommu/intel/iommu.c:2387 __domain_mapping+0x2e5/0x390
Apr 14 11:01:34 Tower kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vhost_net tun vhost vhost_iotlb tap xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding tg3 ipmi_ssif x86_pkg_temp_thermal intel_powerclamp i2c_core coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl intel_cstate intel_uncore acpi_ipmi ahci libahci ipmi_si thermal acpi_power_meter button [last unloaded: tg3]
Apr 14 11:01:34 Tower kernel: CPU: 6 PID: 3903 Comm: unraidd1 Tainted: G        W I       5.15.30-Unraid #1
Apr 14 11:01:34 Tower kernel: Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
Apr 14 11:01:34 Tower kernel: RIP: 0010:__domain_mapping+0x2e5/0x390
Apr 14 11:01:34 Tower kernel: Code: 2b 48 8b 4c 24 08 48 89 c2 4c 89 e6 48 c7 c7 8f c7 ef 81 e8 48 96 2c 00 8b 05 05 38 c3 00 85 c0 74 08 ff c8 89 05 f9 37 c3 00 <0f> 0b 8b 74 24 38 b8 34 00 00 00 8d 0c f6 83 e9 09 39 c1 0f 4f c8
Apr 14 11:01:34 Tower kernel: RSP: 0018:ffffc9000377f788 EFLAGS: 00010046
Apr 14 11:01:34 Tower kernel: RAX: 0000000000000000 RBX: ffff8881076371d0 RCX: 0000000000000027
Apr 14 11:01:34 Tower kernel: RDX: 0000000000000000 RSI: ffffc9000377f618 RDI: ffff888436f9c550
Apr 14 11:01:34 Tower kernel: RBP: ffff888107420000 R08: ffff888447f65a58 R09: 0000000000000000
Apr 14 11:01:34 Tower kernel: R10: 2933303862366131 R11: 366131303120746f R12: 00000000000af83a
Apr 14 11:01:34 Tower kernel: R13: ffff8881076371d0 R14: 0000000000001000 R15: 00000000af83a000
Apr 14 11:01:34 Tower kernel: FS:  0000000000000000(0000) GS:ffff888436f80000(0000) knlGS:0000000000000000
Apr 14 11:01:34 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 14 11:01:34 Tower kernel: CR2: 00001513000281c0 CR3: 0000000108726006 CR4: 00000000000606e0
Apr 14 11:01:34 Tower kernel: Call Trace:
Apr 14 11:01:34 Tower kernel: <TASK>
Apr 14 11:01:34 Tower kernel: ? mempool_alloc+0x68/0x14f
Apr 14 11:01:34 Tower kernel: intel_iommu_map_pages+0xf3/0x102
Apr 14 11:01:34 Tower kernel: __iommu_map+0x138/0x211
Apr 14 11:01:34 Tower kernel: __iommu_map_sg+0x8c/0x110
Apr 14 11:01:34 Tower kernel: iommu_dma_map_sg+0x245/0x3e3
Apr 14 11:01:34 Tower kernel: __dma_map_sg_attrs+0x63/0x95
Apr 14 11:01:34 Tower kernel: dma_map_sg_attrs+0xa/0x12
Apr 14 11:01:34 Tower kernel: ata_qc_issue+0xec/0x1ab
Apr 14 11:01:34 Tower kernel: __ata_scsi_queuecmd+0x1f2/0x1fd
Apr 14 11:01:34 Tower kernel: ata_scsi_queuecmd+0x41/0x7a
Apr 14 11:01:34 Tower kernel: scsi_queue_rq+0x57d/0x6db
Apr 14 11:01:34 Tower kernel: blk_mq_dispatch_rq_list+0x2a7/0x4da
Apr 14 11:01:34 Tower kernel: __blk_mq_do_dispatch_sched+0x23d/0x281
Apr 14 11:01:34 Tower kernel: ? update_cfs_rq_load_avg+0x138/0x146
Apr 14 11:01:34 Tower kernel: __blk_mq_sched_dispatch_requests+0xd5/0x129
Apr 14 11:01:34 Tower kernel: blk_mq_sched_dispatch_requests+0x2f/0x52
Apr 14 11:01:34 Tower kernel: __blk_mq_run_hw_queue+0x50/0x76
Apr 14 11:01:34 Tower kernel: __blk_mq_delay_run_hw_queue+0x4d/0x108
Apr 14 11:01:34 Tower kernel: blk_mq_sched_insert_requests+0xa2/0xd9
Apr 14 11:01:34 Tower kernel: blk_mq_flush_plug_list+0xfb/0x12c
Apr 14 11:01:34 Tower kernel: blk_mq_submit_bio+0x2e6/0x406
Apr 14 11:01:34 Tower kernel: submit_bio_noacct+0x9d/0x203
Apr 14 11:01:34 Tower kernel: ? raid5_generate_d+0xce/0x105 [md_mod]
Apr 14 11:01:34 Tower kernel: unraidd+0x11a5/0x1237 [md_mod]
Apr 14 11:01:34 Tower kernel: ? md_thread+0x103/0x12a [md_mod]
Apr 14 11:01:34 Tower kernel: ? rmw5_write_data+0x17d/0x17d [md_mod]
Apr 14 11:01:34 Tower kernel: md_thread+0x103/0x12a [md_mod]
Apr 14 11:01:34 Tower kernel: ? init_wait_entry+0x29/0x29
Apr 14 11:01:34 Tower kernel: ? md_seq_show+0x6c8/0x6c8 [md_mod]
Apr 14 11:01:34 Tower kernel: kthread+0xde/0xe3
Apr 14 11:01:34 Tower kernel: ? set_kthread_struct+0x32/0x32
Apr 14 11:01:34 Tower kernel: ret_from_fork+0x22/0x30
Apr 14 11:01:34 Tower kernel: </TASK>
Apr 14 11:01:34 Tower kernel: ---[ end trace 7069ea449eb4fd61 ]---
Apr 14 11:01:34 Tower kernel: DMAR: ERROR: DMA PTE for vPFN 0xaf83b already set (to af83b003 not 14c18f803)
Apr 14 11:01:34 Tower kernel: ------------[ cut here ]------------
Apr 14 11:01:34 Tower kernel: WARNING: CPU: 6 PID: 3903 at drivers/iommu/intel/iommu.c:2387 __domain_mapping+0x2e5/0x390
Apr 14 11:01:34 Tower kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vhost_net tun vhost vhost_iotlb tap xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding tg3 ipmi_ssif x86_pkg_temp_thermal intel_powerclamp i2c_core coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl intel_cstate intel_uncore acpi_ipmi ahci libahci ipmi_si thermal acpi_power_meter button [last unloaded: tg3]
Apr 14 11:01:34 Tower kernel: CPU: 6 PID: 3903 Comm: unraidd1 Tainted: G        W I       5.15.30-Unraid #1
Apr 14 11:01:34 Tower kernel: Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
Apr 14 11:01:34 Tower kernel: RIP: 0010:__domain_mapping+0x2e5/0x390
Apr 14 11:01:34 Tower kernel: Code: 2b 48 8b 4c 24 08 48 89 c2 4c 89 e6 48 c7 c7 8f c7 ef 81 e8 48 96 2c 00 8b 05 05 38 c3 00 85 c0 74 08 ff c8 89 05 f9 37 c3 00 <0f> 0b 8b 74 24 38 b8 34 00 00 00 8d 0c f6 83 e9 09 39 c1 0f 4f c8
Apr 14 11:01:34 Tower kernel: RSP: 0018:ffffc9000377f788 EFLAGS: 00010046
Apr 14 11:01:34 Tower kernel: RAX: 0000000000000000 RBX: ffff8881076371d8 RCX: 0000000000000027
Apr 14 11:01:34 Tower kernel: RDX: 0000000000000000 RSI: ffffc9000377f618 RDI: ffff888436f9c550
Apr 14 11:01:34 Tower kernel: RBP: ffff888107420000 R08: ffff888447f65f50 R09: 0000000000000000
Apr 14 11:01:34 Tower kernel: R10: 2933303866383163 R11: 383163343120746f R12: 00000000000af83b
Apr 14 11:01:34 Tower kernel: R13: ffff8881076371d8 R14: 0000000000001000 R15: 00000000af83b000
Apr 14 11:01:34 Tower kernel: FS:  0000000000000000(0000) GS:ffff888436f80000(0000) knlGS:0000000000000000
Apr 14 11:01:34 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 14 11:01:34 Tower kernel: CR2: 00001513000281c0 CR3: 0000000108726006 CR4: 00000000000606e0
Apr 14 11:01:34 Tower kernel: Call Trace:
Apr 14 11:01:34 Tower kernel: <TASK>
Apr 14 11:01:34 Tower kernel: ? mempool_alloc+0x68/0x14f
Apr 14 11:01:34 Tower kernel: intel_iommu_map_pages+0xf3/0x102
Apr 14 11:01:34 Tower kernel: __iommu_map+0x138/0x211
Apr 14 11:01:34 Tower kernel: __iommu_map_sg+0x8c/0x110
Apr 14 11:01:34 Tower kernel: iommu_dma_map_sg+0x245/0x3e3
Apr 14 11:01:34 Tower kernel: __dma_map_sg_attrs+0x63/0x95
Apr 14 11:01:34 Tower kernel: dma_map_sg_attrs+0xa/0x12
Apr 14 11:01:34 Tower kernel: ata_qc_issue+0xec/0x1ab
Apr 14 11:01:34 Tower kernel: __ata_scsi_queuecmd+0x1f2/0x1fd
Apr 14 11:01:34 Tower kernel: ata_scsi_queuecmd+0x41/0x7a
Apr 14 11:01:34 Tower kernel: scsi_queue_rq+0x57d/0x6db
Apr 14 11:01:34 Tower kernel: blk_mq_dispatch_rq_list+0x2a7/0x4da
Apr 14 11:01:34 Tower kernel: __blk_mq_do_dispatch_sched+0x23d/0x281
Apr 14 11:01:34 Tower kernel: ? update_cfs_rq_load_avg+0x138/0x146
Apr 14 11:01:34 Tower kernel: __blk_mq_sched_dispatch_requests+0xd5/0x129
Apr 14 11:01:34 Tower kernel: blk_mq_sched_dispatch_requests+0x2f/0x52
Apr 14 11:01:34 Tower kernel: __blk_mq_run_hw_queue+0x50/0x76
Apr 14 11:01:34 Tower kernel: __blk_mq_delay_run_hw_queue+0x4d/0x108
Apr 14 11:01:34 Tower kernel: blk_mq_sched_insert_requests+0xa2/0xd9
Apr 14 11:01:34 Tower kernel: blk_mq_flush_plug_list+0xfb/0x12c
Apr 14 11:01:34 Tower kernel: blk_mq_submit_bio+0x2e6/0x406
Apr 14 11:01:34 Tower kernel: submit_bio_noacct+0x9d/0x203
Apr 14 11:01:34 Tower kernel: ? raid5_generate_d+0xce/0x105 [md_mod]
Apr 14 11:01:34 Tower kernel: unraidd+0x11a5/0x1237 [md_mod]
Apr 14 11:01:34 Tower kernel: ? md_thread+0x103/0x12a [md_mod]
Apr 14 11:01:34 Tower kernel: ? rmw5_write_data+0x17d/0x17d [md_mod]
Apr 14 11:01:34 Tower kernel: md_thread+0x103/0x12a [md_mod]
Apr 14 11:01:34 Tower kernel: ? init_wait_entry+0x29/0x29
Apr 14 11:01:34 Tower kernel: ? md_seq_show+0x6c8/0x6c8 [md_mod]
Apr 14 11:01:34 Tower kernel: kthread+0xde/0xe3
Apr 14 11:01:34 Tower kernel: ? set_kthread_struct+0x32/0x32
Apr 14 11:01:34 Tower kernel: ret_from_fork+0x22/0x30
Apr 14 11:01:34 Tower kernel: </TASK>
Apr 14 11:01:34 Tower kernel: ---[ end trace 7069ea449eb4fd62 ]---
Apr 14 11:01:34 Tower kernel: DMAR: ERROR: DMA PTE for vPFN 0xaf83c already set (to af83c003 not 103e3c803)

Edited April 14, 2022 by plantsandbinary

plantsandbinary · April 14, 2022

Sorry for all the posts, frantically trying to fix things.

Seems I need to remove the intel_iommu=on from the Linux kernel boot parameters.

Due to this issue: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c04565693

JorgeB · April 14, 2022

25 minutes ago, plantsandbinary said:

My syslog is getting polluted with these errors too. Never seen them before:

You should downgrade back to v6.9, looks like your hardware and the newer Linux kernel don't get along.

plantsandbinary · April 14, 2022

8 minutes ago, JorgeB said:

You should downgrade back to v6.9, looks like your hardware and the newer Linux kernel don't get along.

Roger that.

plantsandbinary · April 14, 2022

Downgraded and also replaced the drive with a spare. Still getting this error "Unmountable"

I haven't changed the filesystem or anything on the single parity and 3 array drives. What am I supposed to do now?

@JorgeB

Edited April 14, 2022 by plantsandbinary

JorgeB · April 14, 2022

Previous issue was with the cache, not the array drives, just downgrading shouldn't cause any issue with those but post new diags so we can't see what the problem is.

plantsandbinary · April 14, 2022

Ran a check first, jeez. What happened?

Last check completed on Thursday, 14 April 2022, 22:26 (today)
Duration: 8 hours, 12 minutes, 13 seconds. Average speed: 135.5 MB/s
Finding 46789 errors

If I've ever had a dirty shutdown or something I've only seen like 4-6 errors max, but near to 47,000???

Here's the full diagnostics. I cannot get these to mount no matter what and nothing should have touched them. I just swapped out the cache drive, that's it.

tower-diagnostics-20220414-2320.zip

plantsandbinary · April 14, 2022

Ok so... I've run xfs_repair -vL /dev/md1, 2, 3, etc.

Rebooted the machine. I'm up and running again and the disks are mounted.

Got myself a lost+found share with wait for it....

image.png.942fc8ce68cf9b86f4f44bc467677bec.png

with no extensions or other metadata information.

So it seems I've lost roughly 80% of all of my data... I guess it's been moved into the lost+found share.

Almost all of the stuff I lost are irreplaceable personal files. Photos, videos, writings etc. I've also lost scans of old important documents.

This is the worst data loss I've ever suffered. I don't understand how this happened. All from updating the OS? I had no Docker issues. No problem with my cache drive. Did not change any docker containers. No VMs running.

How did this happen? Isn't the whole point of the parity drive that I am supposed to be able to recover my data? Unless both the parity AND another drive both fail at the same time.

Please someone, help me understand... Is it possible to rebuild everything from lost+found? I just have a bunch of 3436, 6457472, 75683784 etc. files which I have no idea what they are but some are huge and some are tiny.

@JorgeB

Edited April 14, 2022 by plantsandbinary

plantsandbinary · April 14, 2022

Looks like my docker didn't backup properly either for some reason. So I've lost everything there too...

image.png.a330da84edc1e6db8383fc216e1e87b9.png

Some of this data loss dates back to 2003 when I was just a kid, and I carried this stuff across like 15 computers and probably more than 30 USB drives.

trurl · April 14, 2022

Parity is not a substitute for backups, whether Unraid or some RAID system.

Have you done memtest? Bad RAM could cause all sorts of issues.

itimpi · April 14, 2022

If you want to sort out lost+found I would suggest start by running the Linux ‘file’ command against its content. That will at least give you the content type for each file (and thus it’s probable file extension).

2 hours ago, plantsandbinary said:

This is the worst data loss I've ever suffered. I don't understand how this happened. All from updating the OS? I

Updating the OS does not touch the content of the drives so something else must have happened.

plantsandbinary · April 15, 2022

8 hours ago, trurl said:

Parity is not a substitute for backups, whether Unraid or some RAID system.

Have you done memtest? Bad RAM could cause all sorts of issues.

How could bad RAM destroy the metadata for everything across 4 disks? I'm just trying to make sense of this. I have certified HP ECC memory too. I realise things break at times but this was supposed to be 'the backup'.

@itimpi how do I do that? Sorry my head is swimming.

plantsandbinary · April 15, 2022

Don't the diagnostics show anything? No one has commented on any of them.

JorgeB · April 15, 2022

Diags show the filesystem corruption, but unfortunately not the reason why it happened, might have been caused by all the errors you were getting on the newer kernel, but it's just a guess, assuming it was stable before on v6.9 and continues to run stable now.

trurl · April 15, 2022

5 hours ago, plantsandbinary said:

this was supposed to be 'the backup'

So these are the backup of files you have elsewhere?

plantsandbinary · April 16, 2022

On 4/15/2022 at 3:01 PM, trurl said:

So these are the backup of files you have elsewhere?

No this was it. I've been using UNRAID now for years. My understanding was that I would need 2 drive failures (and one would have to be parity) to lose anything. So short of a lightning strike or super bad luck with 2 simultaneous drive failures. I'd be able to recover the data. In this case I lost all the metadata and now I just have 120k files and 62k folders just sitting in lost+found which I have no idea what to do with.

I'm grateful for the help thus far, but everything shows that my hardware is fine, and I only ever experienced problems after updating. It's my own fault for not having another backup but I was suggested to update to solve a security issue I found during login.

This was the result. So naturally I am feeling super bitter about this. I carried a lot of this stuff through CDs in 2003 onward. This is 18 years worth of stuff smashed and the damage extends through everything I have. How could kernel issues cause so many inode issues?

Is there any way I can restore the metadata? Surely someone else has been in a similar situation, even to a lesser extent. Is there some utility I can use on these xfs drives? Also, is there a more redundant way I can set this up in the future using UNRAID of course excepting having another backup of all of my data (because I will obviously do that). Maybe use zfs or btrfs or something instead?

Edited April 16, 2022 by plantsandbinary

Docker Dead Post Update

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation