Cache BTRFS issues


ku8475

Recommended Posts

Ok so my adventure begins with me updating my Unraid to the beta. I have a x570 mobo so I wanted to see my temps. I thought it was a good idea. It worked ok until I installed IPMI plugin and some server monitoring docker with neat graphs from the comapp that has like a rodent or something as the logo. Anyways, overnight that killed my server. I had to restart it and than I removed the docker and the plugin. After that it kept becoming unreachable ever day or so. So I decided maybe beta isn't a good call. So I recovered to 6.8.3.

 

So I booted it up to find that my cache pool of two SSDs' was not assigned or even existing anymore. It was as if there never was a cache. So I reassigned them to the appropriate spots and ran the array. After that it started giving these BTRFS errors and crashing. So far I have removed the cache array and wiped both drives completely. Restored the appdata from backup. Removed and re-added the dockers from the appstore as I though that was what was causing the fails. 

 

I am at a loss. Maybe it is a failing SSD, but I find it hard to believe its passing smart tests and now just wants to quit. I don't really grasp BTRFS so I figured I botched that up some how. I will attach logs that I took before the latest crash. Also the last crash here are the logs as it crashed.

 

Dec 2 20:11:51 Tower avahi-daemon[10415]: Registering new address record for fe80::cc6b:86ff:fe07:72fa on veth0425396.*.
Dec 2 20:45:00 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c0090000 flags=0x0090]
Dec 2 20:45:00 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:00 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:00 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:00 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:00 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1690000 flags=0x0090]
Dec 2 20:45:01 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred
Dec 2 20:45:01 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1a90000 flags=0x0090]
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred
Dec 2 20:45:01 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred
Dec 2 20:45:01 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1090000 flags=0x0090]
Dec 2 20:45:01 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1290000 flags=0x0090]
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred
Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1690000 flags=0x0090]
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1284 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1284 CDB: opcode=0x28 28 00 3a 38 5f 80 00 00 08 00
Dec 2 20:45:02 Tower kernel: print_req_error: I/O error, dev sdk, sector 976772992
Dec 2 20:45:02 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred
Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1690000 flags=0x0090]
Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1490000 flags=0x0090]
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1285 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1285 CDB: opcode=0x28 28 00 3a 38 60 20 00 00 08 00
Dec 2 20:45:02 Tower kernel: print_req_error: I/O error, dev sdk, sector 976773152
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred
Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c0690000 flags=0x0090]
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1290 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1290 CDB: opcode=0x2a 2a 00 0e 99 83 c0 00 0a 00 00
Dec 2 20:45:02 Tower kernel: print_req_error: I/O error, dev sdk, sector 244941760
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 54, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 55, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 56, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 57, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 58, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 59, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 60, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 61, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 62, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 63, rd 0, flush 0, corrupt 57, gen 0
Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:02 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred
Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1890000 flags=0x0090]
Dec 2 20:45:03 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:03 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:03 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred
Dec 2 20:45:03 Tower rc.diskinfo[9324]: SIGHUP received, forcing refresh of disks info.
Dec 2 20:45:03 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred
Dec 2 20:45:03 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x0000 address=0x00000000c1690000 flags=0x0090]
Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322)
Dec 2 20:45:04 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred
Dec 2 20:45:04 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x0000 address=0x00000000c1a90000 flags=0x0090]

 

 

Any help would be swell. Or atleast instructions on how to start fresh without losing my parity and all my data.tower-diagnostics-20201202-1908.zip

Link to comment
6 hours ago, ku8475 said:

AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x0000 address=0x00000000c1690000 flags=0x0090]

This a a board/kernel issue, disable IOMMU if not needed, look for a BIOS update and/or try different PCIe slot for the HBA.

 

Also, since btrfs is showing corruption errors good idea to run memtest.

 

 

Link to comment
  • 2 months later...

I understand that this was for that particular incident but i figured i would use this for the issue that i am having. i recently upgrade my processor and power supply and thought that everything went well. i started up the array not thinking to look and make sure all devices were in the list and noticed that 1 of my cache drives was missing. i shut everything down checked all my cables, started up the system again, and was able to add it with the indication "all info will be wiped from this drive" because it was now looked at as a new drive seeing the array was started without it the previous boot. i started the array with it and started to get an error... Warning [server] - Cache pool BTRFS missing device(s). im not sure what to do seeing it looks like everything is run fine with the exception of that error. i have attached some images that have lead me to confusion about the issue. i also noticed that my cache pool size is double what it was before the incident. i thought when the drives were pooled that they were only the size of 1 drive because the other is essentially a backup. let me know if there is any other info that can be used to help you help me. i am very much a noob when it comes to terminal and software so please give details of what you would like me to do. Thanks.

cache1.PNG

cache2.PNG

Link to comment
11 hours ago, kerpster said:

and was able to add it with the indication "all info will be wiped from this drive" because it was now looked at as a new drive seeing the array was started without it the previous boot.

This will delete all data on that device like the warning says, there's a way to re-add a device without deleting data, but it's likely to late for that, please post the diagnostics: Tools -> Diagnostics

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.