ku8475 Posted December 3, 2020 Share Posted December 3, 2020 Ok so my adventure begins with me updating my Unraid to the beta. I have a x570 mobo so I wanted to see my temps. I thought it was a good idea. It worked ok until I installed IPMI plugin and some server monitoring docker with neat graphs from the comapp that has like a rodent or something as the logo. Anyways, overnight that killed my server. I had to restart it and than I removed the docker and the plugin. After that it kept becoming unreachable ever day or so. So I decided maybe beta isn't a good call. So I recovered to 6.8.3. So I booted it up to find that my cache pool of two SSDs' was not assigned or even existing anymore. It was as if there never was a cache. So I reassigned them to the appropriate spots and ran the array. After that it started giving these BTRFS errors and crashing. So far I have removed the cache array and wiped both drives completely. Restored the appdata from backup. Removed and re-added the dockers from the appstore as I though that was what was causing the fails. I am at a loss. Maybe it is a failing SSD, but I find it hard to believe its passing smart tests and now just wants to quit. I don't really grasp BTRFS so I figured I botched that up some how. I will attach logs that I took before the latest crash. Also the last crash here are the logs as it crashed. Dec 2 20:11:51 Tower avahi-daemon[10415]: Registering new address record for fe80::cc6b:86ff:fe07:72fa on veth0425396.*. Dec 2 20:45:00 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c0090000 flags=0x0090] Dec 2 20:45:00 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:00 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:00 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:00 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:00 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1690000 flags=0x0090] Dec 2 20:45:01 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred Dec 2 20:45:01 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1a90000 flags=0x0090] Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred Dec 2 20:45:01 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred Dec 2 20:45:01 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1090000 flags=0x0090] Dec 2 20:45:01 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1290000 flags=0x0090] Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:01 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1690000 flags=0x0090] Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1284 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00 Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1284 CDB: opcode=0x28 28 00 3a 38 5f 80 00 00 08 00 Dec 2 20:45:02 Tower kernel: print_req_error: I/O error, dev sdk, sector 976772992 Dec 2 20:45:02 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1690000 flags=0x0090] Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1490000 flags=0x0090] Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1285 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00 Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1285 CDB: opcode=0x28 28 00 3a 38 60 20 00 00 08 00 Dec 2 20:45:02 Tower kernel: print_req_error: I/O error, dev sdk, sector 976773152 Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c0690000 flags=0x0090] Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1290 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00 Dec 2 20:45:02 Tower kernel: sd 14:0:2:0: [sdk] tag#1290 CDB: opcode=0x2a 2a 00 0e 99 83 c0 00 0a 00 00 Dec 2 20:45:02 Tower kernel: print_req_error: I/O error, dev sdk, sector 244941760 Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 54, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 55, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 56, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 57, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 58, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 59, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 60, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 61, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 62, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: BTRFS error (device sdj1): bdev /dev/sdk1 errs: wr 63, rd 0, flush 0, corrupt 57, gen 0 Dec 2 20:45:02 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:02 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred Dec 2 20:45:02 Tower kernel: mpt3sas 0000:04:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c1890000 flags=0x0090] Dec 2 20:45:03 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:03 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:03 Tower kernel: sd 14:0:2:0: Power-on or device reset occurred Dec 2 20:45:03 Tower rc.diskinfo[9324]: SIGHUP received, forcing refresh of disks info. Dec 2 20:45:03 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred Dec 2 20:45:03 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x0000 address=0x00000000c1690000 flags=0x0090] Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:04 Tower kernel: mpt2sas_cm1: log_info(0x31120322): originator(PL), code(0x12), sub_code(0x0322) Dec 2 20:45:04 Tower kernel: sd 14:0:1:0: Power-on or device reset occurred Dec 2 20:45:04 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x0000 address=0x00000000c1a90000 flags=0x0090] Any help would be swell. Or atleast instructions on how to start fresh without losing my parity and all my data.tower-diagnostics-20201202-1908.zip Quote Link to comment
JorgeB Posted December 3, 2020 Share Posted December 3, 2020 6 hours ago, ku8475 said: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x0000 address=0x00000000c1690000 flags=0x0090] This a a board/kernel issue, disable IOMMU if not needed, look for a BIOS update and/or try different PCIe slot for the HBA. Also, since btrfs is showing corruption errors good idea to run memtest. Quote Link to comment
ku8475 Posted December 3, 2020 Author Share Posted December 3, 2020 (edited) Ok, I will try and update the board today and run a memtest. The fact that it ran stable for a month before I did the upgrade doesn't remove these as possible issues does it? Edited December 3, 2020 by ku8475 Quote Link to comment
ku8475 Posted December 3, 2020 Author Share Posted December 3, 2020 So it turned out that one of the memory sticks failed MemTest86. So I am running to MicroCenter to return it now. Hopefully that fixes the issues. 1 Quote Link to comment
kerpster Posted February 25, 2021 Share Posted February 25, 2021 I understand that this was for that particular incident but i figured i would use this for the issue that i am having. i recently upgrade my processor and power supply and thought that everything went well. i started up the array not thinking to look and make sure all devices were in the list and noticed that 1 of my cache drives was missing. i shut everything down checked all my cables, started up the system again, and was able to add it with the indication "all info will be wiped from this drive" because it was now looked at as a new drive seeing the array was started without it the previous boot. i started the array with it and started to get an error... Warning [server] - Cache pool BTRFS missing device(s). im not sure what to do seeing it looks like everything is run fine with the exception of that error. i have attached some images that have lead me to confusion about the issue. i also noticed that my cache pool size is double what it was before the incident. i thought when the drives were pooled that they were only the size of 1 drive because the other is essentially a backup. let me know if there is any other info that can be used to help you help me. i am very much a noob when it comes to terminal and software so please give details of what you would like me to do. Thanks. Quote Link to comment
kerpster Posted February 25, 2021 Share Posted February 25, 2021 i have also done this command that clearly tells me that the device is missing but like i said.... everything looks to be fine out of this. Quote Link to comment
JorgeB Posted February 26, 2021 Share Posted February 26, 2021 11 hours ago, kerpster said: and was able to add it with the indication "all info will be wiped from this drive" because it was now looked at as a new drive seeing the array was started without it the previous boot. This will delete all data on that device like the warning says, there's a way to re-add a device without deleting data, but it's likely to late for that, please post the diagnostics: Tools -> Diagnostics Quote Link to comment
kerpster Posted February 26, 2021 Share Posted February 26, 2021 which file out of the zipped folder would you like? Quote Link to comment
itimpi Posted February 26, 2021 Share Posted February 26, 2021 35 minutes ago, kerpster said: which file out of the zipped folder would you like? The whole zip file Quote Link to comment
kerpster Posted February 26, 2021 Share Posted February 26, 2021 i have been informed that there was a balance issue and that the array just needs restarted to clear out the missing drive. let me know if you see anything else. thanks. diagnostics-20210226-0803.zip Quote Link to comment
JorgeB Posted February 26, 2021 Share Posted February 26, 2021 1 hour ago, kerpster said: let me know if you see anything else. thanks. Nothing else, but one thing I forgot to mention is before doing that make sure cache backups are up to date, just in case. Quote Link to comment
kerpster Posted February 26, 2021 Share Posted February 26, 2021 that is good advice. i will do that. thanks for the help and i will post an update later tonight on how it goes. thanks again. Quote Link to comment
kerpster Posted February 26, 2021 Share Posted February 26, 2021 2 hours ago, kerpster said: that is good advice. i will do that. thanks for the help and i will post an update later tonight on how it goes. thanks again. that seemed to fix the issue. thanks so much again for the help. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.