mbc0

February 28, 2023

Hi, I have just installed an NVIDIA P4000 and am getting these errors (loads in the log) I have swapped slots with my HBA card but still getting them, can anyone advise please?

Feb 28 17:34:01 UNRAID1 kernel: pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0
Feb 28 17:34:01 UNRAID1 kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Feb 28 17:34:01 UNRAID1 kernel: pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000
Feb 28 17:34:01 UNRAID1 kernel: pcieport 0000:00:01.1: [ 6] BadTLP

BIOS:American Megatrends Inc. Version F13a. Dated: 11/30/2021

CPU:AMD Ryzen Threadripper 2950X 16-Core @ 3500 MHz

HVM:Enabled

IOMMU:Enabled

Cache:1536 KiB, 8 MB, 32 MB

Memory:64 GiB DDR4 (max. installable capacity 512 GiB)

Network:bond0: fault-tolerance (active-backup), mtu 1500

Kernel:Linux 5.19.17-Unraid x86_64

OpenSSL:1.1.1s

unraid1-diagnostics-20230228-1731.zip

February 27, 2023

On 2/20/2023 at 7:22 PM, KluthR said:

I dont think so. You can see that this rsync always deletes the backup set. So you only have one working backup. Now it should grow to the „keep x backups“ setting.

is it always this calendar app thats failing? Could you look into it whats inside?

since your nextcloud container is stopped, there is nothing accessing it. Weird.

some posts ago I asked an user to add some diag code, could you do the same? Dont have the post link, Iam at my mobile phone.

Thank you so much for your help, the problem was traced to an unreadable file & folder on my cache drive, I had to format it to get rid! Backup now working as expected! 🙂

February 26, 2023

1 hour ago, Frank1940 said:

Set up the syslog server and up load the file after the next instance of the problem. Instructions below:

I already have this setup, sorry, I was not clear but that is what I meant by no useful information in the log at all.

1 hour ago, Frank1940 said:

Be aware that this type of problem is often hardware related and nothing shows up in the syslog. You can also try booting into the safe mode and see if the problem occurs in that mode. May we assume that you also use access to your shares? Have you installed the Dynamix System Temperature plugin? Is the PS new? (Power supplies-- both new and old --are often the cause of these types of problems...)

I am using the existing PSU from my previous threadripper setup, all I have changed is motherboard & cpu

I have the Dynamix System Temperature plugin installed and normally around 26C CPU, highest I have seen is 40C when being hammered.

I can see no more I can do with this setup, maybe an intermittant issue with the motherboard? since installation I have had RAM related issues, corrupt NVME cache drive, lost all dockers & VM's (all seperate occasions) but when running, everything is perfect for 1-3 days.

I will transplant the threadripper back in and setup another server using this suspect motherboard/cpu config to see what happens, unless you can think of anything else I can try.

The only reason(s) for doing all this were to save energy costs.

February 26, 2023

Hi,

I have had nothing but problems with this new motherboard/cpu config

M/B:ASRock Z790 Pro RS/D4 Version

BIOS:American Megatrends International, LLC. Version 4.01. Dated: 01/06/2023

CPU:13th Gen Intel® Core™ i5-13600K @ 3465 MHz

When the server locks up, there is still power, the network card is flashing but I cannot access the server even directly with a monitor plugged in or SSL etc, nothing.

I am now using a 3rd set of 4 DIMMS since problems began so would safely say it's not the RAM

Temps are never above 40C so confident of not being a temperature related issue.

I have attached DIAGS since powering off/on at 18:12 tonight

I have logs written to a share but there is nothing of any use written in the log and no errors.

The BIOS has been reset to standard, no power limits, no overclock, everything totally standard.

Where can I go from here to find the problem please? I think I am going to have to swap back to my threadripper for reliability and put this motherboard/cpu config into a test bench.

unraid1-diagnostics-20230226-1820.zip

February 22, 2023

Hi, just wondering if anyone can confirm please?

February 21, 2023

Been using shrmn/gsdock for years without issue, I did not realise there was another repo?

February 21, 2023

5 minutes ago, BRiT said:

Run memtest for at least 12 hours.

You likely have wrong settings for memory in your bios or are running the memory overclocked without realizing it.

I ran the 4 sticks I took out in memtest for over 24 hours in another board with no issues and as all 8 sticks have been running without issue in my previous config I am confident there is no physical issue with the RAM (All Corsair Vengeance)

I have not overclocked but did choose a power saving option (The reason for changing from a Threadripper) in the BIOS of which I cannot remember but I am going to change that back to standard just to eliminate what the problem could be.

I have also enabled power saving in tips and tweaks as this is what brought me under the 150w I was looking for, do you think that could have any influence on this problem?

image.png.749c5529a78c25d62468e7990b5857bd.png

February 21, 2023

26 minutes ago, JorgeB said:
Feb 21 11:10:53 UNRAID1 kernel: BTRFS error (device nvme2n1p1): block=160911851520 write time tree block corruption detected
This usually means bad RAM or other kernel memory corruption, maybe the RAM issues are not totally solved.

This is worrying! I replaced all 4 sticks of RAM the other day so they chances of it actually being the RAM are slim to none I would have thought. It is a brand new motherboard and CPU, my old motherboard had 8 slots for RAM and never missed a beat I have now had these errors on both sets of 4 so I am making the assumption that it is not the RAM causing the issue so maybe the motherboard? (Asrock z790 pro rs/d4) I am on the latest BIOS, are there any steps I can take to try and get on top of this issue?

February 21, 2023

Unable to stop the server currently 😞

Feb 21 11:22:18 UNRAID1 emhttpd: shcmd (2491271): umount /mnt/cache
Feb 21 11:22:18 UNRAID1 root: umount: /mnt/cache: target is busy.
Feb 21 11:22:18 UNRAID1 emhttpd: shcmd (2491271): exit status: 32
Feb 21 11:22:18 UNRAID1 emhttpd: Retry unmounting disk share(s)...
Feb 21 11:22:23 UNRAID1 emhttpd: Unmounting disks...
Feb 21 11:22:23 UNRAID1 emhttpd: shcmd (2491272): umount /mnt/cache
Feb 21 11:22:23 UNRAID1 root: umount: /mnt/cache: target is busy.
Feb 21 11:22:23 UNRAID1 emhttpd: shcmd (2491272): exit status: 32
Feb 21 11:22:23 UNRAID1 emhttpd: Retry unmounting disk share(s)...
Feb 21 11:22:28 UNRAID1 emhttpd: Unmounting disks...
Feb 21 11:22:28 UNRAID1 emhttpd: shcmd (2491273): umount /mnt/cache
Feb 21 11:22:28 UNRAID1 root: umount: /mnt/cache: target is busy.
Feb 21 11:22:28 UNRAID1 emhttpd: shcmd (2491273): exit status: 32
Feb 21 11:22:28 UNRAID1 emhttpd: Retry unmounting disk share(s)...

February 21, 2023

Hi @JorgeB

I have just had my server crash and have attached diags, is this related to my other crashes you helped me with last week? the server is on 24/7 and this is the first problem since the possible RAM issue/corruption I had last week.

Warning: mkdir(): Input/output error in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 351

unraid1-diagnostics-20230221-1113.zip

February 20, 2023

Hey @KluthR

All done, my backups were ok and not broken but I changed my USB location as you advised. I do think the problem is physically reading the files mentioned in Nextcloud, do you think that could be the problem as it still fails.

backup.log

February 20, 2023

11 minutes ago, KluthR said:
Could you share this file as attachment?
/boot/config/plugins/ca.backup2/BackupOptions.json

Attached, thank you!

BackupOptions.json

February 20, 2023

1 hour ago, KluthR said:

Hmm. Whats happening, when you run a manual backup? Does it raise the same error?

Hi yes, I have run a couple of manual backups trying to diagnose the issue.

February 20, 2023

24 minutes ago, KluthR said:

Please post the full backup log as code or as attachmemt, thanks.

Many thanks!

backup.log

February 20, 2023

Hi,

Can anyone please help me out with what could be causing the input/output error I am getting with CA Backup on my Nextcloud container?

The container is stopped and I do not even have calendar installed on Nextcloud

[20.02.2023 03:42:10] Backing Up: nextcloud
/usr/bin/tar: nextcloud/www/nextcloud/apps/calendar/: Cannot savedir: Input/output error
/usr/bin/tar: Exiting with failure status due to previous errors
[20.02.2023 03:43:21] tar creation/extraction failed!
[20.02.2023 03:43:21] Verifying Backup nextcloud
[20.02.2023 03:44:26] Backing Up: unifi-controller
[20.02.2023 03:44:46] Verifying Backup unifi-controller
[20.02.2023 03:44:51] Backing Up: zigbee2mqtt
[20.02.2023 03:44:51] Verifying Backup zigbee2mqtt
[20.02.2023 03:44:51] done
[20.02.2023 03:44:51] Starting gsdock... (try #1) done!
[20.02.2023 03:44:55] Starting mariadb... (try #1) done!
[20.02.2023 03:44:59] Starting nextcloud... (try #1) done!
[20.02.2023 03:45:04] Starting unifi-controller... (try #1) done!
[20.02.2023 03:45:12] A error occurred somewhere. Not deleting old backup sets of appdata
[20.02.2023 03:45:12] Backup / Restore Completed

February 20, 2023

Hi,

Can anyone please help me out with what could be causing the input/output error I am getting with CA Backup on my Nextcloud container?

The container is stopped and I do not even have calendar installed on Nextcloud

[20.02.2023 03:42:10] Backing Up: nextcloud
/usr/bin/tar: nextcloud/www/nextcloud/apps/calendar/: Cannot savedir: Input/output error
/usr/bin/tar: Exiting with failure status due to previous errors
[20.02.2023 03:43:21] tar creation/extraction failed!
[20.02.2023 03:43:21] Verifying Backup nextcloud
[20.02.2023 03:44:26] Backing Up: unifi-controller
[20.02.2023 03:44:46] Verifying Backup unifi-controller
[20.02.2023 03:44:51] Backing Up: zigbee2mqtt
[20.02.2023 03:44:51] Verifying Backup zigbee2mqtt
[20.02.2023 03:44:51] done
[20.02.2023 03:44:51] Starting gsdock... (try #1) done!
[20.02.2023 03:44:55] Starting mariadb... (try #1) done!
[20.02.2023 03:44:59] Starting nextcloud... (try #1) done!
[20.02.2023 03:45:04] Starting unifi-controller... (try #1) done!
[20.02.2023 03:45:12] A error occurred somewhere. Not deleting old backup sets of appdata
[20.02.2023 03:45:12] Backup / Restore Completed

February 18, 2023

1 hour ago, JorgeB said:

That's good, see here for how to reset stats for cachetwo, then see also there how to better monitor the pools, if more corruption errors appear you likely still have some hardware issue.

Thank you, I will read through this to better monitor the pools, again, huge thanks for your help!

February 18, 2023

ok, so have scrubbed all 3 pools and no errors found on any of them!

February 18, 2023

9 minutes ago, JorgeB said:

Run a correction scrub on the pool to see if any errors are detected, if there are uncorrectable errors post new diags after the scrub.

Hi, which pool do you mean please?

February 18, 2023

4 minutes ago, JorgeB said:

Post new diags after array start.

Attached, many thanks

unraid1-diagnostics-20230218-1125.zip

February 18, 2023

Thank you @JorgeB

1, I have run the memtest which passed but I have another 4 sticks which I have now replaced the existing RAM with.

2, You say there was existing corruption? how do I address that please?

3, Is this not an NVME problem then? "BTRFS error (device nvme2n1p1): block=161059291136 write time tree block corruption detected"

4, I will look into ipvlan now.

Thanks again!

February 18, 2023

I also, just saw this in the log when I started to shutdown after I took the diags

Feb 18 09:48:42 UNRAID1 kernel: BUG: Bad rss-counter state mm:00000000ee5ad6d6 type:MM_ANONPAGES val:1
Feb 18 09:48:45 UNRAID1 kernel: I/O error, dev loop3, sector 83360 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
Feb 18 09:48:45 UNRAID1 kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Feb 18 09:48:45 UNRAID1 kernel: I/O error, dev loop3, sector 188192 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
Feb 18 09:48:45 UNRAID1 kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state A) in __btrfs_update_delayed_inode:999: errno=-5 IO failure
Feb 18 09:48:45 UNRAID1 kernel: BTRFS info (device loop3: state EA): forced readonly
Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state EA) in __btrfs_run_delayed_items:1092: errno=-5 IO failure
Feb 18 09:48:45 UNRAID1 kernel: BTRFS warning (device loop3: state EA): Skipping commit of aborted transaction.
Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state EA) in cleanup_transaction:1982: errno=-5 IO failure
Feb 18 09:48:45 UNRAID1 kernel: docker0: port 4(vethb7c8447) entered disabled state
Feb 18 09:48:45 UNRAID1 kernel: vethd950096: renamed from eth0
Feb 18 09:48:45 UNRAID1 root: Error response from daemon: error while removing network: network br0 id b58d2467fa57ae7061498d882571927884c846ee782347c86f3071b85475f1f0 has active endpoints

February 18, 2023

Hi,

I woke this morning to find most of my dockers stopped, it looks like ca backup stopped my dockers ready to backup as the backup folder was created but it is empty and I can see all these BTRFS errors on nvme2n1p1 but I cannot work out which physical drive this is? This is the 3rd nvme related problem I have had on this new motherboard in as many weeks so really need to get to the bottom of it, any help will be greatly appreciated!

image.png.fe233f5d6541bb270244a9146851f967.png

unraid1-diagnostics-20230218-0934.zip

February 17, 2023

Hi,

I feel I should know the answer to this but can I use the iGPU on my 13600K in a VM and dockers like Plex still have access to it?

February 10, 2023

Thank you for all your help @JorgeB @trurl

mbc0

Posts

Joined

Last visited

Days Won

Content Type

Profiles

Forums

Downloads

Store

Gallery

Bug Reports

Documentation

Landing

Posts posted by mbc0

pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0

[Plugin] CA Appdata Backup / Restore v2.5

Server locks up after 1-3 days how can I investigate this?

Server locks up after 1-3 days how can I investigate this?

Current/New - unRaid HP Proliant Edition - RMRR Error Patching

Job Opportunity - Need someone to help build a docker for GoodSync

Warning: mkdir(): Input/output error in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 351

Warning: mkdir(): Input/output error in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 351

Warning: mkdir(): Input/output error in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 351

Warning: mkdir(): Input/output error in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 351

[Plugin] CA Appdata Backup / Restore v2.5

[Plugin] CA Appdata Backup / Restore v2.5

[Plugin] CA Appdata Backup / Restore v2.5

[Plugin] CA Appdata Backup / Restore v2.5

[Plugin] CA Appdata Backup / Restore v2.5

CA Backup not able to backup Nextcloud container? input/output error

BTRFS error transid verfify failed - 2 callbacks suppressed

BTRFS error transid verfify failed - 2 callbacks suppressed

BTRFS error transid verfify failed - 2 callbacks suppressed

BTRFS error transid verfify failed - 2 callbacks suppressed

BTRFS error transid verfify failed - 2 callbacks suppressed

BTRFS error transid verfify failed - 2 callbacks suppressed

BTRFS error transid verfify failed - 2 callbacks suppressed

iGPU in a VM and dockers possible?

One of my share subfolders is showing empty in windows and "Structure needs cleaning" on cli