Jump to content

mbc0

Members
  • Posts

    1,121
  • Joined

  • Last visited

  • Days Won

    2

Posts posted by mbc0

  1. Hi, I have just installed an NVIDIA P4000 and am getting these errors (loads in the log) I have swapped slots with my HBA card but still getting them, can anyone advise please?

     


    Feb 28 17:34:01 UNRAID1 kernel: pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0
    Feb 28 17:34:01 UNRAID1 kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
    Feb 28 17:34:01 UNRAID1 kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
    Feb 28 17:34:01 UNRAID1 kernel: pcieport 0000:00:01.1:    [ 6] BadTLP                
        

     

    BIOS:American Megatrends Inc. Version F13a. Dated: 11/30/2021

    CPU:AMD Ryzen Threadripper 2950X 16-Core @ 3500 MHz

    HVM:Enabled

    IOMMU:Enabled

    Cache:1536 KiB, 8 MB, 32 MB

    Memory:64 GiB DDR4 (max. installable capacity 512 GiB)

    Network:bond0: fault-tolerance (active-backup), mtu 1500

    Kernel:Linux 5.19.17-Unraid x86_64

    OpenSSL:1.1.1s

     

     

    unraid1-diagnostics-20230228-1731.zip

  2. On 2/20/2023 at 7:22 PM, KluthR said:

    I dont think so. You can see that this rsync always deletes the backup set. So you only have one working backup. Now it should grow to the „keep x backups“ setting.

     

    is it always this calendar app thats failing? Could you look into it whats inside?

     

    since your nextcloud container is stopped, there is nothing accessing it. Weird.

     

    some posts ago I asked an user to add some diag code, could you do the same? Dont have the post link, Iam at my mobile phone.

    Thank you so much for your help, the problem was traced to an unreadable file & folder on my cache drive, I had to format it to get rid! Backup now working as expected! 🙂

  3. 1 hour ago, Frank1940 said:

    Set up the syslog server and up load the file after the next instance of the problem.  Instructions below:

    I already have this setup, sorry, I was not clear but that is what I meant by no useful information in the log at all.

     

    1 hour ago, Frank1940 said:

     

       

    Be aware that this type of problem is often hardware related and nothing shows up in the syslog.   You can also try booting into the safe mode and see if the problem occurs in that mode.  May we assume that you also use access to your shares?  Have you installed the Dynamix System Temperature plugin?  Is the PS new?  (Power supplies-- both new and old --are often the cause of these types of problems...)

    I am using the existing PSU from my previous threadripper setup, all I have changed is motherboard & cpu

     

    I have the Dynamix System Temperature plugin installed and normally around 26C CPU, highest I have seen is 40C when being hammered.

     

    I can see no more I can do with this setup, maybe an intermittant issue with the motherboard? since installation I have had RAM related issues, corrupt NVME cache drive, lost all dockers & VM's (all seperate occasions) but when running, everything is perfect for 1-3 days.

     

    I will transplant the threadripper back in and setup another server using this suspect motherboard/cpu config to see what happens, unless you can think of anything else I can try.

     

    The only reason(s) for doing all this were to save energy costs.

     

  4. Hi,

     

    I have had nothing but problems with this new motherboard/cpu config 

    M/B:ASRock Z790 Pro RS/D4 Version

    BIOS:American Megatrends International, LLC. Version 4.01. Dated: 01/06/2023

    CPU:13th Gen Intel® Core™ i5-13600K @ 3465 MHz

     

    When the server locks up, there is still power, the network card is flashing but I cannot access the server even directly with a monitor plugged in or SSL etc, nothing.

     

    I am now using a 3rd set of 4 DIMMS since problems began so would safely say it's not the RAM

    Temps are never above 40C so confident of not being a temperature related issue.

    I have attached DIAGS since powering off/on at 18:12 tonight

    I have logs written to a share but there is nothing of any use written in the log and no errors.

     

    The BIOS has been reset to standard, no power limits, no overclock, everything totally standard.

     

    Where can I go from here to find the problem please?  I think I am going to have to swap back to my threadripper for reliability and put this motherboard/cpu config into a test bench.

    unraid1-diagnostics-20230226-1820.zip

  5. 5 minutes ago, BRiT said:

    Run memtest for at least 12 hours.

     

    You likely have wrong settings for memory in your bios or are running the memory overclocked without realizing it.

     

    I ran the 4 sticks I took out in memtest for over 24 hours in another board with no issues and as all 8 sticks have been running without issue in my previous config I am confident there is no physical issue with the RAM (All Corsair Vengeance)

     

    I have not overclocked but did choose a power saving option (The reason for changing from a Threadripper) in the BIOS of which I cannot remember but I am going to change that back to standard just to eliminate what the problem could be.

     

    I have also enabled power saving in tips and tweaks as this is what brought me under the 150w I was looking for, do you think that could have any influence on this problem?

    image.png.749c5529a78c25d62468e7990b5857bd.png

  6. 26 minutes ago, JorgeB said:
    Feb 21 11:10:53 UNRAID1 kernel: BTRFS error (device nvme2n1p1): block=160911851520 write time tree block corruption detected

    This usually means bad RAM or other kernel memory corruption, maybe the RAM issues are not totally solved.

     

    This is worrying! I replaced all 4 sticks of RAM the other day so they chances of it actually being the RAM are slim to none I would have thought.  It is a brand new motherboard and CPU, my old motherboard had 8 slots for RAM and never missed a beat I have now had these errors on both sets of 4 so I am making the assumption that it is not the RAM causing the issue so maybe the motherboard?  (Asrock z790 pro rs/d4) I am on the latest BIOS, are there any steps I can take to try and get on top of this issue?

  7. Unable to stop the server currently 😞

     

    Feb 21 11:22:18 UNRAID1  emhttpd: shcmd (2491271): umount /mnt/cache
    Feb 21 11:22:18 UNRAID1 root: umount: /mnt/cache: target is busy.
    Feb 21 11:22:18 UNRAID1  emhttpd: shcmd (2491271): exit status: 32
    Feb 21 11:22:18 UNRAID1  emhttpd: Retry unmounting disk share(s)...
    Feb 21 11:22:23 UNRAID1  emhttpd: Unmounting disks...
    Feb 21 11:22:23 UNRAID1  emhttpd: shcmd (2491272): umount /mnt/cache
    Feb 21 11:22:23 UNRAID1 root: umount: /mnt/cache: target is busy.
    Feb 21 11:22:23 UNRAID1  emhttpd: shcmd (2491272): exit status: 32
    Feb 21 11:22:23 UNRAID1  emhttpd: Retry unmounting disk share(s)...
    Feb 21 11:22:28 UNRAID1  emhttpd: Unmounting disks...
    Feb 21 11:22:28 UNRAID1  emhttpd: shcmd (2491273): umount /mnt/cache
    Feb 21 11:22:28 UNRAID1 root: umount: /mnt/cache: target is busy.
    Feb 21 11:22:28 UNRAID1  emhttpd: shcmd (2491273): exit status: 32
    Feb 21 11:22:28 UNRAID1  emhttpd: Retry unmounting disk share(s)...

  8. Hi @JorgeB

     

    I have just had my server crash and have attached diags, is this related to my other crashes you helped me with last week? the server is on 24/7 and this is the first problem since the possible RAM issue/corruption I had last week.

     

    Warning: mkdir(): Input/output error in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 351

    unraid1-diagnostics-20230221-1113.zip

  9. Hi, 

     

    Can anyone please help me out with what could be causing the input/output error I am getting with CA Backup on my Nextcloud container? 

    The container is stopped and I do not even have calendar installed on Nextcloud

     

    [20.02.2023 03:42:10] Backing Up: nextcloud
    /usr/bin/tar: nextcloud/www/nextcloud/apps/calendar/: Cannot savedir: Input/output error
    /usr/bin/tar: Exiting with failure status due to previous errors
    [20.02.2023 03:43:21] tar creation/extraction failed!
    [20.02.2023 03:43:21] Verifying Backup nextcloud
    [20.02.2023 03:44:26] Backing Up: unifi-controller
    [20.02.2023 03:44:46] Verifying Backup unifi-controller
    [20.02.2023 03:44:51] Backing Up: zigbee2mqtt
    [20.02.2023 03:44:51] Verifying Backup zigbee2mqtt
    [20.02.2023 03:44:51] done
    [20.02.2023 03:44:51] Starting gsdock... (try #1) done!
    [20.02.2023 03:44:55] Starting mariadb... (try #1) done!
    [20.02.2023 03:44:59] Starting nextcloud... (try #1) done!
    [20.02.2023 03:45:04] Starting unifi-controller... (try #1) done!
    [20.02.2023 03:45:12] A error occurred somewhere. Not deleting old backup sets of appdata
    [20.02.2023 03:45:12] Backup / Restore Completed

  10. Hi, 

     

    Can anyone please help me out with what could be causing the input/output error I am getting with CA Backup on my Nextcloud container? 

    The container is stopped and I do not even have calendar installed on Nextcloud

     

    [20.02.2023 03:42:10] Backing Up: nextcloud
    /usr/bin/tar: nextcloud/www/nextcloud/apps/calendar/: Cannot savedir: Input/output error
    /usr/bin/tar: Exiting with failure status due to previous errors
    [20.02.2023 03:43:21] tar creation/extraction failed!
    [20.02.2023 03:43:21] Verifying Backup nextcloud
    [20.02.2023 03:44:26] Backing Up: unifi-controller
    [20.02.2023 03:44:46] Verifying Backup unifi-controller
    [20.02.2023 03:44:51] Backing Up: zigbee2mqtt
    [20.02.2023 03:44:51] Verifying Backup zigbee2mqtt
    [20.02.2023 03:44:51] done
    [20.02.2023 03:44:51] Starting gsdock... (try #1) done!
    [20.02.2023 03:44:55] Starting mariadb... (try #1) done!
    [20.02.2023 03:44:59] Starting nextcloud... (try #1) done!
    [20.02.2023 03:45:04] Starting unifi-controller... (try #1) done!
    [20.02.2023 03:45:12] A error occurred somewhere. Not deleting old backup sets of appdata
    [20.02.2023 03:45:12] Backup / Restore Completed

  11. Thank you @JorgeB

     

    1, I have run the memtest which passed but I have another 4 sticks which I have now replaced the existing RAM with.

     

    2, You say there was existing corruption? how do I address that please?

     

    3, Is this not an NVME problem then? "BTRFS error (device nvme2n1p1): block=161059291136 write time tree block corruption detected"

     

    4, I will look into ipvlan now.

     

    Thanks again!

  12. I also, just saw this in the log when I started to shutdown after I took the diags

     

    Feb 18 09:48:42 UNRAID1 kernel: BUG: Bad rss-counter state mm:00000000ee5ad6d6 type:MM_ANONPAGES val:1
    Feb 18 09:48:45 UNRAID1 kernel: I/O error, dev loop3, sector 83360 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
    Feb 18 09:48:45 UNRAID1 kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
    Feb 18 09:48:45 UNRAID1 kernel: I/O error, dev loop3, sector 188192 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
    Feb 18 09:48:45 UNRAID1 kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
    Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state A) in __btrfs_update_delayed_inode:999: errno=-5 IO failure
    Feb 18 09:48:45 UNRAID1 kernel: BTRFS info (device loop3: state EA): forced readonly
    Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state EA) in __btrfs_run_delayed_items:1092: errno=-5 IO failure
    Feb 18 09:48:45 UNRAID1 kernel: BTRFS warning (device loop3: state EA): Skipping commit of aborted transaction.
    Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state EA) in cleanup_transaction:1982: errno=-5 IO failure
    Feb 18 09:48:45 UNRAID1 kernel: docker0: port 4(vethb7c8447) entered disabled state
    Feb 18 09:48:45 UNRAID1 kernel: vethd950096: renamed from eth0
    Feb 18 09:48:45 UNRAID1 root: Error response from daemon: error while removing network: network br0 id b58d2467fa57ae7061498d882571927884c846ee782347c86f3071b85475f1f0 has active endpoints

  13. Hi, 

     

    I woke this morning to find most of my dockers stopped, it looks like ca backup stopped my dockers ready to backup as the backup folder was created but it is empty and I can see all these BTRFS errors on nvme2n1p1 but I cannot work out which physical drive this is?  This is the 3rd nvme related problem I have had on this new motherboard in as many weeks so really need to get to the bottom of it, any help will be greatly appreciated! 

     

    image.png.fe233f5d6541bb270244a9146851f967.png

     

     

    unraid1-diagnostics-20230218-0934.zip

×
×
  • Create New...