March 22, 20251 yr Hello, So I have an UNRAID server, and several containers installed on it using Dockge. I recently installed Immich so I can view my photos, but I'm having issues with it. I currently have it configured to read an external library (a shared folder in UNRAID called photos). The problem is that when the image sync starts, after 30-60 minutes, UNRAID restarts. I've never had this problem with any other container before. The server is: Intel 14500 64GB DDR5 Asrock z790 tb4 itx PG Seasonic SPX650 Everything is up to date, and the container I'm using is the following: # # WARNING: To install Immich, follow our guide: https://immich.app/docs/install/docker-compose # # Make sure to use the docker-compose.yml of the current release: # # https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml # # The compose file on main may not be compatible with the latest release. name: immich services: immich-server: container_name: immich-server image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} # extends: # file: hwaccel.transcoding.yml # service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding volumes: - /mnt/user/photos:/mnt/media/photos:ro - ${UPLOAD_LOCATION}:/usr/src/app/upload - /etc/localtime:/etc/localtime:ro labels: net.unraid.docker.icon: /mnt/user/system/icons/Immich.png net.unraid.docker.managed: dockerman env_file: - .env depends_on: - immich-redis - immich-database restart: always healthcheck: disable: false immich-machine-learning: container_name: immich-machine-learning # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag. # Example tag: ${IMMICH_VERSION:-release}-cuda image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release} # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration # file: hwaccel.ml.yml # service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable volumes: - model-cache:/cache labels: net.unraid.docker.icon: /mnt/user/system/icons/Immich.png net.unraid.docker.managed: dockerman env_file: - .env restart: always healthcheck: disable: false immich-redis: container_name: immich-redis image: docker.io/redis:6.2-alpine@sha256:148bb5411c184abd288d9aaed139c98123eeb8824c5d3fce03cf721db58066d8 command: redis-server --bind 0.0.0.0 --port 6381 healthcheck: test: redis-cli -p 6381 ping || exit 1 restart: always labels: net.unraid.docker.icon: /mnt/user/system/icons/Immich.png net.unraid.docker.managed: dockerman immich-database: container_name: immich-database image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:739cdd626151ff1f796dc95a6591b55a714f341c737e27f045019ceabf8e8c52 environment: POSTGRES_PASSWORD: ${DB_PASSWORD} POSTGRES_USER: ${DB_USERNAME} POSTGRES_DB: ${DB_DATABASE_NAME} PGPORT: 5433 POSTGRES_INITDB_ARGS: --data-checksums volumes: # Do not edit the next line. If you want to change the database storage location on your system, edit the value of DB_DATA_LOCATION in the .env file - ${DB_DATA_LOCATION}:/var/lib/postgresql/data labels: net.unraid.docker.icon: /mnt/user/system/icons/Immich.png net.unraid.docker.managed: dockerman healthcheck: test: pg_isready --dbname="$${POSTGRES_DB}" --username="$${POSTGRES_USER}" || exit 1; Chksum="$$(psql --dbname="$${POSTGRES_DB}" --username="$${POSTGRES_USER}" --tuples-only --no-align --command='SELECT COALESCE(SUM(checksum_failures), 0) FROM pg_stat_database')"; echo "checksum failure count is $$Chksum"; [ "$$Chksum" = '0' ] || exit 1 interval: 5m start_interval: 30s start_period: 5m command: postgres -c shared_preload_libraries=vectors.so -c 'search_path="$$user", public, vectors' -c logging_collector=on -c max_wal_size=2GB -c shared_buffers=512MB -c wal_compression=on restart: always volumes: model-cache: null networks: default: external: true name: npm_network I have done memtest, CPU stress tests and everything has been satisfactory, without errors or problems. I've enabled persistent syslog logging, but there are absolutely no significant errors. 12:47 this is the last time the problem occurred When this happened last time, I was standing next to the server and saw it completely shut down and then turn back on. I don't understand why. I hope you can help me, thanks in advance. syslog-previous
March 22, 20251 yr Author 13 minutes ago, bmartino1 said: ... Very good guide, but I don't see the problem related to mine, do you have any thoughts?
March 22, 20251 yr Community Expert 1 minute ago, Nozle said: Very good guide, but I don't see the problem related to mine, do you have any thoughts? not really. none that would help regarding software and configurations. as what your describing more is power supply and thermal overload... as its is really strange issue—especially since it’s a full system power cycle rather than just a container crash or kernel panic. Given that you've done memory and CPU stress testing with no issues, and the syslog isn't capturing anything meaningful, here are some things you can try to dig deeper: Power Supply Check This sounds suspiciously like a hardware-level failure, particularly the PSU (Power Supply Unit). If Immich kicks off heavy disk or CPU I/O (which it does during image processing and machine learning tasks), it may be drawing just enough power to trigger a shutdown if the PSU is borderline. Try checking: Do the fans dip or lights flicker before shutdown? Do you have another PSU you can test with? Any surge protector/UPS involved that might be triggering protection? Enable Temperature Monitoring Even if CPU passed stress tests, real-world I/O + processing (especially with Immich doing ML stuff like face detection) might spike temps. Tools like sensors (via NerdPack or via container) can help you log CPU/GPU temps. Deeper Logging / Diagnostics *Enable syslog... I will need a diag file your system log previous doesn't help... Since syslog didn’t show much, try enabling IPMI event logging or BMC-level logging if your motherboard supports it (many server-grade boards do). Enable "Local syslog mirror" under UNRAID's Settings > Syslog Server to write logs to USB or cache drive so logs persist across reboots. Also check /var/log/libvirt/qemu and /var/log/docker.log (if available). You may need to also apply compose docker limits; deploy: resources: limits: cpus: '2.0' memory: 4096M Adding to each docker to set # of CPU and ram memory limits Also, ensure the Immich container isn’t using GPU acceleration if you haven’t allocated a GPU properly. Why I would need the diag... this could be disk related... Filesystem Access Issues If Immich is reading from a mounted share or array that's spinning up dozens of disks, and there are any issues in the SATA/RAID controller, it might trip a system restart. Check for: Parity errors or SMART warnings in the Array Devices tab. Try scanning your photos directory with something like du -sh or a find command and see if that triggers any issues. Strip Down and Test Temporarily configure Immich with a local dummy folder (not the full photos share), and see if the crash still occurs. If not, then the issue is tied to that share—maybe a specific file, bad sector, or permission weirdness. You could even try syncing a smaller subset of photos to test in isolation. I post the guide alone as this seemed more like can't get compose working at all and the guide I made will get you off the ground... Please post a diag file.
March 22, 20251 yr Author 1 hour ago, bmartino1 said: not really. none that would help regarding software and configurations. as what your describing more is power supply and thermal overload... as its is really strange issue—especially since it’s a full system power cycle rather than just a container crash or kernel panic. Given that you've done memory and CPU stress testing with no issues, and the syslog isn't capturing anything meaningful, here are some things you can try to dig deeper: Power Supply Check This sounds suspiciously like a hardware-level failure, particularly the PSU (Power Supply Unit). If Immich kicks off heavy disk or CPU I/O (which it does during image processing and machine learning tasks), it may be drawing just enough power to trigger a shutdown if the PSU is borderline. Try checking: Do the fans dip or lights flicker before shutdown? Do you have another PSU you can test with? Any surge protector/UPS involved that might be triggering protection? Enable Temperature Monitoring Even if CPU passed stress tests, real-world I/O + processing (especially with Immich doing ML stuff like face detection) might spike temps. Tools like sensors (via NerdPack or via container) can help you log CPU/GPU temps. Deeper Logging / Diagnostics *Enable syslog... I will need a diag file your system log previous doesn't help... Since syslog didn’t show much, try enabling IPMI event logging or BMC-level logging if your motherboard supports it (many server-grade boards do). Enable "Local syslog mirror" under UNRAID's Settings > Syslog Server to write logs to USB or cache drive so logs persist across reboots. Also check /var/log/libvirt/qemu and /var/log/docker.log (if available). You may need to also apply compose docker limits; deploy: resources: limits: cpus: '2.0' memory: 4096M Adding to each docker to set # of CPU and ram memory limits Also, ensure the Immich container isn’t using GPU acceleration if you haven’t allocated a GPU properly. Why I would need the diag... this could be disk related... Filesystem Access Issues If Immich is reading from a mounted share or array that's spinning up dozens of disks, and there are any issues in the SATA/RAID controller, it might trip a system restart. Check for: Parity errors or SMART warnings in the Array Devices tab. Try scanning your photos directory with something like du -sh or a find command and see if that triggers any issues. Strip Down and Test Temporarily configure Immich with a local dummy folder (not the full photos share), and see if the crash still occurs. If not, then the issue is tied to that share—maybe a specific file, bad sector, or permission weirdness. You could even try syncing a smaller subset of photos to test in isolation. I post the guide alone as this seemed more like can't get compose working at all and the guide I made will get you off the ground... Please post a diag file. Thank you very much for so much information and suggestions. I've run some new tests without success, I removed the UPS in case it could have something to do with it, the problem persists, I changed the array disks directly connected to the motherboard, in case the ASM1106 could be the problem, the problem persists. I'd like to test the power supply, but I don't know if there's any test that would allow me to test it 100%, that would be great. I have another test I could try, although it involves bringing down the server, disassembling and changing... a few hours of work. I'd prefer to do a test before the change, but it wouldn't be a problem. Temperature monitoring is enabled and everything is correct, CPU is at maximum 40º and the motherboard at maximum about 50º I don't think the problem has anything to do with the disks, because when you do a parity for about 12-14 hours, there's no problem, but I could be wrong. I need to look at the diagnostics section you mentioned to see if it gives any more clues, but I need to figure out how to do it best so it lasts. I don't know how to check if it is using GPU acceleration (I don't have a dedicated GPU, I only have an Intel 14500) find and du -sh show no errors or failures, everything runs fine. No errors SMART. If I make immich not read this external shared resource (it is not overloaded) so it does not restart and there are no problems, it is when it starts to load the shared resource (external library) when it generates the problem after 20-50 minutes. Thanks for your time! du -sh /mnt/disk1/photos/ 194G /mnt/disk1/photos/ Edited March 22, 20251 yr by Nozle
March 22, 20251 yr Author 4 hours ago, bmartino1 said: not really. none that would help regarding software and configurations. as what your describing more is power supply and thermal overload... as its is really strange issue—especially since it’s a full system power cycle rather than just a container crash or kernel panic. Given that you've done memory and CPU stress testing with no issues, and the syslog isn't capturing anything meaningful, here are some things you can try to dig deeper: Power Supply Check This sounds suspiciously like a hardware-level failure, particularly the PSU (Power Supply Unit). If Immich kicks off heavy disk or CPU I/O (which it does during image processing and machine learning tasks), it may be drawing just enough power to trigger a shutdown if the PSU is borderline. Try checking: Do the fans dip or lights flicker before shutdown? Do you have another PSU you can test with? Any surge protector/UPS involved that might be triggering protection? Enable Temperature Monitoring Even if CPU passed stress tests, real-world I/O + processing (especially with Immich doing ML stuff like face detection) might spike temps. Tools like sensors (via NerdPack or via container) can help you log CPU/GPU temps. Deeper Logging / Diagnostics *Enable syslog... I will need a diag file your system log previous doesn't help... Since syslog didn’t show much, try enabling IPMI event logging or BMC-level logging if your motherboard supports it (many server-grade boards do). Enable "Local syslog mirror" under UNRAID's Settings > Syslog Server to write logs to USB or cache drive so logs persist across reboots. Also check /var/log/libvirt/qemu and /var/log/docker.log (if available). You may need to also apply compose docker limits; deploy: resources: limits: cpus: '2.0' memory: 4096M Adding to each docker to set # of CPU and ram memory limits Also, ensure the Immich container isn’t using GPU acceleration if you haven’t allocated a GPU properly. Why I would need the diag... this could be disk related... Filesystem Access Issues If Immich is reading from a mounted share or array that's spinning up dozens of disks, and there are any issues in the SATA/RAID controller, it might trip a system restart. Check for: Parity errors or SMART warnings in the Array Devices tab. Try scanning your photos directory with something like du -sh or a find command and see if that triggers any issues. Strip Down and Test Temporarily configure Immich with a local dummy folder (not the full photos share), and see if the crash still occurs. If not, then the issue is tied to that share—maybe a specific file, bad sector, or permission weirdness. You could even try syncing a smaller subset of photos to test in isolation. I post the guide alone as this seemed more like can't get compose working at all and the guide I made will get you off the ground... Please post a diag file. After adding: deploy: resources: limits: cpus: '2.0' memory: 4096M It's been running for over 4 hours. What could be the real problem?
March 22, 20251 yr Community Expert 1 hour ago, Nozle said: After adding: deploy: resources: limits: cpus: '2.0' memory: 4096M It's been running for over 4 hours. What could be the real problem? resource management. you may be running more then the system could handle and or alot of compute needed for the first LM for face detection was needed and it finaly finsihed the compute task. Do you run plex/jelly fin ? dns servers? nextcloud etc... You have to account 2-4 cpu threads and 2-4 GB of ram for each application. Immich take 3 to use it services idle... it seems like it was maxing out on what it could grab to finish its tasks...
March 22, 20251 yr Author 5 minutes ago, bmartino1 said: resource management. you may be running more then the system could handle and or alot of compute needed for the first LM for face detection was needed and it finaly finsihed the compute task. Do you run plex/jelly fin ? dns servers? nextcloud etc... You have to account 2-4 cpu threads and 2-4 GB of ram for each application. Immich take 3 to use it services idle... it seems like it was maxing out on what it could grab to finish its tasks... 11 minutes ago, bmartino1 said: resource management. you may be running more then the system could handle and or alot of compute needed for the first LM for face detection was needed and it finaly finsihed the compute task. Do you run plex/jelly fin ? dns servers? nextcloud etc... You have to account 2-4 cpu threads and 2-4 GB of ram for each application. Immich take 3 to use it services idle... it seems like it was maxing out on what it could grab to finish its tasks... So you think it's not a hardware defect? I was thinking of creating a Windows 11 live USB and trying some CPU stress programs and other things. I'd also run memtest again for at least 8-12 hours to make sure everything is okay. I don't have anything to run for the PSU. My unraid server is: Intel 14500 64GB DDR5 Asrock z790 tb4 itx PG Seasonic SPX650 Array: 3 Toshiba N300 8TB HDD Cache: 2x WD850X 1TB ASM1106 x 6 SATA M2 updated There are a few virtual machines, one for Windows 11 and one for HomeAssistant. There are over 20 Docker containers: Authentik Docker Controller Bot Watchtower Dockge Grafana Hommar Immich Nodered npm Paperless Unifi Vaultwarden Wallos etc. I also have a problem when I run PowerTop Autotune: the system crashes after a few hours. I still have to figure out who the culprit is. I plan to use plex/jellyfin later.
March 23, 20251 yr Community Expert yep the instance I saw 20 dockers and VMS, you are over provisioning the CPU... and they are all runnign at once? Now that I think of it and rember.... there could be cpu issues due to intel firmware. https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239 https://community.intel.com/t5/Processors/July-2024-Update-on-Instability-Reports-on-Intel-Core-13th-and/m-p/1617113 I would live boot a win PE to make sure your running the latest Bios and have the latest intel microcode. Edited March 23, 20251 yr by bmartino1 Data
March 23, 20251 yr Author 9 hours ago, bmartino1 said: yep the instance I saw 20 dockers and VMS, you are over provisioning the CPU... and they are all runnign at once? Now that I think of it and rember.... there could be cpu issues due to intel firmware. https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239 https://community.intel.com/t5/Processors/July-2024-Update-on-Instability-Reports-on-Intel-Core-13th-and/m-p/1617113 I would live boot a win PE to make sure your running the latest Bios and have the latest intel microcode. So, yesterday and today were days of testing. I ran 10 runs (11 hours) of memtest without failure. I inserted a Windows 10 live USB drive with Ycruncher, and after a few minutes: blue error: CLOCK_WATCHDOG_TIMEOUT I cleared CMOS, installed Ycruncher, and the same error occurred. I tested two power supplies with Ycruncher, and the same error occurred. The motherboard BIOS has been updated with the latest Asrock BIOS for four months since I built this server. I just installed a 13500 CPU and will run ycruncher to see if the problem goes away. If it does, the CPU is defective. Damn, this processor is new for 2 months. I hope this ordeal ends soon, but everything points to this CPU being defective. I hope Intel doesn't waste time and replaces it quickly, otherwise my Unraid will be down for a long time. Now I'm also facing another problem: Unraid doesn't work at all. I get the following error and it won't start rcu task blocked on level rcu node unraid
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.