berben Posted October 12, 2023 Share Posted October 12, 2023 (edited) Hi, I encountered an issue with docker apps. They just randomly crash and are not responding. It happened a couple of times but reboot fixes it for a while (a couple of days). Today it happened again but this time I had a little time to do some log browsing. I've found a couple of weird errors showing up in logs. Firstly, system log shows something like this (new entry every couple of seconds): Oct 12 18:03:45 NAS kernel: pcieport 0000:00:1c.3: AER: Corrected error received: 0000:02:00.0 I looked up in devices manager and it points to this device: [8086:7abb] 00:1c.3 PCI bridge: Intel Corporation Device 7abb (rev 11) Also, /var/log/docker.log is spamed with: containerd: creating temp mount location: mkdir /var/lib/docker/containerd/daemon/tmpmounts: input/output error time="2023-10-12T07:05:09+02:00" level=warning msg="containerd config version `1` has been deprecated and will be removed in conta inerd v2.0, please switch to version `2`, see https://github.com/containerd/containerd/blob/main/docs/PLUGINS.md#version-header" My cache disk is SATA SSD and I don't have any external devices connected directly to PCI slots. Everything is fine again after reboot, except the "AER" error still being logged. This PC was running fine for about 6 months, it started to be problematic a couple of days ago. Any help would be appreciated. Edited October 12, 2023 by berben Quote Link to comment
JorgeB Posted October 13, 2023 Share Posted October 13, 2023 Please post the diagnostics. Quote Link to comment
berben Posted October 13, 2023 Author Share Posted October 13, 2023 Attaching the diagnostics zip. nas-diagnostics-20231012-2141.zip Quote Link to comment
Solution JorgeB Posted October 13, 2023 Solution Share Posted October 13, 2023 btrfs is detecting data corruption on cache, start by running memtest. Quote Link to comment
berben Posted October 13, 2023 Author Share Posted October 13, 2023 Do you mean RAM test? The one I run from boot menu? Do you think this might be RAM related? Quote Link to comment
itimpi Posted October 13, 2023 Share Posted October 13, 2023 8 minutes ago, berben said: Do you mean RAM test? The one I run from boot menu? Do you think this might be RAM related? Corruption of BTRFS file systems seems to be a common symptom of potential RAM problems. Note that if you boot in UEFI mode then you need get a more recent version from memtest86.com that can boot in UEFI mode. It would not do any harm to use that version anyway even if you do boot Unraid in legacy mode. Quote Link to comment
berben Posted October 13, 2023 Author Share Posted October 13, 2023 (edited) Thanks for the explanation. That would be weird because the machine is almost brand new but of course anything could happen. I'm creating USB with the newest memtest and I'll leave it running for a couple of hours. SSD drive is not new though, it's a couple of years old drive. Edited October 13, 2023 by berben Quote Link to comment
berben Posted October 13, 2023 Author Share Posted October 13, 2023 Looks like you were right. Memtest started to throw errorrs after a couple of seconds. Now I'll try to figure our if this is RAM indeed (hopefully not CPU) and RMA it. Thanks a lot for the tips! Quote Link to comment
itimpi Posted October 13, 2023 Share Posted October 13, 2023 1 hour ago, berben said: Looks like you were right. Memtest started to throw errorrs after a couple of seconds. Now I'll try to figure our if this is RAM indeed (hopefully not CPU) and RMA it. Thanks a lot for the tips! Make sure you are not overclocking the RAM (a XMP profile IS an overclock). You can also try with less RAM sticks installed as sometimes it is the memory controller that cannot handle the number of installed RAM sticks without issues. Sometimes the CPU has a max RAM speed it can handle as well so check the manual for that. Quote Link to comment
berben Posted October 13, 2023 Author Share Posted October 13, 2023 8 minutes ago, itimpi said: Make sure you are not overclocking the RAM (a XMP profile IS an overclock). You can also try with less RAM sticks installed as sometimes it is the memory controller that cannot handle the number of installed RAM sticks without issues. Sometimes the CPU has a max RAM speed it can handle as well so check the manual for that. I took out one of the sticks and there seems to be no errors this time. I suspect the one stick that I took out to be faulty because the system was running fine for a couple of months and nothing has changed in the setup during this period. I'm not overclocking explicitely but I think XMP profile was selected. After this second test I'll put back the second stick and disable the XMP profile. Quote Link to comment
berben Posted October 13, 2023 Author Share Posted October 13, 2023 Ok, I think I can safely say that one of the two RAM sticks is at fault here. I tried swapping RAM slots and no matter what one of the sticks is causing errors. No XMP, no overclock. I'll send it to RMA and run the PC with only one stick for now. Thanks once again for the tips. Quote Link to comment
itimpi Posted October 13, 2023 Share Posted October 13, 2023 17 minutes ago, berben said: Ok, I think I can safely say that one of the two RAM sticks is at fault here. I tried swapping RAM slots and no matter what one of the sticks is causing errors. No XMP, no overclock. I'll send it to RMA and run the PC with only one stick for now. Thanks once again for the tips. At least it seems to be a reproducible error that follows a particular stick. I wonder if it worth cleaning the contacts on the suspect stick just in case? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.