Abstract7227 Posted November 2, 2022 Share Posted November 2, 2022 Hi, Since setup, I have been experiencing server outages every few days. The outage would start becoming noticable because some docker containers would not respond anymore, the webgui would still load most of the time, but if you go to for example the docker containers page, or try to download the diagnostics, the webgui would freeze and the server would become unreachable (even pinging the server would not succeed). included are the diagnostics files after a reboot, since downloading the diagnostics files are not possible during the issue (so I am not sure if they are helpful) I already tried: - disabling some docker containers I thought caused the problem, no luck something that might be important. Some of my dockers have a folder mapping to the /mnt/cache/share instead of /mnt/user/share due to performance issues when going through /user, the shares that are mapped are configures as cache only though. I first thought maybe the mover could not deal with this, but according to unraids documentation, they wouldnt even look at those shares, and some outages start to occur at a time when the mover is not running. Can anyone shed some light on what is going wrong here? trantor-diagnostics-20221102-1712.zip trantor-diagnostics-20221024-0834.zip Quote Link to comment
JorgeB Posted November 2, 2022 Share Posted November 2, 2022 Start here: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173 1 Quote Link to comment
Abstract7227 Posted November 2, 2022 Author Share Posted November 2, 2022 42 minutes ago, JorgeB said: Start here: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173 thx for the response, changed the Power Supply Idle Control from auto to enabled, heres to hoping that fixes the problem 1 Quote Link to comment
JorgeB Posted November 2, 2022 Share Posted November 2, 2022 If it doesn't enable the syslog server and post that after a crash. Quote Link to comment
Abstract7227 Posted November 5, 2022 Author Share Posted November 5, 2022 This might be a unrelated issue, but recently my docker containers became unresponsive again. The docker logs showed some readonly errors, so I guess this is related to problems on my cache pool? Diagnostics during issue included. A reboot fixed the problem here. If I am reading this right, the issue here is related to the cache pool? SMART does not report any errors on the drives though, and the drives are relatively new. Not sure how to troubleshoot this :( trantor-diagnostics-20221105-0238.zip Quote Link to comment
JorgeB Posted November 5, 2022 Share Posted November 5, 2022 Nov 2 18:16:53 Trantor kernel: BTRFS info (device dm-4): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1729, gen 0 Nov 2 18:16:53 Trantor kernel: BTRFS info (device dm-4): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 1256, gen 0 Btrfs is finding data corruption on both devices, start by running memtest. Quote Link to comment
Abstract7227 Posted November 5, 2022 Author Share Posted November 5, 2022 2 hours ago, JorgeB said: Nov 2 18:16:53 Trantor kernel: BTRFS info (device dm-4): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1729, gen 0 Nov 2 18:16:53 Trantor kernel: BTRFS info (device dm-4): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 1256, gen 0 Btrfs is finding data corruption on both devices, start by running memtest. Thx for the response, I ran 2 passes without errors, any other causes I can look at? Quote Link to comment
JorgeB Posted November 5, 2022 Share Posted November 5, 2022 Bad or failing RAM would be the main suspect, besides data corruption there's also filesystem corruption, likely caused by the same problem, best to backup and re-format the pool, then see here for better pool monitoring so you're notified if corruption continues to occur. Quote Link to comment
Abstract7227 Posted November 5, 2022 Author Share Posted November 5, 2022 33 minutes ago, JorgeB said: Bad or failing RAM would be the main suspect, besides data corruption there's also filesystem corruption, likely caused by the same problem, best to backup and re-format the pool, then see here for better pool monitoring so you're notified if corruption continues to occur. So I should definitely replace the RAM then? And if I do, I should still reformat the cache pool after a backup? Quote Link to comment
JorgeB Posted November 5, 2022 Share Posted November 5, 2022 18 minutes ago, Abstract7227 said: I should still reformat the cache pool after a backup? Yes, corruption already happened, even if you now fix the cause, cannot say for sure the RAM was the problem, so monitor after formatting and if new corruption is found there's still a problem. 1 Quote Link to comment
Abstract7227 Posted November 5, 2022 Author Share Posted November 5, 2022 Noob question, but considering I pointed shares from /mnt/user/appdata to /mnt/cache/appdata on some docker containers. Can I still take the "backup" by changing the cache only shares to "yes" and activating the mover, then reformat the cache pool, setting the shares to "prefer", run the mover again, and changing it to "yes" after? Or wont that work anymore? Quote Link to comment
JorgeB Posted November 5, 2022 Share Posted November 5, 2022 Yes, just need to verify if all data was moved, some might fail due to errors. Quote Link to comment
Abstract7227 Posted November 7, 2022 Author Share Posted November 7, 2022 On 11/5/2022 at 5:08 PM, JorgeB said: Yes, just need to verify if all data was moved, some might fail due to errors. Some indeed failed, does that mean those files are lost? I assume the same error would occur when using krusader to move them Quote Link to comment
JorgeB Posted November 7, 2022 Share Posted November 7, 2022 It means the files are corrupt and btrfs prevents them from copying/moving so that you don't copy corrupt files without knowing. Quote Link to comment
Abstract7227 Posted November 8, 2022 Author Share Posted November 8, 2022 I moved the uncorrupted files, reformated the cache pool, moved it back and was pleasantly surprised my dockers were very much intact when I reinstalled the images. What I am a bit worried about however is that I recently checked the logs and encountered the following entries: Nov 8 08:54:22 Trantor kernel: nvme nvme0: I/O 64 QID 1 timeout, aborting Nov 8 08:54:22 Trantor kernel: nvme nvme0: I/O 65 QID 1 timeout, aborting Nov 8 08:54:22 Trantor kernel: nvme nvme0: I/O 66 QID 1 timeout, aborting Nov 8 08:54:22 Trantor kernel: nvme nvme0: I/O 67 QID 1 timeout, aborting Nov 8 08:54:22 Trantor kernel: nvme nvme0: I/O 68 QID 1 timeout, aborting Nov 8 08:54:52 Trantor kernel: nvme nvme0: I/O 64 QID 1 timeout, reset controller Nov 8 08:54:52 Trantor kernel: nvme nvme0: I/O 7 QID 0 timeout, reset controller Nov 8 08:55:54 Trantor kernel: nvme0: Admin Cmd(0x6), I/O Error (sct 0x3 / sc 0x71) Nov 8 08:55:54 Trantor kernel: nvme nvme0: Abort status: 0x371 Nov 8 08:55:54 Trantor kernel: nvme nvme0: Abort status: 0x371 Nov 8 08:55:54 Trantor kernel: nvme nvme0: Abort status: 0x371 Nov 8 08:55:54 Trantor kernel: nvme nvme0: Abort status: 0x371 Nov 8 08:55:54 Trantor kernel: nvme nvme0: Abort status: 0x371 Nov 8 08:55:54 Trantor kernel: nvme nvme0: 15/0/0 default/read/poll queues Nov 8 12:08:05 Trantor nginx: 2022/11/08 12:08:05 [crit] 2930#2930: *1016094 pwritev() "/var/lib/nginx/client_body/0000147454" failed (28: No space left on device), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost" Should I be worried about this? Quote Link to comment
JorgeB Posted November 9, 2022 Share Posted November 9, 2022 Try this to see if helps: Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference. 1 Quote Link to comment
Abstract7227 Posted November 9, 2022 Author Share Posted November 9, 2022 24 minutes ago, JorgeB said: Try this to see if helps: Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference. ah kk thx could this have caused the btrfs corruption I had, given that memtest did not find any RAM issues? Quote Link to comment
Abstract7227 Posted November 23, 2022 Author Share Posted November 23, 2022 So I got the problem again. diagnostics after reboot included, as well as screenshot of some logs before reboot. So this is after I replaced the RAM and reformated my cache pool trantor-diagnostics-20221123-1355.zip Quote Link to comment
JorgeB Posted November 23, 2022 Share Posted November 23, 2022 The screenshot doesn't show the complete call trace, if the server crashed, preventing the download of the diags before rebooting, enable the syslog server and post that if it happens again. Quote Link to comment
Abstract7227 Posted December 8, 2022 Author Share Posted December 8, 2022 I got a crash again today, included the logs from the syslog server of the relevant time period error trantor.txt Quote Link to comment
JorgeB Posted December 8, 2022 Share Posted December 8, 2022 Problems with the NVMe device: Dec 7 19:33:00 Trantor kernel: nvme nvme1: I/O 862 QID 2 timeout, aborting Dec 7 19:33:00 Trantor kernel: nvme nvme1: I/O 0 QID 6 timeout, aborting Dec 7 19:33:00 Trantor kernel: nvme nvme1: I/O 1 QID 6 timeout, aborting Dec 7 19:33:00 Trantor kernel: nvme nvme1: I/O 2 QID 6 timeout, aborting Dec 7 19:33:00 Trantor kernel: nvme nvme1: I/O 3 QID 6 timeout, aborting Dec 7 19:33:30 Trantor kernel: nvme nvme1: I/O 862 QID 2 timeout, reset controller Dec 7 19:34:00 Trantor kernel: nvme nvme1: I/O 0 QID 0 timeout, reset controller Dec 7 19:34:31 Trantor kernel: nvme1n1: I/O Cmd(0x2) @ LBA 96582160, 8 blocks, I/O Error (sct 0x3 / sc 0x71) Dec 7 19:34:31 Trantor kernel: I/O error, dev nvme1n1, sector 96582160 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 Quote Link to comment
Abstract7227 Posted December 8, 2022 Author Share Posted December 8, 2022 only the one device? Quote Link to comment
JorgeB Posted December 8, 2022 Share Posted December 8, 2022 That's what I see in that log, then btrfs crashed. Quote Link to comment
Abstract7227 Posted December 8, 2022 Author Share Posted December 8, 2022 So this might be a dumb question, but why would btrfs crash because of one drive acting funny? from what I understand, my cache pool is configured in RAID1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.