ItsNotNick Posted April 19, 2022 Share Posted April 19, 2022 My server has crashed 3 times now, increasing in frequency each time. I thought the issue may have been a couple new drives I added for parity, as they've put out parity errors while doing the latest check. I removed one of them and started a parity check only for it to crash in about an hour. I did some due diligence and enabled syslog (although it's not allowing me to upload them here for some reason) So here's a link to it. Also, I have no idea if it's related or just a random occurrence, but, my docker.img tripled in size randomly, so I had to go through that whole process of fixing it. Idk if that's helpful, just thought I'd mention it in case it was. Quote Link to comment
JorgeB Posted April 19, 2022 Share Posted April 19, 2022 43 minutes ago, ItsNotNick said: So here's a link to it. That's a CA backup. Quote Link to comment
ItsNotNick Posted April 20, 2022 Author Share Posted April 20, 2022 Oh, I'm stupid. So it didn't log to the folder I set it to. I started another parity check and set it to log to flash, so I should get one soon. My bad. Quote Link to comment
ItsNotNick Posted April 24, 2022 Author Share Posted April 24, 2022 (edited) Ok, after dealing with some other things I finally was able to start another parity check syslogand it crashed about an hour and a half in. After looking at the file I can see that there is data from after the crash, so I feel I should mention that I restarted the server before remembering about the syslog. Sorry if that make it harder. Edited April 24, 2022 by ItsNotNick Quote Link to comment
JorgeB Posted April 24, 2022 Share Posted April 24, 2022 The only thing of note I see is multiple of these: Apr 24 00:11:12 Andromeda kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Apr 24 00:11:12 Andromeda kernel: caller _nv000651rm+0x1ad/0x200 [nvidia] mapping multiple BARs Including one just before the crash, can't say it's related, but worth trying running the server without the Nvidia GPU, or installing it in a different PCIe slot if available. Quote Link to comment
ItsNotNick Posted April 24, 2022 Author Share Posted April 24, 2022 I am planning to switch motherboard out for one with a couple more PCIe slots, I could do that now, but I feel like it would be better not to change too much at once. Also worth mentioning that this only happens when running a parity check, I ran the server all week without a single crash, then turned on parity only for it to crash within a couple hours like I said. Quote Link to comment
JorgeB Posted April 24, 2022 Share Posted April 24, 2022 It's likely hardware related, if it's not the GPU it could be PSU, board, etc., you can also try v6.10.0-rc to rule out any kernel compatibility issue. Quote Link to comment
ItsNotNick Posted May 16, 2022 Author Share Posted May 16, 2022 On 4/24/2022 at 2:46 AM, JorgeB said: It's likely hardware related, if it's not the GPU it could be PSU, board, etc., you can also try v6.10.0-rc to rule out any kernel compatibility issue. So, after some time, I went from connecting every drive directly to the motherboard to using an HBA Card. And after doing that my parity check went from 17 errors to 16. My system hasn't crashed at all, which is good (?), although I don't know what I did exactly to fix that. I have 2 questions: 1: Does the decrease in parity checks (assuming it's not a fluke) suggest it's the cables? Since I did reconnect every single cable once, at least. 2: Is 16 parity checks a real issue or am I overreacting and it is something I could just ignore? Quote Link to comment
itimpi Posted May 16, 2022 Share Posted May 16, 2022 46 minutes ago, ItsNotNick said: Is 16 parity checks a real issue or am I overreacting and it is something I could just ignore? Assuming you mean errors reported during a parity check then anything other than 0 is an issue as if a disk fails and needs to be built onto a replacement that means that 16 sectors will have corruption after the rebuild. Are the checks reporting this correcting or non-correcting? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.