July 28, 20196 yr Please note: I've been using Unraid for about a month now, and it's my first foray into anything related to Linux. If any information or action is needed of me, it would be extremely helpful if in addition to being told what to do, I could be told how to do it. I am still learning. Apologies for the long winded post but I wanted to be thorough. Three days ago, I started seeing some instability issues within the unraid webgui. As an example, the first time I noticed an issue was when I manually tried to invoke the mover and clicking on the button appeared to be doing nothing. It didn't say the mover had started and my cache disks usage didn't drop. I later found that mover did actually run and the cache was emptied, webgui just took several hours to detect the change. I also noticed other oddities like system information not populating on the main page, at one point my memory hit 100% consumption with the log (and in a separate instance I actually got an error within the log that it had ran out of memory and no other logging was present), and my CPU would frequently show 100% usage on most or all cores, despite my sever not doing anything. First thing I did was give the server a reboot, which didn't seem to really help, but it did invoke a parity check, so I decided to go to sleep and let that run. Woke up the next morning to find the webgui unresponsive and the server inaccessible (it was still powered on, though). I rebooted again, parity check started over, and I started searching through the logs and noticed a call trace happening every 3-5 minutes pretty consistently. I did a lot of Googling on this, understood very little of it, but it seemed like it could be a hardware issue (my server is composed of older recycled hardware, an FX-6300, MSI MB, 16GB DDR3). Decided fire up memtest and let it run for a bit. Right as I came to check on it after an hour I saw my CPU rounding 90C and then the system shut off... well, there is a problem. Turns out my heatsink fan had died. Everything got shut down until last night when I could get a new heatsink it. After doing that, things looked good, CPU temps down to 35C, unraid booted up fine, started a parity check yet again, but everything seemed stable and no more errors in the log. Watched it for about 4 hours and then went to sleep thinking I was in the clear. This morning, things got odd again. Woke up, checked webgui and noticed the same call trace. Oddly, it was the very first one since the server came back online the previous night, and happened exactly at the moment I accessed the server. No call traces all night and now I've had them every ~2 minutes for the past half hour. I'm wondering if my CPU was damaged by all this, but I don't know how to decipher these logs. Hopefully someone here can. Thanks very much for any help that is offered. syslog.txt Edited July 28, 20196 yr by gilschwartzman
July 28, 20196 yr Community Expert Go to Tools - Diagnostics and attach the complete diagnostics zip file to your next post.
July 29, 20196 yr Community Expert The NMI errors during parity check can usually be fixed by lowering the md_sync_thresh tunable (Settings -> Disk Settings)
July 29, 20196 yr Author 2 hours ago, johnnie.black said: The NMI errors during parity check can usually be fixed by lowering the md_sync_thresh tunable (Settings -> Disk Settings) Any insight into why this field causes errors or why it didn't start producing them (to my knowledge) until this week? Parity checks have always slowed my server way down, but never caused instability. Interestingly enough, the call traces did stop when my parity check finished, haven't had one in about 12-hours now. I guess I have only noticed these errors with a parity check running, but my instability issues did start prior to the parity check. The chain of events seems to be instability that I assume was caused by my overheating CPU, followed by numerous lockups and reboots which triggered a parity check, which then triggered new errors as a result of the parity operation. If all the errors really were related to the parity check running, is there any correlation to the initial incident? Or two completely unrelated incidences just happened to be set off in succession? I also noticed that at some point yesterday afternoon, the error that comes right before the call trace changed from reading "Not Tainted" to "Tainted", and then continued to read Tainted until the errors stopped. I tried to spend some time reading about what this means but really couldn't make sense of it. Is that significant in some way?
July 29, 20196 yr Community Expert 5 minutes ago, gilschwartzman said: Any insight into why this field causes errors or why it didn't start producing them (to my knowledge) until this week? Parity checks have always slowed my server way down, but never caused instability. IIRC it started with v6.7, possibly earlier.
Archived
This topic is now archived and is closed to further replies.