August 16, 20214 yr In an effort to keep the thread clean and on point, I adjusted this thread to the single issue it helped resolve, Corruption on the cache pool. I recently resolved an MCE event by replacing hardware. Everything tests perfect and I receive no errors, while likely unrelated I wanted to mention. I have for the last three mornings, awoke to find my server completely unreachable and offline frozen. Even IDRAC console yields no response and I have to warm-cycle the server to bring it back online. Any insight available would be helpful. Attached is the diagnostics pulled when it loaded this morning. gsa-diagnostics-20210816-0916.zip Edited August 20, 20214 yr by fmp4m Adjust for problem addressed
August 18, 20214 yr Author Here is the syslog, I was able SSH in this time, using the shutdown script failed. It appears I have some sort of corruption occurring so I definitely need your advice. syslog
August 18, 20214 yr Community Expert See here for the checksum errors, you should run a scrub a monitor the pool for the future, but the main issue appears to be the constant call traces, I can't see what's causing them, looks more hardware related (or your hardware doesn't like that kernel), you can try upgrading to v6.10 to see if the newer kernel helps, if it's the same it's likely hardware.
August 18, 20214 yr Author Ok. I setup the script in user scripts and scheduled it, I have also started a scrub now that parity check finished (the reboot caused that and it has zero errors). I will report back after the scrub and when it finishes, I will then try the beta to see if that sorts the call traces.
August 19, 20214 yr Author Ok, two uncorrectables, any assist on what to do to resolve / find what they are to fix?
August 19, 20214 yr Community Expert 5 hours ago, fmp4m said: Ok, two uncorrectables Check syslog for name of the file(s), delete them or restore from backup.
August 19, 20214 yr Author 7 hours ago, JorgeB said: Check syslog for name of the file(s), delete them or restore from backup. Excellent I will do that. Loaded the RC1 for 6.10 and am having an issue accessing the WebUI so I will have to sort that first. (never loads) will need another thread.
August 20, 20214 yr Author On 8/19/2021 at 9:32 AM, fmp4m said: Excellent I will do that. Loaded the RC1 for 6.10 and am having an issue accessing the WebUI so I will have to sort that first. (never loads) will need another thread. Ok, I deleted the files, re-ran scrub and it comes back with: Scrub started: Thu Aug 19 12:34:41 2021 Status: finished Duration: 6:06:46 Total to scrub: 34.36TiB Rate: 1.60GiB/s Error summary: no errors found BUT the script you linked to returns this: [/dev/sdb1].write_io_errs 0 [/dev/sdb1].read_io_errs 0 [/dev/sdb1].flush_io_errs 0 [/dev/sdb1].corruption_errs 1137 [/dev/sdb1].generation_errs 0 So I still have corruption that scrub nor the log is showing, I think... advice?
August 20, 20214 yr Community Expert 15 minutes ago, fmp4m said: BUT the script you linked to returns this: You have to reset the errors, it explains how in the linked FAQ entry.
August 20, 20214 yr Author 2 minutes ago, JorgeB said: You have to reset the errors, it explains how in the linked FAQ entry. Makes sense - sorry, didn't want to reset until I knew it was the right thing to do, and missed the "lifetime" note. Thanks again. The call traces apparently relate to nvidia modules, so I opened a thread specific to that.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.