Jump to content

Diagnostics size limiting and other enhancements


RobJ

Recommended Posts

After a discussion over here, I'd like to request an automatic truncation to any log file over 3MB.  At times, we hear of diagnostics zip files that are too big to attach, and the problem always seems to be a log file that has exploded because of megabytes of repeated errors.  After the first hundred of these, there is NOTHING useful in the rest of them.  Since we don't need them at all, and they are causing huge problems, both by making the diagnostics zip too big and by possibly overfilling /var/log, I'd like to see the diagnostics function detect and truncate them, to 3MB.  3MB seems more than enough, more then I want to examine, but far less than the 10MB to 121MB to 133MB files I have seen.  And before truncating, you could capture the final tail of the last 200 lines, which aren't really needed but provide proof of the same repetition still and show whatever happened at the very last, if anyone cares.

 

The logic might be to loop through those files in /var/log that are known log files (syslog*, docker.log, etc), and if their size is over 3MB then do the following 2 lines (modified appropriately) -

 

      tail --lines=200 $file >/tmp/diagnostics/logs/tail.txt

      truncate -s 3M $file

 

This is unlikely to chop the file on a line break, so there will probably be a broken line, 3MB out there, but who cares!  This fixes the "diagnostics too big" problem.

 

In cases where a log file has stuffed /var/log so that no more logging can occur, requesting the diagnostics will have the positive side effect of opening up a lot of room there.  It's not a fix to the real problem that's causing the massive logging, but it could help current operations perform better, since normal logging can continue.

 

I would also like to request another addition to the diagnostics collection, not terribly important but useful once in a while.  Create a file such as folder.txt containing a number of folder listings, to allow us to examine the users flash contents and other folders.  A starting list might be -

- /boot

- /boot/config

- /boot/config/plugins

- /boot/extra

- /boot/syslinux

- /var/log

- /tmp

- and any others that could be useful in troubleshooting

 

I would want it to show all files, their dates and sizes, owners and permissions perhaps, and the volume label for the flash drive (should be in /boot?).  It would also be nice to see the free space for that volume, especially /tmp and /var/log.  If this is put in a simple script, it would be easy to modify in the future, and add other things when requested.

 

And one more request, very minor, and just for convenience - add the unRAID version near the top of the zip file.  I like knowing precisely what version the user is running before I do anything with the diagnostics.  Currently, I have to open the zip, click and open the diagnostics folder, click and open the system folder, click and open the vars file, and shoot down to the bottom, and there it is.  I then take it back out of the zip and rename it, just so I know from the top what version each diagnostics zip has.  It would be nice to create a zero byte file at the very top, name being the unRAID version (e.g. 6.2-beta21.txt).  But I understand I may be the only one interested in this...

Link to comment

 

 

And one more request, very minor, and just for convenience - add the unRAID version near the top of the zip file.  I like knowing precisely what version the user is running before I do anything with the diagnostics.  Currently, I have to open the zip, click and open the diagnostics folder, click and open the system folder, click and open the vars file, and shoot down to the bottom, and there it is.  I then take it back out of the zip and rename it, just so I know from the top what version each diagnostics zip has.  It would be nice to create a zero byte file at the very top, name being the unRAID version (e.g. 6.2-beta21.txt).  But I understand I may be the only one interested in this...

 

+1

 

Would also like the version info, oh, and the size reduction also.

 

 

Link to comment

I have made a development version which includes your requests plus some more additions.

 

A question though: in your referenced topic you talk about a file size of 10MB or 5MB, but here you propose 3MB. What is the preference? I can imagine that truncating too much may lead to loss of vital information, but you have seen many log files so knows much better what size is still acceptable.

 

I follow the rule that source files always stay untouched, the syslog files included in diagnostics will be truncated, but the originals stay as they are. When the log file system is really full then appropriate advice needs to be given.

 

Link to comment

I have made a development version which includes your requests plus some more additions.

Thank you!

 

A question though: in your referenced topic you talk about a file size of 10MB or 5MB, but here you propose 3MB. What is the preference? I can imagine that truncating too much may lead to loss of vital information, but you have seen many log files so knows much better what size is still acceptable.

Yeah I've been waffling.  I started with the observation that the largest that could fit was about 15MB, so 10MB seemed a nice round number that should always be safe.  Then Frank seemed reluctant with the idea, and I realized we don't need it that big, so cut it in half, to 5MB.  Thinking about it more, all the useful info is in the first 200KB to 300KB, so it's hard to imagine needing more than 500KB.  So 2MB seemed a generous number, and then raised it to 3MB in case anyone out there still objected.  At this point, I'm not sure it matters much so long as it is at least 1MB.  I'm not interested and I doubt anyone else is either, in anything beyond say 500KB.

 

I follow the rule that source files always stay untouched, the syslog files included in diagnostics will be truncated, but the originals stay as they are. When the log file system is really full then appropriate advice needs to be given.

I do understand the principle, and accept it, but it did seem like an easy way to give them a little more life, just for that session.  They're still in trouble, HAVE to fix the problem, but might get a little breathing room for the moment.  There's NOTHING useful in what we are chopping off.

Link to comment

Instead of always truncating the log files, should probably make 2 diagnostics files - one with the truncated zip's and the other untruncated.  That way, should there be information required in the non-truncated version it can still be uploaded to dropbox if need be.

 

The problem with truncating is that when its truncated, there will probably be information within syslog.1 and syslog.2, and depending upon how its actually truncated, the information leading up to the repeated errors may get completely lost

 

Link to comment

The truncate command (truncate -s 2M $file) just changes the file size, effectively removing the last megabytes, megabytes we don't care about AT ALL.  If you can find a huge diagnostics, you can see for yourself.  Certainly if someone feels there's something useful out there, I'll change my recommendation, but I have NEVER seen anything useful beyond about 500KB.  The point at which the errors begin is the interesting point.

 

My suggestion was to chop any of the log files that were huge.  Usually only one is, never seen more than one.  If syslog.1 and syslog.2 exist (or syslog.3 etc), then I'm only interested in the first 500KB of syslog.2 (or syslog.3 etc, what ever is highest).  syslog and syslog.1 are just repeated errors, of which I already have many megabytes in syslog.2.  (If you just chop any that are too big, then you don't have to examine them, to figure which one you want.)  I requested a tail of the last 200 syslog lines though, just in case.  Could make it 500 lines if someone preferred.  It's going to be all or mostly more repeated error lines, but could possibly include their last actions to fix the issues.

Link to comment
When the log file system is really full then appropriate advice needs to be given.
I'd like to interject something here. The process is truly broken when logs ABOUT an error cause more errors. It makes the original issue difficult to deal with, and causes more pain that could be avoided. There needs to be a mechanism in place that archives huge log files to a secondary location (boot flash?) and cleans out the RAM FS so recovery efforts aren't hampered. Writing a few MB of zipped logs to the flash when there is an issue isn't going to accelerate wear by that much.
Link to comment
  • 1 month later...

Sorry for posting my request in someone elses thread :-[ , but while my motives are diffrent (privacy/security), it would result in less "useless stuff" in syslog.txt and therefore reducing its size, which is part of this threads intention...

 

Mover entrys are very verbose, a cache folder with many items can spam in syslog.

While the share-names are masked, subfolder/file names are not, but in my opinion should be for the same reason.

While masking would not really reduce the syslog size, maybe just measuring the amount of moved items?

I am fine with "281 items moved" an a per share basis, at least in the default/anonymized diagnostic.

 

Mover-realated issues, could be investigated with the non-anonymized version through pm I think.

 

I could edit this post and create a new thread I you want, because some of your requests are already live it seems.

I had a 17MB Syslog1.txt in todays diagnostics due to a full cache disk and the rather long mover entrys hours later.

That was over a week ago and completly unrelated to the reason I posted my diagnostics.

Link to comment
  • 3 months later...

The truncate command (truncate -s 2M $file) just changes the file size, effectively removing the last megabytes, megabytes we don't care about AT ALL.  If you can find a huge diagnostics, you can see for yourself.  Certainly if someone feels there's something useful out there, I'll change my recommendation, but I have NEVER seen anything useful beyond about 500KB.  The point at which the errors begin is the interesting point.

 

My suggestion was to chop any of the log files that were huge.  Usually only one is, never seen more than one.  If syslog.1 and syslog.2 exist (or syslog.3 etc), then I'm only interested in the first 500KB of syslog.2 (or syslog.3 etc, what ever is highest).  syslog and syslog.1 are just repeated errors, of which I already have many megabytes in syslog.2.  (If you just chop any that are too big, then you don't have to examine them, to figure which one you want.)  I requested a tail of the last 200 syslog lines though, just in case.  Could make it 500 lines if someone preferred.  It's going to be all or mostly more repeated error lines, but could possibly include their last actions to fix the issues.

You wanted an interesting diagnostics.  Here's one:  http://lime-technology.com/forum/index.php?topic=51830.msg497507#msg497507 that I'm not quite sure would hit your criteria on limitations

 

One repeated problem starts at line 2305

Another unrelated problem starts at line 2809

The second problem in turn (with some help from the first) caused a third problem which started at line 9466 and carried on to line 21836

The fourth problem starts at 21837 and carries on to the end line 985013

(Additionally, there's another problem completely unrelated (I believe) that only occurs right at the beginning after the array is started)

 

Would your criteria have caught all of this?  The last 200 lines or so of the log are ambiguous as to whether its problem #1 or #2 cause the last 200 lines logged problem, but taken as a whole problems 1 & 2 are unrelated and problem 4 is related to #2.

 

Buddy's syslog.1 here (syslog for some reason is empty) shows just how logging can get in a perfect storm of by themselves relatively minor issues.

 

And I don't know if you could diagnose #4 properly without seeing #3 and only seeing #2

 

EDIT: I guess that even with truncated logs you would start by restarting the server and grabbing a log right away.  My eyes are blurry from looking at a syslog composed of 1M lines and probably missed some other things in it....

 

 

Link to comment

What about having an abbreviated log file that contains only unique text entries ignoring the timestamp of the issue? That way all the relevant log lines are shown with duplicates being removed.

 

After the log message is separated from the timestamp it should be quit simple to do using uniq.

Link to comment

I think rsync actually crashed as lines 240384-918000 are all the exact same line over and over again.

 

Removing duplicated lines over and over again would solve the problem in this particular case, but I don't know how you go through a syslog looking for issues.  I page down until I see stuff repeated over and over again.  Then I know I've got the issue and go back trying to find the first line.  If there's only one line you're not going to catch it.  (Unless you really love reading syslogs line by line)

Link to comment

If the repeats are replaced with a single line saying "repeats 1500 times" when they are removed, you could easily highlight that in your analysis.

That'll work so long as its not the single indicator of an issue.  Otherwise it'll get buried and you won't notice it unless you actually read the syslog and not skim it looking for issues.

 

And while this may sound callous, the day that I have to start actually reading a syslog looking for issues instead of skimming them looking for obvious issue blocks and then read prior to the block is the day that I stop downloading them.

Link to comment

And while this may sound callous, the day that I have to start actually reading a syslog looking for issues instead of skimming them looking for obvious issue blocks and then read prior to the block is the day that I stop downloading them.

I just have to say Squid is so right about this.  I've gotten completely burned out several times spending too much time analyzing long syslogs, and it makes me quit for awhile coming to the forums.  I know I can be helpful, and I hate not being as helpful, but I've learned I have to be selective, only open threads by impulse, and then decide how I feel about contributing anything at all.  That feels really unfair to other users who also need help, but it's the only way to keep viewing syslogs after awhile.  I really appreciate that more users are pitching in and helping.

 

(It really scares me when I read someone saying "if only unRAID had this 'ease of use' feature, that would attract so many more non-technical users!".  How would we ever support them?  How would we ever keep up?)

 

If the repeats are replaced with a single line saying "repeats 1500 times" when they are removed, you could easily highlight that in your analysis.

The syslog module (rsyslog I think) already has that feature, can't show you an example at the moment, but I've seen that a number of times in syslogs.  The problem we face though is that most issues are not on single lines, they are multi-line, usually a variable number of lines.  You can work out a pattern for most of them, but every problem is different, and has a different number of lines, often a variable number of lines.

 

Take the most well-known, the ATA exception handler series of messages.  The following is an excellent example of 2 different error handlers, mixing their messages (the ATA exception handler ones are highlighted, the un-highlighted are from the other handler, at a different level):

Jun 18 03:41:42 Tardis kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Jun 18 03:41:42 Tardis kernel: ata8.00: irq_stat 0x00060002, device error via D2H FIS

Jun 18 03:41:42 Tardis kernel: ata8.00: failed command: READ DMA EXT

Jun 18 03:41:42 Tardis kernel: ata8.00: cmd 25/00:08:28:8d:24/00:00:21:00:00/e0 tag 0 dma 4096 in

Jun 18 03:41:42 Tardis kernel:          res 51/40:08:28:8d:24/00:00:21:00:00/e0 Emask 0x9 (media error)

Jun 18 03:41:42 Tardis kernel: ata8.00: status: { DRDY ERR }

Jun 18 03:41:42 Tardis kernel: ata8.00: error: { UNC }

Jun 18 03:41:42 Tardis kernel: ata8.00: configured for UDMA/100

Jun 18 03:41:42 Tardis kernel: sd 3:0:0:0: [sdb] Unhandled sense code

Jun 18 03:41:42 Tardis kernel: sd 3:0:0:0: [sdb] 

Jun 18 03:41:42 Tardis kernel: Result: hostbyte=0x00 driverbyte=0x08

Jun 18 03:41:42 Tardis kernel: sd 3:0:0:0: [sdb] 

Jun 18 03:41:42 Tardis kernel: Sense Key : 0x3 [current] [descriptor]

Jun 18 03:41:42 Tardis kernel: Descriptor sense data with sense descriptors (in hex):

Jun 18 03:41:42 Tardis kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00

Jun 18 03:41:42 Tardis kernel:        21 24 8d 28

Jun 18 03:41:42 Tardis kernel: sd 3:0:0:0: [sdb] 

Jun 18 03:41:42 Tardis kernel: ASC=0x11 ASCQ=0x4

Jun 18 03:41:42 Tardis kernel: sd 3:0:0:0: [sdb] CDB:

Jun 18 03:41:42 Tardis kernel: cdb[0]=0x28: 28 00 21 24 8d 28 00 00 08 00

Jun 18 03:41:42 Tardis kernel: end_request: I/O error, dev sdb, sector 556043560

Jun 18 03:41:42 Tardis kernel: Buffer I/O error on device sdb1, logical block 69505437

Jun 18 03:41:42 Tardis kernel: ata8: EH complete

The ATA exception handler series all start with an "exception Emask" line, and end with the ""EH complete".  Both of the error handlers above are completely variable as to how many lines they may take.  But you may see each group of error lines above repeat for possibly thousands of times.  To anyone with experience though, all they have to see is that "(media error)" where it is, and the "UNC" flag below it, and you know it's a bad uncorrectable sector that's failing a read request (from the "READ DMA").  You don't need to see a single repeat of this exception handler sequence.

 

You wanted an interesting diagnostics.  Here's one:  ...

Intriguing ..., but not tonight!  Tomorrow I should have time.  I always have very mixed feelings when faced with a huge syslog, full of trouble.  I love the challenge, but get wore out when the minutes turn into hours.

 

Squid, I've been noticing that you have been getting better and better at syslog analysis, and clearly taking longer at it.  Just don't burn out, we need you!  But you've been doing great work!

 

I do appreciate the ideas expressed, and I'd love to see improvement.  The bash shell has some good tools for text filtering, perhaps using awk.  I'd like to see a tool that made it easy to drop lines matching a given pattern.  Say you've identified an issue like the above error lines, then you select the pattern group that matches all or most of those lines, and gives you back a syslog without *any* of them.  Suddenly millions of lines become hundreds of lines.  Makes looking for the next problem MUCH easier.

Link to comment

* Brit, thanks very much for mentioning uniq!  I'd never heard of it before.  I wasted a couple of hours when I misunderstood the -u option, but once I figured that out, it's a beautiful thing.  I installed the GNU CoreUtils for Windows, opened a command prompt, set the path, and with the command "uniq -s 25 syslog.txt sysnew.txt" got a new syslog file that was 216MB smaller, from 257,914,266 bytes down to 41,145,363.  That uniq command dropped the last almost 750,000 lines of the same repeated rsync error!  I'll be using it again.  I then used split to produce a file with only the first 240,800 lines in it, to try and match what Squid was viewing, make the lines match up (unsuccessfully).

 

* I first noticed that Jeffrey had lines in his go file to double the /var/log space, from 128MB to 256MB, which explained why he had a 256MB syslog!  I would like to say that's a really bad idea, and we should discourage users from doing that.  An analogy:

- Problem: an employee of yours is apparently throwing your stuff in the dumpster, and the dumpster overflows

- Solution A: get a bigger dumpster

- Solution B: get rid of the employee

Fixing the problem is ALWAYS going to be a better solution than getting a bigger dumpster.  If you decide that a bigger dumpster is your solution, then when it fills too, you will do the same and get an even bigger one (increase from 256MB to 1GB), and when it fills, get a bigger one yet (increase to 64GB).  At some point, you can't get a bigger dumpster!

 

I'll go farther and say that a better solution would have been to get a smaller dumpster, decrease /var/log capacity to 8MB (or even 4MB), because you hit the problem that much sooner (you're going to hit it anyway!) and with much more manageable files for the troubleshooter.  It's therefore possible that the problem will be diagnosed *sooner*, with a smaller /var/log and smaller syslog!  I think I may start recommending that /var/log be decreased from 128MB to something like 8MB at most.  If it's going to fill up, it's going to fill up whatever the size!  If a user has Dockers and VM's, and rarely reboots (has long uptimes), then perhaps they should have a larger size.

 

* Squid, I couldn't get the line numbers to match up with yours, at all.  At first I thought you must have line wrap on, which would increase the line count, but that didn't work.  I gave up trying.  Your end line was at 985,013, the actual line count was 985,416 which is pretty close, just 403 shy.  Once I realized that, I added 403 to your 240,384 and was dead on the repeating rsync error start at 240,787.  I added 403 to your 21837 and was roughly close to another repeating rsync error start at 22267.  Most of the btrfs errors ran from around line 5981 to 22239, overlapping initially with the shfs errors that ran from line 3304 to 9901.

 

* I think there are a couple of problems, but one is the critical one, and it caused all or most of the rest of the issues.  The power or SATA connection to the Cache drive disconnected, and very quickly the Cache drive was dropped, for good.  Seconds later the connection was good again, and the kernel reinitialized the drive, BUT it set it up as sdn(!), not it's correct sde!  Shortly after it was dropped again, and it was again regained, but still at sdn.  So for all intents the Cache drive was gone, not accessible at all, and therefore any operations to it, or docker.img, or the BTRFS on loop0 were lost too, and you can completely ignore ALL of those errors!  They are collateral damage, the result of the loss of the Cache drive.  You quoted the disabling of the Cache drive yourself in your first response to Jeffrey, "Sep 12 02:05:57 HunterNAS kernel: ata7.00: disabled".  Yes after that, rsync crashes, docker.img goes bad, BTRFS crashes, etc etc, but nothing really matters after the Cache drive is disabled.  And that happens at line 3218.

 

If anything, I personally feel this just proves my point that once you find the real cause of an issue, essentially always in the first part of the syslog, the rest is just garbage, can only mislead and confuse you.  When I chopped this syslog to about 3300 lines, enough to find the problem, the size went to about 287KB, less than a third of a megabyte.  Truncating this syslog to 1MB would have still provided an awful lot of useless garbage, over 700KB, but 1MB or 2MB is much more manageable than the huge files we've seen.

Link to comment

Thanks Rob for actually reading the syslog.

 

If nothing else, some changes to CA are being made to limit the amount of logging that happens in an error situation (not much I can do about rsync taking it on itself to log to the syslog if its logging location in docker.img suddenly becomes unavailable (if docker isn't running, then it'll actually log into RAM instead), but now I see that in an extreme case like this, CA would actually log ALL the rsync errors into the syslog itself.  And in my case, that would wind up being 200,000 lines of just errors.  I think I'll just log the first 10 errors instead and hope for the best.

 

That, and jeffrey is now in a catch-22 with the backups as if CA is set up correctly it will automatically delete old backup sets after so many days, but since he told CA to not delete them, the drive is now full, so even if he sets it up correctly, the drive is now full which means that the backup will error out again, and on errors CA will not delete old backups.  So I've now got a new script included that will delete all the old dated backups because its a PITA to delete them manually without dropping to a command line due to the permissions involved

 

Sometimes you just need to have a disaster to be able to move ahead  :)

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...