Aggregating Error Logs to Send a Warning Email When Too Many of Them - Log4j, Stat4j, SMTPAppender

Our development team wanted to get notified as soon as something goes wrong in our production system, a critical Java web application serving thousands of customers daily. The idea was to let it send us an email when there are too many errors, indicating usually a problem with a database, an external web service, or something really bad with the application itself. In this post I want to present a simple solution we have implemented using a custom Log4J Appender based on Stats4j and an SMTPAppender (which is more difficult to configure and troubleshoot than you might expect) and in the following post I explore how to achieve the same effect with the open-source Hyperic HQ monitoring SW.

The Challenge

We faced the following challenges with the logs:
  • It's unfortunately normal to have certain number of exceptions (customers select search criteria yielding no results, temporary, unimportant outages of external services etc.) and we certainly don't want to be spammed because of that. So the solution must have a configurable threshold and only send an alert when it is exceeded.
  • The failure rate should be computed for a configurable period (long enough not to trigger an alert because of few-minutes outages yet short enough for the team to be informed ASAP when something serious happens).
  • Once an alert is send, no further alerts should be send again for some time (ideally until the original problem is fixed), we don't want to be spammed because of a problem we already know about.

The Solution

We've based our solution on Lara D'Abreo's Stat4J, which provides a custom Log4J appender that uses the logs to compute configurable measures and triggers alerts when they exceed their warning or critical thresholds. It is couple of years old, alpha-quality (regarding its generality and flexibility) open-source library, which is fortunately simple enough to be modified easily for one's needs.

So we have tweaked Stat4J to produce alerts when the number of alerts exceeds thresholds and keep quiet thereafter and combined that with a Log4J SMTPAppender that listens for the alerts and sends them via e-mail to the team.

Stat4J Tweaking

The key components of Stat4J are the Stat4jAppender for Log4J itself, calculators (measures) that aggregate the individual logs (e.g. by counting them or extracting some number form them), statistics that define which logs to consider via regular expressions and how to process them by referencing a calculator, and finally alerts that log a warning when the value of a statistics exceeds its limits. You can learn more in an article that introduces Stat4J.

We have implemented a custom measure calculator, RunningRate (to count the number of failures in the last N minutes) and modified Stat4J as  follows:
  • We've enhanced Alert to support a new attribute, quietperiod, so that once triggered, subsequent alerts will be ignored for that duration (unless the previous alert was just a warning while the new one is a critical one)
  • We've modified the appender to include the log's Throwable together with the log message, which is then passed to the individual statistics calcualtors, so that we could filter more precisely what we want to count
  • Finally we've modified Alert to log alerts as errors instead of warnings so that  the SMTPAppender wouldn't ignore them
Get our modified Stat4j from GitHub (sources or a compiled jar). Disclaimer: It is one day's hack and I'm not proud of the code.

Stat4J Configuration

Take the example stat4j.properties and put it on the classpath. It is already configured with the correct calculator, statistics, and alert. See this part:


...
### JAKUB HOLY - MY CONFIG
calculator.minuteRate.classname=net.sourceforge.stat4j.calculators.RunningRate
# Period is in [ms] 1000 * 60 * 10 = 10 min:
calculator.minuteRate.period=600000

statistic.RunningErrorRate.description=Errors per 10 minutes statistic.RunningErrorRate.calculator=minuteRate # Regular expression to match "<throwable.toString> <- <original log message>" statistic.RunningErrorRate.first.match=.*Exception.*

# Error Rate alert.TooManyErrorsRecently.description=Too many errors in the log alert.TooManyErrorsRecently.statistic=RunningErrorRate alert.TooManyErrorsRecently.warn= >=3 alert.TooManyErrorsRecently.critical= >=10 alert.TooManyErrorsRecently.category=alerts # Ignore following warnings (or criticals, after the first critical) for the given amount of time: # 1000 * 60 * 100 = 100 min alert.TooManyErrorsRecently.quietperiod=6000000


The important config params are
  • calculator.minuteRate.period (in ms) - count errors over this period, reset the count at its end; a reasonable value may be 10 minutes
  • alert.TooManyErrorsRecently.warn and .critical - trigger the alert when so many errors in the period has been encountered; reasonable values depend on your application's normal error rate
  • alert.TooManyErrorsRecently.quietperiod (in ms) - don't send further alerts for this period not to spam in a persistent failure situation; the reasonable value depends on how quickly you usually fix problems, 1 hour would seem OK to me
  • Notice that statistic.RunningErrorRate.first.match is a regular expression defining which logs to count; ".*" would include any log, "your\.package\..*Exception" any exception in the package and so on, you can even specify logs to exclude using a negative lookahead ((?!X))

Log4J Configuration

Now we need to tell Log4J to use the Stat4j appender to count error occurences and to send alerts via email:


log4j.rootCategory=DEBUG, Console, FileAppender, Stat4jAppender
...
### Stat4jAppender & EmailAlertsAppender ###
# Collects statistics about logs and sends alerts when there
# were too many failures in cooperation with the EmailAlertsAppender

## Stat4jAppender log4j.appender.Stat4jAppender=net.sourceforge.stat4j.log4j.Stat4jAppender log4j.appender.Stat4jAppender.Threshold=ERROR # For configuration see stat4j.properties

## EmailAlertsAppender # BEWARE: SMTPAppender ignores its Thresholds and only evers sends ERROR or higher messages log4j.category.alerts=ERROR, EmailAlertsAppender log4j.appender.EmailAlertsAppender=org.apache.log4j.net.SMTPAppender log4j.appender.EmailAlertsAppender.To=dummy@example.com # BEWARE: The address below must have a valid domain or some receivers will reject it (e.g. GMail) log4j.appender.EmailAlertsAppender.From=noreply-stat4j@google.no log4j.appender.EmailAlertsAppender.SMTPHost=172.20.20.70 log4j.appender.EmailAlertsAppender.BufferSize=1 log4j.appender.EmailAlertsAppender.Subject=[Stat4j] Too many exceptions in log log4j.appender.EmailAlertsAppender.layout=org.apache.log4j.PatternLayout log4j.appender.EmailAlertsAppender.layout.ConversionPattern=%d{ISO8601} %-5p %X{clientIdentifier} %c %x - %m%n


Comments
  • #8 Specify the Stat4J appender
  • #9 Only send ERRORs to Stat4J, we are not interested in less serious exceptions
  • #14 "alerts" is the log category used by Stat4jAppender to log alerts (the same you would create via Logger.getLogger("alerts")); as mentioned, SMTPAppender will without respect to the configuration only process ERRORs and higher

Issues with the SMTPAppender

It is quite tricky to get the SMTPAppender working. Some pitfall:
  • SMTPAppender ignores all logs that are not ERROR or higher without respect to how you set its threshold
  • If you specify a non-existing From domain then some recipient's mail servers can just delete the email as spam (e.g. GMail)
  • To send emails, you of course need mail.jar (and for older JVMs also activation.jar), here are instructions for Tomcat
And one $100 tip: to debug it, run your application in the debug mode and set a method breakpoint on javax.mail.Transport#send (you don't need the source code) and when there, set this.session.debug to true to get a very detailed log of the following SMTP communication in the server log.

Sidenote

The fact that this article is based on Log4J doesn't mean I'd personally choose it, it just came with the project. I'd at least consider using the newer and shiny Logback instead :-).

Conclusion

Stat4j + SMTPAppender are a very good base for a rather flexible do-it-yourself alerting system based on logs and e-mail. You can achieve the same thing out-out-the-box with Hyperic HQ plus much much more (provided that you get your admins to open two ports for it), which I will describe in the next blog post.

Links

  • An alternative for preventing the SMTPAppender from spamming in persisten failure situations (aside of its built-in buffer size): log4j-email-throttle
  • EventConsolidatingAppender - announced via mailing list in 2/2011 - "the purpose of this appender is to consolidate multiple events that are received by a single logger within a specified number of seconds into a single event; this single consolidated event is then forwarded to a 'downstream' appender"

Tags: monitoring DevOps


Copyright © 2024 Jakub Holý
Powered by Cryogen
Theme by KingMob