Webapp performance monitoring with Glassbox 2.0: How does it work?
A word of warning: Information on this page originates from my exploration of Glassbox performed in Oct 2008 and may be inaccurate. Ron Bodkin, the mastermind behind Glassbox, was so kind as to review this but still there may be some mistakes or inexact informatiom left. In any case blame me :-)
There're a few open source java webapp monitoring tools: Glassbox (latest release 2008-09-02), JAMon (2007-09-20), InfraRED (2006-05-17), UseMon (2008-10-06). Among these, Glassbox is both still actively developed and mature. It has also nice though only basic user documentation. Unfortunately there is only little docs about its architecture and about customizing it for monitoring a specific webapp. Therefore I've decided to dive into its code to learn more about it to be able to decide whether it's suitable for our needs of monitoring the Pentaho BI Platform web application.
The unique features of Glassbox are that it tries to point out existing performance problems together with their likely causes and advices what to do about them and that it displays aggregated statistics about the high level "operations" in the monitored web app.
Glassbox consists of a webapp UI that displays statistics and of a set of monitors that are injected into the observed webapp and collect data about its performance. Some of its characteristics are:
- Simple installation (1.Deploy webapp; 2. Run postinstall script to copy jars etc.; 3. Restart AS with a command-line option to load glassbox aspects).
- AOP using AspectJ.
- Preferably Java 5+ though it's also possible to use AspectJ with Java 1.4 and maybe 1.3.
- Data not persistent (though this is on the roadmap).
- Layer aware (UI, logic - EJB, resources - DB, I/O, ...).
- Customized monitors for popular frameworks (Spring MVC, ...).
- UI concentrates on providing the right information at the right time with the necessary context. In other words it doesn't display tons of detailed data but a list of top-level operations highlighting the problematic (slow/failing) ones and providing detailes on likely cause(s) of the problems.
- Aspect (wiki): There're many definitions, for us this is a piece of code injected into the monitored application to perform the actual monitoring of a particular method in a particular class.
- Components and resources: Glassbox breaks time spent in an operation down by the type of activity that consumed it distinguishing between 'components' such as database access or remote calls and 'resources' such as running java code, running native code, or thread waiting (usually I/O).
- Layer: A monitored method can belong to one of application's layers of processing such as ui.controller or resource.database.statement.
- Monitored method: Glassbox collect performance statistics only for selected important methods, for instance JDBC calls.
- Operation: An "entry point" of user requests into the application. Usually this is some high-level method such as a Struts Action's execute or JSP's service. Glassbox uses operations as the unit of monitoring service level agreements (SLA) and to analyze where time goes. Anything that's used within an operation is considered a resource (EJB call, DB query...). (Though the notion of components shall be added in the future.)
- In the source code, the classes having Operation in their name are often also used to track monitored methods.
Glassbox doesn't bother to monitor all methods that are invoked as a result of user request because it is not necessary and would incur higher overhead. It only monitors some especially significant methods of the call tree such as a Struts Action invocation or invocation of a data access method (being it simple JDBC or Hibernate). I call them "monitored methods", while in Glassbox' terms they are either operations (the top-level ones) or resource/component request.
The framework is aware of and takes care of tracking the hierarchy of monitored methods invoked during a single user request processing. This basically means that a monitored method may have a parent monitored method and that Glassbox stores both the aggregated duration of the higher level method and that of the nested one. For example, a JDBC executeQuery may be invoked from an EJB call, which is itself invoked from a Struts Action processing. Thus it can tell which processing layer or component/resource caused a particular top-level operation to be slow.
Glassbox is also aware of layers and monitored methods may be marked as belonging to a particular layer, e.g. the JDBC executeQuery would belong to the layer resource.database.statement.
The main parts of Glassbox are:
Pieces of code collecting detailed performance measurements of monitored methods. Usually they're implemented as aspects extending the glassbox framework,
which are injected (woven) into the monitored webapp instance. But sometimes Glassbox uses also e.g. filters, listeners, callbacks, and timer tasks.
Handle communication between the monitors and a client. In a
clustered environment, a separate instance is on each server.
- Client: Collects monitoring data from agents and displays
an aggregated view to a user. The most used one is the Glassbox web application but you can also use e.g. jconsole or implement another client. Usually you have only one client even in a clustered environment.
Monitors are mostly aspects injected into a monitored web application or the related code. Usually a monitor observes one or few methods in a single class. When one of the method is invoked, it creates a Response object storing the details of the invocation (its id, parameters, ...) and, once it finishes, its performance data. If the method has been invoked as a part of processing of another monitored method, its Response will have the Response object of that method as its parent, in other words monitored methods can be (even indirectly) nested and this structure is preserved in the monitoring data structures. A typical example is a servlet calling an EJB doing some JDBC stuff.
Responses are stored in a ThreadLocal stack, which makes it possible to really connect those Responses belonging to the processing of a single request, and once the top-level operation completes it can be considered as finished.
There is one agent per server (i.e. JVM). It collects the detailed monitoring data from monitors and aggregates them by monitored method into "statistics" (Stats) objects. Stats contain summarized data for a monitored method over all of its execution (avg, min, max time...) together with detailed data for a limited number of those executions that were either slow or failed. There are statistics both for top-level monitored methods and for nested ones and they're aware of each other. These stats are stored in a global StatisticsRegistry (namely OperationTracker's registry).
There is also a special background job (ThreadMonitor) that monitors threads to detect CPU-intensive methods. This is done by taking snapshots of the thread stack in regular intervals and - if its execution time exceeds a predefined limit - by drawing a conclusion from these snapshots. (Note: This is an example of a timer task monitor.)
The Glassbox web user interface queries all registered agents for their data (see OperationsHelper.updateListOfOperations). The detailed and hierarchical statistics are turned into an OperationSummary for the top-level operation having also a list of "findings", i.e. textual descriptions of problems with the operation. It contains average statistics and stats when slow and when failing.
To monitor a web application, Glassbox inserts monitoring aspects into its code and related server code (e.g. its javax.servlet.Servlet implementation) during server startup using AspectJ. In its simplest form an aspect is nothing more than an XML file mapping an existing aspect class to a method(s) of a particular class.
The sequence of actions triggered by a user request towards a monitored webapp is:
- A monitored method in the target web application is going to be invoked.
- An aspect associated with that method is invoked. It's connected to the monitoring framework by extending a base monitor aspect class and optionally by calling some of the framework's methods.
- The framework stores data about the monitored method/operation's name, layer and component/resource and its start time. It also checks the thread local stack of responses to find out whether it's invoked as a part of processing some higher level monitored method and if it is the case then it sets it as its parent.
- The monitored method is
invoked and when it finishes, the framework stores its end time. If it
finished with an exception then it's first checked whether the
exception should be indeed regarded as a failure and if yes, the
exception data is stored as well. If the operation was slow or failed, the framework puts it on the list of slowest/recently failed operations and captures its parameters.
- When the original user request is processed completely, the framework aggregates data about its performance. This happens in StatsSummarizer, a ResponseListener. If the request processing was too slow with respect to predefined limits (SLA` usually <1s in 90% of time) then it's stored into the slow operations list and the layer/component that caused the delay is marked.
As said above, performance statistics are aggregated by monitored methods and are collected into a global statistics registry. The registry contains instances of PerfStats or CompositePerfStats, which also implements StatisticsRegistry to hold nested statistics, or a subclass such as OperationPerfStats. A registry is actualy a map of OperationDescriptions to their OperationPerfStats and can return a subset for a particular StatisticsType, for example UI, Database, Remote Call.
Let's see a partial example produced by OperationTracker.registry.dump(new StringBuffer(), 0):
operation(type javax.servlet.Servlet; name org.pentaho.ui.servlet.AdhocWebService):glassbox.track.api.OperationPerfStatsImpl@12c3d18 operation(type javax.servlet.Servlet; name org.pentaho.ui.servlet.AdhocWebService)(# = 0, tm =0,00 ms, #slow = 0, # fail = 0) StatisticsTypeImpl 0 of class glassbox.track.api.UIStatisticsType: StatisticsTypeImpl 1 of class glassbox.track.api.DatabaseStatisticsType: StatisticsTypeImpl 2 of class glassbox.track.api.DatabaseConnectionStatisticsType: StatisticsTypeImpl 3 of class glassbox.track.api.DatabaseStatementStatisticsType: StatisticsTypeImpl 4 of class glassbox.track.api.SimpleStatisticsType: StatisticsTypeImpl 5 of class glassbox.track.api.RemoteCallStatisticsType: StatisticsTypeImpl 6 of class glassbox.track.api.TreeStatisticsTypeImpl: time:Stats (tm=0,00 ms, slow=0) StatisticsTypeImpl 0 of class glassbox.track.api.UIStatisticsType: StatisticsTypeImpl 1 of class glassbox.track.api.DatabaseStatisticsType: StatisticsTypeImpl 2 of class glassbox.track.api.DatabaseConnectionStatisticsType: StatisticsTypeImpl 3 of class glassbox.track.api.DatabaseStatementStatisticsType: StatisticsTypeImpl 4 of class glassbox.track.api.SimpleStatisticsType: StatisticsTypeImpl 5 of class glassbox.track.api.RemoteCallStatisticsType: StatisticsTypeImpl 6 of class glassbox.track.api.TreeStatisticsTypeImpl:
An OperationPerfStats holds e.g. resourceTotalStats and otherComponentStats.
*Stats also hold all other necessary information, for instance about slow/failing cases.
OperationTracker also uses an instance of OperationAnalyzer, which is responsible for preparing data for all the nice output you can see in the UI. This includes summarizing the stats into OperationSummaries and detecting (based on the collected stats) what is the cause of a slow/failing operation and providing this info in the form of OperationAnalysis.
The web UI, aside of handling installation of Glassbox monitoring into a server, maintains connections to all the agents (or to the single local agent in a non-clustered environment) and retrieves all the needed summaries and analysis from them via its OperationHelper.
Currently the web UI only provides statistics about a top-level operation and its list of detected problems (slow SQL, excessive CPU in method XY, ...) with troubleshooting details but you cannot use it to view statistics for its nested monitored methods/components. However you can access those detailed statistics e.g. via the JMX interface.
- You can view results and manage Glassbox using JMX: jconsole
service:jmx:rmi:///jndi/rmi://localhost:7232/GlassboxTroubleshooter . Check $jboss/lib/glassbox/glassbox.properties and glassbox.war/WEB-INF/lib/agent.jar/beans.xml for settings.
- You can disable/enable some monitors at runtime via JMX, for example RemoteCallMonitor or JdbcMonitor. I'm not sure whether there is a way to dis/enable on a more granular level.
Here we will learn about the most important Glassbox classes, what they can do for you, and how they relate to each other.
Classes without an extension are regular java classes while those with .aj are AspectJ classes and need to be compiled by the aspectj compiler.
This API is used by the monitoring aspects to produce the monitoring data that is than further analyzed and presented to the user by Glassbox.
- Collects data about the system's response while processing a request. These are typically nested, i.e., we track times, parameters, etc. for Servlet requests that result in Struts action requests that result in a database query. A response belongs to a particular layer and may have a parent response (when nested). It has also a duration and a status (processing/suceeded/failed/...). Actually it can hold any context, so a monitor can store whatever relevant data is
needed (e.g., this can be useful for a custom metric that your
application wants to track).
- get/setLayer, get/setParent(Response), duration, status (ok/failed/processing..),
- This is a helper class for manipulation requests including their creation while taking care about their proper nesting and setting their start/end times. It uses a thread local stack to keep track of nested requests and System.currentTimeMillis() for timing.
- As noticed elsewhere, its used by monitors to create/finish Responses and produces the appropriate events for that and also manages a list of ResponseListeners.
The monitoring aspects extend this API and it also includes many specific monitoring aspects such as EjbCallMonitor.aj and StrutsRequestMonitor.aj.
- glassbox.monitor.OperationFactory - create OperationDescription(Impl) from JSP path or from a class name - see e.g. MvcFrameworkMonitor.aj
- isEnabled (calls RuntimeControl.aspectOf(this).isEnabled()), setEnabled, setThisThreadEnabled, ...; failureDetectionStrategy (recordException: failureDetectionStrategy.getFailureDescription(throwable)); getLayer();
- accesses & modifies responseFactory.getLastResponse() - e.g. in endNormally (-> response.complete()), endException(); begin(key, layer) -> createResponse
- Ron's note: One reason for having AbstractMonitorClass is to allow using Java-5 annotation-based aspects with the Glassbox framework, either for AspectJ extensions written in that style or for Spring annotation-based aspects.
An addition to the Response API to keep track of requests (monitored methods) etc.
- glassbox.track.api.Request - represent a specific instance of a request to something, i.e. an invocation of a monitored method. They can be compared based on elapsed time.
- glassbox.track.api.FailureDetectionStrategy - shall an exception thrown by a monitored method be regarded as its failure or not?
- glassbox.track.api. Call/Failure/Operation/SQLFailure Description - OperationDescription has a type (e.g. "HttpServlet"), a name (e.g. the servlet's name), context (e.g. the web app's context root) and perhaps a parent OperationDescription if nested.
- glassbox.track.api.SlowRequestDescriptor - describes a request whose processing was too slow; it has among others the attributes StackTraceElement slowestTraceElement and mean/slow/total counts.
- glassbox.track.api.UsageTrackingInfo - attributes eventTime, eventCpuTime, eventUserCpuTime (uses ThreadMXBean)
Used by the framework and UI to analyse the monitoring data and present an aggregated view to the user. These are mostly only value objects while the logic is in the Agent API.
- glassbox.analysis.api.TimeDecomposition - Captures mutually exclusive breakdown of overall time by component/resource. Components: dispatch (in common code above operation), other (other, undefined areas), db access, remote calls; resources: running java code, running native code, waiting (I/O...), thread contention.
- glassbox.analysis.api.OperationAnalysis - TimeDecomposition getComponentDecomposition() (db, cpu, i/o, dispatch...), TimeDecomposition getResourceDecomposition() (by thread use: runnable, blocked, waiting, etc.); getSlowThresholdMillis(); getMeanCpuTime() ...; isFailing(), isSlow();
- glassbox.analysis.api.SummaryStats - aggregated statistics for a monitored method - its accumulatedTime, count (number of hits), mean time
Collect data from monitors, summarize it, analyze problems, and provide the outputs to the Web UI.
- glassbox.client.persistence.jdbc.BackupDaemon - stores agent connections (but not any monitored data) into a database (by default an embedded hsqldb - see the myDataSource below).
- glassbox.monitor.thread.ThreadMonitor15Impl: This monitor periodically grabs thread dumps for all threads that are processing user requests. It runs in a daemon thread collecting the dumps in preset intervals. Creates instances of glassbox.monitor.thread. OperationSample when sampling a monitored thread. When a monitored thread finishes, it results perhaps in a call to ThreadSummarizer.summarize, which updates the assoc. CompositePerfStats.
- glassbox.monitor.thread.ThreadMonitorIntegration.aj: starts a ThreadMonitor after StatsSummarizer.startTopLevelStats|startNestedLaterOperation
- glassbox.summary.StatsSummarizer.aj (implements ResponseListener): ResponseFactory invokes its startedResponse/finishedResponse when appropriate; startedResponse => update ThreadStats including StatisticsRegistry, invoke startedStats which may results in starting a ThreadMonitor
- uses glassbox.track.api.StatisticsRegistry stored in a thread local variable of the type ThreadStats together with first/last operation key (OperationPerfStats)
- glassbox.analysis.OperationAnalyzer: Analyses the collected statisticts to detect problems and their causes. It uses (Default)TimeDecomposition.It also makes OperationSummaries from the stats further used by the UI.
- Note: in agent.jar the Spring config file beans.xml defines a bean operationTracker of the type glassbox.track.OperationTrackerImpl (implements OperationTracker, StatisticsRegistry) - this is used to collect all the stats in the monitored app. It's used by the bean glassboxService (glassbox.agent.control.GlassboxServiceImp).
The Spring configuration file glassbox.war/WEB-INF/applicationContext.xml defines among others the following beans:
- a backupDaemon - regarding its function see glassbox.client.persistence.jdbc.BackupDaemon above. See also glassbox.client.persistence.jdbc.PersistChanges. It's schedule is def. ibidem by scheduledTask with the default period of 10000 and it uses indirectly myDataSource defined there as well.
- Note: To override the data source used to store client configuration including remote connections to open, you can define the System property glassbox.config.ds (this is configured in the applicationContext.xml).
- glassbox.client.helper.OperationHelper - updates statistics to display by calling listOperations on the remote agents.
- glassbox.client.pojo.OperationData - monit. data collected and used by the UI It only adds source and agent identification to its nested OperationSummary, holding the actual statistic (operation count, is failing, is slow, avg time; nested OperationDescription that might have a parent too).
There're two ways of customizing Glassbox for monitoring of a particular webapp:
- Adding monitors
- Glassbox plugins mechanism for advanced customization.
You can apply an existing monitoring aspect to a new method using AspectJ weaving rules described in an aop.xml file thus turning the method into a monitored method or you can even create a new monitoring aspect extending a base Glassbox aspect, putting it - perhaps together with an aop.xml - on the monited webapp's classpath.
Very valuable and inspiring information about this are in the aforementioned presentation Glassbox Architecture & Design.
Quoting the User Guide:
You can simply extend the Glassbox definition of operations by creating a new XML file with these contents:
<aspectj> <aspects> <concrete-aspect name="ServiceProcessingMonitor" extends="glassbox.monitor.ui.TemplateOperationMonitor"> <pointcut name="methodSignatureControllerExecTarget" expression="within(com.myco.service..*)"/> </concrete-aspect> </aspects> </aspectj>
Note: If you only want to monitor another method invoked during processing of something already regarded as an operation you should rather extend the glassbox.monitor.MethodMonitor.
You can then add this file to a META-INF subdirectory of a directory on your classpath or add a jar containing the file at the location META-INF/aop.xml. For Tomcat, you might just create the directory common/classes/META-INF and install your custom aop.xml file there.
Warning: This is only my idea what is necessary to be done and may contain mistakes and false ideas.
To implement a new monitoring aspect you should do at least the following:
- Extend glassbox.monitor.AbstractMonitor
- Define the pointcuts (i.e. to what methods to apply this monitor)
- Redefine the abstract pointcut monitorEnd() to apply to the monitored method(s) so that the parent class detects when it finishes.
- Either redefine tha abstract pointcut monitorBegin(Object identifier) to allow the parent class to automatically register the beginning of the monitored operation or define your own advice (method) that is run when a custom pointcut is encountered; usually this is a before advice. The identifier should be e.g. an OperationDescription. Inside a custom advice:
- (Re)implement some methods such as getLayer().
- Create an OperationDescription for the operation, likely using the inherited operationFactory.
- Call one of the inherited begin(..) methods (see AbstractMonitorClass.aj), passing the OperationDescription as the 1st argument, i.e. as a key. This will return a glassbox.response.Response object.
- Store some context data into the generated response, using e.g. response.set(Response.PARAMETERS, <the monitored method's arguments from AspectJ's thisJoinPoint.getArgs()>).
Note: If you have it override getLayer() to return Response.RESOURCE_SERVICE and the monitored method is slow then Glassbox will report it as a slow remote call.
It's possible to extend Glassbox with application-specific extensions using the API glassbox.config.extension.api. Simply adding monitors doesn't need it: you can just deploy a monitor jar with aspects to the classpath, and an app can simply call on the response API. The PluginRegistry supports deeper extensions, like adding custom operations (see interface glassbox.config.extension.api.OperationPlugin).
Ron explains (2008-10-25): Glassbox lets you customize a variety of facets using plugins.
Operation plugins let you add an operation type that can extend how
Glassbox summarizes and analyzes the operation (to detect service level
violations) and how the UI renders these. Glassbox plugins also let you
define a connection provider so you can write custom code to define
what connections in a cluster/server farm should be opened (allowing
discovery instead of manual configuration). Most recently I've also
added a runtime controller that lets you change behavior at runtime
(e.g., requesting that a request on a specific thread be monitored).
Operation plugins really shine if you need to add in different service levels, or
want a custom display to summarize problems (e.g., Ron used this to
detect an out of date cache for one custom project).
Hopefully some documentation for creating plugins will be created soon.
Currently Glassbox monitors any web application deployed on the app. server including itself and also some common code like JDBC drivers (which may be invoked not only from a web application but also by the server's daemons etc.). If you don't want to monitor all of that, for instance to decrease the overhead and to make the outputs easier to read, there are few things you can do.
Glassbox yet doesn't support any filtering but you can:
- You can avoid instrumentation of specific
applications on the server by deploying a META-INF/aop.xml file in
their classpath that disables weaving into any classes (although that
would still track calls to common classes like JDBC drivers and app
<aspectj> <weaver> <exclude within="*"/> </weaver> </aspectj>
- Instead of load-time weaving (LTW) you can perform offline weaving of Glassbox aspects only into the code of the web applications that interest you and perhaps into some common server code that you also need to be monitored, for instance a JDBC driver. An additional benefit is that you'll get rid of the longer class loading at server startup.
According to the aforementioned presentation of Glassbox design & architecture it can consume roughly 20% more memory than without Glassbox. Ron further explains:
20% is a rough guideline - it varies quite a bit in specific cases. The biggest area where Glassbox adds overhead is indeed from AspectJ load-time weaving, most specifically the memory overhead from handling JSP's - the load-time weaving system uses a little more than 1 megabyte of memory for each loader, and each JSP gets its own loader. The AspectJ project has been working on this area - see Reducing weaver memory usage over time. We need to merge in the updated work on AspectJ to the version of AspectJ we're using with Glassbox, which also reduces memory overhead a lot. You could try the patch in that bug report with AspectJ 1.6.1's aspectjweaver.jar to see if it has better memory performance.
Another approach that might be simpler and yet could help a lot is a hybrid one: if you just precompile your application's JSPs you will see far lower overhead.
Glassbox does work best if it is weaving some server classes, but in many cases you can get the desired visibility if you do offline weaving of your application and a few key libraries, like your JDBC driver and e.g., web services callers. If you want to try that, I'd be glad to help.
There is one other area where Glassbox can consume significant memory: it records statistics based on the structure of components and resources and how calls are nested. For a fairly static application this is normally constrained and quite small, but some applications generate names/queries/etc. dynamically and Glassbox can build increasingly large trees of statistics, which consume memory also. We definitely want to address this area - I'm leaning towards not recording details for quick operations, and only recording information for things that run often and are taking noticeable time. We'd like to know about cases where this happens so we can test better approaches.
I've implemented persistence for the detailed monitoring data of Glassbox. Unfotunately Glassbox generates too many uninteresting entries and for the methods of interest it doesn't provide enough data, you can read more about this in the glassbox forums linked to above. I had unfortunately no time to try to deal with these problems.
If you're anyway interested in the DB persistence, you can try it - download GlassboxDbPersister.zip and read the contained README.txt.
A nice article about Performance Monitoring using Glassbox (03-03-2009) with many screenshots and some code samples.