Monitoring your system: or how I stopped worrying and learned to love the dumpster fire

Lovisa
FreshTracks.io
Published in
11 min readJan 18, 2018

--

Bugs happen. Preferably you find them before your users do. No matter how well tested your code may be, accounting for every use case and app state is almost impossible. Your defect backlog will eventually fill up with user reported bugs, recorded console errors, or, in the worst cases, bugs that cause an outage.

Not all bugs or incidents are created equal. Some might be obvious and easy to fix; you might be familiar with that specific part of the codebase or have fixed a similar problem somewhere else. Other times you might get a defect that is just a browser console error and not consistently reproducible. Usually the root cause of a defect is not immediately obvious. In these instances, your logs are ultimately the best tool you have at your disposal. The more information you have about what happened right before the event the quicker you can resolve the issue.

Instrumenting your application with robust monitoring is an upfront cost you have to pay in order to be able to scale and iterate on your app. Good monitoring can help you identify bottlenecks, where to invest in performance, and get smarter about how your users interact with your app. Getting good information from your system helps you understand it better, which makes the job of maintaining and building on top of it easier. But, understanding the types of monitoring that’s available and what benefits they bring may not be immediately obvious, let alone choosing what might make sense for your system.

Problems that arise with modern software architecture

UI client have never looked or felt better, web software development has advanced to the point that you can create incredibly elaborate user interfaces and seamless user experiences. However, system and application complexity is at an all time high.

At the same time as our frameworks and developer tools are robust, they are also precarious, and a single `“cannot read property id of undefined”` error can break an entire page. Web apps are becoming increasingly involved and there are a lot of factors that might negatively impact their performance. If you don’t have a full picture of what events are actually taking place when a page loads or a user submits a form, you will have a difficult time deciding where to focus your efforts for the highest return on investment.

The advances in web and infrastructure technology also brought with them new challenges when it came to keeping your application snappy and healthy. Because of the proliferation of microservices, it’s now become more difficult to trace down where the root cause of your bug might be. Even though it’s crystal clear what’s going on in any individual service at any point in time, the answer to the question of how far down in the stack an issue originally originated can most often only be solved by tedious debugging.

Infrastructures are becoming more ephemeral. In contrast to on-premises applications where the same app instance might be run for years and years on end on the same server, now app instances might run anywhere from a few minutes to days in short lived containers. Large containerized microservice clusters are managed and run by container orchestration tools which will automatically take unhealthy containers offline and spin up new ones as needed to match the desired state of your system. Abstraction always brings a level of obscurity, and it makes understanding the inner workings of your distributed, automated, and ephemeral system much harder.

Types of Monitoring

Your monitoring system will evolve along with the rest of your system, and ideally it will serve multiple purposes, not just letting you know when your app is about to catch fire. Monitoring is obviously more than just logs. Depending on what kinds of services you have running and what your business needs are, you’ll focus different amount of effort or money on different monitoring types or solutions.

Structured Logs

Log files and stack traces are incredibly useful, but have limitations since it’s just raw text. Discovering a collection of log files with either `cmd+f`or `grep` being your sharpest tool unfortunately only gets you so far. A collection of strings might be enough when those strings are meant for your eyes only and when you are running a single service on a single host. As your system grows, looking through additional log files for each service or component you’re adding is not tenable or scalable.

In the unstructured logs below, there are four log messages and one reference error.

Endpoint profile/:id was hit with id:2 1515771504646
Endpoint profile/:id was hit with id:5 1515771507262
Endpoint profile/:id/images was hit with id:5 1515771512403
ReferenceError: response is not defined
at /Users/lovisa/Code/js-tracing/server.js:29:3
at Layer.handle [as handle_request] (/Users/lovisa/Code/js-tracing/node_modules/express/lib/router/layer.js:95:5)
at next (/Users/lovisa/Code/js-tracing/node_modules/express/lib/router/route.js:137:13)
at Route.dispatch (/Users/lovisa/Code/js-tracing/node_modules/express/lib/router/route.js:112:3)
at Layer.handle [as handle_request] (/Users/lovisa/Code/js-tracing/node_modules/express/lib/router/layer.js:95:5)
at /Users/lovisa/Code/js-tracing/node_modules/express/lib/router/index.js:281:22
at param (/Users/lovisa/Code/js-tracing/node_modules/express/lib/router/index.js:354:14)
at param (/Users/lovisa/Code/js-tracing/node_modules/express/lib/router/index.js:365:14)
at Function.process_params (/Users/lovisa/Code/js-tracing/node_modules/express/lib/router/index.js:410:3)
at next (/Users/lovisa/Code/js-tracing/node_modules/express/lib/router/index.js:275:10)
Endpoint profile/:id was hit with id:5 1515771527896

Server Side Logging

Let’s look at an example of what structured logging might look like in an app. Our app is a React client powered by an Express server which renders trending gifs when you click a button called Jiphy. I know, very disruptive. There are many good open source logging libraries, and for this example we’ll be using Bunyan.

npm install --save bunyan

With the following code, a new logger will be instantiated and a message will be logged to the server console.

const bunyan = require('bunyan');
const logger = bunyan.createLogger({ name: 'server' });
logger.info('Instantiating the logger.');
---{"name":"server","hostname":"Lovisas-MBP","pid":31218,"level":30,"msg":"Instantiating the logger.","time":"2018-01-11T04:08:08.586Z","v":0}

The log message is in JSON format, so it’s really easy to parse messages and query against them. At this point, you can write your own ETL service (`Extract`, `Transform`, and `Load`) for processing and moving your logs to a data warehouse. Step one is to extract the information from your service or application, then transform the data to fit the ideal format, and finally load it to a data warehouse. A data warehouse is similar to a database, however, it is optimized to perform analytics on its data, and not on performing transactions.

Of course we’re not satisfied with just knowing when our logger has been instantiated. The code below is an example of emitting log messages when an endpoint is hit. Logging is not only for for errors, and it’s always a chain of successful events which result in an unsuccessful one. If we narrate what’s happening in our application, it’s so much easier to retrospectively understand the events that lead up to an error.

app.get('/profile/:id', function(request, response) {
logger.info({ endpoint: 'test', profile: request.params.id });
response.send('profile' + request.params.id)
});
---{"name":"server","hostname":"Lovisas-MBP","pid":31482,"level":30,"endpoint":"test","profile":"1","msg":"","time":"2018-01-11T04:17:41.550Z","v":0}

We stick the entire error object in a log message to preserve as much information as we can, and it’s also possible to add additional key-value pairs as you see necessary.

app.use(function(error, request, response) {
logger.info(error, { requestUrl: request.req.url });
});
---{"name":"server","hostname":"Lovisas-MBP","pid":31976……..,"msg":"{ requestUrl: '/lol' }","time":"2018-01-11T04:28:57.286Z","v":0}

So far the logger in this example is writing to the console. If we lose the shell that the app ran in, our logs are forever lost. In order to actually use our logs, we add a write stream to a `server.log` file. All log messages recorded by `logger` will be written to a file in addition to being logged in the console.

const logger =  bunyan.createLogger({
name: 'server',
streams: [{
path: './server.log'
}]
});

With logs being written to a file, we have the first part of our ETL service complete: the information is being extracted to disk. To achieve the transform and load portion of ETL, you can write a small program to pipe your logs through and either add additional key-value pairs, or remove key-value pairs from the logs that you won’t need. If you have several similar events, you can group them together based on what makes sense for your system. There are multiple options for loading your logs to a data warehouse, an easy way would be to set up a cron job that runs on a regular basis which takes the contents of your log file and uploads them to Amazon S3 for example.

Client Side Logging

There are of course things worth logging in our client side applications as well. With complex Redux states and React component trees it can be difficult to understand the durations and behaviors of individual component renders. Client side logging tools can capture errors and the context in which they occurred, as well as capturing user interactions. Below is an example of how to capture errors using Rollbar, but there are several other solutions out there.

fetch('/profile')
.then(data, err)
.catch(() => {
Rollbar.error('Error fetching user profile', { extra: 'data' })
})

There are more sophisticated (and some very expensive) hosted tools like splunk or honeycomb.io, but using a simple tool like Bunyan to add structured logging to your application is a great first step, even Charity Majors agrees.

Application Performance Monitoring

Any latency or downtime for an application can have serious business impact. Once you find yourself having a big problem with performance, it can be hard to know where to best focus your time and solving it might mean a major refactor of your app or a restructure to your infrastructure. Application Performance Monitoring (APM) tools lend themselves well to diagnosing and monitoring performance of web applications as they give a very detailed view of how your app is used and how it performs for end users.

APM tools monitor both top-down (from the user’s perspective and down the stack), and bottom-up (from the server and up the stack), however, APM tools primarily try to answer questions about the end user experience. There are two types of end user monitoring: passive and active.

Passive Monitoring

Passive monitoring is recording actual events in your app by capturing network traffic. It serves as a historical record. It’s also called Real User Monitoring (RUM) as it’s capturing behaviors and interactions of real people. This type of monitoring is usually implemented using a network packet sniffer and therefore does not have any impact on the actual application itself. Passive monitoring is useful when trying to fix an outage and understand the root cause first after the incident has occurred.

The historical record that’s created through passive monitoring can help inform your product, sales, and user experience/design teams by studying user behavior in your application. You can trace page views and page clicks on pages to help understand the value of your application, or you can get a better idea of how your users move in your app by grouping similar event chains.

Active monitoring

Active monitoring on the other hand gives you a chance to answer questions about your app without risking the happiness of your users. Instead of only learning from real user interaction with your app, you write scripts to emulate traffic using a tool like selenium. With selenium you can write a script which visits your page in the browser, logs in, and for example puts some items in a cart, and then goes to check out.

Now that we have automated this task, you can easily load test your application in a prod-like environment without putting your production environment in danger. Critical and/or common code paths in your application can be scripted as well and run asynchronously as part of your ci pipeline to enable tracking performance of your feature over time.

A New Relic APM dashboard.

Achieving Micro Service Clairvoyance

We are armed with many sharp tools for extracting information about our system and, for many, this is enough. However, as we’re moving towards building distributed systems — microservices instead of monolith applications — achieving full visibility gets difficult. Individual microservices work together to serve the needs of your users. Storing your logs in a central data warehouse is the first step so that you can view logs across multiple services at the same time.

But, just relying on timestamps in order to understand a sequence of events is not sufficient. If some of your users are not able to log in, you want to know exactly which chains of events pertains to them. We need a way to connect events across our app.

In 2010 Google published a whitepaper on Dapper, which is a protocol for large-scale, distributed system tracing infrastructure. It has a long and fancy name, but the concept itself is pretty straightforward. Every event that is captured in your app gets described in a span. The span is similar to a structured log message, in that we configure what information to add to it, but every span follows a standard format.

Every top level action in your app(for example: a user loading a page or clicking a button requesting to see gifs) gets assigned a `traceID`. When the user loads a page, there are other events occurring because of the user loading a page: we make requests to the server asking for user data and assets, the response gets parsed and put on your application state, the component starts rendering on the page. All these subsequent events keep a reference to their “parent span” by having a key `parentTraceID` with the same value as the parent’s `traceID`. Now we can take a parent span, and find all the traces with the same `parentTraceID` and see all the details around a specific event.

Let’s take a look at an example: a user is submitting a form from a page in your app. The parent span might look something like this:

{ “timestamp”: 123456789, “traceID”: 987, “event”: { “type”: “Submit form” } }

The spans for the following events — request for user data, the data parsing, and the rendering of the page — will all hold a reference to the parent span.

{ “timestamp”: 123456792, “traceId”: 654, “parentTraceId”: 987, “event”: { “type”: “Request”, “endpoint”: “/profile”, “uid”: 1  } }
{ “timestamp”: 123456799, “traceId”: 656, “parentTraceId”: 987, “event”: { “type”: “Response”, “status”: “200” } }
{ “timestamp”: 123456825, “traceId”: 321, “parentTraceId”: 987, “event”: { “type”: “Render profile” } }

Once you are capturing this information in your app and are sending them to a centralized storage, it’s really useful to visualize the traces. Below is an example of what rendering traces might look like from the OpenTracing docs, a vendor neutral open standard recently incubated by the Cloud Native Computing Foundation.

An example of trace visualization from the OpenTracing docs.

Conclusion

There are many types of monitoring, and just a few of them discussed in this blog post. When building out your monitoring infrastructure it’s important to not only think about monitoring as a reactive strategy, but also as a proactive one. Gaining more insight into your system will allow you to not only understand your users better, but also more quickly understand what performance optimizations will give you the most gains.

This blog was originally published on the blog JavaScript January.

FreshTracks.io is a monitoring tool that layers machine learning on top of Kubernetes and Prometheus. To learn more, join our beta today!

--

--