Monitoring real-world production app that we worked on

Valentin Mohish

Val is a skilled backend engineer who is capable to carry entire initiative on his shoulders end-to-end if needed. He is never afraid to roll up his sleeves and dive deep into the problem. Whether we are talking about hunting a pesky bug, designing integration with 3rd party API provider or building your own API - the "I can do it" attitude is something we all can learn from Val 💪

To clearly observe a system's operation and execution, we use a monitoring and observability approach to generate crucial data that can be useful for debugging. Additionally, it helps to monitor the app performance from users’ perspective.

The approach relies on three major components:

Metrics - used to collect quantitative data that helps to evaluate the app performance before it goes to production.
Tracing - used to collect individual requests within a complex application with a multi-layer architecture. Tracing provides a vision of how the app operates from the user’s perspective. It can be used to record various system requests as well as time the application needs to complete them.
Logging - used to bring together data from the app running in the production environment to a single place where all team members can research and analyze it.

Context

We need to constantly monitor the health and performance of the product that our customers are using.

Monitoring is a key practice to ensure alignment between engineering, ops and product groups when it comes to health and performance of the app.

There are 2 key pillars that will help us gain visibility into the health of a system:

Monitoring. System exposes a set of predefined parameters (that we consider important) for Viewing and Alerting purposes. Metrics for tracking are:
- Request Count, Error Rate and Duration short: RED - used for monitoring “request-response” cycle
- Memory, CPU and I/O utilization - Brendan Gregg's USE method
Observability. System exposes information that later can be used for debugging purposes. However, we may not know upfront about what information we’ll need when debugging an issue.

Following techniques will help us perform forensic analysis:

Logs - log important information around activities within request. It's extremely important to log just enough information to be able to answer the question - "Will this help me debug an issue in the future?"
Traces - use "log correlation id" for actions that happen within a single “trace”. Trace can span multiple systems, like frontend, backend, microservices, etc.

What components do we want to monitor?

Our goal - is to understand what’s going on with a customer while he’s using an app. Did she encounter an error? Did everything work as expected? Was the app responsive?

In order for us to achieve this level of visibility, we need to monitor all the components invovled with a specific user action: frontend, backend, any internal service and most importantly integration points around 3rd party APIs.

Tools

When it comes to tools there's a vast list of options to help you solve a specific problem. Here I will cover few tools we have experience with, however there are plenty in each category.

Rollbar

Used ONLY for error tracking.

Was chosen because of:

Excellent grouping of common errors under single error (with ability to drill down into individual occurrence)
Clear stack trace for React / Node.js apps
Ability to link into Internal tools (like Splunk or Grafana)
Deployment notifications
Deployment tracking

Raygun

Used ONLY for Real User Monitoring (RUM) of the frontend app
Provides an information from the device(browser) on a real-time performance. These metrics are called Core Web Vitals

Splunk

Logging tool we all know about and use
Used for adhoc debugging and forensic analysis
Ability to configure dashboards to present the most interesting information for your use case

AppSignal

So far the best APM for Node.js apps
Excellent integration with Apollo - our GraphQL + React stack
Ability to see GraphQL request itself and configure custom Spans for Traces
Correlation “what was happening in our system during incident”
All the metrics we care about are readily available

FAQ

Q: Is it possible to converge on a single tool in the future - perhaps Splunk? A: Yes and no. Yes, because technically it’s possible to track MOST of the metrics in Splunk and build corresponding dashboards / queries to give a similar level of visibility. No, because it will require substantial effort, time and SKILL in order to build these dashboards. So when making a decision between Build vs Buy we lean towards Buy. And only build what we can’t buy.

Q: Are we using any forms of alerts from these tools? A: Yes. Rollbar sends deploy notifications to a dedicated Slack channel. Also sends messages for production errors to dedicated Slack channels.