Why Monitoring is as mandatory as Development

When we come to think of it, in this fast-paced world where we don’t even find a single opportunity to read through our emails we often skim through it unless the subject feels important right?!. So then it becomes quite imperative for any system that we build to have easier and faster ways to debug and analyze problems.

Why? you ask. well to get out of the jail quickly & not disturb what's already running.

Now, what do I mean by that? Let’s take an example of a monolith application consisting of thousands of LOC with several rules in place and already deployed scaled to Nth degree. The only thing that is missing from it is logs.

An example snippet from that codebase

public void takeMeToTheHeaven(Person person)
{
    try {
         // die you person already
    } catch(TakingToHeavenVehicleException youGotLucky) {
        return;
    }
    return;
}

Exactly, that's how you'll feel. Its quite impossible to figure out which event got triggered and which did not and obviously we can't do a rerun of that event, there's no concreteness over here :/

So what helps?! Over here just placing a few log statements before and after a function execution with a unique tracer would allow us to debug easily and get this issue closed off quickly. Something like this:

public void takeMeToTheHeaven(Person person)
{
    log.info("takeMeToTheHeaven initiated for {}",person.getId());
    try {
         log.info("Person's name is {}",person.getName());
         // die you person already
         log.info("Person's age is {}",person.getAge());
    } catch(TakingToHeavenVehicleException youGotLucky) {
        log.error("I missed my vehicle goddamnit", youGotLucky);
        return;
    }
    log.info("takeMeToTheHeaven successfully done, now cry {}",person.getId());
    return;
}

But why should I worry, I'll take care of all this while development and E2E performance testing .. this would never happen. Right that's the point of this never occurring in the first place but under heavy load, things go south pretty fast and the cause of it has to be well documented & observed .. hence monitoring.

That was just one application and deducing from there would be much simpler but nowadays with containerization and single responsibility concept those monolith architectures are far and few being replaced by micro service-driven architecture.

So let's suppose a cloud-native project that we are building containing a dozen of microservices and each running at a scale. In this scenario, each instance of each microservice is up 24/7 and serving our clients. When we scale this, eventually over time we might miss out on certain things that'll toss our setup into a red alert mode. Little details like high CPU utilization, OOM occurrences, certain services not reachable, heavy traffic not able to serve requests, etc all these occurrences which we knew could happen but didn't have a watcher. Thus never allowing us to realize when it occurred and how.

Here comes in monitoring, tada! Now this component in itself takes over all the responsibility of catering towards identifying issues and viewing the usage of our system.

Monitoring is the set of tools and techniques to keep an eye on how the system is doing and to keep it functional

If we break down the above statement, it tells us that there’s some tool that’s always watching creepy right, not really. You did your first task of developing your applications and deploying them at scale, now you also have to make sure it doesn’t succumb under high traffic and continuous usages right? After all, it's your child, you don’t want to leave it hanging there by a thread until it falls apart and the blame game starts! So to keep that functional, we need a set of tools that can then if needed alert us when and where certain problems arrived.

Why do we monitor?

So to sum it up in three-pointer for why to monitor

Know when things go wrong.
Be able to debug and gain insight.
Trending to see changes over time.

Parameters to look at for monitoring?

We talked about logging as one of the key parameters helping us for monitoring. Now let's see others

Logs – Great depth, at cost of breadth
Metrics – Great for breadth, at cost of depth
Tracing events – Essential to debug specific events
Traffic flow – Trend to identify the flow of requests over a time range

As seen above logging is great when we want to drill down to any specific event but over time it becomes quite vague, there comes metric analysis. Here you'd be able to identify when the usage was high moderate low etc and then accordingly disperse or allocate resources to those applications.

Tools available for Monitoring

Splunk
AppDynamics or OpenTelemetry
Prometheus and Grafana

So with these tools in place and alerting enabled for custom events that you never want to see, your system would be in safe hands. Give it a thought.

PS: This is my first attempt at blogging under a short timespan. Hopefully, I'll get more into it :)