Logging

Published Date: 2023-09-05T00:48:06Z | Last Modified Date: 2023-09-06T18:53:28Z

Logging is a software engineering skill most don't encounter until landing their first job. It's a skill that's seemingly simple, but does require some thought. In this post, we'll go over the basics of logging and how to use it effectively.

Basics

At its core, logging is a way to emit information about a program's execution.

Targets

Targets are where the logs end up.

STDOUT (Terminal/Console)
File (log.txt, output.json)
Database (RDBMS Table)
Remote Service (AWS CloudWatch)

Levels

Levels are the severity of the log. Loggers typically are set to a specific logging level.

DEBUG
INFO
WARNING
ERROR
FATAL

Loggers will emit all logs at the current level or higher. For example, if a logger is set to INFO, it will emit INFO, WARNING, ERROR, and FATAL logs.

Log Formatting

The typical format of a log statement is similar to:

[       TIMESTAMP      ] [LEVEL] [  MESSAGE  ]
2023-05-05T12:00:00.000Z  INFO    Hello World!

The format of the timestamp in the above example is ISO 8601. This is a common format for timestamps because it's clear and easy to parse. The time zone UTC.

Sometimes, the format includes the thread ID. This is useful for multi-threaded programs. Without it, you may not be able to tell which log entries are from which thread.

In http servers, a random UUID is created for each request. This is useful for tracing a requests. Sometimes this is included in the log format.

So the format is usually customizable and can include whatever information you want. But typically its limited to just enough information to track the flow of execution for a process, request, event handler, etc.

Logging Example

The below is an example of a logging in Java.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {

private static final Logger logger = LoggerFactory.getLogger(Main.class);

public static void main(String[] args) {
    logger.info("Hello World!"); // For example, prints [2023-05-05T12:00:00.000Z] [INFO] Hello World!
    }
}

Every language has its own logging frameworks. So perform your due diligence and select one of the more popular ones.

What to log

Now that we have this tool, we need to figure out how to use it effectively. Let's look at the following Rust code. It's a simple program that handles a request to a web server. It then calls into another service, emits a metric, and returns the result to the client. Now, we're just using this as an example, so don't get lost in the details. Just accept that we have these functions and they do what they say they do. (And yes, this example is horribly contrived, but it's just an example.)

fn handle_request(request: Request) -> Response {

    const user_id = request.body.user_id;
    const action = request.body.action;

    let response: Response;

    match action {
        "get_user" => {
            let result = call_service_get_user(user_id);
            match result => {
                Ok(user) => {
                    emit_get_user_metric(user_id);
                    response.body.success = true;
                    response.body.result = user;
                },
                Err(error) => {
                    response.status = 500;
                }
            }
        },
        "delete_user" => {
            let result = call_service_delete_user(user_id);
            match result => {
                Ok() => {
                    emit_delete_user_metric(user_id);
                    response.body.success = true;
                },
                Err(error) => {
                    response.status = 500;
                }
            }
        },
        _ => {}
    }

    return response;
}

Now imagine this, you're on call and you get paged at 3am. You're barely awake and you're trying to figure out what's going on. You look at the alarm ticket description:

Big Importance Service - HTTP ERROR 5XX Responses Above Threshold

Not particularly helpful, but you have a starting point. You look at the logs and see... nothing. There's a couple of error messages, but its not clear what's going on. What user requests are causing the errors? What functions are being called? You have no clue. At this point, just hand in your badge. You did yourself dirty by not writing proper logging. The alarm is escalating up the management chain. To your manager, their manager, and their manager. Your heart is sinking, your stomach is churning. Finally, you get a cold call from your CEO. He's detailing how you're single handedly costing the company millions of dollars. The internal wiki is already being writting about your monumental failure. Your name is forever etched in the company's history as the "that one person that didn't write logs". Word gets around quick. You're struggling to find another job. You move to another state, change your name, hair color, and get plastic surgery. However, in every interview, they can still see the existential dread in your eyes. That's forever.

So yea, let's not be that person. Instead, let's take a look at what we think is important. We probably want to know the entire incoming request, but only if we're debugging. So let's add a DEBUG log. While we're at it, we can also add a log for the outgoing response.

fn handle_request(request: Request) -> Response {

    debug!("Incoming request: {}", request);

    const user_id = request.body.user_id;
    const action = request.body.action;

    let response: Response;

    match action {
        "get_user" => {
            let result = call_service_get_user(user_id);
            match result => {
                Ok(user) => {
                    emit_get_user_metric(user_id);
                    response.body.success = true;
                    response.body.result = user;
                },
                Err(error) => {
                    response.status = 500;
                }
            }
        },
        "delete_user" => {
        let result = call_service_delete_user(user_id);
            match result => {
                Ok() => {
                    emit_delete_user_metric(user_id);
                    response.body.success = true;
                },
                Err(error) => {
                    response.status = 500;
                }
            }
        },
        _ => {}
    }

    debug!("Outgoing response: {}", response);
    return response;
}

This is a good start, but we know what we're concerned about. Errors! So let's add some error logging.

fn handle_request(request: Request) -> Response {

    debug!("Incoming request: {}", request);

    const user_id = request.body.user_id;
    const action = request.body.action;

    let response: Response;

    match action {
        "get_user" => {
            let result = call_service_get_user(user_id);
            match result => {
                Ok(user) => {
                    emit_get_user_metric(user_id);
                    response.body.success = true;
                    response.body.result = user;
                },
                Err(error) => {
                    error!("Error getting user {} : {}", user_id, error);
                    response.status = 500;
                }
            }
        },
        "delete_user" => {
            let result = call_service_delete_user(user_id);
            match result => {
                Ok() => {
                    emit_delete_user_metric(user_id);
                    response.body.success = true;
                },
                Err(error) => {
                    error!("Error deleting user {} : {}", user_id, error);
                    response.status = 500;
                }
            }
        },
        _ => {}
    }

    debug!("Outgoing response: {}", response);
    return response;
}

Now this is looking much better. We can see the incoming request and the outgoing response in debug mode. We can also see any errors that occur. Now, we could log each function call too. But I would advise against this. Instead, I would advise the functions themselves log their own input and output. This way, we can avoid having the callers of the functions log the input and output. So those function would look something like this:

fn call_service_get_user(user_id: String) -> Result<User, Error> {
    debug!("Calling service get user with user_id: {}", user_id);
    const result = ...;
    ...
    debug!("Service get user result: {}", result);
    return result;
    }

    fn emit_get_user_metric(user_id: String) {
    debug!("Emitting get user metric with user_id: {}", user_id);
    ...
    debug!("Emitted get user metric");
}

With this, we can see the entire flow of the request. Now, when you get paged at 3am, you can look at the logs and see exactly what's going on. You've mitigated one way of completely ruining your life.

I'd add one more log to this. I'd add a log just to indicate that the transaction is complete at the INFO level. So we end up with:

fn handle_request(request: Request) -> Response {

    debug!("Incoming request: {}", request);

    const user_id = request.body.user_id;
    const action = request.body.action;

    let response: Response;

    match action {
        "get_user" => {
            let result = call_service_get_user(user_id);
            match result => {
                Ok(user) => {
                    emit_get_user_metric(user_id);
                    response.body.success = true;
                    response.body.result = user;
                },
                Err(error) => {
                    error!("Error getting user {} : {}", user_id, error);
                    response.status = 500;
                }
            }
        },
        "delete_user" => {
            let result = call_service_delete_user(user_id);
            match result => {
                Ok() => {
                    emit_delete_user_metric(user_id);
                    response.body.success = true;
                },
                Err(error) => {
                    error!("Error deleting user {} : {}", user_id, error);
                    response.status = 500;
                }
            }
        },
        _ => {}
    }

    info!("Handled request for user_id: {} with action: {}. Response status: {}", user_id, action, response.status);
    debug!("Outgoing response: {}", response);
    return response;
}

This way, there's always some logs being emitted in the successfully case. These can be post-processed to create metrics.

Considerations

You may have noticed, but we don't just log everything all of the time. We carefully choose the log level based on when we want logs to be printed. So why not just log everything all of the time?

Cost

There's a cost to everything and logging is no different. Logging takes up CPU cycles and memory. This is mostly an issue in resource constrained environments. For example, in an AWS Lambda Function, the amount of vCPU and memory you have is limited. AWS Lambda Functions are billed by the millisecond.

Now think of what need to happen in a logging statement like this:

println!("Incoming request: {}", request);

The request variable needs to be converted into a string. So, we need to build an entire string that's duplicating all of the information in the request and store that in the logs. Now imagine this, what if that request is large? What if it's a 1MB request? But that's just for one log statement. Imagine if we have 1TPS (transactions per second). That's 1MB of logs per second. That's 86.4GB of logs per day. That's 31TB of logs per year. That's a lot of logs. For that reason, we limit printing large objects to only the DEBUG level. Even in services and systems with less TPS, its still worth it.

Clarity

The logs should be clear and concise in successful cases. In error cases, we want to log as much as possible so debugging the error is easier. In general, we only inspect logs when something goes wrong. So we want to make sure that the logs are as useful as possible in those cases.

Sensitive Information

We don't want to log sensitive information. This includes things like passwords, API keys, and personally identifiable information (PII). Logging sensitive information is a security risk and can also violate policies, rules, regulations, and laws about storing sensitive information GDPR.

There's a few techniques to avoid logging sensitive information. Most logging libraries have filters that can be applied to log statements. For example, you can filter out any log statements that contain the word "password". You can also filter out text that matches a regular expression.

Summary

In short, logging is important. Proper logging facilitates debugging and monitoring. There's some considerations, but a general guide for logging is:

In a function, log the inputs and outputs at the DEBUG level.
In a function, log any computed variables at the DEBUG level.
In a function, log any errors at the ERROR level and include as much information as possible.
For a successful transaction, log the transaction at the INFO level with a summary of the transaction.

Also note, the above are not rules but more tips and guidelines. You should do what is best for your particular use case. Don't blindly follow what I, or anyone else, says. Expect for that last statement.

Extra

A small, fun, and challenging exercise is to try and write a logging library. It's a great way to learn about how logging libraries work and how to use them. I'd recommend writing one. Go as far as you can. You'll learn a lot. I would recommend looking at the APIs of other logging libraries as hints to the requirements of a logging library. Refrain from looking at the source code unless you're really stuck.

Potential Ideas

Below are a few ideas on what to implement.

Has DEBUG, INFO, WARNING, ERROR, FATAL log levels
Can log to STDOUT, STDERR, or a file
Can log to multiple places at once
Loads Configuration from file
Customizable logging format
Does not print using the main thread

As always, pick what you're most curious about, keep asking "How does this work? How could I implement it?", try something, learn, iterate and improve, go again, and have fun while doing it!