Logging is a software engineering skill most don't encounter until landing their first job. It's a skill that's seemingly simple, but does require some thought. In this post, we'll go over the basics of logging and how to use it effectively.
At its core, logging is a way to emit information about a program's execution.
Targets are where the logs end up.
Levels are the severity of the log. Loggers typically are set to a specific logging level.
Loggers will emit all logs at the current level or higher. For example, if a logger is set to INFO, it will emit INFO, WARNING, ERROR, and FATAL logs.
The typical format of a log statement is similar to:
[ TIMESTAMP ] [LEVEL] [ MESSAGE ]
2023-05-05T12:00:00.000Z INFO Hello World!
The format of the timestamp in the above example is ISO 8601. This is a common format for timestamps because it's clear and easy to parse. The time zone UTC.
Sometimes, the format includes the thread ID. This is useful for multi-threaded programs. Without it, you may not be able to tell which log entries are from which thread.
In http servers, a random UUID is created for each request. This is useful for tracing a requests. Sometimes this is included in the log format.
So the format is usually customizable and can include whatever information you want. But typically its limited to just enough information to track the flow of execution for a process, request, event handler, etc.
The below is an example of a logging in Java.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args) {
logger.info("Hello World!"); // For example, prints [2023-05-05T12:00:00.000Z] [INFO] Hello World!
}
}
Every language has its own logging frameworks. So perform your due diligence and select one of the more popular ones.
Now that we have this tool, we need to figure out how to use it effectively. Let's look at the following Rust code. It's a simple program that handles a request to a web server. It then calls into another service, emits a metric, and returns the result to the client. Now, we're just using this as an example, so don't get lost in the details. Just accept that we have these functions and they do what they say they do. (And yes, this example is horribly contrived, but it's just an example.)
fn handle_request(request: Request) -> Response {
const user_id = request.body.user_id;
const action = request.body.action;
let response: Response;
match action {
"get_user" => {
let result = call_service_get_user(user_id);
match result => {
Ok(user) => {
emit_get_user_metric(user_id);
response.body.success = true;
response.body.result = user;
},
Err(error) => {
response.status = 500;
}
}
},
"delete_user" => {
let result = call_service_delete_user(user_id);
match result => {
Ok() => {
emit_delete_user_metric(user_id);
response.body.success = true;
},
Err(error) => {
response.status = 500;
}
}
},
_ => {}
}
return response;
}
Now imagine this, you're on call and you get paged at 3am. You're barely awake and you're trying to figure out what's going on. You look at the alarm ticket description:
Big Importance Service - HTTP ERROR 5XX Responses Above Threshold
Not particularly helpful, but you have a starting point. You look at the logs and see... nothing. There's a couple of error messages, but its not clear what's going on. What user requests are causing the errors? What functions are being called? You have no clue. At this point, just hand in your badge. You did yourself dirty by not writing proper logging. The alarm is escalating up the management chain. To your manager, their manager, and their manager. Your heart is sinking, your stomach is churning. Finally, you get a cold call from your CEO. He's detailing how you're single handedly costing the company millions of dollars. The internal wiki is already being writting about your monumental failure. Your name is forever etched in the company's history as the "that one person that didn't write logs". Word gets around quick. You're struggling to find another job. You move to another state, change your name, hair color, and get plastic surgery. However, in every interview, they can still see the existential dread in your eyes. That's forever.
So yea, let's not be that person. Instead, let's take a look at what we think is important. We probably want to know the entire incoming request, but only if we're debugging. So let's add a DEBUG log. While we're at it, we can also add a log for the outgoing response.
fn handle_request(request: Request) -> Response {
debug!("Incoming request: {}", request);
const user_id = request.body.user_id;
const action = request.body.action;
let response: Response;
match action {
"get_user" => {
let result = call_service_get_user(user_id);
match result => {
Ok(user) => {
emit_get_user_metric(user_id);
response.body.success = true;
response.body.result = user;
},
Err(error) => {
response.status = 500;
}
}
},
"delete_user" => {
let result = call_service_delete_user(user_id);
match result => {
Ok() => {
emit_delete_user_metric(user_id);
response.body.success = true;
},
Err(error) => {
response.status = 500;
}
}
},
_ => {}
}
debug!("Outgoing response: {}", response);
return response;
}
This is a good start, but we know what we're concerned about. Errors! So let's add some error logging.
fn handle_request(request: Request) -> Response {
debug!("Incoming request: {}", request);
const user_id = request.body.user_id;
const action = request.body.action;
let response: Response;
match action {
"get_user" => {
let result = call_service_get_user(user_id);
match result => {
Ok(user) => {
emit_get_user_metric(user_id);
response.body.success = true;
response.body.result = user;
},
Err(error) => {
error!("Error getting user {} : {}", user_id, error);
response.status = 500;
}
}
},
"delete_user" => {
let result = call_service_delete_user(user_id);
match result => {
Ok() => {
emit_delete_user_metric(user_id);
response.body.success = true;
},
Err(error) => {
error!("Error deleting user {} : {}", user_id, error);
response.status = 500;
}
}
},
_ => {}
}
debug!("Outgoing response: {}", response);
return response;
}
Now this is looking much better. We can see the incoming request and the outgoing response in debug mode. We can also see any errors that occur. Now, we could log each function call too. But I would advise against this. Instead, I would advise the functions themselves log their own input and output. This way, we can avoid having the callers of the functions log the input and output. So those function would look something like this:
fn call_service_get_user(user_id: String) -> Result<User, Error> {
debug!("Calling service get user with user_id: {}", user_id);
const result = ...;
...
debug!("Service get user result: {}", result);
return result;
}
fn emit_get_user_metric(user_id: String) {
debug!("Emitting get user metric with user_id: {}", user_id);
...
debug!("Emitted get user metric");
}
With this, we can see the entire flow of the request. Now, when you get paged at 3am, you can look at the logs and see exactly what's going on. You've mitigated one way of completely ruining your life.
I'd add one more log to this. I'd add a log just to indicate that the transaction is complete at the INFO level. So we end up with:
fn handle_request(request: Request) -> Response {
debug!("Incoming request: {}", request);
const user_id = request.body.user_id;
const action = request.body.action;
let response: Response;
match action {
"get_user" => {
let result = call_service_get_user(user_id);
match result => {
Ok(user) => {
emit_get_user_metric(user_id);
response.body.success = true;
response.body.result = user;
},
Err(error) => {
error!("Error getting user {} : {}", user_id, error);
response.status = 500;
}
}
},
"delete_user" => {
let result = call_service_delete_user(user_id);
match result => {
Ok() => {
emit_delete_user_metric(user_id);
response.body.success = true;
},
Err(error) => {
error!("Error deleting user {} : {}", user_id, error);
response.status = 500;
}
}
},
_ => {}
}
info!("Handled request for user_id: {} with action: {}. Response status: {}", user_id, action, response.status);
debug!("Outgoing response: {}", response);
return response;
}
This way, there's always some logs being emitted in the successfully case. These can be post-processed to create metrics.
You may have noticed, but we don't just log everything all of the time. We carefully choose the log level based on when we want logs to be printed. So why not just log everything all of the time?
There's a cost to everything and logging is no different. Logging takes up CPU cycles and memory. This is mostly an issue in resource constrained environments. For example, in an AWS Lambda Function, the amount of vCPU and memory you have is limited. AWS Lambda Functions are billed by the millisecond.
Now think of what need to happen in a logging statement like this:
println!("Incoming request: {}", request);
The request
variable needs to be converted into a string. So, we need to build an
entire string that's duplicating all of the information in the request and store that in the logs. Now imagine this, what if that request is large? What if it's a 1MB request? But that's just for one log statement. Imagine if we have 1TPS (transactions per second). That's 1MB of logs per second. That's 86.4GB of logs per day. That's 31TB of logs per year. That's a lot of logs. For that reason, we limit printing large objects to only the DEBUG level. Even in services and systems with less TPS, its still worth it.
The logs should be clear and concise in successful cases. In error cases, we want to log as much as possible so debugging the error is easier. In general, we only inspect logs when something goes wrong. So we want to make sure that the logs are as useful as possible in those cases.
We don't want to log sensitive information. This includes things like passwords, API keys, and personally identifiable information (PII). Logging sensitive information is a security risk and can also violate policies, rules, regulations, and laws about storing sensitive information GDPR.
There's a few techniques to avoid logging sensitive information. Most logging libraries have filters that can be applied to log statements. For example, you can filter out any log statements that contain the word "password". You can also filter out text that matches a regular expression.
In short, logging is important. Proper logging facilitates debugging and monitoring. There's some considerations, but a general guide for logging is:
Also note, the above are not rules but more tips and guidelines. You should do what is best for your particular use case. Don't blindly follow what I, or anyone else, says. Expect for that last statement.
A small, fun, and challenging exercise is to try and write a logging library. It's a great way to learn about how logging libraries work and how to use them. I'd recommend writing one. Go as far as you can. You'll learn a lot. I would recommend looking at the APIs of other logging libraries as hints to the requirements of a logging library. Refrain from looking at the source code unless you're really stuck.
Below are a few ideas on what to implement.
As always, pick what you're most curious about, keep asking "How does this work? How could I implement it?", try something, learn, iterate and improve, go again, and have fun while doing it!