1. Introduction

Logging is a process of recording events taking place during the execution of a system, an application, or a service. The resulting log can be used to understand the behavior of the system through the time and diagnose problems.

Having good logging is particularly important for systems that run for a long time with little or no user interaction, such as web servers or microservices.

Logs are not just used for diagnosis & monitoring. They can be used to gain insight into other critical aspects of the system such as security, missing pages, failing requests, performance, business activities and audience demographics.

According to studies, in C/C++ projects (might not apply to managed languages with stack traces^{logging-practices-in-oss-2}):

Log messages reduce median diagnosis time of production-run failures between 1.4 and 3 times (on average 2.2x speedup).

1.1. Purposes of logging

Logging may be used for the purposes of:

Debugging and Error Reporting
System Administration and Dev Ops
QA insights
Performance monitoring
Business Intelligence
Security

1.1.1. Debugging & Error Reporting

The most common kind of logging is sometimes called tracing, logging done by developers for debugging purposes. These logs are usually granular, provide very low-level information, are added in an ad-hoc manner and therefore noisy. They are often cryptic and unusable by anyone but the responsible developer and only the very same day they added them :)

Logging can be very helpful in error triage and resolution because it provides a record of events that occurred leading up to an error. When an error occurs, it is often difficult to determine the root cause immediately, especially if the error is not immediately reproducible. By reviewing the log data, it is often possible to get a better understanding of the sequence of events that led up to the error, which can help identify the root cause and facilitate resolution.

For example, suppose that an application is generating an error message that indicates a problem with a database connection. By reviewing the log data, a developer may be able to see that the application was trying to connect to the database using an incorrect username or password. This information can be used to fix the problem and resolve the error.

The questions that logging can help answer are:

How did this unlikely situation happen?
What was the root cause of this error?
How do we verify that this error has been fixed in production system?
Which silent errors still need to be fixed?
Is there a persistent record that allows postmortem analysis as opposed to ephemeral dialogs popping up and final death messages in assert statements?
Are intermediate states of our software are as we expect them to be?
Is there any additional context for an error or a bug report?

1.1.2. System Administration and Dev Ops

Event logging is a type of logging that is used for system administration and dev ops purposes. It logs high-level information, such as the failure of a program installation. This type of logging is typically more stable and at least somewhat documented, although it is often "self-documented" in practice.

The questions that event logging can help answer are:

How many operations are failing?
Which specific operations are failing?
Is system performance falling behind?
Is the system still operational?
Which nodes are experiencing more failures?
Which nodes need to be restarted or be allocated more resources?
Why did the entire distributed system fail overnight?
Why is this service not starting up? Which parameters are set incorrectly?
Is the service configured correctly?

1.1.3. QA insights

Logging may also be useful to the QA people, when they report bugs or verify whether an application or a service works correctly.

Which operations are failing?
Is there any additional evidence that confirms a user report?
What pages / services / operations should be improved?
What new pending tests need to be added?
How to prioritize the bug tickets?
Which items have missing or inadequate help information?

1.1.4. Performance monitoring

Performance issues are difficult to analyze and pinpoint a root cause for repair. Performance degradation may occur at certain times of the day, a period of high load, during an active DDoS attack, or what seems like no reason at all. QA can do performance testing on your application, but these tests rarely represent a realistic production environment with thousands of concurrent users.

Logging performance sensitive parts of the code and potential bottlenecks, you can retroactively find root causes of performance degradations. They could happen because of a poorly optimized database query, insufficiently allocated memory, or during a CPU spike. Monitoring resource usage over time will give you insights for when it is time to upgrade server hardware.

Which operations are taking too much time?
Which phase or stage of a process is taking more time?
Which resources are being unfairly allocated?
What is the primary cause of response delays?
Which nodes need more resources (CPU / RAM / Disk)?

1.1.5. Business Intelligence

Logging for business intelligence purposes involves tracking and recording the state of business processes, such as web clicks, financial trades, cell phone connections, and audit trails. This type of logging is used to gather and analyze data about business processes in order to make informed decisions and improve the efficiency and effectiveness of the organization.

What is the purchase volume over time?
How do purchases compare to last month?
How are customers affected by app issues?
How is the customer experience?
Are transactions taking too long?
Where are transactions happening?
What is the target demographic?

1.1.6. Security

Developers are good at logging exceptions, but they don't usually account for security events. Security events can be any suspicious behaviors such as repeated patterns of failed login attempts, ACL changes or new accounts being created. This information can be used to identify security breaches and to prevent future attacks.

Who is accessing the app? When?
What activity looks suspicious?
Is the application behaving as expected?
When did this user log in and log off? Where did they login from?
Who executed these commands?
Which systems, data, applications, files, and networks were accessed?
Were there any failed access attempts?
Were there any changes to system configurations?
Was there any activity from cybersecurity tools like the firewall or antivirus software?
Who is responsible for leaking sensitive data?

1.2. The sad reality

The sad reality of logging is that most existing systems are not following "the best practices" nor are they following "YOUR logging practices". In fact, there are no existing universally accepted guidelines for logging^{improving-devops-logging}. Unlike feature code, which is sometimes tested for correctness, it is very unlikely that anyone tests the the correctness of their logging code.

You have little or no control over what is and what is not reported by other people's software. Most logging messages are written by developers to help them debug their code.

Developers insert logging statements in an ad-hoc manner, producing logs that are completely cryptic to the outsiders (and to the developers themselves after a passage of time), and mostly useless for other purposes.

The good news is that you have have full control over logging information that YOU write and there are tools that can process the unholy mess produced by others.

2. How much to log

Our take is that programmers are the Creator of software programs and the latter has nothing to hide to the former. The popular slogan hiding implementation details is for multiple pieces of code but not to programmers; for whom, ideally, all information should be handily available if needed. Thereafter, a software program is obligated to reveal its nontrivial execution details with appropriate density in an easy-to-follow way.

How much to log is an age-old question for developers. Logging everything that happens in your application can be great because you have plenty of data to work from when you have a problem.

But logging too much can cause performance issues and result in a waste of disk space. Excessive logging may mask the truly important information.

On the other hand, logging too little or in an ad-hoc manner may result in developers missing important logging statements, which later results in difficulties when debugging issues in production systems.

For every application, you will probably find that you're logging too much data in some areas and not enough in others. The problem is that you do not know beforehand what is important, and therefore it is generally better to err on the "too much" side:

Flying more slowly is better than flying blind.

3. WHAT to log?

As you first develop your application, start by logging everything at a high level and then add more logging messages as you find yourself asking "What happened here?"

3.1. DO log enough context

One fourth (27%) of the log modifications are to variable logging. Majority (62%) of them are adding new variables that are causally related to the log message. It indicates that developers often need additional runtime information to understand how a failure occurred.

Suppose you have the following snippet in your code,

if(message instanceof TextMessage)
    //...
else
    log.error("Unknown message type");

The moment you see this error actually appearing in the logs, you will want to know the actual message type, message id, etc. Something went wrong, but what? What was the context?

If possible, always log enough context for a complete picture of what happened from a single log event. This is not always possible as the full context may be too large, but when it is possible, it will save you a huge amount of time when things do go wrong.

You should log:

All relevant variable values.
Timestamps.
User or request identity: user ids, order ids, transaction ids, request ids, ip addresses, user-agents.
Logical location: host names, node names, service ids.
Flow identifiers: correlation ids, span ids.
Sourcecode location: file name, function, method, class, file line.
Elapsed times.

3.1.1. ✓ Relevant variables

Log any variables that are causally related to the log statement. We will later talk about how to do that in a compositional manner.

3.1.2. ✓ Timestamps

Timestamps are critical to understanding the sequence of events for debugging, analytics, and deriving transactions.

Use the most granular timestamp format available.
Always use UTC, especially if your system is distributed. Using different formats such as EST, CST, GMT etc in different services will create confusion.
If writing semi-structured messages:
- Place timestamps at beginning of events.
- Use a human readable timestamp formats.

Example (good):

08/12/2014:09:16:35.842 GMT INFO key1=value1 key2=value2

3.1.3. ✓ User or request identity

Log any unique identifiers that pertain to the current request or user, for example: transaction ids, user ids, request ids, service ids, node ids, ip addresses.

3.1.4. ✓ Logical location

When using microservices, it is common to have multiple instances of the same component running at the same time. It is important to have a way to identify a particular instance from the log entry. This can be done by logging a host name, node identifier, process id, and a unique service identifier.

It doesn't really matter how you generate these ids as long as they are unique and allow you to trace back to the exact server/container and application that generated the entry.

3.1.5. ✓ Flow identifiers

To properly track and trace external requests or requests that originate within your system, it is important to create a correlation identifier that can be used throughout the entire process. This identifier should be included with the initial request and carried through all subsequent processing steps. By using this identifier consistently, you can more easily track the progress and status of the request and identify any issues that may arise. This can be especially useful when working with complex systems or processes with many different components, as it allows you to easily connect all the related elements and understand how they fit together.

To transport the correlation id between services, it should be passed along with each call to other microservices. If your service receives a request with a correlation id, be sure to use it and return it to the web browser at the end of the request.

You can include the correlation id in HTTP requests using the X-Correlation-ID header. It is ideal to have your internal request API add these headers transparently, so that developers do not have to remember to include the header each time.

In addition to the correlation id, you may also want to consider augmenting it with a short span id, which is an id generated by each microservice itself and passed along when it calls out to other microservices.

Consider using base-62 (or base-58) encoded numbers for your ids as they are short and easily recognizable. However, be aware of the birthday paradox when generating random ids.

3.1.6. ✓ Source location

To help understand the context and location of log messages, consider including class, function, and filename information. You may also want to include stack traces to get a detailed view of the sequence of events leading up to a log entry. Additionally, generating globally unique identifiers for each log message can be useful as they will likely remain unchanged even if the code is refactored or moved, allowing you to easily locate the source of the log message.

3.1.7. ✓ Elapsed time

When logging the end of some process consider adding the elapsed time, as this would allow the reader to understand the performance of the system without correlating multiple events.

3.2. DO log startup parameters

3.2.1. ✓ Configuration

Heuristic H12: If the verb describes initialization of the software or configuration of the software, then the event must be logged.

Logging configuration is extremely useful for diagnosing startup issues.

3.2.2. ✓ Application version

When software is running at multiple different locations, versions between them may differ. The difference in application versions means that bugs that are fixed on one environment may still be present on another.

3.3. DO be proactive with debugging

3.3.1. ✓ Outgoing requests

If you are working with a distributed system, consider logging every piece of data that comes out from your application and gets in. Integration is a tough job and diagnosing problems involving multiple services is particularly hard.

Sometimes the amount of information exchanged with external systems makes it unacceptable to log everything. However, during testing and for a short period of time on production (for example when something wrong is happening), we would like to be able to have everything saved in logs, ready to pay performance cost.

This can be achieved by carefully using logging levels:

Collection<Integer> requestIds = //...
if(log.isDebugEnabled())
    log.debug("Processing ids: {}", requestIds);
else
    log.info("Processing ids size: {}", requestIds.size());

3.3.2. ✓ Unacceptable performance

It is very tempting (and common) to make assumptions about how your code performs, but those assumptions can blind you to some very common performance issues. For example, a call to your database "should always be fast" so you might be tempted not to instrument it.

Log anything outside the range in which you expect your system to perform. If you expect some part of your system to perform at a certain level, you should make sure you log when it falls outside that range. For example, if your database "should always be fast," then you can log any DB access that takes more than, say, 100ms. When the inevitable happens and the DB does slow down, you’ll see that this has happened immediately thanks to this log statement.

Note there is still a question of how to do this in a compositional manner. Sometimes only the caller of a function knows how long it should take to run, or how long a specific part of that function should take to run.

3.3.3. ✓ Paths without sufficient test coverage

As shown earlier in Table 3, 57 failures are hard to detect via generic exception checks. We refer them as hard-to-check errors. When a production failure occurs, it is usually due to an unusual input or environment triggering some code paths that are not covered during in-house testing. Table 6 shows that 21% of the 57 hard-to-check failure cases execute some branch edges that we surmise have never been executed during testing (otherwise, the bugs on those edges would definitely have been exposed). Therefore, if we log on those branch decisions that have not been covered during testing, i.e., cold paths, it would be useful for diagnosis. Of course, special care needs to be taken if some cold paths show up too frequently during runtime.

3.4. DO log exceptional conditions

Across 250 randomly sampled reported failures, we first identify that more than half of the failures could not be diagnosed well using existing log data. Surprisingly, we find that majority of these unreported failures are manifested via a common set of generic error patterns (e.g., system call return errors) that, if logged, can significantly ease the diagnosis of these unreported failure cases.

If you encounter a minor problem that can be automatically resolved, log it. A minor hiccup might be a symptom of a bigger problem. Masking the effects of an error by patching it is known in the parlance as a 'kludge'. It is not a compliment.

If you encounter a major but resolvable problem, you may attempt to resolve it, but make sure to log it ASAP.

Your resolution may impair system performance.
Your resolution or the original problem itself may result in future problems.
The underlying problem may be a usability issue that is annoying to the users.
It may be a symptom of a bigger problem.
Your resolution may not go as planned, and it will be nice to have a log to refer to.

Log any unresolvable errors.

3.4.1. ✓ Log eagerly

Log errors as soon as possible (but don't overdo it, especially if you are doing FP, as it would affect the compositionality of your code, see below). Sometimes systems fail before an error can be processed up the chain of execution. If you have a log, you would know what had happened, even if the condition caused the program to crash on its way back.

3.4.2. ✓ Log eagerly (FP)

If you are doing functional programming, you do not have to worry as much about correct error propagation up the chain of execution in pure & total (!) code, but you should still be careful to log errors eagerly (think IO, not WriterT). Consider the following Scala code:

def foo: F[Unit]= for {
  _   <- log "hello"
  foo <- funcThatMayFail
} yield foo + 1

Will you see the log output? Well, that depends entirely on the way Monad[F] and log work. If F[x] = WriterT[Either[E, ?], List[L], A] ~ Either[E, (List[L], A)], then you will lose your log messages as soon as something throws an error. EitherT[Writer[L, ?], E, A] ~ (List[L], Either[E, A]) will properly propagate log messages.

WriterT[IO, List[L], A] is basically WriterT[Either[Throwable, ?], ...] w.r.t. error handling, so it will lose log messages in a similar fashion. I recommend using final tagless approach in Scala and supplying Logging[IO] algebras that log eagerly.

3.4.3. ✓ Log exceptions with stack traces

Log messages reduce median diagnosis time of production-run failures between 1.4 and 3 times (on average 2.2x speedup).

Different from the original study, the median BRT [bug resolution time] for BWLs [bug reports with logs] is longer than the median BRT for BNLs [bug reports without logs] in server-side projects and SC-based [supporting-component] projects. The BRT for BNLs is statistically different from the BRT for the BWLs in nearly half of the studied projects (10/21). However, the effect sizes for BRT between the BNLs and BWLs are small.

One way to interpret the studies on diagnosis time vs availability of log files is that both log files and stack traces reduce the issue resolution time when taken independently, but logs being reported is negatively correlated with the availability of stack traces (since people report logs only in cases where there is no obvious stack trace), increasing the issue resolution time.

Which is to say, make sure that you print exception stack traces into the log files, if they are available, that should have a significant effect on your issue resolution times.

3.4.4. ✓ Not a replacement for error handling

Logging exceptions and error conditions is one of the most important roles of logging, but many developers tend to treat logging as a way to handle exceptions. They sometimes return default value (typically null, 0 or an empty string) and pretend like nothing erroneous has happened.

There may be some exceptions to this at the top-level of an application, but library code should attempt to propagate as much error information outwards as possible.

3.4.5. ✓ No catch-log-rethrow try-catch-throw

Sometimes developers first log the exception, then wrap it and throw it back:

// BAD: Logging and rethrowing.
try{ /* do some work */ }
catch (Exception ex) {
    logger.log("error occurred", ex);
    throw ex;
}

Something eventually catches the exception again and logs its cause. As a result, the log files contains the same exceptions several times on several stack levels.

The problem isn't the local catch block, a sign of good error handling practices, but the log-and-rethrow part. Either handle the exception or wrap it with a new exception that adds additional context and throw that.

// GOOD: Logging and recovering.
try {
    doSomething();
} catch (Exception e) {
    log.Info("Couldn't do something", e);
    doSomethingElse();
}

If you handle the exception, you can easily downgrade the importance of the exception log entry and there is no reason to percolate that exception up the chain. It's already dealt with.

Handling an exception can include informing users that a problem happened, logging the event, or simply ignoring it.

If you intentionally ignore an exception, I recommend providing a comment in the empty catch clause that clearly indicates why. This lets future maintainers know that it was not a mistake or lazy programming. Example:

// GOOD: leaving a comment explaining why ignoring an exception is okay.
try {
    context.DrawLine(x1,y1, x2,y2);
} catch (OutOfMemoryException) {
    // WinForms throws OutOfMemory if the figure you are attempting to
    // draw takes up less than one pixel (true story)
}

// GOOD: Adding additional context and throwing a wrapped exception.
try {
    doSomething(line);
} catch (Exception e) {
    throw new MyApplicationException(filename, line, e);
}

Adding additional context (like the line number and file name in parsing code) is extremely helpful for debugging purposes. Note that re-wrapping an exception in an overly generic ApplicationException just to rebrand it doesn't help you debug. Make sure you add additional information.

// GOOD: Cleaning up.
try {
    doSomething();
} finally {
  // Cleanup resources but let the exception percolate.
}

In this final case, you just allow the exception to leave without touching it. The exception handler at the outermost layer can handle the logging. The finally clause is used to make sure any resources needed by your method are cleaned up, but this is not the place to log that the exception was thrown.

3.5. ✓ CRUD Actions

> Heuristic H1: If the verb [describing an action] involves creating, reading, updating, or deleting resource data in the software system, then the event must be logged. … If the verb can be accurately rephrased in terms of creating, reading, updating, or deleting resource data in the software system, then the event must be logged. … The unconstrained natural language used (specifically in use-case based requirements and user guides) may not easily map to the core CRUD actions. For example, "designate" appears in the iTrust use-case based requirements and OCS user guide. In these cases, we attempt to mentally rephrase the action using a core CRUD action. For example, "A patient designates a patient representative" can be mentally rephrased as "create a patient representative in the patient's list of patient representatives".

Always record CRUD actions (create, read, update, delete).

3.6. ✓ Read/View Actions

If the verb implies the system displaying or printing resource data that is capable of being viewed in the user interface or on paper, then the event must be logged.

Logging browser http requests and responses is an extension to logging the data that is communicated between your services. Seeing what a user has been doing makes it easier for a developer to determine what exactly went wrong for the user and is especially useful for front end errors.

3.7. ✓ Actions that Express Permissions

If the verb expresses the granting or revocation of access privileges in the software system, then the event must be logged. … In "allow doctors to edit immunizations", the edit action is classified as a mandatory log event. However, the term "allow" suggests the use of an access control security mechanism in the software system. In this example, and based on prior research on security events that should be logged, we consider "allow" equivalent to "grant a user privilege" in an access control mechanism in the software.

3.8. ✓ User session events

If the verb involves the creation or termination of a user session, then the event must be logged.

3.9. DO log system metrics

System metrics are a way to monitor the health and performance of a system at any given time, such as CPU usage, RAM usage and request durations. Mostly these metrics are used to monitor performance in real time and up or down scale the system when needed.

4. HOW to log (formatting)?

4.1. DO use structured logging

Developers seldom delete or move logging code, accounting for only 2% of all the log modifications. Almost all (98%) of the log modifications are to the content of the log messages.

Most logging "frameworks" force you to use regular strings in your log statements:

LOG.error("Could not get invocation handler " + invocationHandler +
" for proxy " + proxy + ", or invocation handler is not closeable.");

This is not machine-friendly, and it is difficult to process such a message in an automated fashion without essentially retroactively typing them. You should make your log format structured - machine (and ideally human) readable

Consider using developer, human, and machine-readable formats, such as JSON. This will further reduce the number of unnecessary log modifications due to the reduced ambiguity of structured messages.

4.1.1. ✓ If using a text format

If you are still adamant about writing text messages, consider using key-value pairs for your message format. They don't require complex processing and can be parsed with regular expressions:

Log.debug("orderstatus=error, errorcode=454, user=%d, transactionid=%s", userId, transId)

Using a text format puts certain restrictions on the types you can put into your logs, you should avoid printing complex data types unless you can serialize them compactly.

You should avoid using binary data (e.g. base64 encoded) as it is not human-readable and is not easily searchable. If data must be in binary format, place textual metadata in the event. For example, do not log just the binary data of a JPG file, but do log its image size, creation tool, username, camera, GPS location, etc.

4.1.2. ✓ Use consistent formatting

Even if you use multiple different formats for different services or applications, you should make sure that your logs get reformatted into a consistent format before they get shipped to a centralized storage. This will simplify the process of analyzing and aggregating information from multiple logs.

4.2. DO separate logs with different formats into different files

If a single application has to output events in different formats, separate them into individual files.

If your logs are consumed by multiple different parties, e.g. system administrators and business analysts, it is best to produce a separate log file for each group (this can be done by an automated log aggregator as well). They have different needs, and appropriate log files will have different "expiration times". Support logs might be kept for a week, marketing logs might be rotated every day, audit logs might be kept for years. Keeping all these messages in a single file makes it impossible to archive only audit logs, and will typically make automated parsing much harder.

5. HOW to log (coding)?

5.1. Standard Logging levels

Developers spend significant efforts on adjusting the verbosity level of log messages, accounting for 26% of all the log improvements. Majority (72%) of them reflect the changes in developers' judgement about the criticality of an error event. … In 28% of verbosity level modifications, developers reconsider the trade-off between multiple verbosities. It might indicate that developers are often confused when estimating the cost (e.g., excessive messages, overhead) and benefit afforded by each verbosity level.

Most logging libraries have a fixed set of severity levels, usually some combination of these:

Fatal/Critical – overall application or system failure that should be investigated immediately. This severity should be used very infrequently. Typically, a Fatal error is the last message in the log, since any error on this level will likely force a shutdown of the service or application to prevent data loss.
Error – any error which is fatal to the operation, but not the service or application (can't open a required file, missing data, etc.). Definitely a problem that should be investigated, but not immediately. By filtering a log to look at log messages of level Error and above, you get an overview of error frequency and can quickly identify the initiating failure that might have resulted in a cascade of additional errors.
Warning –- the process might be continued, but take extra caution. Anything that can potentially cause application oddities, but for which there is an automatic resolution procedure, such as switching from a primary to a backup server, retrying an operation, skipping over some non-essential data, etc.
Information -– general information about the operation of an application or a service. Each action that changes the state of the application significantly (database update, external system request) should be logged with this level.
Trace –- very detailed information, intended only for development. You might keep trace messages for a short period of time after deployment on production environment, but treat these log statements as temporary, that should or might be turned-off eventually. They should be generally discouraged, and eventually removed. Allowing tracing messages tends to lead to more and more Trace messages being added and none ever removed. In time, this makes log files almost useless because it's too hard to filter signal from noise. Pruning Trace messages eliminates the possibility of bugs introduced because of needed side-effects in debug code that should not be included in the release build.

The list above is just a suggestion, you can create your own set of instructions to follow, but it is important to have some. Having the ability to quickly filter logs and extract the information with proper detail level might be a life-saver.

5.1.1. Severity tags?

Instead of using a set of fixed severity levels, you might want to consider tagging your log messages based on the kind of content they contain. For example, a message summarizing a number of error occurrences in a particular subsystem could have a set of tags error, summary, subsystem1.

5.1.2. Log levels are non-compositional

Unless you are writing code at the very top of your application, code that can be allowed to be aware of its overall logical "location" in the application, you will be unable to assign log levels to the messages. Only the caller of a function is aware whether a failure in that function is "fatal" to its operation. A function opening a file is not aware whether not being able to open a file is fatal or normal.

I will talk about more compositionality issues later in this post.

5.1.3. DO NOT duplicate error level in the message

LOG.info("DEBUG - Container FINISHED: " + containerId);
//  ^^^^  ^^^^^

Although the verbosity level is set to INFO, the text suggests that this log statement is used for debugging purposes. The fix is to first change the logging level to DEBUG:

LOG.debug("DEBUG - Container FINISHED: " + containerId);
//  ^^^^^  ^^^^^

Now we have duplication of information about severity level, which we can fix as well:

LOG.debug("Container FINISHED: " + containerId);

5.2. MIND complex expressions in log statements in unsafe languages

Every time you write (or review) a log statement, check if there are any potential NullPointerExceptions like:

log.debug("Processing request with id: {}", request.getId());

or unsafe casts like

DataNode.LOG.warn("Added missing block to memory " + (Block)diskBlockInfo);

Note that

try {
  log.trace("Id=" + request.getUser().getId() +
            " accesses " + manager.getPage().getUrl().toString())
} catch(NullPointerException e) {}

is not a good solution either.

5.3. DO NOT put side-effects into your log statements

Be careful with side-effects in your logging statements. If you fetch a "collection" of domain objects from the database using Hibernate and carelessly log them:

log.debug("Returning users: {}", users);

You will be inadvertendly calling toString() on this "collection", which might result in a variety of problems such as OOM, N+1 select problem, thread starvation (logging is synchronous!), lazy initialization exception, etc.

5.4. DO NOT use multiple statements for one logical event

Each logging statement should represent a single execution event.

log.debug("Message processed");
log.debug(message.getJMSMessageID());

should be rewritten as:

log.debug("Message with id '{}' processed", message.getJMSMessageID());

5.5. DO be lazy

In log4j, isLogLevelEnabled methods, or similar methods in other logging frameworks.

Since Java before Java 8 does not have any form of convenient way to suspend evaluation of an expression, any expression passed to the logger methods is executed even if the logger just ignores the String that is created. This is a waste when doing things like logging complex data structures.

// GOOD enough.
if (logger.isDebugEnabled()) {
    logger.debug("data structure: " + dumpDataStructure(object));
}

Of course, in languages supporting some form of by-name parameters, it is much more preferable to just use those facilities.

5.6. DO log before an action

A common example is logging a SQL command after the execution. In case of an exception, the log entry might not be written, so the audit log will not provide relevant contextual information for the exception. Audit log entries should be written before an action - not after.

5.7. DO NOT commit meaningless messages

One has to go through the pain of analyzing debug output looking like:

step #1
x
2 - true
step #2
step #3

to deeply appreciate their inherent ugliness.

6. HOW to log (operational)?

6.1. Centralize log collection

If you have to log in to each individual server to read logs, you will spend more time trying to correlate problems than if you just had one location where you could access all the logs.

6.2. Implement a log security and retention policy

Logs may contain sensitive data, like personal data of your clients or internal access keys for APIs. Make sure that sensitive data gets anonymized or encrypted before you ship logs to any third party.

Logs from different sources might require different retention times. Some applications are only relevant for troubleshooting for a few days. Security-related or business transaction logs require longer retention times. Therefore, a retention policy should be flexible, depending on the log origin and intention.

6.3. Log Storage

Planning the capacity for log storage should consider high load peaks. When systems run well, the amount of data produced per day is nearly constant and depends mainly on the system utilization and amount of transactions per day. In the case of critical system errors, we typically see accelerated growth in the log volume. If the log storage hits storage limits, you lose the latest logs, which are essential to fix system errors. The log storage must work as a cyclic buffer, which deletes the oldest data first before any storage limit is applied.

Design your log storage so that it is scalable and reliable – there is nothing worse than having system downtimes and a lack of information for troubleshooting, which in turn, can elongate downtime.

Log storage should have a separate security policy. Every attacker will try to avoid or delete his traces in log files. Therefore you should ship security logs in real-time to a secure log storage. If the attacker has access to your infrastructure, sending logs off-site, e.g., using a logging SaaS will help keep evidence untampered.

6.4. Review & constantly maintain your logs

Unmaintained log data could lead to longer troubleshooting times, risks of exposing sensitive data or higher costs for log storage. Review the log output of your applications and adjust it to your needs. Reviews should cover usability, operational and security aspects.

6.5. Make log level configurable

Some applications logs are too verbose and other application logs don't provide enough information about the activities. Adjustable log levels are the key to configure the verbosity of logs.

6.6. Inspect audit logs frequently

Acting on security issues is crucial – so you should always have an eye on audit logs. Setup security tools such as auditd or OSSEC agents. The tools implement real-time log analysis and generate alert logs pointing to potential security issues. On top of such audit logs, you should define alerts on logs in order to be notified quickly on any suspicious activity.

6.7. Set Alerts

The most critical events are those that impact business revenue. These events are the ones you should monitor closely and send alerts to the right response team.

It shouldn't be your customers calling to tell you that there are bugs in your application. You want to find them before they do, and many customers will just bounce to another vendor if you have too many bugs. Many of your customers won't report application issues at all, so you could be losing sales every hour and never know it.

When you build a shopping cart, you should have a point where you ask a visitor for a way to contact them. Usually, email is a common input field during account creation. With an alert on failed shopping cart activity, you have a way to contact a customer should they receive an error and bail on the shopping experience. Alerts are a great tool to salvage a lost customer due to application bugs.

But you also need alerts to tell you when components of your application are creating obstacles for customer shopping experiences. It could be an alert for performance (similar to the previous example), or your customers could be dropping at a specific point in the sales funnel.

Alerts may also help you pinpoint components of your application that could be causing performance degradation. Your developers would need to set a baseline, but – for instance – you could set an alert for any event that takes longer than 100 milliseconds to process.

When CPU usage spikes to over 80%, set an alert to inform administrators. It could be something as simple as upgrading RAM or even your server's CPU, but having these log alerts will give you the ability to analyze the time of day and any patterns surrounding application lifetime.

7. Other considerations

7.1. Logging vs Compositionality

My main problem with logging levels is that they are non-compositional: only the caller knows whether a failure in a function is a "critical failure" in the context of the surrounding computation.

If you're a new developer working on this project for the first time, or a client who is testing my software, it is reasonable to think that something terrible has happened. Since the log file looks so concerning, you might spend an hour or two investigating these errors before realizing that these I2C write failures are expected and acceptable.
This particular program supports dynamic detection of connected sensors using an I2C write. If the sensor is not connected, the write will fail and we will move on to the next slot. Even worse, the same I2C error spew occurs if you try to run an I2C bus sweep to detect connected devices. The log becomes practically worthless in such a situation, as each I2C sweep call would generate up to 128 I2C error prints. The obvious answer is to disable the log statement, but the underlying I2C layer was provided to us as a pre-compiled library.
This example is hardly a shining beacon of useful error logging, but is indicative of typical logging experiences.

Furthermore, it is not always clear whether a particular event should be logged or not. Consider a situation where you are opening thousands of files and the vast majority of them do not exist. Your code may look like this (ZIO):

IO.foreach_(files)(fn => doSomething(fn))

Clearly, in this particular case it would be unwise to record every single file that was not found assuming that that is the expected situation. Instead only the general statistics such as the number of files found vs not found should be logged. But that in turn means that doSomething can not log anything on its own.

7.2. Probes

A good functional solution to the problem of log non-compositionality is to pass a function into doSomething that will be called whenever a file is not found:

def doSomething[E](path: Path)(notFound: => IO[E, Unit] = IO.unit): IO[E, Unit]

This can be further generalized with a concept of a probe, a collection of callbacks that will be called from the function whenever it goes through particular internal state transitions:

trait DoSomethingProbe[-R, +E] {
    def openingFile(file: Path): ZIO[R, Nothing, Unit]
    def fileNotFound(file: Path): ZIO[R, E, Unit]
    def closingFile(file: Path): ZIO[R, Nothing, Unit]
}
object DoSomethingProbe {
    // A probe that does nothing.
    val nop : DoSomethingProbe[Any, Nothing] =
        new DoSomethingProbe[Any, Nothing] { ... }
}

def doSomething[R, E](path: Path)(probe: DoSomethingProbe[R, E] = DoSomethingProbe.nop): ZIO[R, E, Unit]

With this probing mechanism in place, the caller can decide whether doSomething should log anything, what level it should log failures at, etc.

7.3. Function purity

Adherents of pure functional programming may find themselves in a difficult situation where something needs to be logged inside a pure function for debugging purposes but there is no monadic context available.

You can rewrite your pure functions in a final tagless manner, but only at a significant price in terms of performance and clarity. You can put all of your code into an IO-like monad, but you lose a lot of reasoning tools that come with precise types and colloquial "purity".

One potential solution is to have a way of probing pure code... using impure callbacks:

trait TransposeProbe {
    def beforeTranspose(matrix: Matrix): Unit
    def afterTranspose(old: Matrix, new: Matrix): Unit
}

def transpose(m: Matrix)(probe: TransposeProbe = TransposeProbe.nop): Matrix = ...

This is not completely satisfactory, but does side-step the reasoning issue – the function is not side-effectful as long as the probe isn't. The fact that we sometimes sequence innocuous calls to functions returning Unit does not make the reasoning any more difficult as long as we assume they are pure & total. Those who are familiar with Haskell may think of it as seqing unsafeInterleavedIO'd thunks of ().

Unfortunately this approach does not work well if our logging is inherently asynchronous.

Logging