Dynamic Per-Request Log Level

Looks like it took me three years to write something new for this blog. Kids and other cool stuff got in the way - you know the score! I also should say, that I haven't done anything in Java in the past three years (almost). So to be more accurate, I should rename the blog to jvmsplitter.blogspot.com but whatever - I just drop my Scala blog posts here anyways!

This week I implemented a great idea that a co-worker came up with. Have you ever been in a situation, where you have a running system with a bunch of microservices in production and suddenly something doesn't work as expected? In the world of Battlefield, examples could be that players cannot purchase Battlepacks on PS4 or that matchmaking stopped working. So how do you find the root cause of the problem? Right, you want to look at some logs to get additional information. Only problem is, that your log level in production is usually quite high, for instance WARN or even ERROR - otherwise the amount of logging would just be too much. Wouldn't it be great to alter the log level dynamically on a per-request basis? This would allow you to test in production using TRACE logging – for just your test user. Finagle contexts to the rescue!

Here at DICE we have build all our Scala backend services based on Twitters Finagle framework – which is similar to the Tokio Framework in Rust if you have used that. In a nutshell Finagle is a RPC framework on the JVM with build-in support for various transport protocols, load balancing, service discovery, backoff etc. One semi-hidden feature of Finagle is the broadcast context. Think of the broadcast context as a ThreadLocal that is send along with every request through an RPC graph - from microservice to microservice. Finagle itself uses this internally, for instance to send a unique trace id along with each request. In my implementation, I have used the broadcast context to allow for a per-request log level override. Let's get our hands dirty! The first thing you want to implement is a new Key that Finagle can send over the RPC graph.

Essentially each Key need to implement two methods marshal and unmarshal, so that Finagle knows how to convert the Key from and to a Byte Array. I am not sharing this code here, but if you want to see how to unit test your code, Finagle has an example. No that we have a class for the log level override defined, we need code to set the override into as well as code to read the override from the broadcast context.

In most system architectures you have one system on the outside of your cluster. Here at DICE we call this system the gateway and it is the only service that is accessible from the public internet. All requests arrive at the gateway and it is the root node in the RPC graph. In other words, the gateway calls other microservices, which might call other microservices and so on. The most logical choice to define a log level override would be inside a Finagle Filter. I haven't actually written the Filter yet but it would look similar to this.

You have to be very careful with the Filter as this code is executed for every request entering your system! Now that we have code to set a log level override into the broadcast context, let's actually use it somewhere. To make this a seamless as possible for the developers, it is helpful if all your microservices share the same logging setup. For instance do we use slf4j with logback and the LoggerContext is set up programmatically inside a trait that every microservice is using (btw. our services follow the Twitter Server template).

As you can already guess now, reading from the broadcast context and actually using the override is wrapped inside a logback TurboFilter. Logback consults the filter for every log event and you can use this to decide if something should be logged or not. The following filter reads from the broadcast context and then makes a decision based on a potential override.

Conclusion: you can use Finagles broadcast context to transport a log level override through an RPC graph. You need some service to set the override in the context. It is helpful if this system is on the outside of your architecture and preferably uses HTTP. With HTTP it is easy to write a Finagle Filter and base the override on the HTTP request, i.e. by looking at the HTTP headers. Finagle transports the override magically through your RPC call graph and any microservice can use the override for logging decisions. To make this as simple as possible encapsulate this decision logic in some code that is shared between all your microservices.