Analyzing the EF Core Query Cache
13 Mar 2023As I mentioned a couple of times, we at Swiss Post strive for a Cloud-first approach and embrace various techniques and tools to make that happen. But sometimes, these tools can guide you into a completely wrong direction if not understood properly - as it happened to me last week 😅
Here’s what happened
All of our applications are Docker-ized and run on K8s. To get an easy overview of the applications health, .NET metrics are collected and visualized in Grafana. Here’s what a typical Grafana dashboard looks like for one of our applications:
It contains a lot of .NET EventCounter
s that you can find here and we do have metrics for EF Core as well:
In the lower left corner of the previous screenshot, you see the Compiled Query metric, referring to the EF Core Query Cache Hit Rate which is specified as following (taken from here):
The ratio of query cache hits to misses. The first time a given LINQ query is executed by EF Core (excluding parameters), it must be compiled in what is a relatively heavy process. In a normal application, all queries are reused, and the query cache hit rate should be stable at 100% after an initial warmup period. If this number is less than 100% over time, you may experience degraded perf due to repeated compilations, which could be a result of suboptimal dynamic query generation.
Well, I think it’s safe to say that 21.5% is not very close to 100% 😉 since the docs do not state how to analyze this kind of problem, I asked on Stack Overflow, giving me some very helpful feedback.
Log query compilation
EF Core has a ton of additional log statements which are not enabled by default. One of this is CoreEventId.QueryCompilationStarting
which gets logged when an EF Core query needs to be (re)compiled.
I’ve enabled the log statement by adding the following line when configuring EF Core:
Since it’s an Info
log, appsettings.json
needs to be updated as well:
When deploying this change, the following log statements appeared in our logging platform Splunk:
To obtain a meaningful result, I left the application alone overnight and looked at the logs the next morning. I’d expected certain queries to be recompiled over and over again, causing the bad EF Core Query Cache Hit Rate metric. But to my surprise, all queries were compiled only once, so I had to go one step further.
Custom ICompiledQueryCache
Another hint I received on Stack Overflow was to implement my own query cache and see what’s going on inside. I just copy and pasted the original sources from GitHub and extended it with some log statements:
With these additional logging messages:
Finally, the existing CompiledQueryCache
needs to be replaced with the custom implementation:
Let’s take it apart one by one:
- The custom
LoggingCompiledQueryCache
implements the same behavior as the originalCompiledQueryCache
. LoggingCompiledQueryCache
logs each time if the underlying_memoryCache
is either hit or missed viaQueryCacheHit()
orQueryCacheMiss()
.
With these information at hand, I could calculate my own query cache hit rate in Splunk:
And that was one of these moments when you should leave your notebook, drink a hot cup of coffee and ask yourself why you’ve chosen this job - why is the calculated value 99.9% but Grafana displays 21.5%? 🤯
Analyzing the metric in Grafana
When taking a closer look into the data in Grafana, I noticed a lot of of -1
values:
That felt quite odd to me - where are these values coming from? The answer lies in the EF Core code for calculating this metric:
If the cache hasn’t been hit at all (e. g. no traffic), -1
is reported to avoid a division by zero. But let’s calculate an average made up of five datapoints:
((-1) + (-1) + (-1) + (-1) + 100) / 5 = 19.2
Due to all the -1
values, the average query cache hit rate gets pulled down to 19.2%
- and that perfectly describes my initial problem: the affected app runs only every couple of minutes, the other time it is idle. During idle times, the metric reports -1
, thereby bringing down the average.
Conclusion
There are multiple solutions:
- Exclude the
-1
values in Grafana. - Implement a custom
EventListener
and remove the-1
before writing the metric at all.
We decided for the second way, but I think it’s really up to the use case.
For me it was a super interesting story to see how EF Core works under the hood, how you can reveal internal behavior by adding more logs and even replace internal components like the CompiledQueryCache
due to .NET’s modular approach - and to take a closer look into the reported metric values 😉
Thanks for reading and have a great time!