Why So Slow? Using Profilers to Pinpoint the Reasons of Performance Degradation
When you're implementing a new feature or switching to a different library, your application may start to underperform. We just ran into that exact problem, but with the combination of profilers, we were able to identify and reduce the bottleneck we introduced. In this article, we would like to show you how we pinpointed the bug that caused our performance issues so you know how to approach the problem if you ever run into it.
Boost.Asio is the way!
As you might already know, you can connect to Memgraph using Bolt protocol, which you can think of as the HTTP protocol for querying graphs stored in Memgraph. Partially because of our WebAssembly client and partly because of the increasing popularity of WebSocket, we decided to provide an option for our users to use Bolt over WebSocket to run queries against Memgraph. But first, we had to face the elephant in the room: our handwritten Bolt server. It was completely handwritten from scratch on top of the POSIX API. Supporting WebSocket would mean implementing everything for it, for example, parsing HTTP requests or encoding/decoding WebSocket frames. Ancient developer wisdom advises never to write your own cryptographic library, so the first thing in our minds was: We shouldn't write our own WebSocket library. In Memgraph 2.2 we released a monitoring server via WebSocket, which was written using the very popular Boost.Asio library. It was our experiment with Boost.Asio and WebSocket. As you might guess, based on Memgraph 2.3 and this post, we were happy with the result, so we opted to use Boost.Asio to rewrite our Bolt server.
And that's when our problems started. We were newbies at using Boost.Asio in a production environment, so it took us some time to learn the dos and don'ts about the library and implement the same functionality as we have with our previous server implementation. We made it work! All of our tests were passing, and we were happy... But then we checked the benchmarks: the performance of Memgraph had plummeted. We knew that we had to compare our new code with the old implementation, but we hadn't noticed anything strange. Because the code comparison was a dead end, we decided it was time -- dramatic music intensifies -- to reach for profilers.
I think it's important to mention something here: none of us is really an expert in profiling. All of us know its basic principles and might have used profiling to fix some minor issues, but we're far from regular users. Please read on keeping this in mind.
First, we used Callgrind to get some results. Using Callgrind massively slowed down the execution. In some cases, it was almost 20 times slower than the unprofiled version. Callgrind is based on Valgrind, which uses dynamic binary instrumentation to inspect and control different aspects of applications, causing a significant slowdown. We used KCachegrind to interpret and visualize the reports generated by Callgrind.. Sadly, the results didn't pinpoint what caused the issue, so we had to think even harder. The costs of most functions (estimated CPU cycles spent in a function) were very similar, and the networking-related functions (mainly
send) were called the same number of times. Based on these results, we assumed that one of the functions was taking more time than in the previous implementation, and that's what's causing the slowdown.
Because we couldn't identify the issue using the results we got from Callgrind, we reached for another profiler, gperftools. It's a sampling profiler and therefor it has a smaller impact on the application's performance in exchange for less accurate call statistics. After filtering out the unimportant parts and visualizing the rest with pprof, it was evident that something strange was happening with the
send function. It took only 71 milliseconds with the previous implementation and more than 900 milliseconds with the new implementation of our Bolt server. It was very suspicious, but based on Callgrind, its cost was almost the same as before. We were confused as the two results seemed to conflict with each other.
Which profiler is correct?
To fully understand the results, you must understand how the profilers we used work. Let's start with Callgrind. As we mentioned, it uses dynamic binary instrumentation. You might think it modifies the machine code to notify Valgrind whenever an interesting event happens (function call, exception, etc.). In reality, it does so much more. First, it translates the program to its own internal representation (IR), which is simpler than raw machine code. Then it performs all kinds of transformations on the IR form and finally translates the IR back into machine code. Voilà! Using this technique, Callgrind can provide accurate call statistics (exactly how many times a function is called from which function, etc.), but the instrumented application will be much slower than the original one. As you can imagine, this kind of application instrumentation cannot intrude on the kernel binaries. Thus the results related to the
send function are wildly inaccurate.
On the other hand, gperftools uses a different way of profiling. It periodically generates a
SIGPROF signal, and whenever the signal arises, gperftools saves the actual stack trace of the application. In the time between two samplings, it doesn't affect the program's execution too much. It consumes some memory to store necessary information, but it doesn’t consume CPU at all.
The answer to which profiler is correct is both, but you must know how they work in order to interpret the results correctly.
How to interpret the results
Because of Callgrind results, we knew
send is called the same number of times in both the previous and current implementation of the Bolt server. From the gperftools results, we knew that Memgraph spent much more time in
send. This means that, on average, a single call to send takes a lot more time. We focused our attention on how
send was handled, and we finally noticed what we missed. As we were aware of our response’s content and length, we originally set
TCP_NODELAY on the TCP socket to disable automatic buffering of outgoing packages and used the
MSG_MORE flag appropriately to signal when the response should be buffered or sent. We instructed
send to buffer when there is more data to send, but send the package instantly when the end of the response has been reached.
When we tried to replicate this behavior with our new server, we missed setting the
MSG_MORE flag. However, the
TCP_NODELAY was set properly, which means every call to
send resulted in an actual package sent over the wire, even the smallest one.
The solution is to handle
MSG_MORE flag properly. For that, we had to switch how we interact with the socket through Boost.Asio, namely, we had to use
boost::asio::basic_stream_socket::send instead of
boost::asio::write, because with the former one, we could set
MSG_MORE just as we did with the previous server.
After this fix, we checked the performance results, and finally, in most cases, performance losses weren't noticeable. For queries with really high throughput, we still experienced some performance loss, but we decided that it's a good trade-off for now. We're hoping as we gain more experience with Boost.Asio, we will be able to get closer to our original performance and the capabilities of Boost.Asio will give us huge flexibility in the future regarding our server implementation. For example, it wraps different versions of OpenSSL we have to deal with in a unified interface.
So what have we learned from this?
First of all, we got another piece of evidence that all of our team members are humans because we made a mistake. Or maybe all of us are just super-intelligent androids who can pass the Turing test and purposefully make mistakes. I'll let you decide on that.
More importantly, we learned a lot about networking and profiling:
- Know your packages and their sizes! Correctly instrumenting the POSIX sockets is crucial to reaching high throughput in communication.
- Know your profilers! Combining Callgrind and gperftools provides a good combination of profilers. The former offers accurate call statistics and graphs, while the latter provides close-to-real timing information.
- C, C++, POSIX, and networking can still be challenging and fun!