Measuring Memory Usage in Rust
This post documents a couple of fun tricks we use in rust-analyzer for measuring memory consumption.
In general, there are two broad approaches to profiling the memory usage of a program.
The first approach is based on “heap parsing”.
At a particular point in time, the profiler looks at all the memory currently occupied by the program (the heap).
In its raw form, the memory is just a bag of bytes, Vec<u8>
.
However the profiler, using some help from the language’s runtime, is able to re-interpret these bytes as collections of object (“parse the heap”).
It then traverses the graph of objects and computes how many instances of each object are there and how much memory they occupy.
The profiler also tracks the ownership relations, to ferret out facts like “90% of strings in this program are owned by the Config
struct”.
This is the approach I am familiar with from the JVM ecosystem.
Java’s garbage collector needs to understand the heap to search for unreachable objects, and the same information is used to analyze heap snapshots.
The second approach is based on instrumenting the calls to allocation and deallocation routines.
The profiler captures backtraces when the program calls malloc
and free
and constructs a flamegraph displaying “hot” functions which allocate a lot.
This is how, for example, heaptrack works (see also alloc geiger).
The two approaches are complementary. If the problem is that the application does too many short-lived allocations (instead of re-using the buffers), it would be invisible for the first approach, but very clear in the second one. If the problem is that, in a steady state, the application uses too much memory, the first approach would work better for pointing out which data structures need most attention.
In rust-analyzer, we are generally interested in keeping the overall memory usage small, and can make better use of heap parsing approach. Specifically, most of the rust-analyzer’s data is stored in the incremental computation tables, and we want to know which table is the heaviest.
Unfortunately, Rust does not use garbage collection, so just parsing the heap bytes at runtime is impossible.
The best available alternative is instrumenting data structures for the purposes of measuring memory size.
That is, writing a proc-macro which adds fn total_size(&self) → usize
method to annotated types, and calling that manually from the root of the data.
There is Servo’s malloc_size_of
crate for doing that, but it is not published to crates.io.
Another alternative is running the program under valgrind to gain runtime introspectability. Massif and and DHAT work that way. Running with valgrind is pretty slow, and still doesn’t give Java-level fidelity.
Instead, rust-analyzer mainly relies on a much simpler approach for figuring out which things are heavy. This is the first trick of this article:
Archimedes' Method
It’s relatively easy to find out the total memory allocated at any given point in time.
For glibc, there’s mallinfo function, a similar API exists for jemalloc.
It’s even possible to implement a GlobalAlloc
which tracks this number.
And, if you can measure total memory usage, you can measure memory usage of any specific data structure by:
-
measuring the current memory usage
-
dropping the data structure
-
measuring the current memory usage again
The difference between the two values is the size of the data structure. And this is exactly what rust-analyzer does to find the largest caches: source.
Two small notes about this method:
-
It’s important to ask the allocator about the available memory, and not the operating system. OS can only tell how many pages the program consumes. Only the allocator knows which of those pages are free and which hold allocated objects.
-
When measuring relative sizes, it’s important to note the unaccounted-for amount in the end, such that the total adds up to 100%. It might be the case that the bottleneck lies in the dark matter outside of explicit measurements!
Amdahl’s Estimator
The second trick is related to the Amdahl’s law. When optimizing a specific component, it’s important to note not only how much more efficient it becomes, but also overall contribution of the component to the system. Making an algorithm twice as fast can improve the overall performance only by 5%, if the algorithm is only 10% of the whole task.
In rust-analyzer’s case, the optimization we are considering is adding interning to Name
.
At the moment, a Name
is represented with a small sized optimized string (24 bytes inline + maybe some heap storage):
1
2
3
struct Name {
text: SmolStr,
}
Instead, we can use an interned index (4 bytes):
1
2
3
struct Name {
idx: u32
}
However, just trying out this optimization is not easy, as an interner is a thorny piece of global state. Is it worth it?
If we look at the Name
itself, it’s pretty clear that the optimization is valuable: it reduces memory usage by 6x!
But how important is it in the grand scheme of things?
How to measure the impact of Name
s on overall memory usage?
One approach is to just apply the optimization and measure the improvement after the fact.
But there’s a lazier way: instead of making the Name
smaller and measuring the improvement, we make it bigger and measure the worsening.
Specifically, its easy to change the Name
to this:
1
2
3
4
5
struct Name {
text: SmolStr,
// Copy of `text`
_ballast: SmolStr,
}
Now, if the new Name
increases the overall memory consumption by N
, we can estimate the total size of old Name
s as N
as well, as they are twice as small.
Sometimes, quick and simple hacks works better than the finest instruments :).