Hacker News

← Back

Rust: Dropping heavy things in another thread can make your code 10000x faster

Posted 7 hours ago

abramov.io

6 hours ago by fpgaminer

Some important things I think people should note before blindly commenting:

* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.

* I didn't see any difference on my machine between a debug build and a release build.

* The example is preforming 1 _million_ deallocations. That's why it's so pathological. It's not just a "large" vector. It's a vector of 1 million vectors. While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon, and which would likely suffer the same performance penalty.

* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.

* As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

3 hours ago by ckcheng

>As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

This reminds me of the exploding ultimate GC technique [1]:

> on-board software for a missile...chief software engineer said "Of course it leaks". ... They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits its target or at the end of its flight, the ultimate in garbage collection is performed without programmer intervention.

[1]: https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...

15 minutes ago by all-fakes

Yes, Java also has a garbage collector that does nothing, called the Epsilon GC, intended for short-lived programs and references for garbage collector benchmarks.[0]

[0]: https://blogs.oracle.com/javamagazine/epsilon-the-jdks-do-no...

5 hours ago by rubber_duck

And then someone tries to use your compiler as a service (code analysis, change triggered compiler) and it's a dead end

4 hours ago by qzw

Well, then that’s not the original use case anymore, and it’ll have to be re-engineered. In the meantime it may have been used for years and the perf difference may have saved many developer-years collectively across its user base. Surely you’re not suggesting that the compiler developers should be prematurely optimizing for future use cases that they may not even have envisioned.

4 hours ago by rubber_duck

Avoiding leaks is not optimisation, it's a matter of correctness - not freeing memory is an optimisation based on a very shortsighted assumption that is not practical for any new language (modern languages are expected to come with language server support)

3 hours ago by pdimitar

I am suggesting they apply good practices. I'd never imagine that compilers were actually doing what was stated -- sounds awful.

I understand it's tradeoffs and we all have real-world limitations to contend with -- but again, of all the corners that could be cut that's exactly the one I didn't imagine they would.

Nasty.

4 hours ago by ycombobreaker

In a world where processes can fork-and-exec, nothing about "as a service" changes that. The compiler would just be reinvoked as needed. Converting it into a persistent process breaks a lot more than just allocation optimizations.

4 hours ago by rubber_duck

But you want to share state and only update state incrementally on edit to get any reasonable level of performance for stuff like language server code analysis.

4 hours ago by yrro

As distasteful as leaky code is, is it that bad to run it in a separate process? You get a bit more robustness against crashes as well.

4 hours ago by rubber_duck

You want to be able to incrementally update state to get performance out of incremental code analysis (eg. language server implemention for IDE)

4 hours ago by hinkley

One of my “favorite” snags in perf analysis is that periodicity in allocations can misattribute the cost of allocations to the wrong function.

If I allocate just enough memory, but not too much, then pauses for defragmentation of free space may be costed to the code that calls me.

A solution to this that I’ve seen in soft real time systems is to amortize cleanups across all allocations. Every allocation performs n steps of a cleanup process prior to receiving a block of memory. In which case most of the bad actors have to pay part of the cost of memory overhead.

Might be good for Rust to try something in that general realm, or in the cleanup side may be easier to tack on. On free, set a ceiling for operations and queue what is left. That would at least peak shave.

3 hours ago by rini17

Doing unrelated cleanup sounds like flushing CPU cache per every allocation.

an hour ago by hinkley

This post is already about unrelated cleanup, so I'm not sure what other escape hatch you imagine people taking. Do you have a suggestion?

You can tell that it's unrelated cleanup because if it were related, then the cost of freeing wouldn't be noteworthy. It would be cache hot because you would be visiting it for the second time. In which case we'd be talking about why you are scanning a giant object on an event loop in the first place. That's not what's at issue. What's at issue is that you've been handed this great bomb of uncached data from someone else and now you're stuck doing the janitorial work.

Freeing an object of arbitrary size is effectively an unbounded operation. Cache invalidation has a very high cost, sure, but it's still bounded.

Putting a limit on the amount of work you do, you could stop before purging the entire cache. You could use a smaller limit on doing work for someone else and control invalidation there, too.

6 hours ago by papaf

This deallocation trick is neat but in C and C++ you could use a memory pool to do this.

In theory, you could also use a memory pool in Rust but I think the standard library uses malloc without some way of overriding this behaviour.

5 hours ago by vvanders

You can also use the typed-arena crate[0] or roll your own if you're feeling like cracking open unsafe.

[0] https://crates.io/crates/typed-arena

9 minutes ago by all-fakes

Typed-arena is a bit limited, because all the allocated values must be of the same type. bumpalo[0] removes that restriction, but it also doesn't run destructors, which can lead to memory and other resource leaks. Another worry I have with bumpalo is that I don't think it's been reviewed thoroughly for unsoundness.

Still, bumpalo has been used to great effect in dodrio[1], a React-like library for Rust with really good performance.

[0]: https://crates.io/crates/bumpalo

[1]: https://github.com/fitzgen/dodrio

5 hours ago by orf

You can change the global allocator in any rust project. You can write your own easy enough, or use one like jemalloc

31 minutes ago by josephg

Sure; but when you’re using arenas and things like that you usually you will want different objects allocated into different pools (or with different lifetime properties). Rust only lets you pick one allocator for the entire process, so you can’t specify “all the children of this data structure go in arena A, and this other allocation goes into a traditional heap”.

It’s more awkward, but I much prefer Zig’s approach here where everything that allocates takes an allocator as a parameter. Usually the allocator is specified at compile time - in which case zig can generate identical code to the rust compiler. But when you want the flexibility, it’s there.

Aside from compilers, this is heavily used in video games where there’s often a lot of objects that get allocated per frame, and can be discarded all together. And in that case rust’s lifetime tracking would be a huge asset. The dovecot email server (C) also makes superb use of a mix of memory containers for different tasks. Given how messy email parsing is, dovecot is an absolute pleasure to read.

33 minutes ago by francasso

I don't get this. Usually I either don't care about how stuff is allocated, or when I care I want to be specific about what allocator should be used where. In the second situation just setting a global allocator wouldn't cut it most of the time, I would want support for customization when I instantiate a data structure.

5 hours ago by bauerd

Think with LD_PRELOAD it is always possible to override the allocator?

3 hours ago by wongarsu

In a toy raytracer I once wrote in C++, switching from malloc to custom memory pools for small fixed-size objects was a big performance boost. Making free() a noop was another big performance boost, both for deallocation and allocation. Turns out sequentially handing out memory from a big chunk of memory is much easier than keeping track of and reusing empty slots, and it keeps sequentially allocated objects in sequential memory locations.

2 hours ago by folmar

jemalloc is the answer to the question you did not ask

2 hours ago by jomohke

You can easily use a custom global allocator in Rust:

    #[global_allocator]
    static GLOBAL: MyAllocator = MyAllocator;

6 hours ago by chowells

This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.

Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant.

It's actually a lot to be said for rust that the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path.

But overall, there's nothing really unexpected here, if you're familiar with memory management.

6 hours ago by Jasper_

One of my favorite papers by Bacon et al expands on this intuition that garbage collection and reference counting are opposite tradeoffs in many respects, and gives a formal theory for it. My views on gc/rc haven't been the same since.

http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

5 hours ago by pron

That's a great paper, but one important thing to point out is that some production-grade tracing GCs are on the sophisticated end of that paper, while almost all reference counting GCs are on the rather primitive end. Given the same amount of effort, it's easier to get a reasonable result with reference-counting, but there are industrial-strength tracing GCs out there that have had a lot of effort put into them.

3 hours ago by jeffdavis

"Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant."

Takes time proportional the live set times the number of GC runs that happen while the objects are alive. In other words, the longer the objects live, the more GC runs have to scan that object (assuming there is enough activity to trigger the GC), and the worse GC looks.

6 hours ago by arcticbull

> This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.

That doesn't seem to make intuitive sense. A GC has the same problem.

A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers. You're beginning your comparison after the mark step, which isn't a fair assessment since what Rust is doing is akin both both the mark and sweep phases.

The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

The difference between a GC and this kind of memory management is that the GC does the traversal later, at some point, non-deterministically. Rust allows you to decide between deallocating it in place, immediately, or deferring it to a different thread.

6 hours ago by chowells

I said generational/compacting collector. You're talking about a mark and sweep collector.

A generational/compacting collector traverses pointers from the live roots, and copies everything it finds to the start of its memory space, and then declares the rest unused. If there is 1GB of unused memory, it's irrelevant. Only the things that can be reached are even examined.

As I said, this has the opposite problem. When the live set becomes huge, this can drag performance. When the live set is small, it doesn't matter how much garbage it produces, performance is fast.

6 hours ago by arcticbull

How are finalizers invoked if the structure isn't traversed? Would it just be optimized away none of the objects have finalizers? Hence my suggestion about the area allocators being a better point of comparison.

5 hours ago by pron

> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not

Yes, but in practice tracing in a tracing GC is done concurrently and with the help of GC barriers that don't require synchronization and so are generally cheaper than the common mechanisms for reference-counting GC.

> and to invoke finalizers

As others have said, finalizers are very uncommon and, in fact, have been deprecated in Java.

6 hours ago by Reelin

> The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

Isn't that incompatible with RAII though?

4 minutes ago by all-fakes

Yes. Some libraries, like typed-arena (and Rust's built-in Vec), will traverse the structure and call the drop implementations, and others, like bumpalo, will just leak the resources.

6 hours ago by arcticbull

You can handle this in Rust pretty neatly with lifetimes. There's a bunch of crates that do this. [1]

[1] https://crates.io/crates/typed-arena

4 hours ago by mcguire

Finalizers/destructors do not work well in garbage collected languages, for that very reason.

6 hours ago by loufe

I've not worked with any language thus far without automatic garbage collecting, so this was definitely a neat read for me. It sounds rather elegant.

22 minutes ago by burpsnard

It's worth popping the hood and getting your fingers dirty. C was written in an era where memory was a scarce and precious resource to be grudgingly used if absolutely necessary

4 hours ago by littlestymaar

The title is slightly wrong: it's not going to make your code faster, it's going to reduce latency on the given thread.

It maybe a net win if this is the UI thread of a desktop app, but overall, it will come at a performance cost: because modern allocators have thread-local memory pools, and now you're moving away from it. And if you're running you code on a NUMA system (most server nowadays), when moving from one thread to another, you can end up freeing non-local memory instead of local one. Also, you won't have any backpressure on your allocations, and you are susceptible to run out of memory (especially because your deallocations now occur more slowly than they should)

Main takeaway: if you use it blindly it's an anti-pattern, but it can be a good idea in its niche: the UI thread of a GUI.

6 hours ago by cesarb

Just be careful, because moving heavy things to be dropped to another thread can change the semantics of the program. For instance, consider what happens if within that heavy thing you had a BufWriter: unless its buffer is empty, dropping it writes the buffer, so now your file is being written and closed in a random moment in the future, instead of being guaranteed to have been sent to the kernel and closed when the function returns.

And it can even be worse if it's holding a limited resource, like a file descriptor or a database connection. That is, I wouldn't recommend using this trick unless you're sure that the only thing the "heavy thing" is holding is memory (and even then, keep in mind that memory can also be a limited resource).

6 hours ago by lostmyoldone

I only know a very little rust, but since it's generally a good practice to never defer writing (or other side effects) to an ambiguous future point in time - with memory allocations as the only plausible exception - is there any way in rust to make sure one doesn't accidentally move complex objects with drop side-effects into other threads?

Granted the way the type system work you usually know the type of a variable quite well, but could this happen with opaque types?

I'm very much out of my depth, but it felt like one of those things that could really bite you if you are unaware, as happened with finalizers in Java decades ago.

5 hours ago by masklinn

> I only know a very little rust, but since it's generally a good practice to never defer writing (or other side effects) to an ambiguous future point in time - with memory allocations as the only plausible exception - is there any way in rust to make sure one doesn't accidentally move complex objects with drop side-effects into other threads?

If you're the one creating the structure, you could opt it out of Send, that'd make it… not sendable. So it wouldn't be able to cross thread-boundaries. For instance Rc is !Send, you simply can not send it across a thread-boundary (because it's a non-threadsafe reference-counting handle).

If you don't control the type, then you'd have to wrap it (newtype pattern) or remember to manually mem::drop it. The latter would obviously have no safety whatsoever, the former you might be able to lint for I guess, though even that is limited or complicated (because of type inference the problematic type might never get explicitly mentioned).

4 hours ago by the8472

Considering that writing files can also block the process you probably don't want to have that in your latency-sensitive parts either, so you'll have to optimize that one way or another anyway.

For the more general problem you have can also dedicate more threads to the task or apply backpressure.

4 hours ago by usefulcat

It seems like the caller should ensure that the buffer is written before giving away ownership. Also, what happens if there is an error writing during finalization/destruction/etc? Seems like you'd want to find out about such errors earlier if at all possible.

7 hours ago by saagarjha

Why would you ever write a get_size function that drops the object you call it on? Surely in an actual, non-contrived usecase spawning another thread and letting the drop occur there would just be plain worse?

7 hours ago by Reelin

I think the contrived use case is just for illustrative purposes? If I'm understanding correctly, the combination of cleanup code and deallocation can sometimes consume enough time that it's worth dispatching it on another thread. That's hardly specific to Rust though.

As you note that will certainly add some overhead, although that could be minimized by not spawning a fresh thread each time. It could easily reduce latency for a thread the UI is waiting on in many cases.

6 hours ago by tedunangst

It would be helpful to see an example from a real application, too.

5 hours ago by masklinn

A very large Vec<String> (say a few million non-empty strings) would do I'd guess, Rust would drop the Vec which would recursively drop each String.

7 hours ago by epage

I believe this is contrived to prove a point.

And this isn't just a help in these contrived examples. I believe process cleanup (an extreme case of cleaning up objects) is one of cases where garbage collection performs better because it doesn't have to unwind the stack, call cleanup functions that are not in the cache, and make a lot of `free` calls to the allocator.

I vaguely remember reading about Google killing processes rather than having them clean up correctly, relying on the OS to properly clean up any resources of significance.

Now this doesn't mean you should do this in all cases. Profile first, see if you can avoid the large objects, and then look into deferred de-allocations ... if the timing of resource cleanup meets your application's guarantees.

6 hours ago by seventh-chord

Killing a process without freeing all allocations is, as far as I can tell, routine in C. Especially for memory it makes no sense "freeing" allocations, the whole memory space is getting scrapped anyways. Of course, once you add RAAI the compiler cant reason about which destructors it can skip on program exit, and if programmers are negligent of this you get programs that are slow to close.

5 hours ago by estebank

> Killing a process without freeing all allocations is, as far as I can tell, routine in C.

Many times by accident :)

> if programmers are negligent of this you get programs that are slow to close.

I wouldn't call that negligence, just not fully optimized.

5 hours ago by gpderetta

exit(2) will only call destructors of static objects. Quick_exit not even those.

6 hours ago by Reelin

> killing processes rather than having them clean up correctly, relying on the OS

I recall Firefox preventing cleanup code from running when you quit a few years ago. Prior to that, quitting with a lot of pages open (ie hundreds) could cause it to lock up for quite some time.

7 hours ago by pjmlp

Not at all, Herb Sutter has a CppCon talk about this kind of optimisations.

It is also the approach taken by C++/WinRT, COM and UWP components get moved into a background cleaning thread, to avoid application pauses on complex data structures reaching zero count.

7 hours ago by nickm12

I took this to be a contrived example to illustrate the point. I could imagine a process that creates a big data structure (e.g. parse an xml file), pulls some data out, and then drops the data structure. If you want to use that data sooner, you can push the cleanup off your thread.

3 hours ago by pierrebai

I've seen variations on this trick multiple times. Using threads, using a message sent to self, using a list and a timer to do the work "later", using a list and waiting for idle time...

They all have one thing in common: pampering over a bad design.

In the particular example given, the sub-vector probably come from a common source. One could keep a big buffer (a single allocation) and an array of internal pointers. For example of such a design to hold a large array of text strings, see for example this blog entry and its associated github repo:

    https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/
    https://github.com/pierrebai/FastTextContainer

Roughly it is this:

    struct TextHolder
    {
        const char* common_buffer;
        std::vector<const char*> internal_pointers;
    };

This is of course addressing the example, but the underlying message is generally applicable: change your flawed design, don't hide your flaws.

7 hours ago by epage

For those wanting a real world example where this can be useful:

I am writing a static site generator. When run in "watch" mode, it deletes everything and starts over (I'd like to reduce these with partial updates but can't always do it). Moving that cleanup to a thread would make "watch" more responsive.

5 hours ago by elcomet

That's not really the same issue that is mentionned in the article though, is it ?

The issue from the article would be solved by just passing a reference to the variable.

In your case, cleanup is an action that needs to be done before writing new files. So you have to wait for cleanup anyway, don't you ?

6 hours ago by firethief

Why can't it cleanup right after the work?

5 hours ago by pmontra

Or no cleanup at all. A CLI command that runs for a very short time can allocate memory to perform its job, print the result and exit. Then the OS releases all the memory of the process. No idea if Rust can work like this.

4 hours ago by ReactiveJelly

"Watch mode" for static site gens would mean you leave the process running and let it rebuild the site whenever a file changes, probably 10s to 100s of times in a typical run

5 hours ago by estebank

std::mem::forget, which doesn't run destructors:

https://doc.rust-lang.org/std/mem/fn.forget.html

3 hours ago by jeffdavis

Speedup numbers should be given when optimizing constant factors -- e.g. "I made this operation 5X faster using SIMD" or "By employing readahead, I sped up this file copy by 10X".

The points raised in this article are really different:

* don't do slow stuff in your latency-critical path

* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)

* dropping can be slow

The first and second points are good, but not really related to rust, deallocations, or the number 10000.

The last point is worth discussing, but still not really related to the number 10000 and barely related to rust. Rust encourages an eager deallocation strategy (kind of like C), whereas many other languages would use a more deferred strategy (like many GCs).

It seems like deferred (e.g. GC) would be better here, because after the main object is dropped, the GC doesn't bother to traverse all of the tiny allocations because they are all dead (unreachable by the root), and it just discards them. But that's not the full story either.

It's not terribly common to build up zillions of allocations and then immediately free them. What's more common is to keep the structure (and its zillions of allocations) around for a while, perhaps making small random modifications, and then eventually freeing them all at once. If using a GC, while the large structure is alive, the GC needs to scan all of those objects, causing a pause each time, which is not great. The eager strategy is also not great: it only needs to traverse the structure once (at deallocation time), but it needs to individually deallocate.

The answer here is to recognize that all of the objects in the structure will be deallocated together. Use a separate region/arena/heap for the entire structure, and wipe out that region/arena/heap when the structure gets dropped. You don't need to traverse anything while the structure is alive, or when it gets dropped.

In rust, probably the most common way to approximate this is by using slices into a larger buffer rather than separate allocations. I wish there was a little better way of doing this, though. It would be awesome if you could make new heaps specific to an object (like a hash table), then allocate the keys/values on that heap. When you drop the structure, the memory disappears without traversal.

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.