2 hours ago by MaxBarraclough

A quick ramble on the article's point about using a hybrid approach:

As the article says, hybrid algorithms can make good sense, where you use a clever algorithm to break down large problems into many small problems, and use another simpler algorithm to solve the small problems. As the article says, in sorting, it makes sense to use something like quicksort to break down large problems, and to then use bubble sort to solve the resulting small problems. In matrix multiplication, it makes sense to use the mind-bending Strassen algorithm (with superior time complexity compared to the naive algorithm) to break down multiplications of large matrices, and to then use the naive algorithm to solve the resulting small matrix multiplication problems. [0]

The threshold for switching over to the clever solution increases year on year, as hardware improves. Clever algorithms tend to behave worse in terms of branch-prediction and cache.

Apparently a similar thing occurs in large integer multiplication. [1] I'm sure there are plenty of other examples besides.

Related: Of all the known matrix multiplication algorithms, the ones with the best time-complexity are never used in practice, as their constant factors spoil things. [2]

[0] https://en.wikipedia.org/wiki/Strassen_algorithm#Implementat...

[1] https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strasse...

[2] https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...

36 minutes ago by chubot

Hm interesting, so is there any analysis / theory that will capture branch mispredictions?

e.g. like big-O for something closer to real machines.

I feel like CPU cycles should already be analyzed separately from memory accesses (although I guess they are usually proportional). But analyzing branch mispredicts seems harder

How does this hold up on ARM and other architectures? (RISC V, etc.)

3 hours ago by kazinator

In what application are you going to sort an array of a pure scalar type that is native to the machine?

Usually you're sorting pointers to objects, and there is some indirection required to compare them (perhaps an outright call through a function pointer).

Here is one good application: ordering an array of pointers by increasing address. That doesn't require indirection upon them, and is useful: e.g. if these are pointers into a heap we wish to compact.

3 hours ago by haberman

Towards the end of the article, Gerben makes the point that the comparison is not on the critical path of his variant of QuickSort, so even if the comparison involves indirection, that comparison can still proceed in parallel with the main loop of the algorithm:

> Combining these two tricks could lead to merge sort on par with the simple partition loop of QuickSort. However itā€™s just a mitigation. If instead of sorting a simple primitive value we would sort pointers to a struct. The comparison operator would have an extra load which immediately adds to the critical path. QuickSort is immune to this, even rather costly comparisons do not influence the ability to make progress. Of course if the comparison itself suffers from frequent branch misses, then that will limit the ability to overlap different stages of the iteration in the CPU pipeline.

In the benchmark code for the blog, he includes a benchmark that has indirection to illustrate this, though these results were not included in the article itself. On my machine I get:

    $ CC=clang bazel build -c opt :all
    $ bazel-bin/bench_sort --benchmark_filter=BM_IndirectionSort
    -------------------------------------------------------------------------------------
    Benchmark                                              Time           CPU Iterations
    -------------------------------------------------------------------------------------
    BM_IndirectionSort<1, std_sort>                       64 ns         64 ns   10900000
    BM_IndirectionSort<1, std_stable_sort>                89 ns         89 ns    7900000
    BM_IndirectionSort<1, std_heap_sort>                 112 ns        112 ns    6300000
    BM_IndirectionSort<1, exp_gerbens::QuickSort>         26 ns         26 ns   27000000
    BM_IndirectionSort<1, HeapSort>                       96 ns         96 ns    7300000
    BM_IndirectionMergeSort<1>                            45 ns         45 ns   15500000

an hour ago by saagarjha

Andrei Alexandrescu's post, linked at the top, is a great read as well: https://dlang.org/blog/2020/05/14/lomutos-comeback/

2 hours ago by ufo

I'm surprised that they found a legit use-case for bubble sort. I always had the impression that bubble sort had very little going for it because the naive implementation is much slower than insertion sort, while also not being as intuitive. https://users.cs.duke.edu/~ola/bubble/bubble.html

Out of curiosity, does anyone know how selection sort would compare to bubble sort and insertion sort in this kind of problem?

an hour ago by megameter

I wonder if they tried shell sort? IME it's so close to bubble sort in implementation that you can view it as a trivial optimization.

4 hours ago by rubber_duck

Great article, this isn't my domain so the results are unexpected and the fact that a domain expert working on the problem can miss stuff like this reinforces something I see over and over when it comes to optimization - forget your assumption, intuition and best practices - everything is context sensitive (size of data, hardware specific) and gets obsolete through hardware evolution - the only way to optimise is to measure to identify bottlenecks after you have working code and them test again that you actually succeeded - don't optimise in advance.

3 hours ago by mchaynes

And measure. Always measure. Without data itā€™s all posturing

3 hours ago by trhway

the opposition usually sounds like this - "Microbenchmarks lie". On one of the past projects, the leading engineers and architects just loved that phrase and voiced it like a deep profound truth understanding of which was a hallmark of belonging to their circle of chosen. Any specific measurements showing issues with the supposedly brilliant architecture and implementation choices of that project (with many of these choices following the latest "best practices" fads published by some venerable technologists) were dismissed by that phrase just like by magic. Naturally that was one of the slowest, not to mention buggy, projects in my experience, and working in the enterprise space i've seen my share of slow and buggy.

2 hours ago by stephc_int13

Microbenchmarks are not lying, but they can be misleading...

The most frequent problem is selecting the fastest solution with a microbenchmark and creating a lot of unnecessary cache eviction/pollution, sometimes on the code cache, sometimes on L1/L2 or both...

Microbenchmarks are good to explore the optimization space, but the final decision should be made with the full execution context in mind.

an hour ago by oxxoxoxooo

Josh, Gerben!

Have you tried a sorting network, instead of the bubble sort?

8 minutes ago by gerbenst

No and I suspect that there is room for significant improvement here.

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.