United Kingdom: +44 (0)208 088 8978

High Performance Collections in F#

Working with the right data structure in F# is important in order to achieve optimal performance. Isaac Abraham talks us through some of the key considerations when making a decision as to which one is the right tool for the job.

As the first more detailed post in this series on performance, I want to look at collections and one of the most common slow performing pattern we see with customers time and time again.

Which collection should I use?

First, let's do a quick recap. There are several factors which can affect which data structure best fits - and picking the correct one will generally have an impact on the performance we can expect. Here are some good questions to think about:

  • Does the collection need to maintain a sequential (but not necessarily sorted) order?
  • Do you need any uniqueness or de-duplication in the collection?
  • Are you always looking up by a single, specific field e.g. finding an order by ID, or are you going to be searching for data in many different ways?
  • How often are you modifying the collection?

There are a number of different collections in .NET and F#; I'm going to broadly split them up into two kinds:

  • Sequential collections, in which every item is placed sequentially, one after another. You can usually look up items very quickly but the position (or index) of the item in the collection, but if you want to look up based on some other criteria e.g. based on a field on a record, it'll take longer - and generally the time will scale as the collection grows in size.
  • Lookups, in which every value is accessed by some kind of key. There is no way to get the "first" or "last" item in the collection, because there's no notion of that - think of it more like a bag which contains everything in a hidden order. These collections are very quick at looking up items by the key.

Let's look at some collections that are popular in F# today.

Sequential Collections

  • Array - The built-in .NET (mutable) array collection.
  • ResizeArray - The built-in .NET (mutable) System.Collections.Generic.List<'T> collection.
  • Sequence - A type alias for the System.Collections.Generic.IEnumerable<'T> collection.
  • List - The F#-specific immutable linked-list collection. It offers immutable "copy-and-update" style syntax.

Lookups

  • Dictionary - An implementation of the System.Collections.Generic.IDictionary interface, created by calling the dict function on any sequence of tuples.
  • Read Only Dictionary - As above, except it returns an IReadOnlyDictionary created by calling the readOnlyDict function.
  • Map - The F#-specific immutable Map. Unlike the Dictionary, Map offers the ability to perform copy-and-update style modifications in an immutable fashion.

There are many variants of the above collections, either in the .NET BCL or as downloadable NuGet packages.

There is one final collection, the Set, which is a kind of hybrid of the two - it doesn't allow lookups, nor does it have any ordered sequence. However, it behaves like a sequence, allowing you to filter and find items. It also automatically guarantees that every item in the collection is unique, albeit based on their "comparison" rather than hash code, and allows "set-based" operations - so you can "subtract" one Set from another to get the elements missing from one.

High performing collections, slow performing code

I see lots of people talk about the different between arrays, sequences and lists in terms of performance (which is good to know) - but just as many fall foul of issues like the following:

// Get all customers
let customers = loadCustomers ()

// Get all orders
let orders = loadOrders ()

// For each order, find the parent customer and do something with the pair of them
for order in orders do
    let customer = customers |> Array.find(fun c -> c.CustomerId = order.CustomerId)
    printfn "Working with customer %A and order %A..." customer.CustomerId order.OrderId

Can you see the potential performance problem in this (admittedly arbitrary) example? Hidden away inside the for loop is a call to find

For those of you saying "use tryFind" - let's assume for the purpose of this sample we're not fussed about an customer not existing.

find, behind the scenes, simply iterates over every item in the collection of Customers until it finds the first matching one. Imagine we had 10,000 orders and 10,000 customers. Assuming on average that the customer we're looking for is mid-way through the collection, that's 10,000 * 5,000 = 50,000,000 iterations and comparisons. Fifty Million!

Lookups to the rescue

This is where a lookup comes in handy:

// Get all customers
let customers = loadCustomers ()

// Convert into an in-memory lookup keyed on CustomerId
let customerLookup =
    customers
    |> Array.map(fun c -> c.CustomerId, c)
    |> readOnlyDict

// Get all orders
let orders = loadOrders ()

// For each order, find the parent customer and do stuff with the pair of them
for order in orders do
    let customer = customerLookup.[order.CustomerId]
    printfn "Doing that stuff with customer %A and order %A..." customer.CustomerId order.OrderId

The key parts are that we create an in-memory IReadOnlyDictionary of all customers with CustomerId as the key, by creating a sequence of (key, value) tuples. Then, we can access the lookup using either indexer notation or built-in Map methods such as TryFind.

Benchmarking performance differences

The scenario I wanted to compare was this:

Given an already-constructed in-memory array of "lookup data", how much quicker is it to construct a specific hash-style lookup for repeated lookups - and at what point does it not become worthwhile?

To see how much the difference was between repeatedly doing finds on an array, and creating a dictionary and then using that for the lookups, I used Benchmark .NET - the code is available here. Benchmark .NET allows us to parameterise our benchmarks, so I could experiment with different combinations of Customers and Orders. In my case, I tested having 100, 1k, 10k or 100k customers and 1, 10, 100, 1k or 10k orders.

Firstly, to simulate the above scenario, I benchmarked the Array type, using pre-created collections of Customers and Orders. In other words, the benchmark only tests the ability to rapidly find customers based on a CustomerId.
Next, I compared this with three F# lookup-type collections:

  • An F# Map
  • An IReadOnlyDictionary created by using F#'s readOnlyDict function
  • An F# Set

All of these were benchmarked not only for their lookup performance but also the cost of creating the lookup using the pre-created array. This was to reflect the read-world extra cost that would be required in such a scenario (as per the second code example shown earlier).

Most performant Lookup

I first wanted to identify out of the three lookup collections above, which was the most performant for the scenario identified above. The results were clear:

We can infer from this chart several interesting pieces of information:

  1. The read-only dictionary is much faster than either Map or Set - at times, over 10x quicker.
  2. For this scenario, construction of the lookup from the source collection costs much more than the time to perform the lookups themselves.
  3. The number of lookups appears to have minimal effect on overall time taken - even for larger numbers of customers.
  4. For this scenario, Set and Map behave very similarly.

Dictionary vs Array

Now that we know this, let's see how the performance behaviour differs for these two scenarios:

  1. Given existing arrays of customers and orders, how quickly can an Array look up all the customers linked to those orders?
  2. Given existing arrays of customers and orders, how quickly can we construct a lookup of all customers, and then use that to look up all the customers linked to those orders?

Note: When you look at this graph, observe that it is using a logarithmic scale. I was forced to do this, because in the worst case, the Array was so much slower than the Dictionary that it made the rest of the graph completely useless.

What can we infer from this chart?

  1. We see an interesting pattern repeated for each "band" of customer size:

    • The first two sizes of Orders (i.e. number of lookups) are quicker with the Array.
    • For lookups greater than 10, the Dictionary becomes quicker.

    In other words, as long as you have relatively few lookups to perform - somewhere less than 100 - an array is quicker, because the cost of constructing the dictionary outweighs the benefits of the faster lookups.

  2. The performance of the array lookup scales proportionally to the size of the array. Given what we know about the way that find works, this is unsurprising but it's worth remembering: a collection 10x as large as another one will take 10x as long to find an element.
  3. As the number of customers grows the performance of the array becomes very slow in comparison to the Dictionary. In the worst case (10k orders and 100k customers), it takes the array nearly 200 times longer than the dictionary to perform the same number of lookups - 16.53ms vs 3.2 seconds.

Conclusion

Choosing the right collection data structure is important not only for correctness and ergonomics, but also has performance implications. We've seen here how, if you're performing more than just a few lookups, a Dictionary will outperform an Array - so the next time you're performing a find within a for loop, think again!

As with all artificial benchmarks, take this article with a pinch of salt. The benchmark uses simplistic data (strings) to test out several data structures. However, many factors can impact the performance of these data structures - integers behave differently to strings which behave differently to Records etc. The best benchmarks you should do are ones that work for you and your specific scenarios - there is no one size fits all.

Hope you found this useful - until next time, have (fun _ -> ()).

Isaac