United Kingdom: +44 (0)208 088 8978

Parsing Excel with F#

We're hiring Software Developers

Click here to find out more

This post is part of FSAdvent, to see all the other posts see https://sergeytihon.com/2018/10/22/f-advent-calendar-in-english-2018/

Every month at Compositional IT we have one member of the team present their findings on a new library, methodology or development practice internally as part of a brown bag session. We've had some really interesting sessions on a diverse range of topics which we might not otherwise have encountered. This month though, I got the opportunity to talk to the team about one of my favourite F# libraries, FParsec, and how we can apply it to some of the work we're doing on customer projects.

Spreadsheets are quite possibly one of the most common forms of programming used in businesses around the world with thousands of organisations worldwide relying on Excel to run critical parts of their process and so at Compositional IT, we've had quite a few projects which have relied on the use of Excel to allow us to integrate new solutions with existing systems and processes. Within the .Net ecosystem, there's plenty of libraries available which can write Excel spreadsheets or read Excel spreadsheets but the vast majority of these simply deal with the raw data of the spreadsheet. Since spreadsheets are user editable, it also means that users can happily insert forumulae which might not yet be processed. As part of our work, we've included Excel as part of our process to ensure all formulae are evaluated, but would it be possible to parse an Excel formula in F# and then evaluate it within F#. This is what I set out to look into as part of my brown bag session.

The Excel file format

Excel files (.xlsx) are really a pretty simple file type, it's just a zip file renamed with a fixed folder structure which contains XML files for worksheets and additional metadata about the workbook. Within these XML files, the formulae are stored in a string, exactly as you would write within the Excel UI. There's no additional data structures available, it's just a string. If we want to do anything with this string, we probably need it in a better data type that we can work with, something that gives us an overall structure. If we're to get there though, we first need to convert our string into this type. We can do this in a number of ways including basic string manipulations (like String.IndexOf and String.Split), RegEx and parser combinator libraries. Given that Excel formulae are of moderate complexity, I thought this would be a good fit for a parser combinator library like FParsec.

Installing FParsec

FParsec is distributed as a NuGet package through NuGet, so just use your preferred tool to download it, I'll be working through this in a script so I just need to reference the 2 FParsec DLLs in the package.

#r """packages\FParsec\lib\net40-client\FParsecCS.dll"""
#r """packages\FParsec\lib\net40-client\FParsec.dll"""

We'll also open the FParsec namespace.

open FParsec

And since I'm doing this in a script I'll also create an alias for the parser type.

type MyParser<'t> = Parser<'t, unit>

A parser can also include some user state which is the second type parameter, in this case unit. Since we're not using it at all, F# scripts won't be able to work out what type that is and we'll get a value restriction. By annotating our individual parsers with this type, we'll get rid of the type signature. If you're doing this as part of a compiled application then you don't need the type signatures.

The simplest parser

The simplest Excel formula you can reasonably have is =5 this simply states that we want to return a single value from the formula. For now we'll ignore the equals sign and just write a parser for this. Fortunately, FParsec has a parser for this in the form of the pint32 function. So we can define our simplest parser as follows:

let numberParser : _ MyParser = pint32

And we can now run that parser against a string like such:

run numberParser "5"

And we get the value 5 returned from that.

Parsing a quoted string

Within an Excel formula though we can also have strings. We'll define a string as being a list of characters enclosed between ' single quotes. To do that we're going to need a number of parsers and compose them together. To get the individual characters out of our string we need a string parser, this can be written as follows:

let stringParser : _ MyParser = manyChars (noneOf "'")

This says that we expect a string to have multiple characters, however, we can't have a single quote in our string. This is an unfortunate restriction but it simplifies this parser greatly for now. We also want to say that this parser should be wrapped within single quotes, so we can compose this with another parser, this time the between parser, this takes in a parser for the opening of the parser, a parser for the closing of the parser and a final parser which is the internal content of the parser.

let quotedStringStringParser : _ MyParser = between (pchar '\'') (pchar '\'') (manyChars (noneOf "'"))

Here we have a single quote as the opening parser, a single quote as the closing parser and many characters which aren't single quotes as the content of the parser. Now we can test this by running it against a string as follows:

run quotedStringStringParser "'Hello world'"

And this will then output the string "Hello world" making sure to strip away the enclosing quotes.

Combining parsers

Our next step is to create a parser which can parse either a string or a value, to do this we need a type which wraps the 2 possible types which can be handled with a discriminated union in F#. So we can create a discriminated union of either a string with a string value or a number of an integer value. We then use the |>> operator to transform the value returned by a parser. And finally we need a new parser which can handle either of these 2 cases through the use of the <|> function, which first tries to apply the parser on the left to the input string and if that succeeds it ignores the second parser. Otherwise if it fails then it tries the second parser. If the second parser fails then the whole parser has failed.

This leaves us something like the following:

type Value =
    | Number of int
    | String of string

let numberTypeParser : _ MyParser = pint32 |>> Number
let stringTypeParser : _ MyParser = between (pchar '\'') (pchar '\'') (many1Chars (noneOf "'")) |>> String
let valueParser : _ MyParser = numberTypeParser <|> stringTypeParser

We can then test this by running it against a couple of known input strings and seeing that the result we get out is equivalent.

run valueParser "5"
run valueParser "'Hello world'"

We're now able to parse a fixed value within our Excel formula.

Parsing an argument list

Before diving into the next section, let's see how we can parse out multiple occurrences of our individual parsers. This is to handle the situation whereby we have a list of arguments to a function. We might have the function =SUM(5, 7, 12, 13) etc, in which case we need to extract out all of the values for the different function arguments. We can do this by using the sepBy parser which will apply the value parser using a second parser to parse the separators. Using the valueParser from before we can use the sepBy function to extract multiple instances of it. Since this is a function call list we need our values to be enclosed within brackets and so we can once again use the between parser to extract the values between the and ) characters.

let repeatedValueParser : _ MyParser = sepBy valueParser (pchar ',' >>. spaces)
let repeatedValueWihinBracketsParser : _ MyParser = between (pchar '(') (pchar ')') repeatedValueParser

In this example, we can see the first argument to sepBy is our valueParser we created in the previous step and the second is a parser which can parse the comma separation. We then pass this parser into the between parser to extract out the values within our argument list. The ultimate return value of this parser is a list of either numbers or strings.

Parsing a cell reference

In Excel, as well as a hard coded value we can also refer to either individual cells or a range of cells. For example, we can say =SUM(A1:A500) or =SUM(A1, A2, A3), and so for this we need to be able to parse a cell reference. A cell reference can be either an individual cell reference or a range of cells made up of 2 cell references. We can model this with a discriminated union. We can then use a pair of parsers to parse out the 2 distinct options. However, we need to change the way we handle the 2 parsers.

In the case of a reference parser we first want to try and parse out a range of the form <CELL REF>:<CELL REF>, if this fails then we need it to try and parse out a single cell reference. However, parsers are greedy and will try to consume as much input as possible. For example, a parser for A2) will parse out the A2 but then encounter a ) not the expected :. The parser then has an output state different to the input state but will try and run the cell reference parser on ) which will fail. We can handle this scenario by instead using the attempt combinator. This wraps a parser such that if the wrapped parser fails then it will reset all of the internal state. So we then have a choice of parsers, either attempt the wrapped range parser or fall back to the individual cell reference parser.

type CellReference =
    | CellReference of string * int
    | Range of CellReference * CellReference

let isUpperCaseChar c = System.Char.IsLetter c && System.Char.IsUpper c
let cellReferenceParser : _ MyParser = many1Satisfy isUpperCaseChar .>>. pint32 |>> CellReference
let cellRangeParser : _ MyParser = choice [
    attempt (cellReferenceParser .>> pchar ':' .>>. cellReferenceParser) |>> Range
    cellReferenceParser
]

Now if we test the cellRangeParser with either A1 or A1:A2 then the correct value will be chosen.

Parsing functions

The final part of this Excel function parser is to parse the function name itself. However, a function can have another function call as one of its arguments, creating a recursive parser call. Let's start by setting things up. We first define a type which wraps the different types of arguments we can have, either a Value of the values we created at the first part of this post, a Cell reference which we discussed above or a function call which we'll cover now. We'll also have a list of function names, these are the valid functions which we can have such as SUM or AVERAGE. We'll also make a parser which transforms these functions out into individual parsers for them.

let functionsList = ["SUM"; "AVERAGE";]

type Term =
    | Value of Value
    | FunctionCall of string * Term list
    | CellReferenceEx of CellReference

let functionsParsers : _ MyParser =
    functionsList
    |> List.map (fun func -> pstringCI func |>> fun func -> fun args -> FunctionCall(func, args))
    |> choice

We now need to take out argument list parser we created earlier and use the concepts from that to apply it to our argument parser. However, we need to use a few other concepts from FParsec. Notably, we need to handle the scenario where we have a parser needing to use its parent parser. Typically in F# if we wanted to create a recursive function then we#d use the rec modifier on a function name. However, parsers are all values when we're using FParsec and so we can't use rec on values. Instead FParsec provides the createParserForwardedToRef function. This function allows us to use a parser which we assign a value to later on. And so we can then use the parser output from this function as one of the parsers in our argument list. This leaves us with a parser as below.

let terms, termsRef = createParserForwardedToRef ()
let argsParser : _ MyParser = between (pchar '(') (pchar ')') (sepBy terms (pchar ',' >>. spaces))
let functionCallParser : _ MyParser = functionsParsers .>>. argsParser |>> fun (fn, args) -> fn args

let formulaParser : _ MyParser = choice [
    functionCallParser
    valueParser |>> Value
    cellRangeParser |>> CellReferenceEx
]

termsRef := formulaParser

Here we have a Terms parser which is used for parsing out the inidividual terms, however, we want to define this later. We then have our argsParser which is responsible for parsing out all the terms between brackets. We also then combine this parser with the parser functions to wrap a function value with its arguments. Finally, we create a choice of 3 different parsers, either the function call, a single value or a cell range. Once we have this parser, we need to ensure we set the termsRef ref cell with the value of the parser which should be used.

There's now just 2 steps left to ensure that our Excel function parser is complete. We need to ensure that we have the '=' sign at the start of the function and to ensure that the parser consumes all available input. We've seen many times now how to use the pstring function to use a single character, but we also now introduce the eof parser which valdiates that the input stream has been completely consumed. This leaves us with a parser function like the following:

let excelFormulaParser : _ MyParser = pstring "=" >>. terms .>> eof

We can then test this with a number of valid Excel functions such as the following:

run excelFormulaParser "=SUM(5, A1:A2)"
run excelFormulaParser "=AVERAGE(5, A1, BZ26:BZ27, SUM(A1:A500))"

Conclusion

In this post, we've seen how we can create a really simple Excel formula parser which is able to convert an Excel formula into an F# data type. We've not covered every possible case within an Excel formula, for example, there's no support for column references instead of cell references or hard coded column references, however, they aren't too difficult to add to this parser. This parser was ultimately written in only a couple of hours of work and many of the concepts can be ported across to writing custom formulae within your own application using your own custom syntax. As we now have an abstract representation of a function, we can easily write a function which is able to evaluate it as follows:

let rec eval (cells:Map<_, _>) fn =
    match fn with
    | Value (String s) ->
        ...
    | Value (Number n) ->
        ...
    | FunctionCall(name, args) ->
        ...
    | CellReferenceEx (CellReference ref) ->
        ...