Parsing an Indentation-Sensitive Language with Haskell

I’ve been working on a toy language with Job Vranish. We decided that we wanted to use indentation to represent blocking instead of braces (like Python). We’re implementing our toy in Haskell using the Parsec parser-combinator library. Luckily, there’s a companion library that handles indentation parsing for us! It’s called indents. Unfortunately, it requires that you understand a lot more of Haskell’s type system than Parsec usually requires. I spent some time working through this and wanted to share the steps I used to parse our toy language.

Let’s start by taking a look at a basic parser in Haskell. We’re going to ask our parser to parse a ‘word’ composed of letters or numbers ending with a colon (:). In Parsec, this is a pretty simple parser:

 print err
    Right res -> putStrLn $ "I parsed: '" ++ res ++ "'"

aParser :: Parser String
aParser = do
  s <- many1 alphaNum
  _ <- char ':'
  return s

{-
 - Output:
 - > runhaskell -Wall parsec_example.hs
 - I parsed: 'foo123'
 -}

Let’s expand this parser to parse a named list of items, where the name is a word and the items are words indented past the name, like this:

listName:
  item1
  item2
  item3

In order to properly understand how the indents package works with Parsec, we’ll need to pick apart the Parser type provided by Parsec.

The Parser type is defined as follows:

type Parser = Parsec String ()

This says that a Parser is a type alias for the Parsec type with a stream type of String and a user state type of ‘()’ (unit or empty). For completeness, let’s also take a look at the Parsec type definition:

type Parsec s u = ParsecT s u Identity

This tells us that the Parsec type is just an alias for the ParsecT type with an underlying monad of Identity. Finally, let’s look at the definition of ParsecT:

data ParsecT s u m a

Finally we reach the bottom of the alias stack. ParsecT is a monad transformer parameterized by a stream type, a user state type, an underlying monad type, and the type that results from a successful parse. If we expand these all out, we can see that the Parser type is actually the equivalent of the following:

data ParsecT String () Identity

Unfortunately, this is not the same type we’ll be using for the indentation parser. The indentation parser needs to keep track of some state and uses the State monad to do so. Being a good citizen, the indentation parser doesn’t stuff this state into the user state slot; instead, it uses the fact that Parsec is a monad transformer to stack Paresc on top of the State monad parameterized by the SourcePos type from Parsec. We can see all this by looking at the IndentParser type:

type IndentParser s u a = ParsecT s u (State SourcePos) a

Here we see that the IndentParser type is very similar to the Parsec type except that instead of using Identity as the underlying monad, it uses (State SourcePos).

Given this difference, I’ve found it best to define a convenience type and a convenience function based on the indents library types. The function is used in place of the parse function provided by Parsec and the second is used in place of the Parser type we previously discussed.

The new type, which I’ve called IParse is defined as follows:

type IParser a = ParsecT String () (State SourcePos) a

This type is identical to the original Parser type other than that it uses (State SourcePos) as the underlying monad. This type can be used anywhere the Parser type was used.

The new parser function, which I’ve called iParse, is defined as follows:

iParse :: IParser a -> SourceName -> String -> Either ParseError a
iParse aParser source_name input =
  runIndent source_name $ runParserT aParser () source_name input

It’s basically a drop-in replacement for the original parse function and supports all the Parsec operations, but also allows us to use the indentation parser from the indents package. Now that we’ve defined these, we can go about using the functions from the indents library as if they were any other function from the Text.Parsec package. The following is a literate Haskell implementation of the parser we first discussed.

This is our final result. We can see the re-definition of the parser type, the introduction of a new parse function, and the result we get after running a simple parser across some input text.

Using text.parsec.indent to Parse an Indentation-Sensitive Language with Haskell’s Parsec Library

Tell Us About Your Project

Related Posts

10 Techniques for Debugging an Embedded Application

VS Code for Embedded Development

A Library for Driving NeoPixels with the ESP32 Micro Controller

Keep up with our latest posts.

Tell Us About Your Project