Using text.parsec.indent to Parse an Indentation-Sensitive Language with Haskell’s Parsec Library

I’ve been working on a toy language with Job Vranish. We decided that we wanted to use indentation to represent blocking instead of braces (like Python). We’re implementing our toy in Haskell using the Parsec parser-combinator library. Luckily, there’s a companion library that handles indentation parsing for us! It’s called indents. Unfortunately, it requires that you understand a lot more of Haskell’s type system than Parsec usually requires. I spent some time working through this and wanted to share the steps I used to parse our toy language.

Let’s start by taking a look at a basic parser in Haskell. We’re going to ask our parser to parse a ‘word’ composed of letters or numbers ending with a colon (:). In Parsec, this is a pretty simple parser:

 print err
    Right res -> putStrLn $ "I parsed: '" ++ res ++ "'"

aParser :: Parser String
aParser = do
  s <- many1 alphaNum
  _ <- char ':'
  return s

{-
 - Output:
 - > runhaskell -Wall parsec_example.hs
 - I parsed: 'foo123'
 -}

Let’s expand this parser to parse a named list of items, where the name is a word and the items are words indented past the name, like this:

listName:
  item1
  item2
  item3

In order to properly understand how the indents package works with Parsec, we’ll need to pick apart the Parser type provided by Parsec.

The Parser type is defined as follows:

type Parser = Parsec String ()

This says that a Parser is a type alias for the Parsec type with a stream type of String and a user state type of ‘()’ (unit or empty). For completeness, let’s also take a look at the Parsec type definition:

type Parsec s u = ParsecT s u Identity

This tells us that the Parsec type is just an alias for the ParsecT type with an underlying monad of Identity. Finally, let’s look at the definition of ParsecT:

data ParsecT s u m a

Finally we reach the bottom of the alias stack. ParsecT is a monad transformer parameterized by a stream type, a user state type, an underlying monad type, and the type that results from a successful parse. If we expand these all out, we can see that the Parser type is actually the equivalent of the following:

data ParsecT String () Identity

Unfortunately, this is not the same type we’ll be using for the indentation parser. The indentation parser needs to keep track of some state and uses the State monad to do so. Being a good citizen, the indentation parser doesn’t stuff this state into the user state slot; instead, it uses the fact that Parsec is a monad transformer to stack Paresc on top of the State monad parameterized by the SourcePos type from Parsec. We can see all this by looking at the IndentParser type:

type IndentParser s u a = ParsecT s u (State SourcePos) a

Here we see that the IndentParser type is very similar to the Parsec type except that instead of using Identity as the underlying monad, it uses (State SourcePos).

Given this difference, I’ve found it best to define a convenience type and a convenience function based on the indents library types. The function is used in place of the parse function provided by Parsec and the second is used in place of the Parser type we previously discussed.

The new type, which I’ve called IParse is defined as follows:

type IParser a = ParsecT String () (State SourcePos) a

This type is identical to the original Parser type other than that it uses (State SourcePos) as the underlying monad. This type can be used anywhere the Parser type was used.

The new parser function, which I’ve called iParse, is defined as follows:

iParse :: IParser a -> SourceName -> String -> Either ParseError a
iParse aParser source_name input =
  runIndent source_name $ runParserT aParser () source_name input

It’s basically a drop-in replacement for the original parse function and supports all the Parsec operations, but also allows us to use the indentation parser from the indents package. Now that we’ve defined these, we can go about using the functions from the indents library as if they were any other function from the Text.Parsec package. The following is a literate Haskell implementation of the parser we first discussed.

> module Main where

First, import all the needed modules.

> import Text.Parsec hiding (State)
> import Text.Parsec.Indent
> import Control.Monad.State

Next, define our new Parser type. This replaces the Identity monad
with the (State SourcePos) monad.

> type IParser a = ParsecT String () (State SourcePos) a

Now we define our new parse function. This one accepts an IParser
(which we've just defined) instead of a Parser.

> iParse :: IParser a -> SourceName -> String -> Either ParseError a
> iParse aParser source_name input =
>   runIndent source_name $ runParserT aParser () source_name input

Define our sample input string. Note: the unlines function joins
strings together with newline characters.

> input_text :: String
> input_text = unlines [
>     "listName:",
>     "  item1",
>     "  item2",
>     "  item3"
>   ]

Define main. It parses the input text and prints the parsed value. If
there was an error, it prints the error.

> main :: IO ()
> main = do
>   case iParse aNamedList "indented_example" input_text of
>     Left  err    -> print err
>     Right result -> putStrLn $ "I parsed: " ++ show result

Define a datatype to hold our parsed value.

> data NamedList = NamedList Name [Item]
>   deriving (Show)

Define what we mean by 'Name' and 'Item'. In this case, they are both
strings.

> type Name = String
> type Item = String

Define how we parse a NamedList. A Named list is a Name and a list of
Items contained in the NamedList data structure. Read more about the
withBlock function here:

http://hackage.haskell.org/packages/archive/indents/0.3.2/doc/html/Text-Parsec-Indent.html#v:withBlock

> aNamedList :: IParser NamedList
> aNamedList = do
>   b <- withBlock NamedList aName anItem
>   spaces
>   return b

A name is an alpha-numeric string followed by a ':' and some
whitespace.

> aName :: IParser Name
> aName = do
>   s <- many1 alphaNum
>   _ <- char ':'
>   spaces
>   return s

An item is an alpha-numeric string followed by some whitespace.

> anItem :: IParser Item
> anItem = do
>   i <- many1 alphaNum
>   spaces
>   return i

Output:
 > runhaskell -Wall indented_parsec_example.hs
 I parsed: NamedList "listName" ["item1","item2","item3"]

This is our final result. We can see the re-definition of the parser type, the introduction of a new parse function, and the result we get after running a simple parser across some input text.
 

Conversation
  • Piyush says:

    You may want to try indentparser (disclosure: Written by yours truly). In it I have
    also exposed a Parsec.Token like module to ease tokeniser generation for languages. indentparser is a rewrite of my original IndentParser.

  • Comments are closed.