Parser combinators for postal addresses

A while ago, Tsoding did a stream on parsing JSON with parser combinators. I was curious how hard it would be to use the basic parsers he implemented to create a new parser for U.S. postal addresses. It turns out once you have basic building blocks, creating new parsers is not only easy, but fun! You can find the full code here.

I recommend watching the video first, but if you want a dumbed down introduction to parser combinators, read on.

What’s a postal address?

To keep things simple, we will be working with a simplified version of this Wikipedia BNF example.

<postal-address> ::= <name-part> <street-address> <zip-part>

    <name-part> ::= <first-name> <last-name> <opt-suffix-part> <EOL> |

        <opt-suffix-part> ::= "Sr." | "Jr." | ""

    <street-address> ::= <house-num> <street-name> <apt-num> <EOL>

    <zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>

An address consists of three parts: a name part, a street address, and a zip part. Each part can take on multiple forms; for example, a name part can optionally have a suffix at the end. We would like to parse a string representing a postal address into a tree, where each node represents each component.

For this simple example, using parser combinators is probably an overkill, as you can probably just brute-force parse it in Python. However, as the postal address format changes and grows, we’re going to want a more scalable approach that can adapt to these changes. And as you will see by the end, parser combinators provide a clean abstraction that lets you easily build new parsers on top of existing ones. You’ll also realize that functional programming can actually be useful and isn’t just a pretentious cult!

Parsing characters and strings

The core idea of combinator parsing is to 1. define basic parsers as our building blocks and 2. define the operations on those parsers such that they can combine with each other in various ways.

First, let’s translate the BNF from the previous section to code.

-- final address
data PostalAddress =
  PostalAddress NamePart StreetAddress ZipPart deriving (Show, Eq)
-- name part
data NamePart =
  NamePartBasic FirstName LastName |
  NamePartWithSuffix FirstName LastName SuffixPart deriving (Show, Eq)
data FirstName =
  FirstName String deriving (Show, Eq)
data LastName =
  LastName String deriving (Show, Eq)
data SuffixPart =
  SuffixPart String deriving (Show, Eq)
-- address part
data StreetAddress =
  StreetAddress HouseNum StreetName AptNum deriving (Show, Eq)
data HouseNum =
  HouseNum String deriving (Show, Eq)
data StreetName =
  StreetName String deriving (Show, Eq)
-- zip part
data ZipPart =
  ZipPart TownName StateCode ZipCode deriving (Show, Eq)
data TownName =
  TownName String deriving (Show, Eq)
data StateCode =
  StateCode String deriving (Show, Eq)
data ZipCode =
  ZipCode String deriving (Show, Eq)
data AptNum =
  AptNum String deriving (Show, Eq)

We can then define our parser.

data Parser a = Parser {
  runParser :: String -> Maybe (a, String)
}

which is a datatype with a single field runParser. You can see that it takes a string to parse, and returns a Maybe of something of type a that it parsed, along with the remainder of the string that it didn’t parse. The remaining string can be fed as input into other parsers to continue parsing.

A couple notes on this weird syntax. Haskell actually creates a method named runParser that has signature

ghci> :t runParser
runParser :: Parser a -> String -> Maybe (a, String)

So you can supply the parser to it and it’ll return the function field, much like accessing a field name of a struct in other languages. This is why we named it runParser rather than something like parsingFunction, since runParser gives us the function of type String -> Maybe(a, String), which we can run directly by providing the string argument!

To make this more concrete, let’s create our first parser for the Char type. To do this, we’ll first want a function that take any character c and generate a parser for c.

charParserG :: Char -> Parser Char
charParserG c = Parser f
  where
    f (x:xs)
      | x == c = Just (x, xs)
      | otherwise = Nothing
    f [] = Nothing

We can use this to create a parser for the letter 'h' for instance, and use it to parse "hello".

ghci> runParser (charParserG 'h') "hello"
Just ('h',"ello")

As expected, it parsed the letter 'h' and left the rest of the string untouched.

How would we parse a string? We could do something similar to charParserG, but instead of looking at a single character, we try to match an entire string.

substr :: Int -> Int -> String -> String
substr start end s = take (end - start) (drop start s)

stringParserGDumb :: String -> Parser String
stringParserGDumb s = Parser f
  where
    f (x)
      | isPrefixOf s x = Just(x, substr (length s) (length x) x)
      | otherwise = Nothing

This works fine, but there is a lot of boilerplate code. What if I told you we could combine many charParserG’s to create a stringParserG?

After all, a string is simply a list of characters. As a first attempt, we can try to call the map function, but unfortunately the result is of the wrong type.

ghci> :t map charParserG "hello"
map charParserG "hello" :: [Parser Char]

That is, it gives us a list of char parsers, but what we want is a parser of a list of chars. So we need to somehow flip the parser inside out. Luckily, there is a function that does exactly this.

ghci> :info sequenceA
type Traversable :: (* -> *) -> Constraint
class (Functor t, Foldable t) => Traversable t where
  ...
  sequenceA :: Applicative f => t (f a) -> f (t a)
  ...
  	-- Defined in ‘Data.Traversable’

sequenceA however, is defined in the Functor interface which mean we need to implement that for our Parser type before we can use it.

Functor

To implement a functor, we need to define the operator fmap, or <$>, which has the following signature.

fmap :: (a -> b) -> f a -> f b

It takes two arguments: a function mapping a -> b and a value f a, which is something of type a wrapped in some container f. It then unwraps the f a value, runs the function on the a to get a b, then rewraps it, finally returning f b.

This is much easier to understand through an example. Let’s say we have a Maybe of an Int, and we want to double the value inside it and return another Maybe type. Since the Maybe type is a functor, we can simply call the fmap operator.

ghci> (\i -> i * 2) <$> (Just 2)
Just 4

Much better than manually unwrapping the Maybe and doing the repackaging ourselves! This is a trivial example but you can imagine much more complicated versions of f with a lot of hidden state that would be impractical to handle manually.

So for our parser, given a function and parserA, we need to apply the function to the output of parserA and return a new parser parserB.

instance Functor Parser where
  -- fmap :: (a -> b) -> Parser a -> Parser b
  fmap f parserA =
    Parser $ \s ->
      case runParser parserA s of
        Nothing -> Nothing
        Just (a, as) -> Just (f a, as)

As an example, we can easily map the function toUpper to our parser for 'h'.

ghci> runParser (toUpper <$> (charParserG 'h')) "hello"
Just ('H',"ello")

Now that Parser is a functor, implementing stringParserG is a two-line endeavor:

stringParserG :: String -> Parser String
stringParserG = sequenceA . map charParserG

It works!

ghci> runParser (stringParserG "hello") "hello world!!"
Just ("hello"," world!!")

If we go back to our postal address discussion and try to parse something simple like a FirstName, we run into an issue: that is, our stringParserG function can only generate parsers for specific strings. But a first name can be any string given some condition, such as that containing only letters in the alphabet. This means we need another function that can generate a string parser for some condition.

Luckily, there is a function called span defined in GHC.List that does just this! It takes a string and a boolean function and returns a tuple of two elements: everything it could greedily match and the rest of the string.

ghci> span isAlpha "hello world 123"
("hello"," world 123")

We can now create a function that takes in a condition and returns a parser that parses not specific strings, but any string that satisfies that condition.

spanParserG :: (Char -> Bool) -> Parser String
spanParserG f = Parser $ \s ->
  let (token, rest) = span f s
    in Just(token, rest)

Now we can write a parser that parses string literals by supplying it the isAlpha function:

stringLiteralParser :: Parser String
stringLiteralParser = spanParserG isAlpha

And one that parses digits using the isDigit function:

numberParser :: Parser String
numberParser = spanParserG isDigit

Using the fmap operator we defined earlier for our functor, we can now create a parser for FirstName.

firstNameParser :: Parser FirstName
firstNameParser = FirstName <$> stringLiteralParser

Doing the same for LastName:

lastNameParser :: Parser LastName
lastNameParser = LastName <$> stringLiteralParser

We can try to write a parser for SuffixPart but there is another problem. Recall that a suffix can be either “Sr.” or “Jr.” so we’ll need a way to apply the “or” operator on two parsers, such that we return the first match that succeeds. Then we can write a parser for “Sr.” and a parser for “Jr.” and then glue them together.

This can be done with the <|> operator, defined in yet another interface, the Alternative. But to implement Alternative, we must first implement the Applicative interface. So once again, we’re going to take a detour into some PL theory.

Applicative

An applicative has these two operations:

pure :: a -> f a 
(<*>) :: f (a -> b) -> f a -> f b

pure simply wraps some primitive type a inside our container f. The <*> operator looks similar to <$> in its type signature, but notice that in this case, the function is also wrapped inside the container!

We use it like this:

ghci> (Just toUpper) <*> (Just 'h')
Just 'H'

The applicative implementation is similar to what we did for our functor, except we also need to “unwrap” the container wrapping the function. In this context, unwrapping our container means running the parser!

instance Applicative Parser where
  -- pure :: a -> Parser a
  pure a = Parser $ \s -> Just (a, s)

  -- <*> :: Parser (a -> b) -> Parser a -> Parser b
  parserF <*> parserA = Parser $ \s ->
    case runParser parserF s of
      Nothing -> Nothing
      Just (f, fs) ->
        case runParser parserA fs of
          Nothing -> Nothing
          Just (a, as) -> Just (f a, as)

Alternative

Now that our parser is an applicative, we can continue on to implement Alternative. In this case, we need the empty operator and the <|> operator, which we will use as a “or” function combining our parsers.

To implement the <|> operator, we need to parse our string with the first parser, and depending on the results, potentially run it through the second parser.

instance Alternative Parser where
  -- empty :: f a
  empty = Parser (\s -> Nothing)

  -- (<|>) :: f a -> f a -> f a
  parserA <|> parserB = Parser $ \s ->
    case runParser parserA s of
      Just (a, as) -> Just (a, as)
      Nothing -> case runParser parserB s of
        Just (b, bs) -> Just (b, bs)
        Nothing -> Nothing

Now that our Parser is both an Applicative and an Alternative, we can write our suffix parser.

suffixPartParser :: Parser SuffixPart
suffixPartParser = SuffixPart <$> ((stringParserG "Jr.") <|> (stringParserG "Sr."))

Remember that a NamePart can optionally have a suffix. So our final parser has to look something like this:

namePartParser :: Parser NamePart
namePartParser = namePartWithSuffixParser <|> namePartBasicParser

which tries to match the name part with a suffix, but if it fails falls back to a basic name.

How do we implement namePartWithSuffixParser and namePartBasicParser? For these, we now need to run a sequence of parsers on some input string, each parser parsing the remainder of the input string left behind by its preceding parser. To do this, we need the “bind” operator, or >>= defined in the Monad interface.

ghci> :info >>=
type Monad :: (* -> *) -> Constraint
class Applicative m => Monad m where
  (>>=) :: m a -> (a -> m b) -> m b
  ...
  	-- Defined in ‘GHC.Base’
infixl 1 >>=

So before we can finish our parser, we need to do one last task, which is to implement the Monad interface. But once we have this, the rest will be smooth sailing.

Monad

Simply enough, to implement >>=, we just run the first parser on the input, take the Maybe result and run the second parser on the rest of the input.

instance Monad Parser where
  -- >>= :: (Parser a) -> (a -> Parser b) -> (Parser b)
  parser >>= f = Parser $ \s ->
    case runParser parser s of
      Nothing -> Nothing
      Just (a, as) -> runParser (f a) as

Consider the following parser that parses "good".

ghci> runParser (stringParserG "good") "goodbye"
Just ("good","bye")

If we want to do the same thing but additionally thread the remaining string, i.e. "bye", into another parser that consumes "bye", we can!

ghci> runParser ((stringParserG "good") >>= (\s -> stringParserG "bye")) "goodbye"
Just ("bye","")

Implementing our parser

The hard part is done! Now we can implement a parser for basic names and a parser for names with suffixes. To do this, we can make use of the do notation for monads and call the smaller parsers we implemented earlier, i.e. firstNameParser, lastNameParser, and suffixPartParser. Then finally, we can <|> the two parsers together.

namePartBasicParser :: Parser NamePart
namePartBasicParser = do
  fst <- firstNameParser <* charParserG ' '
  snd <- lastNameParser
  pure $ NamePartBasic fst snd

namePartWithSuffixParser :: Parser NamePart
namePartWithSuffixParser = do
  fst <- firstNameParser <* charParserG ' '
  snd <- lastNameParser <* charParserG ' '
  thrd <- suffixPartParser
  pure $ NamePartWithSuffix fst snd thrd

namePartParser :: Parser NamePart
namePartParser = namePartWithSuffixParser <|> namePartBasicParser 

Note the <* operator defined in the Applicative interface. It sequentially applies two parsers, but only returns the result of the first. As an example, if we want to parse "good night" followed by an exclamation point, but only want the string "good night" to be captured, we can do something like:

ghci> runParser ( stringParserG "good night" <* charParserG '!') "good night!"
Just ("good night","")

Next, parsing the address part of the postal address should look familiar.

houseNumParser :: Parser HouseNum
houseNumParser = HouseNum <$> numberParser

streetNameParser :: Parser StreetName
streetNameParser = StreetName <$> (stringLiteralParser <* stringParserG " St")

aptNumParser :: Parser AptNum
aptNumParser = AptNum <$> ((stringParserG "Apt ") *> numberParser)

streetAddressParser :: Parser StreetAddress
streetAddressParser = do
  houseNum <- houseNumParser <* charParserG ' '
  streetName <- streetNameParser <* charParserG ' '
  aptNum <- aptNumParser
  pure $ StreetAddress houseNum streetName aptNum

Then lastly, the zip portion of the address:

townNameParser :: Parser TownName
townNameParser = TownName <$> stringLiteralParser

-- only the West Coast exists
stateCodeParser :: Parser StateCode
stateCodeParser = StateCode <$>
  (stringParserG "WA" <|>
    stringParserG "CA" <|>
    stringParserG "OR" <|>
    stringParserG "AR" <|>
    stringParserG "HI")

zipCodeParser :: Parser ZipCode
zipCodeParser = ZipCode <$> numberParser

zipPartParser = do
  town <- townNameParser <* charParserG ','
  state <- stateCodeParser <* charParserG ' '
  zip <- zipCodeParser
  pure $ ZipPart town state zip

The final step is to create postal address parser by combining the name, address, and zip parsers!

postalAddressParser :: Parser PostalAddress
postalAddressParser = do
  namePart <- namePartParser <* charParserG '\n'
  streetAddress <- streetAddressParser <* charParserG '\n'
  zipPart <- zipPartParser
  pure $ PostalAddress namePart streetAddress zipPart

We can test this on a couple examples:

ghci> -- test case 1
ghci> runParser postalAddressParser "Linus Torvalds Jr.\n1538 John St Apt 550\nSeattle,WA 98702"
Just (PostalAddress (NamePartWithSuffix (FirstName "Linus") (LastName "Torvalds") (SuffixPart "Jr.")) (StreetAddress (HouseNum "1538") (StreetName "John") (AptNum "550")) (ZipPart (TownName "Seattle") (StateCode "WA") (ZipCode "98702")),"")
ghci> -- test case 2
ghci> runParser postalAddressParser "Grace Hopper\n528 Madison St Apt 702\nFremont,CA 78771"
Just (PostalAddress (NamePartBasic (FirstName "Grace") (LastName "Hopper")) (StreetAddress (HouseNum "528") (StreetName "Madison") (AptNum "702")) (ZipPart (TownName "Fremont") (StateCode "CA") (ZipCode "78771")),"")

These are admittedly pretty simple examples but you can easily build on this to create more complicated parsers to deal with your favorite edge cases.

Summary

To recap, parser combinators are powerful abstractions that allow you to combine small, simple parsers in different ways in order to build larger, more complicated parsers. We started with a simple parser that can parse a single character, then built more parsers on top of that to parse strings, numbers, names, street names, zip codes, and finally, entire postal addresses. To do this, we implemented various interfaces for our Parser type, including Functor, Applicative, Alternative, and Monad, each of which gave us some unique operator we used to glue our various parsers together.

For more on parser combinators, I recommend this list of resources:

And if you really must, a more formal paper discussing this:

Monadic Parser Combinators