Vermeille's blog

Resume && contactProjetsC++BullshitLanguage TheoryArtificial Intelligence¬LinkedIn

Experiment: Bringing Haskell's parsers to C++

Experiment: Bringing Haskell's parsers to C++

Hello,

I am learning Haskell, and I discovered Parsec. Since parsing is nothing more than running a stateless automaton, functionnal languages are perfects. Parser's combinators technics, usually used in this paradigm, are less popular in imperatives languages. All the article will be based on this paper.

Since we are imitating Haskell, we want to preserve its purity. No pointers, references or side effects are tolerated in any way.

A parser can be thinked in terms of combining more little parsers. A parser of CSV is a combination of a parser of words, a parser of ',', and a parser of "". We want those combinations to be resolved at compile-time.

Defining our types

The most fundamental block of this approach is certainly this line:

type Parser a = String -> Maybe (a, String)

How to understand this line ? We are defining a kind of template type Parser, with type variable 'a' as follow: A function that takes a String and returns (-> could be understood as "returns") a Maybe pair of a a and a String. The first member of this pair is result of the parser while the second is the remainder of the string, the "not already parsed" part. Maybe is defined in Haskell as follows:

data Maybe a = Nothing | Just a

It simply means that Maybe Int can be Just 42 or Nothing.

In more or less idiomatic C++, the Maybe monad already exists under the name

namespace boost {
template <class T>
class optional<T>;
}
// #include <boost/optional.hpp>

std::pair exists in the standard header utility

The whole type is defined with a C++11 feature: templated using

template <class T>
using ParserRet = boost::optional<std::pair<T, std::string::const_iterator>>;

// for convenience:
typedef std::string::const_iterator str_iterator;

In exercise 3 is written the most little parser: it only parses a char, returning Just it, or Nothing if the string is empty. In Haskell, this is really easy to achieve:

-- Haskell
char (c:cs) = Just (c, cs)
char ""     = Nothing
// C++
struct ParserChar
{
    // We use a static method to avoid useless instantiation, but
    // to have function semantics, it might be interesting to define
    // operator() instead
    static ParserRet<char> Parse(str_iterator begin, str_iterator end)
    {
        if (begin == end)
            return boost::none;
        return boost::make_optional(std::make_pair(*begin, begin + 1));
    }

    typedef char parser_type;
};

The necessity of the typedef parser_type will be shown later. I believe some metaprogramming tricks can avoid that ; if you find one, tell me in a commentary.

We can already see how expressive and concise is Haskell compared to the legendary verbosity of C++.

parser ? predicate

The fisrt parser operator implemented in the article is (?). parser ? predicate can be read « Does the thing parsed by parser matches the predicate ? ». Depending on the result, we return Just val or Nothing

In Haskell, implementation is fairly simple.

(?) :: Parser a -> (a -> Bool) -> Parser a
(m ? p) str =
    case m str of
        Nothing -> Nothing
        Just (a, cs) -> if p a then Just (a, cs) else Nothing

In C++, this is not straight forward, since we do not have type inference and structural typing. We must use templates, since we don't know what returns parser and what takes predicate as argument.

template <class Parser, class Pred>
struct ParserPred
{
    typedef typename Parser::parser_type parser_type;

    static ParserRet<parser_type>
        Parse(str_iterator begin, str_iterator end)
    {
        auto res = Parser().Parse(begin, end);
        if (res)
        {
            if (Pred()(res->first))
                return res;
            return boost::none;
        }
        return boost::none;
    }
};

We already can see a limitation of our implementation: since we don't want any instantiation, we must pass the predicate through a template parameter, and they can't be functions, we cannot do anything else than defining Pred as a functor, or as a static class defining a given method (like Test() or Check()).

So, the line

if (Pred()(res->first))

instantiates an anonymous predicate-object and immediately calls its operator()

With only one parser and the conditionnal combinator, we can already create several little parsers.

-- Only parses [0-9]
digit :: Parser Char
digit = char ? isDigit

-- Only parses [a-zA-Z]
letter :: Parser Char
letter = char ? isAlpha

-- Only parses c
lit :: Char -&gt; Parser Char
lit c = char ? (== c)

-- Only parses [ \t\n]
space :: Parser Char
space = char ? isSpace

In C++, we do not have such expressivity, and we must define functors to wrap predicates:

// We may use a macro to have a lighter syntax
struct IsAlpha
{
    bool operator()(char c)
    {
        return isalpha(c);
    }
};

struct IsDigit
{
    bool operator()(char c)
    {
        return isdigit(c);
    }
};

struct IsBlank
{
    bool operator()(char c)
    {
        return isblank(c);
    }
};

template <char C>
struct IsChar
{
    bool operator()(char c)
    {
        return c == C;
    }
};

// Only parses [0-9]
typedef ParserPred<ParserChar, IsDigit> ParserDigit;

// Only parses [a-zA-Z]
typedef ParserPred<ParserChar, IsAlnum> ParserLetter;

// Only parses C
template <char C>
using Lit = ParserPred<ParserChar, IsChar<C>>;

// Only parses [ \t\n]
typedef ParserPred<ParserChar, IsBlank> ParserBlank;

ifIfail ! IwillSaveYou

Next operator is (!), that tries the left operand, returns its results is everything went good, or executes the right operand and returns its result if the left failed.

(!) :: Parser a -&gt; Parser a -&gt; Parser a
(m ! n) str =
    case m str of
        Nothing -&gt; n str
        res -&gt; res
template <class L, class R>
class Or
{
    typedef typename L::parser_type parser_type;

    static ParserRet<parser_type>
        Parse(str_iterator begin, str_iterator end)
    {
        auto res = L::Parse(begin, end);
        if (res)
            return res;
        res = R::Parse(begin, end);
        return res;
    }
};

Note: We do not explicitly check that L::parser_type and R::parser_type matches: the compiler automatically does it when assigning the return of R to res

doMeFirst # andIComeAfter

Operator (#) simply concatenates two parsers. The first have to success, and the remainder string is given as input of the second. A pair of both values is returned.

(#) :: Parser a -&gt; Parser b -&gt; Parser (a, b)
(m # n) str =
    case m str of
        Nothing -&gt; Nothing
        Just (a, cs) -&gt;
            case n cs of
                Nothing -&gt; Nothing
                Just (b, cs') -&gt; Just ((a, b), cs')
template <class L, class R>
struct Then
{
    typedef std::pair<typename L::parser_type, typename R::parser_type>
        parser_type;

    static ParserRet<parser_type>
        Parse(str_iterator begin, str_iterator end)
    {
        auto res = L::Parse(begin, end);
        if (res)
        {
            auto res2 = R::Parse(res->second, end);
            if (!res2)
                return boost::none;
            return boost::make_optional(
                    std::make_pair(
                        std::make_pair(res->first, res2->first),
                        res2->second));
        }
        else
            return boost::none;
    }
};

This one is really simple and follows closely the Haskell code

parser >-> function

This new operator allows parse-time transformations of tokens, operations on characters etc. We will encounter the same problem as above: functions cannot be template parameters, so, we still need to wrap then into functors (major drawback of our approach), and add a typedef to indicate the return type.

(>->) :: Parser a -> (a -> b) -> Parser b
(m >-> f) str =
    case m str of
        Nothing      -> Nothing
        Just (a, cs) -> Just (f a, cs)
template <class Parser, class Fun>
struct Apply
{
    typedef typename Fun::return_type parser_type;

    static ParserRet<parser_type> Parse(str_iterator begin, str_iterator end)
    {
        auto res = Parser::Parse(begin, end);
        if (!res)
            return boost::none;
        return boost::make_optional(
                std::make_pair(Fun()(res->first), res->second));
    }
};

It's still a verbose imitation of the Haskell code. And now, we can define some functions, to create more elaborated parsers:

struct Atoi
{
    int operator()(char c)
    {
        return c - '0';
    }
    typedef int return_type;
};

struct ToLower
{
    char operator()(char c)
    {
        return tolower(c);
    }
    typedef char return_type;
};

template <class L, class R>
struct GetFirst
{
    typedef typename R::parser_type return_type;

    return_type operator() (const std::pair<typename L::parser_type,
            typename R::parser_type>& p)
    {
        return p.first;
    }
};

template <class L, class R>
struct GetSecond
{
    typedef typename L::parser_type return_type;

    return_type operator() (const std::pair<typename L::parser_type,
            typename R::parser_type>& p)
    {
        return p.second;
    }
};

// Parsers:

// Parses a digit and returns its value as an int
typedef Apply<ParserDigit, Atoi> ParserDigitV;

// Parses a char and turn it into lowercase, to get an case insensitive parser
typedef Apply<ParserLetter, ToLower> ParserInsensitive;

// ignores result of parser L, Haskell equivalent is (-#)
template <class L, class R>
using SkipL = Apply<Then<L, R>, GetSecond<L, R>>;

// ignores result of parser R, Haskell equivalent is (#-)
template <class L, class R>
using SkipR = Apply<Then<L, R>, GetFirst<L, R>>;

In Haskell:

digitV :: Parser Int
digitV = digit >-> digitToInt

ichar :: Parser Char
ichar = letter >-> toLower

(-#) :: Parser a -> Parser b -> Parser b
(m -# n) str = (m # n) >-> snd $ str

(#-) :: Parser a -> Parser b -> Parser a
(m #- n) str = (m # n) >-> fst $ str

Conclusion

This article starts to be very long. I think that you now have the logic. You can read the Haskell paper or follow your inspiration to write others combinators and combinations to have a real world parser (using Apply/>-> to build ASTs, for instance). The number of base combinators is, in fine, relatively low. I will push some code of a whole library allowing to parse and evaluate a simple mathematical expression.

The most important question of this work is not « is that iteresting ? ». It is. The most important question is « Is this approach pertinent in C++ ? ».

Advantages are easy to understand:

  • parsers and blocks are easy to combine and to build.
  • each block is guaranteed not to fail because it is a composition of blocks wich cannot fail (recursive property)
  • parsers are pure: no side effects, no global variables
  • parsers are built at compile-time: there is no runtime cost for assembling them together, and compilers can optimize them very well
  • a backtrack algorithm is integrated for free, allowing your parser to be a kind of LL(k)

Drawbacks are already seen:

  • verbosity is a huge problem. You always have to define a struct if you want to add a simple predicate ; you always have to define a struct and a typedef if you want to add a function to Apply
  • you have to handle types correctly
  • if your parser is ill constructed, g++ (or clang++) will vomit on you
  • this approach is unfamiliar in C++. Use this kind of parsers in a team, I don't believe they'll thank you
  • How to implement recursion ?

Creating your own language Chapter 1 (Lexical Analysis)

Creating your own language (Part 1)

Hello,

I intend to write a serie of articles explaining how to implement a programming language, may it be functionnal, imperative, object oriented, statically or dynamically typed. We will implement it in pseudo code. For learning purposes, we will implement everything by hand (Bison and Flew does a lot of work, and we don't want to assume this work done automagically).

Theory

The first thing to do is lexical analysis (done by a Lexer): when you write something like a simple line like this:

print(3 + 4)

your computer see this:

0000000 7270 6e69 2874 2033 202b 2934 000a
000000d

So we have to translate these datas in a way our computer will feel better. We want these characters to be understood at a higher abstraction layer: distinguish identifiers, operators, and numbers. Our same line will be translated in:

[ IDENTIFIER("print"), O_PARENTHESIS, NUMBER(4), OP_PLUS, NUMBER(3), C_PARENTHESIS ]
(Just assume this is a kind of enumeration), and this is the only lexer's job.

In our example

Usually, we express tokens as regular expression (don't expect this to be always true, Shell is an exception, and cannot express its token with a regex). These expression are simply written in the bison/flex configuration file, and the lexer is generated.

Because start from scratch, it's easier to have a disambiguation on the first character, so, be worry that

Here are usual tokens (I will express my grammar using an EBNF syntax)

Identifier ::= [a-zA-Z][a-zA-Z0-9]*
Number  ::= [1-9][0-9]* /* '-' is considered as a unary operator */
Plus    ::= '+'
Minus   ::= '-'
Div     ::= '/'
Mul     ::= '*'
Modulo  ::= '%'
OParen  ::= '('
CParen  ::= ')'
Eq      ::= '='
Neq     ::= '!='
Less    ::= '<'
...

Since we wrote our tokens grammar, we now have to represent our text input (user input or script file) as a stream of characters. Our lexer will read it, character by character, and return the corresponding token.

Lex(Stream s) : Token
    if Head(s) belongs to [a-zA-Z0-9]
        return LexIdentifier(s)
    if Head(s) belongs to [+-=/%()&lt;!]
        return LexOperator(s)
    error("Unexpected character: " + head(s))

LexIdentifier(Stream s) : Token
    String str = ""
    while Head(s) belongs to [a-zA-Z0-9]
        str += Head(s)
        Pop(s)
    return Token(TOKEN_IDENTIFIER, str)

LexOperator(Stream s) : Lexer
    String s = ""
    while Head(s) belongs to [+-=/%()&lt;!]
        str += Head(s)
        Pop(s)
    foreach OperatorToken op in operators
       if str == op.str
           return Token(op.type)
    error("Invalid operator token: " + str)

const operators =
[
    { str: "(", type: O_PAREN },
    { str: "+", type: PLUS },
    { str: "&lt;=", type LESSEQUAL },
    ...
]

Now we can lex correctly our input, in the next article we will see how to assemble these tokens with a parser, and recognize our language.

If you want to create your language, I encourage you to write down the tokens that will be recognized and write the lexer. Have fun !