Vim-Eqshow

Vim supporting python scripting makes writing plugin a far more enjoyable experience. In this post, I'm going to discuss briefly about the plugin vim-eqshow I wrote few days ago. I will not discuss the vim API and implementation support, but the algorithms used in it. There's nothing groundbreaking or supersmart but I find enjoyable anyway.

Motivation

I'm currently learning about machine learning, hence the lot of maths I'm writing now (I promise, I'll post soon (I hope) about the deep NLP stuff I'm working on). And it appears that writing maths is absolutely awful, and readability is even worse. Just look at this, logistic regression with L2 regularization in matlab:

J = 1 / m * sum(-y .* log(h) - (1 - y) .* log(1 - h)) + lambda / m * sum(theta2 .^ 2);

Awful, isn't it? My take on this would be to be able to switch vim in a "view mode" where equations are pretty printed to ease reading, then going back in edit mode to continue writing code. And the equational view mode would show the previous equation like this:

        ====                                           ====
    1   \                                          λ   \      2
J = - *  >  (-y ⊗ log(h) - (1 - y) ⊗ log(1 - h)) + - *  >  (θ2 )
    m   /                                          m   /
        ====                                           ====

An abstract datastructure

Nothing new under the sun: we need a way to represent our code as datastructures, and the usual AST is here to help. Few Python classes are necessary for a simple toy case: Binop (binary operation), Term (variable or number), Frac (a fraction, handle separately than other Binops because its rendering is really different), Superscript, Parenthesized, and whatever you can come up with.

Simple example:

# Start to write our library

def Term(object):
    txt = None
    def __init__(txt):
        self.txt = txt

def Binop(object):
    lhs = None
    op = None
    rhs = None
    def __init__(lhs, op, rhs):
        self.lhs = lhs
        self.op = op
        self.rhs = rhs

def Frac(object):
    up = None
    down = None
    def __init__(up, down):
        self.up = up
        self.down = down

# [...] Really simple stuff

# And a simple example:
# We represent 1 + 2 / 3 + 4 as
expr = Binop(Binop(Term('1'), '+', Frac(Term('2'), Term('3'))), '+', Term('4'))

The canvas

We're dealing with ASCII art (utf-8, actually, but "Utf-8 art" is not even a thing), so everything is about writing the right characters on the right column and row, one line at a time, without possibility to go back. In Python, we will just fill a list of list of chars, each sub list being a line.

Sizing the canvas

How big should it be? It obviously depends on the equation. To find out, recursively answer the question "how much room do I need to show up?"

  • for a Term, there is just one line of length len(self.txt) needed. Example of the bounding box:
    txt
    ┌──┐
    │42│
    └──┘
  • for a binop, it's a little more complicated. For the width, the operator itself needs one line and typically 3 characters including whitespaces (' + '), plus the width of the lhs, plus the width of the rhs. For the height, the max of the height of the lhs and rhs. Example:
        lhs       op rhs
    ┌───────────┐
    │     1     ├───┬──┐
    │-----------│_+_│42│
    │1 + exp(-x)├───┴──┘
    └───────────┘
  • For a fraction, the width needed is the max of width of the numerator and the denominator, and the height is the sum of both plus one for the line.
       ┌─┐
       │1│    lhs
    ┌──┴─┴──┐
    │-------│ op
    ├───────┤
    │     -x│ rhs
    │1 + e  │
    └───────┘

Simply add them recursively, and you have the size of the bounding box. It's that simple. Well, it could be, if the only thing we had to do was to compute the size. But looking a little forward, we can easily guess that we could need a separate count of what's up, and what's down, so that we can vertically center our text. We can also cache the results that will be needed when drawing. The code is thus:


class Term(object):
    # As before
    def size(self):
        self.length = len(self.txt)
        return (self.length, 0, 0)

class Binop(object):
    # As before
    def size(self):
        (ax, a_up, a_down) = self.a.size()
        (bx, b_up, b_down) = self.b.size()
        self.ax = ax
        self.a_up = a_up
        self.b_up = b_up
        return (ax + len(self.op) + bx, max(a_up, b_up), max(a_down, b_down))

class Frac(object):
    #As before
    def size(self):
        (ax, a_up, a_down) = self.top.size()
        (bx, b_up, b_down) = self.bottom.size()
        self.ax = ax
        self.a_up = a_up
        self.a_down = a_down
        self.bx = bx
        return (max(ax, bx), a_up + a_down + 1, b_up + b_down + 1)

Nothing scary, AST traversal.

Drawing

The drawing will happen as follow: the root note is given the coordinate (0, 0) as its top-left corner, will compute where to write its own stuff, and recursively give a subset of the canvas to its children, by changing the coordinate. See the algorithm performing on (1 / 2) + 3. The size needed is 3x5, the box show the bounding box in wich the algorithm is currently drawing

    Binop(..., ' + ', ...)

    ┌─────┐
    │     │
    │  +  │
    │     │
    └─────┘

    Binop(Frac(..., ...), ..., ...)
    ┌─┐
    │ │
    │-│ +
    │ │
    └─┘

    Binop(Frac(Term('1'), ...), ..., ...)
    ┌─┐
    │1│
    └─┘
     - +


    Binop(Frac(..., Term('2')), ..., ...)

     1
     - +
    ┌─┐
    │2│
    └─┘

    Binop(..., ..., Term('3'))

     1   ┌─┐
     - + │3│
     2   └─┘

I don't show the code for this part. Nothing interesting but horrible to read. Feel free to read it in the source code of vim-eqshow.

Reading the code

Problem solved? Uh, not really, we only print the stuff, but not read the code. Only half of the job is done. It's parsing time!

Lexing

Okay, keep this simple: the input we have to parse is really small (one line of code), we can afford storing it all in memory. Our pseudo-lexing phase is really dump:

  1. Read the line of code
  2. Add an end-of-input marker, like '$'
  3. Add spaces around all symbols
  4. Split on whitespaces
  5. Enjoy
symbols = ['+', '-', '/', '*', '(', ')', ', ', '<', '>', '^', ';', '.', '=', "'"]

def lex(txt, syms):
    for s in syms:
        txt = txt.replace(s, ' ' + s + ' ')
    txt += ' $'
    return txt.split()

For our use, it's that simple.

Parsing

We have no choice other than writing a grammar. The easiest parser to implement is LL(0), and we will try to stick to it. Our parser is supposed to read valid input, we take away the burden of detailed error handling or allows to write the grammar in a more permissive form if it simplifies writing it. The grammar is defined as a list of couples: the first element is a the rule, the second is the action to apply if successfully matched.

A rule is composed by these three types of elements:

  1. A regexp that must be matched against a string
  2. A number i which will ask for recursively matching the i-th rule
  3. A list of rules index that will be tried in turn until one matches

And a semantic action is:

  1. A function
  2. Indexes of the token that should be passed as arguments.
# This grammar is simplified for the sake of clarity

rules = [
# 0 Eq
    ([1, '=', 0], [Binop, 0, 1, 2]),
# 1 PlusMinus
    ([2, '\+|-', 1], [Binop, 0, 1, 2]),
# 2 Mul
    ([3, '\*', 2], [Binop, 0, 1, 2]),
# 3 Div
    ([[4, 5], '/', 4], [Frac, 0, 2]),
# 4 Atomic
    (["(\w+)"], [Term, 0]),
# 5 Paren
    (['\(', 1, '\)'], [Paren, 1]),
]

Here, to match the rule 0, we first try to match recursively 1, then the equals sign, then recursively try to call on itself on the rest of the input.

Here again, the algorithm profits from the fact that we already have all the input as a list:

  1. Try to match the current grammar rule 2.a. if it didn't match, return without modification 2.b. if it matched, replace the n token involved with the result of the semantic action.
def parse(expr, eidx, rules, ridx):
    i = 0
    #print(expr, ridx)
    for r in rules[ridx][0]:
        if type(r) == list:
            for subrule in r:
                if parse(expr, eidx + i, rules, subrule):
                    break
        elif isinstance(r, int):
            parse(expr, eidx + i, rules, r)
        elif not isinstance(expr[eidx + i], str) or \
            re.search(r, expr[eidx + i]) == None:
                return False
        i += 1

    node = rules[ridx][1]
    args = []
    for a in node[1:]:
        args.append(expr[eidx + a])
    expr[eidx : eidx + len(rules[ridx][0])] = [node[0](*args)]
    return True

See the algorithm match 1 + 2 / 3 + 4:

    Start:
    call trace: []
    input: ['1', '+', '2', '/', '3', '+', '4', '$']
    index:  ^
    [...]

    First token matched:
    call trace: [Eq in 1, PlusMinus in 2, Mul in 3, Div in [4], Atomic in "\w+"]
    input: [Term('1'), '+', '2', '/', '3', '+', '4', '$']
    index:  ^
    [...]

    About to match the '+':
    call trace: [Eq in 1, PlusMinus in "+"]
    input: [Term('1'), '+', '2', '/', '3', '+', '4', '$']
    index:             ^
    [...]

    Going back recursively in the rhs, about to match a token:
    call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3, Div in [4], Atomic in "\w+"]
    input: [Term('1'), '+', Term('2'), '/', '3', '+', '4', '$']
    index:                  ^
    [...]

    Back up in the call stack, we match the '/':
    call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3, Div in "\*"]
    input: [Term('1'), '+', Term('2'), '/', '3', '+', '4', '$']
    index:                             ^
    [...]

    And now we match the rhs of '/'
    call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3, Div in 4, Atomic in "\w+"]
    input: [Term('1'), '+', Term('2'), '/', Term('3'), '+', '4', '$']
    index:                                  ^
    [...]

    Div successfully matched, replace the tokens with the result of the match
    call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3]
    input: [Term('1'), '+', Frac(Term('2'), Term('3')), '+', '4', '$']
    index:                  ^
    [...]

    About to match the '+':
    call trace: [Eq in 1, PlusMinus in 1, PlusMinus in '+']
    input: [Term('1'), '+', Frac(Term('2'), Term('3')), '+', '4', '$']
    index:                                              ^
    [...]

    Matched, fetching the rhs:
    call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3, Div in 4, Atom]
    input: [Term('1'), '+', Frac(Term('2'), Term('3')), '+', Term('4'), '$']
    index:                                                   ^
    [...]

    Going up in the call stack, the deepest PlusMinus matched:
    call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 1]
    input: [Term('1'), '+', Binop(Frac(Term('2'), Term('3')), '+', Term('4')), '$']
    index:                  ^
    [...]

    Going up in the call stack, the first PlusMinus matched too:
    call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 1]
    input: [Binop(Term('1'), '+', Binop(Frac(Term('2'), Term('3')), '+', Term('4'))), '$']
    index:  ^
    [...]

The input successfully matched and the AST have been constructed, ready for pretty printing.

This article is not groundbreaking, no false hope or promise. So much time passed since the last one, I just found a good enough subject to write an article on. Hope you'll still enjoy it!

Stay tuned for the NLP stuff ;)