# Vermeille's blog

## Writing a french POS tagger (2)

How to choose those confidence levels? That's where we start to apply machine learning.

Okay, we're into maths now, so everything is a number. Let's assign a unique natural integer to every word in our vocabulary, so that $$w_i$$ is the word with the id i, and a unique integer to every POS class, so that $$c_i$$ is the class with the id i.

## Maximum Likelihood Estimate

Since we're doing ML, let's use some vocabulary. Everything we extract from the ith word (noted wi) is called "features", denoted , the POS we're trying to tag are named "classes" and hence will be noted c, and our confidence is a probability. If we forget a little about suffixes, prefixes and capitalization, we have only the word as a source of info. Predicting is just the act of doing

$\underset{c}{\operatorname{argmax}} P(c | w_i)$

which is like asking "which class i'm the most confident about considering this word?". How do we find such a probability? We can use a simple algorithm called Maximum Likelihood Estimate which works like this:

$P(c | w_i) = \frac{\operatorname{Count}(w_i\text{ is a }c)}{\operatorname{Count}(w_i)}$

"How many times, in a big text corpus, is $$w_i$$ of class c relatively to is global appearance?". And since the denominator is the same for all classes (it varies solely on w_i), we can leave it off.

$\underset{c}{\operatorname{argmax}} P(c | w_i) = \underset{c}{\operatorname{argmax}} \operatorname{Count}(w_i\text{ is a }c)$

Super simple, but we can't do anything for unknown words despite having all those fancy morphological features. We need something to incorporate them.

## Maximum Entropy model

### Prediction

Let's say we have what we call a feature vector $$\theta$$, for which $$\theta_i$$ is the $$i$$th feature being 1 if the feature is present, 0 otherwise. When we try to predict the class $$c$$, the ith feature will be more or less discriminative. Let's represent that by weighting it with $$\lambda_{c,i}$$ "how indicative is $$\theta_i$$ for class $$c$$?" where:

• a high lambda means "wow, super discriminative, this is super indicative of this class!"
• a low (negative) lambda means "Wow, super discriminative, this super indicative of NOT this class!"
• and a lambda about 0 means "this is not helping me for class c". Emphasis, $$\lambda_{c,i}\theta_i$$ is NOT a probability, it's just a score. Continuing this way, evaluating the score of class $$c$$ for the whole feature vector is a the dot product $$\lambda_c\cdot\theta$$. Weighting then summing.

Note: Here, $$\lambda$$ is essentially a matrix from which I pick the column $$c$$, which is contrary to everything you'll read in the litterature. I'm doing this because, contrary to what the litterature says, I assume that a feature can be discriminative for every class, whereas most papers wants you to condition the features on the class being predicted, allowing a different set of features per class. With my approach, we have more parameters, some of them will be weighted to 0, but it prevents from early optimization. Later on, and after having analyzed the lambdas, one can purge the unnecessary features and weights in the actual implementation, without having missed a discriminative indication.

Once we have this score, easy stuff, find the class with the best score.

$\underset{c}{\operatorname{argmax}} P(c | \theta, \lambda) = \underset{c}{\operatorname{argmax}} \lambda_c\cdot\theta$

how cool? Wait, how do I choose all those lambdas?

### Learning

Well, that's a different story. To make a long story short, we have a very handy algorithm, called Maximum Entropy. It relies on the fact that the probability distribution which best represents the current state of knowledge is the one with largest entropy.

I said earlier that scores weren't probabilities. I say that ME works on probabilities. We need to turn those score into probabilities. First we needs the class scores to be on the same scale. Fairly easy, just divide the score for the class $$c$$ by the absolute score of all $$c$$. We're mapping all of our values to [-1;1].

$\operatorname{not-really-P}(c | \theta, \lambda) = \frac{\lambda_c\cdot\theta}{\sum_{c'}|\lambda_{c'}\cdot\theta|}$

Still not good, we need [0, 1]. Well, we could just +1, but actually, we don't like computers to work on [0;1], because computation in such a range tends to quickly turn to NaN and absurdely small numbers. And divisions and multiplications are not cheap. That's why we prefer to deal with log-probabilities instead of probabilities. Log is monotonic, has nice additive properties and is cute looking function, despite its undefined log(0) value. It turns out that the best way to make probabilities from those scores is the exp function.

$P(c | \theta,\lambda) = \frac{\exp{\lambda_c\cdot\theta}}{\sum_{c'}\exp{\lambda_{c'}\cdot\theta}}$

It maps to the correct values range, and if we take the log probability, the division turns to a substraction and the exp cancel out. How nice.

And the super good thing is that, by whichever mathemagics I'm not fully getting yet (please someone explain?), Maximum Entropy and Maximum Likelihood are linked, which brings us to an optimization objective of:

$\underset{\lambda}{\operatorname{argmin}} -\log \mathcal{L} = -\sum_x\log\frac{\exp{\lambda_c\cdot\theta^{(x)}}}{\sum_{c'}\exp{\lambda_{c'}\cdot\theta^{(x)}}}$

Where $$\theta^{(x)}$$ is the feature vector of the $$x$$th example in the dataset.

Cool. We have to take the gradient with respect to lambda, which gives us

$\frac{\partial \mathcal{L}}{\partial\lambda_c} = \sum_x (1\{x\text{ is a }c\}\theta^{(x)} - \sum_c P(c | \theta^{(x)}, \lambda)\theta^{(x)})$

with this derivative, we can take a simple iterative approach to update lambda.

$\lambda := \lambda - \alpha\frac{\partial \mathcal{L}}{\partial \lambda}$

This is quite slow, but it works.

And in the end, you have your model.

Wait, what about the context? We used only features from the word, and we can't disambiguate from the very first example "je commande" and "une commande". Well, we'll have to use something a little smarter than a Maximum Entropy Model and use a MEMM, a Maximum Entropy Markov Model.

## Context

French lacks a lot of NLP tools in the FOSS community. And everyone focuses on english AI. Then, french technology sucks. How sad for us. Thing is, I want an AI in my home, for various automation stuff. And I want it to run locally, for privacy reasons (I don't want to share with companies my personnal audio and video at any single second). Then, someone has to start something. Let's say this someone is me, hahaha, sounds so much like a noble duty, hahaha.

The very first step in NLP is to have a POS tagger, that is, something tagging every word in a sentence with its part-of-speech (verb, noun, pronoun, adjective, adverb, etc). Having this allows to have another features that will allow higher level model to generalize more easily (like "oh, this word is a noun. Even if I never saw this word before, I know that this sentence is synctactically correct") and with less data.

One might at first think that this is not even an issue: look at the dictionnary entry for this word. Sadly, language is highly ambiguous. Let's consider "Je commande" (I order) and "une commande" (an order); depending on the context, "commande" is either a noun or a verbal form. Thankfully, french is much more morphologically rich than english, which makes it a less ambiguous language and provides more patterns in the words to guess their POS (like a word ending in "ly" is most likely an adverb).

## Word features

Alright, let's brainstorm for a little, and see what kind of clues we can have to guess a word's POS. in machine learning terminology, that's called "features".

• hopefully, we already know the word, and this word will have a clear frequential dominance of one its many possible POS. For instance, the determiner "la" is ambiguous with "LA" (Los Angeles), but the determiner occuring much more than the city, always tagging it as the determiner will fail in only very few cases. Same applies for the "est" (a form of "être"/"be") and "est", the latin locution.
• Case would also help disambiguate the LA case. So, keeping the shape is a good feature. We'll unlikely to know every city and possible name and every organisation etc, then something starting with a capitalized letter is actually a good indication to tag it as a proper noun.
• If we don't know the word, we can use some morphological features of french. Suffixes like "er", "é", "ées", "ée", "eindre", "oindre", "ir", "issons", "issez", "ent" are very strong features.
• Same applies for prefixes.

Great. But they're another question. More than one, actually. How much can I trust every of these features? What are their relative levels of confidence? I mean, I believe they help. I don't know how if I'm right, and if I am, I don't know how much. A "ement" prefix is alwyas an adverb (tbh, I just don't have any counter example on top of head right now), while "er" is super often discriminative to infinitive verbs, but one counter example could be the adjective "léger" (not heavy), so we're slightly less confident in this one.

How to choose those confidence levels? That's where we start to apply machine learning.

As a computer geek who eventually turned into a software engineer, one of the questions I've been asked the most (if not the most), is

Can you hack someone's service account?

And I'm sure it's super common among SWEs, and I'm sure I'm not the only one raging about that.

No, I fuckin' can't.

To understand why, let me explain what it takes to hack something.

### An internet service

Any internet service (GMail, Facebook, MSN, any blogging service, etc) is a (complex) system built by engineers. Compare it to an engine, in a mechanical sense, or an electronic circuit, if you like, in a black box. This service is cut into many different parts:

• The web view is what you actually see and quite often the only way you have to use the website. On a minimal or poorly designed website, you directly receive the view filled with the data (the content of the page).

• Sometimes, this webview calls another service on the web. For exemple, when you go on Facebook, you want to see the webview. When it's showing, the webview queries another service on the web to know what it must show, this service is known as an API. The API sends data back to the webview in an easy to process but very not pretty way.

• Both of them are generated on a webserver where the company's code is running. You have absolutely no control on that. Actually, that's all of what hacking is about: be able to access this server and do bad stuff.

• Most of the time, there's also a database service where all of user data is stored. This is the most precious piece of such services. Breaking into this is actually even better than the webserver. Engineers know that. That's why it has most often no interaction to the external world. It's directly accessible only from the inside of the company, and very few people have the credentials. For such big companies, most engineers never see actual data, and work only on generated and fictious data for privacy concerns, some being by company's ethics, other enforced by law. So that data can be actually rendered to the user, the webserver queries the database about what's needed, filters the result and renders it to the webview or the API.

### How do we get in?

You have to find a weakness of the system. You have to misuse it. If we continue with the analogies proposed before, the question of hacking is:

Given what we could see and how we could interact with the engine, how can we misuse it so that it does what we want?

If you have a car with an artificial speed limitation, you have to understand how the engine works, why it has this limitation, and how you can remove it. Not that hard'uh? Everybody at least a little known in mechanics can do that! Sure thing SWEs know how software works and can do the same!

Nope. In the case of an internet service, you can't even touch the engine. You can only see what it produces, and use it in the way engineers allowed you. You don't know how it works because most of the time the company keeps it secret. That's trying to bypass the car's speed limitation by only using the car's controls. Now it sounds way harder, doesn't it?

### What is hacking about, in the end?

Hacking is misusing you service, trying to guess what are the mistakes the engineers could have done, so that you can do more than what you should. It's about forgetting to handle server's state properly when doing two opposite actions; not securing some text fields, allowing you to inject code (super old, everybody's aware about this one), not checking properly filetypes when uploading files etc.

So, the hacker has to guess, try, fail, imagine what the engineers have done wrong. Try everything that comes to his mind, and, hopefully, that'll work. And even if you find a weakness, you have no clue that you can exploit it to do what you'd like. Maybe it will just produce garbage results.

### I know a weakness and how to exploit it

Good, you have two choices now:

1. Contact the service, and tell them about it. They'll thank you, you'll save people data, maybe can write an article that will get famous for few weeks, and be given some money by the said company.

2. Keep it for yourself. Obviously, if it gets known, they will eventually fix it, and it will become unusable.

So, exploits are well kept secrets. No, I can't hack your ex-girlfriend's account.

## But I had my account hacked!

Well, odds are it's not Facebook's fault. As long as a service has users, there is a major vulnerability: the user's stupidity. Users are dumb. Beyond imagination. They write their passwords in a non secured file on their computers, they use public data like their birthday, or they use publicly available answer to their secret question (seriously guys, just enter garbage if you're asked. And if you want to "hack" someone you know, just ask them for the answer, they'll never remind this question can leak their accounts), they use the same password on every service, they leave their computers unlocked, they believe phishing, etc etc.

If you've been hacked, chances are that you've been stupid at some point. If you want to hack someone, just exploit his stupidity and send a phishing page.

## McAfee, why do you still exist?

Okay, this post is less formal or interesting than usually, I just want to express the hate and rage I have against McAfee.

Dear McAfee, an anti-virus follows one simple concept: while you're installed, keep the computer safe. If you want the user to pay for it, fine. But an anti-virus is not about BLOATING THE USER'S COMPUTER IF HE DOESN'T PAY . So far, with friends and family computers, I've noticed several way you use to fuck up people's computer:

1. Hook when Windows loads a driver, and insert a huge fucking delay in it. Plug something, and the computer freezes for few seconds. GREAT.
2. Make the computer slow as fuck. Just launch 4 threads doing an infinite loop. Amazing, uh?
3. Fuck Windows up so that users cannot access http websites, only https.

Do you have McAfee on your computer? Just fuckin' remove it. Never ever pay for those crooks.

## Vim-Eqshow

Vim supporting python scripting makes writing plugin a far more enjoyable experience. In this post, I'm going to discuss briefly about the plugin vim-eqshow I wrote few days ago. I will not discuss the vim API and implementation support, but the algorithms used in it. There's nothing groundbreaking or supersmart but I find enjoyable anyway.

### Motivation

I'm currently learning about machine learning, hence the lot of maths I'm writing now (I promise, I'll post soon (I hope) about the deep NLP stuff I'm working on). And it appears that writing maths is absolutely awful, and readability is even worse. Just look at this, logistic regression with L2 regularization in matlab:

J = 1 / m * sum(-y .* log(h) - (1 - y) .* log(1 - h)) + lambda / m * sum(theta2 .^ 2);

Awful, isn't it? My take on this would be to be able to switch vim in a "view mode" where equations are pretty printed to ease reading, then going back in edit mode to continue writing code. And the equational view mode would show the previous equation like this:

        ====                                           ====
1   \                                          λ   \      2
J = - *  >  (-y ⊗ log(h) - (1 - y) ⊗ log(1 - h)) + - *  >  (θ2 )
m   /                                          m   /
====                                           ====

### An abstract datastructure

Nothing new under the sun: we need a way to represent our code as datastructures, and the usual AST is here to help. Few Python classes are necessary for a simple toy case: Binop (binary operation), Term (variable or number), Frac (a fraction, handle separately than other Binops because its rendering is really different), Superscript, Parenthesized, and whatever you can come up with.

Simple example:

# Start to write our library

def Term(object):
txt = None
def __init__(txt):
self.txt = txt

def Binop(object):
lhs = None
op = None
rhs = None
def __init__(lhs, op, rhs):
self.lhs = lhs
self.op = op
self.rhs = rhs

def Frac(object):
up = None
down = None
def __init__(up, down):
self.up = up
self.down = down

# [...] Really simple stuff

# And a simple example:
# We represent 1 + 2 / 3 + 4 as
expr = Binop(Binop(Term('1'), '+', Frac(Term('2'), Term('3'))), '+', Term('4'))

### The canvas

We're dealing with ASCII art (utf-8, actually, but "Utf-8 art" is not even a thing), so everything is about writing the right characters on the right column and row, one line at a time, without possibility to go back. In Python, we will just fill a list of list of chars, each sub list being a line.

#### Sizing the canvas

How big should it be? It obviously depends on the equation. To find out, recursively answer the question "how much room do I need to show up?"

• for a Term, there is just one line of length len(self.txt) needed. Example of the bounding box:
    txt
┌──┐
│42│
└──┘
• for a binop, it's a little more complicated. For the width, the operator itself needs one line and typically 3 characters including whitespaces (' + '), plus the width of the lhs, plus the width of the rhs. For the height, the max of the height of the lhs and rhs. Example:
        lhs       op rhs
┌───────────┐
│     1     ├───┬──┐
│-----------│_+_│42│
│1 + exp(-x)├───┴──┘
└───────────┘
• For a fraction, the width needed is the max of width of the numerator and the denominator, and the height is the sum of both plus one for the line.
       ┌─┐
│1│    lhs
┌──┴─┴──┐
│-------│ op
├───────┤
│     -x│ rhs
│1 + e  │
└───────┘

Simply add them recursively, and you have the size of the bounding box. It's that simple. Well, it could be, if the only thing we had to do was to compute the size. But looking a little forward, we can easily guess that we could need a separate count of what's up, and what's down, so that we can vertically center our text. We can also cache the results that will be needed when drawing. The code is thus:


class Term(object):
# As before
def size(self):
self.length = len(self.txt)
return (self.length, 0, 0)

class Binop(object):
# As before
def size(self):
(ax, a_up, a_down) = self.a.size()
(bx, b_up, b_down) = self.b.size()
self.ax = ax
self.a_up = a_up
self.b_up = b_up
return (ax + len(self.op) + bx, max(a_up, b_up), max(a_down, b_down))

class Frac(object):
#As before
def size(self):
(ax, a_up, a_down) = self.top.size()
(bx, b_up, b_down) = self.bottom.size()
self.ax = ax
self.a_up = a_up
self.a_down = a_down
self.bx = bx
return (max(ax, bx), a_up + a_down + 1, b_up + b_down + 1)

Nothing scary, AST traversal.

#### Drawing

The drawing will happen as follow: the root note is given the coordinate (0, 0) as its top-left corner, will compute where to write its own stuff, and recursively give a subset of the canvas to its children, by changing the coordinate. See the algorithm performing on (1 / 2) + 3. The size needed is 3x5, the box show the bounding box in wich the algorithm is currently drawing

    Binop(..., ' + ', ...)

┌─────┐
│     │
│  +  │
│     │
└─────┘

Binop(Frac(..., ...), ..., ...)
┌─┐
│ │
│-│ +
│ │
└─┘

Binop(Frac(Term('1'), ...), ..., ...)
┌─┐
│1│
└─┘
- +

Binop(Frac(..., Term('2')), ..., ...)

1
- +
┌─┐
│2│
└─┘

Binop(..., ..., Term('3'))

1   ┌─┐
- + │3│
2   └─┘

I don't show the code for this part. Nothing interesting but horrible to read. Feel free to read it in the source code of vim-eqshow.

Problem solved? Uh, not really, we only print the stuff, but not read the code. Only half of the job is done. It's parsing time!

### Lexing

Okay, keep this simple: the input we have to parse is really small (one line of code), we can afford storing it all in memory. Our pseudo-lexing phase is really dump:

1. Read the line of code
2. Add an end-of-input marker, like '$' 3. Add spaces around all symbols 4. Split on whitespaces 5. Enjoy symbols = ['+', '-', '/', '*', '(', ')', ', ', '<', '>', '^', ';', '.', '=', "'"] def lex(txt, syms): for s in syms: txt = txt.replace(s, ' ' + s + ' ') txt += '$'
return txt.split()

For our use, it's that simple.

### Parsing

We have no choice other than writing a grammar. The easiest parser to implement is LL(0), and we will try to stick to it. Our parser is supposed to read valid input, we take away the burden of detailed error handling or allows to write the grammar in a more permissive form if it simplifies writing it. The grammar is defined as a list of couples: the first element is a the rule, the second is the action to apply if successfully matched.

A rule is composed by these three types of elements:

1. A regexp that must be matched against a string
2. A number i which will ask for recursively matching the i-th rule
3. A list of rules index that will be tried in turn until one matches

And a semantic action is:

1. A function
2. Indexes of the token that should be passed as arguments.
# This grammar is simplified for the sake of clarity

rules = [
# 0 Eq
([1, '=', 0], [Binop, 0, 1, 2]),
# 1 PlusMinus
([2, '\+|-', 1], [Binop, 0, 1, 2]),
# 2 Mul
([3, '\*', 2], [Binop, 0, 1, 2]),
# 3 Div
([[4, 5], '/', 4], [Frac, 0, 2]),
# 4 Atomic
(["(\w+)"], [Term, 0]),
# 5 Paren
(['$$', 1, '$$'], [Paren, 1]),
]

Here, to match the rule 0, we first try to match recursively 1, then the equals sign, then recursively try to call on itself on the rest of the input.

Here again, the algorithm profits from the fact that we already have all the input as a list:

1. Try to match the current grammar rule 2.a. if it didn't match, return without modification 2.b. if it matched, replace the n token involved with the result of the semantic action.
def parse(expr, eidx, rules, ridx):
i = 0
#print(expr, ridx)
for r in rules[ridx][0]:
if type(r) == list:
for subrule in r:
if parse(expr, eidx + i, rules, subrule):
break
elif isinstance(r, int):
parse(expr, eidx + i, rules, r)
elif not isinstance(expr[eidx + i], str) or \
re.search(r, expr[eidx + i]) == None:
return False
i += 1

node = rules[ridx][1]
args = []
for a in node[1:]:
args.append(expr[eidx + a])
expr[eidx : eidx + len(rules[ridx][0])] = [node[0](*args)]
return True

See the algorithm match 1 + 2 / 3 + 4:

    Start:
call trace: []
input: ['1', '+', '2', '/', '3', '+', '4', '$'] index: ^ [...] First token matched: call trace: [Eq in 1, PlusMinus in 2, Mul in 3, Div in [4], Atomic in "\w+"] input: [Term('1'), '+', '2', '/', '3', '+', '4', '$']
index:  ^
[...]

call trace: [Eq in 1, PlusMinus in "+"]
input: [Term('1'), '+', '2', '/', '3', '+', '4', '$'] index: ^ [...] Going back recursively in the rhs, about to match a token: call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3, Div in [4], Atomic in "\w+"] input: [Term('1'), '+', Term('2'), '/', '3', '+', '4', '$']
index:                  ^
[...]

Back up in the call stack, we match the '/':
call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3, Div in "\*"]
input: [Term('1'), '+', Term('2'), '/', '3', '+', '4', '$'] index: ^ [...] And now we match the rhs of '/' call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3, Div in 4, Atomic in "\w+"] input: [Term('1'), '+', Term('2'), '/', Term('3'), '+', '4', '$']
index:                                  ^
[...]

Div successfully matched, replace the tokens with the result of the match
call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3]
input: [Term('1'), '+', Frac(Term('2'), Term('3')), '+', '4', '$'] index: ^ [...] About to match the '+': call trace: [Eq in 1, PlusMinus in 1, PlusMinus in '+'] input: [Term('1'), '+', Frac(Term('2'), Term('3')), '+', '4', '$']
index:                                              ^
[...]

Matched, fetching the rhs:
call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 1, PlusMinus in 2, Mul in 3, Div in 4, Atom]
input: [Term('1'), '+', Frac(Term('2'), Term('3')), '+', Term('4'), '$'] index: ^ [...] Going up in the call stack, the deepest PlusMinus matched: call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 1] input: [Term('1'), '+', Binop(Frac(Term('2'), Term('3')), '+', Term('4')), '$']
index:                  ^
[...]

Going up in the call stack, the first PlusMinus matched too:
call trace: [Eq in 1, PlusMinus in 1, PlusMinus in 1]
input: [Binop(Term('1'), '+', Binop(Frac(Term('2'), Term('3')), '+', Term('4'))), '\$']
index:  ^
[...]

The input successfully matched and the AST have been constructed, ready for pretty printing.

This article is not groundbreaking, no false hope or promise. So much time passed since the last one, I just found a good enough subject to write an article on. Hope you'll still enjoy it!

Stay tuned for the NLP stuff

- page 2 of 6 -