## Writing a french POS tagger (2)

By Vermeille on Tuesday 13 October 2015, 12:00 - Permalink

How to choose those confidence levels? That's where we start to apply machine learning.

Okay, we're into maths now, so everything is a number. Let's assign a unique natural integer to every word in our vocabulary, so that \(w_i\) is the word with the id i, and a unique integer to every POS class, so that \(c_i\) is the class with the id i.

## Maximum Likelihood Estimate

Since we're doing ML, let's use some vocabulary. Everything we extract from the ith word (noted wi) is called "features", denoted , the POS we're trying to tag are named "classes" and hence will be noted c, and our confidence is a probability. If we forget a little about suffixes, prefixes and capitalization, we have only the word as a source of info. Predicting is just the act of doing

\[\underset{c}{\operatorname{argmax}} P(c | w_i)\]

which is like asking "which class i'm the most confident about considering this word?". How do we find such a probability? We can use a simple algorithm called Maximum Likelihood Estimate which works like this:

\[P(c | w_i) = \frac{\operatorname{Count}(w_i\text{ is a }c)}{\operatorname{Count}(w_i)}\]

"How many times, in a big text corpus, is \(w_i\) of class c relatively to is global appearance?". And since the denominator is the same for all classes (it varies solely on w_i), we can leave it off.

\[\underset{c}{\operatorname{argmax}} P(c | w_i) = \underset{c}{\operatorname{argmax}} \operatorname{Count}(w_i\text{ is a }c)\]

Super simple, but we can't do anything for unknown words despite having all those fancy morphological features. We need something to incorporate them.

## Maximum Entropy model

### Prediction

Let's say we have what we call a feature vector \(\theta\), for which \(\theta_i\) is the \(i\)th feature being 1 if the feature is present, 0 otherwise. When we try to predict the class \(c\), the ith feature will be more or less discriminative. Let's represent that by weighting it with \(\lambda_{c,i}\) "how indicative is \(\theta_i\) for class \(c\)?" where:

- a high lambda means "wow, super discriminative, this is super indicative of this class!"
- a low (negative) lambda means "Wow, super discriminative, this super indicative of NOT this class!"
- and a lambda about 0 means "this is not helping me for class c". Emphasis, \(\lambda_{c,i}\theta_i\) is NOT a probability, it's just a score. Continuing this way, evaluating the score of class \(c\) for the whole feature vector is a the dot product \(\lambda_c\cdot\theta\). Weighting then summing.

Note: Here, \(\lambda\) is essentially a matrix from which I pick the column \(c\), which is contrary to everything you'll read in the litterature. I'm doing this because, contrary to what the litterature says, I assume that a feature can be discriminative for every class, whereas most papers wants you to condition the features on the class being predicted, allowing a different set of features per class. With my approach, we have more parameters, some of them will be weighted to 0, but it prevents from early optimization. Later on, and after having analyzed the lambdas, one can purge the unnecessary features and weights in the actual implementation, without having missed a discriminative indication.

Once we have this score, easy stuff, find the class with the best score.

\[\underset{c}{\operatorname{argmax}} P(c | \theta, \lambda) = \underset{c}{\operatorname{argmax}} \lambda_c\cdot\theta\]

how cool? Wait, how do I choose all those lambdas?

### Learning

Well, that's a different story. To make a long story short, we have a very handy algorithm, called Maximum Entropy. It relies on the fact that the probability distribution which best represents the current state of knowledge is the one with largest entropy.

I said earlier that scores weren't probabilities. I say that ME works on probabilities. We need to turn those score into probabilities. First we needs the class scores to be on the same scale. Fairly easy, just divide the score for the class \(c\) by the absolute score of all \(c\). We're mapping all of our values to [-1;1].

\[\operatorname{not-really-P}(c | \theta, \lambda) = \frac{\lambda_c\cdot\theta}{\sum_{c'}|\lambda_{c'}\cdot\theta|}\]

Still not good, we need [0, 1]. Well, we could just +1, but actually, we don't like computers to work on [0;1], because computation in such a range tends to quickly turn to NaN and absurdely small numbers. And divisions and multiplications are not cheap. That's why we prefer to deal with log-probabilities instead of probabilities. Log is monotonic, has nice additive properties and is cute looking function, despite its undefined log(0) value. It turns out that the best way to make probabilities from those scores is the exp function.

\[P(c | \theta,\lambda) = \frac{\exp{\lambda_c\cdot\theta}}{\sum_{c'}\exp{\lambda_{c'}\cdot\theta}}\]

It maps to the correct values range, and if we take the log probability, the division turns to a substraction and the exp cancel out. How nice.

And the super good thing is that, by whichever mathemagics I'm not fully getting yet (please someone explain?), Maximum Entropy and Maximum Likelihood are linked, which brings us to an optimization objective of:

\[\underset{\lambda}{\operatorname{argmin}} -\log \mathcal{L} = -\sum_x\log\frac{\exp{\lambda_c\cdot\theta^{(x)}}}{\sum_{c'}\exp{\lambda_{c'}\cdot\theta^{(x)}}}\]

Where \(\theta^{(x)}\) is the feature vector of the \(x\)th example in the dataset.

Cool. We have to take the gradient with respect to lambda, which gives us

\[\frac{\partial \mathcal{L}}{\partial\lambda_c} = \sum_x (1\{x\text{ is a }c\}\theta^{(x)} - \sum_c P(c | \theta^{(x)}, \lambda)\theta^{(x)})\]

with this derivative, we can take a simple iterative approach to update lambda.

\[\lambda := \lambda - \alpha\frac{\partial \mathcal{L}}{\partial \lambda}\]

This is quite slow, but it works.

And in the end, you have your model.

Wait, what about the context? We used only features from the word, and we can't disambiguate from the very first example "je commande" and "une commande". Well, we'll have to use something a little smarter than a Maximum Entropy Model and use a MEMM, a Maximum Entropy Markov Model.