Writing a french POS tagger (1)
French lacks a lot of NLP tools in the FOSS community. And everyone focuses on english AI. Then, french technology sucks. How sad for us. Thing is, I want an AI in my home, for various automation stuff. And I want it to run locally, for privacy reasons (I don't want to share with companies my personnal audio and video at any single second). Then, someone has to start something. Let's say this someone is me, hahaha, sounds so much like a noble duty, hahaha.
The very first step in NLP is to have a POS tagger, that is, something tagging every word in a sentence with its part-of-speech (verb, noun, pronoun, adjective, adverb, etc). Having this allows to have another features that will allow higher level model to generalize more easily (like "oh, this word is a noun. Even if I never saw this word before, I know that this sentence is synctactically correct") and with less data.
One might at first think that this is not even an issue: look at the dictionnary entry for this word. Sadly, language is highly ambiguous. Let's consider "Je commande" (I order) and "une commande" (an order); depending on the context, "commande" is either a noun or a verbal form. Thankfully, french is much more morphologically rich than english, which makes it a less ambiguous language and provides more patterns in the words to guess their POS (like a word ending in "ly" is most likely an adverb).
Alright, let's brainstorm for a little, and see what kind of clues we can have to guess a word's POS. in machine learning terminology, that's called "features".
- hopefully, we already know the word, and this word will have a clear frequential dominance of one its many possible POS. For instance, the determiner "la" is ambiguous with "LA" (Los Angeles), but the determiner occuring much more than the city, always tagging it as the determiner will fail in only very few cases. Same applies for the "est" (a form of "être"/"be") and "est", the latin locution.
- Case would also help disambiguate the LA case. So, keeping the shape is a good feature. We'll unlikely to know every city and possible name and every organisation etc, then something starting with a capitalized letter is actually a good indication to tag it as a proper noun.
- If we don't know the word, we can use some morphological features of french. Suffixes like "er", "é", "ées", "ée", "eindre", "oindre", "ir", "issons", "issez", "ent" are very strong features.
- Same applies for prefixes.
Great. But they're another question. More than one, actually. How much can I trust every of these features? What are their relative levels of confidence? I mean, I believe they help. I don't know how if I'm right, and if I am, I don't know how much. A "ement" prefix is alwyas an adverb (tbh, I just don't have any counter example on top of head right now), while "er" is super often discriminative to infinitive verbs, but one counter example could be the adjective "léger" (not heavy), so we're slightly less confident in this one.
How to choose those confidence levels? That's where we start to apply machine learning.