Face recognition at scale pt.1 (featuring a lot more buzzwords in the text!)


Because I'm such a soulless sheep, I've recently started a PhD and a new job. Deep Learning, computer vision, hype, artificial intelligent, more buzzwords, face recognition, butterflies (I can't unthink of flying butter when I hear this word), neural networks, etc.

So yeah. I've been doing that for exactly 4 months, at the moment of writing (19/6/18). It appears that the company I'm working for needs some face recognition to identify people in videos. Like, actually, a fuckton of people, in a fuckload of videos, with a big shitload of uploads per day (I'm trying to curse proportionally to the numbers I'm describing). That means that whatever I'm doing, I have to do it...

  • quickly: people there are a bit skeptical about DL and I have to prove my worth
  • fast: because whatever algorithms I'm running, I have to treat a day's upload within less than a day, otherwise I'll never catch up
  • accurate: we already have reviewers, they're already busy, and the goal is to help them, not to give them more work because of false positives etc.
  • with the least effort possible: because I'm completely alone on that herculean task

Oh, and btw, I have no formal education in ML/DL, I've been basically self taught for a while, and that's more or less my introductory task in ML. Speaking of suicidal tendencies much?

Proof of concept and initial system

The Data

The first thing I had to do was to download some pictures per person I wish to recognize. I made a list of 100 persons, and downloaded around 4k pictures for each of them. That many because I didn't exactly know how many photos are actually needed (and still don't, tbh), they're basically super easy to find on the internet so why not, the more the better, and I thought that I might need to train my own face recognition engine, because the domain is much less constrained than traditional face recognition systems.

Analyzing what the script downloaded showed an awful lot of noise. Photos not showing the face, photos of another person sharing part of the name, photos were they were other faces, etc. And I just didn't want to review 400k pictures by myself, one by one, or I would still be doing that today.


Cleaning the data

I jumped immediately on dlib, wrapped by the fairly good face_recognition (btw, if you haven't read Adam's post on face recognition , it's awesome), detected the faces, and generated the embeddings (a 128D vector representing one's face features, so that vectors of the same persons are meant to be close (with Euclidean distance), and far apart for different persons).

I don't know jack about Javascript except:

  • It works in a browser so that's good because not only is it remote but it doesn't need anything to be installed to be used;
  • I hate stupid languages like php and javascript that are so terribly error-prone and inconsistent (I learned yesterday that php doesn't have a proper way to concatenate arrays, that's amazing);
  • React is the only frontend technology that hasn't made me want to shoot myself with a cannon so far. I guess that makes it my favorite (or should I say, least hated) UI technology at that point;
  • Andrej Karpathy wrote the awesome t-SNE.js that I definitely wanted to use to explore the data I just downloaded (or that's what I said to my thesis supervisor, but for realsies it's because PICTURES MOVE! WOOOO!)

And I went coding and nerding and sleep depriving myself for something like a month, moving through JS and web nonsense, avoiding depression, and finally built what I call the Face Inspector. Tadah. You can explore all the pictures downloaded under someone's name, filter them by the number of faces in the photo, visualize them under t-SNE, cluster them with K-means (with automatic K choice), X-means, and Chinese Whispers, and I've been allowed to release some of that stuff, for the better or the worse.

Face Inspector Showcase

Face Inspector Showcase

So, selecting pictures by clusters of similar pictures is a much less unreasonable task, and I could have my photos properly tagged in a little over a week.

I quickly built an HTTP API you can send pictures to, and get back some JSON describing locations of the faces that have been found, and the filename+distance of the k most similarly looking pictures in the dataset. As I knew that the dataset is really meant to grow (and I was missing C++), I sat for few hours writing a fast, multithreaded, vectorized, 128D knn C++ library that I exposed to python. I benchmarked various implementations and compilers and took the best one. Takeout: compilers are amazing. I ended-up with a clean implementation that my compiler could easily optimize.

Oh, and for the lulz, I quickly built a UI to exploit that HTTP endpoint, but didn't want to go through the hassle of using npm, creating a React project, compiling shit etc, so I wrote some piece of funny code React-ish but actually pure JS-backtick-string-nested-in-JS-backtick-string-nested-in-JS-backtick-string-as-a-python-string-on-the-server. See a part of it for yourself is its full glory. Because, who needs JSX, right?

def whoami():
    return """
    .... some JS.....
    let form = document.getElementById('form');
    form.addEventListener('submit', e => {
        res.innerText = 'loading...';
        fetch('/whoami?photo_url=' + encodeURIComponent(photo_url.value))
            .then(r => r.json())
            .then(data => {
                res.innerHTML =
                    ${data.map(d =>
                        `<h1>${d.sex}, conf: ${d.confidence}</h1>
                        <div style="display:flex">
                        <div style="display:flex">
                            ${d.proposals.map(p =>
                                         width="128" />
                                         (${Math.floor(p[1] * 1000) / 1000})
    .... some more JS...

So I had this dlib + fast kNN pipeline, ready to tackle some videos! Oh wait, not so fast. But we'll see that later.