Hey, how about some more problems?

In order to show the alpha, face_recognition over dlib-only face recognition system, I needed to get a server. I ended up renting a p2.xlarge EC2 instance. We'll get our servers later. No worry Dory, it'll be done quickly, swiftly, promptly.

Build a nginx-flask-face_recognition cuda docker, push it, done. Write a docker-compose.yml, put a traefik as a load balancer / Reverse Proxy / vhost router, and you're good to go, Bilbo.

I demo it, superiors are happy (how couldn't they be, AMARITE?), they try it with random pictures on the internet, it works seemingly fine (I actually did some statistics beforehand, and my accuracy on my testing set was good, so, le me is not surprised). So, then, superiors be like, let's try that on some real video frames! Me like, sure, that's the goal RIGHT?

So we try it on a couple of frames and eerrrrhhhh, not that great. The cynical reader would say "You said you were interested in annotating videos, yet you chose photos in your training and testing set, you stupid modafaka". And he wouldn't be so wrong, but I knew what I was doing (let's pretend).

  1. A face in a photo and a face in a frame shouldn't be much different.
  2. I don't exactly care on being exact per image, I just need to be statistically right over the video sequence. No probs if I miss some frames as long as I'm correct in the majority.
  3. Collecting roughly annotated photos was actually much easier than frames.

Why did it perform not-so-greatly?

  1. First, because there are no males in the database yet, but there are some on the videos, and they're totally polluting the results.

  2. And then, because the embeddings done by dlib when the face is not facing the camera and straight up is not great. True, dlib embeds a face aligner, but when the face is not facing the camera, there's only so much you can do to deform it to make it seemingly face the camera without ruining the picture. And dlib's face realigner totally fails when the face is rotated, say, because the person is lying on the ground. Which, uh, happens a lot in our videos, for some reasons.

Males

Super easy to solve. Download the CelebA dataset, embed the faces, and train an SVM to recognize male and female faces right from the embeddings. More than 98% of testing accuracy, me happy.

Rotations, etc

The first problem, people not facing the camera, can't be solved without re-training dlib's embedding neural net with a less constrained set of photos (thus including non facing ones), which would take way too long (in terms of computing time, data gathering and preparation), and it's a wayyy too sensitive piece of the puzzle, so, let's try not to touch it.

The second however, rotated faces, has a simple solution: after dlib detected a face, but before sending it to the embedding neural net, we can detect and fix the rotation ourselves.

To be able to find the orientation of the face, a great quality RGB image is not needed. So we'll restrict ourselves to a greyscale 32x32 image of the face.

So I collected some fair amount of straight up faces, generated a bunch of rotations to build a dataset and that was it for the data. Easy peasy, Winny.

Taking sklearn, I did not have much success with learning the rotation with any of the algorithms. I guess the God of Hype didn't want it to work, because he wanted me to use some more Deep Learning Magic. So, I did.

Brace yourselves, rotated faces, Tensorflow is coming. I ended up building a dead small, 5 layers convolutional neural network that took few minutes to train, and had awesome results. Note that, for simplictiy, I did not try to regress to the exact angle, but instead classify it into some buckets of 20 degrees each, which is good enough for me, since dlib's face aligner can properly do its job as long as the angle is from -20° to 20°.

Integrating it to dlib's Face Embedding pipeline really made things better. How much? I don't really know tbh. It's so uuuurrrrrrhhhh to build a test set. As all researchers / industrials, I'll make it later, and if I ever publish somethin' 'bout that, I'll obviously pretend I made it right before doing anything.

Rotated face

Rotated face

The net's architecture is dead simple and fast, and looks like this:

InputLayer 32x32x1
Conv2D 64 filters, 3x3 kernel, same
Relu
MaxPool2D 2x2 valid
Conv2D 32 filters, 3x3 kernel, same
Relu
MaxPool2D 2x2 valid
Conv2D 16 filters, 3x3 kernel, same
Relu
MaxPool2D 2x2 valid
FullyConnected 128
Relu
FullyConnected 12
Softmax

Some more stuff

Playing more with UTKFaces and CelebA, I also built a SVM that classifies an embedding into ethnicities, and another that regresses to an estimation of the age. The second one do not work that great, mainly I guess because embeddings are actually supposed to be age-invariant so that ideally the same young and old person share the same embedding.

Conclusion

Is it good enough now? Haha, you bet. Life is a bitch, Machine Learning is a lie. The world is an illusion, governments are a conspiracy, vaccines prevent peple from having their Jedi powers. So, yeah. Nope. Not yet. 'Twas bettah, but still not enuff'.