In my previous company (hi to yall guys, btw), they had this amazing idea to lock me up in a room with four other 10x-ers madmen to come up with an idea of something new and innovative and all that.

We thought that ads were too invasive and unbearable in the world but were totally absent from movies. That means we can crap that with ads too and hope to make money. Obviously, we're tech people so LET'S AUTOMATE MOST OF THAT in few days. Proof of concept. I really do love impossible challenges.

Main concept

So, the goal was something like a UI with a split screen where ad makers could select a 3D model of a product on the left pane, and then click on the movie on the right pane to insert it. Then Neural Networks would do the magic, and

  1. Make region proposals to the ad makers for zones were it makes sense to insert ads, like adding a poster on a wall, or adding a Coca-Cola can on a table.
  2. Scale the 3D model according the perspective (covered in this blog post)
  3. Handle occlusions, that is, if someone walks between the camera and the inserted object, the occluded part will have to disappear (covered in this post as a side effect of 1)
  4. Track camera motion (uncovered, and never will be)
  5. Re-render / Harmonize to accomodate the model's luminosity and global light color to the original image (uncovered, and never will be)
  6. Cast shadows (uncovered, and never will be)

We just had enough time and funding for 1 (barely), 2 and 3. And then I quitted the job.

Region proposals

Is it weird if the solution has almost the same name of the problem? I used a Faster-RCNN (which includes a Region Proposals Network) trained on ImageNet to tell me where tables and TV monitors were.



Points 2 and 3 are about depth estimation. Choose a region (ie, a pixel) where the model will be inserted, estimate its depth, scale the model according to the perspective, and if that depth suddenly decreases, that (theoretically) means something has come between the object and the camera, and we have to hide the occluded pixels of the rendered model. Easy, right? So let's do it.

The general depth estimation was made using Deeper Depth Prediction with Fully Convolutional Residual Networks. They give the trained models, and it's doing a fairly decent job. However, for our needs, this was not good enough. Look at the predictions: the edges are not sharp at all (which makes handling occlusion a no-go), and tbh, that do look good on pictures, but Oh Lord forgive me if you attempt to visualize it in 3D. Edges are not sharp, and flat surfaces are nothing but flat. That's probably the state-of-the-art of what Deep Learning could do at that time, but we had to fix it.

Original Image and Depth Estimation

Original Image and Depth Estimation

And incompetent, clueless me happened to stumble upon the great Colorization Using Optimization. Take a B&W image, just give it some colored strokes as a hint, and let the algorithmic magic (no Deep Learning™! regular optimization algorithm!) propagate the color and shit. It'll get the shades right and will do a great job with the edges. And le me said to himself "This looks interesting, how about I use the same technique to propagate depth hints instead of color hints and get sharp edges?". Turns out that 10 times out of 10, when I have a good idea, thankfully someone else had it before and did the hard work of transposing a hand wavy concept into concrete maths and code.

So someone just did that in High resolution depth image recovery algorithm using grayscale image cough cough, look it up on SciHub, cough cough. Their problem is a bit different than mine: they have a reliable and low res depth image and they want to scale it up. I have an unreliable but high res depth image. But still, I thought that could work. Also, they have a bit more going on to make the leap between colorization and depth prediction: it's a really fair thing to assume that colors will vary according to the variations in the greyscale image, but it's much less obvious with depth. So they kinda iterate between fixing the depth image, then fixing the greyscale image. Unsurprisingly, their results are much better looking than mine.

I decided to randomly sample my depth image and take some random depth hints to propagate through the image. And used what's in the paper. Wasn't super easy to implement for someone like me without any prior knowledge in optimization (beside Gradient Descent) or Image Processing (besides, well, convolutions), but by ensembling several of that guy's papers, I could reconstruct the knowledge I was missing and build the thing. Also, the paper is fairly easy to read and generally understand, if you're not trying to implement something without knowing a shit about what you're doing.

recovered depth

recovered depth

Look it up in details on the Jupyter Notebook.

Before you see the final result, there's one last thing to fix: frame-by-frame depth estimation wasn't consistent, and the depth image did flicker a lot. The easy but disappointing solution I went with was to insert a lowpass filter

\[\text{depth} := 0.95 * \text{depth} + 0.05 * \text{this_frame_depth}\]


Once we get the depth right-ish, we need to properly scale the object according to its location. This can be easily solved using the pin-hole camera model and doing simple triangle maths: first, tell the program how big the object is, then how big should the object look in the picture at some depth (like, click on the table and re-click few pixels above to tell where the object should be). They you're able to use basic proportionality to scale the object at any depth, as long as the camera remains the same.

Final Result

In the end, for something done in 3 days without prior knowledge it was pretty good for me to feel happy.

insert video here but I can't find it anymore