Vermeille's blog2018-06-20T08:02:57+02:00urn:md5:5758d3efb9495ffa8211bb031b6b3594DotclearEnhancing Videos with Deep Product Placementurn:md5:f3d57a5a85e4f03bce74bce73901462b2018-06-20T07:42:00+02:002018-06-20T10:02:57+02:00Vermeille <h2 id="motivation">Motivation</h2>
<p>In my previous company (hi to yall guys, btw), they had this amazing idea to lock me up in a room with four other <del>10x-ers</del> madmen to come up with an idea of something new and innovative and all that.</p>
<p>We thought that ads were too invasive and unbearable in the world but were totally absent from movies. That means we can crap that with ads too and hope to make money. Obviously, we're tech people so LET'S AUTOMATE MOST OF THAT in few days. Proof of concept. I really do love impossible challenges.</p>
<h2 id="main-concept">Main concept</h2>
<p>So, the goal was something like a UI with a split screen where ad makers could select a 3D model of a product on the left pane, and then click on the movie on the right pane to insert it. Then Neural Networks would do the magic, and</p>
<ol style="list-style-type: decimal">
<li>Make region proposals to the ad makers for zones were it makes sense to insert ads, like adding a poster on a wall, or adding a Coca-Cola can on a table.</li>
<li>Scale the 3D model according the perspective (covered in this blog post)</li>
<li>Handle occlusions, that is, if someone walks between the camera and the inserted object, the occluded part will have to disappear (covered in this post as a side effect of 1)</li>
<li>Track camera motion (uncovered, and never will be)</li>
<li>Re-render / Harmonize to accomodate the model's luminosity and global light color to the original image (uncovered, and never will be)</li>
<li>Cast shadows (uncovered, and never will be)</li>
</ol>
<p>We just had enough time and funding for 1 (barely), 2 and 3. And then I quitted the job.</p>
<h2 id="region-proposals">Region proposals</h2>
<p>Is it weird if the solution has almost the same name of the problem? I used a Faster-RCNN (which includes a <em>Region Proposals</em> Network) trained on ImageNet to tell me where tables and TV monitors were.</p>
<p>Done.</p>
<h2 id="depth">Depth</h2>
<p>Points 2 and 3 are about depth estimation. Choose a region (ie, a pixel) where the model will be inserted, estimate its depth, scale the model according to the perspective, and if that depth suddenly decreases, that (theoretically) means something has come between the object and the camera, and we have to hide the occluded pixels of the rendered model. Easy, right? So let's do it.</p>
<p>The general depth estimation was made using <a href="https://github.com/iro-cp/FCRN-DepthPrediction">Deeper Depth Prediction with Fully Convolutional Residual Networks</a>. They give the trained models, and it's doing a fairly decent job. However, for our needs, this was not good enough. Look at the predictions: the edges are not sharp at all (which makes handling occlusion a no-go), and tbh, that do look good on pictures, but Oh Lord forgive me if you attempt to visualize it in 3D. Edges are not sharp, and flat surfaces are nothing but flat. That's probably the state-of-the-art of what Deep Learning could do at that time, but we had to fix it.</p>
<div class="figure">
<img src="http://vermeille.fr/dotclear2/public/start_img.png" alt="Original Image and Depth Estimation" />
<p class="caption">Original Image and Depth Estimation</p>
</div>
<p>And incompetent, clueless me happened to stumble upon the great <a href="https://www.cs.huji.ac.il/~yweiss/Colorization/">Colorization Using Optimization</a>. Take a B&W image, just give it some colored strokes as a hint, and let the algorithmic magic (no Deep Learningâ„˘! regular optimization algorithm!) propagate the color and shit. It'll get the shades right and will do a <em>great job with the edges</em>. And le me said to himself "This looks interesting, how about I use the same technique to propagate <em>depth hints</em> instead of <em>color hints</em> and get sharp edges?". Turns out that 10 times out of 10, when I have a good idea, thankfully someone else had it before and did the hard work of transposing a hand wavy concept into concrete maths and code.</p>
<p>So someone just did that in <a href="https://ieeexplore.ieee.org/document/7362701/">High resolution depth image recovery algorithm using grayscale image</a> <em>cough cough, look it up on SciHub, cough cough</em>. Their problem is a bit different than mine: they have a <em>reliable</em> and <em>low res</em> depth image and they want to scale it up. I have an <em>unreliable</em> but <em>high res</em> depth image. But still, I thought that could work. Also, they have a bit more going on to make the leap between colorization and depth prediction: it's a really fair thing to assume that colors will vary according to the variations in the greyscale image, but it's much less obvious with depth. So they kinda iterate between fixing the depth image, then fixing the greyscale image. Unsurprisingly, their results are much better looking than mine.</p>
<p>I decided to randomly sample my depth image and take some random depth hints to propagate through the image. And used what's in the paper. Wasn't super easy to implement for someone like me without any prior knowledge in optimization (beside Gradient Descent) or Image Processing (besides, well, convolutions), but by ensembling several of that guy's papers, I could reconstruct the knowledge I was missing and build the thing. Also, the paper is fairly easy to read and generally understand, if you're not trying to implement something without knowing a shit about what you're doing.</p>
<div class="figure">
<img src="http://vermeille.fr/dotclear2/public/depth_res.png" alt="recovered depth" />
<p class="caption">recovered depth</p>
</div>
<p>Look it up in details on the <a href="https://github.com/Vermeille/depth-recovery/blob/master/Optimize%20Depth.ipynb">Jupyter Notebook</a>.</p>
<p>Before you see the final result, there's one last thing to fix: frame-by-frame depth estimation wasn't consistent, and the depth image did flicker a lot. The easy but disappointing solution I went with was to insert a lowpass filter</p>
<p><span class="math display">\[\text{depth} := 0.95 * \text{depth} + 0.05 * \text{this_frame_depth}\]</span></p>
<h2 id="perspective">Perspective</h2>
<p>Once we get the depth right-ish, we need to properly scale the object according to its location. This can be easily solved using the pin-hole camera model and doing simple triangle maths: first, tell the program how big the object is, then how big should the object look in the picture at some depth (like, click on the table and re-click few pixels above to tell where the object should be). They you're able to use basic proportionality to scale the object at any depth, as long as the camera remains the same.</p>
<h2 id="final-result">Final Result</h2>
<p>In the end, for something done in 3 days without prior knowledge it was pretty good for me to feel happy.</p>
<p><em>insert video here but I can't find it anymore</em></p>http://vermeille.fr/dotclear2/index.php/post/38-Enhancing-Videos-with-Deep-Product-Placement#comment-formhttp://vermeille.fr/dotclear2/index.php/feed/atom/comments/38Face Recognition at scale (1/N)urn:md5:aae1b7b3eb1b3b962c4aa23ba9c947662018-06-19T12:29:00+02:002018-06-19T12:54:43+02:00Vermeille <h1 id="face-recognition-at-scale-pt.1-featuring-a-lot-more-buzzwords-in-the-text">Face recognition at scale pt.1 (featuring a lot more buzzwords in the text!)</h1>
<h2 id="background">Background</h2>
<p>Because I'm such a soulless sheep, I've recently started a PhD and a new job. Deep Learning, computer vision, hype, artificial intelligent, more buzzwords, face recognition, butterflies (I can't <em>unthink</em> of flying butter when I hear this word), neural networks, etc.</p>
<p>So yeah. I've been doing that for exactly 4 months, at the moment of writing (19/6/18). It appears that the company I'm working for needs some face recognition to identify people in videos. Like, actually, a fuckton of people, in a fuckload of videos, with a big shitload of uploads per day (I'm trying to curse proportionally to the numbers I'm describing). That means that whatever I'm doing, I have to do it...</p>
<ul>
<li><em>quickly</em>: people there are a bit skeptical about DL and I have to prove my worth</li>
<li><em>fast</em>: because whatever algorithms I'm running, I have to treat a day's upload within less than a day, otherwise I'll never catch up</li>
<li><em>accurate</em>: we already have reviewers, they're already busy, and the goal is to help them, not to give them more work because of false positives etc.</li>
<li><em>with the least effort possible</em>: because I'm completely alone on that herculean task</li>
</ul>
<p>Oh, and btw, I have no formal education in ML/DL, I've been basically self taught for a while, and that's more or less my introductory task in ML. Speaking of suicidal tendencies much?</p>
<h2 id="proof-of-concept-and-initial-system">Proof of concept and initial system</h2>
<h3 id="the-data">The Data</h3>
<p>The first thing I had to do was to download some pictures per person I wish to recognize. I made a list of 100 persons, and downloaded around 4k pictures for each of them. That many because I didn't exactly know how many photos are actually <em>needed</em> (and still don't, tbh), they're basically super easy to find on the internet so why not, the more the better, and I thought that I might need to train my own face recognition engine, because the domain is much less constrained than traditional face recognition systems.</p>
<p>Analyzing what the script downloaded showed an awful lot of noise. Photos not showing the face, photos of another person sharing part of the name, photos were they were other faces, etc. And I just didn't want to review 400k pictures by myself, one by one, or I would still be doing that today.</p>
<p>So...</p>
<h3 id="cleaning-the-data">Cleaning the data</h3>
<p>I jumped immediately on <a href="http://vermeille.fr/dotclear2/index.php/post/37-Face-Recognition-at-scale-%281/dlib.net">dlib</a>, wrapped by the fairly good <a href="https://github.com/ageitgey/face_recognition">face_recognition</a> (btw, if you haven't read Adam's <a href="https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78">post on face recognition</a> , it's awesome), detected the faces, and generated the embeddings (a 128D vector representing one's face features, so that vectors of the same persons are meant to be close (with Euclidean distance), and far apart for different persons).</p>
<p>I don't know jack about Javascript except:</p>
<ul>
<li>It works in a browser so that's good because not only is it remote but it doesn't need anything to be installed to be used;</li>
<li>I hate stupid languages like php and javascript that are so terribly error-prone and inconsistent (I learned yesterday that php doesn't have a proper way to concatenate arrays, that's amazing);</li>
<li>React is the only frontend technology that hasn't made me want to shoot myself with a cannon so far. I guess that makes it my favorite (or should I say, <em>least hated</em>) UI technology at that point;</li>
<li>Andrej Karpathy wrote the awesome <a href="https://github.com/karpathy/tsnejs">t-SNE.js</a> that I definitely wanted to use to explore the data I just downloaded (or that's what I said to my thesis supervisor, but for realsies it's because PICTURES MOVE! WOOOO!)</li>
</ul>
<p>And I went coding and nerding and sleep depriving myself for something like a month, moving through JS and web nonsense, avoiding depression, and finally built what I call the Face Inspector. Tadah. You can explore all the pictures downloaded under someone's name, filter them by the number of faces in the photo, visualize them under t-SNE, cluster them with <a href="https://github.com/Vermeille/kmeans">K-means</a> (with automatic K choice), X-means, and <a href="https://github.com/Vermeille/ChineseWhispers">Chinese Whispers</a>, and I've been allowed to release <a href="https://github.com/Vermeille/vec">some</a> of that stuff, for the better or the worse.</p>
<div class="figure">
<img width="600" src="http://vermeille.fr/dotclear2/public/fi_showcase.png" alt="Face Inspector Showcase" />
<p class="caption">Face Inspector Showcase</p>
</div>
<p>So, selecting pictures by clusters of similar pictures is a much less unreasonable task, and I could have my photos properly tagged in a little over a week.</p>
<p>I quickly built an HTTP API you can send pictures to, and get back some JSON describing locations of the faces that have been found, and the filename+distance of the k most similarly looking pictures in the dataset. As I knew that the dataset is really meant to grow (and I was missing C++), I sat for few hours writing a fast, multithreaded, vectorized, 128D knn C++ library that I exposed to python. I benchmarked various implementations and compilers and took the best one. Takeout: <a href="https://www.youtube.com/watch?v=bSkpMdDe4g4">compilers are amazing</a>. I ended-up with a clean implementation that my compiler could easily optimize.</p>
<p>Oh, and for the lulz, I quickly built a UI to exploit that HTTP endpoint, but didn't want to go through the hassle of using npm, creating a React project, compiling shit etc, so I wrote some piece of funny code React-ish but actually pure JS-backtick-string-nested-in-JS-backtick-string-nested-in-JS-backtick-string-as-a-python-string-on-the-server. See a part of it for yourself is its full glory. Because, who needs JSX, right?</p>
<div class="sourceCode"><pre class="sourceCode js"><code class="sourceCode javascript">def <span class="at">whoami</span>()<span class="op">:</span>
<span class="cf">return</span> <span class="st">"""</span>
.... <span class="at">some</span> <span class="va">JS</span>.....
<span class="kw">let</span> form <span class="op">=</span> <span class="va">document</span>.<span class="at">getElementById</span>(<span class="st">'form'</span>)<span class="op">;</span>
<span class="va">form</span>.<span class="at">addEventListener</span>(<span class="st">'submit'</span><span class="op">,</span> e <span class="op">=></span> <span class="op">{</span>
<span class="va">console</span>.<span class="at">log</span>(photo_url)<span class="op">;</span>
<span class="va">console</span>.<span class="at">log</span>(<span class="va">photo_url</span>.<span class="at">value</span>)<span class="op">;</span>
<span class="va">res</span>.<span class="at">innerText</span> <span class="op">=</span> <span class="st">'loading...'</span><span class="op">;</span>
<span class="at">fetch</span>(<span class="st">'/whoami?photo_url='</span> <span class="op">+</span> <span class="at">encodeURIComponent</span>(<span class="va">photo_url</span>.<span class="at">value</span>))
.<span class="at">then</span>(r <span class="op">=></span> <span class="va">r</span>.<span class="at">json</span>())
.<span class="at">then</span>(data <span class="op">=></span> <span class="op">{</span>
<span class="va">res</span>.<span class="at">innerHTML</span> <span class="op">=</span>
<span class="vs">`<img</span>
<span class="vs"> src="</span><span class="sc">${</span><span class="va">photo_url</span>.<span class="at">value</span><span class="sc">}</span><span class="vs">"</span>
<span class="vs"> height="512"</span>
<span class="vs"> style="margin:auto;display:block"/></span>
<span class="vs"> </span><span class="sc">${</span><span class="va">data</span>.<span class="at">map</span>(d <span class="op">=></span>
<span class="vs">`<h1></span><span class="sc">${</span><span class="va">d</span>.<span class="at">sex</span><span class="sc">}</span><span class="vs">, conf: </span><span class="sc">${</span><span class="va">d</span>.<span class="at">confidence</span><span class="sc">}</span><span class="vs"></h1></span>
<span class="vs"> <h2></span><span class="sc">${</span><span class="va">d</span>.<span class="at">age</span><span class="sc">}</span><span class="vs"></h2></span>
<span class="vs"> <div style="display:flex"></span>
<span class="vs"> <div style="display:flex"></span>
<span class="vs"> </span><span class="sc">${</span><span class="va">d</span>.<span class="va">proposals</span>.<span class="at">map</span>(p <span class="op">=></span>
<span class="vs">`<figure></span>
<span class="vs"> <img</span>
<span class="vs"> src="</span><span class="sc">${</span>base_url<span class="sc">}${</span>p[<span class="dv">0</span>]<span class="sc">}</span><span class="vs">"</span>
<span class="vs"> width="128" /></span>
<span class="vs"> <figcaption></span>
<span class="vs"> </span><span class="sc">${</span>p[<span class="dv">0</span>].<span class="at">split</span>(<span class="st">'/'</span>)[<span class="dv">0</span>]<span class="sc">}</span>
<span class="vs"> <br/></span>
<span class="vs"> (</span><span class="sc">${</span><span class="va">Math</span>.<span class="at">floor</span>(p[<span class="dv">1</span>] <span class="op">*</span> <span class="dv">1000</span>) / <span class="dv">1000</span><span class="sc">}</span><span class="vs">)</span>
<span class="vs"> </figcaption></span>
<span class="vs"> </figure>`</span>
.... <span class="at">some</span> <span class="at">more</span> <span class="va">JS</span>...
<span class="st">"""</span></code></pre></div>
<p>So I had this dlib + fast kNN pipeline, ready to tackle some videos! Oh wait, not so fast. But we'll see that later.</p>http://vermeille.fr/dotclear2/index.php/post/37-Face-Recognition-at-scale-%281/N%29#comment-formhttp://vermeille.fr/dotclear2/index.php/feed/atom/comments/37A Differentiable Graph for Neural Networksurn:md5:2f086af02eb210b583eedb732030ccb42016-03-06T07:39:00+01:002016-03-06T18:11:43+01:00Vermeille <h2 id="a-differentiable-graph-for-neural-networks">A differentiable graph for neural networks</h2>
<p>Following the work by (Grefenstette et al. 2015) in the paper "Learning to transduce with unbounded memory", they invented a fully differentiable Stack, Queue, and DeQueue, I'm trying to create a differentiable graph model. Although the work is not finished yet since it misses proper experiments, I'm writing this article so that someone may help me with those ideas.</p>
<p>Still, I actually didn't study maths the past 5 years in an academic context, so don't expect perfect notation and terminology, I'm self taught in ML and maths. However, I think the ideas are reasonnably introduced, and that the concept is not totally wrong.</p>
<h2 id="basic-graph-model">Basic graph model</h2>
<p>The simple model stores the edges in an adjancency matrix of size <span class="math">\(|V|\times |V|\)</span>, where row <span class="math">\(a\)</span> column <span class="math">\(b\)</span> contains <span class="math">\(P(\text{edge}_{ab})\)</span>, that is: the strength of a link from vertex <span class="math">\(a\)</span> to <span class="math">\(b\)</span>. The vertices are stored in a vector of size <span class="math">\(|V|\)</span> where each cell contains <span class="math">\(P(a)\)</span>, ie, the strength of existence of <span class="math">\(a\)</span>. Let's call the adjacency matrix <span class="math">\(\mathbf{C}\)</span>, where <span class="math">\(C_{ab}\)</span> is the <span class="math">\(a\)</span>th row and <span class="math">\(b\)</span>th column of <span class="math">\(C\)</span>, and <span class="math">\(\mathbf{s}\)</span> the strength vector. Both the vector and the matrix contains values only in <span class="math">\([0;1]\)</span></p>
<p>For two vertices <span class="math">\(a\)</span> and <span class="math">\(b\)</span> (I will discuss later about how to adress them), some operations are:</p>
<h3 id="are-two-vertices-directly-connected">Are two vertices directly connected?</h3>
<p><span class="math">\[\text{connected?(a, b)} = s_a s_b C_{ab}\]</span></p>
<p>Returns the strength of the edge between <span class="math">\(a\)</span> and <span class="math">\(b\)</span>. We can also propose to read <span class="math">\(\mathbf{C}^n\)</span> to access the graph's transitive closure of nth order.</p>
<h3 id="successors-of-a">Successors of <span class="math">\(a\)</span></h3>
<p><span class="math">\[\text{succ}(a) = (\mathbf{s} \circ \mathbf{C}_{*,a}) s_a\]</span></p>
<p>Where <span class="math">\(\mathbf{C}_{*,a}\)</span>, following MATLAB's notation, denotes the <span class="math">\(a\)</span>th column of <span class="math">\(\mathbf{C}\)</span>.</p>
<p>Returns a vector of strength. The <span class="math">\(i\)</span>th value indicates the strength of <span class="math">\(i\)</span> as a successor of <span class="math">\(b\)</span>.</p>
<p>The predecessor function can be implemented trivially by using rows intead of columns in the matrix.</p>
<h3 id="change-the-strength-of-a-vertex">Change the strength of a vertex</h3>
<p>Changing strength of a vertex <span class="math">\(a\)</span> to target strength <span class="math">\(t\)</span> with amount <span class="math">\(p\)</span> is:</p>
<p><span class="math">\[s_a := (1-p)s_a + p t\]</span></p>
<p>So that nothing changes if <span class="math">\(p = 0\)</span></p>
<p>and similarly for all edges.</p>
<h2 id="adressing">Adressing</h2>
<p>Similarly to what have been proposed for the Neural Turing Machine (Graves et al. 2014), I propose two adressing modes: by location and by content. The manipulation of the associated graph function almost work the way the Neural Random Access Machine (Kurach et al. 2015) uses its modules.</p>
<h3 id="by-location">By location</h3>
<p>Let's take the example of the <span class="math">\(\text{connected?(a, b)}\)</span> function.</p>
<p>To be able to call this function in a fully differentiable way, we can't hardly choose <span class="math">\(a\)</span> and <span class="math">\(b\)</span>. We instead have to make <span class="math">\(a\)</span> and <span class="math">\(b\)</span> <em>distributions over vertices</em>.</p>
<p>Let <span class="math">\(A\)</span> and <span class="math">\(B\)</span> be distributions over vertices. The <span class="math">\(\text{connected?(a, b)}\)</span> can be used in a differentiable way like this:</p>
<p><span class="math">\[\text{out} = \sum_a \sum_b P(A = a)P(B = b)\text{connected?}(a, b)\]</span> <span class="math">\[\text{out} = \sum_a \sum_b P(A = a)P(B = b) s_a s_b C_{a,b}\]</span></p>
<p>Same goes for the successor function:</p>
<p><span class="math">\[\text{out} = \sum_a P(A = a)\text{succ}(a)\]</span></p>
<p>And so on.</p>
<p>However, addressing by locations has severe cons:</p>
<ul>
<li><p>You can't grow the number of vertices. It would need the neural net emitting addressing distributions to grow accordingly.</p></li>
<li><p>As you grow the number of available operations or "views" of the graph (such as adding the ability to read the nth order transitive closure to study connexity, you need to emit more and more distributions over <span class="math">\(V\)</span>.</p></li>
<li><p>You need to emit <span class="math">\(V^2\)</span> value of <span class="math">\(p, t\)</span> at each time step to gain the ability to modify each edge. Which is a lot. Way too much. A RNN may be able to do it sequentially, or might not.</p></li>
</ul>
<p>Hence, the graph must stay with a fixed size, and the dimensionnality of controls is already huge.</p>
<p>I will discuss a way to reduce dimensionality and allow the graph to have an unfixed size with content.</p>
<h3 id="by-content">By content</h3>
<p>So far, our vertices and edges were unlabeled. I have no idea how an unlabeled graph would be useful, but the neural net controlling it would have to find a way on its own to know which node is what. And it might not achieve that.</p>
<p>Here, I propose to extend our model with embeddings. With <span class="math">\(d\)</span> being the size of embeddings, we need an additionnal matrix <span class="math">\(\mathbf{E} \in \mathbb{R}^{|V| \times d}\)</span> to embed the vertices.</p>
<p>Adressing the nodes is now done by generating a distribution over the nodes defined as the softmax of the similarity (dot product) of the embedding outputted by the neural net and the actual vertices' embeddings.</p>
<p>For instance, let <span class="math">\(\mathbf{x_a, x_b} \in \mathbb{R}^{d \times d}\)</span> be two embeddings given by the controller. We can have the strength of connection of those embeddings by:</p>
<p><span class="math">\[\text{out} = \sum_a \sum_b \text{softmax}(\mathbf{E} \mathbf{x_a})_a\text{softmax}(\mathbf{E} \mathbf{x_b})_b \text{connected?}(a, b)\]</span> <span class="math">\[\text{out} = \sum_a \sum_b \text{softmax}(\mathbf{E} \mathbf{x_a})_a\text{softmax}(\mathbf{E} \mathbf{x_b})_b s_a s_b C_{a,b}\]</span></p>
<p>Same goes for the successor function:</p>
<p><span class="math">\[\text{out} = \text{softmax}(\mathbf{E} \mathbf{x_a}) \circ \sum_a \text{succ}(a)\]</span></p>
<p>We can extend the model again by adding a tensor <span class="math">\(\mathbf{F} \in \mathbb{R}^{|V| \times |V| \times d}\)</span> to embed the edges, but I'm not sure yet about the use case of the benefices. Maybe one could find it useful to know which (fuzzy) pair of vertices is linked by a given embedded edge <span class="math">\(\mathbf{x}\)</span>, like</p>
<p><span class="math">\[\text{vertex1} = \sum_a \text{softmax}(\sum_b s_b F_{a,b,*} \cdot \mathbf{x})_a s_a E_a\]</span> <span class="math">\[\text{vertex2} = \sum_a \text{softmax}(\sum_b s_b F_{b,a,*} \cdot \mathbf{x})_a s_a E_a\]</span></p>
<p>We can derive other operations in a similar fashion easily.</p>
<p>To grow the graph, let the controller output, at each timestep, a pair of an embedding and a strength. At each timestep, add the node with the said embedding and strength to the graph. For the edges, it's possible to either init their strength and embedding to 0, or to initialize the embedding of the edge from the new vertex <span class="math">\(a\)</span> to <span class="math">\(b\)</span> as <span class="math">\[F_{a,b,*} = \frac{E_{a,*} + E_{b,*}}{2}\]</span> and their strength as a cropped cosine distance of their embedding <span class="math">\[C_{a,b} = \text{max}(0, \frac{\mathbf{E_{a,*} \cdot E_{b,*}}}{\Vert\mathbf{E_{a,*}}\Vert \Vert\mathbf{E_{b,*}}\Vert})\]</span></p>
<p>With this addressing mode, the underlying graph structure given by <span class="math">\(\mathbf{C}\)</span> and <span class="math">\(\mathbf{s}\)</span> is never accessed by the controller which manipulates only embeddings to allow fuzzy operations. And the number of vertices is never needed for the controller, which allows to use a growing graph without growing the controller, solving the issue we had when addressing by locations.</p>
<h2 id="experiments">Experiments</h2>
<p>They are to come as soon as I'll find an idea to test this concept, and decided of a clear way to architect the inputs / outputs of the controller. I'm thinking about question answering about relationships between things, but we'll see. I don't really know how to design such an experiment yet.</p>
<p>Maybe the neural net won't be able to learn how to use this. Maybe it won't use it the expected way. That's the kind of things you never really know.</p>http://vermeille.fr/dotclear2/index.php/post/36-A-Differentiable-Graph-for-Neural-Networks#comment-formhttp://vermeille.fr/dotclear2/index.php/feed/atom/comments/36An expectation maximization Yahtzee AIurn:md5:a7a1eb9a91f310f868a673733ca3d8d32016-03-05T20:30:00+01:002016-03-05T21:36:05+01:00Vermeille <h2 id="yahtzee">Yahtzee</h2>
<p>I will describe a simple AI I did for the Yahtzee game. The solution is not optimal because of one small point. There are probably smarter ways to write this program, but as I needed to write this program quickly to play with my friends at New Year's Eve, (I had less than 3 days, actually), my priority was to have a solution almost guaranteeing me to win, not a beautiful and optimal one. If you know how to make it better, let me know in the comments.</p>
<p>As I am self taught in probability ans statistics, my notations and terminology might not be accurate. You're more than welcome to help me improve this in the comments.</p>
<h3 id="description">Description</h3>
<p>The game of Yahtzee is a mix of poker and dice rolls: you have 5 dices to roll, the ability to reroll any of them twice, and, depending of the combinations you have, score some points. Each combination can be scored only once, and if no combination was made, the player must sacrifice one of them, so that the number of turns is fixed.</p>
<p>The combinations are:</p>
<table>
<thead>
<tr class="header">
<th align="left">Name</th>
<th align="left">Score</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">One</td>
<td align="left">Sum of 1s</td>
<td align="left">Number of 1s obtained</td>
</tr>
<tr class="even">
<td align="left">Two</td>
<td align="left">Sum of 2s</td>
<td align="left">Number of 2s obtained * 2</td>
</tr>
<tr class="odd">
<td align="left">Three</td>
<td align="left">Sum of 3s</td>
<td align="left">Number of 3s obtained * 3</td>
</tr>
<tr class="even">
<td align="left">Four</td>
<td align="left">Sum of 4s</td>
<td align="left">Number of 4s obtained * 4</td>
</tr>
<tr class="odd">
<td align="left">Five</td>
<td align="left">Sum of 5s</td>
<td align="left">Number of 5s obtained * 5</td>
</tr>
<tr class="even">
<td align="left">Six</td>
<td align="left">Sum of 6s</td>
<td align="left">Number of 6s obtained * 6</td>
</tr>
<tr class="odd">
<td align="left">Set</td>
<td align="left">Sum of 3 dices</td>
<td align="left">Three same dices. Score is the sum of those 3.</td>
</tr>
<tr class="even">
<td align="left">Full House</td>
<td align="left">25</td>
<td align="left">Three same dices + two same dices.</td>
</tr>
<tr class="odd">
<td align="left">Quad</td>
<td align="left">Sum of 4 dices</td>
<td align="left">Four same dices. Score is the sum of those 4.</td>
</tr>
<tr class="even">
<td align="left">Straight</td>
<td align="left">30</td>
<td align="left">Four dices in sequence (1234 / 2345 / 3456)</td>
</tr>
<tr class="odd">
<td align="left">Full straight</td>
<td align="left">40</td>
<td align="left">Five dices in sequence (12345 / 23456)</td>
</tr>
<tr class="even">
<td align="left">Yahtzee</td>
<td align="left">50</td>
<td align="left">Five same dices</td>
</tr>
<tr class="odd">
<td align="left">Luck</td>
<td align="left">Sum of dices</td>
<td align="left">Any combination. Usually, when nothing else works.</td>
</tr>
</tbody>
</table>
<p>Each player, in turn do this:</p>
<ol style="list-style-type: decimal">
<li>Roll all dices. The player can select a combination and end his turn, or...</li>
<li>Select some dices to roll again. Then, the player can select a combination and end his turn, or...</li>
<li>Select some dices to roll again. Then, the player MUST select a combination to score or sacrifice.</li>
</ol>
<h2 id="the-ai">The AI</h2>
<h3 id="the-numbers">The numbers</h3>
<p>The game has a fairly low dimensionnality. Any of the 5 dices can take values from 1 to 6. Hence, the (naive) number of possible games is <span class="math">\(6^5 = 7776\)</span>. But this is actually a higher bound: the dices are not ordered, and a lot of the combinations are equivalent (11234 is equivalent to 12431, etc). The real number of possible games is given by the formula of unordered combinations with repetitions. With <span class="math">\(n = 6\)</span> and <span class="math">\(k = 5\)</span>:</p>
<p><span class="math">\[C'_k(n) = {n+k-1 \choose k}\]</span> <span class="math">\[C'_{ 5 }( 6 ) = C_{{ 5}}(10) = {{ 10} \choose 5} = \frac{ 10! }{ 5!(10-5)!} = 252\]</span></p>
<p>Which is, fortunately, far from intractable, and we can bruteforce all of them.</p>
<p>We will also find useful later to know how many outcomes are possible for any number of dices.</p>
<table>
<thead>
<tr class="header">
<th align="left"># of dices</th>
<th align="left"># of outcomes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">0</td>
<td align="left">1</td>
</tr>
<tr class="even">
<td align="left">1</td>
<td align="left">6</td>
</tr>
<tr class="odd">
<td align="left">2</td>
<td align="left">21</td>
</tr>
<tr class="even">
<td align="left">3</td>
<td align="left">56</td>
</tr>
<tr class="odd">
<td align="left">4</td>
<td align="left">126</td>
</tr>
<tr class="even">
<td align="left">5</td>
<td align="left">252</td>
</tr>
</tbody>
</table>
<p>The number of possible <em>actions</em> (set of dices to reroll) is the number of subsets of the dices, ie <span class="math">\(2^k=2^5=32\)</span>.</p>
<h3 id="the-program">The program</h3>
<p>The program is fairly simple to use: given a dice roll, it will tell you which dices to reroll (if any), and the associated statistical expected score.</p>
<p>First, we need to precompute the score that each roll gets for all of the combinations. I first enumerate each of the (ordered) possible games, compute their score for each combination, and store than in a table of <span class="math">\(7776 \times 13\)</span>.</p>
<p>The user is then prompted to write the hand he got. The objective is the following:</p>
<p><span class="math">\[\text{action*} = \underset{\text{action}}{\operatorname{argmax}} \mathbb{E}[\text{best score | action}]\]</span></p>
<p>ie: find the subset of dices to reroll that leads to the best score (ie, the best scored combination for each possible outcome given this reroll) where <span class="math">\(action\)</span> is successively one of the 32 possible subsets of dices to reroll, and <span class="math">\(action*\)</span> the best choice according (with an eager strategy).</p>
<p>This expectation can be computed as follows:</p>
<p><span class="math">\[\text{action*} = \underset{\text{action}}{\operatorname{argmax}}
\frac{1}{\text{# of equivalent outcomes | action}} \sum_{\text{possible games} g \text{| action}} \underset{\text{combination} c}{\operatorname{max}}(\text{score for} c | g)\]</span></p>
<p>This is an eager policy that maximizes the score for each <em>turn</em>. As such, this algorithm does not take into account the <em>waste of points</em> that you can make by choosing a combination, to allow maximizing your score for the <em>whole game</em>. As I was unable to think of an optimal solution for this (and I would really enjoy to know if there's one), I chose to apply a (quite arbitraty) penalty to each combination's maximum score following:</p>
<p><span class="math">\[\text{penalty(combination, current_score)} = \exp{-\frac{\text{best possible score for combination} - \text{current_score})}{100}}\]</span></p>
<p>In code terms, this would lead to something like:</p>
<ol style="list-style-type: decimal">
<li>Read input hand <span class="math">\(r\)</span></li>
<li>Initialize <span class="math">\(e\)</span>, the expectation for each possible reroll, to 0</li>
<li>For each possible game <span class="math">\(g_i\)</span>:
<ol style="list-style-type: decimal">
<li><span class="math">\(d = \text{dices to reroll to go from } r \text{ to } g_i\)</span></li>
<li><span class="math">\[e[d] \text{+=} \frac{1}{\text{number of possible outcomes for } d}
\text{maximum score for } g_i\]</span></li>
</ol></li>
<li>return <span class="math">\(\underset{\text{d}}{\operatorname{argmax}} e[d]\)</span></li>
</ol>
<p>And that's it.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I won, statistically. Which is good. Bad point: my friends were angry because taking some time to make an often "obvious" choice was not worth it according to them :D. Make sure your friends enjoy maths and / or CS before doing something like this!</p>
<p>The code is available on <a href="https://github.com/Vermeille/Yahtzee">my GitHub page</a>. As I said, don't expect magnificent code.</p>http://vermeille.fr/dotclear2/index.php/post/35-An-expectation-maximization-Yahtzee-AI#comment-formhttp://vermeille.fr/dotclear2/index.php/feed/atom/comments/35Writing a french POS tagger (2)urn:md5:6a3e36e620d56fe76aa06a23ae468efc2015-10-13T12:00:00+02:002015-10-14T08:25:11+02:00Vermeille <p>How to choose those confidence levels? That's where we start to apply machine learning.</p>
<p>Okay, we're into maths now, so everything is a number. Let's assign a unique natural integer to every word in our vocabulary, so that <span class="math">\(w_i\)</span> is the word with the id i, and a unique integer to every POS class, so that <span class="math">\(c_i\)</span> is the class with the id i.</p>
<h2 id="maximum-likelihood-estimate">Maximum Likelihood Estimate</h2>
<p>Since we're doing ML, let's use some vocabulary. Everything we extract from the ith word (noted wi) is called "features", denoted , the POS we're trying to tag are named "classes" and hence will be noted c, and our confidence is a probability. If we forget a little about suffixes, prefixes and capitalization, we have only the word as a source of info. Predicting is just the act of doing</p>
<p><span class="math">\[\underset{c}{\operatorname{argmax}} P(c | w_i)\]</span></p>
<p>which is like asking "which class i'm the most confident about considering this word?". How do we find such a probability? We can use a simple algorithm called Maximum Likelihood Estimate which works like this:</p>
<p><span class="math">\[P(c | w_i) = \frac{\operatorname{Count}(w_i\text{ is a }c)}{\operatorname{Count}(w_i)}\]</span></p>
<p>"How many times, in a big text corpus, is <span class="math">\(w_i\)</span> of class c relatively to is global appearance?". And since the denominator is the same for all classes (it varies solely on w_i), we can leave it off.</p>
<p><span class="math">\[\underset{c}{\operatorname{argmax}} P(c | w_i) = \underset{c}{\operatorname{argmax}} \operatorname{Count}(w_i\text{ is a }c)\]</span></p>
<p>Super simple, but we can't do anything for unknown words despite having all those fancy morphological features. We need something to incorporate them.</p>
<h2 id="maximum-entropy-model">Maximum Entropy model</h2>
<h3 id="prediction">Prediction</h3>
<p>Let's say we have what we call a feature vector <span class="math">\(\theta\)</span>, for which <span class="math">\(\theta_i\)</span> is the <span class="math">\(i\)</span>th feature being 1 if the feature is present, 0 otherwise. When we try to predict the class <span class="math">\(c\)</span>, the ith feature will be more or less discriminative. Let's represent that by weighting it with <span class="math">\(\lambda_{c,i}\)</span> "how indicative is <span class="math">\(\theta_i\)</span> for class <span class="math">\(c\)</span>?" where:</p>
<ul>
<li>a high lambda means "wow, super discriminative, this is super indicative of this class!"</li>
<li>a low (negative) lambda means "Wow, super discriminative, this super indicative of NOT this class!"</li>
<li>and a lambda about 0 means "this is not helping me for class c". Emphasis, <span class="math">\(\lambda_{c,i}\theta_i\)</span> is NOT a probability, it's just a score. Continuing this way, evaluating the score of class <span class="math">\(c\)</span> for the whole feature vector is a the dot product <span class="math">\(\lambda_c\cdot\theta\)</span>. Weighting then summing.</li>
</ul>
<p>Note: Here, <span class="math">\(\lambda\)</span> is essentially a matrix from which I pick the column <span class="math">\(c\)</span>, which is contrary to everything you'll read in the litterature. I'm doing this because, contrary to what the litterature says, I assume that a feature can be discriminative for every class, whereas most papers wants you to condition the features on the class being predicted, allowing a different set of features per class. With my approach, we have more parameters, some of them will be weighted to 0, but it prevents from early optimization. Later on, and after having analyzed the lambdas, one can purge the unnecessary features and weights in the actual implementation, without having missed a discriminative indication.</p>
<p>Once we have this score, easy stuff, find the class with the best score.</p>
<p><span class="math">\[\underset{c}{\operatorname{argmax}} P(c | \theta, \lambda)
= \underset{c}{\operatorname{argmax}} \lambda_c\cdot\theta\]</span></p>
<p>how cool? Wait, how do I choose all those lambdas?</p>
<h3 id="learning">Learning</h3>
<p>Well, that's a different story. To make a long story short, we have a very handy algorithm, called Maximum Entropy. It relies on the fact that the probability distribution which best represents the current state of knowledge is the one with largest entropy.</p>
<p>I said earlier that scores weren't probabilities. I say that ME works on probabilities. We need to turn those score into probabilities. First we needs the class scores to be on the same scale. Fairly easy, just divide the score for the class <span class="math">\(c\)</span> by the absolute score of all <span class="math">\(c\)</span>. We're mapping all of our values to [-1;1].</p>
<p><span class="math">\[\operatorname{not-really-P}(c | \theta, \lambda) = \frac{\lambda_c\cdot\theta}{\sum_{c'}|\lambda_{c'}\cdot\theta|}\]</span></p>
<p>Still not good, we need [0, 1]. Well, we could just +1, but actually, we don't like computers to work on [0;1], because computation in such a range tends to quickly turn to NaN and absurdely small numbers. And divisions and multiplications are not cheap. That's why we prefer to deal with log-probabilities instead of probabilities. Log is monotonic, has nice additive properties and is cute looking function, despite its undefined log(0) value. It turns out that the best way to make probabilities from those scores is the exp function.</p>
<p><span class="math">\[P(c | \theta,\lambda) = \frac{\exp{\lambda_c\cdot\theta}}{\sum_{c'}\exp{\lambda_{c'}\cdot\theta}}\]</span></p>
<p>It maps to the correct values range, and if we take the log probability, the division turns to a substraction and the exp cancel out. How nice.</p>
<p>And the super good thing is that, by whichever mathemagics I'm not fully getting yet (please someone explain?), Maximum Entropy and Maximum Likelihood are linked, which brings us to an optimization objective of:</p>
<p><span class="math">\[\underset{\lambda}{\operatorname{argmin}} -\log \mathcal{L}
= -\sum_x\log\frac{\exp{\lambda_c\cdot\theta^{(x)}}}{\sum_{c'}\exp{\lambda_{c'}\cdot\theta^{(x)}}}\]</span></p>
<p>Where <span class="math">\(\theta^{(x)}\)</span> is the feature vector of the <span class="math">\(x\)</span>th example in the dataset.</p>
<p>Cool. We have to take the gradient with respect to lambda, which gives us</p>
<p><span class="math">\[\frac{\partial \mathcal{L}}{\partial\lambda_c} =
\sum_x (1\{x\text{ is a }c\}\theta^{(x)} - \sum_c P(c | \theta^{(x)}, \lambda)\theta^{(x)})\]</span></p>
<p>with this derivative, we can take a simple iterative approach to update lambda.</p>
<p><span class="math">\[\lambda := \lambda - \alpha\frac{\partial \mathcal{L}}{\partial \lambda}\]</span></p>
<p>This is quite slow, but it works.</p>
<p>And in the end, you have your model.</p>
<p>Wait, what about the context? We used only features from the word, and we can't disambiguate from the very first example "je commande" and "une commande". Well, we'll have to use something a little smarter than a Maximum Entropy Model and use a MEMM, a Maximum Entropy Markov Model.</p>http://vermeille.fr/dotclear2/index.php/post/33-Maximum-Entropy-Models#comment-formhttp://vermeille.fr/dotclear2/index.php/feed/atom/comments/33