A differentiable graph for neural networks

Following the work by (Grefenstette et al. 2015) in the paper "Learning to transduce with unbounded memory", they invented a fully differentiable Stack, Queue, and DeQueue, I'm trying to create a differentiable graph model. Although the work is not finished yet since it misses proper experiments, I'm writing this article so that someone may help me with those ideas.

Still, I actually didn't study maths the past 5 years in an academic context, so don't expect perfect notation and terminology, I'm self taught in ML and maths. However, I think the ideas are reasonnably introduced, and that the concept is not totally wrong.

Basic graph model

The simple model stores the edges in an adjancency matrix of size \(|V|\times |V|\), where row \(a\) column \(b\) contains \(P(\text{edge}_{ab})\), that is: the strength of a link from vertex \(a\) to \(b\). The vertices are stored in a vector of size \(|V|\) where each cell contains \(P(a)\), ie, the strength of existence of \(a\). Let's call the adjacency matrix \(\mathbf{C}\), where \(C_{ab}\) is the \(a\)th row and \(b\)th column of \(C\), and \(\mathbf{s}\) the strength vector. Both the vector and the matrix contains values only in \([0;1]\)

For two vertices \(a\) and \(b\) (I will discuss later about how to adress them), some operations are:

Are two vertices directly connected?

\[\text{connected?(a, b)} = s_a s_b C_{ab}\]

Returns the strength of the edge between \(a\) and \(b\). We can also propose to read \(\mathbf{C}^n\) to access the graph's transitive closure of nth order.

Successors of \(a\)

\[\text{succ}(a) = (\mathbf{s} \circ \mathbf{C}_{*,a}) s_a\]

Where \(\mathbf{C}_{*,a}\), following MATLAB's notation, denotes the \(a\)th column of \(\mathbf{C}\).

Returns a vector of strength. The \(i\)th value indicates the strength of \(i\) as a successor of \(b\).

The predecessor function can be implemented trivially by using rows intead of columns in the matrix.

Change the strength of a vertex

Changing strength of a vertex \(a\) to target strength \(t\) with amount \(p\) is:

\[s_a := (1-p)s_a + p t\]

So that nothing changes if \(p = 0\)

and similarly for all edges.

Adressing

Similarly to what have been proposed for the Neural Turing Machine (Graves et al. 2014), I propose two adressing modes: by location and by content. The manipulation of the associated graph function almost work the way the Neural Random Access Machine (Kurach et al. 2015) uses its modules.

By location

Let's take the example of the \(\text{connected?(a, b)}\) function.

To be able to call this function in a fully differentiable way, we can't hardly choose \(a\) and \(b\). We instead have to make \(a\) and \(b\) distributions over vertices.

Let \(A\) and \(B\) be distributions over vertices. The \(\text{connected?(a, b)}\) can be used in a differentiable way like this:

\[\text{out} = \sum_a \sum_b P(A = a)P(B = b)\text{connected?}(a, b)\] \[\text{out} = \sum_a \sum_b P(A = a)P(B = b) s_a s_b C_{a,b}\]

Same goes for the successor function:

\[\text{out} = \sum_a P(A = a)\text{succ}(a)\]

And so on.

However, addressing by locations has severe cons:

  • You can't grow the number of vertices. It would need the neural net emitting addressing distributions to grow accordingly.

  • As you grow the number of available operations or "views" of the graph (such as adding the ability to read the nth order transitive closure to study connexity, you need to emit more and more distributions over \(V\).

  • You need to emit \(V^2\) value of \(p, t\) at each time step to gain the ability to modify each edge. Which is a lot. Way too much. A RNN may be able to do it sequentially, or might not.

Hence, the graph must stay with a fixed size, and the dimensionnality of controls is already huge.

I will discuss a way to reduce dimensionality and allow the graph to have an unfixed size with content.

By content

So far, our vertices and edges were unlabeled. I have no idea how an unlabeled graph would be useful, but the neural net controlling it would have to find a way on its own to know which node is what. And it might not achieve that.

Here, I propose to extend our model with embeddings. With \(d\) being the size of embeddings, we need an additionnal matrix \(\mathbf{E} \in \mathbb{R}^{|V| \times d}\) to embed the vertices.

Adressing the nodes is now done by generating a distribution over the nodes defined as the softmax of the similarity (dot product) of the embedding outputted by the neural net and the actual vertices' embeddings.

For instance, let \(\mathbf{x_a, x_b} \in \mathbb{R}^{d \times d}\) be two embeddings given by the controller. We can have the strength of connection of those embeddings by:

\[\text{out} = \sum_a \sum_b \text{softmax}(\mathbf{E} \mathbf{x_a})_a\text{softmax}(\mathbf{E} \mathbf{x_b})_b \text{connected?}(a, b)\] \[\text{out} = \sum_a \sum_b \text{softmax}(\mathbf{E} \mathbf{x_a})_a\text{softmax}(\mathbf{E} \mathbf{x_b})_b s_a s_b C_{a,b}\]

Same goes for the successor function:

\[\text{out} = \text{softmax}(\mathbf{E} \mathbf{x_a}) \circ \sum_a \text{succ}(a)\]

We can extend the model again by adding a tensor \(\mathbf{F} \in \mathbb{R}^{|V| \times |V| \times d}\) to embed the edges, but I'm not sure yet about the use case of the benefices. Maybe one could find it useful to know which (fuzzy) pair of vertices is linked by a given embedded edge \(\mathbf{x}\), like

\[\text{vertex1} = \sum_a \text{softmax}(\sum_b s_b F_{a,b,*} \cdot \mathbf{x})_a s_a E_a\] \[\text{vertex2} = \sum_a \text{softmax}(\sum_b s_b F_{b,a,*} \cdot \mathbf{x})_a s_a E_a\]

We can derive other operations in a similar fashion easily.

To grow the graph, let the controller output, at each timestep, a pair of an embedding and a strength. At each timestep, add the node with the said embedding and strength to the graph. For the edges, it's possible to either init their strength and embedding to 0, or to initialize the embedding of the edge from the new vertex \(a\) to \(b\) as \[F_{a,b,*} = \frac{E_{a,*} + E_{b,*}}{2}\] and their strength as a cropped cosine distance of their embedding \[C_{a,b} = \text{max}(0, \frac{\mathbf{E_{a,*} \cdot E_{b,*}}}{\Vert\mathbf{E_{a,*}}\Vert \Vert\mathbf{E_{b,*}}\Vert})\]

With this addressing mode, the underlying graph structure given by \(\mathbf{C}\) and \(\mathbf{s}\) is never accessed by the controller which manipulates only embeddings to allow fuzzy operations. And the number of vertices is never needed for the controller, which allows to use a growing graph without growing the controller, solving the issue we had when addressing by locations.

Experiments

They are to come as soon as I'll find an idea to test this concept, and decided of a clear way to architect the inputs / outputs of the controller. I'm thinking about question answering about relationships between things, but we'll see. I don't really know how to design such an experiment yet.

Maybe the neural net won't be able to learn how to use this. Maybe it won't use it the expected way. That's the kind of things you never really know.