Forward Propagation: Building a Skip-Gram Net From the Ground Up
Part 1: Skip-gram Feedforward
Editor's Note: This post is part of a series based on the research conducted in District Data Labs' NLP Research Lab. Make sure to check out the other posts in the series so far:
- NLP Research Lab Part 1: Distributed Representations
- NLP Research Lab Part 2: Skip-Gram Architecture Overview
Let's continue our treatment of the Skip-gram model by traversing forward through an single example of feeding forward through a Skip-gram neural network; from an input target word, through a projection layer, to an output context vector representing the target word's nearest neighbors. Before we get into our example, though, let's revisit some fundamentals on neural networks.
Neural Network Fundamentals
Neural networks originally got their name from borrowing concepts observed in the functioning of the biological neural pathways in the brain. At a very basic level, there is a valid analogy between a node in a neural network and the neurons in a biological brain worth using to explain the fundamental concepts.
The biological model of a neural pathway consists of specialized cells called dendrites that receive chemical signals in one end that build up an electric potential until a threshold is reached (called activation), resulting in more or different chemical signals to be released on the other end. These output chemicals can then act as input to other neurons, causing them to activate as well. In the biological model, a combination of activated neurons can be interpreted in such a way to cause an action on behalf of the organism.
The computational model of a neural network represents this process mathematically by propagating input data in a particular way through a graph structure containing nodes inside an input layer, hidden layer, and output layer. The input layer represents the input data, analogous to the incoming chemical signals of a neuron. The hidden layer represents the neurons receiving the input signal. The output layer is a simplification of the decision interpretation of the biological model; it represents the decision or action a given activated pathway indicates. In other words, the output layer is a classifier.
The input data, which in the biological system are chemical signals but in the computational model are numerical quantities, are made available to all potential hidden layer nodes through a linear transformation of a weight layer. This provides a way for the computational model to represent that not all nodes receive the input data evenly or with the same affinity.
The hidden layer, representing our neurons, receives this input and then applies a non-linear activation function to simulate neurons' activation. The simplest activation function is the Heaviside step function which simply turns a hidden layer node to on (1) or off (0) (and a neural network that uses this type of activation function is nicknamed a perceptron). More common activation functions in language models are the logit function, the hyperbolic tangent, and their variants.
The output layer is produced by a linear transformation from the hidden layer through another weight matrix. This provides a way for the computational model to represent that activated nodes have varying effects on the final decision interpretation.
Below is a figure from the Appendix of the paper "word2vec Parameter Learning Explained" by Rong et al that represents this process. The input layer {x(k)}
= {x(1) ... x(K)}
is linearly transformed through a weight matrix {w(ki)}
. That input is activated with an activation function to produce the hidden layer {h(i)}
= {h(1) .... h(N)}
. The resulting hidden layer {h(i)}
is linearly transformed through another weight matrix {w'(ij)}
to finally result in the output layer {y(j)}
= {y(1) .... y(M)}
.
The learning part of both a biological brain and a computational neural network has to do with strengthening or weakening certain paths given certain input data. The biological system to achieve this is well beyond my ability to explain, but on the computational side we simulate it by modifying our weight layers in reaction to some objective function (a term often conflated with a related term cost function and which for our purposes are synonymous). Knowing what our input layer looks like and what our output layer should represent, we pick some objective function that allows us to compare our output layer classifier with the expected result for a given input. In other words, we pick something the output layer should represent (like a probability) that we can compare to the subset of reality represented by our training data (like an observed frequency).
By iteratively training examples and penalizing the weight layers for the error observed based on our interpretation of the output layer, the net eventually learns how to output the values that mean what we want. We use a mathematical technique called the chain rule to estimate the effect of different parts of the network (e.g. only the second weight layer, or only the first weight layer) on the total error. We then modify each piece accordingly using an algorithmic technique called gradient descent (or ascent, depending on whether you are maximizing or minimizing your objective function). In practice, we use a less computationally expensive version of gradient descent called stochastic gradient descent that randomly applies the chain rule as opposed to after every single training example.
Once you have reached a predefined threshold for your objective function (or got tired of running your network for so long), your model is complete. You would now be able to submit new input data and receive a classification result based on all the examples you trained with in the first place.
Skip-gram Model Architecture
Now that we've reestablished some fundamentals, let's set up the specific architecture of Skip-gram. Recall that the goal of Skip-gram is to learn the probability distribution of words in our vocabulary being within a given distance (context) of an input word. We are effectively choosing our output layer to represent that probability distribution. Once we've trained the network with this goal, instead of using the model to predict the context probability distribution for future words, we'll derive distributed representations of words we saw (word embeddings) from the input weight layer of the final model.
As you may recall from our previous post, our example corpus consisted of the following sentence.
“Duct tape works anywhere. Duct tape is magic and should be worshiped.”
In that post, we preprocessed this sentence by removing punctuation and capitalization, dropping stopwords, and stemming the remaining words. This left us with "duct tape work anywher duct tape magic worship."
Let's start by defining some terminology and the variables you'll see associated with them throughout this post.
A vocabulary (v), is a deduplicated, ordered list of all the distinct words in our corpus. The specific order of the words doesn't matter, as long as it stays the same throughout the entire process.
>>> v = ["duct", "tape", "work", "anywher", "magic", "worship"]
>>> print(len(v))
6
>>> print(v[0])
"duct"
A context (c) is a zone, or window, of words before and after a target word that we want to consider "near" the target word. We can select a context of our choice, which is 2 in the example below.
>>> c = 2
We have another selection to make, this time about the projection layer. We can specify the size of the projection layer in nodes (n). The higher the number of nodes, the higher dimensionality the projection between layers will be and the higher dimensionality your final word embeddings will be. We'll look at this in more detail a little later. For now, let's just remember that n is a tunable parameter, and let's set it to 3.
>>> n = 3
Before we actually get to working through the neural network, let's reconsider how we will provide each input word to it.
Word Encodings
What do each of these words look like when they go into the neural network? To perform transformations on them through each layer, we have to represent the words numerically. Additionally, each word's representation must be unique. There are many vector-based approaches to represent words to a neural network, as discussed in the first post in this series. In the case of Skip-gram, each input word is represented as a one-hot encoded vector. This means that each word is represented by a vector of length len(v)
, where the index of the target word in v contains the value 1 and all other indices contain the value 0.
To illustrate, let's recall our vocabulary vector.
>>> print(v)
["duct", "tape", "work", "anywher", "magic", "worship"]
We would expect to represent each word in our vocabulary with the following one-hot encoded vectors:
"duct" [1,0,0,0,0,0]
"tape" [0,1,0,0,0,0]
"work" [0,0,1,0,0,0]
"anywher" [0,0,0,1,0,0]
"magic" [0,0,0,0,1,0]
"worship" [0,0,0,0,0,1]
This selection of one-hot encoded vectors, as the representation of words to the Skip-gram net, is actually going to become quite important later. But let's get to some forward propagation now.
Birds-Eye View of Skip-Gram as a Neural Net
The Skip-gram neural network is a shallow neural network consisting of an input layer, a single linear projection layer, and an output layer. The input layer is a one-hot encoded vector representing the input word, and the output layer is a probability distribution of what words are likely to be seen in the input word's context. The objective function of the net attempts to maximize the output layer probability distribution against what is known from the source corpora about a given word's context frequency distribution.
That is quite dense, and several of these parts are important, but I want to point out a major divergence from the classic neural network model that we just described. Classic neural nets have hidden layers transformed with a non-linear activation function as described in the neural net refresher above - Heaviside step functions, logit, hyperbolic tangent. One of the major computational tradeoffs Mikolov et al made to make Skip-gram and CBOW feasible whereas earlier neural nets were prohibitively limited by training time was to completely remove the activation step from the hidden layer. From the paper Efficient Estimation of Word Representations in Vector Space, they state:
"The main observation from [prior work] was that most of the complexity is caused by the nonlinear hidden layer in the model. While this is what makes neural networks so attractive, we decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently."
Therefore, what a classical neural net would term a 'hidden layer' is generally regarded instead as a projection layer for the purposes of Skim-gram and CBOW. In Mikolov et al, they distinguish these two models from classic, non-linearly activated neural nets by calling them log-linear models.
To train the net, we scan through our source corpus, submitting each word we encounter once for each c*2
output context. The input word is projected through a weight layer and then transformed through another weight layer into an output context. Each output node is of size v and contains at each index a score for that index's word, estimating the word's likelihood of appearing in that context location.
Below is a reproduction of the architecture diagram from Google's introductory Skip-gram paper, Efficient Estimation of Word Representations in Vector Space.
Let's break that down again using the terms in the diagram. The Skip-gram neural net iterates through a series of words one at a time, as input w(t). Each input word w(t) is fed forward through the network c*2
times, once for each output context vector. Each time w(t) is fed through the network, it is linearly transformed through two weight matrices to an output layer that contains nodes representing a context location: where c=2
, those context locations are one of w(t-2) to w(t+2). The output nodes, each the size of the vocabulary, contain scores at each index estimating the likelihood that a word in the vocabulary would appear in that context position. For example, with the input word "tape" the score for the w(t-2) vector of length v at its second index (w(t-2)[2]) would be a score for the likelihood that the word "work" would appear in context two words behind the input word "tape." Those raw scores are later turned into actual probabilities using the softmax equation.
For each given training instance, the net will calculate its error between the probability generated for each word in each context location and the observed reality of the words in the context of the training instance. For example, the net may calculate that "work" has a 70% chance of showing up two words before the word "tape", but we can determine from the source corpus that the probability is really 0%. Through the process of backpropagation, the net will modify the weight matrices to change how it projects the input layer through to the output layer in order to minimize its error: for example, to minimize the error between the calculated 70% and the observed 0%. Then the next word in the corpus will be sent as an input c*2
times, then the next, and so on.
Once the net is done training, the first weight layer will contain the vector representations of each word in the vocabulary, in order. After we go through all of feed-forward and backprop, you'll see why, but for now while we've got the gist of the architecture, let's try plugging in some numbers.
A Real Life Net
Time for some code! Let's get a more detailed, color-coded diagram going:
The figure above is an extended depiction of a Skip-gram network for a training example feeding forward to output context vectors representing a context window c=2
. The input word is a one-hot encoded vector the size of the vocabulary. This is connected to a projection layer defined by our node size parameter to be of size len(n). The input layer is projected via linear transformation through the input weight matrix p, of v x n dimensions. With a context size of 2, each training example will feed forward 4 times to 4 output vectors: w(t-2), w(t-1), w(t+1) and w(t+2). The projection layer is connected to the output layers by the output weight matrix p', which has n x v dimensions.
From left to right, each subsequent layer is constructed mathematically simply by taking the dot product of a layer and its weight matrix.
The weight matrices p and p' can be initialized in many ways, and will eventually be tuned using backpropagation. For now, let's initialize the net simply with random numbers. Let's expand our example with those randomized matrices and calculate the transformations from the input word to the projection layer and from the projection layer to the output layer. Let's consider a forward pass through the net where our input word is the second word in our vocabulary, "tape."
First let's import our packages and set up our first two pieces, the input array for "tape" (which you'll recall is one-hot encoded), and our randomized first weight layer p:
>>> import numpy as np
>>> np.random.seed(42)
>>> input_array_tape=np.array([0,1,0,0,0,0]) #"tape"
>>> input_weight_matrix = np.random.random_sample((6,3))
>>> print(input_weight_matrix)
[[ 0.37454012 0.95071431 0.73199394]
[ 0.59865848 0.15601864 0.15599452]
[ 0.05808361 0.86617615 0.60111501]
[ 0.70807258 0.02058449 0.96990985]
[ 0.83244264 0.21233911 0.18182497]
[ 0.18340451 0.30424224 0.52475643]]
Now, let's calculate from the input layer to the projection layer. You'll see here that the effect of a linear transformation from the one-hot encoded layer through the weight layer means we are simply projecting a row of the weight matrix matching the index of the one-hot encoded input word through to the next layer. That is why we keep calling this process 'projection' and are shying away from the terms 'hidden' or 'activation' layer (though those terms are still sometimes used to describe this process). You can compare the output vector below to the matrix above and see that clearly - we've simply picked out the vector from the second matrix row of p during this projection process.
>>> projection = np.dot(input_array_tape,input_weight_matrix)
>>> print(projection)
[ 0.59865848 0.15601864 0.15599452]
Next, let's randomize the output weight matrix p'.
>>> output_weight_matrix = np.random.random_sample((3,6))
>>> print(output_weight_matrix)
[[ 0.43194502 0.29122914 0.61185289 0.13949386 0.29214465 0.36636184]
[ 0.45606998 0.78517596 0.19967378 0.51423444 0.59241457 0.04645041]
[ 0.60754485 0.17052412 0.06505159 0.94888554 0.96563203 0.80839735]]
We'll use that to calculate from the projection layer to the first output context vector w(t-2).
>>> output_array_for_input_tape_and_orange_output_context = np.dot(projection, output_weight_matrix)
>>> print(output_array_for_input_tape_and_orange_output_context)
# [ 0.42451664 0.32344971 0.40759145 0.31176029 0.41795589 0.35267831]
At this point, we have the first output vector after we've performed calculations against the randomized weight layers for p and the p'. For the purposes of the diagram I have rounded all the indices in our matrices to their tenths place, but let's take a look to re-establish where we are.
Now that we have this context vector, let's dive in again to what the context vectors really are. Each context vector is the length of the vocabulary, and since the vocabulary is an ordered list, each index in the context vector can be traced back to a certain word in the vocabulary. The meaning of the value at each index in the context vector is an estimation of the likelihood of appearing in that context window for the word in the vocabulary that index traces back to.
Let's unpack that some more with the following code snippets. The variable output_array_for_input_tape_and_orange_output_context
is the output vector for that orange colored context w(t-2) we just calculated given the input word "tape". The value at each index of that vector represents each vocabulary word's estimated likelihood to be 2 words behind "tape", per the vocabulary order of our original vocabulary vector v
.
>>> print(output_array_for_input_tape_and_orange_output_context)
# [ 0.42451664 0.32344971 0.40759145 0.31176029 0.41795589 0.35267831]
We know the vocabulary is made up of this ordered list.
>>> print(v)
["duct", "tape", "work", "anywher", "magic", "worship"]
If we zip those together, we annotate each index in the context vector that estimates the likelihood of being 2 words behind "tape" with what word in the vocabulary we're estimating for:
>>> print(list(zip(v, output_array_for_input_tape_and_orange_output_context)))
#[('duct', 0.42451663675598933),
#('tape', 0.32344971050993732),
#('work', 0.40759145057525981),
#('anywher', 0.3117602853605092),
#('magic', 0.41795589389125587),
#('worship', 0.35267831257488347)]
Since this whole net at this point was trained with the input word "tape," we can extrapolate from this output vector for w(t-2) that the word "duct" is estimated to be two words behind our input word "tape" with a likelihood score of 0.42451663675598933, that that the word "tape" is estimated to be two words behind our input word "tape" with a likelihood score of 0.32344971050993732, and so on.
Intuitively, we can see that we have very similar scores right now for each word, though each word is not equally likely to appear at position w(t-2), aka two spots before our target word, since we've seen our corpus. We will use backpropagation to calculate our error here against the known likelihood of these words in context and adjust the weight vectors to reduce our error. However, there is one more processing step to these output vectors before we're ready to have the net begin backpropagation: applying the softmax equation. The softmax will normalize the values in each context vector to represent probabilities, so we can directly compare them during backpropogation to the known frequency distributions in the source corpus. Softmax will be the topic of the second half of this feedforward post series, so stay tuned!
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!