Understanding Embedding Layer in Keras
In deep learning, embedding layer sounds like an enigma until you get the hold of it. Since embedding layer is an essential part of neural networks, it is important to understand the working of it. In this article, I will try to explain what is embedding layer, what is the need of it and how it works, along with some coding examples. So let’s get started.
What is Embedding Layer
Embedding layer is one of the available layers in Keras. This is mainly used in Natural Language Processing related applications such as language modeling, but it can also be used with other tasks that involve neural networks. While dealing with NLP problems, we can use pre-trained word embeddings such as GloVe. Alternatively we can also train our own embeddings using Keras embedding layer.
Need of Embeddings
Word embeddings can be thought of as an alternate to one-hot encoding along with dimensionality reduction.
As we know while dealing with textual data, we need to convert it into numbers before feeding into any machine learning model, including neural networks. For simplicity words can be compared to categorical variables. We use one-hot encoding to convert categorical features into numbers. To do so, we create dummy features for each of the category and populate them with 0’s and 1's.
Similarly if we use one-hot encoding on words in textual data, we will have a dummy feature for each word, which means 10,000 features for a vocabulary of 10,000 words. This is not a feasible embedding approach as it demands large storage space for the word vectors and reduces model efficiency.
Embedding layer enables us to convert each word into a fixed length vector of defined size. The resultant vector is a dense one with having real values instead of just 0’s and 1’s. The fixed length of word vectors helps us to represent words in a better way along with reduced dimensions.
This way embedding layer works like a lookup table. The words are the keys in this table, while the dense word vectors are the values. To understand it better, let’s look at the implementation of Keras Embedding layer.
Implementation in Keras
Let’s start by importing the required libraries.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
import numpy as npWe can create a simple Keras model by just adding an embedding layer.
model = Sequential()
embedding_layer = Embedding(input_dim=10,output_dim=4,input_length=2)
model.add(embedding_layer)
model.compile('adam','mse')There are three parameters to the embedding layer
- input_dim : Size of the vocabulary
- output_dim : Length of the vector for each word
- input_length : Maximum length of a sequence
In the above example, we are setting 10 as the vocabulary size, as we will be encoding numbers 0 to 9. We want the length of the word vector to be 4, hence output_dim is set to 4. The length of the input sequence to embedding layer will be 2.
Now, lets pass a sample input to our model and see the results.
input_data = np.array([[1,2]])
pred = model.predict(input_data)
print(input_data.shape)
print(pred)The output of the above code will be following.
(1, 2)
[[[ 0.04502351 0.00151128 0.01764284 -0.0089057 ]
[-0.04007018 0.02874336 0.02772436 0.00842067]]]As you can see, each word (1 and 2) is represented by a vector of length 4. If we print the weights of the embedding layer, we get below result.
[array([[-0.04333381, -0.02326865, -0.00812379, 0.02167496],
[ 0.04502351, 0.00151128, 0.01764284, -0.0089057 ],
[-0.04007018, 0.02874336, 0.02772436, 0.00842067],
[ 0.00512743, 0.03695237, -0.02774147, -0.03748262],
[ 0.02066498, -0.01512628, -0.03989452, 0.00809463],
[-0.02207369, 0.02889762, -0.01229819, -0.03157005],
[ 0.02565557, 0.02931032, -0.01611946, -0.00105535],
[ 0.03920721, 0.04009463, -0.04943105, 0.04145898],
[ 0.04208959, -0.00412361, -0.04585704, 0.03489918],
[-0.04016889, 0.03448426, 0.00623332, 0.02844917]],
dtype=float32)]These weights are basically the vector representations of the words in vocabulary. As we discussed earlier, this is a lookup table of size 10 x 4, for words 0 to 9. The first word (0) is represented by first row in this table, which is
[-0.04333381, -0.02326865, -0.00812379, 0.02167496]Note: In this example we have not trained the embedding layer. The weights assigned to the word vectors are initialized randomly.
This was a nice example to start with. But while working with actual text data, we need to train the embedding layer to get the correct word embeddings. Lets see how to do it using restaurant reviews data.
Restaurant Review Classification
We will be performing following steps while solving this problem.
- Tokenize the sentences into words.
- Create one-hot encoded vector for each word.
- Use padding to ensure all sequences are of same length.
- Pass the padded sequences as input to embedding layer.
- Flatten and apply Dense layer to predict the label.
We start by importing required libraries
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten,Embedding,DenseTo make it simple, we will be using total of 10 reviews. Half of them are positive, represented by 0 and other half being negative, represented by 1.
# Define 10 restaurant reviews
reviews =[
'Never coming back!',
'horrible service',
'rude waitress',
'cold food',
'horrible food!',
'awesome',
'awesome services!',
'rocks',
'poor work',
'couldn\'t have done better'
]#Define labels
labels = array([1,1,1,1,1,0,0,0,0,0])We will take vocabulary size as 50 and one-hot encode the words using one_hot function from Keras.
Vocab_size = 50
encoded_reviews = [one_hot(d,Vocab_size) for d in reviews]
print(f'encoded reviews: {encoded_reviews}')We will get the results as following encoded reviews.
encoded reviews: [[18, 39, 17], [27, 27], [5, 19], [41, 29], [27, 29], [2], [2, 1], [49], [26, 9], [6, 9, 11, 21]]Here you can see the length of each encoded review is equal to the number of words in that review. Keras one_hot is basically converting each word into its one-hot encoded index. Now we need to apply padding so that all the encoded reviews are of same length. Let’s define 4 as the maximum length and pad the encoded vectors with 0’s in the end.
max_length = 4
padded_reviews = pad_sequences(encoded_reviews,maxlen=max_length,padding='post')
print(padded_reviews)The padded and encoded reviews will be like this.
[[18 39 17 0]
[27 27 0 0]
[ 5 19 0 0]
[41 29 0 0]
[27 29 0 0]
[ 2 0 0 0]
[ 2 1 0 0]
[49 0 0 0]
[26 9 0 0]
[ 6 9 11 21]]After creating padded one-hot representation of the reviews, we are ready to pass it as input to the embedding layer. In the following code snippet, we create a simple Keras model. We will fix the length of embedded vectors for each word as 8 and the input length will be the maximum length which we have already defined as 4.
model = Sequential()
embedding_layer = Embedding(input_dim=Vocab_size,output_dim=8,input_length=max_length)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])print(model.summary())The model summary will be.
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 4, 8) 400
_________________________________________________________________
flatten (Flatten) (None, 32) 0
_________________________________________________________________
dense (Dense) (None, 1) 33
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
NoneNext, we will train the model for 100 epochs.
model.fit(padded_reviews,labels,epochs=100,verbose=0)Once the training is completed, embedding layer has learnt the weights which are nothing but the vector representations of each word. Lets check the shape of the weight matrix.
print(embedding_layer.get_weights()[0].shape)This embedding matrix is essentially a lookup table of 50 rows and 8 columns, as evident by the output.
(50, 8)If we check the embeddings for the first word, we get the following vector.
[ 0.056933 0.0951985 0.07193055 0.13863552 -0.13165753 0.07380469 0.10305451 -0.10652688]So this is how we train an embedding layer on our text corpus and get the embedded vectors for each word. These vectors are then used to represent words in a sentence.
Conclusion
Embeddings are a great way to deal with NLP problems because of two reasons. First it helps in dimensionality reduction over one-hot encoding as we can control the number of features. Second it is capable of understanding the context of a word so that similar words have similar embeddings. This is a great article explaining the working of word embeddings in detail.
Please let me know in comments if you find this article useful. I am a data science enthusiast and blogger. You can reach out to me on my LinkedIn profile.
Thanks for reading.
References
- What are Embedding Layers in Keras (11.3) by Jeff Heaton : https://www.youtube.com/watch?v=OuNH5kT-aD0.






