How to Build Your First Chatbot

One day our chatbots will be as good as our 1980s imagination!

In this article, we will be using conversations from Cornell University’s Movie Dialogue Corpus to build a simple chatbot. The code will be written in python, and we will use TensorFlow to build the bulk of our model.

This article will focus on how to build the sequence-to-sequence model that I made, so if you would like to see the full project, take a look at its GitHub page. It’s a bit of work to prepare this dataset for the model, so if you are unsure of how to do this, or would like some suggestions, I recommend that you take a look at my GitHub.

Before we begin, I just want to make one more note. One of the really neat things about sequence-to-sequence models is the diversity of their applications. Although we will be using it to build a chatbot, it can also be applied towards language translation, text summarization, text generation, etc.

Let’s begin!

def model_inputs():
    
    input_data = tf.placeholder(tf.int32, 
                                [None, None], 
                                name='input')
    
    targets = tf.placeholder(tf.int32, 
                             [None, None], 
                             name='targets')
    
    lr = tf.placeholder(tf.float32, name='learning_rate')
    
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

return input_data, targets, lr, keep_prob

Side note: I’ve removed all of the comments from my code to shorten things. The files on my GitHub will have comments, which should help to explain things a bit more.

First step, create placeholders for our model’s inputs. You might have noticed that learning_rate and keep_prob do not have a shape parameter. This is because the default shape is None , which is what we want, so we can just leave it blank to keep our code concise.

def process_encoding_input(target_data, vocab_to_int, batch_size):
    
    ending = tf.strided_slice(target_data, 
                              [0, 0], 
                              [batch_size, -1], 
                              [1, 1])
    
    dec_input = tf.concat([tf.fill([batch_size, 1], 
                                   vocab_to_int['<GO>']), 
                           ending], 1)

return dec_input

tf.strided_slice() will remove the final word from each batch. Appended to the start of each batch will be the token <GO> . This formatting is necessary for creating the embeddings for our decoding layer.

def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob, 
                   sequence_length, attn_length):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    
    drop = tf.contrib.rnn.DropoutWrapper(
               lstm, 
               input_keep_prob = keep_prob)
    
    enc_cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
    
    _, enc_state = tf.nn.bidirectional_dynamic_rnn(
               cell_fw = enc_cell,
               cell_bw = enc_cell,
               sequence_length = sequence_length,
               inputs = rnn_inputs, 
               dtype=tf.float32)

return enc_state

This will encode our input data.

From what I have read, LSTM cells typically outperform GRU cells for seq2seq tasks, such as this one.
Making the encoder bidirectional proved to be much more effective than a simple feed forward network.
We return only the encoder’s state because it is the input for our decoding layer. Simply put, the weights of the encoding cells are what interest us.

def decoding_layer_train(encoder_state, dec_cell, dec_embed_input, 
                         sequence_length, decoding_scope,
                         output_fn, keep_prob, batch_size):
    
    attention_states = tf.zeros([batch_size, 
                                1, 
                                dec_cell.output_size])
    
    att_keys, att_vals, att_score_fn, att_construct_fn = \
        tf.contrib.seq2seq.prepare_attention(
             attention_states,
             attention_option="bahdanau",
             num_units=dec_cell.output_size)
    
    train_decoder_fn = \  
       tf.contrib.seq2seq.attention_decoder_fn_train(
             encoder_state[0],
             att_keys,
             att_vals,
             att_score_fn,
             att_construct_fn,
             name = "attn_dec_train")
    
    train_pred, _, _ = tf.contrib.seq2seq.dynamic_rnn_decoder(
             dec_cell, 
             train_decoder_fn, 
             dec_embed_input, 
             sequence_length, 
             scope=decoding_scope)
    
    train_pred_drop = tf.nn.dropout(train_pred, keep_prob)
    
    return output_fn(train_pred_drop)

Using attention in our decoding layers reduces the loss of our model by about 20% and increases the training time by about 20%. I’d say that it’s a fair trade-off. Some notes to make:

The model performs best when the attention states are set with zeros.
The two attention options are bahdanau and luong. Bahdanau is less computationally expensive and better results were achieved with it.

def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings,  
                         start_of_sequence_id, end_of_sequence_id,
                         maximum_length, vocab_size, decoding_scope,        
                         output_fn, keep_prob, batch_size):
    
    attention_states = tf.zeros([batch_size, 
                                1, 
                                dec_cell.output_size])
    
    att_keys, att_vals, att_score_fn, att_construct_fn = \
        tf.contrib.seq2seq.prepare_attention(
            attention_states,
            attention_option="bahdanau",
            num_units=dec_cell.output_size)
    
    infer_decoder_fn = \  
        tf.contrib.seq2seq.attention_decoder_fn_inference(
            output_fn, 
            encoder_state[0], 
            att_keys, 
            att_vals, 
            att_score_fn, 
            att_construct_fn, 
            dec_embeddings,
            start_of_sequence_id, 
            end_of_sequence_id, 
            maximum_length, 
            vocab_size, 
            name = "attn_dec_inf")
    
    infer_logits, _, _ = tf.contrib.seq2seq.dynamic_rnn_decoder(
        dec_cell, 
        infer_decoder_fn, 
        scope=decoding_scope)

    return infer_logits

decoding_layer_infer() is very similar to decoding_layer_train(). The main difference is the extra parameters added to attention_decoder_fn_inference() compared to attention_decoder_fn_train(). These extra parameters are necessary to help the model create accurate responses for your input sentences.

There is also no dropout in this function. This is because we are using it to create our responses during testing (aka making predictions), and we want to be using our full network for that.

def decoding_layer(dec_embed_input, dec_embeddings, encoder_state, 
                   vocab_size, sequence_length, rnn_size,
                   num_layers, vocab_to_int, keep_prob, batch_size):
    
    with tf.variable_scope("decoding") as decoding_scope:
    
        lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
        drop = tf.contrib.rnn.DropoutWrapper(
                   lstm, 
                   input_keep_prob = keep_prob)
        dec_cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)

        weights = tf.truncated_normal_initializer(stddev=0.1)
        biases = tf.zeros_initializer()
        output_fn = lambda x: tf.contrib.layers.fully_connected(
                       x, 
                       vocab_size, 
                       None, 
                       scope=decoding_scope,
                       weights_initializer = weights,
                       biases_initializer = biases)

        train_logits = decoding_layer_train(encoder_state, 
                                            dec_cell, 
                                            dec_embed_input, 
                                            sequence_length, 
                                            decoding_scope, 
                                            output_fn, 
                                            keep_prob, 
                                            batch_size)
        decoding_scope.reuse_variables()
    
        infer_logits = decoding_layer_infer(encoder_state, 
                                            dec_cell, 
                                            dec_embeddings, 
                                            vocab_to_int['<GO>'],
                                            vocab_to_int['<EOS>'], 
                                            sequence_length - 1, 
                                            vocab_size,
                                            decoding_scope, 
                                            output_fn, 
                                            keep_prob, 
                                            batch_size)

    return train_logits, infer_logits

Here we are using the previous two functions, a decoding cell, and a fully connected layer to create our training and inference logits. We are using tf.variable_scope() to reuse the variables from training for making predictions.

I strongly encourage you to initialize your weights and biases. By initializing your weights with a truncated normal distribution and a small standard deviation, this can really help to improve the performance of your model.

def seq2seq_model(input_data, target_data, keep_prob, batch_size, 
                  sequence_length, answers_vocab_size, 
                  questions_vocab_size, enc_embedding_size, 
                  dec_embedding_size, rnn_size, num_layers, 
                  questions_vocab_to_int):
    
    enc_embed_input = tf.contrib.layers.embed_sequence(
        input_data, 
        answers_vocab_size+1, 
        enc_embedding_size,
        initializer = tf.random_uniform_initializer(-1,1))
    
    enc_state = encoding_layer(enc_embed_input, 
                               rnn_size,
                               num_layers, 
                               keep_prob, 
                               sequence_length)

    dec_input = process_encoding_input(target_data,      
                                       questions_vocab_to_int, 
                                       batch_size)
    dec_embeddings = tf.Variable(  
        tf.random_uniform([questions_vocab_size+1,  
                           dec_embedding_size], 
                          -1, 1))
    
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, 
                                             dec_input)
    
    train_logits, infer_logits = decoding_layer(
        dec_embed_input, 
        dec_embeddings, 
        enc_state, 
        questions_vocab_size, 
        sequence_length, 
        rnn_size, 
        num_layers, 
        questions_vocab_to_int, 
        keep_prob, 
        batch_size)

    return train_logits, infer_logits

This is where we tie everything together and generate the outputs for our model.

Similar to initializing weights and biases, I find it best to initialize my embeddings as well. Rather than using a truncated normal distribution, a random uniform distribution is more appropriate. If you want, you can read more about embeddings from TensorFlow’s tutorial.
Since we do not have to process our encoding’s inputs, we can use tf.contrib.layers.embed_sequence() to simplify the code a little.
If you want to shorten your code a little, you could return decoding_layer() rather than creating train_logits & infer_logits and returning them. I wrote it this way to be more explicit.

epochs = 100
batch_size = 128
rnn_size = 512
num_layers = 2
encoding_embedding_size = 512
decoding_embedding_size = 512
learning_rate = 0.005
learning_rate_decay = 0.9
min_learning_rate = 0.0001
keep_probability = 0.75

Here are the parameters that I used. A larger network could produce better results, but given the number of iterations that I performed, I didn’t want to rack up my bill on FloydHub.

Side note: If you do not have a GPU at home, I highly recommend that you use FloydHub’s services. They offer a very simple and inexpensive (cheaper than Amazon) way to use a GPU.

Using learning rate decay is always something you should consider. As your model tries to find the optimal weights, it needs to update these values with smaller increments, so a shrinking learning rate is beneficial.

To help you build and improve your model, I highly recommend that you read this research paper. It will provide you with some great insights about how to set your hyperparameters’ values and what the size of your network should be.

tf.reset_default_graph()
sess = tf.InteractiveSession()
      
input_data, input_length, targets, lr, keep_prob = model_inputs()
sequence_length = tf.placeholder_with_default(
        max_line_length, 
        None, 
        name='sequence_length')
input_shape = tf.shape(input_data)

train_logits, inference_logits = seq2seq_model(
    tf.reverse(input_data, [-1]), 
    targets, 
    keep_prob, 
    batch_size, 
    sequence_length, 
    len(answers_vocab_to_int), 
    len(questions_vocab_to_int), 
    encoding_embedding_size, 
    decoding_embedding_size, 
    rnn_size, 
    num_layers, 
    questions_vocab_to_int)

with tf.name_scope("optimization"):
    cost = tf.contrib.seq2seq.sequence_loss(
        train_logits,
        targets,
        tf.ones([input_shape[0], sequence_length]))

    optimizer = tf.train.AdamOptimizer(learning_rate)

    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for 
        grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

This sets up the structure of our graph.

I chose to use an interactive session to provide a little more flexibility when building this model, but you can use whatever session type you wish.
Sequence length will be the max line length for each batch. I sorted my inputs by length to reduce the amount of padding when creating the batches. This helped to speed up training.
If you are unfamiliar with seq2seq models, the input is often reversed. This helps a model to produce better outputs because when the input data is being fed into the model, the start of the sequence will now become closer to the start of the output sequence.
Although I have clipped my gradients at ±5, I didn’t notice much of a difference with ±1.

I’m going to skip over creating the batches, padding the batches, and training the model since it’s pretty standard stuff. Below you’ll see how to make predictions with this model.

#input_question = 'How are you?'

random = np.random.choice(len(short_questions))
input_question = short_questions[random]

input_question = question_to_seq(input_question, 
                                 questions_vocab_to_int)

input_question = input_question + 
                 [questions_vocab_to_int["<PAD>"]] * 
                 (max_line_length - len(input_question))

batch_shell = np.zeros((batch_size, max_line_length))
batch_shell[0] = input_question    
    
answer_logits = sess.run(inference_logits, {input_data: batch_shell, 
                                            keep_prob: 1.0})[0]

pad_q = questions_vocab_to_int["<PAD>"]
pad_a = answers_vocab_to_int["<PAD>"]

print('Question')
print('  Word Ids: {}'.format(
 [i for i in input_question if i != pad_q]))
print('  Input Words: {}'.format(
 [questions_int_to_vocab[i] for i in input_question if i != pad_q]))

print('\nAnswer')
print('Word Ids: {}'.format(
    [i for i in np.argmax(answer_logits, 1) if i != pad_a]))
print('Response Words: {}'.format(
    [answers_int_to_vocab[i] for i in \
     np.argmax(answer_logits, 1) if i != pad_a]))

I provided the optionality to either input your own questions or use one from the data. I didn’t find the model to be any better at answering a question from either type of input.

For the input question to be used by the model, it needs to be formatted like the training data. This is why padding was added and batch_shell was created.

To produce the best results, I encourage you to train the model for a few hours with a GPU. This is another reason why I recommend FloydHub’s services, because they charge only US$0.43/hour for using a GPU, and you get 100 hours free when you sign up.

If you test out this model, and expand it, or do anything else cool with it, could you please make a comment about it below. I’ll be looking for ways to improve this model and how best to apply it to other seq2seq tasks, and it would be great to see what you come up with!

Just a few closing comments:

There are many ways this model can be altered and improved upon, even the data could, probably, be processed in a better way. One cool thing you could do, is try a few different methods, compare the results with TensorBoard, then post a link in the comments! If you have never used TensorBoard before, you can check out my article about using it.
I recommend that you check out this GitHub page. It contains seq2seq projects with good results and from different data sources.
If you’re looking for a good video about seq2seq models Siraj Ravel has one. His example is a bit more basic, but he explains things well, and could give you some good ideas.

I hope that you enjoyed reading about my model and learned a thing or two. If you have any ideas for improvements, or see something wrong/suboptimal, please let me know!

Thanks for reading!

Click ❤ below to recommend this to other Medium readers interested in AI, chatbots and development.