In this part, we will implement a RNN for Language Modelling using python, numpy and theano.
So, let's get started.
Step1:
First step is to import the dependencies. So we import csv to read the input data file which is in csv format, itertools to iterate over the data, numpy for arrays, pickle to save the models to save the training time and loading time, from nltk import word tokenizer and sentence tokenizer and matplotlib for plotting the data at the end.
Now, once we have all the dependencies imported, now lets first define some things.
Firstly, we initialize our vocabulary size to 8000.
Now, we know that in the given data, not all words will be appearing as frequently. Hence, we only need to keep the most frequent words and discard the rest. So, we define a token called "unknown_token" and initialize it to string "Unknown_Token". When we have the words which appear verly less frequently, we replace them with this token.
Also, lets define a Start and a End token. When we get the raw text and tokenize it into sentences, we need a way to know the starting and end of the sentences. This is not only necessary to know the start and end but this will also help us to phrase the sentences well when we generate them in the end.