We are provided with captions for each of the images. These captions give us a brief description about what is happening in the image. For the image completion task these captions can be useful by providing overhead information. They can try to give additional information which the pixels alone in the image might fail to do so.
So, these captions can be used either in a supervised setting where for each image we provide a label or in the unsupervised setting where each caption is attached (appended) with the image.
To do this we have to convert these captions(word/sentences) into some other representation which is understandable by the network. Therefore, we convert these captions into word embeddings. After we perform the basic preprocessing NLP tasks on the captions such as lemmatizing, stemming, tokenizing we convert each word in the entire corpus into an equivalent word embedding (which is essentially a real valued vector in some lower dimension).
There are two popular methods to represent the words into vectors viz., Word2Vec and GloVe.
These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.
GloVe is a new global log bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. The model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus.
For the purpose of the project, I have used Word2Vec because of its popularity and ease of use. See the accompanied code in the repo. All the text has been pre-processed as discussed.