Hey folks!! In this article I am going to explain about Image Captioning using Keras. For this I will be using tensorflow, keras and Open CV to generate captions associated with the image.


What do you see in the below picture?

Well some of you might say “A white dog in a grassy area”, some may say “White dog with brown spots” and yet some others might say “A dog on grass and some pink flowers”.

Definitely all of these captions are relevant for this image and there may be some others also. But the point I want to make is; it’s so easy for us, as human beings, to just have a glance at a picture and describe it in an appropriate language. Even a 5 year old could do this with utmost ease.

But, what about a machine? Can a computer think the same? Can we write a computer program that takes an image as input and produces a relevant caption as output?

Yes , we can.

And this is what the agenda of today’s article. I’ll guide you on how to create your own program to generate captions from images.

I’ll be using Google Colab to train the model.

We’ll start by creating a file named “Image_caption.ipynb” on Colab.

Connecting to G-DRIVE :

    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    print("Note: not using Google CoLab")
    COLAB = False

Importing necessary Libraries:

We’ll start by importing necessary libraries.

import os
import string
import glob
from tensorflow.keras.applications import MobileNet
import tensorflow.keras.applications.mobilenet  
from tensorflow.keras.applications.inception_v3 import InceptionV3
import tensorflow.keras.applications.inception_v3
from tqdm import tqdm
import tensorflow.keras.preprocessing.image
import pickle
from time import time
import numpy as np
from PIL import Image
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import LSTM, Embedding, TimeDistributed, Dense, RepeatVector, Activation, Flatten, Reshape, concatenate, Dropout, BatchNormalization, add
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras import Input, layers, optimizers
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt

START = "startseq"
STOP = "endseq"


Now, we will download necessary data required and save them inside /projects/captions on G-Drive:

  • glove.6B – glove embedding
    • Flicker8k_Dataset – flicker dataset
    • Flicker8k_Text
    • data – Create this directory to hold saved models.

Flicker8k Dataset Buliding/Cleaning

In this step, we’ll pull in the Flickr dataset captions and clean them of extra whitespace, punctuation, and other distractions.

null_punct = str.maketrans('', '', string.punctuation)
lookup = dict()

with open( os.path.join(root_captioning,'Flickr8k_text','Flickr8k.token.txt'), 'r') as fp:
  max_length = 0
  for line in fp.read().split('\n'):
    tok = line.split()
    if len(line) >= 2:
      id = tok[0].split('.')[0]
      desc = tok[1:]
      # Cleanup description
      desc = [word.lower() for word in desc]
      desc = [w.translate(null_punct) for w in desc]
      desc = [word for word in desc if len(word)>1]
      desc = [word for word in desc if word.isalpha()]
      max_length = max(max_length,len(desc))
      if id not in lookup:
        lookup[id] = list()
      lookup[id].append(' '.join(desc))
lex = set()
for key in lookup:
  [lex.update(d.split()) for d in lookup[key]]

Loading Glove embedding

Now, I’ll be loading Glove embedding.

glove_dir = os.path.join(root_captioning,'/content/drive/My Drive/projects/captions')
embeddings_index = {} 
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), encoding="utf-8")

for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

print(f'Found {len(embeddings_index)} word vectors.')

Reading all image names and using the predefined train/test sets

train_images_path = os.path.join(root_captioning,'Flickr8k_text','Flickr_8k.trainImages.txt') 
train_images = set(open(train_images_path, 'r').read().strip().split('\n'))
test_images_path = os.path.join(root_captioning,'Flickr8k_text','Flickr_8k.testImages.txt') 
test_images = set(open(test_images_path, 'r').read().strip().split('\n'))

train_img = []
test_img = []

for i in img:
  f = os.path.split(i)[-1]
  if f in train_images: 
  elif f in test_images:

Choosing a Computer Vision Neural Network to Transfer

There are two neural networks that are accessed via transfer learning. In this example, I use Glove for the text embedding and InceptionV3 to extract features from the images. Both of these transfers serve to extract features from the raw text and the images. Without this prior knowledge transferred in, this example would take considerably more training.

I made it so you can interchange the neural network used for the images. By setting the values WIDTH, HEIGHT, and OUTPUT_DIM you can interchange images. One characteristic that you are seeking for the image neural network is that it does not have too many outputs (once you strip the 1000-class imagenet classifier, as is common in transfer learning). InceptionV3 has 2,048 features below the classifier and MobileNet has over 50K. If the additional dimensions truly capture aspects of the images, then they are worthwhile. However, having 50K features increases the processing needed and the complexity of the neural network we are constructing.

  encode_model = InceptionV3(weights='imagenet')
  encode_model = Model(encode_model.input, encode_model.layers[-2].output)
  WIDTH = 299
  HEIGHT = 299
  OUTPUT_DIM = 2048
  preprocess_input = tensorflow.keras.applications.inception_v3.preprocess_input
  encode_model = MobileNet(weights='imagenet',include_top=False)
  WIDTH = 224
  HEIGHT = 224
  OUTPUT_DIM = 50176
  preprocess_input = 

Creating the Training/Testing Dataset

We need to encode the images to create the training set. Later we will encode new images to present them for captioning.

def encodeImage(img):
  # Resize all images to a standard size (specified bythe image encoding network)
  img = img.resize((WIDTH, HEIGHT), Image.ANTIALIAS)
  # Convert a PIL image to a numpy array
  x = tensorflow.keras.preprocessing.image.img_to_array(img)
  # Expand to 2D array
  x = np.expand_dims(x, axis=0)
  # Perform any preprocessing needed by InceptionV3 or others
  x = preprocess_input(x)
  # Call InceptionV3 (or other) to extract the smaller feature set for the image.
  x = encode_model.predict(x) # Get the encoding vector for the image
  # Shape to correct form to be accepted by LSTM captioning network.
  x = np.reshape(x, OUTPUT_DIM )
  return x

We can now generate the training set. This will involve looping over every JPG that was provided. As this can take a while to perform, we will save it to a pickle file. This saves the considerable time needed to completely reprocess all of the images. The output dimensions are also made part of the file name because the images are processed differently by different transferred neural networks. If you changed from InceptionV3 to MobileNet, the number of output dimensions would change, and a new file would be created.

train_path = os.path.join(root_captioning,"data",f'train{OUTPUT_DIM}.pkl')
if not os.path.exists(train_path):
  start = time()
  encoding_train = {}
  for id in tqdm(train_img):
    image_path = os.path.join(root_captioning,'Flicker8k_Dataset', id)
    img = tensorflow.keras.preprocessing.image.load_img(image_path, target_size=(HEIGHT, WIDTH))
    encoding_train[id] = encodeImage(img)
  with open(train_path, "wb") as fp:
    pickle.dump(encoding_train, fp)
  print(f"\nGenerating training set took: {hms_string(time()-start)}")
  with open(train_path, "rb") as fp:
    encoding_train = pickle.load(fp)

We’ll perform a similar process for test images.

test_path = os.path.join(root_captioning,"data",f'test{OUTPUT_DIM}.pkl')
if not os.path.exists(test_path):
  start = time()
  encoding_test = {}
  for id in tqdm(test_img):
    image_path = os.path.join(root_captioning,'Flicker8k_Dataset', id)
    img = tensorflow.keras.preprocessing.image.load_img(image_path, target_size=(HEIGHT, WIDTH))
    encoding_test[id] = encodeImage(img)
  with open(test_path, "wb") as fp:
    pickle.dump(encoding_test, fp)
  print(f"\nGenerating testing set took: {hms_string(time()-start)}")
  with open(test_path, "rb") as fp:
    encoding_test = pickle.load(fp)

Using a Data Generator

Up to this point, we’ve always generated training data ahead of time and fit the neural network to it. It is not always practical to generate all of the training data ahead of time. The memory demands can be considerable. If the training data can be generated, as the neural network needs it, it is possible to use a Keras generator. The generator will create new data, when required. The generator provided here creates the training data for the caption neural network, as it is needed.

def data_generator(descriptions, photos, wordtoidx, max_length, num_photos_per_batch):
  # x1 - Training data for photos
  # x2 - The caption that goes with each photo
  # y - The predicted rest of the caption
  x1, x2, y = [], [], []
  while True:
    for key, desc_list in descriptions.items():
      photo = photos[key+'.jpg']
      # Each photo has 5 descriptions
      for desc in desc_list:
        # Convert each word into a list of sequences.
        seq = [wordtoidx[word] for word in desc.split(' ') if word in wordtoidx]
        # Generate a training case for every possible sequence and outcome
        for i in range(1, len(seq)):
          in_seq, out_seq = seq[:i], seq[i]
          in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
          out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
      if n==num_photos_per_batch:
        yield ([np.array(x1), np.array(x2)], np.array(y))
        x1, x2, y = [], [], []

Building the Neural Network

An embedding matrix is built from Glove. This will be directly copied to the weight matrix of the neural network.

embedding_dim = 200

# Get 200-dim dense vector for each of the 10000 words in out vocabulary
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in wordtoidx.items():
    #if i < max_words:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
    # Words not found in the embedding index will be all zeros
    embedding_matrix[i] = embedding_vector
inputs1 = Input(shape=(OUTPUT_DIM,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
caption_model = Model(inputs=[inputs1, inputs2], outputs=outputs)

Training the Neural Network

Moving forward, we’ll be training the Neural Network model.

number_pics_per_bath = 3
steps = len(train_descriptions)//number_pics_per_bath
model_path = os.path.join(root_captioning,"data",f'caption-model.hdf5')
if not os.path.exists(model_path):
  for i in tqdm(range(EPOCHS*2)):
      generator = data_generator(train_descriptions, encoding_train, wordtoidx, max_length, number_pics_per_bath)
      caption_model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)

  caption_model.optimizer.lr = 1e-4
  number_pics_per_bath = 6
  steps = len(train_descriptions)//number_pics_per_bath

  for i in range(EPOCHS):
      generator = data_generator(train_descriptions, encoding_train, wordtoidx, max_length, number_pics_per_bath)
      caption_model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)  
  print(f"\Training took: {hms_string(time()-start)}")

Generating Captions

It is important to understand that a caption is not generated with one single call to the neural network’s predict function. Neural networks output a fixed-length tensor. To get a variable-length output, such as free-form text, it requires multiple calls to the neural network.

The neural network accepts two objects (which are mapped to the input neurons). The first is the photo. The second is an ever-growing caption. The caption begins with just the starting token. The neural network’s output is the prediction of the next word in the caption. This continues until an end token is predicted or we reach the maximum length of a caption. Each time predicts a new word is predicted for the caption. The word that has the highest probability (from the neural network) is chosen.

def generateCaption(photo):
    in_text = START
    for i in range(max_length):
        sequence = [wordtoidx[w] for w in in_text.split() if w in wordtoidx]
        sequence = pad_sequences([sequence], maxlen=max_length)
        yhat = caption_model.predict([photo,sequence], verbose=0)
        yhat = np.argmax(yhat)
        word = idxtoword[yhat]
        in_text += ' ' + word
        if word == STOP:
    final = in_text.split()
    final = final[1:-1]
    final = ' '.join(final)
    return final

Evaluate Performance on Test Data from Flicker8k

The caption model performs relatively well on images that are similar to what it trained on.

for z in range(10):
  pic = list(encoding_test.keys())[z]
  image = encoding_test[pic].reshape((1,OUTPUT_DIM))
  print(os.path.join(root_captioning,'Flicker8k_Dataset', pic))
  x=plt.imread(os.path.join(root_captioning,'Flicker8k_Dataset', pic))

The output of the trained model will be as:


Finally, we are able to generate captions from images. Try it on your own.

You can access the entire code at github.

-Shruti Sharma


Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Additional settings
Formula color
Text color
Type math using LaTeX
Nothing to preview