Image Captioning Demystifying Deep Learning and How to Get Started
30 Dec 2017
Reading time ~3 minutes
We conceived AI(Artificial Intelligence), more specifically Deep Learning as a complex and difficult task. I do agree with concept that Deep Learning is difficult but that doesnot mean that there is no intitution behind it.
For me at least I think eveything started with Alan Turing and his turing machine. He formulated theory of the what can the finite state machine can be used to achieve. He put forward the theory of Universal turing Machine. In short its a finite state machine that can simulate other finite state machine. It then begs the queestion are human finite machine. Can we simulate human beings with computer?
Well I am not sure about the answer but lets just try it out. Tackling a single problem at a time looks like a good way to start.
My Deep Learning Definition
After all those theoretical background I need to come across my definition of Deep Learning. For me its the method of learning to find the solution of the problem in same way a human does by constantly imporving from our last performance. We do need a lot of try to master the problem thought Deep Learning requires a lot of data. And we all know in the present world data is gold.
Analysis of Complexity of the problem
First task of an engineer whenever he gets a problem from a client is to analyze the feasibility of the solution. Lets look at the input and output of the problem. After all what we do in the end is adjust our model with best possible method of getting the output form the input from the all the knowledge we have gathered from our study and life experience.
In case of Image captioning the input is “Image”. It has following characteristics:
- are usually huge in size.
- comes in different size.
- coded in the form of how it looks rather than what it means to us.
The output of image captioning is “Sentence”. It has following characteristics:
- can be thought as sequence of words
- the sequence can be of any length
More analysis about the problem:
- multiple right caption for a same image
- can’t be coded in set of steps to generate the output because we don’t even know how we do it
- can be solved only by machine learning algorithm
Analysis of Image Captioning
Since we are just pondering how human perform some act and trying to simulate that in computer. It’s a good idea to guess how we do it.
- Grab information from the input Image
- Select the most important information
- Present the information in the form of sentence
Neural Network for Input
Since we have image as input, we can use the best neural network that we know till now that extracts information from the image Convolution Neural Network. It uses convolution filter to extract information from an image. The Convolution Neural Network can itself be a tutorial on itself which is not the point of this tutorial.
Neural Network for Output
We have output requirement of sequence of the word. A simple network that takes the input and gives the output can’t do this we need a Recurrent Neural Network that takes two input context and input word and then gives the two output context and output word.
Connecting Input and Output
At last we need to pass the information gained from the image using CNN to the RNN. Lets just pass the iCNN’s output as the context vector for the RNN.
This is a simplified version of what happens in the image captioning system.
How to get Started
I have always been open source enthusisast. As a mozilla tech speaker I want to shout out the work mozilla is doing. Speech and Machine learning research here at mozilla. We:
- collect data
- free code
- api to use
Trying out and tweaking the code is the best way to get stated.