We conceived AI(Artificial Intelligence), more specifically Deep Learning as a complex and difficult task. I do agree with concept that Deep Learning is difficult but that doesnot mean that there is no intitution behind it.

Introduction Link to heading

For me at least I think eveything started with Alan Turing and his turing machine. He formulated theory of the what can the finite state machine can be used to achieve. He put forward the theory of Universal turing Machine. In short its a finite state machine that can simulate other finite state machine. It then begs the queestion are human finite machine. Can we simulate human beings with computer?

Well I am not sure about the answer but lets just try it out. Tackling a single problem at a time looks like a good way to start.

My Deep Learning Definition Link to heading

After all those theoretical background I need to come across my definition of Deep Learning. For me its the method of learning to find the solution of the problem in same way a human does by constantly imporving from our last performance. We do need a lot of try to master the problem thought Deep Learning requires a lot of data. And we all know in the present world data is gold.

Analysis of Complexity of the problem Link to heading

First task of an engineer whenever he gets a problem from a client is to analyze the feasibility of the solution. Lets look at the input and output of the problem. After all what we do in the end is adjust our model with best possible method of getting the output form the input from the all the knowledge we have gathered from our study and life experience.

In case of Image captioning the input is “Image”. It has following characteristics:

are usually huge in size.
comes in different size.
coded in the form of how it looks rather than what it means to us.

The output of image captioning is “Sentence”. It has following characteristics:

can be thought as sequence of words
the sequence can be of any length

Analysis of Image Captioning Link to heading

Since we are just pondering how human perform some act and trying to simulate that in computer. It’s a good idea to guess how we do it.

Grab information from the input Image
Select the most important information
Present the information in the form of sentence

Neural Network for Input Link to heading

Since we have image as input, we can use the best neural network that we know till now that extracts information from the image Convolution Neural Network. It uses convolution filter to extract information from an image. The Convolution Neural Network can itself be a tutorial on itself which is not the point of this tutorial.

Neural Network for Output Link to heading

We have output requirement of sequence of the word. A simple network that takes the input and gives the output can’t do this we need a Recurrent Neural Network that takes two input context and input word and then gives the two output context and output word.

Connecting Input and Output Link to heading

At last we need to pass the information gained from the image using CNN to the RNN. Lets just pass the iCNN’s output as the context vector for the RNN.

This is a simplified version of what happens in the image captioning system.

How to get Started Link to heading

I have always been open source enthusisast. As a mozilla tech speaker I want to shout out the work mozilla is doing. Speech and Machine learning research here at mozilla. We:

collect data
free code
api to use

Trying out and tweaking the code is the best way to get stated.

Download Link to heading

slide