Humans have been captioning images involuntary since decades and now in the age of social media where every image have a caption over various social platforms. Describing the content of an image is a fundamental problem, in artificial intelligence that connects computer vision and nlp. This talk shall be beneficial for those who are interested in the advance applications of Deep Neural Networks.
In this talk, I will present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on COCO dataset shows the accuracy of the model and the fluency of the language it learns solely from image descriptions.
Learning a model like this would be incredible. It would be a great way to relate how relevant an image and caption are to each other. For a batch of images and captions, we can use the model to map them all into this embedding space, compute a distance metric, and for each image and for each caption find its nearest neighbors.
I will cover the following:
What is deep learning, CNN and RNN? Motivation: CNN and RNN real world applications, state-of-the-art results Internal structure of Vanilla models Description of dataset Deep dive into image encoder, caption encoder. Impact of GPUs (Some practical thoughts on hardware and software) Explaination of Code