IRJET- Image Captioning and Visual Question Answering for the Visually Impaired

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 08 Issue: 02 | Feb 2021

p-ISSN: 2395-0072

www.irjet.net

Image Captioning and Visual Question Answering for the Visually Impaired Yash Jeswani 1, Siddhesh Sawant2, Hrushikesh Makode3, Ankush Rathi4 ,Prof.Sumitra Jakhete5 sajakhete@pict.edu, yashjeswani2420@gmail.com, makodehrushikesh@gmail.com, rathi.ankush2438@gmail.com, sid123.sawant@gmail.com 1,2,3,4,

Student,,Dept Information Technology,Pune Institute of Computer Technology (PICT), Pune, Maharashtra. Information Technology,Pune Institute of Computer Technology (PICT), Pune,Maharashtra, India

5Professor,Dept

----------------------------------------------------------------------***-------------------------------------------------------------------

Abstract: For people, it's straightforward for us to take a

II. LITERATURE SURVEY

look at an image and give the response to the answer for any questions utilizing our insight. In any case, there additionally are situations, for example, a visually impaired user or an intelligence, any place they need to effectively evoke visual information given a picture. We might want to help blind people to beat their day by day visual difficulties and separate social availability obstructions. The main purpose of our project is Image captioning and VQA, Image captioning is to get a caption for an Image.Image Captioning is to get an inscription or caption for a picture. Picture inscribing needs to decide objects in the picture, activities, their relationship and a couple of quiet highlights that might be absent inside the picture. While distinguishing, the accompanying advance is to get a most relevant and transient description for the picture that must be grammatically and semantically right. It utilizes each CNN for identification of objects and language process ways for description and on its description users (visually impaired) will raise any questions, we tend to propose the task of free-form and open-ended VQA. Given an image and a characteristic language question concerning the picture, the task is to deliver a right regular language answer.

Literature Survey was conducted in order to study and obtain knowledge from previous researches and surveys. Some papers were classified based on tools/software’s used, algorithms used and their corresponding data sets (if any) used along with the corresponding platform on which they were deployed. Papers are also mentioned describing the comparison between various existing Image Captioning and Visual Question Answering methodologies along with various Datasets as follows.

1)Image Captioning Retrieval based and template based image captioning methods are adopted mainly in early work. Due to great progress made in the field of deep learning [1], recent work begins to rely on deep neural networks for automatic image captioning. In this section, we will review such methods. Even though deep neural networks are now widely adopted for tackling the image captioning task, different methods may be based on different frameworks. Therefore, we classify deep neural network based methods into subcategories on the basis of the main framework they use and discuss each subcategory respectively.

Keywords - Object Detection,Fully connected neural

A. Retrieval and template based methods augmented by neural networks: To retrieve description sentences for a query picture, Socheret al. propose to utilize dependency-tree recursive neural networks to address phrases and sentences as compositional vectors. They utilize another deep neural network [2] as a visual model to extract features from images.Obtained multimodal features are mapped into a common space by using a max-margin objective function.After training, correct image and sentence pairs in the common space will have larger inner products and vice versa. Finally, sentence recovery is performed dependent on similarities between representations of images and sentences in the common space. Karpathy et al. propose to embed sentence fragments and image fragments into a common space for ranking sentences for a query image [3]. Addressing both picture sections and sentence parts as feature vectors, the creators plan an organized max-edge objective, which incorporates a global ranking term and a fragment alignment term, to map visual and textual data into a common space.

networks(FCCNs),Longs hort-Term Memory, Convolutional Neural Network,Image captioning,VQA

I. INTRODUCTION One of the complex important tasks for the computer vision community is to combine various tools for high-level scene interpretation, such as image captioning and visual question answering. Such technologies have the potential to assist people who are blind or visually impaired. With regard to an image, image captioning is the process of generating a textual description and visual question answering is aimed at answering questions about it. We propose a Machine Learning application for this to deal with the same. For image captioning, an LSTM network is fed with vectorized representation of image by a pretrained CNN to generate captions. For question answering, the vectorized representations of image and textual question are combined to generate the answer. We hope this work will help visually impaired people overcome their daily visual challenges.

|

Impact Factor value: 7.529

|

ISO 9001:2008 Certified Journal

|

Page 1169

Turn static files into dynamic content formats.

Create a flipbook