Speech Synthesis Based Image Captioning
Asim Shah, Institute of Computer Sciences and Information Technology (ICS/IT), The University of Agriculture Peshawar, Pakistan.
Arbab Waseem Abbas, Institute of Computer Sciences and Information Technology (ICS/IT), The University of Agriculture Peshawar, Pakistan.
Salman Zeb, Institute of Computer Sciences and Information Technology (ICS/IT), The University of Agriculture Peshawar, Pakistan.
Ejaz Muhammad, Institute of Computer Sciences and Information Technology (ICS/IT), The University of Agriculture Peshawar, Pakistan.
Sayyad Abbas, Institute of Computer Sciences and Information Technology (ICS/IT), The University of Agriculture Peshawar, Pakistan.
Corresponding Author:
Asim Shah (aasimshah325@gmail.com)
Abstract:
The technique used for converting an image into text form in natural language is known as image captioning. It is a problem of computer vision and natural language processing in the research domain. In past years, tremendous progress has been made in this area with the advancement of deep learning models. However, most existing image captioning models generate text-based descriptions, which may not be accessible to people with visual impairments. To address this issue, we propose two modules, one is an image caption generator using a transformer model and the second is to convert caption to speech. Our core transformer model takes an image as input and generates a text description for it and then converts that description into speech. We will use transformer models that are surfaced other models based on its performance. We evaluate the transformer model on a dataset available on kaggle called Flickr 8k and compare its performance with other deep learning models using evaluation metrics specific for text generation, Bleu 1, 2, 3 and 4. The main focus of our paper is to develop a system for visually impaired people.
Keywords:
Transformers; Text to Speech Synthesis; Meteor; Convolution Neural Networks and Long-Short Term Memory (CNN’s- LSTM); Google Text to Speech TTS (gTTS)