Can you imagine, a life without eyesight? Can you feel the complete blackness surrounding you? I hope not.
Vision comes very naturally to us, but we do not understand the power of vision. Its not even much appreciated by the individuals having the gift of vision. We could see everything around us, and wonder about the sophisticated sense organ provided to us.
![]() |
This is what a blind person sees. |
![]() |
This is what we see. |
Even a visual description about this picture could be of a huge help to a person with visual impairment. That's what happens when we read a book, we build up visual elements in our brain, based on the description provided to us. "A beautiful lotus shaped architecture at night." can be a useful description for the picture above.
A real world example:
Automatically Captioned: a group of items on a table |
Using this field of Artificial Intelligence, we can generate a textual description of the scene. A Convolutional Neural Network (CNN) can be used to do this task. It can achieve accuracy as high as 93.9% on ImageNet classification task.
![]() |
source: https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html |
Natural Language Processing
Now, we have a description about a scene, but what is the use of a textual description for a person with visual impairment? We need to provide this information to them in the form they can avail and find this information useful, i.e. by sound waves in the form of a language.
We can pass this information to a Natural Language Processing unit, to process this information and generate a speech for the user to understand. It is essential that the voice generated should not be a computer like sound, rather it should be of natural form, probably in native language of the user. This ensures that it doesn't sound too fabricated, it should be as good as a real person.
But one description of an image cannot define it all. How can we decide the, that the description contains the information relevant to what the user desires? An assistant along with this model can be of great help in this matter. Person can ask question from the assistant and the Computer Vision can find out more information about the attribute asked by the person, which again goes through the same process of text to speech and further more questions by the person.
Automatically captioned: a man standing in front of a mountain |
Resources and References:
- https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html
- https://azure.microsoft.com/en-in/services/cognitive-services/computer-vision/
- https://arxiv.org/abs/1609.06647
- https://research.googleblog.com/2016/08/improving-inception-and-image.html
- http://blog.neospeech.com/what-is-text-to-speech-and-how-does-it-work/
Comments
Post a Comment