dc.description.abstract |
Frequently there are instances where speech is corrupted by the surrounding noises. This puts a limitation on the efficiency of an audio-only speech recognition system. This drives us to think of alternative sources of information from the speaker. Lip movements satisfy the condition of being a source of what the speaker said as well as inertness from the environmental noises. Also, visual information extraction for visual speech recognition would be required in case the user is unable to speak.
This thesis attempts to achieve the task of visual speech recognition. First motivation came from making the process of lip reading completely automatic. As a result, there comes a need to first of all identify the skin region of the person in the video. Once we are able to accomplish that, we get the face boundary from where we attempt to obtain the location of eyes of the speaker. The coordinates of eyes leads us to a region between the nose and the upper lip. This leads us to the implementation of the process called lip-tracking, giving us the lip shapes in all the frames that comprise a video. Once we are at the stage of comparing videos, we attempt to extract key frames from a video since that helps in reducing the amount of calculations which would have to be accomplished otherwise when comparing all the frames of the videos. After extracting the key frames, we compare the set of key frames of different videos using a modified form of Hausdorff distance. |
en_US |