Implementation of a voice control and command system using a Luganda keyword spotting system.
Mugagga, Michael Mulondo
MetadataShow full item record
This project aims at addressing the issues indicated majorly considering this in the Luganda language. Luganda is the second most widely spoken language in Uganda. With vocal communication being the most preferred way of communicating, having a voice command and control interface on computerized systems can increase on the effectiveness of using computers. The user is required to utter a Luganda command keyword to issue a command to the system and the system will carry out the command. This system was entirely developed in python with the utilization of its different readily available packages. It is based on a keyword spotting model that uses a convolutional neural network architecture that is able to detect a specific keyword from a continuous speech segment. After detection of the specific keyword, a command that is coded to that keyword in the keyword spotting system is executed on the computerized system. With a high level of accuracy and a low level of false positives, the application is to run on a computer with limited resources and also with a low power consumption for example mobile phones. During the design process, a four-stage model was considered. The planning phase where features like reviewing research papers on keyword spotting, which programming language to use, what algorithms to consider, the specific Luganda keywords to include. The second phase was the data cleaning and collection process which included; recording and collecting audio samples for the keywords considering differences in pitch, pronunciations, indigenous Luganda speakers, non-Luganda speakers. This phase also included audio signal processing of these audio samples where these audio samples were formatted into uniform wave files with constant sample rates, bit rate. The third phase was to train and test the system on MFCCs developed from the audio sample files collected and also making edits in the application to increase its functionality. The third phase included actions like developing the model, training it and making further adjustments to both the audio dataset and the model itself. The fourth phase was mainly to develop the GUI interface, make tests on how the system works on different platform, to map the commands to different keywords in the system and modify the system so as to better its performance. There was a difference in results collected basing on different keywords. Keywords with multiple phonemes were harder to detect by the system. Keywords with similar ending phonemes were detected with interchanges within the system. The more samples a keyword had, the easier it was to be detected with accuracy hence yielding a better performance.