Implementation of a voice control and command system using a Luganda keyword spotting system.
Abstract
This project aims at addressing the issues indicated majorly considering this in the Luganda
language. Luganda is the second most widely spoken language in Uganda. With vocal
communication being the most preferred way of communicating, having a voice command and
control interface on computerized systems can increase on the effectiveness of using
computers. The user is required to utter a Luganda command keyword to issue a command to
the system and the system will carry out the command.
This system was entirely developed in python with the utilization of its different readily
available packages. It is based on a keyword spotting model that uses a convolutional neural
network architecture that is able to detect a specific keyword from a continuous speech
segment. After detection of the specific keyword, a command that is coded to that keyword in
the keyword spotting system is executed on the computerized system. With a high level of
accuracy and a low level of false positives, the application is to run on a computer with limited
resources and also with a low power consumption for example mobile phones.
During the design process, a four-stage model was considered. The planning phase where
features like reviewing research papers on keyword spotting, which programming language to
use, what algorithms to consider, the specific Luganda keywords to include. The second phase
was the data cleaning and collection process which included; recording and collecting audio
samples for the keywords considering differences in pitch, pronunciations, indigenous
Luganda speakers, non-Luganda speakers. This phase also included audio signal processing
of these audio samples where these audio samples were formatted into uniform wave files with
constant sample rates, bit rate. The third phase was to train and test the system on MFCCs
developed from the audio sample files collected and also making edits in the application to
increase its functionality. The third phase included actions like developing the model, training
it and making further adjustments to both the audio dataset and the model itself. The fourth
phase was mainly to develop the GUI interface, make tests on how the system works on
different platform, to map the commands to different keywords in the system and modify the
system so as to better its performance.
There was a difference in results collected basing on different keywords. Keywords with
multiple phonemes were harder to detect by the system. Keywords with similar ending
phonemes were detected with interchanges within the system. The more samples a keyword
had, the easier it was to be detected with accuracy hence yielding a better performance.