dc.description.abstract | In Uganda, native languages, despite being understood by majority of the population, could entirely disappear from online news spaces as English becomes dominant. The dominance of English is attributed to the aid of intelligent models for processing news in English, contrary to native languages, for which manual practices of publishing news online result in significant time delays.
It has been the goal of several efforts to build such intelligent models for native languages, with key interest in Swahili, which is spoken by over 100 million people and Africa’s most spoken native language. Native languages are still low resource in data for training language processing models for them. There is need of deliberate effort to collect more data for news in native languages and subsequent need to build models of better performance thereof.
Our case study was Swahili language for which we collected more text data for news in Swahili in addition to that in literature. This data was labelled with six (6) news categories. We were able to build five (5) multi label classification models for Swahili news text. We evaluated these models and our best model posed a better performance than those in literature. We were also able to explain the performance this model in context of the bias within the data that led to the small confusion in classification of the test data. The best model was deployed in a real time web application.
Swahili language, being widely spoken, is a fair representation of native languages and therefore, this project greatly contributes to the body of work to increase usage of native languages on online news media.
We recommend more data, focused on more news categories and local news scenarios, is collected as a step to enable transfer learning of existing language models. | en_US |