A supervised machine learning model to predict internet use
Abstract
The Internet is a global network of billions of computers and other electronic devices. With the Internet, it's possible to access almost any information, communicate with anyone else in the world, and do much more (1). The internet is core as it provides the means by which devices are connected. Studies have shown that developing countries with higher levels of connectivity significantly outperform those with limited connectivity (2). In a previous study of mine on factors determining internet use, it was found that school-g respondents were more likely to have use the internet as compared to those out of school because students use the internet for research purposes. It was also noticed that people with higher levels of education were more likely to have used the internet as compared to those with lower levels of education. The study also found that male respondents in Uganda were more likely to have used the internet as compared to female respondents. This study aimed to use a supervised machine learning model to combine some of these determinants to be able to predict if someone has used the internet or not depending on their features in regards to the specified determinants. Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy (3) .The development of the model was carried out using data collected from Uganda, Officially the Republic of Uganda. The model managed to predict the right class for majority of the test data, with an accuracy level of 84%, which was relatively good considering the level of imbalance in the dataset. The model can precisely predict individuals that have never used the internet with a 92% level. The model can relatively predict those that have used the internet with the 79% level. False positives exist but are not so co common especially for those that have never used the internet. The model had a strong ability when it comes to finding those that have ever used the internet even though some of them had been randomly sampled from the same dataset for the test case. The process of this project involved identification of an initial problem and a relevant public dataset, the data was loaded into the memory and preprocessed, the classifier was trained using the data. This was done multiple times until a good set of parameters were found. The model was then saved as a python pickle file.