dc.description.abstract | With the ability to gather massive amounts of data in a large number of domains, data is collected at an unprecedented rate and the analysis rather than the storage of this data becomes a challenge (Hastie et al., 2009). The vast amounts of data are both labeled data which is a designation for pieces of data that have been tagged with one or more labels identifying certain properties or characteristics, or classifications of objects; and unlabeled data which (Sydorenko, 2020) refers to as pieces of data that have not been tagged with labels identifying characteristics, properties or classifications. Unlabeled data includes photos, audio, videos, news articles, tweets, articles, x-rays (when working with medical data) among others and such data is at a high rate of accumulation due to the increased use of the internet.
In the big data era, the need for fast robust machine learning techniques is rapidly increasing yet the exponential growth in today’s data sources exposed traditional machine learning (ML) techniques are susceptible to poor scalability, loss in robustness and redundancy (Nadine Hajj, Rizk Yara, & Mariette, 2015). Powerful algorithms capable of extracting hidden structures from large datasets are hence a necessity especially for unsupervised learning approach to machine learning.
This research paper aims to show the application of the unsupervised learning techniques such as clustering to extract the existing patterns and relations in the US Census 1990 dataset which is unlabeled and yet structured. For purposes of this study, the k-modes clustering algorithm because of categorical data and hierarchical clustering alongside PCA are to be demonstrated. | en_US |