What is Supervised, Semi-Supervised and Unsupervised Learning?
Kick start machine learning with ideas of supervised, semi-supervised and unsupervised learning. Question first comes into your mind when you are given a data set, is it labeled, unlabeled, or partially labeled?
Introduction
Traditionally, human told a program the rules or how to do the jobs, it executes the instructions. Now, we are feeding machine with data, the algorithm learns and comes out with set of rules, producing a program to complete task applying the rules. Back then during my university study, my machine learning course began with introducing ideas about supervised and unsupervised learning. Soon, I was being exposed to ‘semi-supervised learning’. Identification of types of learning is the first step to a problem.
Supervised Learning
Labeled dataset is a set of data with predictors (input variables) and output (response/target variables). Supervised learning is where you have such a dataset, and you are searching for the best representative function that link or map predictors to relevant target. With availability of response variables, we are able to compare prediction and actual label, and hence modification applied to reduce misprediction and improve model.
Supervised learning problem can be split further into classification or regression problem according to type of output variables. A real-valued output variable, whether it is discrete or continuous, is a regression problem. A categorical output variable is classification problem. Algorithms applied should be appropriate to the corresponding categories of learning problem. Figure 1 gives examples for each category.
The ultimate goal of supervised learning is to build a model that generalizes well on future unseen data. This raises the concern about possible overfitting phenomenon. We are not going to dive further into relevant topic, brief idea about this can be found here.
Unsupervised Learning
When we have unlabeled data, we are handling unsupervised learning problem. We are clueless whether we are clustering the data points/ tuples correctly. Therefore, challenge for unsupervised learning will be deciding point to stop learning and evaluation of model built. We are not working to do any prediction here, as ‘machine’ is not taught how to predict. Instead, we are exploring and reporting underlying insights and structure of data. Three typical areas applying unsupervised learning is clustering, dimensionality reduction and association analysis.
Clustering is the situation where we are trying to capture the common characteristics between tuples in dataset and group them according to their similarities. Association analysis is conducted to find interesting hidden rules or relationships between tuples. When we have dataset with large input dimension, one will find smaller set of input variables to simplify or reduce original input dimension, preserving as much information as possible (minimum loss of information). Here is the point where dimensionality reduction is applied as crucial feature engineering step for more time efficient training.
Semi-Supervised Learning
This kind of learning lies in between supervised and unsupervised. It has mixture of labeled and unlabeled data with larger proportion for latter. Data available today is mostly of this form, as it is costly, time consuming, and requires expertise to get a huge data labeled. When such data is given, unsupervised learning technique is used and followed by supervised learning. Data is divided into clusters and missing labels are predicted. It is assumed that the data points in same cluster posing same or similar label, basic way to get the labels is by voting or average value among data falling in same cluster.
Conclusion
Supervised learning is a problem with labeled data, expecting to develop predictive capability. Unsupervised learning is discovering process, diving into unlabeled data to capture hidden information. Semi-supervised learning is a blend of supervised and unsupervised learning.