# Introductory Machine Learning Algorithm

## Traditionally, we used hard code a series of steps or procedures, input data, and program will output results. As machine learning evolves, we apply algorithm to learn from historical data and it tells the program what to do and how to complete the task. Can you see the sequence of intermediate actions are different?

Hard coding instruction may induce some limitations, we might miss some useful information or the thoughts are bounded. For instant, we are trying to find the rules to produce an accurate prediction for future events from the data available. The very first step in the process is asking yourself some questions, what do you going to learn? What are you going to predict? Do you have data available? What kinds of data you have? What types of learning problem is it? These questions are actually interrelated. Answer to the former question gives you clue about the next. Is it a supervised or unsupervised problem? If it is a supervised problem, is it a classification or regression problem? Here is some introduction about supervised and unsupervised learning. Next is the brief introduction about the most basic algorithm in machine learning.

**Decision Tree**

Decision tree is the most classic learning model in machine learning, it is similar to the dive and conquer algorithm in computer science. Typically, throughout the learning process, we are trying to figure out answers to question: What question to ask? In what order should we ask? What is the prediction?

The tree is branched such that information gain is maximized at which split. Maximizing information gain at the sense that we are minimizing node impurity. Two most commonly used impurity measure are entropy and Gini index. The node with only single class samples present is pure.

Algorithm:

- Compute impurity for all features.
- Split the tree by features with minimum impurity.
- Repeat the steps 1–2 with remaining features until arrive at the leaf nodes with prediction for each path.

We can decide to stop branching when the impurity is less than or equal to the minimum impurity predefined. There is not always an attainable pure leaf node. Insisting to build tree with all pure leaf nodes might reduce the generalization of model, as some of the impurity might just an error in the data collection and recording process or simply an outlier. Decision tree is non parametric model applicable for both classification and regression problem. Decision tree is rather intuitive and easy to interpret, and there is not much preprocessing procedure is needed, such as scaling and normalization. Decision tree with numerical features is slightly more complicated and time consuming to compute by hand. You may find application of decision tree here.

**K-Nearest Neighbors**

KNN is a non-parametric supervised machine learning algorithm which is easy to implement but may be computational inefficient and time consuming when we have a large dataset. It is a lazy learner, there is no training phase. It memorizes all data points and rerun the entire process as we execute modelling instruction. KNN works such that we are searching for k nearest points and classify it to the group with majority votes.

Algorithm:

- define k.

2. Compute distance between x and other data points. Note the k nearest points.

3. Assign label y to x by majority vote out of those k neighbors.

k is the hyperparameter of this model. It is chosen by plotting training error and validating error versus k. As k increases, it is making sense that training error will continue decreasing until it attain zero error leading to overfitting problem. At the same time, validation error will decrease until there comes a point it will revert and increase again. That should be the k chosen. We are aware that the turning point is where overfitting problem comes in. KNN is applicable to both classification and regression problem. Nonetheless, it is more common in classification problem. For regression, the process is pretty similar, instead of major voting, it averaging the labels of samples in its neighborhood.

**Perceptron**

Perceptron is a linear boundary separating samples into two groups, or we called it as binary classifier. Let’s say we have features {x1,…, xn}, model assigns weights to each features, {w1, …, wn, bias}. Label is predicted according to the weighted sum of features values. If f(x) ≥ z, we label as “1”else as “0”. Typically, we have separator such as below:

function 0, 1 assign

Algorithm:

- Initiate weights vector randomly.
- Compute weighted sum of features values.
- Assign label. Update weight vector if it is misclassified.
- Repeat steps 2–3 until weight vector converged.

The weight vector is initiated randomly here as the final weight vectors may varied with different initializing weights. When we update {w1, …, wn}, the boundary will be rotated and shifted as we updated bias.

**Naive Bayes**

Recall Bayes Theorem:

Naive Bayes classifier is rather straight forward model implementing Bayes’ theorem, prior probability and posterior probability concepts. Nevertheless, problem arises if there is assumptions violation. It assumes each feature are independent to each other and contribute equally to the response variable. There are typically two scenarios, whether we make some assumption about type of distribution for features. If distribution assumption is specified, conditional probability needs to be replaced with relevant distribution function.

**Neural Network**

This is the starting point where we learn deep learning. Neural network is basically multilayer perceptrons with a number of perceptrons arranged in layers. It is applicable in a wide range of tasks. The idea is inspired by interconnected neurons in human brain. It can make good prediction but the intermediate steps and result have limited interpretability. Training a neural network involves two phases, forward and backward propagation. The former is the process where input is fed into the model and it will travel along the network and output an prediction. The later process is when information gained from mis-predicted output propagated backwards from output layers and weights vectors for each layers are updated accordingly by applying gradient descent method. There are other optimizing algorithms available as you dive deeper into the context about neural network.

**K-Means Clustering**

This is non-parametric unsupervised learning, where we are grouping nearest data points in a n-dimension space into k clusters based on distance. We need to pay attention to the measurement scales as this is distance based model. There is a range of distance metrics to be chosen from, such as Euclidean distance, Manhattan distance and Chebyshev distance and so on. We are assuming samples within each cluster are exhibiting high similarity.

Algorithm:

- Decide on k.
- Randomly selecting k data points as centroids.
- Computing distance of samples from each centroids. Assign the samples to nearest centroid.
- Reassign centroids by getting mean of samples in each cluster.
- Repeat steps 3 and 4 until the centroids converged.

The biggest challenge of K-Means clustering is to select appropriate k. You can select k from Elbow Method or Silhouette Method, there is an demo for K-Means clustering here illustrating how the decision on k. You should be note that, this model is not robust to outliers. Hence, result might be less convincing with presence of outliers. Extension to K-Means model, K-Medoids will be a better option.