Time series analysis is applicable in numerous industry, such as business, economic, financial and even healthcare. Scientists were doing research on this topic since 19th century. As the name suggests it is highly dependent on point of time when it is collected. Have you ever thought of what make it a special kind of dataset and what makes it different from regression problem?

Time series is a collection of ordered data points collected sequentially over a period of time. Generally, it is collected at regular intervals. It is special as data points are not independent, we expect a certain degree of serial correlation. There are no independent variables and time-dependent make it distinct from the regression problem. Time series forecasting is about predicting the future based on historical data by extracting useful statistics and characteristics within data. We will be going through time series description, analysis, and modeling in this piece of writing.

Time series is mainly comprising of three components:

  1. Trend —…

Density-Based Spatial Clustering Application with Noise (DBSCAN), an award-winning clustering algorithm that catches our eyes. Understanding what is DBSCAN and its application in customer segmentation, a critical area in business analytics.

What is clustering?


We always try to focus on a subset rather than broad coverage of potential customers to optimize our business objectives by shrinking our target customers through customers segmentation effort. Recall customer segmentation via centroid-based clustering — K-Means, discussed in previous post, there are some drawbacks associated with model applied that we may wish to minimize or even eliminate with goal of proposing more flexible or robust model as an alternative. In this piece of writing, we will be going through a density-based solution — DBSCAN, that overcome the issues.


Looking at clustering result from K-Means, every data…

Application of Multivariate Analysis and Classification on R. Forecasting profitability and enhancing sustainability through the weights and gender prediction by inspecting its physical size.

Abalone diver, Tasmania. Photo: Stuart Gibson


Abalone is a type of marine snail with high nutrition values and economic values, almost whole abalone, including viscera and shell can be processed and serving as sources of income for fishing industry. High market demand for abalones has led to overexploitation, raising public’s concerns on environmental issues. Governments enforce strict law and regulations on abalones harvesting to ensure sustainability of abalones. Analyzing data to find relationship between multiples variables in data collected to aid innovation ideas in designing equipment as solution to both the profitability and sustainability through instant result provided underwater.

Exploratory Analysis

Dataset adopted here is…

How do we define ‘churned customer’? How to initialize a churn model building process when you have unlabeled or unsupervised data? Defining ‘Churned’ for Unlabeled Dataset might be more challenging than building a supervised predicting model with acceptable accuracy.

Often, when we try to build a machine learning model for churn prediction, we are provided with supervised dataset where ‘churn customer’ are identified and labelled. Why does business implement subscription, membership and contract based business model? Besides information collection including personal details and transaction details, customers management and spending behavioral analysis, they can identify leaving customers with certainty and apply the dataset in future to figure out factors affecting customers’ propensity to terminate services. Unrenewed membership, terminated subscription and contracts are the best churn indicators. Nonetheless, there are cases where data available is below our expectation. …

Traditionally, we used hard code a series of steps or procedures, input data, and program will output results. As machine learning evolves, we apply algorithm to learn from historical data and it tells the program what to do and how to complete the task. Can you see the sequence of intermediate actions are different?

Hard coding instruction may induce some limitations, we might miss some useful information or the thoughts are bounded. For instant, we are trying to find the rules to produce an accurate prediction for future events from the data available. The very first step in the process is asking yourself some questions, what do you going to learn? What are you going to predict? Do you have data available? What kinds of data you have? What types of learning problem is it? These questions are actually interrelated. Answer to the former question gives you clue about the next. Is it a…

Emphasizing customer retention as much as exploring new potential customers ensuring business sustainability and growth. Building a churn predicting model to identify underlying churning factors and customers inclining to leave. Choosing the right evaluation metrics helps to build more practically useful model.

Telecommunication industry has been showing exponential growth in line with rising demand following technology advancement. The competitions among services providers are so fierce that they are executing different strategies to meet the customers’ needs. Effort in retaining existing customers is now as important as searching for new customers.

Exploratory Analysis

The dataset has 7032 instances and 21 columns, comprised of ID information, 3 numerical attributes, 16 categorical attributes and target (‘Churn’) column. There is no missing value.

Have you ever thought of how and where to apply ANOVA? We have to make dozens of decisions every day, how do we judge which is a better one among all options available? Or, they are equally good? ANOVA is the answer that helps us to make a wiser decision.

Z-test or t-test come in place when comparing means of one to two populations. But, problem of error rate or Type I error (alpha) compounding arises in scenario of comparing more than two means. Let’s say we are testing 3 populations at alpha=0.05, applying three t-test resulted true alpha level in computation to be more than 0.05 but less than 0.15. ANOVA, a basic statistics analysis that is applicable to conduct hypothesis testing such that null hypothesis states all populations means are equal at predefined alpha level, eliminating compounding effect.

Rapid advancement of technologies has led to information explosion. We are leaving our ‘footprint’ for all interactions done online. Surging availability of information creates challenges in data storage and management, and thought of searching the values of data. Let’s demonstrate designing process of a simple database for a website.

Relational database is a better option than spreadsheet to work with huge dimensions data. We might be facing replication, redundancy and inconsistency with spreadsheet. A systematic data storage allows more efficient and effective information management and retrieving process as compared to manual operation on spreadsheet. (Imagine a dataset with thousands of columns.) We are working to design a relational database that organize data in tables and is able to link to other tables by applying data modeling technique, ER modeling through a series of steps, conceptual, logical and physical data models. Let’s understand some simple terms for ER modeling:


What are the similar behaviors of your customers? What are the answers to questions in business? Customers segmentation is the solution but how we do it? How can you make use of it for decision making? RFM analysis is applied here with Python, exhibiting its simplicity and use of most basic set of information available with purchasing records.


How much do you spend to attract new customers, as compared to the expenses on retaining the existing? To sustain and expand business, one should realize being able to retain existing customers is as important as exploring new customers. If the rate of customers leaving is greater than rate of new customers entering, our customers database is actually shrinking. To certain extend, we see customers retaining effort outweighs searching for new potential customers.

Not every deal is profitable, not all the customers are financially attractive to the business. It is crucial to ensure resources allocated or deployed are in…

What is bias, variance? How to interpret learning curve? How do we diagnose bias and variance? And, what should we do to deal with it?


In supervised learning, we have target variables provided to be compared with prediction for judging model performance. We assume there is a unknown model, f, that best describe the data, our task is to find the estimate of f. The main sources of learning error in a model is noise, bias, and variance. Noise is irreducible by the learning process .Our goal is always to build a model with good generalization capability beyond training data.

Is oversized good? (from disneyclips.com)


Bias evaluates model learning ability, computing difference between true values and predicted values. Under most circumstance, we try to make some assumption about…



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store