Time series is a collection of ordered data points collected sequentially over a period of time. Generally, it is collected at regular intervals. It is special as data points are not independent, we expect a certain degree of serial correlation. There are no independent variables and time-dependent make it distinct from the regression problem. Time series forecasting is about predicting the future based on historical data by extracting useful statistics and characteristics within data. We will be going through time series description, analysis, and modeling in this piece of writing.
Time series is mainly comprising of three components:
We always try to focus on a subset rather than broad coverage of potential customers to optimize our business objectives by shrinking our target customers through customers segmentation effort. Recall customer segmentation via centroid-based clustering — K-Means, discussed in previous post, there are some drawbacks associated with model applied that we may wish to minimize or even eliminate with goal of proposing more flexible or robust model as an alternative. In this piece of writing, we will be going through a density-based solution — DBSCAN, that overcome the issues.
Looking at clustering result from K-Means, every data…
Abalone is a type of marine snail with high nutrition values and economic values, almost whole abalone, including viscera and shell can be processed and serving as sources of income for fishing industry. High market demand for abalones has led to overexploitation, raising public’s concerns on environmental issues. Governments enforce strict law and regulations on abalones harvesting to ensure sustainability of abalones. Analyzing data to find relationship between multiples variables in data collected to aid innovation ideas in designing equipment as solution to both the profitability and sustainability through instant result provided underwater.
Dataset adopted here is…
Often, when we try to build a machine learning model for churn prediction, we are provided with supervised dataset where ‘churn customer’ are identified and labelled. Why does business implement subscription, membership and contract based business model? Besides information collection including personal details and transaction details, customers management and spending behavioral analysis, they can identify leaving customers with certainty and apply the dataset in future to figure out factors affecting customers’ propensity to terminate services. Unrenewed membership, terminated subscription and contracts are the best churn indicators. Nonetheless, there are cases where data available is below our expectation. …
Hard coding instruction may induce some limitations, we might miss some useful information or the thoughts are bounded. For instant, we are trying to find the rules to produce an accurate prediction for future events from the data available. The very first step in the process is asking yourself some questions, what do you going to learn? What are you going to predict? Do you have data available? What kinds of data you have? What types of learning problem is it? These questions are actually interrelated. Answer to the former question gives you clue about the next. Is it a…
Telecommunication industry has been showing exponential growth in line with rising demand following technology advancement. The competitions among services providers are so fierce that they are executing different strategies to meet the customers’ needs. Effort in retaining existing customers is now as important as searching for new customers.
The dataset has 7032 instances and 21 columns, comprised of ID information, 3 numerical attributes, 16 categorical attributes and target (‘Churn’) column. There is no missing value.
Z-test or t-test come in place when comparing means of one to two populations. But, problem of error rate or Type I error (alpha) compounding arises in scenario of comparing more than two means. Let’s say we are testing 3 populations at alpha=0.05, applying three t-test resulted true alpha level in computation to be more than 0.05 but less than 0.15. ANOVA, a basic statistics analysis that is applicable to conduct hypothesis testing such that null hypothesis states all populations means are equal at predefined alpha level, eliminating compounding effect.
Relational database is a better option than spreadsheet to work with huge dimensions data. We might be facing replication, redundancy and inconsistency with spreadsheet. A systematic data storage allows more efficient and effective information management and retrieving process as compared to manual operation on spreadsheet. (Imagine a dataset with thousands of columns.) We are working to design a relational database that organize data in tables and is able to link to other tables by applying data modeling technique, ER modeling through a series of steps, conceptual, logical and physical data models. Let’s understand some simple terms for ER modeling:
How much do you spend to attract new customers, as compared to the expenses on retaining the existing? To sustain and expand business, one should realize being able to retain existing customers is as important as exploring new customers. If the rate of customers leaving is greater than rate of new customers entering, our customers database is actually shrinking. To certain extend, we see customers retaining effort outweighs searching for new potential customers.
Not every deal is profitable, not all the customers are financially attractive to the business. It is crucial to ensure resources allocated or deployed are in…
In supervised learning, we have target variables provided to be compared with prediction for judging model performance. We assume there is a unknown model, f, that best describe the data, our task is to find the estimate of f. The main sources of learning error in a model is noise, bias, and variance. Noise is irreducible by the learning process .Our goal is always to build a model with good generalization capability beyond training data.
Bias evaluates model learning ability, computing difference between true values and predicted values. Under most circumstance, we try to make some assumption about…