Data Mining Homework Solution

Data Mining and Regression

These questions cover a wide range of data mining and regression sub-topics. It involves concepts like:

Training set and test
Data reduction
Sampling
Data splitting and resampling
Regression

Training and Test Sets

What are the training sets and test sets used for respectively? If splitting a dataset by assigning 75% to one set while 25% to another set, is it 75% or 25% that should go to the training set?

Ans: A training set is used to train the model at a known sample so that model can learn its parameters. A test set is used for the model performance testing using out of sample examples which was not used to train the model in order to assess the real-world performance of the model. 75% of the data should go to training the model so that it can reliably estimate the parameters.

Data Reduction

Removing predictor(s) is generally known as a data reduction technique. Explain under what

conditions we should consider removing predictors.

Ans: Predictors can be removed under certain conditions such as:

a) Predictor is not adding any value to the problem in the logical sense, like name, serial number, etc.

b) Predictor is replicating the same information which is covered in any other predictor.

c) Lots of missing values in the predictor which may lead to a bad fit.

Sampling

What is the difference(s) between simple random sampling and stratified random sampling?

Ans: Simple random sampling is just taking a k out of n objects randomly. In these sampling schemes, every possible sample must have an equal probability of getting selected.

In Stratified sampling, there are well-defined groups or strata, and simple random sampling is done inside each stratum and included in the sample. These are, in most cases, a better alternative to represent actual scenarios especially in case of class imbalance.

Why is model tuning necessary for predictive modelling?

Ans: Hyperparameters are crucial as they control the overall behaviour of a machine learning model. The ultimate goal is to find an optimal combination of hyperparameters that minimizes a predefined loss function to give better results. This is why model tuning is important to get the optimum model based on the problem statement. There can be n number of models for every task but to get the best out of it, hyperparameters must be tuned.

Predictive Model Building

Use your words to describe the process of building predictive models considering data splitting and data resampling (referring to the graph below).

Ans: The steps of model building is outlined below:

Step 1: Select/Get Data

Step 2: Data cleaning/Data pre-processing

Step 3: Data splitting: Into training and test sets

Step 4: Split training set into Training and Validation set

Step 5: Model Selection and Develop Models (Training)

Step 6: Parameter tuning (Validation set), Optimize

Step 7: Testing and model performance evaluation

Linear Regression

List three linear regression models we learned in class. What metrics can be used to compare the linear model predictive performance?

Ans: The regression models are Ordinary least square regression, Kernel regression, k-NN regression, MARS Model.

What are the two tuning parameters associated with the Multivariate Adaptive Regression Splines (MARS) model? How to determine the optimal values for the tuning parameters?

Ans: Two parameters are degree and prune. Both of these are determined by testing the model performance on the validation set.

Define K-Nearest Neighbours (KNN) regression method and indicate whether pre-processing predictors is needed prior to performing KNN.

Ans: KNN regression is a non-parametric method that, in an intuitive manner, approximates the association between independent variables and the continuous outcome by averaging the observations in the same neighbourhood. The size of the neighbourhood needs to be set by the analyst or can be chosen using cross-validation to select the size that minimises the mean-squared error. Generally, pre-processing here includes making the features similar and numeric so that distance can be calculated. So we centre and scale the data.

Data Mining Assignment Solution