Data Mining Review Questions / XLMiner Labs



Data Mining Review Questions / XLMiner Labs

Chapter 2 – Overview of the Data Mining Process



  1. Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning (textbook reference – 2.1).

Supervised learning happens when one applies data mining techniques  with the aim of creating a prediction for of classification of  a specific set of data. On the other hand, unsupervised learning takes place when one examines a body of data for other reasons rather than for prediction  or classification purposes for instance, establishing relationship between items and data segmentation.

  1. Deciding whether or not to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers).

Supervised learning  since it pertains to using classification methods in order to create a distinction between customers who have failed to pay their loans and those who have been successful in servicing their loans.

  1. In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns of prior transactions.

Unsupervised learning due to absence of apparent outcome since one cannot clearly tell whether the recommendations were adopted or not.Data Mining Review Questions / XLMiner Labs

  1. Identifying a network data packet as dangerous (e.g., virus, hacker attack) based on comparison to other packets whose threat status is known.

This is supervised learning since classification methods are used to distinguish between categories of  data for example those that contain virus and those that do not.

  1. Identifying segments of similar customers.

Unsupervised learning due to absence of known outcome.

  1. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non-bankrupt firms.

This is supervised learning since the financial data of similar firms is known.


Data Mining Review Questions / XLMiner Labs

  1. Estimating the repair time required for an aircraft based on a trouble ticket.

This is supervised learning since there is high probability of presence of knowledge of actual historic repair times.

  1. Automated sorting of mail by zip code scanning.

Supervised learning because one is likely to have knowledge if the sorting was done correctly

  1. Printing of customer discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously.

a)Data Mining Review Questions / XLMiner Labs

This could be either supervised or unsupervised learning. Unsupervised learning arises when one is thinking about using association rules in order to decide whether a customer is likely to make use of the coupon in the subsequent purchase. Therefore, when one considers the wording and language of the question then one can say that the answer is unsupervised learning. Nevertheless, when one is thinking about considering the probability that a customer will subsequently use the coupon then the problem is classified and the a supervised learning task.

  1. Describe the difference in roles assumed by the validation partition and the test partition (textbook reference – 2.2).

The function of the validation partition is to examine the performance of every supervised learning model. This is done so that one can compare models and pick the best model among all the available models. For example in some algorithms for instance regression trees and classification the validation partition can be utilized  or applied in automated fashion to  improve and fine tune the model.  Thus, validation data is actually used to build and improve the model.  The test data partition is used to assess the final chosen model.

  1. Using the concept of over fitting, explain why that when a model is fit to training data, zero error with those data are not necessarily good (textbook reference – 2.5).

Overfitting takes place when a model that is chosen describes random error or noise and not the underlying relationship.  This occurs when the model is excessively complex  in that the model has too many parameters in relation to the number of observations.  When using training dataset one gets zero error when utilizing such a model. Therefore, such a model is  unlikely to give useful results on the validation data set.

  1. Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than Model B on the training data but slightly less accurate than Model B on the validation data.  Which model are you more likely to consider for final deployment? (textbook reference – 2.10)

One would prefer the model that has the lowest error on the validation data.  One would prefer model B to be used on new data since model A might be overfitting the training data.

Data Mining Review Questions / XLMiner Labs

  1. The next 2 Questions require the Use of XLMiner Data Mining software and the xls dataset . . .
  2. Use XLMiner’s Convert to Dummies utility to convert the categorical variable Education to binary dummy variables. After the conversion, how many resulting columns exist for the Education variable?  Why is this conversion performed?
  3. Using the newly created dataset (with binary dummy variables), use XLMiner’s Partitioning function to perform Standard Partitioning (accept the default percentages for partitioning). How many records were assigned to the Training Partition?   How many records were assigned to the Validation Partition?  Why was a Test Partition not created?