Excel at Data Mining – Quick Data Preparation
Today, I’m going to show how use Data Preparation wizards to get your data ready to be mined. You should have already installed the data mining add-in for Excel and configured the add-in to be pointed at an instance of SQL Server to which you have access rights before beginning this tutorial.
We will be using a tutorial spreadsheet that can be found on Codeplex here.
Our goal is to discretize and clean up a dataset containing information on whether or not a person purchased a bike. This data is located on the tab labeled Table Analysis Tools Sample.
Discretize Continuous Data
To start, we go to the DATA MINING tab, find the Data Preparation group, and select the Explore Data button.
From here, we will select the range to be explored and then select Next.
Now, we will select which column that we wish to explore from the drop down. Choose Income then select Next.
Next, we will select how many buckets that we want to view the data. Choose 3 and then select Add New Column. This adds a new column named Income2 that discretizes the income column into three equal sized buckets: 10000 – 63333, 63333 – 116667, and 116667-170000. Notice that the range of each bucket is $53,333.
Cleaning Your Data
Going to the same Data Preparation group that we used in the first section, select Outliers from the Clean Data drop down.
Select the data range and click Next. From here, select the Income column to analyze for outliers. Now, we have a rough distribution of the income values from our dataset. The vertical axis represents the count and the horizontal axis represents the income values. Let’s increase the Resolution to 80. We can now see that there are some values to the far right. These people are not representative of our average customer, and so we do not wish to include them in the analysis. Let’s lower the maximum threshold from 170000 to 149000 to exclude those values and then hit Next.
Now, select the Delete Rows Containing Outliers radio button and hit Next. Then, select Change Data in place and hit Finish. Now, you have completely eliminated the rows with outlier income data from your dataset.
Going back to the Data Preparation group, select Re-label from the Clean Data drop down.
Select the data range and click Next. Now, select the discretized column Income2. Then, we’ll re-label our data types to Low, Average, and High.
Next, select Change Data in Place and select Finish. That will relabel our buckets of income from ranges to Low, Average, and High.
And that’s it, you have officially discretized a column of continuous data and then relabeled it. You also got rid of some people with outlier data that are not in the range of customers that you wish to analyze. For more practice and exploration, I highly recommend using the spreadsheet from Codeplex as it has several different simple sets of data that are ready made to set up different models and mining structures.
Billy Decker is a consultant at StatSlice Systems. He graduated with a dual degree in Physics and Mathematics from the University of Texas at Austin and received his Masters Degree in Physics from the University of Texas at Arlington. He previously worked for Global Technical Services as a Senior Training Analyst and Bell Hellicopter as an Instructional Designer. His technical experience includes, but is not limited to, SQL, SAP, Business Objects, QlikView, and Sharepoint.
You can subscribe to our RSS feed.