StatSlice Business Intelligence and Analytics | Business Intelligence and Analytics Dallas | Business Intelligence Training Dallas

Excel at Data Mining – How to Randomly Sample Your Data

Excel at Data Mining – How to Sample Your Data

Today, I’m going to show you how to randomly sample and oversample data in less than 5 minutes with the Microsoft Excel data mining add-in. The assumption of this post is that you have already installed the data mining add-in for Excel and configured the add-in to be pointed at an instance of SQL Server with Analysis Services to which you have access rights.

For the example, we will be using a tutorial spreadsheet that can be found on Codeplex at:

https://dataminingaddins.codeplex.com/releases/view/87029

You will also need to attach the AdventureworksDW2012 data file to SQL Server which can be downloaded here:

http://msftdbprodsamples.codeplex.com/releases/view/55330

Our goal is to create a random sample and to oversample the bike buyer source data.

 

Read below if you are unable to view the video walkthrough.

In practice, we create random samples in order to analyze a smaller amount of data for our model. Every row has the same chance to be put in the random sample. Oversampling is usually done to attempt to create a balanced (or unbalanced) subset of data by taking a rare subset of the data and expanding it. For example, we would over sample a dataset where the gender was 75% male but we wanted to analyze a dataset where the gender is 50% male.

Step 1: Random Sampling

We go to the DATA MINING tab, the Data Preparation group, and select the Sample Data icon to open the Sample Data wizard. From here, we need to select our source data. We will use ‘Source Data’!’Source Data’ table and then click Next.

select our source data

We will leave the radio button on Random Sampling and click Next.

random sampling

Here, we can select either what percentage of the total rows we want in our random sample or just the number. We will change the radio button to Row Count and the value to 100. One the final screen, we can change the name of the worksheet we are about to create for the selected and one for the unselected data. Uncheck the Create a worksheet for unselected data box. Finally, click Finish.

unselected data

By scrolling to the bottom of our random sample dataset, we can see that we selected 100 rows (plus the blank row and header row at the top).

Step 2: Oversampling Data

We go to the DATA MINING tab, the Data Preparation group, and select the Sample Data icon to open the Sample Data wizard. From here, we need to select our source data. We will use ‘Source Data’!’Source Data’ table and then click Next.

Now, we will change the radio button to Oversample to balance data distributions and click Next.

oversample to balance data distributions

Here, we can pick the column we wish to balance and how we wish to balance it by selecting the target state and the percentage we wish it to be. Select Education from the Input Column  and set Target State  to be Graduate Degree. Click Next, and then set the name of the sheet to Oversampled Data. Then select Finish.

By setting the filter on the Education column to Graduate Degree, we see that those records make up 300 or 30% of the 1000 columns.

filter columns

And that is how you can randomly sample and oversample your data in less than five minutes using the Microsoft Excel Data Mining add-in.

Happy digging!

Billy Decker

Billy Decker_border2Billy Decker is a consultant at StatSlice Systems. He graduated with a dual degree in Physics and Mathematics from the University of Texas at Austin and received his Masters Degree in Physics from the University of Texas at Arlington. He previously worked for Global Technical Services as a Senior Training Analyst and  Bell Hellicopter as an Instructional Designer. His technical experience includes, but is not limited to, SQL, SAP, Business Objects, QlikView, and Sharepoint.

Subscribe

You can subscribe to our RSS feed.