Excel at Data Mining – How to Detect Categories
Today, I’m going to show you how to detect categories in your data in less than 5 minutes with the Microsoft Excel data mining add-in. The assumption of this post is that you have already installed the data mining add-in for Excel and configured the add-in to be pointed at an instance of SQL Server with Analysis Services to which you have access rights.
For the example, we will be using a tutorial spreadsheet that can be found on Codeplex at:
You will also need to attach the AdventureworksDW2012 data file to SQL Server which can be downloaded here:
Our goal is to uncovered hidden categories or clusterings in our data.
Read below if you are unable to view the video walkthrough.
Step 1: Detecting Categories
First, a cell within a defined excel table must be selected. Click on a table cell on the Table Analysis Tools Sample tab. Then, we go to the Table Tools Analyze tab, the Table Analysis Tools group, and select the Detect Categories icon to open the Detect Categories wizard. From here, we need to select columns we wish to consider in our cluster analysis.
We will unselect ID and leave the rest of the options in default and click Run.
Step 2: Reading the Categories Report
So what the analysis is doing is going over the data and looking for clusters or rows that have similar characteristics across all the columns that we selected earlier.
After the tool is finished running, you will move over to a new tab, but let’s go back to our original tab with the data table.
We can see that a new column has been added to the worksheet that categorizes every single row in our dataset. However, the categories are not very descriptive in and of themselves. Click on the worksheet called Categories Report. Instead of us having to filter each category and try to get an idea of each category’s demographics, we have this nice report that summarizes a lot of information about all of our categories.
The first section is a basic histogram of the data and gives us an idea of how the rows are distributes in each category.
The next section is called Category Characteristics, and this one gives us an idea of what defines a particular category. Each column is also ranked according to it’s importance for that category or cluster.
From here, we can see that Category 1 is mostly defined by a very low income, but also tends to have people in Europe. We can change the filter above the category to see another category or multiple categories at once.
The third section is called Category Profiles. This takes a column and then shows the of each category in that column.
What we see here is that column Cars has been selected and each category has a stacked bar breakdown of the Cars column. We can easily see that Category 3 is dominated by people with 0 cars and that Category 4 doesn’t have any people with 0 cars. In the bottom left hand corner, you can filter to whichever column or category you want to look at.
And that is how you can detect categories in your data in less than five minutes using the Microsoft Excel data mining add-in.
Billy Decker is a consultant at StatSlice Systems. He graduated with a dual degree in Physics and Mathematics from the University of Texas at Austin and received his Masters Degree in Physics from the University of Texas at Arlington. He previously worked for Global Technical Services as a Senior Training Analyst and Bell Hellicopter as an Instructional Designer. His technical experience includes, but is not limited to, SQL, SAP, Business Objects, QlikView, and Sharepoint.
You can subscribe to our RSS feed.