StatSlice Business Intelligence and Analytics | Business Intelligence and Analytics Dallas | Business Intelligence Training Dallas

Excel at Data Mining – Creating a Classification Matrix

Today, I’m going to show you how to create and read a Classification Matrix in less than 5 minutes with the Microsoft Excel data mining add-in. The assumption of this post is that you have already installed the data mining add-in for Excel and configured the add-in to be pointed at an instance of SQL Server with Analysis Services to which you have access rights.

For the example, we will be using a tutorial spreadsheet that can be found on Codeplex here. You also need to attach the AdventureworksDW2012 data file to SQL Server. The file can be downloaded here.

Our goal is to create a Classification Matrix based on a mining structure with all of its associated models that we have created previously.

Read below if you are unable to view the video walkthrough.

Step 1: Creating the Profit Chart

We go to the DATA MINING tab, the Accuracy and Validation group, and select the Classification Matrix icon. From here, select the Mining Structure that you wish to analyze. In this case we will be analyzing a Naïve Bayes Model and a Logistic Regression Model that we previously created using Adventureworks2012DW. Click Next. On the Specify Column to Predict screen, make sure that BikeBuyer, Show results as percentage, and Show results as count are selected.

 

classification matrix options

 

Click Next and then Finish. And, that is how easy it is to actually create a Classification Matrix!

Step 2: Reading the Classification Matrix

Before we start looking at the Classification Matrix, let’s take a moment to think about how we can analyze our models using test data that has actual results in it. We’ll take customer A and have our model look at her information. It predicts that she WILL purchase a bike. Next, we can see what she actually did in the dataset. We have two possible outcomes: she either bought a bike and validated our model or she did not buy a bike and our model “missed” on her. The first case we call a true positive, and the second case we call false positive. Thinking about a NO prediction will lead to similar categories of true negative and false negative. An easy way to remember these terms is that first word tells you if you actual and prediction match and the second word tells you the prediction. True negative means that actual matched the prediction, and the prediction was NOT BUYING. False positive means that the actual does not match the prediction, and the prediction was a BUYING. Note that positive and negative depends on the definition of your model. We could have easily tried to predict people who would not buy bikes, and this would reverse all our definitions because NOT BUYING becomes the positive and BUYING becomes the negative.

So, now we realize that we have four categories (true positive, false positives, true negatives, false negatives) that can be summarized as a 2 x 2 matrix. This is exactly what we did when we created the Classification Matrix.

 

Counts of correct/incorrect classification for structure

 

Starting at the top, we see an overview breakdown of how well the models performed against the test data. Keep in mind that Total Correct tells how many the model was right on for both positive and negative results.

Then below that, we see two matrices for the Bayes Model. One of them has percentages and the other has counts. Across the top of our matrix, we have our actuals and along the side we have our predictions. At each intersection of the matrix, we have a number that tells us how many of the test results fell into that bucket. At the intersection of 0 (actual) with 0 on the percentage matrix, we see 61.91%. This tells us that the model was accurately able to identify 61.91% of the people who ended not purchasing a bike (true negatives). Moving down one spot, we see that the remaining 38.09% were false negatives in that we predicted customers would not buy but they ended up buying a bike. The counts matrix shows the actual raw numbers used to calculate the percentages in the percentage matrix.

And that is how you create and read a Classification Matrix in less than five minutes using the Excel Data Mining Add in.

Happy digging!

Billy Decker

Billy Decker_border2Billy Decker is a consultant at StatSlice Systems. He graduated with a dual degree in Physics and Mathematics from the University of Texas at Austin and received his Masters Degree in Physics from the University of Texas at Arlington. He previously worked for Global Technical Services as a Senior Training Analyst and  Bell Hellicopter as an Instructional Designer. His technical experience includes, but is not limited to, SQL, SAP, Business Objects, QlikView, and Sharepoint.

Subscribe

You can subscribe to our RSS feed.

  • Categories