Excel at Data Mining – Your First Lift Chart
Today, I’m going to show you how to create and read a lift chart in less than 5 minutes with the Microsoft Excel data mining add-in. The assumption of this post is that you have already installed the data mining add-in for Excel and configured the add-in to be pointed at an instance of SQL Server with Analysis Services to which you have access rights.
For the example, we will be using a tutorial spreadsheet that can be found on Codeplex here.
Our goal for this post is to create lift chart using a Logistic Regression model and look at its performance.
Read below if you are unable to view the video walkthrough.
Step 1: Creating the Logistic Regression Model
First, we go to the DATA MINING tab, the Data Modeling group, and select the Advanced drop down. From here select Create Mining Structure to open the Mining Structure Wizard. Click Next to get past the title screen on the wizard. From the drop down, select ‘Source Data’!’Source Data’. Click Next three times and then click Finish.
Now, go back to the Advanced drop down and select Add Model to Structure. Click Next to get past the wizard title screen. Select the Table Structure that you just created and click Next. From the drop down, select Logistic Regression Model and click Next. For Purchased Bike, changed the drop down to Input and Predict. Click Next and Finish.
Step 2: Creating a Lift Chart
This section will show us how to use the Excel DM Accuracy Chart Wizard to create a lift chart so you can see the performance of the model you just created. We will also take a look at the basic elements of the chart and what they mean.
Let’s go to the DATA MINING tab, the Accuracy and Validation group, and select Accuracy Chart. Click Next. Select the Logistic Regression Model that you just created and click Next. Leave the values at default, and click Next and then Finish. We should see something like this:
Step 3: Reading Your Lift Chart
To make the chart a little easier to understand, let’s change the names on the axes. For the x-axis, rename it to Total Leads %. Change the y-axis to Sales %.
The way to think about a lift chart is that we have an entire pool of leads that we can contact to sell them bikes. Within that pool of leads are folks that are going to buy a bike if they are contacted. For instance, we might have a lead pool of 10,000 people and from our experience we are able to convert about 12% of them into Sales. That means that we have about 1,200 bicycle sales in that lead pool that we need to contact.
The first line to notice is the blue one which represents a random selection of leads to contact from our pool. The line is a nice straight line with a slope of 1 because we expect that if we randomly contact 10% of the leads, we will hit upon 10% of the sales or randomly contacting 50% of leads will lead to hitting upon 50% of sales in the total pool. Using our numbers, we expect to hit about 600 sales if we randomly contact 5,000 of our leads.
But we aren’t going to randomly contact our leads. We have a model that gives us a probability that particular lead will purchase a bike!
So, let’s assume our model is perfect and that we can contact the exact customers who will purchase a bike. That model would look like the red line. We can see that the Sales % hits 100% when the Total Leads % is around 12%. That means that we are expecting that about 12% of our leads are people who will actually buy a bike.
Now, the important part is not really either the red line or the blue line, but the area in between. This represents where our real model is going to perform because it cannot possibly do better than a perfect model, and we would hope that it is not going to do worse than randomly selecting leads. We can see that our model falls between the worst case scenario (random selection) and the best case scenario (perfect model).
So, let’s see what the green line tells us by tracing its path on the chart. Starting in the lower left corner and moving to the right, we see that when we use the model to contact 10% of our leads then we can snag about 20% of the total Sales population. That is about double the return we would have expected just randomly contacting our leads.
Moving further along the green line, we find that we can get about 50% of our sales by only having to contact 30% of our leads. If we contact about 50% of our total leads using the model, we can get 70% of the sales. I think you get the idea at this point, but notice that we have to work harder as we try to get a higher and higher percentage of the total Sales. We can get 100% of the sales by contacting 95% of the leads, but businesses are run on profit and opportunity rather than just pure revenue. So, we can find the sweet spot of profit along that curve into consideration how much it costs to contact each customer. Also, the time saved from having to contact every single customer could be reinvested into selling more products to your current customer base.
And that is how you create and read a lift chart in less than five minutes.
Billy Decker is a consultant at StatSlice Systems. He graduated with a dual degree in Physics and Mathematics from the University of Texas at Austin and received his Masters Degree in Physics from the University of Texas at Arlington. He previously worked for Global Technical Services as a Senior Training Analyst and Bell Hellicopter as an Instructional Designer. His technical experience includes, but is not limited to, SQL, SAP, Business Objects, QlikView, and Sharepoint.
You can subscribe to our RSS feed.