The Importance of Being Naïve
One of the first simplifying assumptions a scientist makes when studying a system is that nothing in the system interacts with anything else or that each object is independent. This seems like a fairly naïve assumption because experience tells us that everything interacts with…well, everything else. However, there are a couple of major advantages that a scientist gains by making this assumption. One is that her equations for studying the system can be simplified greatly so that a low cost solution can be attained quickly. Another important advantage is that she now has a baseline that acts as a “sanity check” and can be compared with results from the more complicated interactive model. The example that most people are familiar with from science is the Ideal Gas Model where the assumption is made that the particles making up the gas do not interact with one another.
In the world of Data Mining, the Naïve Bayes Model plays a similar role to the Ideal Gas Model because it provides a much faster and simpler way to calculate a model that gives us an idea of the predictive characteristics of a dataset. It also provides an easy way to check the results of the application of a much more complicated algorithm.
Yet, what does independence mean in the context of data? In the world of physics, we can easily imagine what particle interaction means through collisions, repulsions, or attractions, and we just ignore those things when we want to look at an independent model. But what independence means in Data Mining is not as obvious.
Let’s begin by assuming that we have a simple dataset that relates the gender and occupation (the inputs) with the color of the car that a person drives to work (the output). To calculate the true probability that a blue car we see on the road is being driven by a male banker, we would need to understand three relationships: males with blue cars, bankers with blue cars, and males with banking jobs. However, we can use the assumption that there is no relationship between gender and job to simplify our problem. It is possible that gender and occupation can have a very complicated relationship that varies from job to job or is affected by a third hidden input such as location, but we are going to simplify our model by ignoring such correlations.
Given that we only have two inputs, we should not see a huge simplification to our current problem. However, if we had around 100 inputs, and we assume each input can have a relationship with up to 3 other inputs, then the number of possible relationships to analyze jumps to over 4 million. We can see this by calculating the total number of possible groupings of 1,2,3, and 4 inputs out of 100 choices using the COMBIN() function in Excel.
If we want to analyze any possible number of combinations, a similar calculation yields 1.26* 〖10〗^30 possible relationship groups! Applying the Naïve Bayes Model saves us quite a bit of work when we are at the beginning of a Data Mining project.
Another advantage of having independent events or inputs is that we can simply multiply the probabilities of the two events separately. An example of this would be a coin flip and a die toss where we want to find the probability of flipping heads and rolling a 6. Because a coin toss and a die toss are independent, the equation can be written as:
P(Heads AND Rolling a Six)=P(Heads)*P(Rolling a Six)=1⁄2* 1⁄6= 1⁄12
Returning to our original problem, we can see how independence will affect it. The probability of someone owning a blue car given that they are a male banker can be reduced to the problem of multiplying the probability of a male owning a blue car with the probability of a banker owning a blue car.
P(Blue Car | Male AND Banker)≅P(Blue Car | Male )*P(Blue Car | Banker)
The Naïve Bayes Classification Model uses this approximation to greatly simplify probability calculations. The term “Bayes” refers to the fact that we are doing a conditional probability because we are trying to find the probability of owning a blue car given a male banker as our subject. It is a “Classification Model” because we are trying to predict discrete categories for the color of vehicles such blue, green, red, ect. The “Naïve” part of the title comes from the assumption that being male and being a banker are not related at all, but the name is misleading because the model provides the quickest and easiest way to gain insight before we dive into the complex relationships hidden within our data.
The power of the NBCM comes from using a simplifying assumption to be able to generate understanding about our dataset that can make the complex analysis more accurate. So remember: be naïve.
Billy Decker is a consultant at StatSlice Systems. He graduated with a dual degree in Physics and Mathematics from the University of Texas at Austin and received his Masters Degree in Physics from the University of Texas at Arlington. He previously worked for Global Technical Services as a Senior Training Analyst and Bell Hellicopter as an Instructional Designer. His technical experience includes, but is not limited to, SQL, SAP, Business Objects, QlikView, and Sharepoint.
You can subscribe to our RSS feed.