Naive Bayes Theorem

Learn to code AI




Part 3 : Bayes Theorem


Naive Bayes is an algorithm that makes use of Bayes Theorem:


Bayes Theorem simply tells us that our ideas, beliefs, and perceptions should change based on new information or evidence, in proportion to how important that piece of information is. Put it this way – would you rather your barber screw up your side burns or shave your head before a first date. Well, to each his own.

Let's go over an actual example:

Hypothetically, I am pretty damn positive, about 99% sure, that aliens don't exist. But what if tomorrow NASA announced the discovery of new exoplanets that resembled earth in ways that other planets from previous discoveries hadn’t. Well I'd still be convinced that aliens don’t exist – but down to 98%. Over the next few days, RFSA (Russia), ESA (Europe), and CNSA (China) announce similar, independent discoveries. I'm still convinced of my initial belief, but I’ll follow Bayes advice and take into consideration that there’s a 5 percent chance (generous) that I’m wrong.

Fast forward a year. The Mars rover found tall structures on Mars! Exciting! If I was into conspiracy theories, maybe I’d be skeptical, and give a 5% chance that our ancestors for thousands of years ago visited mars. But again, I’m not a fan of conspiracies, so my base probability of Zeus flying George Washington to Mars was .01% , and only went up to 0.1%. I guess aliens are more likely to exist than not. I’m about 80% convinced that they exist and about 20% convinced they don’t! This makes zero sense intuitively, but it has to do with the fact that Mars is like a little grain of sand in the universe.

A discovery that may prove the existence of life on other planets that are billions of light years away, and a visual confirmation of a physical structure on Mars, should not impact my beliefs in the same way. And on top of that, aside from the absolute strength of the evidence provided, we’ll also have to consider other explanations, and how they may affect other previous hypotheses.

Let’s see here how this simple theorem, paired with a 'naive' assumption, can be used.

Gender Identification

The most widely known application of Naive Bayes is for filtering spam emails, but for now we’ll look at a unique example using NLTK, and then come back to spam mail.

In [1]:
import  nltk 
from nltk.corpus import names # used as training data
from sklearn.model_selection  import  train_test_split

Feature extraction

There are many, many algorithms to extract features from text - most of which rely on vectorization. These include the Bag of Words model, Term Frequency - Inverse Document Frequency, and Embeddings. Here we’ll use the first and last letter of a person’s name as our features to analyze

In [2]:
def gender_features (word):
return {'first_letter':word[0],'last_letter': word[-1]}

Let's see the features generated for Alexei:

In [3]:
alexei = gender_features("Alexei")
Out [3]:
 {'first_letter': 'A', 'last_letter': 'i'}

In [4]:
# our dataset: a list of names and the respective gender labels
labeled_names = ([(name, 'male')for name in names.words('male.txt')] +
                [(name, 'female')for name in names.words('female.txt')])

labeled_names[:3] + labeled_names[-3:] 
Out [4]:
[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Zsazsa', 'female'),
 ('Zulema', 'female'),
 ('Zuzana', 'female')]
In [5]:
# conversion of the list of data points to features
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
In [6]:
# we only have 1 iterable
# the form differs from sklearn, but this is what NLTK expects
train_data, test_data = train_test_split(featuresets)
In [7]:
# Here's how we create a classifier, defined and fitted at the same time 
classifier = nltk.NaiveBayesClassifier.train(train_data)

Just like with the training data, we need to convert the features, then call classify on each method of our classifier.

In [8]:
Out [8]:

Hm. Perhaps if we try another spelling?

In [9]:
Out [9]:

Our Slavic/Greek friends won’t be thrilled. Either our features are bad, or this particular algorithm isn’t the right one to use. While that’s a discussion in and of itself, put simply, both of the aforementioned are to blame. Have you ever wondered why so many non-spam emails end up in the junk folder? We’re only at the tip of the iceberg here with regards to this whole Artificial Intelligence thing. We’re not that good yet.

Naive Bayes (NB) is usually the starting point and is often combined with other algorithms to produce a higher accuracy model. But 20 years from now I’m sure we’ll be looking back laughing at how “naïve” we were. That’s the excitement of AI and machine learning. Progress is rapid.

Anyway, despite it’s flaws, the model did perform better than random tests -> See the test set accuracy below. And it made a 'naive' assumption too, ok so it did it’s job, we’re content.

But what is a naive assumption?

The "Naive" part of NB, has to do with the assumption that the existence of every individual word, or data point, is completely independent of other words. In the spam-filtering example, it would mean that the presence of the words "gold" and "sex" tells us nothing about the presence of the word "Rolex" , even if it’s in that same email. Hey I mean maybe for some high rollers, but in most cases, this point holds true.

In [10]:
# Often it’s good to compare with a baseline,
# Since we have more women names, we’re using a 
# classifier that assigns the label 'woman' to every name

baseline = sum(1 for x in test_data if x[1] == "female") / len(test_data)
print("Accuracy if we predicted only 'woman': {:.2f}%".format(baseline*100))

# Actual model accuracy

print("Test set accuracy: {:.2f}%".format(nltk.classify.accuracy(classifier, test_data)*100))

Accuracy if we predicted only 'woman': 63.24%
Test set accuracy: 77.09%

Before wrapping up, we’ll demonstrate how to extract features with sklearn, so you can create a unique model for yourself. Using sklearns TfidfVectorizer, we can input a list of sentences, paragraphs or "documents", and generate a sparse matrix (a table with mostly zeros) where each row corresponds to each document. TF-IDF works by finding how often a term appears in a certain document in relation to how often it appears across documents (view more here)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [12]:
sentences = ["Sample sentence one",
            "Another sample sentence, two",
            "Sample sentence three is here"]
In [13]:
tfidf = TfidfVectorizer()

features = tfidf.fit_transform(sentences)
features # sparse results
Out [13]:
<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [14]:
# but since we have a small matrix, we can view its dense version
Out [13]:

matrix([[0.        , 0.        , 0.        , 0.76749457, 0.45329466,
         0.45329466, 0.        , 0.        ],
        [0.6088451 , 0.        , 0.        , 0.        , 0.35959372,
         0.35959372, 0.        , 0.6088451 ],
        [0.        , 0.52004008, 0.52004008, 0.        , 0.30714405,
         0.30714405, 0.52004008, 0.        ]])

And now, with these feature vectors, you can train your classifiers.