In this problem, we will attempt to identify spam or ham SMS messages using the na¨ıve Bayes model. You will need to download the SMS dataset: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. Python users: scikit-learn is not allowed.

Consider an SMS message as a (case-insensitive) sequence of words (X1,··· ,XT). Ignore all other punctuation. Under the na¨ıve Bayes assumption, the probability of the words in each message factors as:

P(x1:T|y) =

T Y t=1

P(xi|y). (3)

When estimated from dataset D with pseudo-count prior of α, the model parameters are:

ˆ P(xi|y) =

CountD(xi,y) + α CountD(y) + Nα

, (4)

where: CountD(xi,y) and CountD(y) are the number of occurances of word xi in spam/ham messages y (from our sample D); and the number of words for label spam/ham words y (from our sample D) respectively; and N is the total number of dictionary words (including words not seen in D). Let us use N = 20,000 and

4

α = 0.1 in our experiments.

Note that the classes are heavily imbalanced. The number of spam messages is 747, while the number of ham messages is 4827. If a simple classifier predicts that all messages are ham, it will get around 86% accuracy. In this case, accuracy is not a good measurement of the classifier’s performance.

Instead of using accuracy, we can use confusion matrix to see the performance of our model. Below is the explanation of confusion matrix:

True condition Positive Negative

Predicted Condition

Positive True positive False positive Negative False negative True negative

Other important performance measurements are precision, recall, and F-score, defined as:

precision =

true positive true positive + false positive

(5)

recall =

true positive true positive + false negative

(6)

F-score = 2·

precision·recall precision + recall

(7)

(a) Randomly split the messages into a training set D1 (80% of messages) and a testing set D2 (20% of messages). Calculate the testing accuracy, confusion matrix, precision, recall, and F-score of the Na¨ıve Bayes classifier in determining whether a message is spam or ham. Submit your source code. Note: Let’s assume that spam is the positive class. (20 points)

(b) How does the change of α effect the classifier performance? Using random split above, evaluate the training and test accuracy and F-score under different selections of α. The selection of α values are 2i where i = −5,··· ,0. Create two plots, the first plot is for the accuracy measure and the second plot is for F-score. In each plot, x-axis represents i, and y-axis represents the performance measure (accuracy/F-score). Each plot contains two line chart, a line chart describing training accuracy/F-score measure, the other line chart is for test accuracy/F-score. Submit your source code. (20 points)

Hints: There are scripts in both Octave (nbayes.m) and Julia (nbayes.jl) for counting occurances of words from spam/ham messages. A few notes: (i) Octave can make use of Java data structures (e.g., the hashmap); this can be easier / more efficient than using Octave structures. (ii) Julia has its own data structure library which includes many standard data structures e.g. Dict for hashmap. (iii) logxy = logx + logy.

All papers are written by ENL (US, UK, AUSTRALIA) writers with vast experience in the field. We perform a quality assessment on all orders before submitting them.

Do you have an urgent order?  We have more than enough writers who will ensure that your order is delivered on time. 

We provide plagiarism reports for all our custom written papers. All papers are written from scratch.

24/7 Customer Support

Contact us anytime, any day, via any means if you need any help. You can use the Live Chat, email, or our provided phone number anytime.

We will not disclose the nature of our services or any information you provide to a third party.

Assignment Help Services
Money-Back Guarantee

Get your money back if your paper is not delivered on time or if your instructions are not followed.

We Guarantee the Best Grades
Assignment Help Services