In this problem, we will attempt to identify spam or ham SMS messages using the na¨ıve Bayes model. You will need to download the SMS dataset: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. Python users: scikit-learn is not allowed.

Consider an SMS message as a (case-insensitive) sequence of words (X1,··· ,XT). Ignore all other punctuation. Under the na¨ıve Bayes assumption, the probability of the words in each message factors as:

P(x1:T|y) =

T Y t=1

P(xi|y). (3)

When estimated from dataset D with pseudo-count prior of α, the model parameters are:

ˆ P(xi|y) =

CountD(xi,y) + α CountD(y) + Nα

, (4)

where: CountD(xi,y) and CountD(y) are the number of occurances of word xi in spam/ham messages y (from our sample D); and the number of words for label spam/ham words y (from our sample D) respectively; and N is the total number of dictionary words (including words not seen in D). Let us use N = 20,000 and

α = 0.1 in our experiments.

Note that the classes are heavily imbalanced. The number of spam messages is 747, while the number of ham messages is 4827. If a simple classiﬁer predicts that all messages are ham, it will get around 86% accuracy. In this case, accuracy is not a good measurement of the classiﬁer’s performance.

Instead of using accuracy, we can use confusion matrix to see the performance of our model. Below is the explanation of confusion matrix:

True condition Positive Negative

Predicted Condition

Positive True positive False positive Negative False negative True negative

Other important performance measurements are precision, recall, and F-score, deﬁned as:

precision =

true positive true positive + false positive

(5)

recall =

true positive true positive + false negative

(6)

F-score = 2·

precision·recall precision + recall

(7)

(a) Randomly split the messages into a training set D1 (80% of messages) and a testing set D2 (20% of messages). Calculate the testing accuracy, confusion matrix, precision, recall, and F-score of the Na¨ıve Bayes classiﬁer in determining whether a message is spam or ham. Submit your source code. Note: Let’s assume that spam is the positive class. (20 points)

(b) How does the change of α eﬀect the classiﬁer performance? Using random split above, evaluate the training and test accuracy and F-score under diﬀerent selections of α. The selection of α values are 2i where i = −5,··· ,0. Create two plots, the ﬁrst plot is for the accuracy measure and the second plot is for F-score. In each plot, x-axis represents i, and y-axis represents the performance measure (accuracy/F-score). Each plot contains two line chart, a line chart describing training accuracy/F-score measure, the other line chart is for test accuracy/F-score. Submit your source code. (20 points)

Hints: There are scripts in both Octave (nbayes.m) and Julia (nbayes.jl) for counting occurances of words from spam/ham messages. A few notes: (i) Octave can make use of Java data structures (e.g., the hashmap); this can be easier / more eﬃcient than using Octave structures. (ii) Julia has its own data structure library which includes many standard data structures e.g. Dict for hashmap. (iii) logxy = logx + logy.

Need Help with Your Assignment? Let us Help you

Place Your Order

We guarantee you the following if you order your paper with us

High Quality Papers

All papers are written by ENL (US, UK, AUSTRALIA) writers with vast experience in the field. We perform a quality assessment on all orders before submitting them.

Timely Delivery

Do you have an urgent order? We have more than enough writers who will ensure that your order is delivered on time.

100% Original

We provide plagiarism reports for all our custom written papers. All papers are written from scratch.

24/7 Customer Support

Contact us anytime, any day, via any means if you need any help. You can use the Live Chat, email, or our provided phone number anytime.

Confidential and Secure

We will not disclose the nature of our services or any information you provide to a third party.

Stressed? We got you covered

Money-Back Guarantee

Get your money back if your paper is not delivered on time or if your instructions are not followed.

We Guarantee the Best Grades

Get a Professional Answer

Computer Science Assignment

We guarantee you the following if you order your paper with us

Our Statistics

Why Choose Us

Formats and Features

FREE Features: Save "$65"

Company

Questions

Trusted and Secure