### Introduction

Text classification is an important task in natural language processing. Recent advancement in text classification and related domain-specific tasks such as sentimental analysis etc., are successfully powered by computational linguistics and statistical machine learning. To make it friendly to beginner, the tutorial introduces the basic machine learning data analytics pipeline, which works directly on extracted feature vectors of our dataset.

Our pipeline is tested on the following tools for scientific computing and data scientists:

- Python 2.7.8
- Scikit-learn 0.15 (current stable version)
- NumPy 1.9.1
- SciPy 0.14.0

### Load features and labels

The features are already prepared and the corresponding labels of these samples are provided in `data-train.dat`

. You can load the dataset as follows:

In [1]: import os In [2]: from sklearn.datasets import load_svmlight_file In [3]: DATA_TRAIN_PATH = 'data-train.dat' In [4]: N_FEATURES = 1881 In [5]: X, y = load_svmlight_file(DATA_TRAIN_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)

We split the data into 80% as training set and 20% as validation set:

In [6]: from sklearn.cross_validation import train_test_split In [7]: TEST_SIZE = 0.2 In [8]: RANDOM_STATE = 0 In [9]: X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

Following code outputs the data sizes:

In [10]: X.shape, y.shape Out[10]: ((167, 1881), (167,)) In [11]: X_train.shape, y_train.shape Out[11]: ((133, 1881), (133,)) In [12]: X_val.shape, y_val.shape Out[12]: ((34, 1881), (34,))

### Training and evaluation

Now, we will train the Multinomial Naive Bayes classifier based on the loaded features and labels in the training data:

In [13]: from sklearn.naive_bayes import MultinomialNB In [14]: clf = MultinomialNB(alpha=1.0) In [15]: clf.fit(X_train, y_train) Out[15]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

where the classifier parameter is set as alpha=1.0.

We then make a prediction on the validation samples and evaluate the performance using the validation features:

In [16]: y_pred = clf.predict_proba(X_val)[:, 1]

In [17]: from sklearn.metrics import roc_auc_score In [18]: roc_auc_score(y_val, y_pred) Out[18]: 0.74652777777777779

The above result indicates that the current Naive Bayes classifier achieves a seemingly nice score on the dataset. Let us make a new model with a different parameter.

In [19]: clf_new = MultinomialNB(alpha=0.1) In [20]: clf_new.fit(X_train, y_train) Out[20]: MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True) In [21]: y_pred_new = clf_new.predict_proba(X_val)[:, 1] In [22]: roc_auc_score(y_val, y_pred_new) Out[22]: 0.78819444444444442

The new model with alpha=0.1 seems better than the old one.

### Submission

Let us submit a model with alpha=0.1. We first train a model again using all the given data (`X`

, `y`

):

In [23]: clf_submit = MultinomialNB(alpha=0.1) In [24]: clf_submit.fit(X, y) Out[24]: MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

Next, we load the test samples to make a prediction on them:

In [25]: DATA_TEST_PATH = 'data-test.dat' In [26]: X_test, y_test_dummy = load_svmlight_file(DATA_TEST_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)

The final prediction of the test data are reported in `sample-submission-basic.dat`

, you can directly submit this file and then check your score and ranking in the leaderboard.

In [27]: SUBMIT_PATH = 'sample-submission-basic.dat' In [28]: y_submit = clf_submit.predict_proba(X_test)[:, 1] In [29]: import numpy as np In [30]: np.savetxt(SUBMIT_PATH, y_submit, fmt='%.10f')

At this point, we are finished on the whole pipeline, and feel free to make a submission trail whenever you obtain a `sample-submission-basic.dat`

prediction for `test-data.dat`

.