Tutorial on Text classification problem #2


Text classification is an important task in natural language processing. Recent advancement in text classification and related domain-specific tasks such as sentimental analysis etc., are successfully powered by computational linguistics and statistical machine learning. To make it friendly to beginner, the tutorial introduces the basic machine learning data analytics pipeline, which works directly on extracted feature vectors of our dataset.

Our pipeline is tested on the following tools for scientific computing and data scientists:

Load features and labels

The features are already prepared and the corresponding labels of these samples are provided in data-train.dat. You can load the dataset as follows:

In [1]: import os
In [2]: from sklearn.datasets import load_svmlight_file
In [3]: DATA_TRAIN_PATH = 'data-train.dat'
In [4]: N_FEATURES = 1881
In [5]: X, y = load_svmlight_file(DATA_TRAIN_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)

We split the data into 80% as training set and 20% as validation set:

In [6]: from sklearn.cross_validation import train_test_split
In [7]: TEST_SIZE = 0.2
In [8]: RANDOM_STATE = 0
In [9]: X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

Following code outputs the data sizes:

In [10]: X.shape, y.shape
Out[10]: ((167, 1881), (167,))
In [11]: X_train.shape, y_train.shape
Out[11]: ((133, 1881), (133,))
In [12]: X_val.shape, y_val.shape
Out[12]: ((34, 1881), (34,))

Training and evaluation

Now, we will train the Multinomial Naive Bayes classifier based on the loaded features and labels in the training data:

In [13]: from sklearn.naive_bayes import MultinomialNB
In [14]: clf = MultinomialNB(alpha=1.0)
In [15]: clf.fit(X_train, y_train)
Out[15]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

where the classifier parameter is set as alpha=1.0.

We then make a prediction on the validation samples and evaluate the performance using the validation features:

In [16]: y_pred = clf.predict_proba(X_val)[:, 1]
In [17]: from sklearn.metrics import roc_auc_score
In [18]: roc_auc_score(y_val, y_pred)
Out[18]: 0.74652777777777779

The above result indicates that the current Naive Bayes classifier achieves a seemingly nice score on the dataset. Let us make a new model with a different parameter.

In [19]: clf_new = MultinomialNB(alpha=0.1)
In [20]: clf_new.fit(X_train, y_train)
Out[20]: MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
In [21]: y_pred_new = clf_new.predict_proba(X_val)[:, 1]
In [22]: roc_auc_score(y_val, y_pred_new)
Out[22]: 0.78819444444444442

The new model with alpha=0.1 seems better than the old one.


Let us submit a model with alpha=0.1. We first train a model again using all the given data (X, y):

In [23]: clf_submit = MultinomialNB(alpha=0.1)
In [24]: clf_submit.fit(X, y)
Out[24]: MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

Next, we load the test samples to make a prediction on them:

In [25]: DATA_TEST_PATH = 'data-test.dat'
In [26]: X_test, y_test_dummy = load_svmlight_file(DATA_TEST_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)

The final prediction of the test data are reported in sample-submission-basic.dat, you can directly submit this file and then check your score and ranking in the leaderboard.

In [27]: SUBMIT_PATH = 'sample-submission-basic.dat'
In [28]: y_submit = clf_submit.predict_proba(X_test)[:, 1]
In [29]: import numpy as np
In [30]: np.savetxt(SUBMIT_PATH, y_submit, fmt='%.10f')

At this point, we are finished on the whole pipeline, and feel free to make a submission trail whenever you obtain a sample-submission-basic.dat prediction for test-data.dat.