Tutorial on Text classification problem #1


Text classification is an important task in natural language processing. Recent advancement in text classification and related domain-specific tasks such as sentimental analysis etc., are successfully powered by computational linguistics and statistical machine learning. In Text classification problem #1 competition, we will emphasize the insight of raw data and encourage participants to discover richer user-defined features from the data itself. Based on this concern, this quick-start tutorial is provided and we hope it help the participants save time for feature engineering as a data scientist. To make it friendly to both beginner and advanced data miner, the tutorial is organized in two parts:

  • Part 1 includes the basic machine learning data analytics pipeline, which works directly on extracted feature vectors of our dataset.
  • Part 2 will introduce some advanced topics, including the encoding of words and advanced machine learning algorithms for the pipeline.

Our pipeline is tested on the following tools for scientific computing and data scientists:

Part 1: Basic Pipeline

We start from the basic pipeline to explain how to run machine learning algorithms in scikit-learn toolkit for the feature matrix we already prepared. In this basic pipeline, we use Naive Bayes classifier as a baseline classifier. Naive Bayes classifier performs well in general text classification tasks because its assumption on attribute independence fits well for bag-of-words representation of text data.

In Part 2, we will further introduce how to extract features from raw text data and utilize advanced machine learning algorithms in the pipeline. Now, let’s stay tuned on the basic one.

Load features and labels

The features are already prepared and the corresponding labels of these samples are provided in data-train.dat. You can load the dataset as follows:

In [1]: import os
In [2]: from sklearn.datasets import load_svmlight_file
In [3]: BOW_DIR = 'bag-of-words'
In [4]: DATA_TRAIN_PATH = os.path.join(BOW_DIR, 'data-train.dat')
In [5]: N_FEATURES = 37040
In [6]: X, y = load_svmlight_file(DATA_TRAIN_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)

We split the data into 80% as training set and 20% as validation set:

In [7]: from sklearn.cross_validation import train_test_split
In [8]: TEST_SIZE = 0.2
In [9]: RANDOM_STATE = 0
In [10]: X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

Following code outputs the data sizes:

In [11]: X.shape, y.shape
Out[11]: ((2400, 37040), (2400,))
In [12]: X_train.shape, y_train.shape
Out[12]: ((1920, 37040), (1920,))
In [13]: X_val.shape, y_val.shape
Out[13]: ((480, 37040), (480,))

Training and evaluation

Now, we will train the Multinomial Naive Bayes classifier based on the loaded features and labels in the training data:

In [14]: from sklearn.naive_bayes import MultinomialNB
In [15]: clf = MultinomialNB(alpha=0.1)
In [16]: clf.fit(X_train, y_train)
Out[16]: MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

where the classifier parameter is set as alpha=0.1.

We then make a prediction on the validation samples and evaluate the performance using the validation features:

In [17]: y_pred = clf.predict_proba(X_val)[:, 1]
In [18]: from sklearn.metrics import roc_auc_score
In [19]: roc_auc_score(y_val, y_pred)
Out[19]: 0.60255682394209131

The above result indicates that the current Naive Bayes classifier achieves a seemingly nice score on the dataset. Let us make a new model with a different parameter.

In [20]: clf_new = MultinomialNB(alpha=1.0)
In [21]: clf_new.fit(X_train, y_train)
Out[21]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [22]: y_pred_new = clf_new.predict_proba(X_val)[:, 1]
In [23]: roc_auc_score(y_val, y_pred_new)
Out[23]: 0.65065187858634599

The new model with alpha=1.0 seems better than the old one.


Let us submit a model with alpha=1.0. We first train a model again using all the given data (X, y):

In [24]: clf_submit = MultinomialNB(alpha=1.0)
In [25]: clf_submit.fit(X, y)
Out[25]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Next, we load the test samples to make a prediction on them:

In [26]: DATA_TEST_PATH = os.path.join(BOW_DIR, 'data-test.dat')
In [27]: X_test, y_test_dummy = load_svmlight_file(DATA_TEST_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)

The final prediction of the test data are reported in sample-submission-basic.dat, you can directly submit this file and then check your score and ranking in the leaderboard.

In [28]: SUBMIT_PATH = 'sample-submission-basic.dat'
In [29]: y_submit = clf_submit.predict_proba(X_test)[:, 1]
In [30]: import numpy as np
In [31]: np.savetxt(SUBMIT_PATH, y_submit, fmt='%.10f')

At this point, we are finished on the whole pipeline, and feel free to make a submission trail whenever you obtain a sample-submission-basic.dat prediction for test-data.dat.

Hold on! Are you expecting further improvements of the scores using our pipeline? Do not hesitate to read Part 2, which definitely helps out towards an improvement.

Part 2: Advanced Information

In this part, we will describe the encoding the rules of the words in our dataset and try to utilize some advanced machine learning classifiers. We assume that the required modules are imported and the variables are set by the following codes:

In [1]: import os
In [2]: import numpy as np
In [3]: from sklearn.datasets import load_svmlight_file
In [4]: from sklearn.cross_validation import train_test_split
In [5]: from sklearn.metrics import roc_auc_score
In [6]: BOW_DIR = 'bag-of-words'
In [7]: DATA_PATH = os.path.join(BOW_DIR, 'data-train.dat')
In [8]: DATA_TEST_PATH = os.path.join(BOW_DIR, 'data-test.dat')
In [9]: N_FEATURES = 37040
In [10]: TEST_SIZE = 0.2
In [11]: RANDOM_STATE = 0
In [12]: X, y = load_svmlight_file(DATA_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)
In [13]: X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

More on classifier

Shall we try more advanced classifiers and expect an improvement from multiple classifiers? Yes, let’s try the following advanced classifiers: Logistic Regression and Support Vector Machine.

L1-regularized Logistic Regression (LogisticRegression):

In [14]: from sklearn.linear_model import LogisticRegression
In [15]: clf_lr = LogisticRegression(penalty='l1')
In [16]: clf_lr.fit(X_train, y_train)
Out[16]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, penalty='l1', random_state=None, tol=0.0001)
In [17]: y_pred_lr = clf_lr.predict_proba(X_val)[:, 1]
In [18]: roc_auc_score(y_val, y_pred_lr)
Out[18]: 0.64002717005616505

Linear SVM (SVC):

In [19]: from sklearn.svm import SVC
In [20]: clf_svc = SVC(kernel = 'linear', probability = True)
In [21]: clf_svc.fit(X_train, y_train)
Out[21]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=True, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [22]: y_pred_svc = clf_svc.predict_proba(X_val)[:, 1]
In [23]: roc_auc_score(y_val, y_pred_svc)
Out[23]: 0.71369046412133252

More on features

Creating your own bag-of-words representation

This section introduces how to extract the bag-of-words features from the raw data by yourself. Let us first load the data in the text/ directory.

In [24]: ROOT_DIR = 'text'
In [25]: TEXT_DIR = os.path.join(ROOT_DIR, 'train')
In [26]: TEXT_TEST_DIR = os.path.join(ROOT_DIR, 'test')
In [27]: FILES_PATH = os.path.join(ROOT_DIR, 'text-files-train.dat')
In [28]: FILES_TEST_PATH = os.path.join(ROOT_DIR, 'text-files-test.dat')
In [29]: files = np.loadtxt(FILES_PATH, dtype=str)
In [30]: texts = np.array([file(os.path.join(TEXT_DIR, f), 'r').read().strip() for f in files])
In [31]: files_test = np.loadtxt(FILES_TEST_PATH, dtype=str)
In [32]: texts_test = np.array([file(os.path.join(TEXT_TEST_DIR, f), 'r').read().strip() for f in files_test])

We split the samples into 80% as training set and 20% as validation set as we did in Part 1.

In [33]: train, val, y_train, y_val = train_test_split(np.arange(texts.shape[0]), y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
In [34]: texts_train = texts[train]
In [35]: texts_val = texts[val]

The vectors in bag-of-words/ directory are generated from the words that appeared in more than four documents. Here, we are trying to generate the vectors using the words that appeared in more than one documents. CountVectorizer supports word count.

In [36]: from sklearn.feature_extraction.text import CountVectorizer
In [37]: MIN_DF = 2
In [38]: count_vect = CountVectorizer(lowercase=True, min_df=MIN_DF)
In [38]: X_cnt_train = count_vect.fit_transform(texts_train)
In [40]: X_cnt_val = count_vect.transform(texts_val)
In [41]: X_cnt_test = count_vect.transform(texts_test)

Instead of CountVectorizer, you can use TfidfVectorizer to extract TF-IDF values:

In [42]: from sklearn.feature_extraction.text import TfidfVectorizer
In [43]: tfidf_vect = TfidfVectorizer(lowercase=True, min_df=MIN_DF)
In [44]: X_tfidf_train = tfidf_vect.fit_transform(texts_train)
In [45]: X_tfidf_val = tfidf_vect.transform(texts_val)
In [46]: X_tfidf_test = tfidf_vect.transform(texts_test)

Word encoding

An example of the encoded raw text is as following:

 R51407 p85199 H66460 t38887n422 a26256c99 h86854 W77164 d53539 R65303 h70394 H58787 t84009 a33300T22 s282f864 p9704n920 ......
...... ......
...... ......
...... h5675O62 z92087 l58673 a79424 L72623 m2004B46 l10413n80 X76814 H70707 o75233 e69397 z88334 h5675h46 O57357 Q79026

In the above text file, each of the space-separated token e.g. F86155 is an encoded expression of real word in English. For example, you can imagine the following:

  • Fly -> F86155
  • Flying -> F86155b43
  • flight -> f86155j152


F ly ing
F 86155 b43

Inside the dataset, there are such information hidden in the capital character and the following numbers. The interesting fact you need to keep in mind is that, in a full dictionary of an existing language, the volume of the dictionary is around 500,000, but in our dataset, the words follows a specific encoding rule so that there are 1,000,000+ unique tokens in our dataset. That means, the discovery of potential relationship between the words that share a similar capital character or the following number is somehow important.

We now shortly demonstrate the use of regular expression and python here, based on an simple re-thinking of the encoded words. Token_pattern parameter in CountVectorizer supports a use of regular expressions. Assuming we are trying to extract the number of occurrences of b43.

In [47]: TOKEN_PATTERN_B43 = u'(?u)\\b\\w+b43\\b'
In [48]: count_vect_b43 = CountVectorizer(lowercase=True, min_df=MIN_DF, token_pattern=TOKEN_PATTERN_B43)
In [49]: X_cnt_b43_train = count_vect_b43.fit_transform(texts_train).sum(axis=1)
In [50]: X_cnt_b43_val = count_vect_b43.transform(texts_val).sum(axis=1)
In [51]: X_cnt_b43_test = count_vect_b43.transform(texts_test).sum(axis=1)

You can combine X_cnt_* and X_cnt_b43_*:

In [52]: import scipy
In [53]: X_cnt_combine_b43_train = scipy.sparse.hstack([X_cnt_train, X_cnt_b43_train])
In [54]: X_cnt_combine_b43_val = scipy.sparse.hstack([X_cnt_val, X_cnt_b43_val])
In [55]: X_cnt_combine_b43_test = scipy.sparse.hstack([X_cnt_test, X_cnt_b43_test])


We introduce the notion of cross-validation, which splits the dataset into K parts and repeats the following procedure K times: (1) Training a model using K-1 parts, (2) Evaluating the model using the remaining part. The following code excuses the cross-validation and outputs the mean and standard deviation of the AUC scores:

In [56]: from sklearn.cross_validation import KFold
In [57]: from sklearn.naive_bayes import MultinomialNB
In [58]: kf = KFold(y.shape[0], n_folds=5, shuffle=False)
In [59]: aucs = []
In [60]: for train, val in kf:
             X_train_cv, y_train_cv = X[train], y[train]
             X_val_cv, y_val_cv = X[val], y[val]
             clf_cv = MultinomialNB(alpha=1.0)
             clf_cv.fit(X_train_cv, y_train_cv)
             y_pred_cv = clf_cv.predict_proba(X_val_cv)[:, 1]
             auc = roc_auc_score(y_val_cv, y_pred_cv)
In [61]: np.mean(aucs), np.std(aucs)
Out[61]: (0.67273428817310921, 0.030980080080578294)


This tutorial offers a quick start for the participants in the text-classification competition. We look forward to see you in the leaderboard and have fun!