Text classification is an important task in natural language processing. Recent advancement in text classification and related domain-specific tasks such as sentimental analysis etc., are successfully powered by computational linguistics and statistical machine learning. In Text classification problem #1 competition, we will emphasize the insight of raw data and encourage participants to discover richer user-defined features from the data itself. Based on this concern, this quick-start tutorial is provided and we hope it help the participants save time for feature engineering as a data scientist. To make it friendly to both beginner and advanced data miner, the tutorial is organized in two parts:
- Part 1 includes the basic machine learning data analytics pipeline, which works directly on extracted feature vectors of our dataset.
- Part 2 will introduce some advanced topics, including the encoding of words and advanced machine learning algorithms for the pipeline.
Our pipeline is tested on the following tools for scientific computing and data scientists:
Part 1: Basic Pipeline
We start from the basic pipeline to explain how to run machine learning algorithms in scikit-learn toolkit for the feature matrix we already prepared. In this basic pipeline, we use Naive Bayes classifier as a baseline classifier. Naive Bayes classifier performs well in general text classification tasks because its assumption on attribute independence fits well for bag-of-words representation of text data.
In Part 2, we will further introduce how to extract features from raw text data and utilize advanced machine learning algorithms in the pipeline. Now, let’s stay tuned on the basic one.
Load features and labels
The features are already prepared and the corresponding labels of these samples are provided in
data-train.dat. You can load the dataset as follows:
In : import os In : from sklearn.datasets import load_svmlight_file In : BOW_DIR = 'bag-of-words' In : DATA_TRAIN_PATH = os.path.join(BOW_DIR, 'data-train.dat') In : N_FEATURES = 37040 In : X, y = load_svmlight_file(DATA_TRAIN_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)
We split the data into 80% as training set and 20% as validation set:
In : from sklearn.cross_validation import train_test_split In : TEST_SIZE = 0.2 In : RANDOM_STATE = 0 In : X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
Following code outputs the data sizes:
In : X.shape, y.shape Out: ((2400, 37040), (2400,)) In : X_train.shape, y_train.shape Out: ((1920, 37040), (1920,)) In : X_val.shape, y_val.shape Out: ((480, 37040), (480,))
Training and evaluation
Now, we will train the Multinomial Naive Bayes classifier based on the loaded features and labels in the training data:
In : from sklearn.naive_bayes import MultinomialNB In : clf = MultinomialNB(alpha=0.1) In : clf.fit(X_train, y_train) Out: MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
where the classifier parameter is set as alpha=0.1.
We then make a prediction on the validation samples and evaluate the performance using the validation features:
In : y_pred = clf.predict_proba(X_val)[:, 1]
In : from sklearn.metrics import roc_auc_score In : roc_auc_score(y_val, y_pred) Out: 0.60255682394209131
The above result indicates that the current Naive Bayes classifier achieves a seemingly nice score on the dataset. Let us make a new model with a different parameter.
In : clf_new = MultinomialNB(alpha=1.0) In : clf_new.fit(X_train, y_train) Out: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) In : y_pred_new = clf_new.predict_proba(X_val)[:, 1] In : roc_auc_score(y_val, y_pred_new) Out: 0.65065187858634599
The new model with alpha=1.0 seems better than the old one.
Let us submit a model with alpha=1.0. We first train a model again using all the given data (
In : clf_submit = MultinomialNB(alpha=1.0) In : clf_submit.fit(X, y) Out: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Next, we load the test samples to make a prediction on them:
In : DATA_TEST_PATH = os.path.join(BOW_DIR, 'data-test.dat') In : X_test, y_test_dummy = load_svmlight_file(DATA_TEST_PATH, n_features=N_FEATURES, dtype=int, zero_based=True)
The final prediction of the test data are reported in
sample-submission-basic.dat, you can directly submit this file and then check your score and ranking in the leaderboard.
In : SUBMIT_PATH = 'sample-submission-basic.dat' In : y_submit = clf_submit.predict_proba(X_test)[:, 1] In : import numpy as np In : np.savetxt(SUBMIT_PATH, y_submit, fmt='%.10f')
At this point, we are finished on the whole pipeline, and feel free to make a submission trail whenever you obtain a
sample-submission-basic.dat prediction for
Hold on! Are you expecting further improvements of the scores using our pipeline? Do not hesitate to read Part 2, which definitely helps out towards an improvement.
Part 2: Advanced Information
In this part, we will describe the encoding the rules of the words in our dataset and try to utilize some advanced machine learning classifiers. We assume that the required modules are imported and the variables are set by the following codes:
In : import os In : import numpy as np In : from sklearn.datasets import load_svmlight_file In : from sklearn.cross_validation import train_test_split In : from sklearn.metrics import roc_auc_score In : BOW_DIR = 'bag-of-words' In : DATA_PATH = os.path.join(BOW_DIR, 'data-train.dat') In : DATA_TEST_PATH = os.path.join(BOW_DIR, 'data-test.dat') In : N_FEATURES = 37040 In : TEST_SIZE = 0.2 In : RANDOM_STATE = 0 In : X, y = load_svmlight_file(DATA_PATH, n_features=N_FEATURES, dtype=int, zero_based=True) In : X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
More on classifier
Shall we try more advanced classifiers and expect an improvement from multiple classifiers? Yes, let’s try the following advanced classifiers: Logistic Regression and Support Vector Machine.
L1-regularized Logistic Regression (
In : from sklearn.linear_model import LogisticRegression In : clf_lr = LogisticRegression(penalty='l1') In : clf_lr.fit(X_train, y_train) Out: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l1', random_state=None, tol=0.0001) In : y_pred_lr = clf_lr.predict_proba(X_val)[:, 1] In : roc_auc_score(y_val, y_pred_lr) Out: 0.64002717005616505
Linear SVM (
In : from sklearn.svm import SVC In : clf_svc = SVC(kernel = 'linear', probability = True) In : clf_svc.fit(X_train, y_train) Out: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='linear', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False) In : y_pred_svc = clf_svc.predict_proba(X_val)[:, 1] In : roc_auc_score(y_val, y_pred_svc) Out: 0.71369046412133252
More on features
Creating your own bag-of-words representation
This section introduces how to extract the bag-of-words features from the raw data by yourself. Let us first load the data in the
In : ROOT_DIR = 'text' In : TEXT_DIR = os.path.join(ROOT_DIR, 'train') In : TEXT_TEST_DIR = os.path.join(ROOT_DIR, 'test') In : FILES_PATH = os.path.join(ROOT_DIR, 'text-files-train.dat') In : FILES_TEST_PATH = os.path.join(ROOT_DIR, 'text-files-test.dat') In : files = np.loadtxt(FILES_PATH, dtype=str) In : texts = np.array([file(os.path.join(TEXT_DIR, f), 'r').read().strip() for f in files]) In : files_test = np.loadtxt(FILES_TEST_PATH, dtype=str) In : texts_test = np.array([file(os.path.join(TEXT_TEST_DIR, f), 'r').read().strip() for f in files_test])
We split the samples into 80% as training set and 20% as validation set as we did in Part 1.
In : train, val, y_train, y_val = train_test_split(np.arange(texts.shape), y, test_size=TEST_SIZE, random_state=RANDOM_STATE) In : texts_train = texts[train] In : texts_val = texts[val]
The vectors in
bag-of-words/ directory are generated from the words that appeared in more than four documents. Here, we are trying to generate the vectors using the words that appeared in more than one documents.
CountVectorizer supports word count.
In : from sklearn.feature_extraction.text import CountVectorizer In : MIN_DF = 2 In : count_vect = CountVectorizer(lowercase=True, min_df=MIN_DF) In : X_cnt_train = count_vect.fit_transform(texts_train) In : X_cnt_val = count_vect.transform(texts_val) In : X_cnt_test = count_vect.transform(texts_test)
In : from sklearn.feature_extraction.text import TfidfVectorizer In : tfidf_vect = TfidfVectorizer(lowercase=True, min_df=MIN_DF) In : X_tfidf_train = tfidf_vect.fit_transform(texts_train) In : X_tfidf_val = tfidf_vect.transform(texts_val) In : X_tfidf_test = tfidf_vect.transform(texts_test)
An example of the encoded raw text is as following:
R51407 p85199 H66460 t38887n422 a26256c99 h86854 W77164 d53539 R65303 h70394 H58787 t84009 a33300T22 s282f864 p9704n920 ...... ...... ...... ...... ...... ...... h5675O62 z92087 l58673 a79424 L72623 m2004B46 l10413n80 X76814 H70707 o75233 e69397 z88334 h5675h46 O57357 Q79026
In the above text file, each of the space-separated token e.g. F86155 is an encoded expression of real word in English. For example, you can imagine the following:
- Fly -> F86155
- Flying -> F86155b43
- flight -> f86155j152
Inside the dataset, there are such information hidden in the capital character and the following numbers. The interesting fact you need to keep in mind is that, in a full dictionary of an existing language, the volume of the dictionary is around 500,000, but in our dataset, the words follows a specific encoding rule so that there are 1,000,000+ unique tokens in our dataset. That means, the discovery of potential relationship between the words that share a similar capital character or the following number is somehow important.
We now shortly demonstrate the use of regular expression and python here, based on an simple re-thinking of the encoded words.
Token_pattern parameter in
CountVectorizer supports a use of regular expressions. Assuming we are trying to extract the number of occurrences of b43.
In : TOKEN_PATTERN_B43 = u'(?u)\\b\\w+b43\\b' In : count_vect_b43 = CountVectorizer(lowercase=True, min_df=MIN_DF, token_pattern=TOKEN_PATTERN_B43) In : X_cnt_b43_train = count_vect_b43.fit_transform(texts_train).sum(axis=1) In : X_cnt_b43_val = count_vect_b43.transform(texts_val).sum(axis=1) In : X_cnt_b43_test = count_vect_b43.transform(texts_test).sum(axis=1)
You can combine
In : import scipy In : X_cnt_combine_b43_train = scipy.sparse.hstack([X_cnt_train, X_cnt_b43_train]) In : X_cnt_combine_b43_val = scipy.sparse.hstack([X_cnt_val, X_cnt_b43_val]) In : X_cnt_combine_b43_test = scipy.sparse.hstack([X_cnt_test, X_cnt_b43_test])
We introduce the notion of cross-validation, which splits the dataset into K parts and repeats the following procedure K times: (1) Training a model using K-1 parts, (2) Evaluating the model using the remaining part. The following code excuses the cross-validation and outputs the mean and standard deviation of the AUC scores:
In : from sklearn.cross_validation import KFold In : from sklearn.naive_bayes import MultinomialNB In : kf = KFold(y.shape, n_folds=5, shuffle=False) In : aucs =  In : for train, val in kf: X_train_cv, y_train_cv = X[train], y[train] X_val_cv, y_val_cv = X[val], y[val] clf_cv = MultinomialNB(alpha=1.0) clf_cv.fit(X_train_cv, y_train_cv) y_pred_cv = clf_cv.predict_proba(X_val_cv)[:, 1] auc = roc_auc_score(y_val_cv, y_pred_cv) aucs.append(auc) In : np.mean(aucs), np.std(aucs) Out: (0.67273428817310921, 0.030980080080578294)