[Sansan Data Analysis Challenge] Business card field labeling tutorial

Introduction

This tutorial introduces an exemplar implementation for solving the competition of [Sansan Data Analysis Challenge] Business card field labeling, by using Python and the existing implementations of machine learning classifiers provided by scikit-learn.

The source code provided has been tested in Python 2.7.8, and requires the following libraries:

Reference (in Japanese):

Preparation

Loading libraries

In [1]:
import os
import pandas as pd
from PIL import Image
import numpy as np
import sklearn

Loading data

First, download the data from competition webpage, load the 'training data'. Under the sansan-001 directory, run the following code, load training data train.csv as df_train.

In [2]:
df_train = pd.read_csv('train.csv')

Next, confirm the contents in df_train. filename is listed in the first column, left, top, right, bottom are the coordinates of a designated area of a given image. company_name, full_name, ..., url are ground-truth labels. For example, looking at the first row of df_train, we can know from mobile = 1 that the area of the image contains a mobile phone number.

In [3]:
df_train.head()
Out[3]:
filename left top right bottom company_name full_name position_name address phone_number fax mobile email url
0 2842.png 491 455 796 485 0 0 0 0 0 0 1 0 0
1 182.png 24 858 311 886 0 0 0 0 0 0 1 0 0
2 95.png 320 498 865 521 0 0 0 0 0 1 1 0 0
3 2491.png 65 39 497 118 1 0 0 0 0 0 0 0 0
4 3301.png 271 83 333 463 0 1 1 0 0 0 0 0 0

Moreover, we can confirm the size of the training dataset with the following code:

In [4]:
df_train.shape
Out[4]:
(25357, 14)

By checking the 0-th row of df_train, we can see the details of the sub image:

In [5]:
row = df_train.iloc[0, :]
row
Out[5]:
filename         2842.png
left                  491
top                   455
right                 796
bottom                485
company_name            0
full_name               0
position_name           0
address                 0
phone_number            0
fax                     0
mobile                  1
email                   0
url                     0
Name: 0, dtype: object
            

Open the image with row.filename, extract the designated rectangular area by row.left, row.top, row.right, row.bottom, we obtain the img. In this img, we can check that a mobile number does exist.

In [6]:
DIR_IMAGES = 'images'
img = Image.open(os.path.join(DIR_IMAGES, row.filename))
img = img.crop((row.left, row.top, row.right, row.bottom))
img
Out[6]:

Load the 'test data' the same as the way we did for 'training data'. Different from training data, the ground-truth labels are not provided in the test data.

In [7]:
df_test = pd.read_csv('test.csv')
df_test.head()
Out[7]:
filename left top right bottom
0 1942.png 66 359 361 386
1 1128.png 58 373 519 422
2 2719.png 62 289 297 314
3 641.png 58 668 416 747
4 2529.png 42 212 303 244

Our target is to learn a predictive model based on training data together with their ground-truth labels, and then use it to make prediction output for test data. Same as what we did for training data, we can check that there are in total 8,918 images in the test dataset.

In [8]:
df_test.shape
Out[8]:
(8918, 5)

In the following, we employ 500 training samples and 100 test samples for simplicity.

In [9]:
df_train = df_train.sample(500, random_state=0)
df_test = df_test.sample(100, random_state=0)

Generating Feature Vectors

In order to learn a predictive model, we first have to quantilize the images. We use the image in the first row as an example to explain the procedure to generate feature vector. Now, let's start from accessing the image in the first row.

In [10]:
img
Out[10]:

For convenience, convert the image into gray scale.

In [11]:
img = img.convert('L')
img
Out[11]:

Since it is hard to handle the designated areas that are different in size, we convert the image into 100 * 100 square.

In [12]:
IMG_SIZE = 100
img = img.resize((IMG_SIZE, IMG_SIZE), resample=Image.BICUBIC)
img
Out[12]:

Now, convert the image to be a numerical matrix.

In [13]:
x = np.asarray(img, dtype=np.float)
x.shape
Out[13]:
(100, 100)

Each entry of this 100 * 100 matrix corresponds to the brightness of the pixel.

In [14]:
x
Out[14]:
array([[ 204.,  203.,  203., ...,  222.,  223.,  223.],
       [ 204.,  203.,  203., ...,  222.,  223.,  223.],
       [ 204.,  203.,  203., ...,  222.,  223.,  223.],
       ...,
       [ 204.,  204.,  205., ...,  223.,  223.,  224.],
       [ 204.,  204.,  205., ...,  223.,  223.,  224.],
       [ 204.,  204.,  205., ...,  223.,  223.,  224.]])

Last, convert the 100 * 100 matrix into flattened 10000 dimensional vector.

In [15]:
x = x.flatten()
x
Out[15]:
array([ 204.,  203.,  203., ...,  223.,  223.,  224.])
In [16]:
x.shape
Out[16]:
(10000,)

The above procedure for generating feature vector is applicable to all the images. We use it as a preparation step for X_train, as well as X_test, before diving into the classifier part.

In [17]:
X_train = []
for i, row in df_train.iterrows():
    img = Image.open(os.path.join(DIR_IMAGES, row.filename))
    img = img.crop((row.left, row.top, row.right, row.bottom))
    img = img.convert('L')
    img = img.resize((IMG_SIZE, IMG_SIZE), resample=Image.BICUBIC)

    x = np.asarray(img, dtype=np.float)
    x = x.flatten()
    X_train.append(x)

X_train = np.array(X_train)

In [18]:
X_test = []
for i, row in df_test.iterrows():
    img = Image.open(os.path.join(DIR_IMAGES, row.filename))
    img = img.crop((row.left, row.top, row.right, row.bottom))
    img = img.convert('L')
    img = img.resize((IMG_SIZE, IMG_SIZE), resample=Image.BICUBIC)

    x = np.asarray(img, dtype=np.float)
    x = x.flatten()
    X_test.append(x)

X_test = np.array(X_test)     

Obtaining grount-truth labels

Obtain ground-truth label of training data by Y_train.

In [19]:
columns = ['company_name', 'full_name', 'position_name',
           'address', 'phone_number', 'fax',
           'mobile', 'email', 'url']
Y_train = df_train[columns].values

At this point, the preparation of predictive modeling is done. Our aim is to train a predictive model based on X_train, Y_train, and to make prediction for X_test.

Training predictive model

Now, let's see how to train a predictive model. Since the predictive result is unknown before submission, and it is not convenient to upload all the predictions made by various models, we can split the training data into parts and use some of them as a 'local' evaluation of the predictive performance. Specifically, we use 80% as development set, and 20% as evaluation set.

In [20]:
from sklearn.model_selection import train_test_split
X_dev, X_val, Y_dev, Y_val = train_test_split(X_train, Y_train, train_size=0.8, random_state=0)

The total 500 training data has been splited into 400 as development set and 100 as evaluation set.

In [21]:
print X_dev.shape, Y_dev.shape
print X_val.shape, Y_val.shape
Out[21]:
(400, 10000) (400, 9)
(100, 10000) (100, 9)

Preprocessing: standardization

Standardization of datasets is required for using PCA (described later of this tutorial). We use sklearn.preprocessing.StandardScaler.

In [22]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_dev)
Out[22]:
StandardScaler(copy=True, with_mean=True, with_std=True)

Use scaler to standardize the dataset:

In [23]:
X_dev_scaled = scaler.transform(X_dev)

We can see that scaled data has zero mean and unit variance:

In [24]:
X_dev_scaled.mean(axis=0)
Out[24]:
array([ -3.16413562e-17,  -3.16413562e-17,   3.14193116e-16, ...,
        -1.51545443e-16,  -3.47777362e-16,   1.02695630e-17])
In [25]:
X_dev_scaled.var(axis=0)
Out[24]:
array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

Preprocessing: dimension reduction

In order to extract effective features from the 10,000 dimensional feature vectors, we use dimension reduction techniques. Specifically, we reduce the dimension into 10 by using PCA. First, we construct the dimension reduction decomposer for development set. We initialize decomposer by decomposer=PCA(), and construct the dimension reduction module by decomposer.fit(X_dev).

In [26]:
from sklearn.decomposition import PCA
decomposer = PCA(n_components=10, random_state=0)
decomposer.fit(X_dev_scaled)
Out[26]:
PCA(copy=True, iterated_power='auto', n_components=10, random_state=0,
          svd_solver='auto', tol=0.0, whiten=False)

We can apply PCA as a method for dimension reduction, on development set X_dev and obtain X_dev_pca. Specifically, we use decomposer.transform(X_dev_scaled).

In [27]:
X_dev_pca = decomposer.transform(X_dev_scaled)

X_dev_pca is indeed 10 dimensional feature vectors.

In [28]:
print X_dev_pca.shape
Out[28]:
(400, 10)

In [29]:
X_val_scaled = scaler.transform(X_val)
X_val_pca = decomposer.transform(X_val_scaled)

Logistic regression

After the above dimension reduction, we train the predictive model. The problem in this competition is a multi-label classification problem where each image was annotated by multiple labels. In such problem, a basic solution is to treat each sample-label pair independently and use binary classifier to predict whether a specific label corresponds to an image. There are 9 kinds of labels (company_name, full_name, ..., url) in this competition, so we train 9 binary classifiers.

We use logistic regression with L2 regularization. We fix the regularization parameter to 0.01. Then, by separately treating the labels, we can obtain the classifiers. After initializing the logistic regression with classifier = LogisticRegression(), we can use classifier.fit(X_dev_pca, y) for training, where y = Y_dev[:, j] are the ground-truth label of the j-th column.

In [30]:
from sklearn.linear_model import LogisticRegression

classifiers = []
for j in range(Y_dev.shape[1]):
    y = Y_dev[:, j]
    classifier = LogisticRegression(penalty='l2', C=0.01)
    classifier.fit(X_dev_pca, y)
    classifiers.append(classifier)

With this trained classifier, we can have a first glimpse of the predictive output on evaluation set. The command classifier.predict_proba(X_val_pca) returns a two dimensional array, in which the first column expresses the probability of 'being negative', and the second column expresses the probability of 'being positive'. Here, we only use one column, the probability of 'being positive'.

In [31]:
Y_val_pred = np.zeros(Y_val.shape)
for j in range(Y_dev.shape[1]):
    classifier = classifiers[j]
    y = classifier.predict_proba(X_val_pca)[:, 1]
    Y_val_pred[:, j] = y

In Y_val_pred, we have obtained many things related to the 100 samples in the evaluation set, and of course, their predicted probability of 'being positive'.

In [32]:
Y_val_pred.shape
Out[32]:
(100, 9)

By using ground-truth of evaluation set Y_val, we can measure the prediction accuracy of Y_val_pred. Same as the evaluation measure we use in the competition, we use marco-averaged ROC-AUC.

In [33]:
from sklearn.metrics import roc_auc_score
roc_auc_score(Y_val, Y_val_pred, average='macro')
Out[33]:
0.81590230361126037

So far, we have achieved a not-so-bad predictive performance. Moreover, there is a simpler way, sklearn.multiclass.OneVsRestClassifier, to implement such multi-label classification by means of binary clasiification. The following code learns the same model, producing exactly the same predictive performance.

In [34]:
from sklearn.multiclass import OneVsRestClassifier

classifier = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01))
classifier.fit(X_dev_pca, Y_dev)
Y_val_pred = classifier.predict_proba(X_val_pca)
In [35]:
roc_auc_score(Y_val, Y_val_pred, average='macro')      
Out[35]:
0.81590230361126037

Tuning Hyperparamters

So far, we fixed the regularization parameter to 0.01. However, in practice, proper regularization parameter configuration varies with data. We can find the best values, by cross validation. In cross validation, we have to repeatedly go through the procedure from dimension reduction to classifier training. As a result, we have to repeatedly feed hyperparamters into the model. For convenience, let's first define this procedure as a pipeline, and see how we can simplify it. Define the operations (i.e. dimension reduction, classifier training, etc.) as steps, and pipeline = Pipeline(shape) as a pipeline.

In [36]:
from sklearn.pipeline import Pipeline

steps = [('scaler', StandardScaler()),
         ('decomposer', PCA(10, random_state=0)),
         ('classifier', OneVsRestClassifier(LogisticRegression(penalty='l2')))]
pipeline = Pipeline(steps)

Let's try to pick the best one hyperparameter from {0.01, 0.1, 1.0, 10., 100.} by five-fold cross-validation. Pleasantly, the above hyperparameter selection with cross validation can be easily denoted by using sklearn.model_selection.GridSearchCV. First, denote params as parameter candidates. Here, {0.01, 0.1, 1.0, 10., 100.} are used as candidates for the regularization hyperparameter C, which is fed into pipeline through 'classifier' (OneVsRestClassifier) -> `estimator` (LogisticRegression). This is denoted as {'classifier__estimator__C': [0.01, 0.1, 1.0, 10., 100.]}. Second, we feed the parameter candidates together with the pipeline definition into GridSearchCV.

In [37]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

params = {'classifier__estimator__C': [0.01, 0.1, 1.0, 10., 100.]}
scorer = make_scorer(roc_auc_score, average='macro', needs_proba=True)

predictor = GridSearchCV(pipeline, params, cv=5, scoring=scorer)

Proper hyperparameter can be found by predictor.fit(X_dev, Y_dev), inside which we do hyperparameter selection through cross validation.

In [38]:
predictor.fit(X_dev, Y_dev)
Out[38]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('decomposer', PCA(copy=True, iterated_power='auto', n_components=10, random_state=0,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, d...=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'classifier__estimator__C': [0.01, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(roc_auc_score, needs_proba=True, average=macro),
       verbose=0)

The following code reveals the selected best hyperparameter.

In [39]:
predictor.best_params_
Out[39]:
{'classifier__estimator__C': 10.0}

We can see that C=10.0 is selected as the best hyperparameter. Meanwhile, we can make prediction on the evaluation set and confirm the performance. Note that, after finding the best hyperparameter, GridSearchCV retrains the model based on the whole training dataset with the best hyperparameter, and makes prediction through predictor.predict_proba().

In [40]:
Y_val_pred = predictor.predict_proba(X_val)
roc_auc_score(Y_val, Y_val_pred, average='macro')
Out[40]:
0.82581633343114969

We can see that by setting C=10, the predictive performance beats that of C=0.01.

So far, we still fix the reduced dimension to 10, this is another hyperparameter that remains to be tuned. In the following example, we will see how to jointly tune the two hyperparameters. First, we pass the candidates of n_components through decomposer (PCA), denoted by {'decomposer__n_components': [10, 20, 50]}, and pass it to the GridSearchCV. Then, we can search for the best combination of reduced dimension and regularization parameter through predictor.fit(X_train, Y_train).

In [41]:
params = {'classifier__estimator__C': [0.01, 0.1, 1.0, 10., 100.],
         'decomposer__n_components': [10, 20, 50]}

predictor = GridSearchCV(pipeline, params, cv=5, scoring=scorer)
predictor.fit(X_dev, Y_dev)
Out[41]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('decomposer', PCA(copy=True, iterated_power='auto', n_components=10, random_state=0,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, d...=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'classifier__estimator__C': [0.01, 0.1, 1.0, 10.0, 100.0], 'decomposer__n_components': [10, 20, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(roc_auc_score, needs_proba=True, average=macro),
       verbose=0)

After the above cross validation, we found the best hyperparameters are C=0.1, n_components=50.

In [42]:
predictor.best_params_
Out[42]:
{'classifier__estimator__C': 1.0, 'decomposer__n_components': 50}

With no surprise, we improved the predictive performance on evaluation set.

In [43]:
Y_val_pred = predictor.predict_proba(X_val)
roc_auc_score(Y_val, Y_val_pred, average='macro')
Out[43]:
0.8764621490578669

Submission

Now, let's apply our 'well-trained' model to test dataset, and submit the result. Yet, the model is not well-trained. So far, we did not use all the X_train, Y_train for training. A tiny part of data, X_val, Y_val (100 samples in our example) are not used in the training. Since we already evaluated the model and found the best hyperparameters, let's retrain the model with the whole training dataset.

In [44]:
final_predictor = predictor.best_estimator_
final_predictor.fit(X_train, Y_train)
Out[43]:
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('decomposer', PCA(copy=True, iterated_power='auto', n_components=50, random_state=0,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, d...=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1))])

Then, let's apply the retrained model to test dataset and record the predictions in submission file.

In [44]:
Y_test_pred = final_predictor.predict_proba(X_test)
np.savetxt('submission.dat', Y_test_pred, fmt='%.6f')

Note that, since the amount of test samples in this tutorial is 100, the submission.dat directly produced from this tutorial is limited to 100 entries. Therefore, directly submitting this submission.dat file is not valid. Please submit the prediction for the whole test dataset and refer to the sample-submission.dat provided in the competition dataset.