### Introduction

This tutorial introduces an exemplar implementation for solving the competition of [Sansan Data Analysis Challenge] Business card field labeling, by using Python and the existing implementations of machine learning classifiers provided by scikit-learn.

The source code provided has been tested in Python 2.7.8, and requires the following libraries:

Reference (in Japanese):

### Preparation

#### Loading libraries

`In [1]:`

import os import pandas as pd from PIL import Image import numpy as np import sklearn

#### Loading data

First, download the data from competition webpage, load the 'training data'. Under the `sansan-001`

directory, run the following code, load training data `train.csv`

as `df_train`

.

`In [2]:`

df_train = pd.read_csv('train.csv')

Next, confirm the contents in `df_train`

. `filename`

is listed in the first column, `left`

, `top`

, `right`

, `bottom`

are the coordinates of a designated area of a given image. `company_name`

, `full_name`

, ..., `url`

are ground-truth labels. For example, looking at the first row of `df_train`

, we can know from `mobile = 1`

that the area of the image contains a mobile phone number.

`In [3]:`

df_train.head()

`Out[3]:`

Moreover, we can confirm the size of the training dataset with the following code:

`In [4]:`

df_train.shape

`Out[4]:`

By checking the 0-th row of df_train, we can see the details of the sub image:

`In [5]:`

row = df_train.iloc[0, :] row

`Out[5]:`

Open the image with `row.filename`

, extract the designated rectangular area by `row.left`

, `row.top`

, `row.right`

, `row.bottom`

, we obtain the `img`

. In this `img`

, we can check that a mobile number does exist.

`In [6]:`

DIR_IMAGES = 'images' img = Image.open(os.path.join(DIR_IMAGES, row.filename)) img = img.crop((row.left, row.top, row.right, row.bottom)) img

`Out[6]:`

Load the 'test data' the same as the way we did for 'training data'. Different from training data, the ground-truth labels are not provided in the test data.

`In [7]:`

df_test = pd.read_csv('test.csv') df_test.head()

`Out[7]:`

Our target is to learn a predictive model based on training data together with their ground-truth labels, and then use it to make prediction output for test data. Same as what we did for training data, we can check that there are in total 8,918 images in the test dataset.

`In [8]:`

df_test.shape

`Out[8]:`

In the following, we employ 500 training samples and 100 test samples for simplicity.

`In [9]:`

df_train = df_train.sample(500, random_state=0) df_test = df_test.sample(100, random_state=0)

#### Generating Feature Vectors

In order to learn a predictive model, we first have to quantilize the images. We use the image in the first row as an example to explain the procedure to generate feature vector. Now, let's start from accessing the image in the first row.

`In [10]:`

`img`

`Out[10]:`

For convenience, convert the image into gray scale.

`In [11]:`

img = img.convert('L') img

`Out[11]:`

Since it is hard to handle the designated areas that are different in size, we convert the image into 100 * 100 square.

`In [12]:`

IMG_SIZE = 100 img = img.resize((IMG_SIZE, IMG_SIZE), resample=Image.BICUBIC) img

`Out[12]:`

Now, convert the image to be a numerical matrix.

`In [13]:`

x = np.asarray(img, dtype=np.float) x.shape

`Out[13]:`

Each entry of this 100 * 100 matrix corresponds to the brightness of the pixel.

`In [14]:`

`x`

`Out[14]:`

Last, convert the 100 * 100 matrix into flattened 10000 dimensional vector.

`In [15]:`

x = x.flatten() x

`Out[15]:`

`In [16]:`

x.shape

`Out[16]:`

The above procedure for generating feature vector is applicable to all the images. We use it as a preparation step for `X_train`

, as well as `X_test`

, before diving into the classifier part.

`In [17]:`

X_train = [] for i, row in df_train.iterrows(): img = Image.open(os.path.join(DIR_IMAGES, row.filename)) img = img.crop((row.left, row.top, row.right, row.bottom)) img = img.convert('L') img = img.resize((IMG_SIZE, IMG_SIZE), resample=Image.BICUBIC) x = np.asarray(img, dtype=np.float) x = x.flatten() X_train.append(x) X_train = np.array(X_train)

`In [18]:`

X_test = [] for i, row in df_test.iterrows(): img = Image.open(os.path.join(DIR_IMAGES, row.filename)) img = img.crop((row.left, row.top, row.right, row.bottom)) img = img.convert('L') img = img.resize((IMG_SIZE, IMG_SIZE), resample=Image.BICUBIC) x = np.asarray(img, dtype=np.float) x = x.flatten() X_test.append(x) X_test = np.array(X_test)

#### Obtaining grount-truth labels

Obtain ground-truth label of training data by `Y_train`

.

`In [19]:`

columns = ['company_name', 'full_name', 'position_name', 'address', 'phone_number', 'fax', 'mobile', 'email', 'url'] Y_train = df_train[columns].values

At this point, the preparation of predictive modeling is done. Our aim is to train a predictive model based on `X_train`

, `Y_train`

, and to make prediction for `X_test`

.

### Training predictive model

Now, let's see how to train a predictive model. Since the predictive result is unknown before submission, and it is not convenient to upload all the predictions made by various models, we can split the training data into parts and use some of them as a 'local' evaluation of the predictive performance. Specifically, we use 80% as development set, and 20% as evaluation set.

`In [20]:`

from sklearn.model_selection import train_test_split X_dev, X_val, Y_dev, Y_val = train_test_split(X_train, Y_train, train_size=0.8, random_state=0)

The total 500 training data has been splited into 400 as development set and 100 as evaluation set.

`In [21]:`

print X_dev.shape, Y_dev.shape print X_val.shape, Y_val.shape

`Out[21]:`

#### Preprocessing: standardization

Standardization of datasets is required for using PCA (described later of this tutorial). We use `sklearn.preprocessing.StandardScaler`

.

`In [22]:`

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_dev)

`Out[22]:`

Use `scaler`

to standardize the dataset:

`In [23]:`

X_dev_scaled = scaler.transform(X_dev)

We can see that scaled data has zero mean and unit variance:

`In [24]:`

X_dev_scaled.mean(axis=0)

`Out[24]:`

`In [25]:`

X_dev_scaled.var(axis=0)

`Out[24]:`

#### Preprocessing: dimension reduction

In order to extract effective features from the 10,000 dimensional feature vectors, we use dimension reduction techniques. Specifically, we reduce the dimension into 10 by using PCA. First, we construct the dimension reduction decomposer for development set. We initialize decomposer by `decomposer=PCA()`

, and construct the dimension reduction module by `decomposer.fit(X_dev)`

.

`In [26]:`

from sklearn.decomposition import PCA decomposer = PCA(n_components=10, random_state=0) decomposer.fit(X_dev_scaled)

`Out[26]:`

We can apply PCA as a method for dimension reduction, on development set `X_dev`

and obtain `X_dev_pca`

. Specifically, we use `decomposer.transform(X_dev_scaled)`

.

`In [27]:`

X_dev_pca = decomposer.transform(X_dev_scaled)

`X_dev_pca`

is indeed 10 dimensional feature vectors.

`In [28]:`

print X_dev_pca.shape

`Out[28]:`

`In [29]:`

X_val_scaled = scaler.transform(X_val) X_val_pca = decomposer.transform(X_val_scaled)

#### Logistic regression

After the above dimension reduction, we train the predictive model. The problem in this competition is a multi-label classification problem where each image was annotated by multiple labels. In such problem, a basic solution is to treat each sample-label pair independently and use binary classifier to predict whether a specific label corresponds to an image. There are 9 kinds of labels (`company_name`

, `full_name`

, ..., `url`

) in this competition, so we train 9 binary classifiers.

We use logistic regression with L2 regularization. We fix the regularization parameter to 0.01. Then, by separately treating the labels, we can obtain the classifiers. After initializing the logistic regression with `classifier = LogisticRegression()`

, we can use `classifier.fit(X_dev_pca, y)`

for training, where `y = Y_dev[:, j]`

are the ground-truth label of the j-th column.

`In [30]:`

from sklearn.linear_model import LogisticRegression classifiers = [] for j in range(Y_dev.shape[1]): y = Y_dev[:, j] classifier = LogisticRegression(penalty='l2', C=0.01) classifier.fit(X_dev_pca, y) classifiers.append(classifier)

With this trained classifier, we can have a first glimpse of the predictive output on evaluation set. The command `classifier.predict_proba(X_val_pca)`

returns a two dimensional array, in which the first column expresses the probability of 'being negative', and the second column expresses the probability of 'being positive'. Here, we only use one column, the probability of 'being positive'.

`In [31]:`

Y_val_pred = np.zeros(Y_val.shape) for j in range(Y_dev.shape[1]): classifier = classifiers[j] y = classifier.predict_proba(X_val_pca)[:, 1] Y_val_pred[:, j] = y

In `Y_val_pred`

, we have obtained many things related to the 100 samples in the evaluation set, and of course, their predicted probability of 'being positive'.

`In [32]:`

Y_val_pred.shape

`Out[32]:`

By using ground-truth of evaluation set `Y_val`

, we can measure the prediction accuracy of `Y_val_pred`

. Same as the evaluation measure we use in the competition, we use marco-averaged ROC-AUC.

`In [33]:`

from sklearn.metrics import roc_auc_score roc_auc_score(Y_val, Y_val_pred, average='macro')

`Out[33]:`

So far, we have achieved a not-so-bad predictive performance. Moreover, there is a simpler way, `sklearn.multiclass.OneVsRestClassifier`

, to implement such multi-label classification by means of binary clasiification. The following code learns the same model, producing exactly the same predictive performance.

`In [34]:`

from sklearn.multiclass import OneVsRestClassifier classifier = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)) classifier.fit(X_dev_pca, Y_dev) Y_val_pred = classifier.predict_proba(X_val_pca)

`In [35]:`

roc_auc_score(Y_val, Y_val_pred, average='macro')

`Out[35]:`

#### Tuning Hyperparamters

So far, we fixed the regularization parameter to 0.01. However, in practice, proper regularization parameter configuration varies with data. We can find the best values, by cross validation. In cross validation, we have to repeatedly go through the procedure from dimension reduction to classifier training. As a result, we have to repeatedly feed hyperparamters into the model. For convenience, let's first define this procedure as a pipeline, and see how we can simplify it. Define the operations (i.e. dimension reduction, classifier training, etc.) as `steps`

, and `pipeline = Pipeline(shape)`

as a pipeline.

`In [36]:`

from sklearn.pipeline import Pipeline steps = [('scaler', StandardScaler()), ('decomposer', PCA(10, random_state=0)), ('classifier', OneVsRestClassifier(LogisticRegression(penalty='l2')))] pipeline = Pipeline(steps)

Let's try to pick the best one hyperparameter from {0.01, 0.1, 1.0, 10., 100.} by five-fold cross-validation. Pleasantly, the above hyperparameter selection with cross validation can be easily denoted by using `sklearn.model_selection.GridSearchCV`

. First, denote `params`

as parameter candidates. Here, {0.01, 0.1, 1.0, 10., 100.} are used as candidates for the regularization hyperparameter `C`

, which is fed into pipeline through `'classifier'`

(`OneVsRestClassifier`

) -> ``estimator``

(`LogisticRegression`

). This is denoted as `{'classifier__estimator__C': [0.01, 0.1, 1.0, 10., 100.]}`

. Second, we feed the parameter candidates together with the pipeline definition into `GridSearchCV`

.

`In [37]:`

from sklearn.model_selection import GridSearchCV from sklearn.metrics import make_scorer params = {'classifier__estimator__C': [0.01, 0.1, 1.0, 10., 100.]} scorer = make_scorer(roc_auc_score, average='macro', needs_proba=True) predictor = GridSearchCV(pipeline, params, cv=5, scoring=scorer)

Proper hyperparameter can be found by `predictor.fit(X_dev, Y_dev)`

, inside which we do hyperparameter selection through cross validation.

`In [38]:`

predictor.fit(X_dev, Y_dev)

`Out[38]:`

The following code reveals the selected best hyperparameter.

`In [39]:`

predictor.best_params_

`Out[39]:`

We can see that `C=10.0`

is selected as the best hyperparameter. Meanwhile, we can make prediction on the evaluation set and confirm the performance. Note that, after finding the best hyperparameter, `GridSearchCV`

retrains the model based on the whole training dataset with the best hyperparameter, and makes prediction through `predictor.predict_proba()`

.

`In [40]:`

Y_val_pred = predictor.predict_proba(X_val) roc_auc_score(Y_val, Y_val_pred, average='macro')

`Out[40]:`

We can see that by setting C=10, the predictive performance beats that of C=0.01.

So far, we still fix the reduced dimension to 10, this is another hyperparameter that remains to be tuned. In the following example, we will see how to jointly tune the two hyperparameters. First, we pass the candidates of `n_components`

through `decomposer`

(`PCA`

), denoted by `{'decomposer__n_components': [10, 20, 50]}`

, and pass it to the `GridSearchCV`

. Then, we can search for the best combination of reduced dimension and regularization parameter through `predictor.fit(X_train, Y_train)`

.

`In [41]:`

params = {'classifier__estimator__C': [0.01, 0.1, 1.0, 10., 100.], 'decomposer__n_components': [10, 20, 50]} predictor = GridSearchCV(pipeline, params, cv=5, scoring=scorer) predictor.fit(X_dev, Y_dev)

`Out[41]:`

After the above cross validation, we found the best hyperparameters are `C=0.1`

, `n_components=50`

.

`In [42]:`

predictor.best_params_

`Out[42]:`

With no surprise, we improved the predictive performance on evaluation set.

`In [43]:`

Y_val_pred = predictor.predict_proba(X_val) roc_auc_score(Y_val, Y_val_pred, average='macro')

`Out[43]:`

### Submission

Now, let's apply our 'well-trained' model to test dataset, and submit the result. Yet, the model is not well-trained. So far, we did not use all the `X_train`

, `Y_train`

for training. A tiny part of data, `X_val`

, `Y_val`

(100 samples in our example) are not used in the training. Since we already evaluated the model and found the best hyperparameters, let's retrain the model with the whole training dataset.

`In [44]:`

final_predictor = predictor.best_estimator_ final_predictor.fit(X_train, Y_train)

`Out[43]:`

Then, let's apply the retrained model to test dataset and record the predictions in submission file.

`In [44]:`

Y_test_pred = final_predictor.predict_proba(X_test) np.savetxt('submission.dat', Y_test_pred, fmt='%.6f')

Note that, since the amount of test samples in this tutorial is 100, the `submission.dat`

directly produced from this tutorial is limited to 100 entries. Therefore, directly submitting this `submission.dat`

file is not valid. Please submit the prediction for the whole test dataset and refer to the `sample-submission.dat`

provided in the competition dataset.