Tutorial on MLSS Online market purchase prediction

Introduction

Purchase prediction is an important application of machine learning. This tutorial provides a guide for you to submit your first prediction, that will directly work with our dataset.

This tutorial requires the following tools:

Preprocessing (Feature and Target Extraction)

First, we generate features to be given to a classifier. A basic feature extraction approach is to count one or more events represented in the data. Therefore, in the code below, we calculate the number of purchase actions for each user and each item in a given log file. Suppose the log file log-0331-0406.tsv is located at your current directory.

from csv import DictReader
user = {}  # Dictionary to store user's feature
item = {}  # Dictionary to store item's feature
with open('log-0331-0406.tsv', 'r') as f:
    for i, row in enumerate(DictReader(f, delimiter='\t')):
        if i%10000 == 0: print('Finished {} rows'.format(i))  # Process indicator
        if row['layer'] != 'order': continue
        uid = row['user_id']
        iid = row['item_id']
        # Count user's event
        if uid not in user:
            user[uid] = 1
        else:
            user[uid] += 1
        # Count item's event
        for ordered_iid in row['order_item_ids'].split('/')[1:]:  # Exclude empty ID
            ordered_iid = '/' + ordered_iid
            if ordered_iid not in item:
                item[ordered_iid] = 1
            else:
                item[ordered_iid] += 1
print('Total {} rows'.format(i+1))

Next, we generate a feature vector corresponding to each target pair in the test data by simply joining the frequencies counted above. We then save these vectors at in the test.tsv file.

of = open('test.tsv', 'w')
with open('target.tsv') as f:
    for i, line in enumerate(f):
        if i%10000 == 0: print('Finished {} rows'.format(i))  # Process indigator
        row = line.rstrip().split('\t')
        uid = row[0]
        iid = row[1]
        feature = []
        # Combine counters
        ## User feature
        if uid in user:
            feature.append(user[uid])
        else:
            feature.append(0)
        ## Item feature
        if iid in item:
            feature.append(item[iid])
        else:
            feature.append(0)
        of.write('\t'.join(map(str, feature)) + '\n')
print('Total {} rows'.format(i+1))
of.close()

Next, we prepare the data to train a classifier. As we have no target labels to be predicted in the training step, the first step here is to extract the target labels from the log data. We divide the seven day log file into two subsets for feature and target extraction, for example, the first five days for features and the last two days for target labels. The features are obtained in the same manner as that of the test data. Here, we suppose that the target pairs of users and items, which are to be predicted whether the user purchased the item during the last two days or not, are all possible pairs of users that purchased at least one item and items purchased by at least one user. In the code below, we extract features and targets from the seven day log file.

user = {}
item = {}
uids = set()
iids = set()
positive_example = set()
target_date = set(['2015-04-05', '2015-04-06'])
with open('log-0331-0406.tsv', 'r') as f:
    for i, row in enumerate(DictReader(f, delimiter='\t')):
        if i%10000 == 0: print('Finished {} rows'.format(i))  # Process indicator
        if row['layer'] != 'order': continue
        uid = row['user_id']
        uids.add(uid)
        iid = row['item_id']

        # Target extraction
        if row['date'].split()[0] in target_date:
            for ordered_iid in row['order_item_ids'].split('/')[1:]:
                ordered_iid = '/' + ordered_iid
                positive_example.add((uid, ordered_iid))
                iids.add(ordered_iid)
            continue

        # Feature Extraction
        ## Count user's event
        if uid not in user:
            user[uid] = 1
        else:
            user[uid] += 1
        # Count item's event
        for ordered_iid in row['order_item_ids'].split('/')[1:]:
            ordered_iid = '/' + ordered_iid
            if ordered_iid not in item:
                item[ordered_iid] = 1
            else:
                item[ordered_iid] += 1
            iids.add(ordered_iid)
print('Total {} rows'.format(i+1))

Finally, we generate feature vectors that correspond to each of the target pairs. We write the feature vectors and binary target labels (which indicate if the user purchased the item or not) together to the train.tsv file.

of = open('train.tsv', 'w')
i = 0
for uid in uids:
    for iid in iids:
        if i%10000 == 0: print('Finished {} rows'.format(i))  # Process indigator
        feature = []
        # Combine counters
        ## User feature
        if uid in user:
            feature.append(user[uid])
        else:
            feature.append(0)
        ## Item feature
        if iid in item:
            feature.append(item[iid])
        else:
            feature.append(0)
        label = 0
        if (uid, iid) in positive_example: label = 1
        of.write('\t'.join(map(str, feature + [label,])) + '\n')
        i += 1
print('Total {} rows'.format(i+1))
of.close()

Loading Features and Labels

Now that we have feature vectors and target labels, with the training data stored in train.tsv and the test data stored in test.tsv, you can load the dataset via the code below.

import numpy as np
DATA_TRAIN_PATH = 'train.tsv'
train = np.genfromtxt(DATA_TRAIN_PATH)

Next, we split the data into training data (80%) and validation data (20%), as shown in the code below.

from sklearn.cross_validation import train_test_split
X = train[:, :-1]
y = train[:, -1]
TEST_SIZE = 0.2
RANDOM_STATE = 0
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

The following script outputs the data sizes:

print(X.shape, y.shape)
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
Out: 
((647264, 2), (647264,))
((517811, 2), (517811,))
((129453, 2), (129453,))

Training and Evaluation

We train a Logistic Regression classifier based on the loaded features and labels from the training data. The Logistic Regression classifier yields estimated probabilities of purchase. Here, we use sklearn.linear_model.LogisticRegression. The code below trains the classifier.

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=RANDOM_STATE)

Here, the classifier parameter is set as default.

We then evaluate the classifier based on the validation data. We first make a prediction on the validation data, as shown in the code below.

clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_val)[:, 1]

Here, we follow the competition's requirements, and the classifier yields the probability that the label is 1, which indicates purchase.

Next, we compute the performance using area under the receiver operating characteristic curve (AUC), which is a metric used in the competition. In the code below, we use roc_auc_score, and obtain the AUC value.

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_val, y_pred))
Out: 0.922303182959

The results from the above code indicate that the current Logistic Regression classifier achieves a good score on the given dataset. Next, in the code below, we define a new model with a different parameter.

clf_new = LogisticRegression(fit_intercept=False,random_state=RANDOM_STATE)
clf_new.fit(X_train, y_train)
y_pred_new = clf_new.predict_proba(X_val)[:, 1]
print(roc_auc_score(y_val, y_pred_new))
Out: 0.92388724538

The new model with fit_intercept=False seems better than the previous model. This parameter specifies if a constant (bias term) is added the decision function.

Submission

Finally, we submit a prediction generated by the model with fit_intercept=False that achieved a good score in the validation set. We train the model using all the training data (X, y), as shown in the code below.

clf_submit = LogisticRegression(fit_intercept=False,random_state=RANDOM_STATE)
clf_submit.fit(X, y)

Next, we load the test data to generate a prediction based on the test data.

DATA_TEST_PATH = 'test.tsv'
X_test = np.genfromtxt(DATA_TEST_PATH)

As a last step, as shown in the code below, we save the final prediction for the test data to the sample-submission-basic.dat file.

SUBMIT_PATH = 'sample-submission-basic.tsv'
y_submit = clf_submit.predict_proba(X_test)[:, 1]
np.savetxt(SUBMIT_PATH, y_submit, fmt='%.5f')

Congratulations! We finished the entire prediction process. Submit your first prediction file sample-submission-basic.tsv, and view your score and rank on the leaderboard. In this tutorial, we used only a few portions of the log data. You will be able to improve your predictions by taking various other features into consideration.