[MLSS 2015 Predictive Modeling Challenge] オンラインマーケットでの購買予測 上位入賞者手法紹介

The first and second place winners kindly share their solutions on [MLSS 2015 Predictive Modeling Challenge] Online market purchase prediction.

1st place winner: Liyuan Xu ("ly9988")

1. Summary

I followed the provided tutorial to make a feature. I used user-specific and item-specific features, which can be directly aggregated from the log data. I also used the result of Latent Dirichlet Allocation algorithm as feature in order to take account of the preference of the user and content of the item.

2. Features Selection / Extraction

I follow the tutorial to make train data. I divided the seven days log file to, the first five days for features and the last two days for target labels. Please consult detail in tutorial page.

I add the feature as below,

User specific
  • Sex_id
  • Age_group (One-Hot encoded)
  • The number of items the user buy during the training period
  • The number of items the see buy during the training period
Item specific
  • The whole number of the item was viewed during the training period
  • Aggregate view count with respect to sex_id
  • Aggregate view count with respect to age group
  • The whole number of the item bought during the training period
  • Aggregate purchase count with respect to sex_id
  • Aggregate purchase count with respect to age group
User-Item specific
  • The number of the user view the item
  • The number of the user bought the item
  • The count of purchase made by users with same sex to the specific user
  • The count of view made by users with same sex to the specific user
  • The count of purchase made by users with same age group to the specific user
  • The count of view made by users with same age group to the specific user

I also conducted the LDA, considering user as "Document" and item as "Word".

3. Modeling Techniques and Training

I made LDA training data with user-item matrix of 10x(number of the user bought the item)+(number of the user view the item). I weighted that matrix by TF-IDF. I conducted LDA with my C++ implementation. I added user-topic vector, item-topic vector, and the product of these two specific vectors as features. Setting the number of topics to 50 seems working well. I used GradientBoostingClassifier from scikit-learn to do the final prediction.

2nd place winner: Luca Celotti ("Actla")

1. Summary

The procedure used was simple. Three key points: Gaussian Boosting Regressor to build the model, correcting an error in the processing.py given by the MLSS that was filling the test.tsv file with zeros, extracting trivial features like gender and browser used, and extracting a more sophisticated feature, how many times each user checked each item.

2. Features Selection / Extraction

The MLSS gave a script with 2 features extracted: how many times a user bought each item and how many times an item was purchased.

Besides these two I extracted all the trivial ones and one not so trivial. The 13 trivial ones consisted of:

  • 3 columns for OS = Windows, Mac and Linux
  • 2 age columns = 1 column for 20-29 another for 30 to 49 and if both of them had a zero that meant that the subject was 50 or over 50
  • 3 columns for OS = Windows, Mac and Linux
  • 4 columns for browser = Safari, Chrome, Explorer and Firefox
  • 3 columns for languages = Japanese, English and other

The one not so trivial, as specified before, counts how many times an item was visited by a customer. The more Matt looks for quadhelicopters on the browser, the more likely it is that he wants to buy one, that is the approximation behind.

The last two submissions consisted of the model trained in the full 16 features and another trained only in the 2 suggested by the MLSS and my “sophisticated” feature. The second performed worse in the validation but better in the test set.

3. Modeling Techniques and Training

The selection of the model was made by training it on the first 5 days of the data and validating it on the last 2, while for the submission the model was trained on the full 7 days.

The algorithms that performed the best on the validation data were the Random Forest, the Gradient Boosting Classifier, and the Gradient Boosting Regressor, the last one consistently giving a slightly better result.

4. Additional Comments and Observations

Deep Neural Networks of different shapes were trained unsuccessfully, likely due as well to the insufficient power of my laptop that didn’t allow for intense sessions of training. A first phase of clustering was attempted but was not taken to the final step due mainly to lack of time.

5. Simple Features and Methods

As described before, only three features were enough to give the best result: number of purchases per user, number of times purchased per item and number of times each item was checked by each customer.