Winners' report on Text classification problem #1

The first and second place winners kindly share their solutions on Text classification problem #1.

1st place winner: n.otani

1 Preprocessing

1.1 Modify encoded words

For each encoded word, I discarded all characters following the alphabet that appeared in the middle of the word. (e.g. F86155b43 -> F86155)

1.2 Vectorize

Filtering out words that appears less than twice, I converted documents to TF-IDF vectors. I used GemSim, a python library, for this process.

1.3 LSI Dimension reduction

Vectors that I made in 1.2 are too big to load and use in limited memory and time. I used LSI (Latent Semantic Indexing) to reduce dimension. Here GemSim was used again.

2 Training and Prediction

I used SVM and Gradient Boosting contained in scikit-learn, a python library, and train classifiers. Both algorithm have several hyper parameters. Tuning is based on cross validation with 5 folds and grid search (scikit-learn also covers these). Tuning log can be seen my github.

2.1 SVM

I used data prepared in 1.2 (without LSI) for SVM. The best parameters are following (intermediate score is around 0.996):

  • kernel:rbf
  • C: 4
  • gamma: 0.5

I referred SVM実践ガイド (A Practical Guide to Support Vector Classification) - 睡眠不足?!(in Japanese) for parameter tuning.

2.2 Gradient Boosting

I used data prepared in 1.3 (with LSI) for Gradient Boosting. The best parameters are following (intermediate score is around 0.995):

  • n_estimators: 50000
  • learning_rate: 2^(-9,5)
  • max_features: log2
  • max_depth: 7

I tried 300, 500 and 750 for the number of topics used in LSI. 300 topics gave the best score.

3 Postprocessing

I blended the results from SVM and Gradient Boosting just by averaging ones with the best score for each. Blended values resulted in 0.9956 (final score).

Source code

Source codes is available on my github.

2nd place winner: Vagif

My approach was as follows: I trimmed all the features to 6 characters, applied tf-idf to the data, and then used a logistic regression with L2 regularization and parameter C=11. I tried other approaches, such as Naive Bayes, trees, adaboost, but logistic regression proved to have the best performance. I also tried to make ensembles of logistic regression and other classifiers however I always got worse performance with such ensembles than without them.