The Zillow Prize competition on Kaggle was constructed into two rounds. The qualifying round and the private round for the 100 top qualifying teams. In the qualifying round, we built models to predict Zestimate's residual error. In the final round, we built home valuation models from scratch. Our team at Prizmi were among the participants of these competitions. We won a gold medal in the first round (ranked 8 among 3,779 teams) and a silver medal in the second round (ranked 12 among 70 teams) of the competition. While extensive feature engineering and model tuning were crucial for building high quality models, our medal-winning results were ultimately achieved by combining the predictions of several models. For interested readers, an extensive report on feature engineering, data processing, and model training is available here. In this article, we introduce the Triad method and show how easy it is to combine the predictions of several models using Learner.
What is model stacking and how the Triad method works?
Combining the predictions of several models or Ensemble Learning with the goal of improving the performance of individual models has been around for several years. Many of the routine algorithms in data scientists' toolbox such as Random Forest or Gradient Boosting are considered ensemble methods. These models build several individual models (trees in this case) and combine their outputs to arrive at final predictions (see the article here to read more). In tree-based models, a single decision tree is considered a level 0 model while Random Forest or Gradient Boosting are classified as level 1 (or ensemble) models. In this article, our goal is to move one step beyond and achieve a level 2 model by combining/stacking the predictions of level 1 models. All of the routine machine learning (ML) algorithms such as RandomForest, GradientBoosting, LightGBM, XGBoost, Linear Regression, etc (all supported models in Learner) are considered level 1 models. A combined/stacked model in this case is a combination of those models, which would hopefully outperform all individual models.
The simplest and sometimes the most efficient stacking technique is "mean combine" or averaging the predictions of multiple models. Learner already supports the "mean" method, and this can easily be achieved by activating a single parameter (see here for more details).
To improve our models in Zillow Prize competition, we first started with simple and traditional stacking techniques. In addition to the "mean" method, a common approach is to use the predictions of level 1 models as features to train a level 2 model (see the articles here and here to read more). Some practitioners use the training data (sometimes with cross-validation) to build their stacked models. However, using validation or a secondary train dataset are recommended to reduce overfitting. It is also customary to train multiple level 2 models and then a level 3 model on top of that. Unfortunately, none of these methods were able to produce a superior stacked model in Zillow Prize competition. We believe that it was difficult to control for overfitting when using the traditional ML models to train a level 2 model in this competition. As a result, we developed a new method (the Triad method) to combine the predictions of several models while controlling for overfitting.
The Triad method is a novel stacking approach for combining the predictions of several (2, 4, 8, 16, …, 2n) level1 models . In short, this method first finds optimum constant values to multiply the predictions of each individual models. It then creates a neural-network-like structure (see the diagram below) in which "model pairs" are connected to neurons. The two models connected to a neuron forms a "triad" thus the name of the method. The neurons don’t have any activation functions and the weights are optimized using gradient descent against the validation data to obtain the best score. The same structure repeats in each layer until we reach a layer with a single neuron. This last neuron will hold the final combined predictions. The depth of the network is controlled by the number of initial models (see the diagram below). The diagram below shows the structure of such network:
To optimize the weights, a fixed step size for the gradient descent algorithm is used. The gradients (the change in the score by taking a step) are computed numerically. The type of the score and the maximum number of iterations can also be defined. Because we have a very limited number of parameters and fixed step sizes, the Triad method is i) extremely fast (it takes only couple of minutes depending on the size of the data), ii) less prone to overfitting.
How to use learner to combine the predictions of multiple models?
The Triad method is supported for regressor and deep_regressor engines in Learner. It is very easy and straightforward to combine the predictions of multiple models using Learner. By defining a couple of parameters in your configuration files, you can train, save, and load multiple models and combine their predictions. Let's assume we want to train 4 models and use the Triad method to combine their predictions. The "model" and "combine" sections in your configuration file would look like this:
"model": {
"num_models": 4,
"train_models": true,
"models_dict": {
"xgb": {
"type": "XGBRegressor",
"params" : {}
},
"rf": {
"type": "RandomForestRegressor",
"params" : {}
},
"lgbm": {
"type": "LGBMRegressor",
"params" : {}
},
"gb": {
"type": "GradientBoostingRegressor",
"params": {}
}
}
},
"combine": {
"triad_params": {"activate": true}
}
We hope you've found this article useful. Please leave questions/comments below.
Acknowledgement: I would like to thank my team member in the first round of the Zillow Prize competition Mohammad Rahimi for his hard work and dedication.
Post Comments (0)