Recent advances in Deep Learning has offered tremendous opportunities across many industries. It is generally accepted that deep learning is most suited for solving computer vision and natural language processing (NLP) problems. This is certainly not a misconception because deep learning offers capabilities that cannot be achieved using any other technique.
We know that companies often deal with tabular (structured) data rather than unstructured data (e.g, text) data. Tabular data typically consist of rows and columns stored in databases, files (csv, tsv, etc), spreadsheets, etc. Given the abundance of tabular data and the power of deep learning, one may ask:
Can we use deep learning techniques on tabular data and still achieve high-quality models?
The answer to this question appears to be YES. Companies like Google are using deep learning to build all kinds of models based on tabular data. Obviously, the performance of those models are not open to the general public. However, we know that using neural network and particularly embedding of categorical variables has led to winning solutions in Kaggle competitions. One example is the Rossmann Store Sales competition where the third place winners used neural networks and embedding of categorical features to win the competition (see their post on Kaggle here and their article here). There are several other articles out there indicating that deep learning and particularly embeddings of categorical variables may result in high-quality models (see here, here, and here).
Notably, researchers from fast.ai have done a tremendous work finding the best general architectures and parameters for these models. They have implemented their findings in their library fastai. They have shown that we can achieve state-of-the-art results using neural networks on tabular data (see their course here).
What are the benefits of deep learning models over gradient boosting trees like XGBoost, LightGMB, CatBoost, etc?
In general, setting interpretability aside, the ultimate goal for building machine learning models is making accurate predictions of future events. These future events could be categorical outcomes (e.g. predicting if a customer will make a purchase within the next month) or continuous outcomes (e.g. predicting how much revenue our customer will generate within the next six months). To solve these problems, we can use both deep learning and tree-based models as well as other traditional methods. We may need to consider several factors when deciding between deep learning and tree-based models:
- Domain Knowledge: To improve the accuracy of tree-based models, we generally have to design and implement many features in addition to hyperparameter (parameters of the models) tuning (see our previous work on Zillow competition here). While statistical methods can help generate some predictive features, we often need a good understanding of our domain to engineer powerful features and beat the benchmarks. In contrast to the tree-based models, the main focus in deep learning models stays on model architecture and parameter tuning. We still need to do some basic feature engineering to obtain good results but not as extensive as tree-based models.
- Hardware: Deep learning models are often trained on GPUs. Depending on the size of the data, the model, and the problem we try to solve, using CPUs may no longer be a viable option. That said, training on GPUs does not mean that building deep learning models will always be computationally more expensive than tree-based models. In my opinion, the use of deep learning in companies like Google is encouraged partially due to hardware availability and exiting support for GPUs and TPUs.
- Accuracy/Performance: According to universal approximation theorem, neural networks can represent a wide range of interesting and complex functions using appropriate weights. That means, a network with sufficient number of weights/parameters may very well fit our training data. The model, however, may not show good performance on truly unseen data. As such, it is crucial to use ample training data and apply various techniques such as regularization, weight decay, dropout, etc to prevent overfitting. While tree-based models can very well overfit our data, it might be more straight-forward for data scientist with less experience in deep learning to tune the parameters of those model. Additionally, one type of model may outperform the other one given the nature of the problem.
How can we build production-level deep learning models in our companies?
The companies have several choices when it comes to leveraging deep learning for building predictive models:
- Internal investment: If your company has enough resources to hire data scientists/software engineers and time is not an issue, you can certainly recruit data scientists with some experience in building deep learning models. Using enough time and resources, you will be able to build and maintain deep learning models that solve your business problems.
- Acceleration using an external software: If your company already has several data scientists but you'd like to quickly leverage deep learning models, you can take advantage of an existing software. As the title and the overview of this blog suggests, Learner now allows users build production-level deep learning models without writing any codes. The two new engines namely DeepClassifier and DeepRegressor provide a seamless interface for building deep learning models, which is very similar to building the tree-based models. For example, if we wanted to build a tree-based and deep learning classification models using Titanic data on Kaggle to predict which passengers survived the Titanic shipwreck, we can simply use the following configuration files (note that these configurations are for demonstration purposes):
XGBoost model:
{"engine": "classifier", "data": { "train_params": { "location": "./sample_data/titanic/train.csv" } }, "column": { "use_cols": ["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"], "target_col":"Survived" }, "process": { "dummies_params": {"activate": true, "cols": ["Sex", "Embarked", "Pclass"]} }, "model": { "models_dict": { "xgb": { "type": "XGBClassifier", "params": {} } } } }
Deep Learning model:{"engine": "deep_classifier", "data": { "train_params": { "location": "./sample_data/titanic/train.csv" } }, "column": { "use_cols": ["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"], "target_col":"Survived" }, "process": { "dummies_params": {"activate": true, "cols": ["Sex", "Embarked", "Pclass"]} }, "model": { "models_dict": { "dc": { "type": "DeepClassifier", "params": { "fully_connected_layers": [ {"type": "Linear", "out_features": 50}, {"type": "ReLU"}, {"type": "BatchNorm1d"}, {"type": "Dropout", "p": 0.4}, {"type": "Linear", "out_features": 100}, {"type": "ReLU"}, {"type": "BatchNorm1d"}, {"type": "Dropout", "p": 0.4}, {"type": "Linear", "out_features": 2}, {"type": "LogSoftmax", "dim": 1} ], "epochs": 100, "batch_size": 10000 } } } } }
- Outsourcing: If you'd like to avoid large investments or have time limitations but still would like to leverage machine learning models to solve your business problems, outsourcing might be the right option for you. You can let a professional team handle the entire process from end-to-end including data processing, model building, and deployment. Many companies are certainly capable of handling machine learning problems. We believe our team at Prizmi is uniquely positioned to tackle such problems. Our team has built Learner and knows everything about it. In particular, the design of new features in Learner is typically guided by our clients' needs. We enable our clients by building high-quality models for them. We then give our clients the option of learning Learner and maintaining their own models. Please feel free to contact us here if you'd like to discuss your business case.
We hope you found this post useful. Please leave questions/comments below.
Post Comments (0)