Gradient Boosting
Last updated
Last updated
Gradient Boosting is an ensemble learning method that builds a series of weak learners (typically decision trees) sequentially, with each new model correcting the errors of the previous ones. In our ML workflow, we support both Gradient Boosting Regression and Gradient Boosting Classification.
Gradient Boosting works by iteratively improving the model's predictions. Here's a step-by-step explanation of the process:
Initialization:
Start with a simple model, often just predicting the mean of the target variable.
Set the number of iterations (trees) to use.
Iterative Process: For each iteration:
Calculate the residuals (errors) between the current model's predictions and the actual target values.
Fit a new weak learner (usually a decision tree) to predict these residuals.
Calculate the optimal step size (learning rate) to update the model.
Update the model by adding the new weak learner's predictions multiplied by the learning rate.
Loss Function:
The process aims to minimize a loss function (e.g., mean squared error for regression, log loss for classification).
The gradient of this loss function with respect to the model's predictions guides the boosting process.
Regularization:
Various techniques like limiting tree depth, subsampling, and shrinkage (learning rate) help prevent overfitting.
Final Model:
The final model is the sum of all weak learners, each weighted by the learning rate.
Prediction:
For a new input, each weak learner makes a prediction, and these are summed to get the final prediction.
This process allows Gradient Boosting to create a strong predictive model by focusing on and correcting the errors of previous iterations.
The Gradient Boosting model is initialized in the initialize_regressor
method:
Model Selection:
For continuous targets, we use GradientBoostingRegressor
from scikit-learn.
For categorical targets, we use GradientBoostingClassifier
from scikit-learn.
Multi-output Support:
For multiple target variables, we use MultiOutputRegressor
or MultiOutputClassifier
.
Hyperparameter Tuning:
When auto_mode
is enabled, we use RandomizedSearchCV
for automated hyperparameter tuning.
The main hyperparameters for Gradient Boosting include:
n_estimators
: The number of boosting stages to perform.
learning_rate
: Shrinks the contribution of each tree, helping to prevent overfitting.
max_depth
: The maximum depth of the individual regression estimators.
min_samples_split
: The minimum number of samples required to split an internal node.
min_samples_leaf
: The minimum number of samples required to be at a leaf node.
subsample
: The fraction of samples to be used for fitting the individual base learners.
max_features
: The number of features to consider when looking for the best split.
loss
: The loss function to be optimized.
The training process is handled in the fit_regressor
method:
The method checks if we're dealing with a multi-output scenario.
It reshapes the target variable y
if necessary for consistency.
The Gradient Boosting model is fitted using the fit
method.
After training, the model is serialized and stored.
When auto_mode
is enabled:
A RandomizedSearchCV
object is created with the base estimator (GradientBoostingRegressor or GradientBoostingClassifier).
It performs a randomized search over the specified parameter distributions.
The best parameters found are saved and used for the final model.
For multiple target variables:
In regression tasks, MultiOutputRegressor
is used to wrap the GradientBoostingRegressor
.
In classification tasks, MultiOutputClassifier
is used to wrap the GradientBoostingClassifier
.
This allows the model to predict multiple target variables simultaneously.
Advantages:
Often provides higher accuracy than random forests
Handles non-linear relationships well
Can capture complex patterns in the data
Provides feature importance rankings
Limitations:
Can be prone to overfitting, especially with high learning rates
Generally slower to train than random forests
Less interpretable than single decision trees
Sensitive to outliers and noisy data
Start with a small learning rate (e.g., 0.01 or 0.1) and a moderate number of estimators.
Use early stopping or cross-validation to determine the optimal number of estimators.
Balance the learning rate and number of estimators: lower learning rates typically require more estimators.
Experiment with different subsample rates to introduce randomness and prevent overfitting.
For high-dimensional data, consider setting max_features
to 'sqrt' or 'log2'.
Monitor training and validation errors to detect and prevent overfitting.