Inverse Watch Docs
AppLanding
  • Overview
    • Home
    • Governance
      • Proposal 7
      • Proposal 25
      • Proposal 52
      • Proposal 107
      • Proposal 147 - S1
      • Proposal 189 - S2
  • Products
    • Inverse Alerts
      • See on Twitter
    • Inverse Chatbot
      • /doc
      • /imagine
      • /data
      • /graph
    • Inverse Subgraphs
      • See inverse-subgraph on Mainnet
      • See inverse-governance-subgraph on Mainnet
    • Inverse Watch
      • Go to App
  • User Guide
    • Quickstart
    • Alerts
      • Setting Up an Alert
      • Adding New Alert Destinations
      • Customize Alert Template
      • Multiple Column Alert
    • Queries
      • Creating and Editing Queries
      • Querying Existing Query Results
      • Query Parameters
      • How to Schedule a Query
      • Favorites & Tagging
      • Query Filters
      • How To Download / Export Query Results
      • Query Snippets
    • Visualizations
      • Cohort Visualizations
      • Visualizations How-To
      • Chart Visualizations
      • Formatting Numbers in Visualizations
      • How to Make a Pivot Table
      • Funnel Visualizations
      • Table Visualization Options
      • Visualizations Types
    • Dashboards
      • Creating and Editing Dashboards
      • Favorites & Tagging
      • Sharing and Embedding Dashboards
    • Data Sources
      • CSV & Excel Files
      • Google Sheets
      • JSON (API)
      • Python
      • EVM Chain Logs
      • EVM Chain State
      • GraphQL
      • Dune API
    • Machine Learning
      • Data Engineering
      • Regressors
        • Linear Regression
        • Random Forest
        • Ada Boosting
        • Gradient Boosting
        • Neural Network (LSTM)
      • Training and Predicting
      • Metrics & Overfitting
      • Examples
        • Price Prediction
          • Data Preprocessing
          • Model Creation & Training
          • Metrics Evaluation
          • Back Testing
          • Visualizing
        • Liquidation Risk
  • Admin & Dev Guide
    • Setup
    • Redash
    • Integrations & API
    • Query Runners
    • Users
      • Adding a Profile Picture
      • Authentication Options
      • Group Management
      • Inviting Users to Use Redash
      • Permissions & Groups
    • Visualizations
  • Cheat Sheets
    • Snippets
    • Contracts
  • More
    • Deprecated Apps
    • Github : inverse-flaskbot
    • Github : inverse-subgraph
    • Github : inverse-watch
Powered by GitBook
On this page
  • How it Works
  • Initialization
  • Key Components
  • Hyperparameters
  • Training Process
  • Auto Mode
  • Multi-output Scenario
  • Advantages and Limitations
  • Usage Tips

Was this helpful?

  1. User Guide
  2. Machine Learning
  3. Regressors

Gradient Boosting

PreviousAda BoostingNextNeural Network (LSTM)

Last updated 7 months ago

Was this helpful?

Gradient Boosting is an ensemble learning method that builds a series of weak learners (typically decision trees) sequentially, with each new model correcting the errors of the previous ones. In our ML workflow, we support both Gradient Boosting Regression and Gradient Boosting Classification.


How it Works

Gradient Boosting works by iteratively improving the model's predictions. Here's a step-by-step explanation of the process:

  1. Initialization:

    • Start with a simple model, often just predicting the mean of the target variable.

    • Set the number of iterations (trees) to use.

  2. Iterative Process: For each iteration:

    • Calculate the residuals (errors) between the current model's predictions and the actual target values.

    • Fit a new weak learner (usually a decision tree) to predict these residuals.

    • Calculate the optimal step size (learning rate) to update the model.

    • Update the model by adding the new weak learner's predictions multiplied by the learning rate.

  3. Loss Function:

    • The process aims to minimize a loss function (e.g., mean squared error for regression, log loss for classification).

    • The gradient of this loss function with respect to the model's predictions guides the boosting process.

  4. Regularization:

    • Various techniques like limiting tree depth, subsampling, and shrinkage (learning rate) help prevent overfitting.

  5. Final Model:

    • The final model is the sum of all weak learners, each weighted by the learning rate.

  6. Prediction:

    • For a new input, each weak learner makes a prediction, and these are summed to get the final prediction.

This process allows Gradient Boosting to create a strong predictive model by focusing on and correcting the errors of previous iterations.


Initialization

The Gradient Boosting model is initialized in the initialize_regressor method:

if self.regressor == 'GradientBoosting':
    base_estimator_class = GradientBoostingClassifier if is_classification else GradientBoostingRegressor
    param_dist = {
        'n_estimators': [50, 100, 200, 300, 500],
        'learning_rate': [0.01, 0.1, 0.5, 1.0],
        'max_depth': [3, 5, 10, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'subsample': [0.8, 0.9, 1.0],
        'max_features': ['sqrt', 'log2', None],
    }
    if is_classification:
        param_dist['loss'] = ['log_loss', 'exponential']
    else:
        param_dist['loss'] = ['squared_error', 'absolute_error', 'huber', 'quantile']

Key Components

  1. Model Selection:

    • For continuous targets, we use GradientBoostingRegressor from scikit-learn.

    • For categorical targets, we use GradientBoostingClassifier from scikit-learn.

  2. Multi-output Support:

    • For multiple target variables, we use MultiOutputRegressor or MultiOutputClassifier.

  3. Hyperparameter Tuning:

    • When auto_mode is enabled, we use RandomizedSearchCV for automated hyperparameter tuning.


Hyperparameters

The main hyperparameters for Gradient Boosting include:

  • n_estimators: The number of boosting stages to perform.

  • learning_rate: Shrinks the contribution of each tree, helping to prevent overfitting.

  • max_depth: The maximum depth of the individual regression estimators.

  • min_samples_split: The minimum number of samples required to split an internal node.

  • min_samples_leaf: The minimum number of samples required to be at a leaf node.

  • subsample: The fraction of samples to be used for fitting the individual base learners.

  • max_features: The number of features to consider when looking for the best split.

  • loss: The loss function to be optimized.


Training Process

The training process is handled in the fit_regressor method:

  1. The method checks if we're dealing with a multi-output scenario.

  2. It reshapes the target variable y if necessary for consistency.

  3. The Gradient Boosting model is fitted using the fit method.

After training, the model is serialized and stored.


Auto Mode

When auto_mode is enabled:

  1. A RandomizedSearchCV object is created with the base estimator (GradientBoostingRegressor or GradientBoostingClassifier).

  2. It performs a randomized search over the specified parameter distributions.

  3. The best parameters found are saved and used for the final model.


Multi-output Scenario

For multiple target variables:

  1. In regression tasks, MultiOutputRegressor is used to wrap the GradientBoostingRegressor.

  2. In classification tasks, MultiOutputClassifier is used to wrap the GradientBoostingClassifier.

  3. This allows the model to predict multiple target variables simultaneously.


Advantages and Limitations

Advantages:

  • Often provides higher accuracy than random forests

  • Handles non-linear relationships well

  • Can capture complex patterns in the data

  • Provides feature importance rankings

Limitations:

  • Can be prone to overfitting, especially with high learning rates

  • Generally slower to train than random forests

  • Less interpretable than single decision trees

  • Sensitive to outliers and noisy data


Usage Tips

  1. Start with a small learning rate (e.g., 0.01 or 0.1) and a moderate number of estimators.

  2. Use early stopping or cross-validation to determine the optimal number of estimators.

  3. Balance the learning rate and number of estimators: lower learning rates typically require more estimators.

  4. Experiment with different subsample rates to introduce randomness and prevent overfitting.

  5. For high-dimensional data, consider setting max_features to 'sqrt' or 'log2'.

  6. Monitor training and validation errors to detect and prevent overfitting.