Inverse Watch Docs
AppLanding
  • Overview
    • Home
    • Governance
      • Proposal 7
      • Proposal 25
      • Proposal 52
      • Proposal 107
      • Proposal 147 - S1
      • Proposal 189 - S2
  • Products
    • Inverse Alerts
      • See on Twitter
    • Inverse Chatbot
      • /doc
      • /imagine
      • /data
      • /graph
    • Inverse Subgraphs
      • See inverse-subgraph on Mainnet
      • See inverse-governance-subgraph on Mainnet
    • Inverse Watch
      • Go to App
  • User Guide
    • Quickstart
    • Alerts
      • Setting Up an Alert
      • Adding New Alert Destinations
      • Customize Alert Template
      • Multiple Column Alert
    • Queries
      • Creating and Editing Queries
      • Querying Existing Query Results
      • Query Parameters
      • How to Schedule a Query
      • Favorites & Tagging
      • Query Filters
      • How To Download / Export Query Results
      • Query Snippets
    • Visualizations
      • Cohort Visualizations
      • Visualizations How-To
      • Chart Visualizations
      • Formatting Numbers in Visualizations
      • How to Make a Pivot Table
      • Funnel Visualizations
      • Table Visualization Options
      • Visualizations Types
    • Dashboards
      • Creating and Editing Dashboards
      • Favorites & Tagging
      • Sharing and Embedding Dashboards
    • Data Sources
      • CSV & Excel Files
      • Google Sheets
      • JSON (API)
      • Python
      • EVM Chain Logs
      • EVM Chain State
      • GraphQL
      • Dune API
    • Machine Learning
      • Data Engineering
      • Regressors
        • Linear Regression
        • Random Forest
        • Ada Boosting
        • Gradient Boosting
        • Neural Network (LSTM)
      • Training and Predicting
      • Metrics & Overfitting
      • Examples
        • Price Prediction
          • Data Preprocessing
          • Model Creation & Training
          • Metrics Evaluation
          • Back Testing
          • Visualizing
        • Liquidation Risk
  • Admin & Dev Guide
    • Setup
    • Redash
    • Integrations & API
    • Query Runners
    • Users
      • Adding a Profile Picture
      • Authentication Options
      • Group Management
      • Inviting Users to Use Redash
      • Permissions & Groups
    • Visualizations
  • Cheat Sheets
    • Snippets
    • Contracts
  • More
    • Deprecated Apps
    • Github : inverse-flaskbot
    • Github : inverse-subgraph
    • Github : inverse-watch
Powered by GitBook
On this page
  • How It Works
  • Initialization
  • Key Components
  • Hyperparameters
  • Training Process
  • Auto Mode
  • Multi-output Scenario
  • Advantages and Limitations
  • Usage Tips

Was this helpful?

  1. User Guide
  2. Machine Learning
  3. Regressors

Random Forest

PreviousLinear RegressionNextAda Boosting

Last updated 6 months ago

Was this helpful?

Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction (regression) or mode prediction (classification) of the individual trees. In our ML workflow, we support both Random Forest Regression and Random Forest Classification.


How It Works

  1. Bootstrap Aggregating (Bagging): Random Forest creates multiple subsets of the original dataset through random sampling with replacement.

  2. Decision Tree Creation: For each subset, a decision tree is constructed. At each node of the tree, a random subset of features is considered for splitting.

  3. Voting/Averaging: For classification tasks, the final prediction is the mode of the predictions from all trees. For regression, it's the average prediction.


Initialization

The Random Forest model is initialized in the initialize_regressor method:

if self.regressor == 'RandomForest':
    base_estimator_class = RandomForestClassifier if is_classification else RandomForestRegressor
    param_dist = {
        'n_estimators': [50, 100, 200, 300, 500],
        'max_depth': [None, 5, 10, 15, 20, 25],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['sqrt', 'log2', None],
        'bootstrap': [True, False],
        'ccp_alpha': uniform(0, 0.01)
    }
    if is_classification:
        param_dist['criterion'] = ['gini', 'entropy']
        param_dist['class_weight'] = ['balanced', 'balanced_subsample', None]
    else:
        param_dist['criterion'] = ['squared_error', 'absolute_error', 'friedman_mse', 'poisson']

Key Components

  1. Model Selection:

    • For continuous targets, we use RandomForestRegressor from scikit-learn.

    • For categorical targets, we use RandomForestClassifier from scikit-learn.

  2. Multi-output Support:

    • For multiple target variables, we use MultiOutputRegressor or MultiOutputClassifier.

  3. Hyperparameter Tuning:

    • When auto_mode is enabled, we use RandomizedSearchCV for automated hyperparameter tuning.


Hyperparameters

The main hyperparameters for Random Forest include:

  • n_estimators: The number of trees in the forest.

  • max_depth: The maximum depth of the trees.

  • min_samples_split: The minimum number of samples required to split an internal node.

  • min_samples_leaf: The minimum number of samples required to be at a leaf node.

  • max_features: The number of features to consider when looking for the best split.

  • bootstrap: Whether bootstrap samples are used when building trees.

  • criterion: The function to measure the quality of a split (differs for classification and regression).

For classification, additional parameters include:

  • class_weight: Weights associated with classes for dealing with imbalanced datasets.


Training Process

The training process is handled in the fit_regressor method:

  1. The method checks if we're dealing with a multi-output scenario.

  2. It reshapes the target variable y if necessary for consistency.

  3. The Random Forest model is fitted using the fit method.

After training, the model is serialized and stored.


Auto Mode

When auto_mode is enabled:

  1. A RandomizedSearchCV object is created with the base estimator (RandomForestRegressor or RandomForestClassifier).

  2. It performs a randomized search over the specified parameter distributions.

  3. The best parameters found are saved and used for the final model.


Multi-output Scenario

For multiple target variables:

  1. In regression tasks, MultiOutputRegressor is used to wrap the RandomForestRegressor.

  2. In classification tasks, MultiOutputClassifier is used to wrap the RandomForestClassifier.

  3. This allows the model to predict multiple target variables simultaneously.


Advantages and Limitations

Advantages:

  • Handles both linear and non-linear relationships

  • Reduces overfitting by averaging multiple decision trees

  • Can handle large datasets with high dimensionality

  • Provides feature importance rankings

Limitations:

  • Less interpretable than single decision trees

  • Can be computationally expensive for very large datasets

  • May overfit on some datasets if parameters are not tuned properly


Usage Tips

  1. Start with a moderate number of trees (e.g., 100) and increase if needed.

  2. Use cross-validation to find the optimal max_depth to prevent overfitting.

  3. Adjust min_samples_split and min_samples_leaf to control the complexity of individual trees.

  4. For high-dimensional data, consider setting max_features to 'sqrt' or 'log2'.

  5. If dealing with imbalanced classes, experiment with different class_weight settings.