# Random Forest

Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction (regression) or mode prediction (classification) of the individual trees. In our ML workflow, we support both Random Forest Regression and Random Forest Classification.

***

## <mark style="color:blue;">How It Works</mark>

<figure><img src="/files/9N66b4hL3wnnrdSGg0CB" alt=""><figcaption></figcaption></figure>

1. Bootstrap Aggregating (Bagging): Random Forest creates multiple subsets of the original dataset through random sampling with replacement.
2. Decision Tree Creation: For each subset, a decision tree is constructed. At each node of the tree, a random subset of features is considered for splitting.
3. Voting/Averaging: For classification tasks, the final prediction is the mode of the predictions from all trees. For regression, it's the average prediction.

***

## <mark style="color:blue;">Initialization</mark>

The Random Forest model is initialized in the `initialize_regressor` method:

```python
if self.regressor == 'RandomForest':
    base_estimator_class = RandomForestClassifier if is_classification else RandomForestRegressor
    param_dist = {
        'n_estimators': [50, 100, 200, 300, 500],
        'max_depth': [None, 5, 10, 15, 20, 25],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['sqrt', 'log2', None],
        'bootstrap': [True, False],
        'ccp_alpha': uniform(0, 0.01)
    }
    if is_classification:
        param_dist['criterion'] = ['gini', 'entropy']
        param_dist['class_weight'] = ['balanced', 'balanced_subsample', None]
    else:
        param_dist['criterion'] = ['squared_error', 'absolute_error', 'friedman_mse', 'poisson']
```

***

## <mark style="color:blue;">Key Components</mark>

1. **Model Selection**:
   * For continuous targets, we use `RandomForestRegressor` from scikit-learn.
   * For categorical targets, we use `RandomForestClassifier` from scikit-learn.
2. **Multi-output Support**:
   * For multiple target variables, we use `MultiOutputRegressor` or `MultiOutputClassifier`.
3. **Hyperparameter Tuning**:
   * When `auto_mode` is enabled, we use `RandomizedSearchCV` for automated hyperparameter tuning.

***

## <mark style="color:blue;">Hyperparameters</mark>

The main hyperparameters for Random Forest include:

* `n_estimators`: The number of trees in the forest.
* `max_depth`: The maximum depth of the trees.
* `min_samples_split`: The minimum number of samples required to split an internal node.
* `min_samples_leaf`: The minimum number of samples required to be at a leaf node.
* `max_features`: The number of features to consider when looking for the best split.
* `bootstrap`: Whether bootstrap samples are used when building trees.
* `criterion`: The function to measure the quality of a split (differs for classification and regression).

For classification, additional parameters include:

* `class_weight`: Weights associated with classes for dealing with imbalanced datasets.

***

## <mark style="color:blue;">Training Process</mark>

The training process is handled in the `fit_regressor` method:

1. The method checks if we're dealing with a multi-output scenario.
2. It reshapes the target variable `y` if necessary for consistency.
3. The Random Forest model is fitted using the `fit` method.

After training, the model is serialized and stored.

***

## <mark style="color:blue;">Auto Mode</mark>

When `auto_mode` is enabled:

1. A `RandomizedSearchCV` object is created with the base estimator (RandomForestRegressor or RandomForestClassifier).
2. It performs a randomized search over the specified parameter distributions.
3. The best parameters found are saved and used for the final model.

***

## <mark style="color:blue;">Multi-output Scenario</mark>

For multiple target variables:

1. In regression tasks, `MultiOutputRegressor` is used to wrap the `RandomForestRegressor`.
2. In classification tasks, `MultiOutputClassifier` is used to wrap the `RandomForestClassifier`.
3. This allows the model to predict multiple target variables simultaneously.

***

## <mark style="color:blue;">Advantages and Limitations</mark>

Advantages:

* Handles both linear and non-linear relationships
* Reduces overfitting by averaging multiple decision trees
* Can handle large datasets with high dimensionality
* Provides feature importance rankings

Limitations:

* Less interpretable than single decision trees
* Can be computationally expensive for very large datasets
* May overfit on some datasets if parameters are not tuned properly

***

## <mark style="color:blue;">Usage Tips</mark>

1. Start with a moderate number of trees (e.g., 100) and increase if needed.
2. Use cross-validation to find the optimal `max_depth` to prevent overfitting.
3. Adjust `min_samples_split` and `min_samples_leaf` to control the complexity of individual trees.
4. For high-dimensional data, consider setting `max_features` to 'sqrt' or 'log2'.
5. If dealing with imbalanced classes, experiment with different `class_weight` settings.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.inverse.watch/user-guide/machine-learning/regressors/random-forest.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
