Random Forest
Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction (regression) or mode prediction (classification) of the individual trees. In our ML workflow, we support both Random Forest Regression and Random Forest Classification.
How It Works

Bootstrap Aggregating (Bagging): Random Forest creates multiple subsets of the original dataset through random sampling with replacement.
Decision Tree Creation: For each subset, a decision tree is constructed. At each node of the tree, a random subset of features is considered for splitting.
Voting/Averaging: For classification tasks, the final prediction is the mode of the predictions from all trees. For regression, it's the average prediction.
Initialization
The Random Forest model is initialized in the initialize_regressor method:
if self.regressor == 'RandomForest':
base_estimator_class = RandomForestClassifier if is_classification else RandomForestRegressor
param_dist = {
'n_estimators': [50, 100, 200, 300, 500],
'max_depth': [None, 5, 10, 15, 20, 25],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None],
'bootstrap': [True, False],
'ccp_alpha': uniform(0, 0.01)
}
if is_classification:
param_dist['criterion'] = ['gini', 'entropy']
param_dist['class_weight'] = ['balanced', 'balanced_subsample', None]
else:
param_dist['criterion'] = ['squared_error', 'absolute_error', 'friedman_mse', 'poisson']Key Components
Model Selection:
For continuous targets, we use
RandomForestRegressorfrom scikit-learn.For categorical targets, we use
RandomForestClassifierfrom scikit-learn.
Multi-output Support:
For multiple target variables, we use
MultiOutputRegressororMultiOutputClassifier.
Hyperparameter Tuning:
When
auto_modeis enabled, we useRandomizedSearchCVfor automated hyperparameter tuning.
Hyperparameters
The main hyperparameters for Random Forest include:
n_estimators: The number of trees in the forest.max_depth: The maximum depth of the trees.min_samples_split: The minimum number of samples required to split an internal node.min_samples_leaf: The minimum number of samples required to be at a leaf node.max_features: The number of features to consider when looking for the best split.bootstrap: Whether bootstrap samples are used when building trees.criterion: The function to measure the quality of a split (differs for classification and regression).
For classification, additional parameters include:
class_weight: Weights associated with classes for dealing with imbalanced datasets.
Training Process
The training process is handled in the fit_regressor method:
The method checks if we're dealing with a multi-output scenario.
It reshapes the target variable
yif necessary for consistency.The Random Forest model is fitted using the
fitmethod.
After training, the model is serialized and stored.
Auto Mode
When auto_mode is enabled:
A
RandomizedSearchCVobject is created with the base estimator (RandomForestRegressor or RandomForestClassifier).It performs a randomized search over the specified parameter distributions.
The best parameters found are saved and used for the final model.
Multi-output Scenario
For multiple target variables:
In regression tasks,
MultiOutputRegressoris used to wrap theRandomForestRegressor.In classification tasks,
MultiOutputClassifieris used to wrap theRandomForestClassifier.This allows the model to predict multiple target variables simultaneously.
Advantages and Limitations
Advantages:
Handles both linear and non-linear relationships
Reduces overfitting by averaging multiple decision trees
Can handle large datasets with high dimensionality
Provides feature importance rankings
Limitations:
Less interpretable than single decision trees
Can be computationally expensive for very large datasets
May overfit on some datasets if parameters are not tuned properly
Usage Tips
Start with a moderate number of trees (e.g., 100) and increase if needed.
Use cross-validation to find the optimal
max_depthto prevent overfitting.Adjust
min_samples_splitandmin_samples_leafto control the complexity of individual trees.For high-dimensional data, consider setting
max_featuresto 'sqrt' or 'log2'.If dealing with imbalanced classes, experiment with different
class_weightsettings.
Last updated
Was this helpful?