Machine Learning
Last updated
Last updated
This gitbook provides a comprehensive overview of the machine learning workflow implemented in Inverse Watch. The workflow is designed to be flexible, supporting various types of regression and both single and multi-output scenarios.
By leveraging the capabilities of our platform, which is forked from Redash, this workflow becomes a highly versatile tool for feeding machine learning models. Our powerful data visualization and query capabilities allow users to seamlessly integrate data from various sources, making it easier to prepare and feed data into the ML models. This integration enhances the overall efficiency and effectiveness of the machine learning process, providing a robust environment for data-driven decision-making.
The ML workflow consists of the following main components:
Data Preparation:
Query execution to fetch raw data
Data cleaning and structuring
Identification of feature types (numeric, categorical, timestamp)
Encoding of categorical variables
Scaling of numeric features
Extraction of time-based features from timestamps
Dimensionality reduction using autoencoders
Feature Engineering:
Automatic detection and transformation of feature types
Application of cyclical encoding for time-based features
Use of autoencoders for dimensionality reduction
Model Initialization:
Selection of appropriate regressor based on configuration
Initialization of model with default or user-specified parameters
Model Training:
Splitting data into training and validation sets
Training the model on the training data
Hyperparameter tuning if auto_mode is enabled
Model Evaluation and Tuning:
Evaluation of model performance on validation data
Selection of best hyperparameters (if in auto_mode)
Saving of best model and parameters
Prediction:
Loading of trained model
Preprocessing of new data
Generation of predictions
Decoding of predictions into human-readable format
The system supports multiple types of regressors, each with its own initialization and training process:
Linear/Logistic Regression: Offers simplicity and interpretability for both continuous and categorical targets. Key hyperparameters include fit_intercept
for linear regression and C
for logistic regression.
Random Forest: Utilizes ensemble learning to improve prediction accuracy and reduce overfitting. Key hyperparameters include n_estimators
, max_depth
, and max_features
.
AdaBoost: Combines multiple weak learners to create a strong regressor, with support for both regression and classification tasks. Key hyperparameters include n_estimators
and learning_rate
.
Gradient Boosting: Sequentially builds models to correct errors from previous models, suitable for capturing complex patterns. Key hyperparameters include n_estimators
, learning_rate
, and max_depth
.
Neural Networks: Employs deep learning techniques for handling complex, non-linear relationships in data. Key hyperparameters include epochs
, batch_size
, and learning_rate
.
For more detailed information on each regressor type and its specific workflow, please refer to the individual regressor documentation:
The main class that orchestrates the entire ML workflow. It handles data preparation, feature engineering, and manages the training and prediction processes.
A custom estimator that supports multi-output scenarios and hyperparameter tuning for traditional machine learning models.
A custom class for neural network models that supports hyperparameter tuning and handles both single and multi-output scenarios.
The system supports an "auto mode" for each regressor type, enabling automated hyperparameter tuning to optimize model performance.
The workflow is designed to handle both single-output and multi-output scenarios, using appropriate wrappers or custom implementations.
Trained models are serialized and stored in the database, facilitating easy retrieval and deployment.