Machine Learning

Overview

This gitbook provides a comprehensive overview of the machine learning workflow implemented in Inverse Watch. The workflow is designed to be flexible, supporting various types of regression and both single and multi-output scenarios.

Integration with Inverse Watch

By leveraging the capabilities of our platform, which is forked from Redash, this workflow becomes a highly versatile tool for feeding machine learning models. Our powerful data visualization and query capabilities allow users to seamlessly integrate data from various sources, making it easier to prepare and feed data into the ML models. This integration enhances the overall efficiency and effectiveness of the machine learning process, providing a robust environment for data-driven decision-making.

Workflow Components

The ML workflow consists of the following main components:

Data Preparation:
- Query execution to fetch raw data
- Data cleaning and structuring
- Identification of feature types (numeric, categorical, timestamp)
- Encoding of categorical variables
- Scaling of numeric features
- Extraction of time-based features from timestamps
- Dimensionality reduction using autoencoders
Feature Engineering:
- Automatic detection and transformation of feature types
- Application of cyclical encoding for time-based features
- Use of autoencoders for dimensionality reduction
Model Initialization:
- Selection of appropriate regressor based on configuration
- Initialization of model with default or user-specified parameters
Model Training:
- Splitting data into training and validation sets
- Training the model on the training data
- Hyperparameter tuning if auto_mode is enabled
Model Evaluation and Tuning:
- Evaluation of model performance on validation data
- Selection of best hyperparameters (if in auto_mode)
- Saving of best model and parameters
Prediction:
- Loading of trained model
- Preprocessing of new data
- Generation of predictions
- Decoding of predictions into human-readable format

Supported Regressors

The system supports multiple types of regressors, each with its own initialization and training process:

Linear/Logistic Regression: Offers simplicity and interpretability for both continuous and categorical targets. Key hyperparameters include fit_intercept for linear regression and C for logistic regression.
Random Forest: Utilizes ensemble learning to improve prediction accuracy and reduce overfitting. Key hyperparameters include n_estimators, max_depth, and max_features.
AdaBoost: Combines multiple weak learners to create a strong regressor, with support for both regression and classification tasks. Key hyperparameters include n_estimators and learning_rate.
Gradient Boosting: Sequentially builds models to correct errors from previous models, suitable for capturing complex patterns. Key hyperparameters include n_estimators, learning_rate, and max_depth.
Neural Networks: Employs deep learning techniques for handling complex, non-linear relationships in data. Key hyperparameters include epochs, batch_size, and learning_rate.

Detailed Regressor Documentation

For more detailed information on each regressor type and its specific workflow, please refer to the individual regressor documentation:

Key Classes

MLModel

The main class that orchestrates the entire ML workflow. It handles data preparation, feature engineering, and manages the training and prediction processes.

TunedMultiOutputEstimator

A custom estimator that supports multi-output scenarios and hyperparameter tuning for traditional machine learning models.

TuneableNNRegressor

A custom class for neural network models that supports hyperparameter tuning and handles both single and multi-output scenarios.

Auto Mode

The system supports an "auto mode" for each regressor type, enabling automated hyperparameter tuning to optimize model performance.

Multi-output Support

The workflow is designed to handle both single-output and multi-output scenarios, using appropriate wrappers or custom implementations.

Serialization and Storage

Trained models are serialized and stored in the database, facilitating easy retrieval and deployment.

PreviousDune API NextData Engineering

Last updated 9 months ago

Was this helpful?