Data Preparation and Target Extraction
Overview
Our ML workflow includes sophisticated data preparation and target extraction processes. These steps are crucial for transforming raw data into a format suitable for machine learning models and for handling various types of target variables.
Workflow Components
Data Preparation
The data preparation process is handled by the feature_engineering
method. This method applies several transformations to the input data:
Data Type Identification: The method identifies the type of each feature:
Unix timestamps
Ethereum addresses
Categorical features
Numeric features
Timestamp Feature Extraction: For Unix timestamp features:
It also applies cyclical encoding for month, day, and hour.
Categorical Feature Encoding: For categorical features (including Ethereum addresses):
Numeric Feature Scaling: For numeric features:
Dimensionality Reduction: An autoencoder is used to reduce the dimensionality of the feature space:
Sample Weight Calculation: Sample weights are calculated to give more importance to recent observations.
Target Extraction
The target extraction process is handled by the extract_targets
method. This method prepares the target variables for training:
Target Type Identification: The method identifies each target as either numeric or categorical.
Ethereum Address Handling: Ethereum addresses are treated as categorical variables.
Numeric Target Scaling: For numeric targets:
Categorical Target Encoding: For categorical targets:
Multi-target Handling: The method can handle multiple targets, combining them into a single 2D numpy array.
Usage
Both methods are called during the model training process:
Advantages
Handles various data types automatically
Applies appropriate transformations for each data type
Supports multi-target scenarios
Preserves information about feature and target transformations for consistent prediction
Considerations
The autoencoder for dimensionality reduction might not be necessary for all datasets
The current implementation doesn't include advanced time series features like lagged variables or rolling statistics
Last updated