Data Preparation and Target Extraction

Overview

Our ML workflow includes sophisticated data preparation and target extraction processes. These steps are crucial for transforming raw data into a format suitable for machine learning models and for handling various types of target variables.

Workflow Components

Data Preparation

The data preparation process is handled by the feature_engineering method. This method applies several transformations to the input data:

  1. Data Type Identification: The method identifies the type of each feature:

    • Unix timestamps

    • Ethereum addresses

    • Categorical features

    • Numeric features

  2. Timestamp Feature Extraction: For Unix timestamp features:

    df[f'{feature}_year'] = df[feature].dt.year
    df[f'{feature}_month'] = df[feature].dt.month
    # ... (more time-based features)

    It also applies cyclical encoding for month, day, and hour.

  3. Categorical Feature Encoding: For categorical features (including Ethereum addresses):

    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    encoded = encoder.fit_transform(df[[feature]])
  4. Numeric Feature Scaling: For numeric features:

    scaler = StandardScaler()
    df[feature] = scaler.fit_transform(df[[feature]])
  5. Dimensionality Reduction: An autoencoder is used to reduce the dimensionality of the feature space:

    autoencoder = Model(input_layer, decoded)
    encoder = Model(input_layer, encoded2)
  6. Sample Weight Calculation: Sample weights are calculated to give more importance to recent observations.

Target Extraction

The target extraction process is handled by the extract_targets method. This method prepares the target variables for training:

  1. Target Type Identification: The method identifies each target as either numeric or categorical.

  2. Ethereum Address Handling: Ethereum addresses are treated as categorical variables.

  3. Numeric Target Scaling: For numeric targets:

    scaler = StandardScaler()
    y_scaled = scaler.fit_transform(y_values)
  4. Categorical Target Encoding: For categorical targets:

    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    one_hot_values = encoder.fit_transform(np.array(string_values).reshape(-1, 1))
  5. Multi-target Handling: The method can handle multiple targets, combining them into a single 2D numpy array.

Usage

Both methods are called during the model training process:

X, feature_types, feature_encoders, n_features = self.feature_engineering(data, features, mode='train')
y, target_types, target_encoders = self.extract_targets(data, targets)

Advantages

  • Handles various data types automatically

  • Applies appropriate transformations for each data type

  • Supports multi-target scenarios

  • Preserves information about feature and target transformations for consistent prediction

Considerations

  • The autoencoder for dimensionality reduction might not be necessary for all datasets

  • The current implementation doesn't include advanced time series features like lagged variables or rolling statistics

Last updated