Data Engineering

The data & feature engineering process is implemented in the feature_engineering method. The target extraction process is handled by the extract_targets method, which prepares the target variables for training.

Those two methods handle various data types, encodes categorical variables, scales numeric varibales, and applies dimensionality reduction using an autoencoder.



Data Type Identification

The methods identify the data type of each feature / target:

  • Unix timestamps

  • Ethereum addresses

  • Categorical features

  • Numeric features

Type
Feature
Target

Unix timestamps

Ethereum addresses

Categorical features

Numeric features

Timestamp Feature Extraction

For Unix timestamps, the method extracts several time-based features:

  • Year, month, day, hour, day of week

  • Cyclical encoding for month, day, and hour

Example of cyclical encoding:

df[f'{feature}_month_sin'] = np.sin(2 * np.pi * df[feature].dt.month / 12)
df[f'{feature}_month_cos'] = np.cos(2 * np.pi * df[feature].dt.month / 12)

Categorical Encoding

One-Hot Encoding transforms categorical variables (features and targets) into binary columns representing each category.

Categorical features and targets, including Ethereum addresses, are encoded using One-Hot Encoding:

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[[feature]])

Numeric Feature Scaling

Numeric features and targets are scaled to ensure consistent model input using StandardScaler:

scaler = StandardScaler()
df[feature] = scaler.fit_transform(df[[feature]])

Autoencoder for Dimensionality Reduction

An autoencoder is used to reduce the dimensionality of the feature space:

input_layer = Input(shape=(input_dim,))
encoded1 = Dense(input_dim // 2, activation='relu', activity_regularizer=l2(1e-5))(input_layer)
encoded2 = Dense(encoding_dim, activation='relu', activity_regularizer=l2(1e-5))(encoded1)
decoded1 = Dense(input_dim // 2, activation='relu', activity_regularizer=l2(1e-5))(encoded2)
decoded = Dense(input_dim, activation='sigmoid', activity_regularizer=l2(1e-5))(decoded1)

autoencoder = Model(input_layer, decoded)
encoder = Model(input_layer, encoded2)

An autoencoder is a type of artificial neural network used for unsupervised learning, primarily for the purpose of dimensionality reduction and feature learning. It learns an efficient representation of the input data by training the network to ignore noise and irrelevant data while preserving important features. The architecture consists of two main parts:

  • Encoder: The encoder compresses the input data into a lower-dimensional space (latent representation), reducing its dimensionality while retaining critical information.

  • Decoder: The decoder reconstructs the input data from the compressed representation, attempting to generate an output as similar as possible to the original input.

The autoencoder is trained by minimizing the difference between the input and the reconstructed output, often using a loss function like Mean Squared Error (MSE).

How an Autoencoder Works

  1. Input Layer: The raw features from the dataset are fed into the input layer.

  2. Encoding: The encoder, typically composed of fully connected layers, compresses the input data into a smaller representation by learning important features and discarding redundant information. For example, a layer might shrink the number of input features by half.

  3. Latent Space: This compressed representation, also called the latent space, captures the most critical features needed for reconstructing the input.

  4. Decoding: The decoder attempts to expand the latent space representation back to the original input feature size, aiming to recreate the input data as closely as possible.

  5. Training: The network is trained to minimize the reconstruction loss (difference between the original input and the reconstructed output), gradually improving the quality of the compression.

By using an autoencoder, we reduce the dimensionality of the feature space, which helps in retaining only the most relevant features and discarding noise, making downstream tasks like prediction more efficient.


Multi-target Handling

The method can handle multiple targets, combining them into a single 2D numpy array.


Usage

Both methods are called during the model training process:

X, feature_types, feature_encoders, n_features = self.feature_engineering(data, features, mode='train')
y, target_types, target_encoders = self.extract_targets(data, targets)

Advantages

  • The method handles both training and prediction modes.

  • It stores necessary information (encoders, scalers, etc.) for consistent feature engineering during prediction.

  • Sample weights are calculated to give more importance to recent observations.

  • Handles various data types automatically

  • Applies appropriate transformations for each data type

  • Supports multi-target scenarios

  • Preserves information about feature and target transformations for consistent prediction

Last updated