Data Engineering

The data & feature engineering process is implemented in the feature_engineering method. The target extraction process is handled by the extract_targets method, which prepares the target variables for training.

Those two methods handle various data types, encodes categorical variables, scales numeric varibales, and applies dimensionality reduction using an autoencoder.

Data Type Identification

The methods identify the data type of each feature / target:

Unix timestamps
Ethereum addresses
Categorical features
Numeric features

Type

Feature

Target

Unix timestamps

✅

❌

Ethereum addresses

✅

Categorical features

✅

Numeric features

✅

Timestamp Feature Extraction

For Unix timestamps, the method extracts several time-based features:

Year, month, day, hour, day of week
Cyclical encoding for month, day, and hour

Example of cyclical encoding:

df[f'{feature}_month_sin'] = np.sin(2 * np.pi * df[feature].dt.month / 12)
df[f'{feature}_month_cos'] = np.cos(2 * np.pi * df[feature].dt.month / 12)

Categorical Encoding

One-Hot Encoding transforms categorical variables (features and targets) into binary columns representing each category.

Categorical features and targets, including Ethereum addresses, are encoded using One-Hot Encoding:

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[[feature]])

Numeric Feature Scaling

Numeric features and targets are scaled to ensure consistent model input using StandardScaler:

scaler = StandardScaler()
df[feature] = scaler.fit_transform(df[[feature]])

Autoencoder for Dimensionality Reduction

An autoencoder is used to reduce the dimensionality of the feature space:

input_layer = Input(shape=(input_dim,))
encoded1 = Dense(input_dim // 2, activation='relu', activity_regularizer=l2(1e-5))(input_layer)
encoded2 = Dense(encoding_dim, activation='relu', activity_regularizer=l2(1e-5))(encoded1)
decoded1 = Dense(input_dim // 2, activation='relu', activity_regularizer=l2(1e-5))(encoded2)
decoded = Dense(input_dim, activation='sigmoid', activity_regularizer=l2(1e-5))(decoded1)

autoencoder = Model(input_layer, decoded)
encoder = Model(input_layer, encoded2)

An autoencoder is a type of artificial neural network used for unsupervised learning, primarily for the purpose of dimensionality reduction and feature learning. It learns an efficient representation of the input data by training the network to ignore noise and irrelevant data while preserving important features. The architecture consists of two main parts:

Encoder: The encoder compresses the input data into a lower-dimensional space (latent representation), reducing its dimensionality while retaining critical information.
Decoder: The decoder reconstructs the input data from the compressed representation, attempting to generate an output as similar as possible to the original input.

The autoencoder is trained by minimizing the difference between the input and the reconstructed output, often using a loss function like Mean Squared Error (MSE).

How an Autoencoder Works

Input Layer: The raw features from the dataset are fed into the input layer.
Encoding: The encoder, typically composed of fully connected layers, compresses the input data into a smaller representation by learning important features and discarding redundant information. For example, a layer might shrink the number of input features by half.
Latent Space: This compressed representation, also called the latent space, captures the most critical features needed for reconstructing the input.
Decoding: The decoder attempts to expand the latent space representation back to the original input feature size, aiming to recreate the input data as closely as possible.
Training: The network is trained to minimize the reconstruction loss (difference between the original input and the reconstructed output), gradually improving the quality of the compression.

By using an autoencoder, we reduce the dimensionality of the feature space, which helps in retaining only the most relevant features and discarding noise, making downstream tasks like prediction more efficient.

Multi-target Handling

The method can handle multiple targets, combining them into a single 2D numpy array.

Usage

Both methods are called during the model training process:

X, feature_types, feature_encoders, n_features = self.feature_engineering(data, features, mode='train')
y, target_types, target_encoders = self.extract_targets(data, targets)

Advantages

The method handles both training and prediction modes.
It stores necessary information (encoders, scalers, etc.) for consistent feature engineering during prediction.
Sample weights are calculated to give more importance to recent observations.
Handles various data types automatically
Applies appropriate transformations for each data type
Supports multi-target scenarios
Preserves information about feature and target transformations for consistent prediction

PreviousMachine Learning NextRegressors

Last updated 9 months ago

Was this helpful?