# Data Engineering

The data & feature engineering process is implemented in the `feature_engineering` method.  The target extraction process is handled by the `extract_targets` method, which prepares the target variables for training.

Those two methods handle various data types, encodes categorical variables, scales numeric varibales, and applies dimensionality reduction using an autoencoder.

***

***

### <mark style="color:blue;">**Data Type Identification**</mark>

The methods identify the data type of each feature / target:

* Unix timestamps
* Ethereum addresses
* Categorical features
* Numeric features

<table data-full-width="false"><thead><tr><th width="263">Type</th><th width="264" align="center">Feature</th><th align="center">Target</th></tr></thead><tbody><tr><td>Unix timestamps</td><td align="center">                           ✅</td><td align="center">                      ❌</td></tr><tr><td>Ethereum addresses</td><td align="center">                           ✅</td><td align="center">                      ✅</td></tr><tr><td>Categorical features</td><td align="center">                           ✅</td><td align="center">                      ✅</td></tr><tr><td>Numeric features</td><td align="center">                           ✅</td><td align="center">                      ✅</td></tr></tbody></table>

### <mark style="color:blue;">**Timestamp Feature Extraction**</mark>

For Unix timestamps, the method extracts several time-based features:

* Year, month, day, hour, day of week
* Cyclical encoding for month, day, and hour

Example of cyclical encoding:

```python
df[f'{feature}_month_sin'] = np.sin(2 * np.pi * df[feature].dt.month / 12)
df[f'{feature}_month_cos'] = np.cos(2 * np.pi * df[feature].dt.month / 12)
```

***

### <mark style="color:blue;">**Categorical  Encoding**</mark>

One-Hot Encoding transforms categorical variables (features and targets) into binary columns representing each category.

Categorical features and targets, including Ethereum addresses, are encoded using One-Hot Encoding:

```python
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[[feature]])
```

***

### <mark style="color:blue;">**Numeric Feature Scaling**</mark>

Numeric features and targets are scaled to ensure consistent model input using StandardScaler:

```python
scaler = StandardScaler()
df[feature] = scaler.fit_transform(df[[feature]])
```

***

### <mark style="color:blue;">**Autoencoder for Dimensionality Reduction**</mark>

An autoencoder is used to reduce the dimensionality of the feature space:

```python
input_layer = Input(shape=(input_dim,))
encoded1 = Dense(input_dim // 2, activation='relu', activity_regularizer=l2(1e-5))(input_layer)
encoded2 = Dense(encoding_dim, activation='relu', activity_regularizer=l2(1e-5))(encoded1)
decoded1 = Dense(input_dim // 2, activation='relu', activity_regularizer=l2(1e-5))(encoded2)
decoded = Dense(input_dim, activation='sigmoid', activity_regularizer=l2(1e-5))(decoded1)

autoencoder = Model(input_layer, decoded)
encoder = Model(input_layer, encoded2)
```

An autoencoder is a type of artificial neural network used for unsupervised learning, primarily for the purpose of dimensionality reduction and feature learning. It learns an efficient representation of the input data by training the network to ignore noise and irrelevant data while preserving important features. The architecture consists of two main parts:

* **Encoder**: The encoder compresses the input data into a lower-dimensional space (latent representation), reducing its dimensionality while retaining critical information.
* **Decoder**: The decoder reconstructs the input data from the compressed representation, attempting to generate an output as similar as possible to the original input.

The autoencoder is trained by minimizing the difference between the input and the reconstructed output, often using a loss function like Mean Squared Error (MSE).

#### <mark style="color:blue;">How an Autoencoder Works</mark>

<figure><img src="https://4269815422-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FGc9mnST31tkU3h2NNgSL%2Fuploads%2F7Bnt0eywlYqhg9eVTf14%2F1_nqzWupxC60iAH2dYrFT78Q.png?alt=media&#x26;token=2c550d8f-9e70-4902-bf94-d605dae52e9f" alt=""><figcaption><p>Autoencoder </p></figcaption></figure>

1. **Input Layer**: The raw features from the dataset are fed into the input layer.
2. **Encoding**: The encoder, typically composed of fully connected layers, compresses the input data into a smaller representation by learning important features and discarding redundant information. For example, a layer might shrink the number of input features by half.
3. **Latent Space**: This compressed representation, also called the latent space, captures the most critical features needed for reconstructing the input.
4. **Decoding**: The decoder attempts to expand the latent space representation back to the original input feature size, aiming to recreate the input data as closely as possible.
5. **Training**: The network is trained to minimize the reconstruction loss (difference between the original input and the reconstructed output), gradually improving the quality of the compression.

By using an autoencoder, we reduce the dimensionality of the feature space, which helps in retaining only the most relevant features and discarding noise, making downstream tasks like prediction more efficient.

***

### <mark style="color:blue;">**Multi-target Handling**</mark>

The method can handle multiple targets, combining them into a single 2D numpy array.

***

### <mark style="color:blue;">Usage</mark>

Both methods are called during the model training process:

```python
X, feature_types, feature_encoders, n_features = self.feature_engineering(data, features, mode='train')
y, target_types, target_encoders = self.extract_targets(data, targets)
```

***

{% hint style="info" %}

### <mark style="color:blue;">Advantages</mark>

* The method handles both training and prediction modes.
* It stores necessary information (encoders, scalers, etc.) for consistent feature engineering during prediction.
* Sample weights are calculated to give more importance to recent observations.
* Handles various data types automatically
* Applies appropriate transformations for each data type
* Supports multi-target scenarios
* Preserves information about feature and target transformations for consistent prediction
  {% endhint %}
