Feature Engineering

Feature engineering is a crucial step in preparing data for machine learning models. This process involves transforming raw financial data into meaningful features that can improve model performance.

Let's break down the feature_engineering method:

1. Initialization and Data Preparation

def feature_engineering(self, data, features, mode='train'):
    df = pd.DataFrame(data)

This method takes in raw data, a list of features, and a mode ('train' or 'predict'). It starts by converting the data into a pandas DataFrame for easier manipulation.

2. Helper Functions

def is_eth_address(s):
    return isinstance(s, str) and s.startswith('0x') and len(s) == 42

def is_unix_timestamp(s, column_name):
    try:
        return 0 < int(s) < 4102444800 and 'timestamp' in column_name.lower()
    except ValueError:
        return False

These helper functions identify Ethereum addresses and Unix timestamps, which are common in DeFi data.

3. Feature Type Inference and Encoding

The method iterates through each feature, inferring its type and applying appropriate encoding:

Unix Timestamp Features

if self.feature_types[feature] == 'unix_timestamp':
    df[feature] = pd.to_datetime(df[feature], unit='s')
    
    # Extract time-based features
    df[f'{feature}_year'] = df[feature].dt.year
    df[f'{feature}_month'] = df[feature].dt.month
    # ... (more time-based features)
    
    # Cyclical encoding
    df[f'{feature}_month_sin'] = np.sin(2 * np.pi * df[feature].dt.month / 12)
    df[f'{feature}_month_cos'] = np.cos(2 * np.pi * df[feature].dt.month / 12)
    # ... (more cyclical encodings)

For timestamp features, it extracts various time-based components and applies cyclical encoding to capture periodic patterns.

Categorical Features

elif self.feature_types[feature] == 'categorical':
    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    encoded = encoder.fit_transform(df[[feature]])
    # ... (one-hot encoding logic)

Categorical features, including Ethereum addresses, are one-hot encoded to convert them into a format suitable for machine learning models.

Numeric Features

else:
    df[feature] = pd.to_numeric(df[feature], errors='coerce')
    scaler = StandardScaler()
    df[feature] = scaler.fit_transform(df[[feature]])

Numeric features are scaled using StandardScaler to normalize their range.

4. Dimensionality Reduction with Autoencoder

autoencoder_input = df[encoded_features].values.astype(np.float32)

if mode == 'train':
    # ... (autoencoder model definition)
    autoencoder.fit(autoencoder_input_tf, autoencoder_input_tf, epochs=50, batch_size=32, shuffle=True, verbose=0)
else:
    encoder = pickle.loads(self.options["encoder_model_blob"].encode('latin1'))

encoded_features = encoder.predict(autoencoder_input)

An autoencoder is used for dimensionality reduction, especially useful when dealing with high-dimensional DeFi data. It compresses the features into a lower-dimensional space while preserving important information.

5. Final Feature Preparation

final_features = encoded_feature_names
X = df[final_features].values

The final feature set consists of the encoded features from the autoencoder.

Key Considerations

  1. Ethereum Addresses: The method specifically handles Ethereum addresses, treating them as categorical features. This is crucial for analyzing on-chain data.

  2. Timestamp Handling: Detailed extraction of time-based features allows the model to capture temporal patterns in DeFi markets, such as day-of-week effects or seasonal trends.

  3. Scalability: The use of an autoencoder for dimensionality reduction helps in handling the high-dimensional nature of DeFi data, which can include numerous tokens, pools, and market indicators.

  4. Adaptability: The method can handle both training and prediction modes, ensuring consistent feature engineering across model training and deployment.

  5. Persistence: Encoders, scalers, and the autoencoder model are saved, allowing for consistent transformation of new data during prediction.

By applying this comprehensive feature engineering process, we transform raw financial and blockchain data into a format that maximizes the effectiveness of machine learning models for tasks such as price prediction, risk assessment, and anomaly detection in DeFi operations.

Last updated