Data Preprocessing

Data Extraction

In this example, the data is fetched from the CoinGecko API and stored in a table named query_10. This table contains raw JSON arrays of historical prices, market caps, and total volumes fetched from Coingecko.

WITH RECURSIVE
    array_data_nn AS (
        SELECT prices AS prices_array,
               market_caps AS market_caps_array,
               total_volumes AS total_volumes_array
        FROM query_10
    ),
  • array_data_nn: This step extracts JSON arrays for prices, market caps, and volumes from the query_10 table for further processing.

  • A recursive CTE is then used to unpack the JSON arrays into individual rows, where each row corresponds to a specific time point with its associated price, market cap, and volume.

Feature Engineering

While some automatic feature engineering is applied in the backend (particularly for categorical features and timestamp dimensions - see Feature Engineering), it is necessary to apply custom preprocessing for time-series data, particularly by generating lagged values and rolling sums.

Lagged values and rolling sums are computed to capture trends and changes over time, which are crucial for time series analysis and prediction.

Example for generating lagged values and rolling sums:

lag_cte AS (
    SELECT strftime('%Y-%m-%d %H:%M', time / 1000, 'unixepoch') AS time,
               price,
               market_cap,
               volume,
               LAG(price) OVER (ORDER BY time) AS previous_price,
               LAG(price, 2) OVER (ORDER BY time) AS previous_previous_price,
               LAG(volume) OVER (ORDER BY time) AS previous_volume,
               LAG(volume, 2) OVER (ORDER BY time) AS previous_previous_volume,
               LAG(market_cap) OVER (ORDER BY time) AS previous_market_cap,
               LAG(market_cap, 2) OVER (ORDER BY time) AS previous_previous_market_cap,
               SUM(volume) OVER (ORDER BY time ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) AS previous_7_days_volume,
               SUM(volume) OVER (ORDER BY time ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) AS previous_30_days_volume
               
    FROM result_cte
)
  • Lag Functions: Calculate previous values of prices, market caps and volumes. These lag features help capture the previous states of the market.

  • Rolling Sums: Compute the sum of volumes over the past 7 and 30 days. These features help capture the trend in trading volume.

Final Feature Selection

The final feature set includes price changes, momentum indicators, and moving averages, which are all essential for predictive models in financial time series.

SELECT 
    --Timestamp will be feature-engineered by the system automatically
    strftime('%s', time) AS timestamp,
    time,
    
    --Price Features
    previous_price - previous_previous_price AS previous_price_change_abs,
    (previous_price - previous_previous_price) / previous_previous_price AS previous_price_change_rel,
    ln(previous_price / previous_previous_price) AS log_previous_price_change,
    
    -- Volume features
    previous_volume AS previous_day_volume,
    previous_7_days_volume AS previous_7_days_volume,
    previous_30_days_volume AS previous_30_days_volume,
    
    -- Market cap features
    previous_market_cap,
    previous_market_cap - previous_previous_market_cap AS previous_market_cap_change_abs,
    (previous_market_cap - previous_previous_market_cap) / previous_previous_market_cap AS previous_market_cap_change_rel,
    ln(COALESCE(previous_market_cap / previous_previous_market_cap, 1)) AS log_previous_market_cap_change,
    
    -- New Daily Features
    (price - LAG(price, 7) OVER (ORDER BY time)) / LAG(price, 7) OVER (ORDER BY time) AS daily_price_momentum_7_days,
    (volume - LAG(volume, 7) OVER (ORDER BY time)) / LAG(volume, 7) OVER (ORDER BY time) AS daily_volume_momentum_7_days,
    AVG(price) OVER (ORDER BY time ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS daily_moving_average_5_days,
    AVG(volume) OVER (ORDER BY time ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS daily_volume_moving_average_5_days,
    
    -- Targets
    price,
    ln(price / previous_price) AS log_price_change
FROM lag_cte
WHERE previous_price_change_rel IS NOT NULL
ORDER BY 1 ASC;
  • Price Changes: Absolute, relative, and logarithmic changes in price and market cap.

  • Momentum Indicators: Price and volume momentum over 7 days.

  • Moving Averages: Compute the 5-day moving averages for both price and volume, which are standard features in financial models.

  • Targets: Include both the current price and the logarithmic price change (log_price_change) as targets for prediction.

Logarithmic Price Changes

Logarithmic price changes (log_price_change) are often preferred in financial modeling over prices because they provide a better statistical representation. Logarithms stabilize variance and make the data more normally distributed, which is beneficial for many statistical models.

They also allow for the interpretation of changes in percentage terms, which is intuitive for financial analysis.

Last updated