Price Prediction

This use case shows how to leverage Inverse Watch ML infrastructure to achieve price prediction based on simple Coingecko historical data (price, volume and market capitalization).

Data Preprocessing

Data Extraction

In this example, the data is fetched from the CoinGecko API and stored in a table named query_10. This table contains JSON arrays of historical prices, market caps, and total volumes fetched from Coingecko.

WITH RECURSIVE
    array_data_nn AS (
        SELECT prices AS prices_array,
               market_caps AS market_caps_array,
               total_volumes AS total_volumes_array
        FROM query_10
    ),
  • array_data_nn: Extracts JSON arrays from the query_10 table for further processing.

A recursive CTE is then used to unpack the JSON arrays into individual rows, each representing a specific time point with its corresponding price, market cap, and volume.

  • result_cte: Iteratively extracts each element from the JSON arrays, creating a row for each time point.

Feature Engineering

Lagged values and rolling sums are computed to capture trends and changes over time, which are crucial for time series analysis and prediction.

lag_cte AS (
    SELECT strftime('%Y-%m-%d %H:%M', time / 1000, 'unixepoch') AS time,
               price,
               market_cap,
               volume,
               LAG(price) OVER (ORDER BY time) AS previous_price,
               LAG(price, 2) OVER (ORDER BY time) AS previous_previous_price,
               LAG(volume) OVER (ORDER BY time) AS previous_volume,
               LAG(volume, 2) OVER (ORDER BY time) AS previous_previous_volume,
               LAG(market_cap) OVER (ORDER BY time) AS previous_market_cap,
               LAG(market_cap, 2) OVER (ORDER BY time) AS previous_previous_market_cap,
               SUM(volume) OVER (ORDER BY time ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) AS previous_7_days_volume,
               SUM(volume) OVER (ORDER BY time ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) AS previous_30_days_volume
               
    FROM result_cte
)
  • Lag Functions: Calculate previous values of prices, market caps and volumes.

  • Rolling Sums: Compute the sum of volumes over the past 7 and 30 days.

Final Feature Selection

The final set of features is selected and computed, including price changes, momentum indicators, and moving averages.

SELECT 
    --Timestamp will be feature-engineered by the system automatically
    strftime('%s', time) AS timestamp,
    time,
    
    --Price Features
    previous_price - previous_previous_price AS previous_price_change_abs,
    (previous_price - previous_previous_price) / previous_previous_price AS previous_price_change_rel,
    ln(previous_price / previous_previous_price) AS log_previous_price_change,
    
    -- Volume features
    previous_volume AS previous_day_volume,
    previous_7_days_volume AS previous_7_days_volume,
    previous_30_days_volume AS previous_30_days_volume,
    
    -- Market cap features
    previous_market_cap,
    previous_market_cap - previous_previous_market_cap AS previous_market_cap_change_abs,
    (previous_market_cap - previous_previous_market_cap) / previous_previous_market_cap AS previous_market_cap_change_rel,
    ln(COALESCE(previous_market_cap / previous_previous_market_cap, 1)) AS log_previous_market_cap_change,
    
    -- New Daily Features
    (price - LAG(price, 7) OVER (ORDER BY time)) / LAG(price, 7) OVER (ORDER BY time) AS daily_price_momentum_7_days,
    (volume - LAG(volume, 7) OVER (ORDER BY time)) / LAG(volume, 7) OVER (ORDER BY time) AS daily_volume_momentum_7_days,
    AVG(price) OVER (ORDER BY time ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS daily_moving_average_5_days,
    AVG(volume) OVER (ORDER BY time ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS daily_volume_moving_average_5_days,
    
    -- Targets
    price,
    ln(price / previous_price) AS log_price_change
FROM lag_cte
WHERE previous_price_change_rel IS NOT NULL
ORDER BY 1 ASC;
  • Price Changes: Absolute, relative, and logarithmic changes in price and market cap.

  • Momentum Indicators: Price and volume momentum over 7 days.

  • Moving Averages: 5-day moving averages for price and volume.

  • Targets: Current price and logarithmic price change that will be used for prediction.

Logarithmic Price Changes

Logarithmic price changes (log_price_change) are often preferred in financial modeling because they provide a better statistical representation. Logarithms stabilize variance and make the data more normally distributed, which is beneficial for many statistical models. They also allow for the interpretation of changes in percentage terms, which is intuitive for financial analysis.

Model Creation

After preparing your input data, follow these steps to create a new price prediction model using the Inverse Watch UI:

  1. Select Data Source:

    • In the "Query" field, search for and select "inv_coingecko_price_data". This query contains our engineered features and target variables.

  2. Define Columns:

    • You'll see a list of all columns from your query. For each column, specify whether it's a feature or a target:

      • Mark all columns as features, including "timestamp" and "time". The system will automatically derive additional features from these time-related columns.

      • Mark "log_price_change" as your target variable.

      • Leave "price" unmarked if you're not using it as a direct feature or target.

  3. Provide Model Description:

    • In the "Description" field, enter a meaningful description for your model. For example: "Cryptocurrency price prediction model using historical price, volume, and market cap data."

  4. Model Version:

    • The version is set automatically by the system. You don't need to manually adjust this.

  5. Choose Regressor:

    • From the "Regressor" dropdown, select one of the following options:

      • Linear Regression

      • Random Forest

      • Gradient Boosting

      • AdaBoost

      • LSTM Neural Network

  6. Configure Regressor Settings:

    • Auto Mode: Leave this unchecked to manually set parameters. For ease of use, you can check the box and the system will automatically run the hyper parameters tuning for you.

    • Number of Trees: Set the number of trees in the forest (e.g., 100).

    • Max Depth: Set the maximum depth of the trees (e.g., 10).

    • Min Samples Split: Minimum number of samples required to split an internal node (e.g., 2).

    • Min Samples Leaf: Minimum number of samples required to be at a leaf node (e.g., 1).

    • Criterion (Regression): Choose the function to measure the quality of a split (e.g., "mse" for mean squared error).

  7. Set Train/Test Split:

    • Adjust the slider to set the split between training and test data. A common split is 80-70% train, 20-30% test.

  8. Set Random State:

    • Enter a number (e.g., 42) to ensure reproducibility of results.

  9. Configure Training Options:

    • Retrain when: Select "When query is refreshed" if you want the model to retrain automatically when new data is available.

    • When trained, send notification: Choose "Always send notifications" to stay informed about training progress.

    • Train Template: Use the default template unless you have a custom one.

  10. Configure Prediction Options:

    • Predict when: Select "When query is refreshed" to generate new predictions when data is updated.

    • When predicted, send notification: Choose "Always send notifications" to be notified of new predictions.

    • Predict Template: Use the default template unless you have a custom one.

  11. Create Model:

    • Once you've configured all settings, click on the "Create Model" button at the bottom of the page to create the model configuration.

  12. Start Training:

    • After creating the model, you need to manually start the training process:

      • Go to the model options.

      • Click "Start Training" to begin the training process.

    • Alternatively, if you've set "Retrain when" to "When query is refreshed", the model will train automatically the next time the query refreshes. For the time being, ressource being limited it's better to avoid thisoption.

  13. Monitor Training:

    • Once training starts, you can monitor the progress and view results when training completes.

Evaluate Metrics

After training your model, it's crucial to evaluate its performance using various metrics. You can access these metrics on the main model page in the Inverse Watch UI or receive them via Discord if you've set it up as a notification destination.

Accessing Metrics

  1. Main Model Page: Navigate to your model's page in the Inverse Watch UI to view detailed metrics.

  1. Discord Notifications: If configured, you'll receive metric updates in your designated Discord channel or web hook.

Understanding the Metrics

Overall Metrics

  • Mean Absolute Error (MAE): 38.5595

    • Interpretation: On average, predictions deviate by about 38.56 units from the actual values.

  • Mean Squared Error (MSE): 8299.2618

    • Interpretation: Gives higher weight to larger errors. Useful for comparing models.

  • R2 Score: 0.4505

    • Interpretation: The model explains about 45.05% of the variance in the target variable.

  • Is Overfitted: False

    • Interpretation: The model doesn't show signs of overfitting to the training data.

  • Overfitting Score: 0.0914

    • Interpretation: Low score indicates minimal overfitting.

Price Metrics

  • Mean Absolute Error: 77.0788

  • Mean Squared Error: 16598.5195

  • R2 Score: 0.6802

  • Train Performance: 0.7246

  • Val Performance: 0.6802

Log Price Change Metrics

  • Mean Absolute Error: 0.0402

  • Mean Squared Error: 0.0041

  • R2 Score: 0.2208

  • Train Performance: 0.2588

  • Val Performance: 0.2208

Interpreting the Results

  1. Model Performance: The overall R2 score of 0.4505 suggests that the model has moderate predictive power. There's room for improvement.

  2. Overfitting: The model doesn't show signs of overfitting (Is Overfitted: False), which is positive. The low overfitting score (0.0914) supports this conclusion.

  3. Price vs. Log Price Change: The model performs better at predicting the actual price (R2 = 0.6802) than the log price change (R2 = 0.2208). This suggests that predicting price changes is more challenging than predicting absolute prices.

  4. Prediction Accuracy: The mean absolute error for price (77.0788) indicates the average deviation of predictions from actual prices. Consider if this level of accuracy is acceptable for your use case.

  5. Train vs. Validation Performance: For both price and log price change, the training performance is slightly higher than the validation performance, which is expected. The small difference suggests good generalization.

Metrics History

The "Prediction Metrics History" and "Training Metrics History" graphs allow you to track the model's performance over time:

  1. Prediction Metrics: Shows how the mean absolute error for price predictions has changed. Recent predictions seem more stable and accurate compared to earlier fluctuations.

  1. Training Metrics: Displays the mean absolute error for log price change during training. The relatively stable line suggests consistent performance across training iterations. Additionnally the Tooltip will show you what were the exact parameters used for after the training for the selected best candidate.

Accessing Results

While prediction results are available on the predictions page, it's often more convenient to access them programmatically. Inverse Watch provides an efficient way to retrieve prediction results using the query results queryrunner.

Accessing Predictions

You can access prediction results using SQL-like queries in the Query Results data source. There are two main ways to retrieve predictions:

  1. By Prediction ID:

    SELECT * FROM prediction_{prediction_id}

    Replace {prediction_id} with the specific ID of the prediction you want to retrieve.

  2. Latest Prediction for a Model:

    SELECT * FROM prediction_m_{model_id}

    Replace {model_id} with the ID of your model. This query will return the most recent prediction for that model.

Example Usage

Let's say your model ID is 11 (as shown in the earlier Discord notification). To get the latest prediction results, you would use:

SELECT * FROM prediction_m_11

Correspondingly if your generated prediction id was 148 you would use :

SELECT * FROM prediction_148

Analyze Results

After training and evaluating our models, we use a SQL query to analyze their performance and implement a basic trading strategy. This query compares predictions from multiple models: Linear Regression, Random Forest, Neural Network, Gradient Boosting, and AdaBoost. Let's break down the query and understand its components:

Data Preparation

For each model (Linear, Random Forest, Neural Network, Gradient Boosting, and AdaBoost), we create a CTE (Common Table Expression) to prepare the prediction data:

WITH prediction_data_linear AS (
    SELECT 
        timestamp,
        price AS actual_price,
        pred_price as pred_price_linear,
        LEAD(price) OVER (ORDER BY timestamp) AS next_actual_price,
        pred_log_price_change AS pred_log_price_change_linear,
        LAG(price) OVER (ORDER BY timestamp) AS previous_price
    FROM prediction_{{linear}}
),
  • This CTE fetches data from the prediction table for each model.

  • It includes actual prices, predicted prices, and predicted log price changes.

  • LEAD and LAG functions are used to get the next and previous prices.

2. Price Prediction and Trade Signal

For each model, we then create another CTE to calculate predicted prices and generate trade signals:

predicted_prices_linear AS (
    SELECT 
        timestamp,
        actual_price,
        pred_price_linear,
        next_actual_price,
        previous_price * exp(pred_log_price_change_linear) AS pred_price_log_linear,
        CASE 
            WHEN previous_price * exp(pred_log_price_change_linear) > actual_price THEN 1
            ELSE 0
        END AS trade_linear
    FROM prediction_data_linear
    WHERE previous_price IS NOT NULL
),
  • This CTE calculates the predicted price using the log price change.

  • It generates a trade signal (1 for buy, 0 for no action) based on whether the predicted price is higher than the actual price.

3. Combining Results

The full_results CTE joins the results from all models:

full_results AS (
    SELECT 
        DATETIME(lin.timestamp, 'unixepoch') as timestamp,
        -- Fields for each model...
    FROM predicted_prices_linear lin
    LEFT JOIN predicted_prices_rf rf ON lin.timestamp = rf.timestamp
    LEFT JOIN predicted_prices_nn nn ON lin.timestamp = nn.timestamp
    LEFT JOIN predicted_prices_gb gb ON lin.timestamp = gb.timestamp
    LEFT JOIN predicted_prices_ada ada ON lin.timestamp = ada.timestamp
),

4. PnL Calculation

The pnl_calculation CTE calculates the profit and loss for each trade:

pnl_calculation AS (
    SELECT 
        timestamp,
        CASE 
            WHEN LAG(trade_linear) OVER (ORDER BY timestamp) = 1 THEN (next_actual_price - actual_price) / actual_price * 10000
            ELSE 0
        END AS pnl_linear,
        -- Similar calculations for other models...
    FROM full_results
    WHERE row_num <= total_rows  * {{test_size}}
),
  • PnL is calculated only when a trade signal was generated in the previous period.

  • The calculation assumes a fixed position size of 10,000 units.

5. Balance Calculation

The balance_calculation CTE keeps a running total of the account balance:

balance_calculation AS (
    SELECT 
        timestamp,
        pnl_linear,
        10000 + SUM(pnl_linear) OVER (ORDER BY timestamp) AS balance_linear,
        -- Similar calculations for other models...
    FROM pnl_calculation
),
  • Starting balance is assumed to be 10,000 units.

  • Running total is calculated using a cumulative sum of PnL.

6. Final Results

The final SELECT statement combines all the calculated fields and orders the results by timestamp:

SELECT *
FROM full_results_with_balance
ORDER BY timestamp ASC;

Interpreting the Results

This query allows us to compare the performance of different models:

  1. Prediction Accuracy: Compare pred_price_* and pred_price_log_* with actual_price and next_actual_price.

  2. Trade Signals: Analyze the trade_* columns to see how often each model generates buy signals.

  3. PnL: The pnl_* columns show the profit or loss for each trade.

  4. Overall Performance: The balance_* columns provide a running total of the account balance, indicating overall performance of each model.

By analyzing these results, you can determine which model performs best in terms of prediction accuracy and potential profitability. Remember that this is a simplified trading strategy and does not account for factors like transaction costs, slippage, or market impact.

Analyzing the Results

One of the key advantages of using the Inverse Watch platform (forked from Redash) is its robust data visualization capabilities. This allows us to seamlessly transition from raw prediction data to insightful charts and derived metrics.

After training and evaluating our models, we use a SQL query to analyze their performance and implement a basic trading strategy. This query compares predictions from multiple models: Linear Regression, Random Forest, Neural Network, Gradient Boosting, and AdaBoost. Let's break down the query and understand its components:

1. Data Preparation

For each model (Linear, Random Forest, Neural Network, Gradient Boosting, and AdaBoost), we create a CTE (Common Table Expression) to prepare the prediction data:

WITH prediction_data_linear AS (
    SELECT 
        timestamp,
        price AS actual_price,
        pred_price as pred_price_linear,
        LEAD(price) OVER (ORDER BY timestamp) AS next_actual_price,
        pred_log_price_change AS pred_log_price_change_linear,
        LAG(price) OVER (ORDER BY timestamp) AS previous_price
    FROM prediction_{{linear}}
),
  • This CTE fetches data from the prediction table for each model.

  • It includes actual prices, predicted prices, and predicted log price changes.

  • LEAD and LAG functions are used to get the next and previous prices.

2. Price Prediction and Trade Signal

For each model, we then create another CTE to calculate predicted prices and generate trade signals:

predicted_prices_linear AS (
    SELECT 
        timestamp,
        actual_price,
        pred_price_linear,
        next_actual_price,
        previous_price * exp(pred_log_price_change_linear) AS pred_price_log_linear,
        CASE 
            WHEN previous_price * exp(pred_log_price_change_linear) > actual_price THEN 1
            ELSE 0
        END AS trade_linear
    FROM prediction_data_linear
    WHERE previous_price IS NOT NULL
),
  • This CTE calculates the predicted price using the log price change.

  • It generates a trade signal (1 for buy, 0 for no action) based on whether the predicted price is higher than the actual price.

3. Combining Results

The full_results CTE joins the results from all models:

full_results AS (
    SELECT 
        DATETIME(lin.timestamp, 'unixepoch') as timestamp,
        -- Fields for each model...
    FROM predicted_prices_linear lin
    LEFT JOIN predicted_prices_rf rf ON lin.timestamp = rf.timestamp
    LEFT JOIN predicted_prices_nn nn ON lin.timestamp = nn.timestamp
    LEFT JOIN predicted_prices_gb gb ON lin.timestamp = gb.timestamp
    LEFT JOIN predicted_prices_ada ada ON lin.timestamp = ada.timestamp
),

4. PnL Calculation

The pnl_calculation CTE calculates the profit and loss for each trade:

pnl_calculation AS (
    SELECT 
        timestamp,
        CASE 
            WHEN LAG(trade_linear) OVER (ORDER BY timestamp) = 1 THEN (next_actual_price - actual_price) / actual_price * 10000
            ELSE 0
        END AS pnl_linear,
        -- Similar calculations for other models...
    FROM full_results
    WHERE row_num <= total_rows  * {{test_size}}
),
  • PnL is calculated only when a trade signal was generated in the previous period.

  • The calculation assumes a fixed position size of 10,000 units.

5. Balance Calculation

The balance_calculation CTE keeps a running total of the account balance:

balance_calculation AS (
    SELECT 
        timestamp,
        pnl_linear,
        10000 + SUM(pnl_linear) OVER (ORDER BY timestamp) AS balance_linear,
        -- Similar calculations for other models...
    FROM pnl_calculation
),
  • Starting balance is assumed to be 10,000 units.

  • Running total is calculated using a cumulative sum of PnL.

6. Enhanced Metrics

The enhanced_metrics CTE calculates additional performance metrics:

enhanced_metrics AS (
    SELECT 
        timestamp,
        (balance_linear - 10000) / 10000 * 100 AS cumulative_return_linear,
        10000 - MIN(balance_linear) OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS drawdown_linear,
        -- Similar calculations for other models...
    FROM balance_calculation
)
  • Calculates cumulative return and drawdown for each model.

7. Final Results

The final SELECT statement combines all the calculated fields and orders the results by timestamp:

SELECT *
FROM full_results fr
JOIN balance_calculation bc ON fr.timestamp = bc.timestamp
JOIN enhanced_metrics em ON fr.timestamp = em.timestamp
WHERE fr.row_num <= fr.total_rows * {{test_size}}
ORDER BY fr.timestamp ASC;

Visualizing the Results

The Inverse Watch platform allows us to create various charts to visualize our prediction results:

  1. Raw Data Table: This table shows detailed metrics for each model, including actual prices, predicted prices, trade signals, PnL, balance, cumulative returns, and drawdowns. This granular data is crucial for in-depth analysis of each model's performance over time.

  2. Raw Price Prediction Chart: This chart compares the actual price with predictions from various models over time. Key observations:

    • Most models follow the general trend of the actual price.

    • There are significant spikes in predictions, especially for the Random Forest model, which shows extreme overestimations at certain points.

    • The Linear model appears to have some negative price predictions towards the end, which is not realistic for asset prices.

    • Neural Network and Gradient Boosting models seem to have more stable predictions compared to others.

  3. Predicted vs Real Price Chart: This chart offers a clearer view of how well the log price change predictions align with actual prices. Observations:

    • All models seem to track the actual price movements quite closely.

    • There's less extreme behavior compared to the raw price predictions, suggesting that predicting price changes might be more reliable than absolute prices.

    • The models appear to capture major trends and even some of the smaller fluctuations in the price.

Interpreting the Results

Based on these visualizations and the data from our SQL query, we can draw several conclusions:

  1. Prediction Accuracy:

    • Log price change predictions seem more stable and accurate compared to raw price predictions.

    • The Random Forest model, while potentially powerful, may need some hyperparameter tuning to reduce overfitting and extreme predictions.

    • The Linear model's negative price predictions indicate it might not be suitable for this type of data without modifications.

    • Neural Network and Gradient Boosting models appear to offer a good balance between following trends and avoiding extreme predictions.

  2. Trade Signals: Analyze the trade_* columns to see how often each model generates buy signals. The frequency and accuracy of these signals can be visualized using a heatmap.

  3. PnL /Cumulative Returns and Overall Performance: The pnl_* and balance_* columns show the profit or loss for each trade and the running total of the account balance. These metrics indicate the overall performance of each model in a simulated trading scenario.

  4. Cumulative Return : The cumulative_return_* columns provide insights into the overall profitability and risk of each model's predictions.

  5. Drawdown: The drawdown_* columns provide insights into the maximum loss for each model.

Remember, while these visualizations provide valuable insights, they should be combined with rigorous statistical analysis and back testing before making any trading decisions. The performance in this historical data may not necessarily predict future performance.

By leveraging the Inverse Watch platform's visualization capabilities, we can gain deeper insights into our models' performance, easily communicate results to stakeholders, and make data-driven decisions to improve our trading strategies.

Last updated