Price Prediction
This use case shows how to leverage Inverse Watch ML infrastructure to achieve price prediction based on simple Coingecko historical data (price, volume and market capitalization).
Data Preprocessing
Data Extraction
In this example, the data is fetched from the CoinGecko API and stored in a table named query_10
. This table contains JSON arrays of historical prices, market caps, and total volumes fetched from Coingecko.
array_data_nn
: Extracts JSON arrays from thequery_10
table for further processing.
A recursive CTE is then used to unpack the JSON arrays into individual rows, each representing a specific time point with its corresponding price, market cap, and volume.
result_cte
: Iteratively extracts each element from the JSON arrays, creating a row for each time point.
Feature Engineering
Lagged values and rolling sums are computed to capture trends and changes over time, which are crucial for time series analysis and prediction.
Lag Functions: Calculate previous values of prices, market caps and volumes.
Rolling Sums: Compute the sum of volumes over the past 7 and 30 days.
Final Feature Selection
The final set of features is selected and computed, including price changes, momentum indicators, and moving averages.
Price Changes: Absolute, relative, and logarithmic changes in price and market cap.
Momentum Indicators: Price and volume momentum over 7 days.
Moving Averages: 5-day moving averages for price and volume.
Targets: Current price and logarithmic price change that will be used for prediction.
Logarithmic Price Changes
Logarithmic price changes (log_price_change
) are often preferred in financial modeling because they provide a better statistical representation. Logarithms stabilize variance and make the data more normally distributed, which is beneficial for many statistical models. They also allow for the interpretation of changes in percentage terms, which is intuitive for financial analysis.
Model Creation
After preparing your input data, follow these steps to create a new price prediction model using the Inverse Watch UI:
Select Data Source:
In the "Query" field, search for and select "inv_coingecko_price_data". This query contains our engineered features and target variables.
Define Columns:
You'll see a list of all columns from your query. For each column, specify whether it's a feature or a target:
Mark all columns as features, including "timestamp" and "time". The system will automatically derive additional features from these time-related columns.
Mark "log_price_change" as your target variable.
Leave "price" unmarked if you're not using it as a direct feature or target.
Provide Model Description:
In the "Description" field, enter a meaningful description for your model. For example: "Cryptocurrency price prediction model using historical price, volume, and market cap data."
Model Version:
The version is set automatically by the system. You don't need to manually adjust this.
Choose Regressor:
From the "Regressor" dropdown, select one of the following options:
Linear Regression
Random Forest
Gradient Boosting
AdaBoost
LSTM Neural Network
Configure Regressor Settings:
Auto Mode: Leave this unchecked to manually set parameters. For ease of use, you can check the box and the system will automatically run the hyper parameters tuning for you.
Number of Trees: Set the number of trees in the forest (e.g., 100).
Max Depth: Set the maximum depth of the trees (e.g., 10).
Min Samples Split: Minimum number of samples required to split an internal node (e.g., 2).
Min Samples Leaf: Minimum number of samples required to be at a leaf node (e.g., 1).
Criterion (Regression): Choose the function to measure the quality of a split (e.g., "mse" for mean squared error).
Set Train/Test Split:
Adjust the slider to set the split between training and test data. A common split is 80-70% train, 20-30% test.
Set Random State:
Enter a number (e.g., 42) to ensure reproducibility of results.
Configure Training Options:
Retrain when: Select "When query is refreshed" if you want the model to retrain automatically when new data is available.
When trained, send notification: Choose "Always send notifications" to stay informed about training progress.
Train Template: Use the default template unless you have a custom one.
Configure Prediction Options:
Predict when: Select "When query is refreshed" to generate new predictions when data is updated.
When predicted, send notification: Choose "Always send notifications" to be notified of new predictions.
Predict Template: Use the default template unless you have a custom one.
Create Model:
Once you've configured all settings, click on the "Create Model" button at the bottom of the page to create the model configuration.
Start Training:
After creating the model, you need to manually start the training process:
Go to the model options.
Click "Start Training" to begin the training process.
Alternatively, if you've set "Retrain when" to "When query is refreshed", the model will train automatically the next time the query refreshes. For the time being, ressource being limited it's better to avoid thisoption.
Monitor Training:
Once training starts, you can monitor the progress and view results when training completes.
Evaluate Metrics
After training your model, it's crucial to evaluate its performance using various metrics. You can access these metrics on the main model page in the Inverse Watch UI or receive them via Discord if you've set it up as a notification destination.
Accessing Metrics
Main Model Page: Navigate to your model's page in the Inverse Watch UI to view detailed metrics.
Discord Notifications: If configured, you'll receive metric updates in your designated Discord channel or web hook.
Understanding the Metrics
Overall Metrics
Mean Absolute Error (MAE): 38.5595
Interpretation: On average, predictions deviate by about 38.56 units from the actual values.
Mean Squared Error (MSE): 8299.2618
Interpretation: Gives higher weight to larger errors. Useful for comparing models.
R2 Score: 0.4505
Interpretation: The model explains about 45.05% of the variance in the target variable.
Is Overfitted: False
Interpretation: The model doesn't show signs of overfitting to the training data.
Overfitting Score: 0.0914
Interpretation: Low score indicates minimal overfitting.
Price Metrics
Mean Absolute Error: 77.0788
Mean Squared Error: 16598.5195
R2 Score: 0.6802
Train Performance: 0.7246
Val Performance: 0.6802
Log Price Change Metrics
Mean Absolute Error: 0.0402
Mean Squared Error: 0.0041
R2 Score: 0.2208
Train Performance: 0.2588
Val Performance: 0.2208
Interpreting the Results
Model Performance: The overall R2 score of 0.4505 suggests that the model has moderate predictive power. There's room for improvement.
Overfitting: The model doesn't show signs of overfitting (Is Overfitted: False), which is positive. The low overfitting score (0.0914) supports this conclusion.
Price vs. Log Price Change: The model performs better at predicting the actual price (R2 = 0.6802) than the log price change (R2 = 0.2208). This suggests that predicting price changes is more challenging than predicting absolute prices.
Prediction Accuracy: The mean absolute error for price (77.0788) indicates the average deviation of predictions from actual prices. Consider if this level of accuracy is acceptable for your use case.
Train vs. Validation Performance: For both price and log price change, the training performance is slightly higher than the validation performance, which is expected. The small difference suggests good generalization.
Metrics History
The "Prediction Metrics History" and "Training Metrics History" graphs allow you to track the model's performance over time:
Prediction Metrics: Shows how the mean absolute error for price predictions has changed. Recent predictions seem more stable and accurate compared to earlier fluctuations.
Training Metrics: Displays the mean absolute error for log price change during training. The relatively stable line suggests consistent performance across training iterations. Additionnally the Tooltip will show you what were the exact parameters used for after the training for the selected best candidate.
Accessing Results
While prediction results are available on the predictions page, it's often more convenient to access them programmatically. Inverse Watch provides an efficient way to retrieve prediction results using the query results queryrunner.
Accessing Predictions
You can access prediction results using SQL-like queries in the Query Results data source. There are two main ways to retrieve predictions:
By Prediction ID:
Replace
{prediction_id}
with the specific ID of the prediction you want to retrieve.Latest Prediction for a Model:
Replace
{model_id}
with the ID of your model. This query will return the most recent prediction for that model.
Example Usage
Let's say your model ID is 11 (as shown in the earlier Discord notification). To get the latest prediction results, you would use:
Correspondingly if your generated prediction id was 148 you would use :
Analyze Results
After training and evaluating our models, we use a SQL query to analyze their performance and implement a basic trading strategy. This query compares predictions from multiple models: Linear Regression, Random Forest, Neural Network, Gradient Boosting, and AdaBoost. Let's break down the query and understand its components:
Data Preparation
For each model (Linear, Random Forest, Neural Network, Gradient Boosting, and AdaBoost), we create a CTE (Common Table Expression) to prepare the prediction data:
This CTE fetches data from the prediction table for each model.
It includes actual prices, predicted prices, and predicted log price changes.
LEAD
andLAG
functions are used to get the next and previous prices.
2. Price Prediction and Trade Signal
For each model, we then create another CTE to calculate predicted prices and generate trade signals:
This CTE calculates the predicted price using the log price change.
It generates a trade signal (1 for buy, 0 for no action) based on whether the predicted price is higher than the actual price.
3. Combining Results
The full_results
CTE joins the results from all models:
4. PnL Calculation
The pnl_calculation
CTE calculates the profit and loss for each trade:
PnL is calculated only when a trade signal was generated in the previous period.
The calculation assumes a fixed position size of 10,000 units.
5. Balance Calculation
The balance_calculation
CTE keeps a running total of the account balance:
Starting balance is assumed to be 10,000 units.
Running total is calculated using a cumulative sum of PnL.
6. Final Results
The final SELECT statement combines all the calculated fields and orders the results by timestamp:
Interpreting the Results
This query allows us to compare the performance of different models:
Prediction Accuracy: Compare
pred_price_*
andpred_price_log_*
withactual_price
andnext_actual_price
.Trade Signals: Analyze the
trade_*
columns to see how often each model generates buy signals.PnL: The
pnl_*
columns show the profit or loss for each trade.Overall Performance: The
balance_*
columns provide a running total of the account balance, indicating overall performance of each model.
By analyzing these results, you can determine which model performs best in terms of prediction accuracy and potential profitability. Remember that this is a simplified trading strategy and does not account for factors like transaction costs, slippage, or market impact.
Analyzing the Results
One of the key advantages of using the Inverse Watch platform (forked from Redash) is its robust data visualization capabilities. This allows us to seamlessly transition from raw prediction data to insightful charts and derived metrics.
After training and evaluating our models, we use a SQL query to analyze their performance and implement a basic trading strategy. This query compares predictions from multiple models: Linear Regression, Random Forest, Neural Network, Gradient Boosting, and AdaBoost. Let's break down the query and understand its components:
1. Data Preparation
For each model (Linear, Random Forest, Neural Network, Gradient Boosting, and AdaBoost), we create a CTE (Common Table Expression) to prepare the prediction data:
This CTE fetches data from the prediction table for each model.
It includes actual prices, predicted prices, and predicted log price changes.
LEAD
andLAG
functions are used to get the next and previous prices.
2. Price Prediction and Trade Signal
For each model, we then create another CTE to calculate predicted prices and generate trade signals:
This CTE calculates the predicted price using the log price change.
It generates a trade signal (1 for buy, 0 for no action) based on whether the predicted price is higher than the actual price.
3. Combining Results
The full_results
CTE joins the results from all models:
4. PnL Calculation
The pnl_calculation
CTE calculates the profit and loss for each trade:
PnL is calculated only when a trade signal was generated in the previous period.
The calculation assumes a fixed position size of 10,000 units.
5. Balance Calculation
The balance_calculation
CTE keeps a running total of the account balance:
Starting balance is assumed to be 10,000 units.
Running total is calculated using a cumulative sum of PnL.
6. Enhanced Metrics
The enhanced_metrics
CTE calculates additional performance metrics:
Calculates cumulative return and drawdown for each model.
7. Final Results
The final SELECT statement combines all the calculated fields and orders the results by timestamp:
Visualizing the Results
The Inverse Watch platform allows us to create various charts to visualize our prediction results:
Raw Data Table: This table shows detailed metrics for each model, including actual prices, predicted prices, trade signals, PnL, balance, cumulative returns, and drawdowns. This granular data is crucial for in-depth analysis of each model's performance over time.
Raw Price Prediction Chart: This chart compares the actual price with predictions from various models over time. Key observations:
Most models follow the general trend of the actual price.
There are significant spikes in predictions, especially for the Random Forest model, which shows extreme overestimations at certain points.
The Linear model appears to have some negative price predictions towards the end, which is not realistic for asset prices.
Neural Network and Gradient Boosting models seem to have more stable predictions compared to others.
Predicted vs Real Price Chart: This chart offers a clearer view of how well the log price change predictions align with actual prices. Observations:
All models seem to track the actual price movements quite closely.
There's less extreme behavior compared to the raw price predictions, suggesting that predicting price changes might be more reliable than absolute prices.
The models appear to capture major trends and even some of the smaller fluctuations in the price.
Interpreting the Results
Based on these visualizations and the data from our SQL query, we can draw several conclusions:
Prediction Accuracy:
Log price change predictions seem more stable and accurate compared to raw price predictions.
The Random Forest model, while potentially powerful, may need some hyperparameter tuning to reduce overfitting and extreme predictions.
The Linear model's negative price predictions indicate it might not be suitable for this type of data without modifications.
Neural Network and Gradient Boosting models appear to offer a good balance between following trends and avoiding extreme predictions.
Trade Signals: Analyze the
trade_*
columns to see how often each model generates buy signals. The frequency and accuracy of these signals can be visualized using a heatmap.PnL /Cumulative Returns and Overall Performance: The
pnl_*
andbalance_*
columns show the profit or loss for each trade and the running total of the account balance. These metrics indicate the overall performance of each model in a simulated trading scenario.Cumulative Return : The
cumulative_return_*
columns provide insights into the overall profitability and risk of each model's predictions.Drawdown: The
drawdown_*
columns provide insights into the maximum loss for each model.
Remember, while these visualizations provide valuable insights, they should be combined with rigorous statistical analysis and back testing before making any trading decisions. The performance in this historical data may not necessarily predict future performance.
By leveraging the Inverse Watch platform's visualization capabilities, we can gain deeper insights into our models' performance, easily communicate results to stakeholders, and make data-driven decisions to improve our trading strategies.
Last updated