Introduction
The field of time-series analysis plays a crucial role in understanding and predicting complex data patterns over time. In this article, we will explore how to perform multiple variable time-series regression and causality analysis in Python using the powerful libraries Dash and Plotly. We’ll cover the necessary steps to upload data, conduct Regression analysis, perform various tests, and visualize the results.
The code you provided below is a Python script that utilizes the Dash framework to create a web application for multiple variable time-series regression and causality analysis. The application allows users to upload a CSV file containing time-series data, perform regression analysis on the data, and conduct various statistical tests.
Uploading Data
To get started, we need to upload the data we want to analyze. The app allows us to upload a CSV file containing the necessary information. Simply drag and drop the file or click the “Select Files” link to choose the file from your computer. To view an example CSV file click here.
Regression Analysis
Once the data is uploaded, we can proceed with the regression analysis. The app will read the CSV file and prepare the data for analysis. It assumes that the first row contains the column headers and the remaining rows represent the values.
We begin by extracting the necessary columns from the dataset, including the date/year, X variable, and Y variables. We convert the date column to the datetime type and add a constant column to the X variables to account for the intercept term.
Next, we perform linear regression for each Y column using the Ordinary Least Squares (OLS) method. This allows us to estimate the relationships between the X variable and each Y variable. We obtain the regression results, including the coefficients, p-values, and other relevant statistics.
Regression Summary
The regression summary provides a comprehensive overview of the regression results. It includes the coefficient estimates, standard errors, t-values, and p-values for each predictor variable. The summary tables are organized by Y variable, allowing for easy comparison and interpretation.
Tests and Analysis
To gain further insights into the regression results, we conduct various tests and analyses. Here are the key tests we perform:
Test 1: Linear Relationship between Slope and Intercept
This test examines the linear relationship between the slope and intercept of the regression line for each Y variable. We calculate the slope and intercept values and analyze their significance.
Test 2: Pearson Correlation Coefficient
We assess the correlation between the X variable and the residuals (errors) of the regression model for each Y variable. The Pearson correlation coefficient and its p-value provide insights into the relationship between the independent variable and the model’s residuals.
Test 3: Mean of Residuals
We calculate the mean of the residuals for each Y variable to determine if they are close to zero. A non-zero mean may indicate the presence of systematic errors in the regression model.
Test 4: Jarque-Bera Test for Residuals
The Jarque-Bera test assesses the normality of the residuals by examining skewness and kurtosis. We analyze the p-value associated with the test to determine if the residuals follow a normal distribution.
Test 5: Ljung-Box Test for Residuals
The Ljung-Box test checks the autocorrelation of the residuals at lag 1. By examining the p-value of the test, we can determine if the residuals are correlated or exhibit randomness.
Test 6: Normality Test for Residuals
We perform a normality test on the residuals using a statistical test such as the Shapiro-Wilk or Kolmogorov-Smirnov test. This helps us assess if the residuals are normally distributed.
Test 7: Stationarity Test for Residuals
To determine if the residuals are stationary, we conduct the Augmented Dickey-Fuller (ADF) test. This test helps us evaluate if the residuals exhibit a constant mean and variance over time. The p-value from the ADF test provides insights into the stationarity of the residuals.
Causality Analysis
In addition to regression analysis, we can also explore causality between variables. Causality analysis helps us understand the directional relationship between variables and determine if changes in one variable have an impact on another. Here are some causality analysis techniques we use:
Granger Causality Test
The Granger causality test examines if one variable can predict another variable by utilizing the concept of time lag. We apply this test to assess if the X variable has a causal relationship with each Y variable. The results provide the Granger causality statistics and associated p-values.
Pairwise Causality Heatmap
To visualize the causality relationships between variables, we create a pairwise causality heatmap. This heatmap displays the strength and directionality of the causal links between the X variable and each Y variable. It helps identify the variables that have a significant influence on others.
Visualizing Results
To enhance the understanding of the analysis, we employ interactive visualizations using the Dash and Plotly libraries. These visualizations allow for dynamic exploration of the data and results. We can create line charts to display the time series of variables, scatter plots to show the relationships between variables, and heatmaps to visualize causality.
Python code – You can download the whole project here on GitHub
import dash
from dash import dcc
from dash import html
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats import diagnostic
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import grangercausalitytests
import plotly.express as px
import dash_table
import numpy as np
import scipy.stats as stats
import base64
import io
from tabulate import tabulate
# Initialize the Dash app
app = dash.Dash(__name__)
# Define the layout of the app
app.layout = html.Div([
html.H1("Multiple Variable Time-Series Regression and Causality Analysis in Python using Dash and Plotly"),
dcc.Upload(
id="upload-data",
children=html.Div([
"Drag and Drop or ",
html.A("Select Files")
]),
style={
"width": "50%",
"height": "60px",
"lineHeight": "60px",
"borderWidth": "1px",
"borderStyle": "dashed",
"borderRadius": "5px",
"textAlign": "center",
"margin": "10px"
},
multiple=False
),
html.Div(id="output-div"),
dcc.Graph(id="result-plot"),
html.Div(id="tests-summary-tables"),
html.Div(id="regression-summary-table")
])
# Define the callback function for file upload, regression, and tests
@app.callback(
[dash.dependencies.Output("output-div", "children"),
dash.dependencies.Output("result-plot", "figure"),
dash.dependencies.Output("tests-summary-tables", "children"),
dash.dependencies.Output("regression-summary-table", "children")],
[dash.dependencies.Input("upload-data", "contents"),
dash.dependencies.Input("upload-data", "filename")]
)
def perform_regression(contents, filename):
# Check if a file has been uploaded
if contents is not None:
# Read the uploaded file as a DataFrame
content_type, content_string = contents.split(",")
decoded_content = base64.b64decode(content_string)
try:
if "csv" in filename:
# Assuming the first row as headers and rest as Y values
data = pd.read_csv(io.StringIO(decoded_content.decode("utf-8")))
else:
return "Invalid file format. Please upload a CSV file.", {}, "", ""
except Exception as e:
return f"Error occurred while reading the file: {str(e)}", {}, "", ""
# Get the column names from the DataFrame
column_names = data.columns.tolist()
# Extract date/year, X, and Y column names
date_column = column_names[0]
x_column = column_names[1]
y_columns = column_names[2:]
# Convert the date/year column to datetime type
data[date_column] = pd.to_datetime(data[date_column])
# Add a constant column to the X variables for the intercept term
X = sm.tools.add_constant(data[x_column])
# Perform linear regression for each Y column
results = []
for y_column in y_columns:
model = sm.OLS(data[y_column], X)
result = model.fit()
results.append(result)
# Create the regression summary table
summary_tables = [result.summary2().tables[1] for result in results]
summary_df = pd.concat(summary_tables)
# Create the scatter plot of the data points and the regression lines
fig = px.scatter(data_frame=data, x=date_column, y=y_columns, title="Multiple Variable Linear Regression")
for result, y_column in zip(results, y_columns):
fig.add_scatter(x=data[date_column], y=result.fittedvalues, mode="lines", name=f"Regression Line ({y_column})")
# Perform tests
tests_output = []
for result, y_column in zip(results, y_columns):
# Test 1: Linear relationship between slope and intercept
slope, intercept = result.params[1], result.params[0]
tests_output.append(f"Test 1: Linear relationship between slope and intercept ({y_column})")
tests_output.append(f"Slope: {slope:.4f}")
tests_output.append(f"Intercept: {intercept:.4f}")
# Test 2: Independent variable is not random
x = data[x_column]
residuals = result.resid
pearson_r, pearson_p = stats.pearsonr(x, residuals)
tests_output.append("")
tests_output.append(f"Test 2: Pearson correlation coefficient between X and residuals ({y_column})")
tests_output.append(f"Pearson R: {pearson_r:.4f}")
tests_output.append(f"P-value: {pearson_p:.4f}")
tests_output.append("")
# Test 3: Residuals (errors) are zero
mean_residual = np.mean(residuals)
tests_output.append(f"Test 3: Mean of residuals ({y_column})")
tests_output.append(f"Mean: {mean_residual:.4f}")
# Test 4: Residuals (errors) are constant
_, jb_p = stats.jarque_bera(residuals)
tests_output.append(f"Test 4: Jarque-Bera test for residuals ({y_column})")
tests_output.append(f"P-value: {jb_p:.4f}")
# Test 5: Residuals (errors) are not correlated
_, ljung_box_p = diagnostic.acorr_ljungbox(residuals, lags=[1])
tests_output.append(f"Test 5: Ljung-Box test for residuals ({y_column})")
if isinstance(ljung_box_p[0], str):
tests_output.append(f"P-value: {ljung_box_p[0]}")
else:
tests_output.append(f"P-value: {float(ljung_box_p[0]):.4f}")
# Test 6: Residuals (errors) follow a normal distribution
_, normality_p = stats.normaltest(residuals)
tests_output.append(f"Test 6: Normality test for residuals ({y_column})")
tests_output.append(f"P-value: {normality_p:.4f}")
# Test 7: Stationary test
adf_result = adfuller(residuals)
tests_output.append(f"Test 7: Stationary test for residuals ({y_column})")
tests_output.append(f"ADF Statistic: {adf_result[0]:.4f}")
tests_output.append(f"P-value: {adf_result[1]:.4f}")
tests_output.append("")
# Check if residuals are stationary
if adf_result[1] > 0.05:
tests_output.append("Residuals are not stationary. Proceeding with differencing...")
# Differencing until residuals become stationary
stationary_res = residuals.diff().dropna()
differencing_steps = 1
while True:
adf_result = adfuller(stationary_res)
tests_output.append(f"Differencing Step {differencing_steps}")
tests_output.append(f"ADF Statistic: {adf_result[0]:.4f}")
tests_output.append(f"P-value: {adf_result[1]:.4f}")
tests_output.append("")
if adf_result[1] <= 0.05:
tests_output.append(f"Residuals became stationary after {differencing_steps} differencing steps.")
residuals = stationary_res
break
stationary_res = stationary_res.diff().dropna()
differencing_steps += 1
else:
tests_output.append("Residuals are already stationary.")
tests_output.append("")
# Test 8: Toda-Yamamoto Granger causality test
maxlag = int(np.ceil(12 * np.power(len(data) / 100.0, 1 / 4)))
granger_result = grangercausalitytests(data[[y_column, x_column]], maxlag=maxlag, verbose=False)
tests_output.append(f"Test 8: Toda-Yamamoto Granger causality test ({y_column} -> {x_column})")
tests_output.append("")
for lag in range(1, maxlag + 1):
tests_output.append(f"Lag {lag}")
tests_output.append(f"F-value: {granger_result[lag][0]['ssr_ftest'][0]:.4f}")
tests_output.append(f"P-value: {granger_result[lag][0]['ssr_ftest'][1]:.4f}")
tests_output.append("")
# Create separate tables for each test result
test_tables = []
for i in range(0, len(tests_output), 2):
if i + 1 < len(tests_output): # Check if index is within range
test_name = tests_output[i]
test_result = tests_output[i + 1]
test_table = pd.DataFrame({"Test": [test_name], "Output": [test_result]})
test_table = dash_table.DataTable(
columns=[{"name": col, "id": col} for col in test_table.columns],
data=test_table.to_dict("records"),
style_cell={"textAlign": "left"},
style_header={"fontWeight": "bold"},
)
test_tables.append(test_table)
test_tables.append(html.Hr())
# Create the regression summary table main
regression_summaries = []
for result, y_column in zip(results, y_columns):
summary_table_2 = result.summary().tables[0]
summary_df_2 = pd.DataFrame(summary_table_2.data[1:], columns=summary_table_2.data[0])
summary_df_2.columns = [f"{col} ({y_column})" for col in summary_df_2.columns]
regression_summaries.append(html.Div(f"Regression Summary ({y_column})"))
regression_summaries.append(dash_table.DataTable(
columns=[{"name": col, "id": col} for col in summary_df_2.columns],
data=summary_df_2.to_dict("records"),
style_cell={"textAlign": "left"},
style_header={"fontWeight": "bold"},
))
return [
dash_table.DataTable(
columns=[{"name": col, "id": col} for col in summary_df.columns],
data=summary_df.to_dict("records"),
style_cell={"textAlign": "left"},
style_header={"fontWeight": "bold"},
),
fig,
html.Div(regression_summaries),
html.Div(test_tables)
]
# If no file has been uploaded yet, return empty values
return "", {}, "", ""
if __name__ == "__main__":
app.run_server(debug=True)
Conclusion
In this article, we have explored how to perform multiple variable time-series regression and causality analysis using Python libraries such as Dash and Plotly. We have covered the steps of uploading data, conducting regression analysis, performing various tests, and visualizing the results. By leveraging these techniques, we can gain valuable insights into complex time-dependent data patterns and uncover causal relationships between variables.
Time-series regression and causality analysis find applications in various domains, including finance, economics, social sciences, and environmental studies. By understanding the relationships and causality between variables, we can make informed decisions, predict future trends, and optimize processes.
With the power of Python and the flexibility of Dash and Plotly, you can dive deeper into analyzing your own time-series data and uncover meaningful insights. Happy exploring and analyzing!
Please note that the content provided is for informational purposes only and should not be considered financial, legal, or professional advice.
You must log in to post a comment.