Coding / Markets / Commodities

Stylized Facts of Corn Futures

Corn futures are probably the most liquid agricultural commodity futures worldwide, and are used by producers, consumers and speculators alike. This blog post looks for the typical stylized facts that are often found in financial time-series.

Futures – Contract Details

Corn futures are traded on the Chicago Board of Trade CBOT (now part of the CME group). The CBOT lists various corn futures with different expiry months: CN (July), CU (September), CZ (December), CH(March), CK (May). Buying one future contract gives the buyer the right (and obligation!) to receive 5,000 bushels of corn at the expiry of the contract in one of the many designated elevators in the US midwest.
Usually the front month contract is the most liquidly traded, and often the December contract CZ serves as the main corn price index, being the first contract where the bulk of the new crop corn can be delivered. The full contract details of CBOT corn futures are available on the CME website.

Price Levels

Before diving into stylized facts of corn prices we’ll visualize the raw price series.

Daily close of the December Corn contracts since 1988. The price series starts from the day of listing on the exchange until the first notice day.

Each color represents one December futures contract. A few things directly jump out: Corn traded consistently between 200 cts/bu and 400 cts/bu from 1988 until 2006. From 2007 – 2014 the prices exploded and traded anywhere between 350 and 800 cts/bu. From 2015 onward prices settled lower again, but still higher than during the pre-2006 era. The large price increases and high volatility of the price between 2007 and 2014 is usually attributed to a combination of adverse weather conditions, ethanol fuel mandates, higher oil prices, the financial crisis, and a generally higher investor interest for commodities.

However there is a caveat: Comparing historical with current price levels can be misleading due to inflation over the same time-period. Below chart inflates historical prices to July 2020 equivalents using the US CPI (CPIAUCSL):

The price levels look quite different when using inflation-adjusted prices. While nominal prices were stable between 1988 and 2007, real corn prices continuously lost in value during the same period and reached their lows between 2003 and 2005. Current price levels are above pre-crisis levels.

Throughout the rest of the blog-post we’ll use the nearby December contract as a proxy for the corn price, and roll to the next year’s contract the day before the first notice day. All the analysis involving returns excludes data from the first notice days to avoid spurious roll returns.

Stylized Fact: Non-normality of Returns

It is well known that most financial returns are not normally distributed – does the same hold true for Corn?

Yes. Nominal (non-inflation adjusted) daily log-returns of the nearby December Corn future follow a heavy-tailed distribution. The fat tails are clearly identifiable by the positive kurtosis and the shape of the QQ-plot.

The log-returns of the nearby December corn contract follow a fat-tailed distribution as indicated by the QQ-plot on the right. Sample period: 1988 – 2020.

Stylized Fact: Volatility Clustering

Another often cited observance in financial time-series is volatility clustering. Volatility clustering of financial returns describes the fact that “large changes tend to be followed by large changes, of either sign, and small changes tend to be followed by small changes.” as stated by Mandelbrot [3]. Corn is no exception, and exhibits both the typical decaying auto-correlation structure of squared log-returns and the absence of auto-correlation between the (linear) log-returns.

PACF of squared daily log-returns of the nearby December corn futures contract from 1988 – 2020. The

Volatility clustering is also visually observable by looking at the time-series of the daily log-returns:

Daily log-returns since 1988. Volatility clusters are clearly identifiable (two of them highlighted in red).

Stylized Fact: Seasonality of return distribution

Annual seasonality is not typically a feature of financial time series, but is often present in agricultural commodity prices. Corn is a crop with a yearly growing and harvesting cycle, which clearly impacts the return distribution. Below chart visualizes the variance of the daily log-returns and shows a yearly seasonality. This also partially explains the strong volatility clustering, even though seasonality is generally not required to explain this effect.

Average 20-(business) day variance of the log-returns of the nearby December Corn contract from 1988 to 2020.

Checking the return distribution by calendar month further confirms the seasonal pattern: the return distribution during (US-) summer months is much wider than during (US-) winter months. The fundamental reason for this is the growing cycle for US corn. US corn is planted between April and June, and usually harvested from September to November. While the overall supply uncertainty is largest before any corn is planted, the daily volatility peaks during summer months when weather is critical for corn development and determining the final production size [2].

Daily log-return distribution during different calendar months for the nearby Corn December contract. Sample period: 1988 – 2020.

Yearly seasonality of the squared log-returns of the nearby December corn contract. The magnitude of the daily return peaks during summer months, when Corn in the United States is in the critical stages of its development.

Summary

Corn futures exhibit the typical stylized facts of financial time-series such as volatility clustering and non-normal fat-tailed returns. Additionally to these traditional stylized facts, corn also exhibits a strong seasonality of return magnitudes (or volatility) due to the inherent annual growing cycle of the corn plant.

Forecasting Swiss Electricity Demand using Prophet

In this blog post, we’ll use Facebook’s Prophet library to analyze and forecast Switzerland’s electricity demand.

Since Prophet was released to the public in 2017, I’ve always wanted to test it on a dataset where Prophet can show it’s full power, including multiple seasonalities at different time-scales, trend, and business day logic. I recently stumbled upon freely available electricity data from the Swiss national electricity grid company SwissGrid and decided to have a quick go with Prophet.

Getting the data

The data is freely available on SwissGrid’s website here, and contains various data points regarding electricity production, consumption, imports and exports since 2009. In this example we’ll look at the “Total energy consumed by end users in the Swiss controlblock” on 15-minute intervals. The data set is split over various Excel files, one per calendar year, so we’ll first use Python’s to download all files, extract the relevant time-series and write the combined data to a single csv file. The full code is available on my gitlab here.

BASE_URL = 'https://www.swissgrid.ch/dam/dataimport/energy-statistic/'
AVAILABLE_YEARS = range(2009,2020)
SHEET_NAME = "Zeitreihen0h15"
COLUMN_NAME = "Summe endverbrauchte Energie Regelblock Schweiz\nTotal energy consumed by end users in the Swiss controlblock"

# Download all available files
for year in AVAILABLE_YEARS:
    filename = f'EnergieUebersichtCH-{year}.xls'
    url = f'{BASE_URL}/{filename}'
    web_file = requests.get(url)
    with open(filename, 'wb') as local_file:
        local_file.write(web_file.content)

# extract the relevant column from each excel sheet and write it to a single csv file
series = []
for year in AVAILABLE_YEARS:
    print(f'Processing {year}...')
    filename = f'EnergieUebersichtCH-{year}.xls'
    df_single_year = pd.read_excel(filename, sheet_name=SHEET_NAME, header=0)
    df_single_year = df_single_year[col_name][1:] #skip first row following the column names
    series.append(df_single_year)
series = pd.concat(series)
series = series.rename('demand')
series.to_csv('demand.csv', index=True, header=True)

Plotting the data

Once we have the data in a nice format, we can start our exploratory analysis. Using pandas describe() method, we see that our series consists of roughly 380’000 rows, where each row corresponds to the energy consumption during a 15-minute interval.

The data shows some yearly frequencies, but it’s hard to tell from above plot since it’s just too much data! Let’s re-sample the series to a daily frequency and re-plot:

df_daily = df.resample('D', closed='right').sum()

Daily re-sampled data. The yearly seasonality is visible more easily.

There is clearly a seasonal variation: Energy consumption peaks during winter and is lower during summer. The overall trend seems to be relatively stable. Let’s see what Prophet can do with this data.

Going forward we’ll use the daily series and neglect the 15-minute interval data – working with 380’000 rows was too much for my home computer. The daily series contains only 4000 data points.

Forecasting using Prophet

Prophet requires the data to be in a dataframe consisting of one column for the timestamps called ds, and one column for the data called y. Prophet promises to make time-series analysis easy – let’s see what it can do on our daily-sampled time-series:

m = Prophet()
m.fit(df_daily)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)

Four lines of code to fit the model and forecast the next year! By default, Prophet uses an additive model, meaning that it models the time series as the sum of various components:

$y(t) = trend(t) + seasonality(t) + error(t)$

Let’s first look at the time-series components that Prophet detected and and visualize them. Prophet makes this effortless:

m.plot_components(forecast)

The components of our time-series model as detected by Prophet.

The first subplot shows the (piecewise linear) trend that Prophet detected. The second and third chart show the impact of seasonalities – Prophet fit both a yearly seasonality – with lower consumption in summer versus winter – and a weekly seasonality, where Saturday and Sunday have a lower demand when compared to business days.

Let’s look how our forecast looks like. Prophet provides an easy way to visualize the forecast using

m.plot(forecast)

Visualization of the fitted model and the real data. The black dots are the actual data, the dark blue line is the forecast and the light blue area represents the 80% prediction interval around the best estimate.

While visually the results seem to be quite satisfactory, it’s always better to compute error metrics to be able to properly quantify the accuracy. Prophet has a support for time-series cross-validation: Here we’ll start with an initial training period of 5 years, and make a one year forecast every 90 days thereafter.

df_cv = cross_validation(m, initial=f'{5*365} days', period='90 days', horizon = '365 days')
df_metrics = performance_metrics(df_cv)

Below chart shows the mean absolute prediction error of our model for forecasts up to one year:

The average mean absolute prediction error (mape) for a one year forecast is consistently below 5% and does not grow with the forecast period. However there are quite a few outliers, where the forecast error is up to 30%.

While the overall performance of the model is very good and the mean prediction error is below 5%, there are a few outliers where our forecast is off by more than 20%. Can we do better than this without too much effort? Yes! Given the strong dependence of the energy consumption on weekends and weekdays, it’s a reasonable assumption that at least some of the days where the model performed poorly were public holidays that should be treated similarly to weekends but weren’t treated as such. Luckily it’s easy to let Prophet incorporate the relevant holidays into the model:

m.add_country_holidays(country_name='Switzerland')

The new component plot now includes the impact of holidays on the energy consumption:

Component plot of the new model, including the impact of holidays.

The second subplot shows the impact of holidays on our model: As expected they lead to lower energy consumption. Incorporating holidays should have made our model more accurate, we can check this by re-running the cross-validation and computing the prediction errors:

Mean absolute percent error of the improved model that includes holidays.

It worked! The frequency of errors > 20% has decreased quite significantly. The seasonality in the mape suggests that there is another effect that our model does not capture, but for now we’ll leave it at this and show the prediction of our improved model:

Forecast of the improved model including Swiss holidays. The black dots are the actual data, the dark blue line is the forecast and the light blue area represents the 80% prediction interval around the best estimate.

Conclusion

We’ve successfully used Prophet to forecast a complex time-series and improved it by adding public holidays to the model. Prophet is very user-friendly, default parameters seemed reasonable (at least in this specific example), and it makes time-series analysis accessible to most developers. I’ll definitely include it in my tool set going forward.

Forest Fires and Percolation

The recent bushfire activity in Australia reminded me of an interesting problem in statistical physics, and a model that somewhat resembles the spread of a forest fire.

Percolation

Percolation describes the process of fluid movement through a porous medium. Besides their direct application in fluid dynamics and material science, percolation models have been used to study various phenomena across a diverse set of applications, ranging from epidemiology and financial markets to forest fires.

Percolation models are relatively easy to define and construct, but show interesting behaviour linked to phase changes, like fractal dimensions, scaling behaviour and power laws.

A simple forest fire model

Let’s assume we have a square lattice of length n. Every square of the lattice can either be occupied by a tree with probability p, or be empty with probability (1-p). Assume the trees in the top row of the lattice are set on fire, and that the fire will spread to directly neighbouring trees over time. What’s the probability that the fire will reach the lower end of the lattice given enough time? (The correct academic expression for this model is “site percolation on a square lattice”.)

Intuitively, the probability depends on the amount of trees and how the trees are clustered across the lattice. If there were no trees, the probability would be 0, and if every site were occupied by a tree, the probability would be 1. How does this probability behave for p in between these two extremes?

Setup of the site percolation (forest fire) problem for n=100 and p=0.5, where p is the probability of a site being occupied with a tree. The trees in the top row are set on fire – what’s the probability that the fire will reach the bottom of the lattice? Trees are shown in green, empty land is white, and burning trees are red.

To solve this problem, we’ll use a simulation approach: Simulate the system many times over for p values between 0 and 1, and compute the probability that the fire reaches the bottom row for each p through sample statistics. For the simulation we’ll set the lattice size n to 400 (the bigger the better).

An algorithm to solve this model closely resembles the spreading dynamics of a forest fire:

Initialize the lattice and on each lattice square plant a tree with probability p.
Set the top row of trees on fire
In each time step, set all trees next to the burning trees on fire, and set the already burning trees’ status to burnt.
Continue until no more trees are burning or the fire reaches the bottom end.

The lattice is traversed in an approach that is very similar to breadth-first-search in graph theory.

Animation of the algorithm for n=100 and p=0.5. Each timestep, all trees (green) that are neighbours of burning trees (red) are set on fire, and burning trees are set to burnt trees (black). In this case, the fire does not spread to the bottom end of the lattice.

Before looking at the solution, take a guess how the probability of the fire spreading to the bottom row depends on the value of p. Are you expecting a linear increase of the probability with increasing p?

Result

The probability that the fire reaches to bottom of the lattice depends highly non-linearly on p (the probability that a site is occupied with a tree). Above chart shows the results for a 400×400 lattice. Notice how the confidence intervals are extremely small everywhere except at p=0.6.

Zoomed in to the region between p=0.55 and p=0.62.

The probability of the fire reaching the bottom end of the lattice is zero below 0.5, jumps to one above 0.6, and the sample variances are very small. Highly suspicious! In fact, if we were able to simulate the system for an infinitely large lattice, we could see that at approximately p=0.59 the probability of the fire reaching the bottom row jumps from 0 to 1. This discontinuity is smeared out in our results due to the finite lattice size we use in our simulation. It turns out that 0.59 is a so called critical point of the system, $p_c$ , and all kind of interesting things are happening at that point:

The probability of a spanning cluster (a connected cluster connecting the top and bottom, i.e. the probability that the fire reaches the bottom) jumps from 0 to 1.
The shortest path from top to bottom reaches its maximal length.
The spanning cluster has a fractal dimension.
The probability of a site of the lattice being part of the spanning cluster can be described by a power law when p is close to $p_c$ .
The cluster size distribution (number of independent clusters in the lattice with a given size) follows an inverse power law.

Effectively, the system is undergoing a phase transition at $p_c$ ! We can visualize the forest fire dynamics at the critical probability $p_c$ and slightly above $p_c$ , and see that the dynamics look very different:

Forest fire algorithm at p= $p_c$ for N=200. There remain large patches of unburnt trees, and the fire spreads very slowly.

Forest fire algorithm at p= $p_c$ for N=200. There remain large patches of unburnt trees, and the fire spreads very slowly.

The model at p = 0.61, just slightly above the critical probability. The dynamics seem very different to above animation, a much higher part of the trees get burnt, and the fire spreads faster.

The length of the shortest path depending on the parameter p. It reaches it maximum around $p_c$ , and then decays towards the lattice length of 400.

The length of the shortest path depending on the parameter p. It reaches it maximum around $p_c$ , and then decays towards the lattice length of 400.

I’m always amazed how such simple systems can lead to such complex dynamics.

Implementation side note

The code is available on GitLab here. My previous implementation was in C++, but here I wanted to write a pure and well-understandable python implementation of the algorithm. The drawback of this approach is the loss of speed – the code continually loops over all cells using standard for-loops which is known to be very inefficient. To speed up the code, I used Numbas JIT compilation, which increased the speed by more than a factor of 100. It would be interesting to see how a cythonized, a vectorized (using numpy), and the numba version compare versus a raw C++ code… in a future blog post.

Trending time-series: US Corn Yield

The United States are the biggest corn producer in the world, producing over 350 million metric tons per year during the last years. US corn yields, i.e. the amount of harvested corn per acre, have increased considerably throughout the 20th century, playing the major part in the stunning US production figures. The main drivers for the increase in yields are the increased use of targeted fertilizers and pesticides, better seeds, and better farming technology.

This blog post will analyze the time-series of yearly US corn yields, and build various models to predict next year’s yield based purely on historical yield data. This post will be based on simple regression techniques, and a future blog post will analyze whether a classical time-series approach has any benefits over simple regression. All code used in this blog post is available on my GitLab account here.

US corn yields

US corn yields are probably the most studied, estimated and followed numbers in the agricultural market. Market participants use them in their US corn production estimates, together with planted and harvested area, since US corn supply is one major factor impacting corn prices. It is therefore useful to have a good estimate of US corn yields before the actual harvest results are coming in. In this series of blog posts we are trying to estimate US corn yields as of now (December 2019), a long time before the 2020 corn is planted (spring 2020), let alone harvested, using only historical data of yields and some domain knowledge. Besides the use of having an early estimate of corn yields, the estimated trend can also serve as a basic building block for more advanced yield models that usually require detrended yields as an input.

Historical US corn yields are provided by the US Department of Agriculture and freely available on their website (https://quickstats.nass.usda.gov/).

Linear Regression

A linear regression of yields against time using ordinary least squares seems to be a simple first approach, if we ignore data before 1960. The linear trend explains most of the yield variation that we observe, but we will of course not be able to detect a change in trend like it occurred in 1940. Nevertheless, this will be a first method to predict next year’s yield.

Residual distribution of the linear model. Normalizing the residuals by the actual yield to account for the increasing trend leads to the same shape of the distribution.

Outliers

By looking at the the usual regression diagnostics, we see that the residuals are not normally distributed, but follow a left-skewed distribution. The time-series contains a few outliers, where yields were much lower than what we would have expected based on the trend alone.

To explain this asymmetry, it is important to understand what the main driver for the yearly variation besides trend in yields is: the weather. Corn needs water, sun and suitable temperatures to grow, and a lack or too much of any of these will negatively impact the yield. In the United States corn is planted in spring and harvested in late autumn, and is especially susceptible to adverse weather conditions during spring and summer.

Knowing this, it intuitively makes sense that the risk of an extremely bad yield is higher than the risk of an extremely good yield: Perfect conditions can only do so much to make corn grow better and increase yield, whereas a lack of water or heat can easily damage the yield’s potential. It is therefore not surprising that the years with the biggest deviations from trend yield in the last 40 years were 2012, 1993, 1988, and 1983: 2012, 1988 and 1983 were years with severe droughts and heat throughout the midwest, while in 1993 the midwest was plagued by a flood.

What does this mean for our linear trend model?
We should not ignore the outliers – extremely adverse weather is possible and should not be neglected. However, outliers shouldn’t impact our trend estimation. They are by construction not part of the general trend, but occur due to rare but severe weather events, and would have to be modeled separately using a different dynamic.

Dealing with outliers

There are multiple ways of dealing with outliers, and we’ll use two of them here:

Robust Estimators: Estimators that are robust against outliers. Here we’ll use the Theil-Sen estimator available in scipy.
Removing outliers: While it is generally not advised to remove outliers, we can justify this in this example by saying that years with severe drought/flood follow a completely different dynamic and distribution. This is equivalent of modelling yield under the condition that we are not in a severe adverse weather condition year. We classify all data as outliers if the residual from the linear regression fit is bigger than 20% of the actual yield, which is inspired by the relative residual histogram.

This leaves us with three models:
– Model 1: Standard linear regression (OLS)
– Model 2: Theil-Sen
– Model 3: OLS – removed outlier

Comparison of regression lines of standard ordinary least squares (OLS), the robust Theil-Sen estimator, and an OLS performed after removing outliers (points where the deviation from the standard OLS trendline is > 20%). The Theil-Sen estimator estimates a slightly steeper slope, while omitting the outliers increases the intercept. Overall, the differences are relatively small.

Model performance, model selection and predicting 2020

To select the best of the above three models, we’ll look at their historical one-year-ahead prediction performance, making sure to fit them using only backwards-looking data. We’ll use an expanding window approach, increasing the data available for fitting each year. We start reporting the metrics from 1981 in order to have at least 20 data points to estimate our models.

Error metrics for the three models on years excluding all years classified as outliers by model 3. RMSE and MAE for the three models on the 1960-2019 dataset using an expanding window. The reported values for year X are the MAE/RMSE for the period 1981-(X-1). The analysis starts in 1981 to contain at least 20 data points.
The error estimates get smaller with increasing sample data size, reaching a low of around 9 bu/ac for the MAE with a sample size > 50 (in 2010).

Looking at above results, we see that ordinary least squares (OLS) consistently has the lowest root mean square error (RMSE). This is expected, given that OLS is by construction the estimator that minimizes the RMSE. The estimator observing the lowest mean absolute error (MAE) is the Theil-Sen estimator. Again, this is somewhat expected: The Theil-Sen estimator is robust against outliers, and the MAE metric is less sensitive to outliers than the RMSE. The OLS model with removed outliers is usually the worst performing, given that it was explicitly created to estimate the trend when excluding outlier years, but is measured on the full dataset.

We can re-run the same analysis, but instead of using all data to compute the error metrics exclude years that are classified as outliers by model 3. We see that the Theil-Sen estimator is the best model in estimating the outlier-less trend, beating the simple approach of removing outliers.

Now where does that leave us with selecting the best model for predicting 2020 trend yields? I would argue that the Theil-Sen estimator will give us the best estimate for next year’s trend yield, off of which we can improve our yield estimate based on the weather developments throughout the growing season.
If we are required to estimate next year’s yield for risk management purposes (such as insurance or corn futures trading), a trend yield estimate is not enough: We have to add a downward bias to the trend yield to account for the risk of extreme weather conditions. One way to do so is to start estimating 2020 using the OLS model, and move to the Theil-Sen model once the risk of a major drought/flood is negligible.

OLS trend-yield prediction for 2020: 173 bu/ac
Theil-Sen trend-yield prediction for 2020: 175 bu/ac

Drawbacks of the linear regression models

Above models are all based on linear regression – which might not be the best idea when modelling a time-series. The main drawbacks applicable to the US corn yield example are:

We will not be able to detect a change in trend. A shorter backward-looking time frame of the regression can be a partial remedy, but will be more susceptible to outliers.
We do not properly model the dynamics of a time-series and have neglected any potential auto-correlations.
We cannot explain the full time series, but only the current trend existing since the 1960s.

Throughout this post it was assumed that the 2019 corn yields are known, even though they are still subject to change by the USDA.

Photo by Julian Schöll on Unsplash