Many stock market analysts frequently make the claim that you should “sell in May and go away”. In this post, we will try to find out if there’s any truth to the idea behind this claim. Namely, that holding stocks over the summer period from May to October is historically detrimental to annual returns from a statistical perspective.

## Loading (and installing) packages

```
if(!require("pacman")) install.packages("pacman")
pacman::p_load(quantmod,ggplot2,plyr,lme4,lmerTest,effects,jtools,tidyverse)
```

## Customizing a ggplot theme

```
my_theme <- function() {
theme_apa(legend.pos = "none") +
theme(panel.background = element_blank(),
plot.background = element_rect(fill = "azure"),
panel.border = element_blank(), # facet border
strip.background = element_blank(), # facet title background
plot.margin = unit(c(.5, .5, .5, .5), "cm"))
}
```

## Loading data

The selection of data here is crucial and will obviously affect our analysis. In this case, we will use the adjusted close price of the S&P 500 from 1950 to 2018. This is easy to do thanks to the ‘quantmod’ package.

`getSymbols(c("^GSPC"),from = "1950-01-01",to = "2018-01-01",src = "yahoo",class = ts)`

`## [1] "GSPC"`

`dim(GSPC)`

`## [1] 17110 6`

`head(GSPC)`

```
## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 1950-01-03 16.66 16.66 16.66 16.66 1260000
## 1950-01-04 16.85 16.85 16.85 16.85 1890000
## 1950-01-05 16.93 16.93 16.93 16.93 2550000
## 1950-01-06 16.98 16.98 16.98 16.98 2010000
## 1950-01-09 17.08 17.08 17.08 17.08 2520000
## 1950-01-10 17.03 17.03 17.03 17.03 2160000
## GSPC.Adjusted
## 1950-01-03 16.66
## 1950-01-04 16.85
## 1950-01-05 16.93
## 1950-01-06 16.98
## 1950-01-09 17.08
## 1950-01-10 17.03
```

```
ggplot(GSPC,aes(y = GSPC.Close, x = Index)) +
geom_line() +
my_theme()
```

## Average stock market return by month

First, let’s take a look at monthly returns since 1950. We’ll also add a line indicating the average return.

```
returnsByMonth <- periodReturn(GSPC, period = 'monthly', subset = '1950::')
returnsByMonth %>%
ggplot(aes(y = monthly.returns, x = Index)) +
geom_line() +
geom_hline(yintercept = mean(returnsByMonth), colour = "gold", size = 1, alpha = .5) +
my_theme()
```

This is not very helpful to us. Instead we’ll group our data by month and compare the average return of each from the twelve months. We create a data frame and include a factor variable with month names and summarize our data using ddply().

```
df <- as.data.frame(returnsByMonth)
#note that month.name and month.abb are constants built into base R
df$Month <- factor(rep(month.abb, length(returnsByMonth) / 12), levels = month.abb)
head(df)
```

```
## monthly.returns Month
## 1950-01-31 0.023409304 Jan
## 1950-02-28 0.009970675 Feb
## 1950-03-31 0.004065157 Mar
## 1950-04-28 0.038750605 Apr
## 1950-05-31 0.045657129 May
## 1950-06-30 -0.058040465 Jun
```

```
meanReturns <- ddply(df, "Month", summarize,
Mean = mean(monthly.returns),
SE = sd(monthly.returns) / sqrt(length(monthly.returns)))
meanReturns
```

```
## Month Mean SE
## 1 Jan 0.0096073603 0.005830234
## 2 Feb 0.0009844702 0.004263172
## 3 Mar 0.0123033210 0.004061560
## 4 Apr 0.0144877261 0.004491402
## 5 May 0.0024491246 0.004417714
## 6 Jun -0.0003782533 0.004159915
## 7 Jul 0.0104621815 0.004822198
## 8 Aug -0.0009131059 0.005537527
## 9 Sep -0.0047557068 0.005322655
## 10 Oct 0.0090777622 0.006681424
## 11 Nov 0.0156188470 0.005124048
## 12 Dec 0.0161459193 0.003725805
```

We can then plot our data (with error bars indicating standard errors of the means).

```
meanReturns %>%
ggplot(aes(x = Month, y = Mean)) +
geom_bar(stat = "identity",fill = "royalblue4") +
geom_errorbar(aes(ymin = Mean - SE, ymax = Mean + SE), width = .2,
position = position_dodge(.9)) +
ggtitle("S&P 500 monthly average returns from 1950 to 2017") +
my_theme()
```

Let’s instead visualize monthly returns with a boxplot. The dashed line is intended to make it a little easier to compare the medians to 0 on the y axis.

```
df %>%
ggplot(aes(x = Month, y = monthly.returns)) +
geom_jitter(alpha = .5, width = 0.25) +
geom_boxplot(fill = "royalblue4", alpha = .8) +
geom_hline(yintercept = 0, colour = "gold", linetype = "dashed", alpha = .5, size = 1) +
ggtitle("S&P 500 monthly average returns from 1950 to 2017") +
my_theme()
```

We can also perform a simple regression analysis to test whether average returns for individual months deviate significantly from 0.

```
fit <- lm(monthly.returns ~ 0 + Month, data = df)
summary(fit)
```

```
##
## Call:
## lm(formula = monthly.returns ~ 0 + Month, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.226708 -0.022898 0.001164 0.025511 0.153969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## MonthJan 0.0096074 0.0049387 1.945 0.05208 .
## MonthFeb 0.0009845 0.0049387 0.199 0.84205
## MonthMar 0.0123033 0.0049387 2.491 0.01293 *
## MonthApr 0.0144877 0.0049387 2.934 0.00345 **
## MonthMay 0.0024491 0.0049387 0.496 0.62009
## MonthJun -0.0003783 0.0049387 -0.077 0.93897
## MonthJul 0.0104622 0.0049387 2.118 0.03445 *
## MonthAug -0.0009131 0.0049387 -0.185 0.85336
## MonthSep -0.0047557 0.0049387 -0.963 0.33586
## MonthOct 0.0090778 0.0049387 1.838 0.06642 .
## MonthNov 0.0156188 0.0049387 3.163 0.00162 **
## MonthDec 0.0161459 0.0049387 3.269 0.00112 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04073 on 804 degrees of freedom
## Multiple R-squared: 0.05679, Adjusted R-squared: 0.04271
## F-statistic: 4.034 on 12 and 804 DF, p-value: 4.144e-06
```

For the sake of illustration, we can then use the effects package to easily create a plot of the estimates, which of course are identical to the means plotted in the bar chart above.

`plot(allEffects(fit))`

Let’s make the plot a little prettier ourselves.

```
eff_month <- data.frame(allEffects(fit)$Month, order = seq(1:12))
eff_month %>%
ggplot(aes(x = reorder(Month, order), y = fit)) +
geom_point() +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
my_theme() +
labs(x = "month", y = "estimate")
```

We see that some of the monthly estimates do in fact deviate from 0, but only in the form of positive returns. We also observe that the strongest positive months appear to be in the winter periods (November-April). Furthermore, much in agreement with the “Sell in May” investment approach, the negative estimates all fall within summer periods.

Lets’s now look further into this by comparing average seasonal returns for summer and winter periods.

```
df <- df %>%
mutate(season = factor(ifelse(df$Month %in% month.abb[5:10],"Summer", "Winter")))
meanReturnsSeason = ddply(df, "season", summarize,
Mean = mean(monthly.returns),
sem = sd(monthly.returns) / sqrt(length(monthly.returns)))
meanReturnsSeason %>%
ggplot(aes(x=season, y=Mean)) +
geom_bar(stat="identity",fill="royalblue4") +
geom_errorbar(aes(ymin=Mean-sem, ymax=Mean+sem), width=.2,
position=position_dodge(.9)) +
ggtitle("S&P 500 average seasonal returns from 1950 to 2017") +
my_theme()
```

We see a clear difference between summer and winter return estimates. However, it is not obvious that the “Sell in May” strategy is better than holding stocks throughout the year.

As a final step in our analysis, we create two regression models: One in which we compare the season estimates to 0, and another that compares the two return estimates directly.

```
mod2 <- lm(monthly.returns~0+season,data=df)
summary(mod2)
```

```
##
## Call:
## lm(formula = monthly.returns ~ 0 + season, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.220287 -0.022604 0.000933 0.026255 0.160390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## seasonSummer 0.002657 0.002021 1.315 0.189
## seasonWinter 0.011525 0.002021 5.702 1.66e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04083 on 814 degrees of freedom
## Multiple R-squared: 0.04037, Adjusted R-squared: 0.03801
## F-statistic: 17.12 on 2 and 814 DF, p-value: 5.209e-08
```

The most interesting observation to be made here is perhaps that the estimated return for summer periods is not significantly different from 0. Let’s set winter as the reference level in the final model.

```
df$season <- relevel(df$season,ref = "Winter")
mod3 <- lm(monthly.returns~season,data=df)
summary(mod3)
```

```
##
## Call:
## lm(formula = monthly.returns ~ season, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.220287 -0.022604 0.000933 0.026255 0.160390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.011525 0.002021 5.702 1.66e-08 ***
## seasonSummer -0.008868 0.002858 -3.102 0.00199 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04083 on 814 degrees of freedom
## Multiple R-squared: 0.01169, Adjusted R-squared: 0.01047
## F-statistic: 9.625 on 1 and 814 DF, p-value: 0.001986
```

Not surprisingly, we see a clear significant difference between returns for the two seasons.
In sum, we see that the S&P 500 does in fact tend to rise between May and November, though significantly less than between November and April. In the future, I might take a look at other indices, compare data from different decades and redo this analysis with extreme outliers removed.

It is perhaps not self-evident that it is generally advisable to sell off stocks in May and wait until November before re-entering the stock market. Even so, it might still be a good way for risk-aversive investors to eliminate part of the risk involved in investing in the stock market.

I find financial markets fascinating, confusing and sometimes overwhelmingly complicated. There will always be a lot of talk, bold claims and noise generated by experts and analysts, who may not always be as insightful as they would like to think. Thankfully, data and statistics can help us in determining who’s right and who’s wrong amidst the constant influx of information.