Some harmless data-mining: Testing individual words in EDGAR filings

Everyone knows about the perils of data-mining and multiple testing. So, don’t take this post too seriously. I recently made an inverted index into all 11 million regulatory filings disseminated online by the SEC. This means that for each string of three or more letters I have a list of all documents that contain it. I did this to facilitate full text search. But, now that I have it, I decided to run each word through a simple backtest. The idea is to test whether the presence of a single word in a filing can predict future stock returns.

Don’t look at the results, yet! The benefit of using an interpretable model (in this case a single word) is that it allows a person to use domain expertise to vet the model and conduct a sanity check. You already know data-mining is dangerous and that most of the results are going to be garbage. Your best bet is to discard the nonsensical words before you’ve seen the out-of-sample results which may bias your judgement. Look at the list of words at the end of this post and decide which you think could have predictive power and which direction you would bet. Then click on the words you chose to see the backtest results. If you need more context, search for filings that contain the word using the SEC’s EDGAR website. A better method of finding strategies would be to think of the very best words and phrases without the help of any data-mining. Then the entire backtest (and not just the two held-out years) would be considered out-of-sample validation. But sometimes it’s nice to have suggestions provided by the machine. I apologize if this post offends David Bailey and Marcos de Prado.

Backtest Details

The backtests are run using daily data from CRSP. The data is adjusted for splits and dividends, is free of survivorship bias, and includes the best bid and offer at the close of each day. I have the timestamp for each filing, and I also know what time the market closed each day. I buy (or short) the corresponding stock at the first closing price possible. For example, a filing that comes in after hours on Friday afternoon will be traded at the close on Monday. I evaluate going both long and short, and I try three different hold times: 10, 90, and 252 days.

I charge half the bid-ask spread when a position is opened, and half when it is closed. I also charge $0.005 per share to trade, since that is what my broker charges. I assume the size traded is small enough to have no market impact. I have not accounted for borrowing costs, since I don’t have the data historically.

The portfolio is equal weighted. I hedge it with SPY, using a hedge ratio of 1 for simplicity. I don’t currently charge transaction costs for changes to the hedge position, but I’ll add them in at some point.

I only trade stocks which (the day before) have a minimum marketcap of $100m and a minimum 10-day average dollar volume traded of $750k. I also only trade stocks when CRSP says the bid-ask spread at the close is less than 50 bps.

I only backtested words appearing in more than 100 and less than 100,000 filings. This eliminates both rare words and common stop-words like “and,” “but,” and “with.” Some garbage words are mixed in, unfortunately. I removed uuencoded data from filings, but it turns out I also needed to remove gzipped data embedded in some filings.


It took three computers about a week to complete all the backtests. I set some filters to only keep the best results, and 60 words passed my filters. I ranked the backtests with a custom scoring function that fits my preferences reasonably well:
score = 0.4 * monthly Sharpe ratio + 0.4 * daily Martin ratio + 0.2 * daily Sharpe ratio for the last 3 years

The charts don’t include the scale on the y-axis. That’s because the leverage varies from one backtest to the next. The tricky thing about event-based backtests is deciding whether to scale down the position sizes when lots of events start occurring. In this case, I kept the position size constant, which causes the portfolio size to vary drastically over time and across backtests. In real trading, one would probably scale down old positions to make room for new ones to keep the portfolio size somewhat more constant. Or, perhaps less-profitable events would be ignored if most of the capital was already deployed. Portfolio optimization might help make such decisions. It could be argued, though, that more filings means more opportunity, which would justify the increased leverage.

Multiple testing is a big problem with this exercise. For that reason, the backtests use data only from 2004 through 2013. Years 2014 and 2015 were held out, and were not part of the data-mining exercise. They provide an idea of how bad the over-fitting was. But, two years isn’t sufficient to judge a strategy. So, I show an additional backtest that includes more unseen events. It considers only filings excluded from the primary backtest because the stock was insufficiently liquid. This backtest does not consider transaction costs, since it isn’t intended to be traded, however it does still have a small minimum liquidity constraint. Its only purpose is to see if the effect can be seen in additional held-out data. In these charts, red plot lines indicate held-out data.

Here are the words in descending order. Click on them to see the backtests:

supermedia   noncircumvention   samba   genehmigte   mitglied   nonconverted   suleman   schneiderman   bcbca   fasst   racer   cytk   shampoos   yaeger   anzeige   beshar   quimicos   propellant   tortilla   oronite   grandfathers   weeklies   favorability   gliche   overdrawn   colavita   godiva   bef   fractionate   legalization   gtm   fleetguard   sharpridge   pests   emmens   playable   tricare   nikolaos   awp   pilipino   multidistrict   spectrasite   canfield   shopped   sphc   nrsros   toothpastes   varietal   amerada   indifferent   thatcher   methylamines   tni   barranquilla   coombs   bpt   apcd   abo   nestl   duffield