The NilssonHedge databases rely overwhelmingly on data that is disclosed into the public domain by hedge funds, distributors, and other data sources. We obtain our data using a variety of different techniques, including web data extraction, APIs, digitizers, and pdf files. And then we condense the data into fewer entities to make it easier to search. This is how we do it, but also where it may possibly go wrong:
Hedge fund return data used to be highly classified and was guarded with jealousy. Today, a lot of hedge funds have vehicles that are listed and where prices are freely available. We liberally use such disclosures to acquire price data for those vehicles, where and when it is feasible. Occasionally we also use other techniques to extract data from files and charts.
With over ten thousand unique inputs, we try to reduce the number of entities by aggregating data over share classes using several clustering approaches. Among the measures we use are correlation, tracking error, volatility, behavior, and descriptive data. We also collect data from multiple sources to avoid input errors. 99.9% of what we do is fully automated, but we are also dependent on correct data being available from our sources.
To classify managers, we use a supervised learning strategy, seeking to describe the strategies based on certain keywords and characteristics. This classification algorithm will gradually learn and may change classification as we learn more about the characteristics of a strategy.
As our goal is to create the longest possible track record for any individual strategy, we routinely ignore minor fee differences, currency hedged share classes or slight variations between strategies. The track records in our database may occasionally deviate from the official returns from the manager by a tens of basis points.
We average the return estimates we have, creating a representative data point. And we round our returns to no more than four decimal points. Better roughly right than precisely wrong. Our database is thus not suitable for tracking your particular investment, but rather the generic returns of the strategy you are invested in.
As an example of our quality controls, we use a cutoff of a correlation higher than .95 (this is something that is almost, but not quite, the same return stream) and volatility that is in the same range to flag performance streams as potentially similar. These returns are then manually mapped to the same entity if they are indeed the same strategy. This works better for funds that are truly long-short strategies. For asset allocation or crypto (beta) strategies this works less well. For those, we use other techniques.
To extract data we use a “web crawler”, a piece of code that crawls specific websites and extracts performance-related data. The crawler is our main workhorse capturing data as soon as it is reported by fund managers. Moreover, we use several OCR (optical character recognition) packages to extract data from other data formats and plot digitizers to take information from charts.
Given the sheer amount of individual data points we collect (over 350,000), mistakes can and will happen. Here are a few of the potential error sources:
- Dividends - cash flow payments are occasionally not captured, leading to unexpected declines in performance. This is usually self-correcting and will be adjusted once we know the dividend. We strive to use share classes that are reinvesting.
- Mapping - returns streams are mapped to a manager and a strategy. Occasionally these mappings break and we may incorrectly report return for a strategy, that actually should be reported under a different name.
- Input errors - for the small fraction of the data that is entered manually, we may misplace a comma or incorrectly transcribe numbers. While we try to use multiples sources for our data, this is hard to capture and prevent, if we only have one source for the data.
- Bugs in our code - occasionally we are unable to extract data in the right way, which will either result in bad data or changes to the historical data streams that we have. To try to capture this, we run regression checks to check that historically reported returns are consistent. Whenever we encounter this, we retool our web crawlers.
Our hedge fund database is unique, not only through relying on public disclosure of data but also that are our processes are almost fully automated. We run our firm lean, but deploy several quality controls to maintain high-quality data. Hopefully, you find our database useful for your research, portfolio construction, and manager selection.
And if you find any error, do not hesitate to get in touch with us.