Ian Kerins
8 Mins
February 1, 2019

The rise of web data in hedge fund decision making & the importance of data quality

Over the past few years, there has been an explosion in the use of alternative data sources in investment decision making in hedge funds, investment banks, and private equity firms.

These new data sources, collectively known as “alternative data”, have the potential to give firms a crucial informational edge in the market, enabling them to generate alpha.

Although investors are now using countless alternative data sources, satellite, weather, employment, trade data, etc. by far the leading alternative data source are the various forms of web data - web scraped data, search trends, and website traffic.

Web data is unique in that there is a vast ocean of rich and up to date signaling data lying within arms reach on the web. However, it is locked away in not easily accessible unstructured data formats.

In this article, we’re going to discuss the most popular form of alternative data, web-scraped data, and share with you the most important factor that firms need to be taken into account when building a robust alternative financial data feed for their investment decision making processes: data quality.

Finance’s unique data requirements

When it comes to using data in multi-million dollar (or billion-dollar) investment decisions, the ability to validate your investment hypothesis via benchmarking and backtesting is crucial.

What this means for web scraped data is that web data doesn’t start to become truly valuable until you have a complete historical dataset.

The key here is the word “complete”.

As we will discuss in the Data Quality section below, data completeness and quality play a huge role in the value and usefulness of any data source. Without a complete historical data set, it is nearly impossible for firms to validate their investment thesis prior to committing to a multi-million (or billion) dollar investment decision.

Their investment thesis must be rigorously stress tested to evaluate the soundness of their underlying assumptions, the predicted risk and return from the investment, and then benchmarked versus other competing investment theses competing for the same pool of investment money.

The most effective way of evaluating how an investment thesis would have faired in past situations is by stress testing it with historical data. Making the need for complete historical data extremely important.

There are two approaches taken to obtain the historical data firms need:

#1 Purchasing historical datasets

One approach is to purchase off-the-shelf data sets from alternative data vendors. The completeness and value of these datasets can be easily validated with some internal analysis, however, they suffer greatly from commoditization.

As these datasets are openly for sale, everyone has can get access to the same data sources. Significantly reducing the informational edge one firm can get over another from the resulting data. The ability to generate alpha with the data will be largely dependant on the competencies of the internal data analysis and investment teams, and the other proprietary data they can combine these off-the-shelf datasets with.

#2 Create your own

The other and increasingly popular option is for firms to create their own alternative finance web data feeds and build their own historical datasets. This approach has it’s pro’s and con’s as well.

The huge advantage to firms creating their own web data feeds is it gives them access to unique data that their competitors won’t have. Having their own internal data extraction capabilities expands exponentially the number and completeness of the investment theses their team can develop. Enabling them to develop investment theses that give them a unique edge over the market. However, the primary downside to building an internal data feed is the fact that they are typically an investment for the future. Firms likely won’t use the extracted data straight away (depending on the data type they might) as they need to build a backlog of historical data.

As we’ve seen there is a huge need for web data in investment decision making, however, as we’ve noted it is all high dependant on the quality of the underlying data.

Data quality

By far the most important element of a successful web scraping project in alternative data in finance is data quality.

Without high quality and complete data, web data is oftentimes useless for investment decision-making. It is simply too unreliable and risky to base investment decisions on incomplete or low-quality data.

This poses a huge challenge to any hedge fund data acquisition team, as the accuracy and coverage requirements they face often far exceed the requirements of your typical web scraping project.

The reason for this heightened need for data quality is the fact that any breaks or corruptions in the data oftentimes corrupt the whole dataset. Making it unusable for investment decision making.

If there is a break in the data feed, interpolating between the available data points might induce errors that would corrupt the output of any analysis of the data. Potentially leading to a misguided investment decision.

As a result, unless you can be confident in the accuracy of the interpolation, any break in the data feed can severely disrupt the usability of the data.

It is because of this need for high quality and reliable data that alternative finance web scraping teams need to double down on the core fundamentals of building a robust web scraping infrastructure: crawler/extractor design, proxy management, and data quality assurance.


Robust crawler & extractor design

Crawler and extractor design plays a crucial role in the quality and reliability of an alternative data feed for finance. As the name suggests the crawler and extractor are parts of the web scraping system that locates and extracts the target data from the website.

As a result, any inaccuracies here are extremely hard (sometimes impossible) to correct in post-processing. If the extracted raw data is incomplete, incorrect, or corrupted then without other independent data sources to supplement, interpolate and validate the data, the underlying raw data can be rendered unusable. Making crawler and extractor design the #1 focus when building a web data extraction infrastructure for alternative finance data.

It is outside the scope of this article to detail how to develop robust crawlers and extractors, however, we will discuss some high-level points to keep in mind when designing your crawlers and extractors.


With the importance of the resulting data to investment decision making, nothing beats having experienced crawl engineers when designing and building crawlers and extractors.

Each website has its own quirks and challenges, from sloppy structures to javascript, to anti-bot countermeasures and difficulty navigating to the target data. Having experienced engineers enables your team to predict the challenges your crawlers and extractors are going to face well in advance of the problems manifesting themselves. Enabling you to develop a robust data feed from day one and building historical datasets, instead of spending weeks (or months) troubleshooting and refining a data feed yielding unreliable data.

Built for scale, configurability & edge cases

How the web crawlers and extractors are configured is also very important. We’ve touched on it in some of our other articles, however, to build on those points. When building your web scraping infrastructure you need to separate your data discovery and data extraction spiders. Along with this, your crawlers need to be highly configurable and designed to enable crawls to be stopped and resumed easily without any data loss. It’s inevitable with website changes and anti-bot challenges that your crawlers will stop yielding perfect data quality. As a result, your crawlers need to be highly configurable, able to detect/cope with foreseen edge cases, and be structured in a way that enables them to be stopped and resumed mid-crawl.


Reliable proxy infrastructure

The most important factor in ensuring the reliability of your data feed is ensuring you can reliably access the data you need no matter the scale. As a result, a robust proxy management solution is an absolute must.

Nothing influences the reliability of requesting the underlying web pages more than your proxy management system. If your requests are constantly getting blocked that introduces a very high risk that there will be gaps in your data feed.

It is very common for web scraping teams to run into severe banning issues as they move spiders from staging to full production. At scale blocked requests can quickly become a troubleshooting nightmare and a huge burden on your team.

You need to use a robust and intelligent proxy management layer that is able to rotate IPs, select geographical specific IPs, throttle requests, identify bans and captchas, automate retries, manage sessions, user agents, and blacklisting logic to prevent your proxies from getting blocked and disrupting their data feed.

You’ve two options here, you can either use high-quality proxies and develop this proxy management infrastructure in-house or use a tailor-built proxy management solution like Zyte Smart Proxy Manager .

Managing proxies is not a core competency or high ROI task for hedge funds so our recommendation is to use robust and well-maintained off-the-shelf proxy management solutions like Zyte Smart Proxy Manager, and let you focus on using the data in your investment decision-making processes.

Data quality assurance

Lastly, your firm's web scraping infrastructure must include a highly capable and robust data quality assurance layer that can detect data quality issues in real-time so they can be fixed immediately to minimize the likelihood that there will be any breaks in the data feed.

Obviously, a completely manual QA process simply won’t cut here as it would never be able to guarantee the required quality levels at scale.

You need to implement a hybrid automated/manual QA process that is able to monitor your crawlers in real-time, detect data accuracy and coverage issues, correct minor issues, and flag major issues for manual inspection by your QA team.

At Zyte , we use a four-layer QA process to ensure our alternative finance clients have confidence in the quality of their data. It is because of this four-layer QA process that we can confidently give our clients written data quality and coverage guarantees in their service level agreements (SLAs).

If you’d like an insider look at the four-layers of our QA process and how you can build your own, then be sure to check out our Data Quality Assurance guide.

Your web data needs

As you have seen there are a lot of challenges associated with extracting high-quality alternative finance data from the web. However, with the right experience, tools, and resources you can build a highly robust web scraping infrastructure to fuel your investment decision-making process with high-quality web data and gain an informational edge over the market.

For those of you who are interested in extracting web data for their investment decision making processes but are wrestling with the decision of whether or not you should build up a dedicated web scraping team in-house or outsource it to a dedicated web scraping firm then be sure to check out our guide, Enterprise Web Scraping: The Build In-House or Outsource Decision.SM - Build In-House or Outsource

At Zyte we always love to hear what our readers think of our content and any questions you might. So please leave a comment below with what you thought of the article and what you are working on.