Apr 10, 20227 min read

Data World Economics

What is a data business?

Every company uses data, but not every company is a data business. A company is a data business if, and only if, data is its core product. Data is central to the activity of the company; without the data, there is no company

Google, Bloomberg, Yelp, and ZoomInfo are all data businesses. They acquire their data in different ways, and they generate revenue from that data in different ways. But for all these companies, data is the fundamental unit of value creation.

ONE: Data is the Core Truth

THE FIRST fundamental truth of data business models is this: it’s all about the data.

Successful data businesses are all built around a unique or proprietary data asset. There are a few ways to build such an asset:

Brute force: You throw resources at the task of primary data collection. Examples: Google crawling every website in the world; Planet launching 100s of micro-satellites; ZoomInfo cold-calling company switchboards to verify contact info. This is the most common and, despite the expense, often the best way to create a new data asset.
Aggregate and harmonize: You take data that others have collected and published (often for free), and you aggregate, link, and harmonize the data. Example: Reuters standardizing printed financial statements in the 1970s.
License and transform: You license commoditized data and transform it into a value-added form. Example: Scale.AI taking raw images and labelling them at, well, scale.
Affiliate collection: You farm out data acquisition to partners with the right incentives. Example: Advertisers install the Facebook pixel to collect data on customers, which they send to Facebook to optimize their Facebook ads.
Core business output: You create the data as part of your core business process. Example: Every transaction on the New York Stock Exchange generates data (price, volume, orders), which NYSE monetizes.
Payment in kind: You offer a free service or tool, in exchange for data or data tracking. Example: Foursquare’s free SDKs for mobile app developers, which enable Foursquare to track mobile user location.
Inbound network effects: You create a compounding advantage in getting data sources to come to you. Example: Google search is made continually better by site owners submitting data to Google (via SEO and other channels), leading to more Google searches and even stronger incentives for site owners.
Give to get: Partners send you individual pieces of data in order to access the corpus as a whole. Example: Businesses send their counter-party data to Dun & Bradstreet, in order to access D&B’s B2B credit database — which is based on aggregating all these counter-party reports.
(Data consortia are related to give-to-get, but with a peer-to-peer topology instead of hub-and-spoke.)
Data exhaust: You collect or generate data as a by-product of your core business. Example: comparison shopping apps, email managers and personal finance tools all have visibility into consumer transactions; some of them use this to build data products.
Data creation: You generate synthetic data, for applications where ‘real’ data is unnecessary, undesirable, or unachievable. Example: Tonic creates fake data that companies can test their systems on, before deploying to production.

TWO: Control unique data to capture unique value

THE SECOND fundamental truth of data business models is this: whoever controls the data, captures the value. Intermediaries get squeezed.

A common failure mode is to build a business on top of somebody else’s data. If you depend on a single upstream source for your data inputs, they can simply raise prices until they capture all of the economics of your product. That’s a losing proposition.

So you should try to build your own primary data asset, or work with multiple upstream providers such that you’re not at the mercy of any single one.

You should also try to add proprietary value of your own, lest either your suppliers or your customers encroach and disintermediate you. A sufficiently large transformation of your source data is tantamount to creating a new data product of your own.

These tactics interact. Sometimes the very act of merging multiple datasets adds substantial value . Joining data correctly is hard! Other non-glamorous ways to add value include quality control, labelling and mapping, deduping, provenancing, and imposing data hygiene .

Some companies, discovering that they can neither control their data assets nor add intermediary value, pivot to picks-and-shovels instead. Tools to support data businesses — everything from monitoring to pipelines to governance — can be lucrative in their own right.

The gold-rush metaphor may be over-used, but it’s still valid. Prospecting is a lottery; picks-and-shovels has the best risk-reward; jewellers make a decent living; and a handful of gold-mine owners become fabulously rich.

THREE: Data business have slow beginnings

THE THIRD fundamental truth of data businesses is this: they start slow.

You’ll that none of the above data acquisition methods are ‘easy’. They need upfront investment or a certain amount of scale to work. Absent either of those, building a data asset is a process of slow bootstrapping.

Adding to the problem is the fact that almost all data products have a ‘minimum viable corpus’ — a size below which the data simply isn't useful. This parallels the concept of a minimum viable product in software, but an MVC is usually much harder to build than an MVP.

The analogy with software doesn’t end there. Almost every apart of the software business stack has an equivalent in the data business stack. Where software firms invest in devops, QA and product, data firms have to invest in data ops, data QA, and data product. These tend to be just as complex, with the extra hurdle that third-party providers are rare, hence you often have to build this infra in-house. All of this is expensive.

As a result, delivering one's first data product requires significant time and resources.

And this is a good thing! Remember the classic wisdom: my capex is your barrier to entry. The effort required to go from zero-to-one in data businesses is one reason they are so formidably defensible. It's also why ‘brute force’ remains one of the most popular strategies used by players in this game. A data product that can be built easily is a data product that can be replicated easily.

But even after you build your data asset, you’re not home free. For reasons we’ll get into later, most data products require category creation. You have to educate your ecosystem, evangelize your product, nurture your customers over time. Early sales cycles are long, and win rates are low. But it gets better — a lot better.

Aside: Nothing ventured, nothing gained

One would think that a business model that requires substantial upfront investment but pays off in buckets later on would be a perfect fit for venture financing. That may have been true in earlier eras, but not today. Tech investing in recent years has indexed heavily on growth rates; as a result, data businesses — with their slow early growth — often find it difficult to raise venture capital.

(It’s also the case that compelling opportunities in data have historically been rarer than opportunities in software, even if they're more lucrative. VCs are familiar with outlier math, but their lack of reps evaluating data businesses tells against them.)

FOUR: Growth accelerates over time

THE FOURTH fundamental truth of data businesses is this: they accelerate.

Everything starts slower on the data side. Building a valuable data asset takes time. Building the supporting infrastructure to actually deliver that data takes time. Sales cycles take time.

The classic mistake people make is to see this and jump to the conclusion that early-stage data businesses don't work and will never work.

But that's a category error, and it’s due to a fundamental difference in dynamics. Software business economics tend to degrade; data business economics tend to improve.

Why so? Here’s how it works:

The marginal cost of acquiring data begins to decline. You begin to see economies of scale on the infrastructure side.
Data sales get easier as your corpus is no longer minimal. Sales cycles shorten, sometimes dramatically.
An expanding corpus also expands your audience: for example, there are many more buyers for data covering 50 US states or 10,000 public stocks than for data covering 10 states or 200 stocks.
You can slice and dice your data for more effective targeting and price discrimination — shortening your sales cycle even more.
As your data becomes widely used, it goes from optional to essential. Customers use it because other customers are using it. The dream of every data asset owner is to become an industry standard. (This doesn’t happen with most other business models).
You can charge more for data. This is partly a corpus-size effect, and partly a table-stakes/must-have effect. Data maturity opens up new axes for pricing — per record, per API call, per data update — in addition to the usual SaaS axes of per use case and per seat.
With more customers, you can amortize your fixed costs of data acquisition and delivery across a wider base — and they’re almost all fixed costs 5.
You can charge recurring revenue; after all, nobody wants to work with obsolete data. (This is harder in the early days, not because of lack of buyer appetite, but because your update cadence probably isn’t good enough.)
The combination of recurring revenue, avenues for upsell, and must-have status means your NRR and LTV are terrific.
You can unlock new channels of data acquisition, most notably customer contribution loops. Models like give-to-get, payment in kind, and affiliate partnerships are now accessible to you. (They weren’t previously, because you were too small to be sufficiently attractive.)

You can also create data quality loops: get customers to not just contribute, but also verify their own as well as third-party data for you, either explicitly or through various behavioural and software hooks.
You can use your data to power your customer acquisition, most notably with a data content loop. Data content loops can be simultaneously cheaper, more scalable and more defensible than most other go-to-market channels, as shown by Expedia, GlassDoor and Zillow.
You can build services on top of your data, creating a data learning loop. As your data improves, these services improve in parallel, growing and contributing even more data to your platform.

Now, many of these effects taper off eventually. Marginal data costs go back up once you start hitting the long tail; price flattens out once the marginal data point no longer adds insight; most real-world entities have finite (even if large) cardinality. These curves are sigmoid, not unbounded.

But you can get a very long way before that happens. At-scale data businesses are huge, and many of them are still growing fast.

The holy grail is when all these scale and network effects combine such that you can be both the lowest-cost acquirer and the highest-paying buyer of your data inputs —while still offering your data outputs to customers at the lowest price in the market. If you get this far, you’re unstoppable.