Skip to main content

How to Hack Your Way into a Proprietary Dataset

As software continues to eat the world, every company must become a tech company. The good news: Declining cloud compute and hosting costs and open-sourced machine learning frameworks like TensorFlow mean it has never been cheaper and easier to build intelligent software. The bad news: it has never been cheaper and easier to build intelligent software, so this is no longer a competitive differentiator; it’s table stakes.

As we enter the “Great Commoditization” era of software, how can a CEO de-commoditize and build a long term competitive moat around her business?

I believe data will be the gold that separates the winners from the also-rans in this next generation of machine learning-driven software.

But all datasets are not created equal. Deep datasets focused on solving specific problems are better than large, broad datasets. Dynamic, constantly refreshing datasets are vastly superior to static datasets, typically regardless of their size. And, ultimately, the datasets must be proprietary; the harder they are to access or replicate, the wider the long-term moat. Proprietary data is the fuel that can turn empty, commoditized workflow software into rich, defensible recommendation engines.

To build these engines, the datasets must also have a “closed-loop”, or a set of inputs which drive outputs. Generally speaking, the way the math of machine learning works is that actions are correlated with outcomes and constantly recalibrated to improve accuracy. If all you have is inputs, like what ads users click to get to your website, but no outputs, like what they bought, you aren’t going to be able to train the AI to do anything.

So how can you build a deep, dynamic, proprietary, closed-loop dataset, especially if you’re starting from Ground Zero?

From Zero to Defensible

It starts with rethinking the relationship between your product and your data. Data collection can’t be a series of one-off exercises used to inform company strategy; it must be built into the core product itself. In other words,

Every time a user engages with your product, you must collect data from the interaction that systematically make future usage of the product even better, across all users in the network.

Bradford Cross, CEO of Merlon Intelligence, describes how to build this data flywheel into your products, “[ensure] you are capturing totally unique data over time from how your product is used, and that data capture is designed precisely to serve the needs of your models, which are designed to serve the needs of the product functionality, which is designed to meet the needs of the customer. This data value chain ensures that the customer’s motivation is aligned with your motivation to compound the value of your proprietary dataset.”

Building out an effective data flywheel can be the key to achieving a state of competitive nirvana I call Compounding. Once your business enters this phase, every new customer you add makes the dataset and thus the product better, which attracts more customers, which makes the dataset better, etc etc. For this model to work most effectively, you have to rethink how data is used across your customer base; data from every user in your network must be used to improve the product for every other user in the network, regardless of which customer they may work for. This requires significant architecture (technical, legal, and data security) to work, but is key to maximizing value of the system.

Google and Amazon have built the most formidable businesses the world has yet known by leveraging this model. We believe there is an even more exciting opportunity to use this model to build companies which don’t harvest users to sell to them more effectively but instead help them complete their tasks more effectively. We call these businesses Coaching Networks, and they use AI not to automate workers away but to augment them in real-time while they are performing their jobs.

Textio is a good example of a Coaching Networks startup that’s built a data flywheel, initially focused on the recruiting space. Textio Hire optimizes job posts to help recruiters hire the right people faster. As recruiters write job posts in Textio, the software highlights words and phrases to suggest tweaks which would improve the likelihood of attracting a targeted candidate profile. Once the job is posted, Textio tracks the candidates that apply and automatically updates its model and thus the suggestions it makes to every recruiter in the network. The product improves with every user, which improves the outcomes for every customer, which leads to a compounding dataset. After three years in market, Textio has amassed thousands of users, which have collectively built the world’s largest database of job posts and outcomes- 370 million strong as of this writing.

But no company starts with this user-driven flywheel. They start with hacks.

Types of Data Hacks

Data hacks can be placed along a spectrum from aggregation hacks to creation hacks. The former start with existing datasets pooled together in some interesting way. These can be relatively straightforward to hack together and, as a result, are the most common types of startup data hacks. On the other end of the spectrum are creation hacks, which, as the name suggests, involve the generation of data that hasn’t existed before (at least in a structured manner). These tend to be harder and are thus a rarer starting point.

Another dimension along which these hacks can be understood is how proprietary the data hacked together is. How hard is it for others to replicate a meaningful portion of the data and in what time frame? To be truly proprietary, the dataset must be exclusive to the company that owns it.

While the ultimate goal is to achieve a defensible, compounding dataset, using less proprietary hacks to get you started on the journey can be effective if you move quickly.

The hack or hacks you start with will depend on the assets you start with. Established companies may have unstructured data sets they can work to structure. They may also have large staff they can co-opt into data hackers.

Startups, with fewer assets by definition, often have to be more creative. Indeed, the best founders I work with are extremely creative when it comes to designing data hacks.

Let’s explore some of the most commonly used data hacks:

Scraping (a non-proprietary aggregation hack)

Perhaps the most common startup data hack, this consists of collecting publicly available but scattered data. This can take the form of scraping websites, online databases, or even offline databases. Corelogic is perhaps the best example of what can be built on the back of offline scraping; they collect public records data from government offices across the country and sell the packaged data to real estate players for large sums.

  • Can be easy/low cost to start (e.g., build a web crawler)
  • Can allow for aggregation of large volumes relatively quickly
  • Can allow for fast iteration
  • Can be easy to replicate
  • Can present legal issues (check scraping rules beforehand)
  • Can be hard to acquire “output” data necessary to close the loop

Partnering (a proprietary aggregation hack)

Another common strategy is to explore partnerships between established entities, such as industry incumbents or governments, that already have large, unstructured datasets and startups that have the talent and focus necessary to structure and make use of it. In exchange for access to this data, startups often offer their partners revenue shares, partial IP ownership, in-kind services, or even good old cash. Tractable, which provides an AI that improves car accident repair processes, is a good example of a startup that has hacked its way into success by partnering with industry incumbents.

  • Can provide valuable closed loop, input/output data pairings for startups
  • Can provide a competitive barrier to entry for startups if partnership involves exclusivity
  • Can provide incumbent an opportunity to leverage an unused asset and move towards building their own data flywheel
  • Data is often unstructured and/or requires significant cleansing
  • Can result in a serious tax on the startup, both financially and legally, if not well-structured. The best partnership deals often involve the startup providing in-kind (free) services to the incumbent in exchange for data

Crowdsourcing (a non-proprietary creation hack)

Crowdsourcing is a popular, low tech way to seed a dataset. It can take a variety of forms, from leaders asking their teams to collect data (e.g., take photos, create surveys, label data, etc) to outsourcing these tasks to workers on services like Mechanical Turk.

  • Can tailor the datasets created to specific needs
  • Can be the easiest and cheapest hack
  • Depending on the tactic, can be hard to scale
  • Can be easy to replicate so important to move on quickly to other hacks

Workflow First (a proprietary creation hack)

A popular “two-step” data hack is to start by building workflow services, driving usage via the workflow, and then looking for ways to make use of the data captured. Salesforce, perhaps the quintessential cloud workflow provider, is looking to move in this direction with its Einstein offerings.


  • Can monetize early with the traditional benefits of a SaaS model
  • Can build workflow to capture closed loop, input/output data pairings


  • Hard to build a “two-step” business from a product, customer and talent perspective. Most don’t make it to step two.
  • Can be challenging legally, as contracts need to clearly allow for data sharing across customers from the beginning. Need to have a bulletproof data privacy and security strategy and team in place.

Companies often layer on a variety of hacks on their journey to the flywheel. Textio started by scraping public job boards. To create a synthetic closed, input/output loop, they assumed that the time a job remained posted on the board was inversely correlated with the quality of the job post (better posts got filled faster). This allowed them to build a rough algorithm that was good enough to approach potential partners. They worked with a few large employers and provided them free job post optimization in exchange for historical job post and hiring data. This injection of large closed loop datasets allowed them to improve the product to the point that they could start selling it on its own to paying customers. As more customers used the product, it continued to improve, and the flywheel began to spin.

As we enter the “Great Commoditization” era, CEOs of everything from freshly minted startups to established incumbents will need to answer two critical new questions: Which data hack or hacks are you pursuing and how will they lead you to a flywheel?

A similar version of this article first appeared on Forbes

Enjoying this article?

Sign up to gain access to our thought leadership and have future articles delivered directly to your email.