Data Asset Companies — Build or Buy Your Data Platform and Data Infrastructure
There’s a new category of companies whose business is selling dataset and/or data analytics and processing atop, which I’d like to call them the “Data Asset” companies. E.g.
- monitoring/metrics/log vertical like DataDog, Dynatrace, Chronosphere, Splunk, SumoLogic
- gaming vertical like Game Analytics
- location vertical like SafeGraph, MapBox, Esri
- crypto vertical like Nansen and Dune Analytics
They collect data (e.g. metrics, logs, point of interest, blockchain) as asset, and then sell access (dashboards, APIs, etc) to you
Here are a few critical aspects to take into consider (The following may apply to all kinds of companies, but I’m being specific to data companies here):
- Cost Scalability
For data assets company, processing and analyzing data is a must-have base cost. To successfully scale, you have to keep data infra cost growth linear while business and profits growth exponential.
The initial one is always cost of people v.s. time/speed. E.g. you don’t want to build your own infra if it’s only a 10 people team. Since it’s a well known tradeoff and topic, I’ll skip it here.
What not so obvious is the tech cost. Despite necessary passing-on cost to customers, core cost cannot grow at the same rate of core business, otherwise, it may not be a good business, e.g. in tech consulting, # of clients you can serve is linear to # of consultants you hire — this is not a good business.
How to scale cost and value disproportionally at data asset company? It’s case-by-case. We should understand how vendors charge you first. Vendors can charge by:
- constants, e.g. $/license. It’s like a SaaS model, usually applicable to low data volume areas like metadata and security management, etc. In general should be ok to just swallow them.
- usage, e.g. $/sec. This is the most common pricing model in infra areas, e.g. compute service like AWS EC2 charges by $/core*sec (or /instance type), storage service like AWS S3 or GCP Cloud Storage charges by $/ MB or GB stored. It’s likely the cost you pass-on to users.
- hidden! This is the worst. Unlike above two, you don’t understand how it’s calculated and have no control. E.g. GCP BigQuery can charge by amount of data scanned. It’s bad in every single way — you can do nothing about it (e.g. barely no optimization can be done), and vendor are not motivated to optimize and reduce the amount of data scanned. If your core business highly depend on BigQuery, give it a second thoughts.
Once understanding your dependency cost model, then you can analyze your business and do a simple projection see if your business can outgrow the cost. Eg, it doesn’t make sense for Datadog to use a 3rd party time series database for metrics, they have to build and down this core piece themselves from cost and other perspectives.
2. Control your own destiny: Vendor Lock-ins
Vendor lock-in first comes with pricing and technical barriers. How much negotiation power do you have over vendors? If vendors said they are gonna raise the price due to inflation, are you able to counter that ask and move off quickly if necessary? I won’t expand here as there’re plenty of content online of this.
What I want to highlight on is non-monetary/technical areas where people usually do not have correct assessment if they’ve never been there. E.g.
- How soon do vendors deliver on feature and bug-fix requests?
- How well do they provide support? e.g. docs, SLAs, response time, oncalls
- How aligned are your roadmap with their actual roadmap? How much can you impact their roadmap and management chains?
Note that these are also closely related to your contract $ amount and thus company size, as well as vendor’s size and stage. E.g. a big customer may be better treated by a small vendor, or by a large vendor if you yourself are a renowned brand. But to be frank and yet crude, I’ve seen from time to time even big companies I worked at are ignored by vendors as big as AWS and as small as startups, e.g. slow response, slow execution, never deliver on features and bug-fixes which resulting in us building internal solutions to replace them, let alone small companies and users. So be prepared and ensure you can tolerate all these hurdles if you decide to use a vendor.
3. System Composability and Extensibility for Future Growth
Vendors usually put in barriers to defend themselves from competitors, but your data strategy only works when it’s composable.
Be aware of what you can do and cannot do down the road with a vendor. E.g. vendor A may never connect to vendor B, vendor C will never allow you to replace or plug in component X,Y,Z inside their systems. Such things will can be a show stopper — if the query engine you purchased cannot read data from one of your data store, that’s very bad. So, avoid short sighted engineering decisions, and take these factors into your early decisions.
4. Team Expertise and Hiring
If your early team only have data engineering experience, use a vendor solution, and design it to keep retreat routes open.
A big misunderstanding of using data infra vendors is that people think they don’t need a platform team anymore since they buy vendors services and imagine that vendor can take care everything. That’s a truly mistake and false assumption to make. https://www.safegraph.com/blog/scaling-data-as-a-service-daas-with-platform-engineering this is a good example of why you still need platform team even with a vendor solution.
Coming to build your own platform, even if you want to, how skilled is your team? Building a platform or infra is not sth someone can learn overnight, you’d better have someone did it before to lead the effort.
What’s more, what’s your hiring plan look like? Start to hire talent to prepare in advance. If you only start to hire when an immediate need is there already, that’s too late, as hiring a qualified lead and a starting team can take you more than a year.
5. Hidden Cost, of Even Free Software
Be aware of the hidden cost when trying to leverage OSS software. They may appear to be open and free, but some projects may be a trap. E.g. Delta Lake only open sources the very basic functions, and Databricks is keeping all advanced features private in commercial version. So if you decide to use Delta Lake mainly because it’s free, think twice.
6. Taking Proper Tradeoffs According to Company Maturity, Timing, Resources
Ultimately, the decision makers should take proper tradeoffs based on the situation, e.g. maturity of the business and company, business demand, etc.
Understand the tradeoffs you are taking and make the right decisions accordingly. E.g. if you are a small startup that is still looking for product market fit, your backend infra is certainly not top priority. You thoroughly understand the pros & cons and tradeoffs you are taking at that moment (e.g. velocity, cost, flexibility, performance, scalability, etc), and must have a long term plan in of what is the ultimately right way, and how to gradually shift to that direction.