Summary

The provided content outlines a framework for evaluating product ideas in the machine learning infrastructure and tooling space, considering the stages of ML project development and the focus on either engineering or ML challenges.

Abstract

The article presents a simple yet effective framework for assessing the viability of product ideas in the AI infrastructure and tooling market. It distinguishes between 'Day 1' problems, which are initial challenges faced when adopting machine learning, and 'Day 2' problems, which involve maintenance and optimization once ML is in production. The framework also differentiates between engineering-focused and ML-focused problems, highlighting the importance of addressing the specific needs of data science teams versus engineering teams. The author notes that while 'Day 1' issues may have a broader market, they are often highly saturated with similar offerings, whereas 'Day 2' problems, though fewer, can lead to customers with a high sense of urgency. Additionally, the article discusses the challenges and advantages of targeting either ML-heavy or engineering-heavy problems, including market size, customer empowerment, and ease of selling to decision-makers.

Opinions

The author believes that despite the market being saturated, there are still valuable and interesting problems in the ML infrastructure and tooling space that need solutions.
'Day 1' problems are seen as having a broader addressable market due to the continuous growth of ML adoption, but they suffer from a lack of differentiation among competing products.
'Day 2' problems are considered to have a smaller market but can lead to customers with a high sense of urgency and potentially less competition.
ML-heavy problems directly affect data science teams and can be more challenging to monetize due to the limited purchasing power of data scientists.
Engineering-heavy problems may be easier to sell to decision-makers, who often come from an engineering background, but these customers might prefer established vendors.
The author suggests that successful ML infrastructure and tools must navigate the complexities of multi-disciplinary teams and the varying stages of ML project maturity.

A Simple framework for evaluating ML infra and tooling product ideas

Over the years, I got quite a few pitches from people starting a company around AI infra/tooling space.

Some of them reach out to evaluate the viability of their product idea — e.g. to hear :

How many of the companies I’ve worked with have faced a given problem
How acute the problem is for different teams/segments
Pros and cons of other players in the field etc.

I am not a product guy, but I’ve been in this space for some time.

Here are some of my thoughts around how to evaluate ML infra and tooling product ideas.

I think these are two interesting dimensions to look at:

Day 1 vs. Day 2 problems
Engineering focused or ML focused problems

Each combination of these dimensions yields a different type of product with its own set of advantages and challenges.

Day 1 vs. Day 2 problems

Day 1

Day 1 problems usually revolve around the basic mechanics needed to use Machine learning.

Here I include everything from the first time you train a model till the early impact it has on customers.

“How do I train my model without needing infrastructure?”
“How do I put it in production?”
“How do I manage model features?”
“How do I monitor my model decisions?”

Advantage — broader addressable market

Let’s face it, the majority of the ML market seems to perpetually stay in day 1. This is probably because:

ML adoption is only growing so more companies are just getting started.
Also, sadly, many projects fail and never make it to day 2.

Disadvantage — saturated market with little differentiation

Almost every problem in this space already has tens of players.

Most of the products in this space are not very differentiated from each other, especially product which solve a single aspect of the ML lifecycle.

Day 2

Day 2 problems are often focused on maintenance, continuous or last mile improvements, or economics — from model quality to infrastructure cost reductions.

Areas such as:

“How do I find tough edge cases for my model?”
“How do I figure out why my model made certain decisions?”
“How do I check if my training was optimal?”
“How can I continuously improve my training datasets?”
“How can I reduce my GPU cloud bill?”

Advantage + disadvantage —Smaller market

Not as crowded, but smaller.

Enough said.

Advantage —potential for customers with high sense of urgency

If a team is facing a critical day 2 problem, they will have a higher sense of urgency to address it, as business is already impacted by the issue and often they are under pressure to deliver.

Disadvantage 1 — problem / market fragmentation

Generally, ML products which reached day 2 can be quite different from each other, since they were already optimized for the specific constraints of the environment and business.

In addition, most companies only invest in day 2 problems only if they are on the critical path (vs. nice to have)

This is generally true for systems which are already in production, but is especially true for ML — where the last 10% are 50x harder than the first 90%.

So in Day 2 there are fewer companies with a wider variety of problems which might be domain-specific, with higher intensity, who are ignoring “nice to have” problems.

Disadvantage 2 — high value problems can justify in-house solutions

If a problem is so extremely valuable to solve, some customers will prefer to build a custom, in-house solution to make sure they have 100% fit.

This is especially common in very high value domain like autonomous driving, robotics, or other safety-critical systems.

It can also happen in very profitable ad-tech, risk or other domains where small marginal improvements can mean very high return.

ML vs. Engineering problems

Delivering a valuable ML product requires solving problems which are somewhere on the spectrum between ML and engineering.

There are many things which are influenced by where the problem you are addressing sits with respect to this spectrum.

One interesting area which is related to where on the spectrum your product falls, is the user personas and sales cycle you can expect to meet.

ML heavy problems

Examples for such problems:

“How to efficiently and accurately label my data?”
“How do I easily track and compare ML experiments?”
“How do I check whether my model is robust?”

When solving such problems, the main audience is usually a team of data science professionals and their immediate management.

Advantage — focus area for the main project staff

Almost no decision can pass in an ML project without going through the data science team.

These problems fall squarely in their area of expertise, bug them daily, and they are passionate about solving them well.

Disadvantage — customer empowerment

Data scientists usually do not hold much purchasing decision making power.

They also lack the technical skills for setting up trials if they require new infrastructure or permission to move large amounts of data around — especially into the vendor’s infra.

So here you can expect all the challenges of selling a dev tool around monetization etc., plus some (due to the fact DS rarely control infra).

As a result, such products need to be a pip install, self-onboarding pure SaaS with no data movement, or a combination thereof.

Engineering heavy problems

In contrast, there are problems which are closer to the engineering side of things, like:

“How to manage GPU machines?”
“How to deploy a model serving endpoint?”
“How do I monitor ML endpoints?”

Advantage — easier sell to decision makers

Often the decision makers with purchasing and data movement resources come from a more engineering background.

These problems can resonate enough with them even if they have very little understanding of the ML specific challenges, and hence easier to sell.

So it’s easier to get the decision makers’ buy-in.

Disadvantage 1 —companies often prefer to go with vendors they know

Many R&D executives will rather search for solutions to problems they kind of know, from vendors they are already working with.

Hence the cloud vendors who generally have mediocre products in this space are raking in the money.

Disadvantage 2 — multi-disciplinary decisions

Virtually every ML infra decision needs to pass the Data science team too — at least in terms of usability and productivity.

This leads to multi-disciplinary discussions between R&D leaders and Data scientists.

Like any multi-disciplinary discussion, it’s somewhat harder to navigate, especially for teams which are still in “day 1”.

Summary

Building successful ML infra and tools as products is certainly a challenge.

However, I firmly believe that there are still many interesting and valuable problems throughout this spectrum which are waiting for a solution.

I hope you found this analysis useful, and feel free to reach out if you have any product ideas you are exploring.