30 July 2025
Machine learning is driving change in every industry, but the reality of ML engineering is that the ecosystems supporting it are immature. Widely used frameworks are not tuned to use the latest hardware, observability tools do not proactively alert and recommend solutions, environments must be set up and torn down frequently, and performance tuning and diagnostics are often not well understood.
Many AI engineers find themselves using low-cost, decentralized Tier 3 cloud service providers (CSPs) like Vast.ai, RunPod, or Lambda Labs, which provide cheaper and readier access to the latest hardware but limited setup/config support, varying levels of reliability, and require (sometimes voluminous) data ingress/egress to function at all.
This article explores key pain points in the current ML engineering ecosystem, analyzes the state of observability tools, the scale of GPU underutilization, and asks why our infrastructure configuration processes are still largely manual. Let’s dive in.
Tier 3 cloud providers appeal to MLEs because they offer cheaper GPU access, better bang-for-buck performance, and greater flexibility than hyperscalers like AWS, GCP, or Azure.
However, with that cost efficiency comes complexity and chaos.
Unlike with AWS SageMaker or GCP Vertex AI, most Tier 3 CSPs hand you a raw virtual machine. Every time a GPU is rented, engineers must:
This creates massive inefficiencies and leads to a lack of reproducibility.
Developed by Databricks, MLflow offers:
Benefits:
Liabilities:
A commercial platform with:
Benefits:
Liabilities:
W&B reported in 2023 that average GPU utilization was 35–45%, with one-third of customers using only 15% or less.
Common causes:
Helpful tools:
Tier 3 providers are heterogeneous. This results in:
We still lack:
The MLE landscape is fragmented not from lack of talent, but from missing tools. Engineers face:
Opportunities exist to:
The pain points are clear — and they are also the blueprints for the next generation of ML infrastructure tools.
Instant cluster provisioning | Flexible Infrastructure | Simple Billing
Article