The Fractured Landscape of Machine Learning Engineering

Machine learning is driving change in every industry, but the reality of ML engineering is that the ecosystems supporting it are immature. Widely used frameworks are not tuned to use the latest hardware, observability tools do not proactively alert and recommend solutions, environments must be set up and torn down frequently, and performance tuning and diagnostics are often not well understood.

Many AI engineers find themselves using low-cost, decentralized Tier 3 cloud service providers (CSPs) like Vast.ai, RunPod, or Lambda Labs, which provide cheaper and readier access to the latest hardware but limited setup/config support, varying levels of reliability, and require (sometimes voluminous) data ingress/egress to function at all.

This article explores key pain points in the current ML engineering ecosystem, analyzes the state of observability tools, the scale of GPU underutilization, and asks why our infrastructure configuration processes are still largely manual. Let’s dive in.

1. The Pain Points of Machine Learning Engineering on Tier 3 Cloud Providers

Tier 3 cloud providers appeal to MLEs because they offer cheaper GPU access, better bang-for-buck performance, and greater flexibility than hyperscalers like AWS, GCP, or Azure.

However, with that cost efficiency comes complexity and chaos.

Unlike with AWS SageMaker or GCP Vertex AI, most Tier 3 CSPs hand you a raw virtual machine. Every time a GPU is rented, engineers must:

Install system dependencies (e.g., CUDA, cuDNN, Python)
Set up Docker or Conda environments and compose Kubernetes/Terraform configs
Set up training data access (e.g., via S3 or rsync)
Install observability/monitoring SDKs and agents
Rebuild environment variables, mount volumes, and SSH keys

This creates massive inefficiencies and leads to a lack of reproducibility.

2. Observability Tools: MLflow vs Weights & Biases

MLflow

Developed by Databricks, MLflow offers:

Lightweight experiment tracking
Model versioning and packaging
Deployment tools

Benefits:

Self-hosted and free to use
Python/ML framework integrations
Supports various backends

Liabilities:

Passive reporting
Limited UI interactivity and real-time visualizations
No GPU tracking

Weights & Biases (W&B)

A commercial platform with:

Real-time monitoring
Artifact logging
Sweep management
GPU utilization dashboards

Benefits:

Interactive UI
Detailed GPU/system metrics
Excellent integration with major ML frameworks

Liabilities:

No intelligent alerts
Vendor lock-in
Internet dependency and potential high cost

3. GPU Utilization Is Still Terrible — and We Know It

W&B reported in 2023 that average GPU utilization was 35–45%, with one-third of customers using only 15% or less.

Common causes:

Data pipeline inefficiencies
Checkpointing stalls
Blocking calls in training loops
No mixed precision or JIT
Imbalanced workloads in multi-GPU training

Helpful tools:

NVIDIA NSight / DLProf
W&B dashboards
PyTorch Profiler / TensorBoard
Smart dataloaders

4. Why Do We Still Rewrite Kubernetes and Terraform for Every Environment?

Tier 3 providers are heterogeneous. This results in:

Terraform networking rule mismatches
K8s manifest incompatibilities
Manual rewrites of node selectors, tolerations, storage classes, etc.

We still lack:

Scripts for auto-fingerprinting environment characteristics
Auto-generation of configs for kubeadm, Helm, Terraform
Plug-and-play environment builders

5. Conclusion: An Opportunity in the Gaps

The MLE landscape is fragmented not from lack of talent, but from missing tools. Engineers face:

Manual, time-consuming VM setups
Underutilized GPUs
Varied observability based on budget
Rewritten infrastructure config per deployment

Opportunities exist to:

Auto-generate K8s/Terraform from environment detection
Build local-first, declarative, versioned env configs
Offer intelligent observability (e.g., dataloader warnings)
Make ephemeral ML training as easy as deploying to Heroku

The pain points are clear — and they are also the blueprints for the next generation of ML infrastructure tools.

Reserve your hardware

Instant cluster provisioning | Flexible Infrastructure | Simple Billing