Latest generation RTX 5090s now available on demand. Start Training!
ARTICE

The Fractured Landscape of Machine Learning Engineering: Pain Points, Tools, and Opportunities

30 July 2025

Machine learning is driving change in every industry, but the reality of ML engineering is that the ecosystems supporting it are immature. Widely used frameworks are not tuned to use the latest hardware, observability tools do not proactively alert and recommend solutions, environments must be set up and torn down frequently, and performance tuning and diagnostics are often not well understood.

Many AI engineers find themselves using low-cost, decentralized Tier 3 cloud service providers (CSPs) like Vast.ai, RunPod, or Lambda Labs, which provide cheaper and readier access to the latest hardware but limited setup/config support, varying levels of reliability, and require (sometimes voluminous) data ingress/egress to function at all.

This article explores key pain points in the current ML engineering ecosystem, analyzes the state of observability tools, the scale of GPU underutilization, and asks why our infrastructure configuration processes are still largely manual. Let’s dive in.

1. The Pain Points of Machine Learning Engineering on Tier 3 Cloud Providers

Tier 3 cloud providers appeal to MLEs because they offer cheaper GPU access, better bang-for-buck performance, and greater flexibility than hyperscalers like AWS, GCP, or Azure.

However, with that cost efficiency comes complexity and chaos.

Unlike with AWS SageMaker or GCP Vertex AI, most Tier 3 CSPs hand you a raw virtual machine. Every time a GPU is rented, engineers must:

  • Install system dependencies (e.g., CUDA, cuDNN, Python)
  • Set up Docker or Conda environments and compose Kubernetes/Terraform configs
  • Set up training data access (e.g., via S3 or rsync)
  • Install observability/monitoring SDKs and agents
  • Rebuild environment variables, mount volumes, and SSH keys

This creates massive inefficiencies and leads to a lack of reproducibility.

2. Observability Tools: MLflow vs Weights & Biases

MLflow

Developed by Databricks, MLflow offers:

  • Lightweight experiment tracking
  • Model versioning and packaging
  • Deployment tools

Benefits:

  • Self-hosted and free to use
  • Python/ML framework integrations
  • Supports various backends

Liabilities:

  • Passive reporting
  • Limited UI interactivity and real-time visualizations
  • No GPU tracking

Weights & Biases (W&B)

A commercial platform with:

  • Real-time monitoring
  • Artifact logging
  • Sweep management
  • GPU utilization dashboards

Benefits:

  • Interactive UI
  • Detailed GPU/system metrics
  • Excellent integration with major ML frameworks

Liabilities:

  • No intelligent alerts
  • Vendor lock-in
  • Internet dependency and potential high cost

3. GPU Utilization Is Still Terrible — and We Know It

W&B reported in 2023 that average GPU utilization was 35–45%, with one-third of customers using only 15% or less.

Common causes:

  • Data pipeline inefficiencies
  • Checkpointing stalls
  • Blocking calls in training loops
  • No mixed precision or JIT
  • Imbalanced workloads in multi-GPU training

Helpful tools:

  • NVIDIA NSight / DLProf
  • W&B dashboards
  • PyTorch Profiler / TensorBoard
  • Smart dataloaders

4. Why Do We Still Rewrite Kubernetes and Terraform for Every Environment?

Tier 3 providers are heterogeneous. This results in:

  • Terraform networking rule mismatches
  • K8s manifest incompatibilities
  • Manual rewrites of node selectors, tolerations, storage classes, etc.

We still lack:

  • Scripts for auto-fingerprinting environment characteristics
  • Auto-generation of configs for kubeadm, Helm, Terraform
  • Plug-and-play environment builders

5. Conclusion: An Opportunity in the Gaps

The MLE landscape is fragmented not from lack of talent, but from missing tools. Engineers face:

  • Manual, time-consuming VM setups
  • Underutilized GPUs
  • Varied observability based on budget
  • Rewritten infrastructure config per deployment

Opportunities exist to:

  • Auto-generate K8s/Terraform from environment detection
  • Build local-first, declarative, versioned env configs
  • Offer intelligent observability (e.g., dataloader warnings)
  • Make ephemeral ML training as easy as deploying to Heroku

The pain points are clear — and they are also the blueprints for the next generation of ML infrastructure tools.

Reserve your hardware

Instant cluster provisioning | Flexible Infrastructure | Simple Billing

You Might Also Like

Article

GPU Cloud vs On-Premise: The Real TCO Analysis Every AI Startup Can’t Afford to Ignore

Article

Maximizing Efficiency: Strategies to Reduce AI Training Costs

Article

Choosing the Right Hardware for Your AI Use Case!

Make the switch to Oblivus Cloud and

Slash Your Expenses by 80%.

Register