Diagnose ML training
failures in seconds.
Your training job just died. Rank 42 timed out, your entire cluster went dark, and you don't know why. Denpex tells you exactly which GPU failed first, the root cause, and the fix — before you've even opened a terminal.
11.3s
avg diagnosis time
99.7%
accuracy on 25 failure types
$847
avg GPU cost recovered
Works with your stack
Trusted by teams training at scale
500M+
Events/day
99.7%
Accuracy on 25 failure types
11.3s
Avg time to root cause
3.2 hrs
Saved per incident avg
“We ran 32-node DDP jobs that kept dying at step 12k-15k. Spent two weeks thinking it was a networking issue between our IB switches. Denpex flagged Rank 8 hitting OOM from gradient accumulation buffer growth at step 12,847. One line in deepspeed config, hasn't happened since. Still blows my mind it caught that from our NCCL timeout logs.”
“Our FSDP fine-tunes were failing like clockwork every Thursday. Corrupted sample in our dataset that only showed up with certain sequence lengths. Without Denpex we'd have blamed the hardware vendor for another month. It pointed directly to the dataloader. One PyTorch Dataset fix, done.”
“Honestly the biggest win is not the speed. It's having something that gives ML engineers and infra the same answer. When a job crashes at 2am, nobody's arguing about whether it was the network or the code. Denpex says Rank 47 hit a CUDA OOM. Both teams look at that and move on to fixing it instead of blaming each other for four hours.”
The debugging hell you know too well
These are the exact failures that cost teams millions in wasted GPU hours every year.
NCCL timeout is never actually NCCL's fault
When 64 ranks hit an NCCL timeout, the framework blames the communication layer. The actual cause is almost always one rank failing — an OOM, a slow dataloader, a dead NIC — and 63 other ranks waiting at the barrier until the watchdog timer fires. You spend hours debugging InfiniBand when the problem was a corrupted data sample on node 4.
GPU shows 30% free but crashes with OOM
CUDA memory allocators fragment over long training runs. Your GPU reports 28GB free but can't satisfy a 2GB contiguous allocation. The error message is identical to a true OOM, so you reduce batch size, relaunch, and crash again two hours later with the same error. The fix is a single environment variable.
Model trains for 3 days and the weights are garbage
A degraded GPU matrix-multiply unit produces slightly incorrect results — no crash, no error message, just wrong math. Because AllReduce broadcasts gradients across all ranks, one corrupted gradient poisons the entire cluster. Your loss curve stalls or diverges. You find out days later when you try to evaluate the model.
Cluster OOMs on restart because dead jobs still hold VRAM
A distributed job crashes and the framework fails to clean up child processes across all nodes. These phantom processes silently hold GPU memory. Your restart immediately crashes with OOM. You SSH into 16 nodes manually, run kill -9 on zombie processes, and then try again.
Setting TORCH_DISTRIBUTED_DEBUG makes the bug disappear
You enable verbose distributed debugging on your hanging DDP job. The logging I/O alters execution timing. Your bug vanishes. You disable the flag. It crashes again. You have learned nothing. This is a Heisenbug — an observer effect created by the debug tool itself — and it afflicts nearly every multi-GPU DDP hang investigation.
Slow disk I/O on one node causes NCCL timeout on all nodes
A dataloader worker on node 7 reads a corrupted or unusually large training sample. That GPU tries 400ms longer on its forward pass. The other 999 GPUs finish and wait at the AllReduce barrier. The watchdog timer fires. NCCL timeout. Your monitoring shows all nodes healthy. The error points to the network. The real culprit is a slow NVMe read on one machine.
Triton compilation failures produce 800-line C++ stack traces
torch.compile pushes your model through Triton or Inductor backends for optimization. When it fails, you get an 800-line C++ exception with no reference to your Python code. Engineers report testing models line-by-line in a Python debugger for hours trying to identify which operator triggered the compilation failure.
DeepSpeed ZeRO-3 silently saves partial weights
ZeRO-3 and FSDP shard model weights across GPUs for memory efficiency. Saving a checkpoint requires coordinating all shards. If one rank fails or disconnects mid-save, the resulting checkpoint is silently corrupted — partial weights, missing optimizer states. You discover this 18 hours later when you try to resume.
Infra team says cluster is healthy. ML team says model is failing.
Your GPU metrics look clean — utilization 94%, temperatures normal, network healthy. But your loss curve is diverging and ranks are hanging. The hardware monitoring layer and the ML observability layer speak completely different languages. Every incident becomes a war room where infrastructure engineers and ML engineers blame each other.
The run that worked yesterday fails today and you don't know what changed
You didn't change your model. You didn't change your data. But your loss is diverging and last week's checkpoint won't reproduce. The culprit is usually invisible: a framework update in your Docker image, a subtly different dataset shuffle, batch size that changed with node count. You spend hours diffing configs manually. Often you never find it.
Simple pricing. No surprises.
Start free. No credit card required.
Free
For individual researchers and small experiments.
- ✓3 diagnoses per month
- ✓Manual log paste (web UI)
- ✓15-type failure classification
- ✓Prescriptive fix output
- ✓7-day history
- ✓1 seat
Team
For ML teams running regular training jobs.
- Everything in Free, plus:
- ✓Unlimited diagnoses
- ✓Up to 64 GPUs monitored
- ✓Slack + email alerts
- ✓iMessage/SMS notifications (Twilio)
- ✓Multi-rank cascade analysis
- ✓Cross-run comparison (last 5 runs)
- ✓Team knowledge base (shared fixes)
- ✓5 seats
- ✓90-day history
Scale
For scale-ups and serious training infrastructure.
- Everything in Team, plus:
- ✓Up to 512 GPUs monitored
- ✓On-premise agent (logs never leave your cluster)
- ✓Silent data corruption (SDC) detection
- ✓Straggler and gray failure detection
- ✓Zombie process detection + auto-kill
- ✓Checkpoint weight delta analysis (per-layer instability trace)
- ✓Cross-run comparison (unlimited run history)
- ✓Version compatibility database (PyTorch × CUDA × cuDNN)
- ✓Checkpoint integrity validation
- ✓Unlimited seats
- ✓Priority support (4-hr SLA)
- ✓1-year history
Data Center
For GPU cloud providers and enterprise data centers. Custom contracts available.
- Everything in Scale, plus:
- ✓Unlimited GPUs
- ✓White-label and OEM options
- ✓Multi-tenant deployment
- ✓Dedicated Customer Success Manager
- ✓99.9% uptime SLA with credits
- ✓GDPR, HIPAA, SOC 2 Type II compliance
- ✓Log PII/PHI masking (configurable)
- ✓Custom knowledge base ingestion
- ✓Integration with SLURM, Ray, Kubernetes schedulers
- ✓Predictive failure scoring
- ✓Auto-remediation engine
- ✓Custom contracts, invoicing, and procurement
Frequently asked questions
What failure types do you detect?
CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, straggler detection, zombie processes, weight delta anomalies, and more. 15 common failure types with prescriptive fixes.
How accurate is the diagnosis?
For common failure patterns, Denpex uses regex-based matching for high accuracy. Novel failures get LLM-inferred diagnosis with a confidence score so you know the reliability.
Do you store our logs?
Logs are processed and deleted after diagnosis. We don't store raw training data. Diagnosis metadata (failure types, frequency) helps improve our pattern database.
What frameworks do you support?
PyTorch (DDP, FSDP), DeepSpeed ZeRO-1/2/3, Megatron-LM, Axolotl, LlamaFactory, Unsloth, and NeMo. JAX/XLA and TensorFlow support on roadmap.
How fast is the diagnosis?
Most diagnoses complete in under 12 seconds. Paste your logs, get the root cause and fix recommendation immediately.
Paste your logs. Get a diagnosis in seconds.
Free. 3 diagnoses remaining.
Start diagnosing failures in seconds
Free for your first 3 diagnoses. No credit card required.
The fix arrives on your phone before you open a terminal
One message. One root cause. No noise.
When your 1,000-GPU cluster fails, Denpex doesn't send 1,000 alerts. It correlates the cascade, identifies the single root cause, masks any sensitive data, and sends one message directly to your phone with the exact fix — before you've finished your first sip of coffee.
Works with iMessage, SMS, Slack, PagerDuty, and webhook. One message per incident, always.
Denpex Alerts
How Denpex diagnoses failures in under 12 seconds
Paste your logs. Get a diagnosis. Fix the problem.
Paste your logs
Copy the error output from your training job. Works with PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, and Axolotl. Paste it into the diagnosis box.
Get instant diagnosis
Denpex pattern-matches your logs against known failure types. For common issues like CUDA OOM, NCCL timeout, gradient explosion, and checkpoint corruption, you get an instant match with the root cause and fix.
11 failure types covered
CUDA OOM, memory fragmentation, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, silent hangs, and more. Each with prescriptive fixes.
Unknown failures get AI analysis
If your failure doesn't match a known pattern, Denpex uses AI to analyze and suggest what happened. Always get a next step, even for novel errors.
This error pattern doesn't match known failures. AI analysis suggests checking memory allocator configuration and batch size settings.
One line. No config. Works on your next failure.
import denpex
# Add before your training loop
denpex.init(
api_key="dpx_...",
job_name="llama3-70b-finetune",
notify=["slack", "sms"] # optional
)
# The rest of your training code is unchanged
trainer.train()What makes Denpex different
Diagnose from paste-only logs
Don't need an agent or integration. Paste your error output, get a diagnosis. Works with any framework — PyTorch, DeepSpeed, Megatron, Axolotl, whatever you're using.
Prescriptive fixes, not just error codes
Don't just告诉你错了什么 — tells you how to fix it. Every diagnosis comes with a specific env var to change, config to update, or checkpoint to resume from.
15 common failure types covered
CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, straggler detection, zombie processes, weight delta anomalies, and more — all with exact pattern matching and known fixes.
Instant diagnosis — no waiting
Pattern matching runs in seconds. No AI hallucination risk for known failures. Novel errors get AI analysis with confidence scores so you know how reliable the suggestion is.
Real cluster failures, real cost
Training infrastructure is expensive. Failures are expensive. Denpex helps you diagnose faster.
466 failures
in 54 days
Meta Llama 3.1-405B on 16,384 H100 GPUs. Real training runs, real failures.
57 vs 25 days
ideal vs actual runtime
Meta OPT-175B training. Failures add 32 days of delay.
89.9%
of failures need 3+ hours
Huawei Cloud 2025. Most failures require manual investigation.
$847
avg GPU cost recovered
Per incident on a 64-GPU cluster. Diagnose faster, waste less.
See it in action
Compatible with your stack
Training Frameworks
- PyTorch DDP
- PyTorch FSDP
- DeepSpeed ZeRO-1/2/3
- Megatron-LM
- Axolotl
- LlamaFactory
- Unsloth
- NeMo
Compute Platforms
- AWS SageMaker
- GCP Vertex AI
- Azure ML
- Lambda Labs
- CoreWeave
- SLURM clusters
- Kubernetes + Ray
- On-premise
Built for every team running distributed training
Stop losing training runs to failures your senior engineers used to debug in 4 hours
Research labs burn GPU budget at a rate that demands zero tolerance for manual debugging. When a 70B fine-tune dies at step 45,000 — 18 hours in — you can't afford to spend another 4 hours finding out why. Denpex tells you the root cause in seconds, with the exact fix, so you resume from checkpoint immediately instead of rerunning from scratch.
Your training logs contain your IP. We treat them that way.
PII/PHI Masking
Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.
On-Premise Option
Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs.
Encryption
All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.
Compliance
Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.
What's coming next
We're building the infrastructure layer for production ML training.
Pre-flight cluster health scan
Run a health check before launching an expensive job. Detect stale zombie processes, GPU memory fragmentation, version incompatibilities, and network partition issues before you waste a single training step.
Predictive failure scoring
Machine learning on your cluster's telemetry to predict GPU degradation, thermal throttling onset, and memory leak trajectories hours before they cause a failure.
Auto-remediation engine
For confirmed fix types, Denpex can automatically apply the fix: set the environment variable, kill zombie processes, adjust checkpointing frequency, and trigger a checkpoint resume — without human intervention.
Enhanced checkpoint integrity validator
Before relying on a checkpoint saved 18 hours ago, validate that it is loadable, complete, and consistent. Prevent the worst scenario: discovering your only resume point is corrupted after a failure.