Infrastructure SRE - HPC
Sarvam AI
About Sarvam
Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.
About the Role
Sarvam runs a large, multi-vendor GPU fleet that serves two demanding workloads on the same physical infrastructure: training jobs that span hundreds of GPUs and must run uninterrupted for weeks, and inference services that must hold a flat p99 under production load. Keeping both healthy at once is a hard, specialized reliability problem, and it is the problem this team exists to solve.
This is not a Kubernetes administration role. We assume Kubernetes fluency as a baseline. The difficulty lies above and below it - in parallel filesystems under heavy checkpoint load, in RDMA fabrics that degrade quietly, in NCCL hangs whose root cause may be the network or the kernel, in driver and firmware drift across heterogeneous hardware, and in distributed training failures that masquerade as infrastructure faults.
We are hiring a team of specialists rather than a set of identical generalists. This posting covers five areas of focus. We expect candidates to bring genuine depth in one and working fluency across the others, because on a shared fleet a storage problem often first appears as a training hang, and the engineer on call must route an incident correctly before anyone can resolve it.
When you apply, please indicate the area of focus that best matches your experience. Strong generalists are welcome; we will place you where your depth is most useful.
What You’ll Do
- Operate the GPU fleet end to end across training and serving - provisioning, observability, capacity, and fleet health.
- Hold a meaningful on-call rotation, write runbooks that hold up under pressure, and drive postmortems that produce durable fixes.
- Build the internal tooling the team relies on, rather than operating off-the-shelf systems alone.
- Partner with ML and platform teams to keep large runs alive and serving latency predictable.
What We're Looking For
- 5+ years in infrastructure or site reliability engineering, including 2+ years operating GPU clusters at scale.*
- Demonstrated on-call ownership of infrastructure that mattered, with a track record of postmortems that led to real change.
- Proficiency in Python or Go, used to build and maintain internal tooling.
- Working fluency across all five areas of focus below - enough to recognize, triage, and route a problem outside your specialty, even if the fix belongs to a teammate.
- For the Storage and Fabric areas of focus, we will weigh deep domain expertise against the GPU-cluster requirement; exceptional specialists with less direct GPU-fleet time are encouraged to apply.
Bonus Points
- Slurm and Kubernetes hybrid environments.
- On-premise GPU deployment, including coordination with datacenter operations on power, cooling, and InfiniBand cabling.
- Experience with Indian NCPs, DGX SuperPOD, Lambda, CoreWeave, NeevCloud etc.
- Multi-tenant GPU isolation (MIG, MPS, time-slicing) in production.
Why Sarvam?
Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact.
Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar
High ownership and high impact, from day one
Everything we do is AI-first, from the way we build and ship to the way we think about problems
You can work on problems that could change how an entire country learns, works, and communicates
If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.
Create your free OnJob profile to apply — we'll take you to Sarvam AI's application after sign-up. · Posted 26 Jun 2026.
Related jobs you can win
Hand-picked roles that match this listing on skills, category and location — each scored to your profile inside OnJob.
Explore more on OnJob
Hiring for a role like this?
Post a job on OnJob and reach AI-matched candidates.