Back to the stack

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

Remote Worldwide Hiring now

Job Description:

  • Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.
  • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated.
  • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.
  • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.
  • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.
  • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning.
  • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.
  • Automate the life cycle of single-tenant, managed deployments

Requirements:

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven, hands-on experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
  • Strong scripting and automation skills (e.g., Python, Go, Bash)

Benefits:

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO
  • Generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions

Apply tot his job Apply To this Job

Apply for this role Opens the employer's application page — free, no JobStack account needed.

More from the stack

Sr. Site Reliability Engineer. Cloud

Remote Worldwide
View role

Site Reliability Engineer (Splunk, Prometheus, Grafana) Hybrid

Remote Worldwide
View role

Platform Site Reliability Engineer:

Remote Worldwide
View role

Site Reliability Engineer 2 days Onsite

Remote Worldwide
View role

Senior Site Reliability Engineer — Token Factory (Inference Platform)

Remote Worldwide
View role

Senior Site Reliability Engineer

Remote Worldwide
View role

Site Reliability Engineer/Sunnyvale, CA/ Austin, TX (Hybrid)- 6-12 months

Remote Worldwide
View role

Senior Site Reliability Engineer (CloudVision as a Service)

Remote Worldwide
View role

Site Reliability Engineer Manager

Remote Worldwide
View role

Site Reliability Engineer: initial focus on Release Management

Remote Worldwide
View role

Salesforce Administrator & Release Engineer CI/CD Public Trust (Remote) DH

Remote Worldwide
View role

Experienced Remote Data Entry Specialist – Flexible Work Arrangements at arenaflex

Remote Worldwide
View role

Experienced Data Entry Clerk for Iconic Entertainment Brand – Part-Time Remote Opportunity with Comprehensive Training and Competitive Benefits

Remote Worldwide
View role

Senior Director II, Customer Strategic Insights – Driving Growth and Innovation at arenaflex

Remote Worldwide
View role

Virtual Bookkeeper | BELAY | $20 – $25 | Remote (US)

Remote Worldwide
View role

No Experience Chat Operator (Remote / Entry Level)

Remote Worldwide
View role

Medical Biller

Remote Worldwide
View role

Experienced Online Customer Service Representative – Delivering Exceptional Support and Driving Customer Satisfaction

Remote Worldwide
View role

Experienced Part-Time Remote Data Entry Specialist – Accurate and Efficient Data Management Solutions

Remote Worldwide
View role

[Remote] Marketing Transition Associate

Remote Worldwide
View role