Associate Staff Engineer, Devops

Full-time

Service Region: South Asia

Company Description

👋🏼 We're Nagarro

We are a Digital Product Engineering company that is scaling in a big way! We build products, services, and experiences that inspire, excite, and delight. We work at scale — across all devices and digital mediums, and our people exist everywhere in the world (17500+ experts across 39 countries, to be exact). Our work culture is dynamic and non-hierarchical. We are looking for great new colleagues. That is where you come in!

Job Description

Requirement:

Experience: 5+ years
Strong experience in DevOps or Site Reliability Engineering (SRE) roles.
Strong knowledge of Docker, Kubernetes, Terraform, and CI/CD pipelines.
Hands-on experience with AWS, Azure, or other cloud platforms.
Familiarity with GPU infrastructure and ML workloads is a plus.
Good understanding of monitoring and logging systems (Prometheus, Grafana).
Ability to collaborate with ML teams for optimized inference and deployment.
Strong troubleshooting and problem-solving skills in high-scale environments.
Knowledge of infrastructure security best practices, cost optimization, and performance tuning.
Exposure to vector databases and AI/ML deployment pipelines is highly desirable.

Responsibilities:

Maintain and manage Kubernetes clusters, AWS/Azure environments, and GPU infrastructure for high-performance workloads.
Design and implement CI/CD pipelines for seamless deployments and faster release cycles.
Set up and maintain monitoring and logging systems using Prometheus and Grafana to ensure system health and reliability.
Support vector database scaling and model deployment for AI/ML workloads.
Collaborate with ML engineering teams to optimize inference performance and resource utilization.
Ensure high availability, security, and scalability of infrastructure across multiple environments.
Automate infrastructure provisioning and configuration using Terraform and other IaC tools.
Troubleshoot production issues and implement proactive measures to prevent downtime.
Continuously improve deployment processes and infrastructure reliability through automation and best practices.
Participate in architecture reviews, capacity planning, and disaster recovery strategies.
Drive cost optimization initiatives for cloud resources and GPU utilization.
Stay updated with emerging technologies in cloud-native, AI infrastructure, and DevOps automation.

Qualifications

Bachelor’s or master’s degree in computer science, Information Technology, or a related field

I'm interested