Associate Staff Engineer, Devops
- Full-time
- Service Region: South Asia
Company Description
👋🏼 We're Nagarro
We are a Digital Product Engineering company that is scaling in a big way! We build products, services, and experiences that inspire, excite, and delight. We work at scale — across all devices and digital mediums, and our people exist everywhere in the world (17500+ experts across 39 countries, to be exact). Our work culture is dynamic and non-hierarchical. We are looking for great new colleagues. That is where you come in!
Job Description
Requirement:
- Experience: 5+ years
- Strong experience in DevOps or Site Reliability Engineering (SRE) roles.
- Strong knowledge of Docker, Kubernetes, Terraform, and CI/CD pipelines.
- Hands-on experience with AWS, Azure, or other cloud platforms.
- Familiarity with GPU infrastructure and ML workloads is a plus.
- Good understanding of monitoring and logging systems (Prometheus, Grafana).
- Ability to collaborate with ML teams for optimized inference and deployment.
- Strong troubleshooting and problem-solving skills in high-scale environments.
- Knowledge of infrastructure security best practices, cost optimization, and performance tuning.
- Exposure to vector databases and AI/ML deployment pipelines is highly desirable.
Responsibilities:
- Maintain and manage Kubernetes clusters, AWS/Azure environments, and GPU infrastructure for high-performance workloads.
- Design and implement CI/CD pipelines for seamless deployments and faster release cycles.
- Set up and maintain monitoring and logging systems using Prometheus and Grafana to ensure system health and reliability.
- Support vector database scaling and model deployment for AI/ML workloads.
- Collaborate with ML engineering teams to optimize inference performance and resource utilization.
- Ensure high availability, security, and scalability of infrastructure across multiple environments.
- Automate infrastructure provisioning and configuration using Terraform and other IaC tools.
- Troubleshoot production issues and implement proactive measures to prevent downtime.
- Continuously improve deployment processes and infrastructure reliability through automation and best practices.
- Participate in architecture reviews, capacity planning, and disaster recovery strategies.
- Drive cost optimization initiatives for cloud resources and GPU utilization.
- Stay updated with emerging technologies in cloud-native, AI infrastructure, and DevOps automation.
Qualifications
Bachelor’s or master’s degree in computer science, Information Technology, or a related field