Senior Cloud Operations Engineer

  • Full-time

Company Description

The Linux Foundation is a driving force in fostering open-source collaboration and supporting communities across a range of projects, including PyTorch. We're dedicated to enhancing and expanding our infrastructure to meet the growing demands of PyTorch and related AI projects. We are seeking a Senior Cloud Operations Engineer who will focus on the infrastructure operations of the PyTorch project, automating processes, optimizing cloud-native tools, and ensuring a robust and scalable cloud environment

Job Description

As a Senior Cloud Operations Engineer for the PyTorch project, your primary responsibility will be to enhance and maintain the infrastructure that supports PyTorch's development and research efforts. You will work on automating infrastructure management, implementing cloud-native tools, and optimizing cloud solutions. Additionally, you will play a crucial role in managing infrastructure from the PyTorch community and contributing to an on-call rotation.

Responsibilities

  • Design and modernize a globally distributed, fault-tolerant infrastructure specifically tailored for the PyTorch project.
  • Collaborate with Terraform, AWS, Kubernetes, CloudFormation, and other cloud-native tools to optimize infrastructure.
  • Utilize Ternary to provide FinOps best practices, focusing on PyTorch requirements.
  • Optimize Datadog monitoring and reporting to enhance the performance of the PyTorch infrastructure.
  • Implement CI/CD pipelines using GitHub Actions to facilitate rapid and reliable software delivery for PyTorch.
  • Develop and maintain configuration management and infrastructure as code (IAC) practices for the PyTorch project.
  • Ensure that security and compliance standards are met within the PyTorch infrastructure.
  • Implement disaster recovery and backup solutions to safeguard PyTorch project data.
  • Collaborate closely with PyTorch software development teams to enable continuous integration and continuous delivery (CI/CD).
  • Participate in an on-call rotation to ensure the availability of the PyTorch project's infrastructure.
  • Continuously drive improvement initiatives, optimize processes, and identify opportunities for enhancing the PyTorch infrastructure.

Qualifications

  • Bachelor’s degree in Computer Science or related field, or equivalent work experience.
  • Minimum of 10 years of relevant technical experience, preferably in high-availability environments.
  • Exceptional collaboration skills in a team-oriented environment.
  • Proficient in the administration of large multi-tenant production environments.
  • Hold strong experience with Terraform, AWS, CloudFormation, and related technologies as they pertain to PyTorch.
  • Solid background in shell scripting.
  • Excellent English communication skills, both written and verbal.
  • Prior experience in a highly visible role.
  • Ability to work effectively with widely distributed teams.
  • Familiarity with open source projects and their development processes.

Additional Information

The Linux Foundation is a largely all-remote workforce that hires top-notch talent. We are as passionate about providing a flexible and supportive work culture as we are in Open Source Software. Collaboration is in our DNA, and we pride ourselves on being able to work closely together while not being tied to an office. 

The Linux Foundation is an Equal Opportunity Employer.