Senior Site Reliability Engineer (Remote)

  • Full-time

Company Description

nClouds is a credentialed, award-winning provider of DevOps and cloud professional services, products, and solutions, specializing in modern infrastructures on AWS. We work as an extension of our clients and love tackling their stickiest challenges. All so our clients can deliver innovation faster and create awesome customer experiences.

Job Description

SRE team is responsible for availability, reliability, performance, monitoring, change management, emergency response for infrastructure or applications, and reducing manual work by implementing SRE principles and practices. The SRE team directly works with Devs/DevOps teams, Operations teams, Product teams, and other teams to deploy new features, and changes, and maintain infrastructure, operations, CI/CD, and IAC  to achieve availability and reliability so that SLOs and SLAs can be protected. We utilize a variety of DevOps automation tools like Ansible, Docker, Kubernetes, Terraform, and Jenkins, along with cloud vendor-specific tools like ECS, Cloudformation, EKS, Opsworks, and the beanstalk. The SRE engineer is capable of implementing Observability, SLO, SLI, SLA, and Disaster Recovery and Backup Plans in cloud environments mainly AWS. DBA experience is a must.

Key Responsibilities:

  • Ensure the availability and reliability of distributed systems. 

  • Help the L1 team to resolve the client’s infrastructure/system issues, escalations, alerts, tickets, and queries.

  • Works as a bridge between DevOps and other teams in order to build and maintain resilient systems.

  • Conduct, coordinate and oversee post-incident Root Cause Analysis / Reviews.

  • Build and maintain documentation for all assigned clients/projects. 

  • Leverage DevOps, Agile methodology, and standards in day-to-day work. 

  • Adopt and propose automation of repetitive tasks to reduce/eliminate toil.

  • Implement and troubleshoot using observability tools like Datadog, New Relic, Splunk, CloudWatch, etc. 

  • Adopt and ensure the SRE practices in Team.

  • Maintenance of AWS-managed resources, CI/CD, and IAC.

  • Planning and implementing disaster recovery and backup plans for AWS cloud platforms.

  • Proactively work on efficiency and capacity planning.  

  • Keep a proactive approach to spotting problems, areas for improvement, and performance bottlenecks

  • Liaise and work closely with Layer-1 Oncall support, DevOps, and Operations teams 

  • Drive availability and reliability by defining and implementing SLI, SLO, error budget, Observability, Disaster recovery, and backup to detect and mitigate issues.

 

Qualifications

Required Qualifications:

  • Bachelor’s degree in computer science (preferred) or equivalent management, technical, or scientific discipline

  • Ability to program (structured and OO) with one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript

  • A clear understanding of SRE principles and practices and Agile and DevOps methodologies.

  • Experience in AWS's Well-Architected framework in order to implement the scalable and reliable infrastructure.

  • Great team player with the flexibility to work in 24/7 rotating shifts.

  • Excellent written/verbal communication and leadership skills.

What to Expect:

First Week

  1. Start with the onboarding process incorporating you into the SRE Team.
  2. Set up all your access and security policies.
  3. Learn about nClouds practices, values, and solutions
  4. Meet the Lead and get familiar with nClouds SRE offering and current Team structure
  5. Meet the team and get familiar with the team’s schedule
  6. Complete the onboarding process.

First Month

  1. Complete all assigned training.
  2. Projects get assigned and required access is arranged
  3. Knowledge Transfer Session with Team Lead and other team members
  4. Start joining customer calls.

First 3 Months

  1. Become fully integrated with the L-1 Support Team and help them in the resolution of client’s infrastructure and application issues, escalations, tickets, and queries 
  2. Assist and oversee the creation and maintenance of Runbooks, post-incident Root Cause Analyses (RCAs), and process documentation.
  3. Build close liaison with client’s Product and Operations Teams. 
  4. Develop a clear understanding of clients' requirements and implement SLIs in line with clients' SLOs and ensure that they conform with clients’ SLAs.
  5. Coordinate with the support team in implementing comprehensive monitoring of the client’s application and infrastructure, ensuring strict monitoring of SLIs. 
  6. Actively participate in the development and implementation of CI/CD, Disaster Recovery and Backup plans, and other relevant processes to ensure the achievement of client’s Service Level Objectives (SLOs)

First Six Months

  1. Take ownership of the SRE team’s practices and procedures and actively participate in their improvement.
  2. Based on customer feedback, provide recommendations to improve nClouds service offerings.
  3. In conjunction with the L-1 team, propose and implement automation of repetitive tasks to reduce/eliminate toil.
  4. Closely collaborate with the team in implementing, tracking, and achieving OKR goals
  5. Get accreditation of your skills by gaining relevant certifications.
  6. Actively participate in nClouds Friday Demos and regularly contribute to initiatives like the nCode library.

Additional Information

Why nClouds? We are a diverse team of skilled and experienced professionals collaborating to solve client challenges. We have a shared set of ethos that defines nClouds, our mission, our attitude towards clients, and what it means to be a member of the team:

Partnerships based on shared goals.

  • Challenge the status quo.

  • Innovation culture that delivers client value.

  • Employees are the foundation of our success.

nClouds is committed to building an outstanding team. That means identifying, hiring, retaining, nurturing, and rewarding great talent that contributes to our mission and the success of our clients.

nClouds is committed to a diverse and inclusive workplace. nClouds is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status.

Position Type/Expected Hours of Work

This is a full-time position from home (remotely). 

Days and hours of work are Monday through Friday during normal business hours, and as may be needed to complete job duties.

Other Duties

Please note this job description is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities that are required of the employee for this job. Duties, responsibilities and activities may change at any time with or without notice.