Senior SRE Engineer & Coach

  • Full-time

Company Description

nClouds is a certified, award-winning provider of AWS and DevOps consulting and implementation services. - AWS Premier Consulting Partner. - We are an integrated team of skilled engineers, architects, developers, project managers, and sales & marketing professionals who are passionate about software excellence, innovation, and client success. We work with organizations of all sizes, in all industries, including some of the coolest startups and growth companies in Silicon Valley.

Job Description

Are you a battle-tested SRE with the war stories to prove it? Do you have a passion for making systems and teams work with greater resilience and robustness? Are you a master of practices like blameless portems, error budget management, chaos engineering, and continuous improvement? Do you have a point of view of how to do SRE “right” and want to share it?

 

We are nClouds – a fully remote, AWS solution provider and one of the fast-growing Cloud management solution providers in the USA – and we are looking for you. We have a core value of “Celebrate Experimentation & Resourcefulness”; and with this value, we strive to bring the best of SRE practices and skillsets to clients of all sizes. We are looking for a world-class SRE leader who can help us realize this vision and build out the SRE practices and the SRE culture for our high-performance support and engineering teams.
 

What We Offer You:

  • A progressive and flexible work environment designed to inspire and enable our team to be the world’s best
  • The freedom to experiment and learn while you design how SRE teams should work
  • The opportunity to solve complex IT reliability problems with a variety of clients 
  • Access to top industry experts in modern IT architecture, DevOps, and cutting-edge technologies. 
  • The flexibility of a fully remote work environment. We have been fully remote since before COVID, building a team of top experts across the globe collaborating on a daily basis. 
  • A highly competitive compensation
  • A wellness coach who can support your eudaimonia and work-life balance
  • Other standard benefits include vacation and gym reimbursement
  • Access to our internal training programs

 

What You Will Do:

  • Design and evolve the SRE processes for our nClouds SRE practice
  • Coach and mentor the nClouds team on SRE best-practice
  • Refine and perfect Incident Management processes and protocols to accelerate response and reduce team stress
  • Innovate processes that increase innovation time for engineers
  • Establish a scalable and robust approach for SLO and error budget management across multiple clients
  • Work directly with client-facing teams to improve the reliability of their systems 
  • Foster a blameless culture
  • Develop your SRE-expert status by continuous learning, and by contributions to our thought leadership.

Qualifications

Why you will succeed:

  • You have over 10 years of experience being in IT engineering and support
  • You have over 3 years of direct experience with SRE practices such as Incident Management, SLO/Error Budget Management, Blameless Post Mortems, Toil Reduction,  code/system reviews, or Chaos Engineering
  • You are passionate about reliability concepts and are well versed in technology and application architecture patterns that increase resilience and robustness.
  • You are a clear communicator with excellent command of written and spoken English.
  • You are an expert with AWS Architecture and Development
  • You are an expert at DevOps with a master of using and managing tools such as source code management, CI/CD pipelines, artifact management, and more.
  • You are an expert at Infrastructure as code technologies such as Terraform and Pulumi
  • You are highly proficient at Observability and APM tools such as Datadog, AppDynamics, Splunk, New Relic, Splunk etc
  • You are highly proficient in time-series analysis with tools like ELK and Prometheus.
  • You are highly proficient with Container and Container Orchestration technologies
  • You have the ability to program (structured and OO) with one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • You are very organized, with impeccable time-keeping and task organization
  • You have a strong understanding of Cloud Architectures and software lifecycle concepts. 
  • You are highly capable of working with a globally distributed team - you have an excellent team- and relationship-building abilities, with both internal and external parties
  • You enjoy collaborating with others
  • You are positive, creative, and curious
     

What to Expect:

First Week

  • Start with the onboarding process incorporating you into the SRE Team.
  • Set up all your accesses and security policies.
  • Learn about nClouds practices, values and solutions
  • Meet the Lead and get familiar with nClouds SRE offering and current Team structure
  • Meet the team and get familiar with team’s schedule
  • Complete onboarding process.

First Month

  • Complete all assigned trainings.
  • Projects get assigned and required access is arranged
  • Knowledge Transfer Session with Team Lead and other team members
  • Start joining customer calls.

First 3 Months

  • Become fully integrated with L-1 Support Team and help them in resolution of client’s infrastructure and application issues, escalations, tickets and queries 
  • Assist and oversee creation and maintenance of Runbooks, post-incident Root Cause Analysis (RCAs) and process documentation.
  • Build close liaison with client’s Product and Operations Teams. 
  • Develop clear understanding of client’s requirements and implement SLIs in line with clients SLOs and ensure that they conform with client’s SLAs.
  • Coordinate with the support team in implementing comprehensive monitoring of client’s application and infrastructure, ensuring strict monitoring of SLIs. 
  • Actively participate in development and implementation of CI/CD, Disaster Recovery and Backup plans and other relevant processes to ensure achievement of client’s Service Level Objectives (SLOs)

First Six Months

  • Take ownership of the SRE team’s practices and procedures and actively participate in their improvement.
  • Based on customer feedback, provide recommendations to improve nClouds service offerings.
  • In conjunction with the L-1 team, propose and implement automation of repetitive tasks to reduce/eliminate toil.
  • Closely collaborate with the team in implementing, tracking and achieving OKR goals
  • Get accreditation of your skills by gaining relevant certifications.
  • Actively participate in nClouds Friday Demos and regularly contribute to initiatives like nCode library.

Additional Information

Please apply only if you have relevant experience.

We require a Sr. SRE and Coach for working remotely and who can work in 24/7 rotating shifts.