Senior Software Engineer, Site Reliability Engineering (SRE)

  • San Francisco, CA, USA
  • Employees can work remotely
  • Full-time

Job Description

Who We Are

 

Twitter’s revenue organization operates services at a massive scale. We are looking for a Site Reliability Engineer (SRE) to join the Embedded Revenue SRE Team and support the MoPub Team.

 

MoPub is the world’s largest mobile application advertising exchange and complete ad serving platform. From individual developers to the largest names in mobile apps and games, our customers span the globe and generate tens of billions of ad requests a day.

 

If you are passionate about working with and contributing to such world-class systems, we invite you to join Revenue SRE and embed with the MoPub Team!

 

Who You Are

 

  • You are passionate about operational excellence and thrive in an environment where you are able to provide an extremely high degree of customer support through rigorous customer focus and continuous engagement.
     

  • You are a tenacious problem solver and will own issues completely from end-to-end until fully resolved.
     

  • You understand the unique value and opportunity of working at very close proximity to the customer, and place a high value on having the work you do align very strongly with the customer in the form of shared deliverables, accountability, goals, and incentives.

 

  • You are an expert consumer or “power user” of tools, processes, and systems that you use to get your work done.
     

  • You have excellent communication skills (written and verbal), and great attention to detail, and can present technical concepts in an authoritative and clear manner to customers and partners.
     

  • You place an extremely high value on your relationships with your customers and have excellent human relations and organizational skills.
     

  • You are excellent at managing your time and are comfortable making decisions in your scope of ownership that is necessary to achieve the deliverables you have committed to, or escalating to your manager when needed.

 

What You’ll Do
 

  • Actively participate and contribute to code reviews and technical design documents, with an eye toward identifying performance and reliability bottlenecks.
     

  • Capacity planning and analysis, and infrastructure change management (including tuning, reshaping, resizing, and migrating infrastructure), for services and their immediate downstreams. You will have a comprehensive view of systems interactions in the local ecosystem, providing valuable feedback and insight to broader capacity planning and event preparation efforts.
     

  • Join with SWE service owners on in-progress large engineering projects, including migrating to the latest Twitter technologies, adopting related best practices, and constantly innovating.  

 

  • Collaborate with SWE service owners to “productionize” new services and features, as well as improve production landscape for existing services, providing SRE expertise and implementing best practices in the areas of CI/CD, dashboard integrity improvements, identifying and evaluating for the right set of alerts, SLOs and error budgets to use for services on an ongoing basis.

 

How You’ll Work

 

  • Partner as a teammate with Software Engineering (SWE) counterparts and take an active role as a co-owner of resilient production services and data pipelines.

 

  • Partner with others in SRE and SWE to leverage tools, processes, and techniques to improve the reliability of services by enabling operations to scale sublinearly and reducing business risk in areas that include: infrastructure & configuration management, deploys, capacity modeling & planning, and incident handling, mitigation, root cause analysis, and post mortems.

 

  • Identify common patterns in challenges with operating services in production, partner with others in SRE and SWE to design and implement reusable solutions and/or other cross-functional work that drives down the complexity, difficulty, costs, and risks of operating the business.

 

Technical Skills

 

  • Minimum 3+ years of experience managing, diagnosing and debugging large-scale distributed systems in production, including: dynamic web servers; relational and non-relational databases; cache; pubsub; containers; and mechanisms associated with resiliency, graceful degradation, disaster recovery, and backpressure.
     

  • Demonstrable knowledge of TCP/IP, HTTP, web application security, experience supporting multi-tier web application architectures, and expert level understanding of Linux servers, specifically RHEL/CentOS.  

 

  • Experience using containerization software such as Mesos, Kubernetes, Docker, etc.

 

  • Strong knowledge and ability to talk confidently about tools, methodologies, and analysis techniques in the area of Infrastructure & Config management and version control (Git), CI/CD automation (Jenkins); Systems & Application Monitoring; Root Cause Analysis; capacity planning (redline testing, load testing) in a distributed systems environment.

 

  • Nice to have, but not strongly required: Ansible, Elasticsearch. AWS: IAM, VPC management, Lambda, CloudWatch

 

  • Experience developing infrastructure, configuration, and deployment scripting and automation for large scale / high complexity services in a microservice environment.

 

  • Experience dealing with large data sets that inform your knowledge around building robust data pipelines and architectures, and tuning Java applications.

 

  • Practical, proven knowledge of at least one higher-level language (Python, Go, Ruby, or similar).

 

  • B.S. in Computer Science or equivalent experience.

Additional Information

All your information will be kept confidential according to EEO guidelines.

Privacy Policy