Site Reliability Engineer-Revenue Engineering

  • Full-time

Company Description

Twitter is what’s happening and what people are talking about right now. For us, life's not about a job, it's about purpose. We believe real change starts with conversation. Here, your voice matters. Come as you are and together we'll do what's right (not what's easy) to serve the public conversation.

Job Description

Who We Are

Twitter’s Revenue organization operates services at massive scale. We are looking for a Site Reliability Engineer (SRE) to join the Embedded Revenue SRE Team.

The Revenue SRE team has many missions. We provide a healthy user experience while serving highly relevant and personalized ads.  We also help advertisers translate their marketing strategy to an audience on Twitter while protecting the safety of their brand.

To achieve this, we use state of the art open-source and proprietary technologies, and build and operate some of the world’s largest and most complex distributed systems.  We embed deeply with development teams, sharing oncall with a focus on upleveling services and increasing automation of all kinds.  

If you are passionate about working with and contributing to such world-class systems, we invite you to join Revenue SRE.

(More info: Resilient Ad Serving at Twitter Scale)

 

How You’ll Work

  • Work closely with Software Engineering (SWE) counterparts and take an active role as a co-owner of production services to ensure services are built, maintained, and operated in a reliable and scalable way. You will be part of the successful delivery of new features and services, as well as the day-to-day successful operation of existing services.

  • You will have deep involvement with application services throughout the Software Development Lifecycle, serving as the local SRE domain expert and point of contact. Through this involvement you will gain a deep understanding of the technology stack, and be empowered to meaningfully contribute to design documents, code reviews, and other technical discussions.

  • Collaborate with the SWE service owners to drive operational health improvements, root cause analysis, postmortem discussions and their associated remediations that serve to improve reliability and sublinearly scale operations.

  • Partner with others in SRE to leverage tools, processes, and techniques to sublinearly scale operations and reduce business risk, in areas that include: infrastructure & configuration management, deploys, capacity modeling & planning, and incident mitigation.

  • Identify common patterns in challenges with operating services in production, partner with others in SRE to design and implement reusable solutions and/or other cross-functional work that drives down the complexity, difficulty, costs, and risks of operating the business.

 

What You’ll Do
 

  • Actively participate and contribute to code reviews and technical design documents, with an eye toward identifying performance and reliability bottlenecks.

  • Proactively work with SWE counterparts to identify and mitigate production issues; validate, document and exercise failover/disaster recovery plans and graceful degradation mechanisms policies and best practices

  • Capacity planning and analysis, and infrastructure change management (including tuning, reshaping, resizing, and migrating infrastructure), for services and their immediate downstreams. You will have a comprehensive view of systems interactions in the local ecosystem, providing valuable feedback and insight to broader capacity planning and event preparation efforts.

  • Join with SWE service owners on in-progress large engineering projects, including migrating to the latest Twitter technologies and adopting related best practices.

  • Collaborate with SWE service owners to productionize new services and features, as well as improve production landscape for existing services, providing SRE expertise and implementing best practices in the areas of CI/CD, dashboard integrity improvements, identifying and evaluating for the right set of alerts, SLOs and error budgets to use for services on an ongoing basis.

  • Attend team meetings, standups, and oncall handoffs.

  • Participate in team oncall rotation.

 

Who You Are

  • You are passionate about operational excellence and thrive in an environment where you are able to provide an extremely high degree of customer support through rigorous customer focus and continuous engagement.

  • You are a tenacious problem solver and will own issues completely from end-to-end, until fully resolved.

  • You understand the unique value and opportunity of working at very close proximity to the customer, and place a high value on having the work you do align very strongly with the customer in the form of shared deliverables, accountability, goals, and incentives.

  • You are an expert consumer or “power user” of tools, processes and systems that you use to get your work done.

  • You have excellent communication skills (written and verbal), and great attention to detail, and can present technical concepts in an authoritative and clear manner to customers and partners.

  • You place extremely high value on your relationships with your customers, and have excellent human relations and organizational skills.

  • You are excellent at managing your time and are comfortable making decisions in your scope of ownership that are necessary to achieve the deliverables you have committed to, or escalating to your manager when needed.

 

Technical Skills

  • Minimum 3+ years of experience managing, diagnosing, and debugging large-scale distributed systems in production, including: dynamic web servers; relational and non-relational databases; cache; pubsub; containers; and mechanisms associated with resiliency, graceful degradation, disaster recovery, and backpressure.

  • Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures.

  • Strong knowledge and ability to talk confidently about tools, methodologies, and analysis techniques in the area of: Infrastructure & Config management and version control (Git), CI/CD automation (Jenkins) ; Systems & Application Monitoring; Root Cause Analysis; capacity planning (redline testing, load testing) in a distributed systems environment.

  • Experience developing infrastructure, configuration, and deployment scripting and automation for large scale / high complexity services in a microservice environment.

  • Experience dealing with large data sets that inform your knowledge around building robust data pipelines and architectures, and tuning java applications.

  • Some experience with Lucene based search systems and scatter gather query patterns is desirable but not mandatory.

  • Passionate about Data technologies (Hadoop, Hbase, Spark, Kafka, Flume, Hive, Solr, Yarn, Presto, Jupyter Notebooks, etc.)

  • Practical, proven knowledge of at least one higher-level language (Python, Go, Ruby, or similar).

  • Experience using containerization software such as: mesos, kubernetes, docker, LXC, etc..

  • Expert level understanding of Linux servers, specifically RHEL/CentOS.

  • B.S. in Computer Science or equivalent experience.

Additional Information

All your information will be kept confidential according to EEO guidelines.

Privacy Policy