Site Reliability Engineer-Revenue Engineering
- Seattle, WA, USA
Twitter is what’s happening and what people are talking about right now. For us, life's not about a job, it's about purpose. We believe real change starts with conversation. Here, your voice matters. Come as you are and together we'll do what's right (not what's easy) to serve the public conversation.
Twitter’s revenue organization operates services at massive scale. We are looking for a Site Reliability Engineer (SRE) to join the Embedded Revenue SRE Team and support the Revenue Callback Team (RCP).
You'll be the embedded SRE in a development team that is responsible for the critical real time data processing pipeline. The ads we serve result in a substantial stream of user interactions (impressions, clicks) flowing back. Revenue products consume these interactions in a variety of different forms and the RCP team owns a data processing pipeline and related applications that prepare the data for different product use cases.
If you have a philosophy of building the tools that solve a broad set of problems, this is the team for you. If you have a relentless customer focus that will help us to accelerate developer and analytic productivity, this is the team for you.
If you are passionate about working with and contributing to such world-class systems, we invite you to join Revenue SRE and embed with the Revenue Callback Team!
How You’ll Work
Work closely with Software Engineering (SWE) counterparts and take an active role as a co-owner of production services to ensure services are built, maintained, and operated in a reliable and scalable way. You will be part of the successful delivery of new features and services, as well as the day-to-day successful operation of existing services.
You will have deep involvement with application services throughout the Software Development Lifecycle, serving as the local SRE domain expert and point of contact. Through this involvement you will gain a deep understanding of the technology stack, and be empowered to meaningfully contribute to design documents, code reviews, and other technical discussions.
Collaborate with the SWE service owners to drive operational health improvements, root cause analysis, postmortem discussions and their associated remediations that serve to improve reliability and sublinearly scale operations.
Partner with others in SRE to leverage tools, processes, and techniques to sublinearly scale operations and reduce business risk, in areas that include: infrastructure & configuration management, deploys, capacity modeling & planning, and incident mitigation.
Identify common patterns in challenges with operating services in production, partner with others in SRE to design and implement reusable solutions and/or other cross-functional work that drives down the complexity, difficulty, costs, and risks of operating the business.
What You’ll Do
Actively participate and contribute to code reviews and technical design documents, with an eye toward identifying performance and reliability bottlenecks.
Proactively work with SWE counterparts to identify and mitigate production issues; validate, document and exercise failover/disaster recovery plans and graceful degradation mechanisms policies and best practices
Capacity planning and analysis, and infrastructure change management (including tuning, reshaping, resizing, and migrating infrastructure), for services and their immediate downstreams. You will have a comprehensive view of systems interactions in the local ecosystem, providing valuable feedback and insight to broader capacity planning and event preparation efforts.
Join with SWE service owners on in-progress large engineering projects, including migrating to the latest Twitter technologies and adopting related best practices.
Collaborate with SWE service owners to productionize new services and features, as well as improve production landscape for existing services, providing SRE expertise and implementing best practices in the areas of CI/CD, dashboard integrity improvements, identifying and evaluating for the right set of alerts, SLOs and error budgets to use for services on an ongoing basis.
Attend team meetings, standups, and oncall handoffs.
Participate in team oncall rotation, which is composed of the development team + you.
Who You Are
You are passionate about operational excellence and thrive in an environment where you are able to provide an extremely high degree of customer support through rigorous customer focus and continuous engagement.
You are a tenacious problem solver and will own issues completely from end-to-end, until fully resolved.
You understand the unique value and opportunity of working at very close proximity to the customer, and place a high value on having the work you do align very strongly with the customer in the form of shared deliverables, accountability, goals, and incentives.
You are an expert consumer or “power user” of tools, processes and systems that you use to get your work done.
You have excellent communication skills (written and verbal), and great attention to detail, and can present technical concepts in an authoritative and clear manner to customers and partners.
You place extremely high value on your relationships with your customers, and have excellent human relations and organizational skills.
You are excellent at managing your time and are comfortable making decisions in your scope of ownership that are necessary to achieve the deliverables you have committed to, or escalating to your manager when needed.
Minimum 3+ years of experience managing, diagnosing, and debugging large-scale distributed systems in production, including: dynamic web servers; relational and non-relational databases; cache; pubsub; containers; and mechanisms associated with resiliency, graceful degradation, disaster recovery, and backpressure.
Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures.
Strong knowledge and ability to talk confidently about tools, methodologies, and analysis techniques in the area of: Infrastructure & Config management and version control (Git), CI/CD automation (Jenkins) ; Systems & Application Monitoring; Root Cause Analysis; capacity planning (redline testing, load testing) in a distributed systems environment.
Experience developing infrastructure, configuration, and deployment scripting and automation for large scale / high complexity services.
Experience building resilient data pipelines with technologies such as Kafka, and various caching technologies such as CouchBase.
Experience using containerization software such as: mesos, kubernetes, docker, LXC.
Comfortable working with on-prem and cloud-based infrastructure (AWS, GCP, Azure)
Practical, proven knowledge of at least one higher-level language (Python, Go, Ruby, or similar).
Expert level understanding of Linux servers, specifically RHEL/CentOS.
B.S. in Computer Science or equivalent experience.
All your information will be kept confidential according to EEO guidelines.