Senior Reliability Engineer

Full-time

Company Description

Twitter serves the public conversation by encouraging people all over the world to connect, learn, debate and solve problems together. We believe the conversation can change the world, and that’s why Tweeps (that’s what we call Twitter employees) come to work every day.

Job Description

This role focuses on leading incidents to resolution and building the tools and instrument automation to make sure that they don't happen again.

Why it matters

Our platform serves hundreds of internal customers running the thousands of services that make up our product. You'll work directly with internal customers triaging and resolving incidents on many kinds of projects and technologies that keep Twitter performing reliably.

What you’ll be doing

You will be responsible for effectively triaging and solving problems in a complex environment operating at a massive scale. As an incident manager, you will resolve critical system issues on a continuous basis, including notification, coordination, and dispatch of individuals from various functional groups. You will take individual ownership of issues and pursue resolution tenaciously. You will provide effective communication and dissemination of information to other teams and executive management. You will develop tools to visualize, detect and resolve issues using Python, Javascript, and other. We are looking for someone with a variety of deep systems experience, superb communication skills, attention to both the details and the big picture, and a real passion for Twitter.

Who we are

The Twitter Command Center is responsible for leading incidents to resolution at twitter. We build tools and automation to make sure that they don't happen again. Our job is to increase the reliability of the Twitter service by providing continuous site monitoring, oversight/management of key control processes, effective communication around reliability related events, and building the tools to automate it.

Qualifications

You should have a B.S. in Computer science or equivalent experience.
8+ yrs of Incident Management experience on a large-scale platform.
5+ yrs of experience in distributed systems at scale in a Linux/Unix environment as an administrator or developer.
Experience developing tooling in Python, Javascript or another language.