Senior Reliability Engineer

Full-time

Company Description

Twitter is what’s happening and what people are talking about right now. For us, life's not about a job, it's about purpose. We believe real change starts with conversation. Here, your voice matters. Come as you are and together we'll do what's right (not what's easy) to serve the public conversation.

Job Description

This role focuses on leading incidents to resolution and building the tools and instrument automation to make sure that they don't happen again.

Why it matters:

Our platform serves hundreds of internal customers running the thousands of services that make up our product. You'll work directly with internal customers triaging and resolving incidents on many kinds of projects and technologies that keep Twitter performing reliably.

What you’ll be doing:

You will be responsible for effectively triaging and solving problems in a complex environment operating at massive scale. As an incident manager, you will resolve critical system issues on a continuous basis, including notification, coordination and dispatch of individuals from various functional groups. You will take individual ownership of issues and pursue resolution tenaciously. You will provide effective communication and dissemination of information to other teams and executive management. You will develop tools to visualize, detect and resolve issues using Python, Javascript, other. We are looking for someone with a variety of deep systems experience, superb communication skills, an attention to both the details and the big picture, and a real passion for Twitter.

Who we are:

Twitter serves the public conversation by encouraging people all over the world to connect, learn, debate and solve problems together. We believe conversation can change the world, and that’s why Tweeps (that’s what we call Twitter employees) come to work every day.

The Twitter Command Center is responsible for leading incidents to resolution at twitter. We build tools and automation to make sure that they don't happen again. Our job is to increase reliability of the Twitter service by providing continuous site monitoring, oversight/management of key control processes, effective communication around reliability related events, and building the tools to automate it.

Qualifications

You should have a B.S. in Computer science or equivalent experience.
3- 5+ yrs of Incident Management experience on a large scale platform.
5+ yrs of experience in distributed systems at scale in a Linux/Unix environment as an administrator or developer.
Experience developing tooling in Python, Javascript or another language

Additional Information

We are committed to an inclusive and diverse Twitter. Twitter is an equal opportunity employer. We do not discriminate based on race, ethnicity, color, ancestry, national origin, religion, sex, sexual orientation, gender identity, age, disability, veteran status, genetic information, marital status or any other legally protected status.