Staff Site Reliability Engineer, Storage Infrastructure

  • Full-time

Company Description

Twitter developed and continually improves a large-scale storage platform. SREs ensure availability of the environment, with a watchful eye on security, capacity and performance. This group writes software to improve service reliability and manage platform growth. Our tools and services reduce operational overhead and maximize performance.

Job Description

As a Site Reliability Engineer (SRE) in Twitter’s Storage Infrastructure team, you will work to improve the reliability and performance of the next generation of distributed systems and containerized deployments. This team ensures the availability of in-memory data services (including Redis and Memcached), and caching content from foundation storage platforms. You will partner with product engineering teams to design, build, operate, and automate distributed storage services at the heart of Twitter’s infrastructure used by millions of people.

We are looking for software engineers that are passionate about reliability, performance, and efficiency, and that have experience building tools, services, and automation to manage and improve production services.

This team has some exciting challenges approaching. Services need to adopt IPv6, transition into Kubernetes, and reimagine elasticity. Opportunities exist for team members to influence how Twitter leverages future caching infrastructure. Work directly with most Twitter engineering teams to improve their caching services interactions.

Responsibilities:

- Build tooling to improve the operations automation. This includes automatic failure detection and remediation, application deployment, OS/kernel deployment, capacity planning, and fleet management.

- Diagnose, and troubleshoot complex distributed systems handling millions of queries per second, petabytes of data, and develop solutions that have a significant impact at our massive scale.

- Collaborate with software engineers to sustain and optimize service availability, reliability, and performance.

- Work and collaborate with the diverse hardware, software and networking teams throughout the company to design next-generation distributed storage platforms.

- Troubleshoot issues across the entire stack - hardware, software, application and network.

- Produce results for large-scale projects and lead active collaboration across multiple teams.

- Scope work for multiple engineers, often across multiple teams.

- Sustain data privacy and service security compliance.

- Participate in a 24x7 on-call rotation.

Qualifications

- 5+ years of managing services in a distributed, internet-scale *nix environment.

- Ability to program scalable and reliable services in at least one programming language (Python, Go, Java, C). Can set standards for code quality.

- Demonstrable knowledge of Linux operating system internals and TCP/IP networking; containerization a plus.

- Familiarity with systems management tools (Puppet, Chef, Ansible, etc).

- Ability to prioritize tasks and work independently.

- Track record of practical problem solving, excellent communication, and documentation skills.

- BS degree in Computer Science or Engineering, or equivalent experience.

Additional Information

A few other things we value:

- Challenge: We solve some of the industry’s hardest problems. Come to be challenged, learn, and thrive as an engineer.

- Diversity: We value diverse backgrounds, ideas, and experiences, all of which contribute to team and organization improvement.

- Work-Life Balance - We honor team members’ work-life balance.

We honor team members’ work-life balance.All your information will be kept confidential according to EEO guidelines.

Privacy Policy