Site Reliability Engineer
- Office: London
Great journeys start with Trainline
We are champions of rail, inspired to build a greener, more sustainable future of travel. Our purpose is our momentum. It makes us feel good because we know we’re doing good. As we lead the way to a greener future, we do it together. We’re all about connections - with each other, with our customers and with the world. Just as our platform brings the world together, it’s our ambition that connects us. We motivate each other to go beyond our limits, to experiment, to fail and to always grow.
With over 110 million visits every month to our platform and £4.3 billion in net ticket sales, we're always innovating and making moves towards our final destination — a world where travel is as simple, seamless, and affordable as it should be.
And we couldn't do any of it without our incredible people driving us forward. Today, we're a FTSE 250 company that's proudly home to more than 1000 Trainliners from over 60 nationalities across offices in London, Paris, Barcelona, Milan, Edinburgh, Berlin, Madrid and Brussels. It's this diversity that energises us and makes us stronger, helping us to achieve amazing things.
With our sights firmly set on further European growth, there is no better time to jump on board this high-speed train and be part of our continued success.
Introducing Reliability & Operations Engineering 👋
Trainline is a fast-growing company that loves utilising new technology to build world-class products for our customers. We run a diverse platform that is hosted on AWS and coupled with our own tooling allows us to embrace CI/CD, DevOps practices, SRE disciplines and cloud native services to their full potential.
ReliabilityOps are at the forefront of platform observability maintaining availability, latency, performance, efficiency, capacity, CI/CD delivery co-ordination, critical incident response and cloud infrastructure automation and provisioning.
We are looking for a Site Reliability Engineer to join the team contributing to owning observability and building tooling that supports operational engineering. We are looking for a strong technical team player who has experience implementing SRE practices within a team and contributing to advocating SRE principles.
As an SRE @ Trainline, you'll be working on...🚄
- Critical incident response in production, from initial event, participating in rapid response and driving service restoration, identifying follow up measures.
- Building and implementing tooling to improve observability, identification and resolution of incidents with a strong emphasis on reducing MTTD.
- Supporting product engineering teams to ensure applications are operationally launch ready and that CI/CD activities are carried out in a safe manner with reliability in mind.
- Reducing MTTR by working with product engineering teams to understand issues, surface and present the right data to influence change.
- Contributing to incident retrospectives with deep technical knowledge to explain what may have occurred at HTTP, TCP, DNS layers of the stack.
- Promoting and expanding SRE concepts to the engineering community in both a consultancy and hands-on fashion, being a champion of observability engineering and reliability principles.
- Improving platform reliability, identifying metrics to base decisions on, surfacing them if we don’t record them, identifying continuous improvement across our pillars of observability.
- See data presentation as a socio-technological problem, we need the most pertinent information presented quickly in a human-consumable way to affect the resolution of a real time incident.
- Delivery on key road map deliverables and ensure that initiatives are contributing to the achievement and improvement of the SRE team, reliability of the platform & business OKRs.
- Participating in the SRE on call rota, assuming the role of incident commander ensuring our platform is supported 24/7 for our customers.
We'd love to hear from you if you have...🔍
- Experience of SRE concepts such as SLI, SLO and error budgets.
- Hands-on experience with observability tooling such as New Relic, Elastic Cloud (ELK Stack), Influx, Grafana, with a good understanding of APM and MELT (metrics, events logs, traces),
- Strong understanding of HTTP/TCP (status codes, nuances of headers, cookies, connection/request life cycle).
- Understanding of load balancing and reverse proxy concepts, upstream config concepts, upstream health checks, worker & data flow concepts.
- Application architecture concepts (threading, queuing, readiness checks, health checks, circuit breakers, timeouts, exponential backoff, throttling).
- Experience building, maintaining and evolving time series data, retention, cardinality, deviation, moving averages and other functions.
- Experience working with cloud providers preferably AWS.
- Experience with build, deployment & configuration management tooling such as TeamCity, GitHub Actions, and Terraform.
- Experience troubleshooting Linux operating systems.
- Experience of scripting in at least one language preferably Python.
But why should you join?
We pay special attention to learning and development and organise quarterly company learning days as well as offering a learning budget that can be put towards resources of your choice. We will cover the costs of your professional subscriptions and give you access to our very own learning platform.
At Trainline, we care about the wellness of our employees. We host puppy therapy sessions, in-office yoga and run Mental Health First Aider training courses as well as having an Employee Assistance Program as one of our many company benefits.
We regularly throw fun social events such pub quizzes, karaoke nights and our large-scale Summer and Winter Festivals every year. Additionally, we love hosting meetups in our amazing event spaces and having the opportunity to support internal and external community groups.
We also hold companywide hackathons and our annual Trainline Tech Summit, which provides Trainliners with an opportunity to stand up and share their story, learnings, or new skills with their colleagues in a safe environment.
Our flexi-first approach
We believe in the importance of a healthy work-life balance and the value of a flexible workforce. Our flexi-first approach outlines our commitment to a hybrid way of working and our expectations of Trainliners. A key part of what makes Trainline special is our people and the value we get from the buzz and energy of our workplaces, and that’s why we’re proud to offer the best of both worlds. In practice this means in–office attendance at least 40% of the time over a 12-week period for all Trainliners. These in-office days are typically team led to help us connect, collaborate and create together.
- Think Big - We're building the future of rail
- Own It - We care about every customer, partner and journey
- Do Good - We make a positive impact
- Travel Together - We're one team
Interested in finding out more about what it's like to work at Trainline? Why not check out what our employees say about us on Glassdoor? You can also find out more information by following us on LinkedIn or our 'Life at Trainline' Instagram account.
We value open expression at Trainline, we believe it’s the diversity of experience, backgrounds and perspectives of our employees that makes us who we are. We encourage everybody to play a part in changing the way people travel across the world.