Senior Platform Engineer - SRE (m/f/d)
We are Flink - your online supermarket revolutionising the way you do your grocery shopping. With a wide selection of over 2,400 high-quality products, we aim to deliver to your door in minutes. We put our customers first and ensure all products delivered are fresh and nutritious. Additionally, we can customise our national assortment to be able to offer you unique local products in every city. Our delivery hubs are located in densely populated inner-city locations and we strive to be sustainable by delivering on electric bikes and using packaging that can be recycled.
Founded by experienced e-commerce professionals and seasoned entrepreneurs with exceptional track records, backed by some of the most renowned Global investors, we are growing rapidly and have a great hunger to continuously challenge ourselves. We pride ourselves in being an inclusive and equal opportunities employer with a diverse and multicultural team.
If you want to be part of this exciting journey… read on!
In your role as a Senior Platform Engineer - SRE and Observability in the Site Reliability Engineering team, you will be working on company-wide strategic initiatives to make all services Observable at Flink. Our SRE team is built on the framework described in https://sre.google/. You will be responsible for implementing automations and services that would be instrumental in establishing SRE practices at Flink with a focus on compliantly delivering out-of-the-box Observability and Incident management process. You will be in constant communication with cross-functional product engineering teams acting as a domain expert in various SRE practices - system design, software platforms and frameworks, and capacity planning among others.
- Take a leadership role in the SRE team to build and maintain Flink’s SRE suite of tools and services and keep them up to date with market standards
- Research and develop SRE standards, conventions, and automations and drive their adoption within the product teams at Flink
- Closely collaborate with product engineering teams to guide system design improvements with a clear focus on observability, availability, scalability, and latency
- Lead incident response efforts, influence and bring forth a blameless post-mortem culture resulting in overall improved service resilience and decreased Mean time to recovery (MTTR)
- At least 5+ years of experience with building and running large-scale distributed cloud-native services
- You have an understanding of Unix/Linux operating system and TCP/IP network fundamentals
- You have experience using cloud provider platforms (preferably GCP) and deploying and running distributed services on Kubernetes. You know your way around Terraform and Helm
- You are comfortable with SRE practices and are familiar with common Observability concepts (Traces, Metrics and Logs), standards (Opentelemetry, Opentracing, OpenMetrics) and tooling (e.g. Datadog, Prometheus, etc)
- You have a deep understanding of system design, data structures, and algorithms
- Nice to have: You will have worked with a range of backend technologies: Golang, Java, Python, etc
- You are self-motivated coupled with fluent English communication skills
- A cool discount off your personal Flink orders, be the first to test out new products!
- Unlimited access to an e-learning and development platform, MyAcademy, including online German courses
- Attractive company pension options
- Online discounts with Corporate Benefits and Future Bens
- A unique opportunity to be an early bird and have an impact on our strategy & growth
- A steep learning curve, the possibility to work in an energised and dynamic team within a fast-paced environment
- For Berlin: A newly renovated and spacious, dog-friendly HQ in the heart of Mitte - lots of delicious lunch spots available within short walking distance
It is our commitment that every applicant will be evaluated according to their skills regardless of age, gender identity, ethnicity, sexual orientation, disability status, or religion.