🍃 🕵️ What is the SRE challenge on the Digital Platform?
The Platform team main purpose is to provide the foundational and essential services for the development and operation of software systems as a set of standardized tools and services to accelerate the development of new features, improve efficiency, and ensure consistency across the organization. We are a diverse team spread across the globe, embracing challenges and handling changes adeptly. Dealing with different cultures is part of our daily routine.
As an SRE, you will be responsible for ensuring the reliability, availability, and scalability of systems and services in your local area, as well as participating in global discussions on standardizing practices, tools, and solutions. You will work closely with the global SRE team to design, implement, and operate software in a way that meets or exceeds our Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
The SRE's mission here is to ensure the security, reliability, availability, and performance of the platform, products, and essential services. This includes implementing security practices, monitoring performance, troubleshooting, adopting engineering best practices, dealing with a diverse technology ecosystem, getting excited about learning and testing modern technologies, and working and evolving with people from around the world. Impacting digital solutions in agriculture globally. Using technology as a tool to solve real problems for producers worldwide.
Are you up for the challenge?
🌿🦾 Let's translate it into activities:
The activities of an SRE (Site Reliability Engineer) professional are focused on ensuring the availability, reliability, stability, and scalability of software systems and services. The SRE role combines knowledge of software development and infrastructure and cloud operations to improve operational efficiency and reduce the impact of failures on systems. Here are some key activities performed by an SRE professional:
- Ensure that systems and services in your local area are reliable, available, and scalable, meeting or exceeding our SLOs and SLAs;
- Work with the global SRE team to design, implement, and operate software reliably, scalably, and economically, contributing to the standardization of practices, tools, and solutions;
- Participate in global discussions on standardizing practices, tools, and solutions, providing feedback on how to improve them;
- Monitor systems and services to identify and resolve issues before they affect users;
- Work on creating and maintaining tools and automations to facilitate operations and improve system reliability. This may involve coding scripts, developing internal tools, and contributing to open-source projects;
- Establish and maintain monitoring systems to collect metrics, logs, and traces from production systems, using monitoring tools to identify performance issues, anomalies, and trends that may affect service reliability;
- Play a crucial role in incident management with rapid responses to outages, performance issues, or other operational failures, investigating root causes and implementing corrective measures to prevent recurrences. Work on incident response plans and continuous improvement of recovery processes;
- Analyze system performance and make capacity predictions to ensure resources are available to support expected demand. Collaborate with development teams to design and implement scalability strategies, such as automatic resource adjustment and adoption of resilient architectures;
- Define and apply Site Reliability Engineering (SRE) practices to improve system reliability and availability. This may include implementing stress tests, fault reduction techniques, controlled updates, canary releases, and change management practices;
- Act as a bridge between development and operations teams. Collaborate with software engineers to improve system reliability from the design phase and provide technical support to resolve operational issues;
- Develop and implement incident response plans to minimize the impact of disruptions.
📎📢 What do you need to excel in this role?
- Experience in incident management, including identification, diagnosis, and resolution of incidents
- Understanding of Service Level Agreements (SLAs) and Service Level Objectives (SLOs);
- Familiarity with monitoring and alerting tools, such as DataDog, Prometheus, Grafana, Nagios, and Splunk;
- Knowledge of best practices and global standards for SRE;
- Experience in designing and implementing highly available, scalable, and fault-tolerant systems;
- Experience with automation tools like Ansible, Chef, Puppet, Terraform, or similar;
- Experience with containerization and container orchestration platforms like Docker and Kubernetes;
- Experience with continuous integration and continuous delivery tools, such as Jenkins, GitLab CI/CD, GitHub Actions, or CircleCI, for automating the software development and deployment pipeline;
- Proficiency in popular programming languages like Python, Go, Java, Typescript for developing tools, scripts, and automations needed for the SRE role;
- Experience with tools like Elasticsearch, Logstash, Kibana (ELK Stack), or Jaeger for collecting, analyzing, and visualizing logs and traces for troubleshooting and performance analysis;
- Experience working in a global team and participating in discssuions on standardizing practices, tools, and solutions.
🍀👀 Apart from the f#d@ ecosystem, what do you get?
- A lot of attention to your physical and mental health, with health and dental insurance, psychological and nutritional support and Gympass for you and your dependents;
- Care that you also have incredible projects outside of work, with a corporate discount for personal travel, support for legal and financial issues (EAP), childcare assistance and extended maternity/paternity leave;
- Boost your development, via platforms for personal and technical development;
- Market benefits, which we love, such as life insurance, food/meal vouchers and transport vouchers;
- Oh, and of course, your schedule is flexible regardless of the type of work, and if you choose to work remotely, we have home office assistance ($$).