Sr. Site Reliability Engineer - SRE

  • Full-time
  • Time Type: Full Time
  • Department: Engineering
  • Location: Spain - Barcelona

Company Description

Redzone is the #1 Connected Workforce Solution for manufacturers big and small. We work to improve efficiency in plants, provide coaching for best practices, and enable the front-line worker to improve the quality of their work and their work life by providing them with tools, processes, and collaboration tools to keep their manufacturing lines running smoothly and efficiently. 

At Redzone we focus on the customer experience, listening to the customer, and providing solutions that create great outcomes. We are a combination of great leadership, years of manufacturing experience, and an incredible technology team that all work together to create great products. 

This role is fully remote.

Job Description

We are expanding our Site Reliability Engineering (SRE) team and seeking a highly skilled and passionate Senior SRE to join us. As a member of our growing SRE function, you will play a critical role in ensuring the reliability, scalability, and performance of our mission-critical services. This is an opportunity to shape our SRE practices, drive automation, reduce operational toil, and significantly impact our product's operational excellence.
 

What You'll Do

  • Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experiences.
  • Serve as a subject matter expert for observability, including monitoring, alerting, logging, tracing, dashboards, and synthetic testing.
  • Develop robust, maintainable software and self-service tooling to automate operational tasks and improve reliability.
  • Identify and eliminate operational toil through automation, process improvements, and systematic problem solving.
  • Lead incident response, participate in on-call rotations, and drive blameless post-mortems.
  • Define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
  • Leverage infrastructure as code, GitOps practices, and CI/CD automation using Terraform, Flux, and GitHub Actions.
  • Provide reliability expertise during system design reviews and influence architectural decisions.
  • Document processes, build runbooks, and mentor engineers across the organisation.
  • Leverage AI responsibly to accelerate investigations, improve documentation, reduce toil, and build intelligent operational workflows while maintaining appropriate human oversight, security, and governance.

Qualifications

What You'll Bring 

Core SRE Capabilities 

  • Demonstrated experience operating and improving production systems at scale in an SRE, Production
  • Engineering, or Platform Engineering role.
  • Ability to rapidly build accurate mental models of complex distributed systems across infrastructure, applications, networking, identity, and observability domains.
  • Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis.
  • Experience defining and using SLIs, SLOs, and error budgets to guide reliability decisions.
  • Excellent written and verbal communication skills.

Technical Domains 

Experience across several of the following areas: 

  • Kubernetes platforms, including Amazon EKS, and service mesh technologies such as Istio. 
  • Cloud infrastructure and services within AWS. 
  • Identity and access management systems, including Auth0 and AWS IAM.\
  • Networking fundamentals, including DNS, load balancing, routing, TLS, and connectivity troubleshooting. 
  • GitOps workflows and infrastructure automation using tools such as Flux and Terraform. 
  • Observability platforms and practices, including metrics, logs, traces, alerting, dashboards, and synthetic monitoring.
  • CI/CD systems and engineering workflows. 
  • Application logging and distributed system debugging. 
  • Engineering Mindset 

A strong SRE: 

  • Prioritizes service stability and customer impact during incidents. 
  • Slows down under pressure, gathers facts, and communicates clearly. 
  • Reduces operational complexity through automation and simplification. 
  • Identifies and eliminates toil through self-service tooling and process improvement. 
  • Demonstrates strong scripting and automation instincts. 
  • Brings a systems-thinking approach to problem-solving. 
  • Balances short-term remediation with long-term reliability improvements. 

Software Engineering for Reliability 

  • Demonstrated ability to build and maintain automation, tooling, and self-service capabilities using one or more programming or scripting languages such as Python, Go, or Bash. 
  • Focuses on applying software engineering practices to improve reliability, reduce toil, and enhance developer productivity. Behavioral Expectations 
  • Calm and effective during high-severity incidents.  
  • Skilled at managing complex situations involving multiple teams and competing priorities. 
  • Able to lead blameless post-mortems and drive meaningful follow-up actions. 
  • Passionate about continuous improvement and fostering a culture of shared ownership. 


AI-Native Operations

  • Uses AI assistants effectively to accelerate troubleshooting, root cause analysis, and operational decision-making.
  • Validates AI-generated recommendations and understands when human judgement is required.
  • Applies AI to automate repetitive tasks, improve runbooks, and create self-service capabilities.
  • Identifies opportunities to build AI-assisted workflows for incident response, observability, and platform operations.
  • Continuously evaluates emerging AI capabilities to improve reliability, developer experience, and operational efficiency.
  • Uses AI to summarise incidents, analyse logs and metrics, accelerate scripting, support code reviews, and generate operational documentation.

Bonus Points (Nice to Have): 

  • Experience defining and working with SLOs, SLIs, and Error Budgets. 
  • Familiarity with other observability tools or concepts beyond Datadog. 
  • Experience with feature flagging platforms like LaunchDarkly.

Additional Information

QAD Inc. is a leading provider of adaptive, cloud-based enterprise software and services for global manufacturing companies. Global manufacturers face ever-increasing disruption caused by technology-driven innovation and changing consumer preferences. In order to survive and thrive, manufacturers must be able to innovate and change business models at unprecedented rates of speed. QAD calls these companies Adaptive Manufacturing Enterprises. QAD solutions help customers in the automotive, life sciences, packaging, consumer products, food and beverage, high tech and industrial manufacturing industries rapidly adapt to change and innovate for competitive advantage. 

QAD is committed to ensuring that every employee feels they work in an environment that values their contributions, respects their unique perspectives and provides opportunities for growth regardless of background. QAD’s DEI program is driving higher levels of diversity, equity and inclusion so that employees can bring their whole self to work. 

We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class. 

About QAD and QAD Redzone:

QAD Inc. is a leading provider of adaptive, cloud-based enterprise software and services for global manufacturing companies. Global manufacturers face ever-increasing disruption caused by technology-driven innovation and changing consumer preferences. In order to survive and thrive, manufacturers must be able to innovate and change business models at unprecedented rates of speed. QAD calls these companies Adaptive Manufacturing Enterprises.  

QAD Redzone helps to enable QAD’s vision for the Adaptive Enterprise. Labor productivity improvements directly impact efficiency. Productive and empowered employees increase the effective capacity of your plant and accelerate time to productivity for new employees giving manufacturers the agility to increase production beyond what was previously possible without having to invest in production equipment or new plants, and reduce the amount and impact of employee attrition. Empowered employees with a growth mindset take extreme ownership of challenges that impact their production goals, creating resilience in the face of disruption.

We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class. 

#LI-Remote

By clicking the link above or any third-party link within this posting, you are leaving this site and going to a third-party website where the third-party website's terms and privacy policy apply

Privacy Notice