Senior Site Reliability Engineer

  • Full-time

Company Description

At KMS Technology Mexico, we are passionate about building innovative software solutions that drive impact. As part of an international tech company, we offer a collaborative and inclusive environment where your ideas matter and your growth is our priority.

Job Description

We are looking for a Senior SRE to join our core engineering team in building the next generation of AI-powered property intelligence for the insurance industry. In this role, you will be the guardian of a platform’s availability, latency, and performance.

You will work at the heart of a high-demand ecosystem, ensuring that our Node.js microservices and AI/ML pipelines running on Google Cloud Platform (GCP) are resilient, scalable, and secure. This is a "Software Engineering approach to Operations" role, where automation is the default and manual intervention is a last resort.

Key Responsibilities

Infrastructure & Platform Engineering

  • Cloud Architecture: Design and manage scalable, multi-regional infrastructure on GCP, leveraging GKE (Kubernetes), Cloud Run, and Pub/Sub.

  • Infrastructure as Code (IaC): Maintain and evolve our infrastructure codebase using Terraform or Pulumi, ensuring environment parity across Staging and Production.

  • Node.js Optimization: Partner with Fullstack teams to tune Node.js application performance, managing memory limits, event loop bottlenecks, and asynchronous execution in a containerized environment.

Observability & Reliability

  • SLO/SLI Definition: Define and monitor Service Level Indicators (SLIs) and Objectives (SLOs) to measure the "health" of our property intelligence engine.

  • Advanced Monitoring: Build comprehensive dashboards and alerting systems using Google Cloud Operations Suite (Stackdriver), Prometheus, or Grafana.

  • Incident Management: Lead Root Cause Analysis (RCA) for production incidents and implement "Blameless Post-mortems" to prevent recurrence.

 

AI & Data Operations

  • MLOps Integration: Support the scaling of AI models by optimizing GPU/TPU utilization and data ingestion pipelines within GCP.

Security & Compliance: Ensure the platform meets the rigorous data privacy standards of the insurance industry, including SOC2 and GDPR compliance.

Qualifications

Technical Requirements:

  • 5+ years in an SRE, DevOps, or System Architecture role.

  • GCP Expertise: Deep experience with Google Cloud Platform, specifically GKE, IAM, Cloud SQL, and VPC networking.

  • Coding Proficiency: Strong experience with Node.js (backend services) and scripting in Python or Go for automation.

  • Orchestration: Expert-level knowledge of Kubernetes (GKE), including Helm charts and service meshes (Istio/Anthos).

  • CI/CD: Experience building high-frequency deployment pipelines with GitHub Actions, GitLab CI, or Google Cloud Build.

Professional Competencies:

  • The "SRE Mindset": A passion for automation and a visceral dislike of repetitive manual tasks ("Toil").

  • Strategic Communication: Ability to translate complex infrastructure risks into business impact for Stakeholders and Delivery Directors.

  • AI-First Workflow: Proactive use of AI tools for log anomaly detection, predictive scaling, and automated troubleshooting.

Additional Information

Location: Guadalajara, Jalisco, Mexico (Hybrid) 

Benefits and Perks

Perks you enjoy at KMS Mexico

  • Mexican law benefits
  • 15 days of PTO (in year zero, from the first year onwards it is 3 days per year).
  • 5 days' leave for the death of immediate family members, negotiable.
  • Major Medical Expenses Insurance with coverage for immediate dependents (spouse and children).
  • Annual performance bonus (≈10% of annualized salary).
  • Annual salary adjustment.
  • Employee Referral Bonus.
  • Paid Certifications / Courses
  • Coursera License.
  • 5% Savings Fund.
  • 5% Grocery Vouchers.
Privacy NoticeImprint