Senior Site Reliability Engineer
- Full-time
Company Description
At KMS Technology Mexico, we are passionate about building innovative software solutions that drive impact. As part of an international tech company, we offer a collaborative and inclusive environment where your ideas matter and your growth is our priority.
Job Description
We are looking for a Senior SRE to join our core engineering team in building the next generation of AI-powered property intelligence for the insurance industry. In this role, you will be the guardian of a platform’s availability, latency, and performance.
You will work at the heart of a high-demand ecosystem, ensuring that our Node.js microservices and AI/ML pipelines running on Google Cloud Platform (GCP) are resilient, scalable, and secure. This is a "Software Engineering approach to Operations" role, where automation is the default and manual intervention is a last resort.
Key Responsibilities
Infrastructure & Platform Engineering
Cloud Architecture: Design and manage scalable, multi-regional infrastructure on GCP, leveraging GKE (Kubernetes), Cloud Run, and Pub/Sub.
Infrastructure as Code (IaC): Maintain and evolve our infrastructure codebase using Terraform or Pulumi, ensuring environment parity across Staging and Production.
Node.js Optimization: Partner with Fullstack teams to tune Node.js application performance, managing memory limits, event loop bottlenecks, and asynchronous execution in a containerized environment.
Observability & Reliability
SLO/SLI Definition: Define and monitor Service Level Indicators (SLIs) and Objectives (SLOs) to measure the "health" of our property intelligence engine.
Advanced Monitoring: Build comprehensive dashboards and alerting systems using Google Cloud Operations Suite (Stackdriver), Prometheus, or Grafana.
Incident Management: Lead Root Cause Analysis (RCA) for production incidents and implement "Blameless Post-mortems" to prevent recurrence.
AI & Data Operations
MLOps Integration: Support the scaling of AI models by optimizing GPU/TPU utilization and data ingestion pipelines within GCP.
Security & Compliance: Ensure the platform meets the rigorous data privacy standards of the insurance industry, including SOC2 and GDPR compliance.
Qualifications
Technical Requirements:
5+ years in an SRE, DevOps, or System Architecture role.
GCP Expertise: Deep experience with Google Cloud Platform, specifically GKE, IAM, Cloud SQL, and VPC networking.
Coding Proficiency: Strong experience with Node.js (backend services) and scripting in Python or Go for automation.
Orchestration: Expert-level knowledge of Kubernetes (GKE), including Helm charts and service meshes (Istio/Anthos).
CI/CD: Experience building high-frequency deployment pipelines with GitHub Actions, GitLab CI, or Google Cloud Build.
Professional Competencies:
The "SRE Mindset": A passion for automation and a visceral dislike of repetitive manual tasks ("Toil").
Strategic Communication: Ability to translate complex infrastructure risks into business impact for Stakeholders and Delivery Directors.
AI-First Workflow: Proactive use of AI tools for log anomaly detection, predictive scaling, and automated troubleshooting.
Additional Information
Location: Guadalajara, Jalisco, Mexico (Hybrid)
Benefits and Perks
Perks you enjoy at KMS Mexico
- Mexican law benefits
- 15 days of PTO (in year zero, from the first year onwards it is 3 days per year).
- 5 days' leave for the death of immediate family members, negotiable.
- Major Medical Expenses Insurance with coverage for immediate dependents (spouse and children).
- Annual performance bonus (≈10% of annualized salary).
- Annual salary adjustment.
- Employee Referral Bonus.
- Paid Certifications / Courses
- Coursera License.
- 5% Savings Fund.
- 5% Grocery Vouchers.