Senior ML Ops Engineer

Full-time

Department: ML
Compensation: USD 80000 - USD 200000 - yearly

Company Description

Axiado is an AI-enhanced security processor company redefining the control and management of every digital system. The company was founded in 2017, and currently has 150+ employees. At Axiado, developing great technology takes more than talent: it takes amazing people who understand collaboration, respect each other, and go the extra mile to achieve exceptional results. It takes people who have the passion and desire to disrupt the status quo, deliver innovation, and change the world. If you have this type of passion, we invite you to apply for this job.

Job Description

We are looking for a Senior MLOps Engineer to own and build the end-to-end machine learning lifecycle, with a special focus on secure, reliable deployment to edge devices.

You are a systems thinker and a hands-on engineer. You will be responsible for everything from the initial data pipeline to the final on-device model verification. You will design our data-labeling feedback loops, build the CI/CD pipelines that convert and deploy models, and implement the monitoring systems that tell us how those models are actually performing in the wild—both in terms of speed and quality.

This role is a unique blend of data engineering, DevOps, ML security, and performance optimization. You will be the engineer who ensures our models are not only fast but also trusted, secure, and continuously improving.

Key Responsibilities

1. Data & Labeling Lifecycle Management:

Architect and implement scalable data processing pipelines for ingesting, validating, and versioning massive datasets (e.g., using DVC, Pachyderm, or custom S3/Artifactory solutions).

Design and build the infrastructure for our Human-in-the-Loop (HITL) and AI-in-the-Loop (Active Learning)data labeling systems. This includes creating the feedback loops that identify high-value data for re-labeling.

Conduct deep data analysis to identify data drift, dataset bias, and feature drift, ensuring the statistical integrity of our training and validation sets.

2. On-Device Model Monitoring:

Design and deploy lightweight, on-device telemetry agents to monitor inference quality and concept drift, not just operational metrics.

Implement statistical monitoring on model outputs (e.g., confidence distributions, output ranges) and create automated alerting systems to flag model degradation.

Build the backend dashboards (e.g., Grafana, custom dashboards) to aggregate and visualize on-device performance and quality metrics from a fleet of edge devices.

3. Model Conversion & Deployment (CI/CD for ML):

Build and maintain a robust CI/CD pipeline (e.g., GitLab CI, Jenkins, GitHub Actions) that automates model training, conversion, quantization (PTQ/QAT), and packaging.

Manage the model conversion process, translating models from PyTorch/TensorFlow into optimized formats (e.g., ONNX, TFLite) for our AI inference engine.

Orchestrate model deployment to edge devices, managing model versioning and enabling reliable Over-the-Air (OTA) updates.

4. On-Device Model Security & Verification:

Implement a robust model verification framework using cryptographic signatures to ensure entity verification(i.e., that the model running on-device is the one we deployed).

Design and apply security protocols (e.g., secure boot, model encryption) to prevent model injection attacks and unauthorized model tampering on the device.

Collaborate with firmware and hardware security teams to ensure our MLOps pipeline adheres to a hardware root of trust.

5. Performance Optimization:

Analyze and optimize ML model performance for our specific AI inference engine.

Apply graph-level optimizations (e.g., operator fusion, pruning) and OP-level optimizations (e.g., rewriting custom ops, leveraging hardware-specific data types) to maximize throughput and minimize latency.

Qualifications

5+ years of experience in MLOps, DevOps, or Software Engineering with a focus on ML systems.

Proven experience building and managing the full MLOps lifecycle, from data ingestion to production monitoring.

Strong programming skills in Python and deep experience with ML frameworks (e.g., PyTorch, TensorFlow).

Demonstrable experience with model conversion and optimization for edge devices (e.g., using ONNX, TFLite, TensorRT, or Apache TVM).

Strong understanding of data engineering principles and experience with data-labeling strategies (HITL/Active Learning).

Excellent understanding of CI/CD principles and tools (e.g., Git, Docker, GitLab CI).

Preferred Qualifications (The "Plus" Factors)

Hands-on experience with Kubernetes (K8s) for MLOps orchestration (e.g., Kubeflow, Argo Workflows).

Familiarity with GPU scheduling and virtualization platforms such as Run:AI.

Proficiency in managing MLOps infrastructure on at least one major cloud platform (AWS, GCP, Azure).

Experience with embedded systems security, cryptographic signing, or hardware security modules (HSMs).

Experience in C++ for deploying high-performance inference code.

Additional Information

Axiado is committed to attracting, developing, and retaining the highest caliber talent in a diverse and multifaceted environment. We are headquartered in the heart of Silicon Valley, with access to the world's leading research, technology and talent.

We are building an exceptional team to secure every node on the internet. For us, solving real-world problems takes precedence over purely theoretical problems. As a result, we prefer individuals with persistence, intelligence and high curiosity over pedigree alone. Working hard and smart, continuous learning and mutual support are all part of who we are.

Axiado is an Equal Opportunity Employer. Axiado does not discriminate on the basis of race, religion, color, sex, gender identity, sexual orientation, age, non-disqualifying physical or mental disability, national origin, veteran status or any other basis covered by appropriate law. All employment is decided on the basis of qualifications, merit, and business need.

I'm interested