Sr Software Engineer - Cortex Platform Infrastructure

  • Full-time

Company Description

Twitter is what’s happening and what people are talking about right now. For us, life's not about a job, it's about purpose. We believe real change starts with conversation. Here, your voice matters. Come as you are and together we'll do what's right (not what's easy) to serve the public conversation.

Job Description

Who We Are

The ML Infrastructure team’s mission is to provide Twitter’s ML Engineers with the orchestration tools and compute capacity to reliably run state-of-the-art ML experiments. We provide this capability through managed Kubeflow clusters built on top of the Google Cloud Platform. We partner with our sister teams within Cortex Platform to provide an end-to-end ML experimentation platform experience.

Cortex Platform 

Cortex Platform empowers internal teams to efficiently leverage ML by providing a platform and by unifying, educating, and advancing the state of the art in ML technologies within Twitter. We win when our customers win by helping our users stay informed, share and discuss what matters; by serving the public conversation. We’re building an AI-first company and every major initiative is increasingly dependent on the successful application of machine learning. Cortex is at the nexus of this evolution.

We are building one of the strongest machine learning platforms in the world by marrying the latest ML industry practices with engineering excellence and the need to perform at Twitter scale. Our customers are all the ML engineers at Twitter and our goal is to provide a unified tooling ecosystem that allows these engineers to focus on what they are good at, building ML models with novel approaches, and abstracting away the complexities of bringing these models into a production environment.

We care deeply about:

  • Engineering excellence such as good design abstractions, API stability, unit testing, leading best practices for other engineers to follow, and solid documentation.

  • Staying abreast and compatible with a quickly shifting technology landscape for ML platform components and related open source solutions.

  • Creating the best ML Platform environment for Twitter that provides an exceptional developer experience for our engineering customers.

  • Encouraging engineering creativity and innovative solutions

Our Current projects include:

  • Establishing Kubeflow as a managed offering at Twitter

  • Enabling and sustaining GCP Infra/Platform components for broader use in Cortex platform; e.g. AI Platform, Dataflow, Data Proc, etc.

  • Improving Operations of essential ML Platform services

    • Hosted notebooks

    • ML Training Service

    • Continuous model training and deployment 

What You'll Do

If this sounds like a team you want to be part of, great! We are looking for engineers who are passionate about writing code, have a desire to learn new technologies, love working in collaborative teams, and are committed to serving their customers.

 

Your responsibilities include:

  • Working with and ideally contributing to OSS Kubeflow and other projects to improve Cortex Platform User Experience, Velocity & Stability.

  • Informing and accelerating GCP Infrastructure adoption best practices (sustaining and improving User Onboarding, IAM, Image Management, Twitter Systems Integrations, Security, et al)

  • Absorbing existing SRE/Operational support scopes (GPU Cluster Management, GKE upgrades, RPM/Python Dependency Management, Capacity Planning, etc)

  • Partnering and supporting existing Cortex Platform teams with Operational guidance and expertise on various project initiatives

  • Creating tools and automation for Operational support and management for DS/ML use cases

  • Supporting various users and developers with operational issues (e.g. “I’m having trouble scheduling GPU jobs with Persistent Volumes”)

 

 

What will success look like in your first year?

Onboard to our team and dig in by contributing to existing projects, under the mentorship of existing team members.

Build customer empathy by meeting with our customers to understand their use cases and perspectives.

Participate in team agile practices and give and receive technical feedback.

Bring your prior experience to bear on our challenges.

Take the lead on 1-3 projects, incorporating peer, customer, and leadership feedback along the way.

Join our on-call rotation.

Help us build and maintain our collaborative, kind, inclusive, and productive team culture.

  

Qualifications

Who You Are

Experienced working with Kubernetes and Kubeflow, huge plus if you are an OSS contributor to the project

Minimum 4+ years of handling services in a large-scale distributed systems environment, preferably services on GCP e.g. BigQuery, etc.

Strong knowledge of Linux operating system internals, filesystems, disk/storage technologies, and storage protocols and networking stack.

Strong knowledge of systems programming (bash and shell tools) and practical, proven knowledge of at least one higher-level language (e.g. Python, Go).

Comfortable working with on-prem and cloud-based infrastructure (GCP) in terms of deployment, support, monitoring, administration, and troubleshooting.

Track record of practical problem solving, excellent communication, and documentation skills

Proven understanding of systems and application design, including the operational trade-offs of various designs.

Work well with and be able to influence a myriad of personalities at all levels.

Be adaptable and able to focus on the simplest, most efficient & reliable solutions.

Solid understanding of algorithms, distributed systems design, and the software development lifecycle

  • Maintaining the version updates of Kubeflow and its microservices such as Kubeflow Pipelines API, Notebooks, MLMD, and Training operator.

  • Partner with Twitter’s Platform and Data Platform orgs to improve, enhance and influence direction and integration opportunities

  • Partner with Compute teams to improve, enhance and integrate with the company’s GCP Adoption & Management strategy

Additional Information

All your information will be kept confidential according to EEO guidelines. We are committed to an inclusive and diverse Twitter. Twitter is an equal opportunity employer. We do not discriminate based on race, color, ethnicity, ancestry, national origin, religion, sex, gender, gender identity, gender expression, sexual orientation, age, disability, veteran status, genetic information, marital status or any legally protected status.

We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Notice (Colorado Equal Pay for Equal Work Act)

The expected salary range for this role to be performed in Colorado is USD$191,000.00 - USD$267,000.00. Starting pay for the successful applicant will depend on a variety of job-related factors, which may include education, training, experience, location, business needs, or market demands. This range may be modified in the future.

This job is also eligible for participation in Twitter’s Performance Bonus Plan and Equity Incentive Plan subject to the terms of the applicable plans and policies.

Twitter offers a wide range of benefits to U.S.-based employees, including medical, dental, and vision insurance, 401(k) program with employer match, generous time off for vacation, sick time, and parental leave. Twitter's benefits prioritize employee wellness and progressive support to our diverse workforce.

Privacy Policy