Senior Lead Engineer - Generative AI Infrastructure (Remote-Eligible)
- Full-time
- Company: Capital One
Company Description
Jobs for Humanity is partnering with Capital One to build an inclusive and just employment ecosystem. Therefore, we prioritize individuals coming from the following communities: Refugee, Neurodivergent, Single Parent, Blind or Low Vision, Deaf or Hard of Hearing, Black, Hispanic, Asian, Military Veterans, the Elderly, the LGBTQ, and Justice Impacted individuals. This position is open to candidates who reside in and have the legal right to work in the country where the job is located.
Company Name: Capital One
Company Name: Capital One
Job Description
Senior Lead Engineer - Generative AI Infrastructure (Remote Eligible)
At Capital One, our mission is to create trustworthy, reliable, and human-in-the-loop AI systems that change banking for the better. We have been at the forefront of using machine learning to provide real-time, intelligent, and automated customer experiences. From alerting customers about unusual charges to answering their questions instantaneously, our AI applications bring simplicity and humanity to banking. With our investments in public cloud infrastructure and machine learning platforms, we are uniquely positioned to harness the power of AI. We are committed to building world-class applied science and engineering teams to deliver breakthrough product experiences and scalable, high-performance AI infrastructure. Join us at Capital One and help us reimagine how we serve our customers and businesses.
We are currently seeking an experienced Sr. Lead Engineer, Generative AI Infrastructure to help us lay the foundations of our AI capabilities. In this role, you will work on various initiatives such as building large-scale distributed training clusters, deploying LLMs on GPU instances for real-time applications and decisioning systems, and supporting cutting-edge AI research and development in our public cloud infrastructure. You will collaborate closely with our cloud and container infrastructure teams as well as our world-class AI researchers to design and implement key capabilities. Some examples of projects you will work on include:
- Deploying a thousand-node training cluster that optimizes storage and networking stack, taking advantage of multiple parallelism strategies in our public cloud.
- Designing and building fault-tolerant infrastructure that supports long-running large-scale training tasks reliably, even in the event of individual node failures, using containers and checkpointing libraries.
- Creating run-time infrastructure for serving large ML models like LLMs and FMs in our public cloud.
- Building infrastructure for deploying search indexes and embeddings in vector databases that work seamlessly with the rest of our capabilities.
This position at Capital One is open to remote employees.
Basic Qualifications:
- Bachelor's degree in Computer Science, Computer Engineering, or a related technical field.
- At least 8 years of experience designing and building data-intensive solutions using distributed computing.
- At least 4 years of experience with HPCs, vector embedding, or semantic search technologies.
- At least 4 years of experience programming with languages such as Python, Go, Scala, or Java.
- At least 3 years of experience building, scaling, and optimizing training and inference systems for deep neural networks.
Preferred Qualifications:
- Master's or Doctoral degree in Computer Science, Computer Engineering, Electrical Engineering, Mathematics, or a related field.
- Background in machine learning with experience in large-scale training and deployment of deep neural nets and/or transformer architectures.
- Experience with machine learning frameworks such as TensorFlow or Pytorch, Lightning, Mosaic ML, etc.
- Ability to work in a fast-paced environment with ambiguity and competing priorities and deadlines. Experience at tech and product-driven companies/startups is preferred.
- Ability to collaborate with researchers and engineers to iteratively improve product experiences while building foundational capabilities.
- Familiarity with deploying large neural network models in demanding production environments.
- Experience with building GPU clusters in the public cloud with tightly-coupled storage and networking.
Location-based Salary Ranges:
- New York City (Hybrid On-Site): $230,100 - $262,700 for Sr. Lead Machine Learning Engineer
- San Francisco, California (Hybrid On-Site): $243,800 - $278,200 for Sr. Lead Machine Learning Engineer
- Remote (Regardless of Location): $195,000 - $222,600 for Sr. Lead Machine Learning Engineer
Please note that salaries for part-time roles will be prorated based on the agreed-upon number of hours worked.
At Capital One, we offer a comprehensive and competitive range of health, financial, and other benefits that support your overall well-being. Learn more about our benefits on the Capital One Careers website. Eligibility for benefits may vary based on employment status and level.
We value diversity and inclusion in the workplace and are an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to sex, race, color, age, national origin, religion, disability, genetic information, marital status, sexual orientation, gender identity, gender reassignment, citizenship, immigration status, protected veteran status, or any other protected characteristic under applicable federal, state, or local law. We promote a drug-free workplace and will consider qualified applicants with criminal histories in a manner consistent with legal requirements.
If you require an accommodation during the application process, please contact Capital One Recruiting at 1-800-304-9102 or by email at [email protected]. All information provided will be kept confidential and used only to provide necessary reasonable accommodations.
For technical support or questions about our recruiting process, please email [email protected].
Note: This job ad is for Capital One in the United States. If you are interested in opportunities in other locations, please apply accordingly.