Site Reliability Engineer / SRE - (Remote)
- Todd Ln, Austin, TX, USA
At Aera, we deliver the cognitive technology that enables the Self-Driving Enterprise™: A Cognitive Operating System™ that connects you with your business and autonomously orchestrates your operations. Aera's Cognitive OS leverages the best of artificial intelligence, machine learning, natural language processing, big data, and enterprise domain expertise to deliver Cognitive Automation at scale for some of the world's largest companies.
Once an Aera application is built by our developers it is handed over to the Site Reliability Engineering/DevOps team. This is the team that supports the operations of our applications and services. They manage all environments from production, sandbox, sales, and implementation. This team pushes new code to our existing customers, monitors the health, performance, and reliability of the Aera stack, and in general, "keep the lights on" with 24/7 coverage.
The primary responsibilities for this role will be to use your background as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying and resolving production issues that relate to infrastructure; In order to adequately protect Aera assets and customer data as well as providing an escalation point for others to consult and trust.
This remote-based role can be located anywhere within the United States but Austin, Texas, is the preferred location.
We are interested in considering every qualified candidate who is eligible to work in the United States. However, at the present time, we are not able to sponsor visas.
Design, build, release, and maintain a fully automated, Infrastructure as Code ecosystem that ensures 4+ nines availability of our SaaS platform
Continuously innovate your way out of existing and yet-to-be-discovered problems, with an eye on “what’s next” as we anticipate and remain ahead of customer expectations
Obsess about, measure, and optimize system performance, continuously pushing your capabilities beyond current boundaries as our platform scales and customer base grows
Learn what a “healthy” platform ecosystem looks like, and build “observability” into the platform which prevents outages from impacting service availability
Seek out and build relationships across teams that positively impact our culture of collaboration, innovation, with an understanding of how your work contributes to the bottom line of the business
What you’ll be doing
- Participating in infrastructure design, platform management, and capacity planning discussions to ensure we are scaling to meet business needs
- Writing code that automates activities that have historically been executed manually
- Gathering and analyzing metrics from our platform using observability methods to assist in performance tuning, debugging, and root cause analysis
- Collaborating with development teams to improve our platform services through innovative new designs, rigorous testing and release methods
- Ensuring we are meeting our Service Level Objectives, (SLOs) by reviewing our Service Level Indicators, (SLIs) and reporting deviations along with remediation and mitigation plans and schedules
- Helping restore service availability, followed by debugging, and root cause analysis for issues that occur in our Production environments
- Helping provide 24/7/365 coverage in a “Follow-the-Sun” model for on-call support
- A Bachelor’s degree in Computer Science or other related technical, and/or scientific discipline.
- A strong background in advanced Mathematics is a plus
- Ability to write code using multiple automation languages like Terraform and Ansible
- Working knowledge of cloud-based technologies, providers, and tools such as Kubernetes, “service meshes”, AWS, Azure, GCP, etc.
- Experience with large scale distributed systems that incorporate modern databases, (Cassandra, SQL), and big data platforms, (Exasol)
- Experience using various real-time and historical monitoring tools such as ELK, DataDog, Prometheus, Nagios, etc. to troubleshoot issues in our platform
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks, as well as an unwavering commitment to identifying root causes of infrastructure issues and resolving them
- 3+ years working as a SRE maintaining complex, distributed systems in real time
At Aera, we're on a mission to solve the biggest, most intractable challenges in the world of enterprise software. We envision the rise of the Self-Driving Enterprise: a more autonomously functioning business with a central operating system that connects and orchestrates business operations. Our Cognitive Operating System is increasingly used by the world's largest companies to fundamentally transform their organizations and how work is done.
If you share our passion for building the next generation of enterprise software, and deploying it for the most sophisticated customers in the world, you’ve met your match. Headquartered in Mountain View, California, we're growing fast, with teams in Mountain View and San Francisco (California), Bucharest and Cluj-Napoca (Romania), Paris (France), Munich (Germany), London (UK), Pune and Bangalore (India), Sydney (Australia) and Singapore. So join us, and let’s build the future of work together!
Aera Technology is an equal opportunity employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status. Pursuant to the San Francisco Fair Chance Ordinance, Aera Technology will consider for employment qualified applicants with arrest and conviction records.
Aera Technology respects the privacy of your data. Please take the time to read our Candidate Privacy Notice, available here.