Staff Software Engineer - (Linux & GPU Environment) - Substrate Team

  • Full-time
  • Recruitment type: Permanent

Job Description

Join the team redefining how the world experiences design.

Hey, g'day, mabuhay, kia ora, 你好, hallo, vítejte!

Thanks for stopping by. We know job hunting can be a little time consuming and you're probably keen to find out what's on offer, so we'll get straight to the point.

Where and how you can work

Collingwood is home to our Melbourne campus - a vibrant, creative hub for connection and impactful work. While Sydney is home to our HQ, Melbourne brings its own unique vibe, with local artwork, lush greenery, and thoughtfully designed spaces to help you collaborate, focus, and feel part of a welcoming community.

This role is based in Melbourne, and we’re looking for someone who calls it home. Our hybrid way of working gives you the flexibility to work remotely, and to come together on campus for meaningful in-person collaboration and connection when it matters most.

What you'd be doing in this role

As Canva scales change continues to be part of our DNA. But we like to think that's all part of the fun. So this will give you the flavour of the type of things you'll be working on when you start, but this will likely evolve.

The Runtime Platform Sub Group within Core Infrastructure Group keeps Canva's Linux fleet, GPU clusters, and compute layer running. Every backend service, training job, inference request, and machine that comes online depends on this work. It covers OS config for security, compliance and performance, hypervisor setup, provisioning, orchestration, and the tuning that keeps the platform fast and stable.

Canva has already moved much of its workload onto Kubernetes. The next phase is about making the compute layer stronger and more efficient. That includes GPU capacity for AI training and inference in Linux systems underneath those clusters that make a diverse fleet of baremetal neoclouds consistent and operable at scale, the hardware labs in Colo’s, and the path toward running Canva-owned GPUs at data centre scale. It also includes performance work at the machine level, where one change in the right place can save time and cost across the company.

This role sits in a small field of engineers who do this work well at scale. Canva is still building the tools, observability, and debugging systems that mature infrastructure teams take years to develop. The person in this seat will shape how Linux at Canva evolves over the next few years.

At the moment, this role is focused on:

  • Linux orchestration at scale: Extending our CAPI interface for more providers, tuning how Linux orchestration works across cloud providers and on-prem, and how the platform scales as Canva's compute footprint grows.

  • GPU clusters: Support GPU clusters that run AI training and inference workloads at scale, including kernel tuning, scheduler decisions, and I/O paths that keep the hardware moving.

  • Bare metal infrastructure: Build hardware configuration and hypervisor patterns that turn the lab into production infrastructure and support Canva-owned GPUs at data centre scale.

  • Linux internals: Work on kernel-level changes, EBPF integration logic, and performance tuning at the box level.

  • Technical direction: Set the patterns the wider infrastructure group follows, write design docs that guide decisions, and help shape how Linux is used at Canva.

  • Hands-on engineering: Stay close to the code, ship to production, and make changes that improve how backend services and AI workloads run.

What success looks like

  • Canva's Linux platform supports GPU workloads across multiple cloud providers and the new bare metal infrastructure as a single system. The setup is consistent, the orchestration works cleanly, and the teams using it do not need to manage the underlying details.

You’re probably a match

We'd like to hear from you if you meet some of these requirements. You do not need to meet all of them.

Experience

  • Linux infrastructure: Built or improved Linux infrastructure at production scale and debugged issues from the kernel through to the application layer.

  • Large-scale infrastructure: Operated infrastructure at scale and handled problems where the path forward was not obvious.

  • GPU or specialised compute: Built GPU clusters, AI training infrastructure, or other large compute systems where capacity, cost, and performance all mattered.

  • Hardware and bare metal: Worked with on-prem infrastructure, hypervisor configuration, or the physical layer of a compute platform.

  • Ambiguous technical problems: Taken ownership of unclear problems, defined the options, and tested the right path forward.

  • AI tools in engineering work: Used AI tools in real work and understand what changes when AI workloads become part of infrastructure.

Technical knowledge

  • Linux kernel: Production knowledge of how Linux works.

  • C and systems languages: Production experience, especially for kernel-adjacent work.

  • GPU infrastructure: Cluster design, scheduling, and integration for AI workloads.

  • Hypervisor and virtualisation: experience configuring KVM and knowledge of how virtualisation works under the hood.

  • Network fundamentals: Routing and the network layer underneath the OS, with the ability to work closely with network engineering peers.

  • Computer science fundamentals: Data structures, complexity, and other core engineering foundations.

Nice to have

  • Open-source work: Linux kernel, or related systems-level open-source work.

  • eBPF: Writing and integrating eBPF programs for observability, performance, or networking.

  • Performance tuning: Improved application performance through low-level tuning across network, OS, or runtime.

  • On-prem migration: Helped move compute from cloud-only to hybrid or on-prem.

  • AI training infrastructure: Worked on the infrastructure side of large-scale model training or inference.

About the Group and Team

Infra owns the platforms Canva engineers use to work: dev environments, build systems, runtime, network, and hardware.

Runtime Platform owns the core infrastructure behind Canva's global operations, including compute, networking, and traffic management. It provides the platform teams use to run workloads across the environments Canva supports. As a Staff Engineer you’ll be reporting to the Senior Engineering Manager, but will be deployed to specific domains that span the teams in Runtime Platform.

The Substrate team sits inside Runtime Platform and owns cloud provider integration, cluster lifecycle, machine images, and the network fabric across the hyperscalers and neoclouds Canva uses. Service Plane team builds the higher-level interfaces that backend engineers work with, mesh, scheduling, governance, extending clusters with addons

These teams maintain the infrastructure layer that other teams build on.

What's in it for you?

Achieving our crazy big goals motivates us to work hard - and we do - but you'll experience lots of moments of magic, connectivity and fun woven throughout life at Canva, too. We also offer a range of benefits to set you up for every success in and outside of work.

Here's a taste of what's on offer:

  • Equity packages - we want our success to be yours too

  • Inclusive parental leave policy that supports all parents & carers

  • An annual Vibe & Thrive allowance to support your wellbeing, social connection, office setup & more

  • Flexible leave options that empower you to be a force for good, take time to recharge and supports you personally

Check out lifeatcanva.com for more info.

Other stuff to know

We make hiring decisions based on your experience, skills and passion, as well as how you can enhance Canva and our culture. When you apply, please tell us the pronouns you use and any reasonable adjustments you may need during the interview process.

We celebrate all types of skills and backgrounds at Canva so even if you don’t feel like your skills quite match what’s listed above - we still want to hear from you!

Please note that interviews are conducted virtually.

Privacy Notice