Site Reliability Engineering (SRE)

  • Full-time
  • Project: Bifrost
  • Department: Code
  • Location: CA, Montreal (Remote/Hybrid)

Company Description

People Can Fly is one of the leading independent AAA games development studios with an international team of hundreds of talented individuals working from offices located in Poland, UK, US, and Canada, and from all over the world thanks to our remote work programs.

Founded in 2002, we made our mark on the shooter genre with titles such as Painkiller, Bulletstorm, Gears of War: Judgment, and Outriders. We are one of the most experienced Unreal Engine studios in the industry and we are expanding it with in-house solutions called PCF Framework.

Our creative teams are currently working on several exciting AAA projects with the top publishers in the industry: Project Gemini with Square Enix and Project Dagger with Take-Two (2K), in addition to a new IP to be self-published and two other games in a concept phase. One of our IPs is also being adapted for VR technology.

With over 20 years of experience, PCF sets out to explore new horizons. We aim to combine our expertise with creativity of the best and most forward-thinking talents in the industry to work together on the new generation of action games for the global gaming community.

If you decide to accompany us on this journey, you’ll have a chance to perfect your craft and expand your knowledge, working alongside leaders in the industry on bringing a brand-new unique experience to the players worldwide.

If you feel yourself able to deliver as nobody else, take ownership of your projects, and are ready to leave a mark on a game you work on, apply now!

Job Description

  • Build and deploy the cloud-native infrastructure of the online services platform.
  • Build the tools, and foster the culture, for reliability across all our services.
  • Plan for, and exercise recovery from, disasters.
  • Build and deploy the platform to cloud service providers in an automated, reproductible way. Provision additional instances for development, testing, load testing, certification and (if needed) external publishers.
  • Harden the platform; advise the programmers on maximizing the reliability, scalability and uptime of their services.
  • Deploy the required tools to ensure maintenance, updates and recoveries of the services are quick, seamless, traceable, reproductible, and simple to revert if needed.
  • Establish disaster recovery protocols. Put them to the test.
  • Write and deploy monitoring dashboards and alerting systems to ascertain the state of online services and their dependencies in real-time. Assist programmers in instrumenting their services so that they're monitored effectively.
  • Build dashboards to monitor the cost of our online systems in real-time. Advise programmers on minimizing operational costs.
  • Communicate with 3rd party providers and/or publishers in case of outages on their end.
  • Establish protocols for 24/7 on-call support of our live games.

Qualifications

  • Typically: 2+ years of experience in a Site Reliability Engineering (SRE) or DevOps position.
  • Videogame-specific experience is useful but not mandatory.
  • Other relevant domains to look into: content distribution, ad-tech, news, mobile gaming, finance.
  • FAANG (or adjacent) experience highly sought after.
  • Strong knowledge of one or two of: Amazon Web Services, Microsoft Azure, Google Cloud Platform.
  • Experience building, deploying and operating Kubernetes clusters in cloud-native environments (EKS on Amazon, AKS on Azure, GKS on Google).
  • Knowledge of infrastructure-as-code tooling (e.g. Hashicorp Terraform) and integration into CI/CD pipelines (e.g. Atlantis).
  • Experience deploying software on Kubernetes clusters using Docker, Helm and ArgoCD (GitOps-style operations).
  • Experience with monitoring and tracing stacks: Prometheus, InfluxDB, Loki, Grafana, OpenTelemetry.
  • Deep understanding of scalability, security and maintainability considerations.
  • Being able to work efficiently under tight deadlines.
  • Knowledge of any project management and bug tracking software.
  • Strong verbal and written communication skills in English.
  • Open-minded team player attitude.
  • Strong work ethic and self-motivated.
  • Passionate about playing and making video games.

Additional Information

What we offer:

U.S.

  • 100% group health insurance benefit premiums paid by PCF (Medical, Dental, Vision, Group Life, and Supplemental Live) and start on day 1 of employment.
  • 401K with 100% match, up to 3% of employee salary, and vested immediately.
  • Paid week off during Winter Holidays.
  • 20 paid vacation days and 5 paid sick days.
  • Free virtual health and mental wellbeing sessions included in the plan for members and their dependents.
  • A competitive salary and performance-based annual bonuses.
  • Personal development opportunities and ability to work in a global environment.
  • Work in a creative team with people full of passion for what they do.
  • Long term disability, short term disability, travel insurance, as well as other benefits provided. 

Canada

  • Benefit package 100% paid by PCF. Insurance company reimburses 100% of claims (Up to $500 per service a year, as well as individual family coverage).
  • Full Dental coverage, including major dental and orthodontics.
  • 4% RRSP matching before tax deductions, 100% vested on day 1.
  • Paid week off during Winter Holidays.
  • 20 paid vacation days and 5 paid sick days.
  • Free virtual health and mental wellbeing sessions included in the plan for members and their dependents.
  • A competitive salary and performance-based annual bonuses.
  • Personal development opportunities and ability to work in a global environment.
  • Work in a creative team with people full of passion for what they do.
Privacy Policy