Attending this event?
In-person + Virtual
18-21 April
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2023 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Central European Summer Time (UTC +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Back To Schedule
Friday, April 21 • 16:00 - 16:35
Enabling HPC and ML Workloads with the Latest Kubernetes Job Features - Michał Woźniak, Google & Vanessa Sochat, Lawrence Livermore National Laboratory

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

In this talk, we present the new features in Kubernetes Job API and how they can be used to stand up to challenges of running distributed Batch/AI/HPC workloads at scale, based on real-world experiences from DeepMind and the Flux Operator from Lawrence Livermore National Laboratory. We showcase the Indexed Jobs feature by presenting its production use. First, we demonstrate how it simplifies running parallel workloads which require pod-to-pod communication, including distributed machine learning examples based on its use by DeepMind. Next, we demonstrate the orchestration of HPC workloads using the Flux Operator. Here, we create a "Mini Cluster" within Kubernetes built on top of an indexed job, providing a rich ecosystem for orchestration of batch workloads, related user interfaces, and APIs. We also discuss the challenge of handling pod failures for long-running workloads. We show how Pod Failure Policy can be used to continue job execution despite numerous pod disruptions (caused by events such as node maintenance or preemption), yet reduce costs by avoiding unnecessary pod retries when there are software bugs.

avatar for Michał Woźniak

Michał Woźniak

Software Engineer, Google
Michał is a software engineer with background in computer science, a PhD in computational biology, and 5+ years of professional experience. In his current role he is focusing on enhancing the support for batch workloads in the Kubernetes ecosystem. Outside of work he enjoys playing... Read More →
avatar for Vanessa Sochat

Vanessa Sochat

Computer Scientist, Lawrence Livermore National Laboratory
Vanessa is a Computer Scientist at Lawrence Livermore National Laboratory, and a software engineer for over a decade. She received her PhD in Biomedical Informatics from Stanford University, and has done extensive work on container technologies, developer tools, and fostering open... Read More →

Friday April 21, 2023 16:00 - 16:35 CEST
Hall 7 | Room D
Feedback form isn't open yet.