Loading…
In-person + Virtual
18-21 April
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2023 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Central European Summer Time (UTC +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Back To Schedule
Thursday, April 20 • 15:25 - 16:00
The Day We Delete(d) Production - Ricardo Rocha & Spyridon Trigazis, CERN

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.


The Kubernetes infrastructure at CERN runs a variety of workloads, from scientific computing to critical services for campus and our physics accelerator complex. It’s important to offer the features and capabilities our users require, but even more the required high levels of service. In this session we present in detail a recent incident where a rogue maintenance tool deleted a third of our production capacity in minutes, how this resulted in no downtime with only service degradation and how we were able to recover in a short time. We describe our architecture to achieve high service availability, the options we took to reduce blast radius, the concept of “clusters as cattle” and how extensive use of gitops saved the day. We will also describe some lessons learned in the process, the detected cyclic dependencies when recovering from a major outage, and the corner cases where more care is needed for stateful workloads and multi-cluster scheduling. We will demo this on stage showing how real CERN services recover from what would not so long ago be events with a very serious impact. And how the effort from the last years has paid off, with our users responding calmly and positively while going through a major incident.

Speakers
avatar for Ricardo Rocha

Ricardo Rocha

Computing Engineer, CERN
Ricardo is a Computing Engineer at CERN IT focusing on containerized deployments, networking and more recently machine learning platforms. He has led for several years the internal effort to transition services and workloads to use cloud native technologies, as well as dissemination... Read More →
ST

Spyros Trigazis

Computing Engineer, CERN
Spyros Trigazis is a computing engineer and a member of the CERN Cloud infrastructure team which provides computing resources to the High Energy Physics community. He has been contributing to open source projects like Fedora, Kubernetes and OpenStack.


Thursday April 20, 2023 15:25 - 16:00 CEST
Hall 7, Room D | Ground Floor | Europe Complex