John Wilkes - Building the Warehouse Scale Computer

Return to Full Calendar
November 5, 2018 at 3:00pm - 4:00pm
JCL, Rm. 390
Event Audience:
John Wilkes

Speaker: John Wilkes Principal Software Engineer, Technical Infrastructure, Google

John Wilkes has been at Google since 2008, where he is working on automation for building warehouse scale computers. Before this, he worked on cluster management for Google's compute infrastructure (Borg, Omega, Kubernetes). He is interested in far too many aspects of distributed systems, but a recurring theme has been technologies that allow systems to manage themselves. He received a PhD in computer science from the University of Cambridge, joined HP Labs in 1982, and was elected an HP Fellow and an ACM Fellow in 2002 for his work on storage system design. Along the way, he’s been program committee chair for SOSP, FAST, EuroSys and HotCloud, and has served on the steering committees for EuroSys, FAST, SoCC and HotCloud. He’s listed as an inventor on 50+ US patents, and has an adjunct faculty appointment at Carnegie-Mellon University. In his spare time he continues, stubbornly, trying to learn how to blow glass.

Abstract: Building the Warehouse Scale Computer

Imagine some product team inside Google wants 100,000 CPU cores + RAM + flash + accelerators + disk in a couple of months. We need to decide where to put them, when; whether to deploy new machines, or re-purpose/reconfigure old ones; ensure we have enough power, cooling, networking, physical racks, data centers and (over longer a time-frame) wind power; cope with variances in delivery times from supply logistics hiccups; do multi-year cost-optimal placement+decisions in the face of literally thousands of different machine configurations; keep track of parts; schedule repairs, upgrades, and installations; and generally make all this happen behind the scenes at minimum cost. And then after breakfast, we get to dynamically allocate resources (on the smallminutes timescale) to the product groups that need them most urgently, accurately reflecting the cost (opex/capex) of all the machines and infrastructure we just deployed, and monitoring and controlling the datacenter power and cooling systems to achieve minimum overheads - even as we replace all of these on the fly. This talk will highlight some of the exciting problems we're working on inside Google to ensure we can supply the needs of an organization that is experiencing (literally) exponential growth in computing capacity

Sponsor: CERES Unstoppable Speaker Series

Host: Andrew Chien

Type: talk