^ Top

NANOG Meeting Presentation Abstract

Monitoring, managing and troubleshooting large scale networks.
Meeting: NANOG64
Date / Time: 2015-06-01 1:30pm - 2:00pm
This item is webcast
Room: Grand Ballroom
Presenters: Speakers:

Peter Hoose, Facebook, Inc.

Peter has spent the last fifteen years attempting to automate himself out of a job, thus far unsuccessfully. Rest assured, he’ll keep at it until the job is done. In his current role as Network Infrastructure Manager at Facebook, his teams combine the power of automation with solid network engineering to keep facebook’s global backbone, and datacenter networks running fast, reliably, and efficiently. Today, they are building the infrastructure to deploy wedge, six-pack and fboss, facebook’s open switching platform and network operating system. Prior to this, Peter worked as a Network Engineer at Facebook, and further in the past as Senior Network Engineer and Architect at NTT America where he built custom dedicated hosting solutions for customers large, and small.
Abstract: Monitoring, managing and troubleshooting large scale networks. Almost four years ago I came to NANOG and mostly complained about the state of monitoring networks, par for the course for me. A lot has changed since then, we've solved many of the problems I addressed. Perhaps more importantly, we've fundamentally changed how we manage, monitor and troubleshoot our network. We plan to share what we learned, what went well, and best of all, what went oh so terribly wrong.

Our driving philosophy behind this effort is that by taking an engineering approach to operations, you can greatly reduce the time to discover, mitigate and resolve issues on your network. We analyzed our faults, our pain points and the work that consumed most of our time. This allowed us to prioritize what we tackled first, we were surprised by what we learned caused the most outages, and how much impact minor network issues can have when they fall in the right place. From this, today, the majority of the faults that occur in our network are automatically detected, and mitigated all without human intervention.

We'll dive into some of the most interesting issues we've experienced in our network, how we narrowed them down before, and after our new tooling and monitoring was deployed. We'll walk through specific examples of remediations and how the systems function.

I'm lazy, I don't want to spend my time fixing known issues, I want to work on new problems, I want a challenge. This was the driving force behind our approach, if this sounds like you, them this talk is for you.

----

One of the keys to this effort was a system called FBAR, which interacts with our devices to perform the tasks needed to resolve issues. We'll explain in detail how this works, as well some of our earlier remediations.

As a companion to this talk, David Swafford will be preparing a separate tutorial session to show you how to build your own system much like FBAR to help detect, isolate and remedy issues automatically.
Files: youtubeMonitoring, managing and troubleshooting large scale networks.
pdfMonitoring, managing and troubleshooting large scale networks. (slides(PDF)
Sponsors: None.

Back to NANOG64 agenda.

NANOG64 Abstracts

  • Conference Opening
    Speakers:
    Tony Tauber, Comcast; Daniel Golding, Google; Aaron Klink, Netflix;
  • Conference Opening
    Speakers:
    Tony Tauber, Comcast; Daniel Golding, Google; Aaron Klink, Netflix;
  • Conference Opening
    Speakers:
    Tony Tauber, Comcast; Daniel Golding, Google; Aaron Klink, Netflix;
  • Research and Education Track
    Speakers:
    Michael Sinatra, ESnet; Julie Percival, University of Texas at Dallas; Michael Smitasin, Lawrence Berkeley National Laboratory; Murat Yuksel, University of Nevada, Reno;
  • Research and Education Track
    Speakers:
    Michael Sinatra, ESnet; Julie Percival, University of Texas at Dallas; Michael Smitasin, Lawrence Berkeley National Laboratory; Murat Yuksel, University of Nevada, Reno;
  • Research and Education Track
    Speakers:
    Michael Sinatra, ESnet; Julie Percival, University of Texas at Dallas; Michael Smitasin, Lawrence Berkeley National Laboratory; Murat Yuksel, University of Nevada, Reno;
  • Research and Education Track
    Speakers:
    Michael Sinatra, ESnet; Julie Percival, University of Texas at Dallas; Michael Smitasin, Lawrence Berkeley National Laboratory; Murat Yuksel, University of Nevada, Reno;
  • Security Track
    Speakers:
    Krassimir TzvetanovA10 Networks, Inc.; .
    Merike Kaeo, DoubleShot Security;
  • Security Track
    Speakers:
    Krassimir TzvetanovA10 Networks, Inc.; .
    Merike Kaeo, DoubleShot Security;
  • Peering Track
    Speakers:
    Greg Hankins, Alcatel-Lucent; Daniel KoppDE-CIX; .
    Brian RoganGoogle; .
    Raul SejasTelefonica; .
    Tom PasekaCloudFlare; .
    Aaron Hughes6connect; .
    Elisa Jasinska, BigWave;
  • Peering Track
    Speakers:
    Greg Hankins, Alcatel-Lucent; Daniel KoppDE-CIX; .
    Brian RoganGoogle; .
    Raul SejasTelefonica; .
    Tom PasekaCloudFlare; .
    Aaron Hughes6connect; .
    Elisa Jasinska, BigWave;
  • Peering Track
    Speakers:
    Greg Hankins, Alcatel-Lucent; Daniel KoppDE-CIX; .
    Brian RoganGoogle; .
    Raul SejasTelefonica; .
    Tom PasekaCloudFlare; .
    Aaron Hughes6connect; .
    Elisa Jasinska, BigWave;
  • Peering Track
    Speakers:
    Greg Hankins, Alcatel-Lucent; Daniel KoppDE-CIX; .
    Brian RoganGoogle; .
    Raul SejasTelefonica; .
    Tom PasekaCloudFlare; .
    Aaron Hughes6connect; .
    Elisa Jasinska, BigWave;
  • Peering Track
    Speakers:
    Greg Hankins, Alcatel-Lucent; Daniel KoppDE-CIX; .
    Brian RoganGoogle; .
    Raul SejasTelefonica; .
    Tom PasekaCloudFlare; .
    Aaron Hughes6connect; .
    Elisa Jasinska, BigWave;
  • Peering Track
    Speakers:
    Greg Hankins, Alcatel-Lucent; Daniel KoppDE-CIX; .
    Brian RoganGoogle; .
    Raul SejasTelefonica; .
    Tom PasekaCloudFlare; .
    Aaron Hughes6connect; .
    Elisa Jasinska, BigWave;
  • Peering Track
    Speakers:
    Greg Hankins, Alcatel-Lucent; Daniel KoppDE-CIX; .
    Brian RoganGoogle; .
    Raul SejasTelefonica; .
    Tom PasekaCloudFlare; .
    Aaron Hughes6connect; .
    Elisa Jasinska, BigWave;

 

^ Back to Top