See also the related Monday afternoon Research Forum talk.
The hardest and most complex network problems often involve end-to-end performance and collaborations across domains. They may be, for example, inexplicable intermittent faults, runaway processes, multiple entangled faults, or interactions among network elements that "shouldn't be happening." The longer they linger unresolved, the worse network reliability and robustness may become. Therefore, troubleshooters need tools to solve needle-in-a-haystack problems efficiently and effectively. Unfortunately, such tools are rare. Software designers know too little about what this support should be and what it should look like.
This BOF focuses on sample, real life cases and needs from people who know first-hand the software support that they need but currently lack for better complex troubleshooting tools. Our research and design team has proposed for national funding a project aimed at developing adequate open source support for complex network troubleshooting support. This session will give participants from different organizations an opportunity to recognize the challenges they have in common in dealing with complex problems and will give them a direct say in designs for this proposed tool.
Participants will be asked to answer a few open-ended questions before the session related to their most "nightmarish" problem and factors that made it hard to resolve. In the session, we'll select and share several of these cases and from them identify and discuss needs that currently aren't addressed in troubleshooting tools. The moderator will then present a sample, high-level design for supporting this type of troubleshooting and ask participants for their input on improving it so that it better meets the needs we've discussed earlier.
If you are interested in this session, please send email to <bmirel@si.umich.edu> with your description of a problem as follows:
Name: Title: Company: Area of specialization: 1. What was the hardest network problem that you remember working on - e.g., so hard it may have kept you up nights thinking about it? 2. What parts of the network did it involve? 3. Which strategies and moves for detection, diagnosis, and root cause analysis worked and which didn't work and why?About the Presenter