North American Network Operators Group|
Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical
Statistical Games Providers Play (RE: availability and resiliency)
Whenever providers start throwing numbers around, you've lost the battle. I always suggest people should talk to their insurance agents, not their technical people. Insurance agents are very good at understanding risk, and how much to spend mitigating that risk. Sometimes it is cheaper to buy a new computer every few years than trying to build the perfect protection around it. The used car dealer is almost never the best source of information about car insurance. Don't expect any better from a sales person at a provider. You are going to end up with expensive undercoat and fabric protection package. As far as I know, the system with the highest publicly stated reliability and availability is FEDWIRE. Fedwire exceeds everything I've seen at NORAD, NASA, or any service provider (carrier, internet, web hosting, etc). Fedwire has five-way redundancy of some systems. It also has the full faith and credit of the US Treasury backing up its service guarantee. A software error in 1985 resulted in a $23 billion (with a "B") accounting imbalance. If I take Fedwire as the upper limit, I need to ask what about providers whose claims exceed those delivered by Fedwire? They aren't lying, but you need to understand the numbers. And if any provider does think their system exceeds Fedwire, I would love a tour of your facility. Due to their history as a regulated monopoly, telephone companies have developed interesting ways to calculated reliability. For example, some telephone companies ignore events which exceed the design parameters of the network. Or in other words, they don't include Mother's Day in their calculations. Some telephone companies also don't include disruptions due to Acts Of God or Force Majure in their reported numbers. I chuckle whenever I hear someone say "carrier-grade." Availability statistics are much like flood and storm statistics. A once in 100 year flood does not mean it will flood only once in any 100 year period. You can have back to back floods. And you can have back to back computer failures. Nor does it limit the length of an outage. You could have a 43 minute failure in Year 1, and no failures in Years 2-5. Or an 86 minute failure in Year 1 and no failures in Years 2-10. Or even a 86 minute failure in Year 1, and a 86 minute failure in Year 2, and no failures in Years 3-20. Remember in statistics when you calculated the series to infinity. If you are still around in Year Infinity, then you can discuss X 9's of availability. Asking a provider how many 9's of reliability they provide, or the MTBF of their systems is really a red herring. What you really want to know, and what you should ask is When a failure does occur (and it will): how will you respond? how will you keep me informed? what do I need to do? and after you understand those answers how often would I expect this? No matter how many 9's you have, there is always a .1, 01, .001, .0001, etc chance. Murphy is exceedingly good at his job. Ok, if you are still reading, and you still want to build a system as reliable as Fedwire, lets talk. Fedwire has shown it can be done, however expect to pay as much as Fedwire. On the other hand, if you are willing to settle for just a little less, the price drops dramatically. Its a lot cheaper to build a system to meet a certain level of design risk, and buying insurance to cover the excess. It may double the price to add another "9" of reliability, but only 10% to cover the risk with insurance. I am not a lawyer, banker, insurance agent, doctor, or indian chief. You should always consult a licensed professional for advice.