North American Network Operators Group|
Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical
Re: impossible circuit
This is just a WAG but what the hell.
Jon Lewis wrote:
I've got this private line DS3. It connects cisco 7206 routers in Orlando (at our data center) and in Ocala (a colo rack in the Embarq CO).
Are you sure that they are not crossing some channels in the middle and accidentally handing them to a different customer? You mention above that various portions of the DS3 ride different transport circuits in the middle. That always creates the potential for someone to not put it back together correctly on either end. I've seen DLCs get crossed before. I could easily see a transport provider crossing portions of a circuit, especially if they break it into pieces in the middle and have to put it back together on the ends.
I think it makes sense too. Somebody's getting traffic off a T1 that isn't destined for them. Their router sees it, says WTF and sends a ICMP dest unreachable via their default route through Sprint. Same thing goes for a traceroute; it simply follows its default route to reply to your packets with the expiring TTL. Taking a path through a different provider would be expected since it doesn't have a connected route to the source of the traceroute (since it's not the far end of your T1 that you're expecting). The site getting your crossed T1 could be using the T1 as a PtP to a branch office and has Internet through a different circuit that hasn't been hosed.
I would be curious to hear if Sprint is having any problems with a circuit connected to sl-bb20-dc-6-0-0.sprintlink.net, what the router is and if any directly connected customers are having T1 problems. If nothing else Sprint should be able to track down the source of the traceroute return packets and contact the customer. The T1 could be part of a bundle at their site and they may not even realize that the bundle dropped a path.
Last Tuesday, at about 2:30PM, "something bad happened." We saw a serious jump in traffic to Ocala, and in particular we noticed one customer's connection (a group of load sharing T1s) was just totally full. We quickly assumed it was a DDoS aimed at that customer, but looking at the traffic, we couldn't pinpoint anything that wasn't expected flows.
Are you sure that the traffic being received by each of the T1s is their's? Do you have any way to getting flows or packets off of individual T1s and not the bundle as a whole?
Tracing through you to your upstream...
7 andc-br-3-f2-0.atlantic.net (18.104.22.168) 47.951 ms 56.096 ms 56.154 ms
Circuit gets crossed onto the wrong customer. Wrong site received a packet with an expiring TTL and goes to send a reply. Destination IP isn't on a connected route so the site sends the reply via it's default route on Sprint.
10 sl-bb20-dc-6-0-0.sprintlink.net (22.214.171.124) 80.774 ms 81.030 ms 81.821 ms
Reply traverses Sprint to L3 and on to you.
12 te-10-1-0.edge2.Washington4.level3.net (126.96.36.199) 46.548 ms 53.200 ms 45.736 ms
I can't explain the continuous loop or the dupes. I'm not sure if my theory fits those symptoms or not.
Our circuit provider's support people have basically just maintained that this behavior isn't possible and so there's nothing they can do about it. i.e. that the problem has to be something other than the circuit.
Can you have them put the circuit into maintenance and have them test it end to end? They can't deny it when their TDR says that there's a problem.
I got tired of talking to their brick wall, so I contacted Sprint and was able to confirm with them that the traffic in question really was inexplicably appearing on their network...and not terribly close geographically to the Orlando/Ocala areas.
Which supports with my theory of a crossed circuit. Crossing a DS1 onto the wrong DS3 or OCx could easily make it pop up anywhere. Somewhere is another site that's having T1 problems.
So, I have a circuit that's bleeding duplicate packets onto an unrelated IP network, a circuit provider who's got their head in the sand and keeps telling me "this can't happen, we can't help you", and customers who were getting tired of receiving all their packets in triplicate (or more) saturating their connections and confusing their applications. After a while, I had to give up on finding the problem and focus on just making it stop. After trying a couple of things, the solution I found was to change the encapsulation we use at each end of the DS3. I haven't gotten confirmation of this from Sprint, but I assume they're now seeing massive input errors one the one or more circuits where our packets were/are appearing. The important thing (for me) is that this makes the packets invalid to Sprint's routers and so it keeps them from forwarding the packets to us. Cisco TAC finally got back to us the day after I "fixed" the circuit...but since it was obviously not a problem with our cisco gear, I haven't pursued it with them.
Right. By changing the encap you've basically killed the circuit. With that T1 effectively down on your end you won't be sending any packets down the problem path and aren't able to see that problem anymore with your traceroutes. However your customer with the bundle of T1s is down a circuit.
It makes sense in my mind that it's simply a crossed circuit in the middle. Your transport provider for whatever reason pulled out a DS1 and sent it down a different path. They accidentally crossed DS1s in the middle and are handing your DS1 to a Sprint customer and their DS1 to your customer. That's my theory at least.