Latency on site-to-site VPN

Hi guys,

Maybe you can help. We have a client with a Meraki MX on-site which has a VPN that connects with our DC infra. Its a site-to-site VPN. The MX in the DC supports multiple peers.

Since beginning of August the latency has gone from average of 10ms to 30ms and its basically affected the clients ability to use certain applications hosted on our servers. It is the connection from the Office to our Servers in DC.

When we do our trace routes to public ip of internet interface/or public ip of server environment, it shows the packet going out the network and when it hits Virgin Media which is the ISP, it hits like 50ms, people here are saying its a routing issue and its a Virgin issue but Virgin are saying 30ms is within the accepted SLA for this site etc. but what approach should I take. Do you think its a Virgin problem or something else i.e. networking kit

Its just evident that MS increases across Virgin route and it started happening since August. Any help appreciated.

Can explain more if needed.

Virgin responses:

We have been able to capture random bursts of utilzation on the access metnet. As this utilization is snapped every 3seconds that could suggest the burst are actually much higher than what Ive been able to capture. This tells us that you are over utilizing the circuit… you will need to carry out further checks on your own equipment. - Problem is utilization is fine, spoke to firewall vendor also can’t see issue

Latest:

Hi Travisour 3rd line team have advised they can not see any issues on our network.The traceroutes provided display a core latency at 33ms avg as an issue. This is well within the VM SLA for latency for this circuit and is not an issue.All pings to customer IP complete without packet loss or latency.As your router is unmanaged i would advise again to look further in to your own equipment as we can see any issues through our network

EDIT: Traceroute found in comment below.

ISSUE RESOLVED: Virgin will never admit it, but latency returned to normal and traceroute reflects that. I guess because it falls within their SLA they didn’t care to admit to anything. Still pursuing private line now.

Travis

There is no way a provider is going to guarantee sub 30ms latency between sites over a general-purpose internet connection. If going from 10 to 30ms is a problem for whatever applications you’re using, your applications probably need to be tuned. Otherwise you should switch to some kind of private line service with lower latency guarantees (if available).

Problem is utilization is fine, spoke to firewall vendor also can’t see issue

Your firewall vendor is very likely not monitoring at the same resolution your ISP has mentioned (every 3 seconds).

people here are saying its a routing issue

People where? Do these people know what they’re talking about?

I would be looking into my own equipment and utilization with those responses. Not to say to always trust a provider but I can generally tell when they are likely correct by their responses. So you need to verify you aren’t over utilizing the circuit with microbursts. Also 30ms really shouldn’t be a big issue unless you have a really latency sensitive application.

What is the circuit size and what is your average utilization?

Do you have QoS set up?

What is the latency from end to end? As in sever to client? Could be an issue on the LAN or with the server itself.

Any multicast traffic?

Site to Site VPNs over the Internet have no guarantees that it will even work let alone surety of latency or jitter. That’s why it’s cheap. If you want a rock solid solution with an SLA, get some sort of private line between the sites and then you have every right to complain to your telco. This is will be expensive.

PS 10ms to 30ms causing application issues? That’s some bad application. If it must be that close and there are no other choices, citrix it locally and remote to the citrix.

If you have applications that need consistent, low latency, then you need a private line, ideally with 1 provider.

You can try enabling QOS on your Meraki to see if you can gather some data. You might also setup something like PRTG to see if you can capture the bursts of traffic.

you need to monitor utilization and find who is using this service per ip source/destination.

does the MX support netflow?

also, do traceroutes end to end and make sure the internet peering is not now f—ked up.

asking a carrier to troubleshoot a local circuit will not help you. you need to know if the end to end public path between you and the vpn destination is using a realistic path or if peering is taking it out of the way.

33ms in the USA is from NYC to Miami or from Chicago to DC on the plain jane public peering/private peering internet between 2 carriers. the 10ms per 1000 miles rule only applies in perfect conditions. in our world latency doubles when chicago to DC goes through NY or Texas, etc.

so post end to end traceroutes if you can in both directions. thanks

Its just evident that MS increases across Virgin route and it started happening since August. Any help appreciated.

As you mention Virgin, theres probably a few things to take into account.

Is this an actual leased line type connection, or is it their business broadband package? If its the latter, they are not symmetric connections, they’re 1gig down and 50meg up. With the limited amount of upload bandwidth that could definitely be getting spiked.

Also they share infrastructure with home broadband, so from the beginning of August you likely have a whole extra bunch of people working from home using up shared infrastructure.

Since beginning of August the latency has gone from average of 10ms to 30ms and its basically affected the clients ability to use certain applications hosted on our servers.

What changed?
Make the carrier(s) answer that question.

Problem is utilization is fine, spoke to firewall vendor also can’t see issue

Are ANY of your router or firewall interfaces experiencing dropped packets on egress?
If they are, then you have a capacity issue in your environment.

Travisour 3rd line team have advised they can not see any issues on our network.The traceroutes provided display a core latency at 33ms avg as an issue

What is the source city and what is the destination city?

I can go from New York to Los Angeles in ~35ms (one-way – not RTT).

That’s 4,500km driving a car.

If you’re using a vpn for site to site, and have specific low latency application requirements, there is almost no way to guarantee the sla over the public internet.

Things to watch for:
Site to site vpn link saturation - how much traffic can your hardware handle? It’s usually in the data sheet for that appliance. If you go over that limit you’ll see the cpu spiking and traffic dropping.

A while back one of the desktop client teams pushed an update that caused our VPN appliance to get pegged at 1Gbps of IPsec throughput shortly after dropping the traffic. The appliance was rated for 500Mbps of IPsec throughput.

Secondly, do any other sites experience this problem?

thanks for all your input guys, as many have said, nothing the ISP can do

we are exploring a private line for this client

EDIT: Issue resolved it seems, Virgin will never admit it, but latency returned to normal and traceroute reflects that. I guess because it falls within their SLA they didn’t care to admit to anything.

Perhaps my constant pestering paid off.

Learnt some good stuff from you all so thanks for your time.

How sure are you that the extra 20ms, however annoying, is actually the root cause of your problem?

Perhaps you could explain roughly what the app is?

Have to concur with all of these points.

Hi there,

Can you explain microbursts, would that be a sudden large spike in traffic?

They have a 1Gbps line which is nowhere near being maxed - average about 30Mbps in last month. When I monitored the live traffic feed I do see spikes of up to 800Mbps - could this be considered a microburst? And if so how on earth would I troubleshoot that?

I do think this application is latency sensitive. End to End latency is averaging 50ms. This reading is consistent across multiple servers we host for them.

No QoS and not sure about multicast traffic.

Many thanks.

Hi there - is there anything I should hide when posting the traceroute?

erm speed test - 779.7 down and 138.4 up

leased line type connection, or is it their business broadband package

just did another test 900 / 876.6

so leased

Well because users only started complaining around the beginning of August, from when the latency jumped from an average of 10ms to 30ms - what looks more like 50ms when doing pings.

Can you explain microbursts, would that be a sudden large spike in traffic?

Your Meraki dashboard only polls your routers every couple of minutes for utilization data.
When polled, the router will say “since the last time you asked, my average utilization has been X%”

25% average utilization over a 3 or 5 minute polling period could represent all kinds of levels of actual utilization.

I find it more interesting to focus on dropped packets.

You have a 1Gbps internet circuit.
It can carry 1Gbps of traffic.

What happens when 6 systems each want to send 200Mbps of traffic to various destinations on the other end of that circuit?

That’s 1.2Gbps of traffic trying to fit into a 1.0Gbps pipe.

What happens is that the router will allocate packet buffer memory to hold traffic in a queue waiting to be transmitted.

A typical WAN router might have 100MB or so of buffer capacity (without tuning or QoS).

So, if those client systems just needed to spike to that level of transmit for just a second or two, all of that data will probably fit into buffers just fine.

But if they have a lot of data to transmit?

After a few milliseconds, the router will run out of buffer memory and the next packet that wants to enter the queue (waiting list) will have no available buffer capacity to be dropped into, and the packet must be dropped.

Now what happens?

The dropped packet interface counter is incremented and the sender & receiver of that affected TCP flow will experience packet loss, and a TCP re-transmit will be requested AND the TCP window size will be cut in half, thus slowing that traffic flow down for a few seconds.

This is the congestion control mechanism baked into TCP.

Eventually, the period of congestion will come to an end once everything has sent their data and the packet buffers will drain empty and everything is smooth and happy again.

If your routers or firewalls are not dropping packets then they had sufficient bandwidth OR buffer capacity to properly deliver all of the traffic your systems asked them to deliver.

If you are dropping packets all over the place, then you need to add capacity.
If you can’t add capacity, then increasing buffer sizes is something that can be considered, but deeper buffers mean more transmission latency.
If you need to minimize latency for specific applications then QoS is a method to prioritize some applications over other applications.

QoS is complicated, but Meraki simplifies a lot of it.

Yeah microbursts are large bursts of traffic that happen over a short period of time, usually bandwidth monitoring apps can’t record this for review because it happens so quick. I don’t know your topology so wouldn’t know where to begin but essentially you can start by checking your edge device interface for drops and errors and work your way back and check involved interfaces for the same. That would be more for checking if your devices resources are being over utilized (buffers, bandwidth, etc…)

We had an issue once where a provider was saying we were overutilizing an aggregated circuit interface on a Cisco ASR-1000 but it didn’t show on our interface bandwidth. My solution was to set up a QoS policy with a threshhold of the circuit size and a violate action of allowing traffic to pass so as not to drop traffic. So if the traffic bursted above the circuit size I would see it in the policy counters. This is a bit advanced and risky and may not apply to your edge device.

Microbursts are hard to capture. So before getting into all that it’s best to determine if your pattern of issues suit that possibility.

Is the application issue continuous? Happen randomly? How long does it last if randomly? Does it recover quickly? How much of your overall edge egress traffic is for this application?

If this is priority traffic then QoS is a good Avenue to pursue.