Inter-Domain Traffic Engineering

advertisement
Inter-Domain
Traffic Engineering
Principles, Applications and
Case Studies
Who We Are

Josh Wepman




Applications Engineer/Snake Oil Salesman
Ixia NetOps
jaw@ixiacom.com
Joe Abley



Toolmaker/Engineer/Token Canadian
MFN PAIX
jabley@mfnx.net
What We Are Talking About


Inter-domain Measurement, Analysis and
Control
Improving Connectivity



With whom?
Where?
At what speed?
What we are NOT talking about





MPLS
DiffServ
RSVP
CR-LDP
All sorts of other words with lots of
capital letters that have become
associated with “traffic engineering…”
Goals For The Afternoon

Methods and Concepts on how to "improve" interdomain connectivity
 Depending on who YOU are, "improve" will have
different meanings

Finding ways to reduce impact of failure in peer or
transit networks
 a.k.a. "increasing reliability“

WARNING: Some operational complexity may arise!
 Put on your peril-sensitive glasses...
Presentation Outline








Inter-Domain TE Goals Definition
Inter-domain TE Measurement
Applying Data to Address Your Goals
Eliciting Control and the Feedback-Loop
Conceptual Examples
Who is Doing This Stuff?
Real_Live_Network Examples
No Questions? Good!
Inter-Domain TE Goals Definition

Iteration-1 – Conceptual
Define Goals, Measure, Analyze, Refine Goals, Action

What is it you need to accomplish?
Examples of Goals

Need to offload my "NSFnet" peering links
outbound (congestion management)

Need to expand my inter-domain peering links
cluefully (growth)

Need to find some people to provide my
services to (sales)
 That's right, I said it…sell stuff!!!
Adjusting Your Assumptions

Be prepared to adjust your assumptions based
on measured data!

What you planned to do, and what you end up
doing may change substantially.

Do not fear - this is real network data!

Clue should increase as valid network data
becomes available and consulted
Data Needs…



What data sets are required?

Flow-export data

BGP routing data

Active measurement data

SNMP
Some public tools available (cflowd, zebra,
ping, scotty, etc)
Some commercial products available…
Inter-domain TE Measurement
Also Known As:
Getting good, problem/goal specific data!
Assumed Network Model

Hierarchical Network Model

Ingress/Egress Network services are
separated from Transit Services

Works in other network models (as we
will show), but this is what we are
focusing on...
Hierarchical Network Model
Core Network Services
Core1
Core2
Peer1
Peer2
LocalASN
RemoteASN
AS2
AS3
AS3
AS9
AS4
Types of Data to Measure

Routing Data


Traffic Data


Focus here is BGP
Flow-export V5 is the focus here
Active Measurement Performance Data

Ping/Traceroute/One-way delay/Jitter
Routing Data

Routers generally do this well

Core competency by design (Routers route...)

Different data sets are available for measurement




IBGP (Good if you are looking at the whole system, looking outbound
or using a flat network model)
Route-Reflection (Often needed for inbound analysis, can create
some complexity in flat netowrk models)
EBGP (Good for seeing your neighbor's view of you)
Choose the right one to measure based on your needs/goals
Routing Data – In/Outbound
Core Network Services
Core1
IBGP vs.
Route-Reflection
Core2
Collector
Peer1
Data
Peer2
LocalASN
RemoteASN
AS2
AS3
AS3
AS4
Routes
AS9
Routing Data – In/Outbound

When your goal is outbound characterization,
and your measurement point is the exit point
for traffic, IBGP is your guy/girl/other.



Routes are always external, and thus always
propagated (sans election and policy of course)
“Protocols hate being anthropomorphized”
When your goal is inbound characterization,
and your measurement point is the entry
point for traffic, Route-Reflection must be
used.

Only way to get internal routes “cleanly”
Route Data – Full Mesh
(tangent)

Value of full mesh monitoring…




Historical route tracking
Policy benchmarking
Tracking med-selection issue
Identifying disasters the FIRST time cluefully



Don’t just wait for it to happen again!
PLEASE! For everyone’s sake!
Slightly off topic, but pretty darn important!
Route Data – Full Mesh (pic)
Core1
Core2
Core2
Core1
Core1
Core2
Collector
Core2
Core1
Core1
Core2
Core2
Core1
Traffic Accounting Data

Also Known As:




Flow-export
NetFlow
Cflow
A MAJOR pain in the AS!
The Quick Skinny on Flow




Packet and Byte counters per unique
set of traffic attributes
Measured from strategic routers per
input interface
Which interfaces depends on your
defined goals/needs...
Come a long way in the last few years

In some respects… 
Flow Data Inbound - Easy
Core Network Services
Core1
Core2
Collector
Peer1
Data
Peer2
LocalASN
RemoteASN
AS2
AS3
AS3
AS4
Routes
AS9
Flow Data Outbound - Easy
Core Network Services
Core1
Core2
Collector
Peer1
Data
Peer2
LocalASN
RemoteASN
AS2
AS3
AS3
AS4
Routes
AS9
Flow Data Outbound - Harder
AS2
AS4
Core
Core
AS6
Core
Core
Core
AS3
Flow Data Outbound - Harder



Since flow-export data is inbound only,
all potential feeder links in a nonhierarchical, mixed services device
must be accounted for in order to catch
all traffic outbound
Issue: How do you know what data
coming in core link4 is bound for the
local external link? Route Reflection is
bad here! Can double-count!
Problem exacerbated by complex policy
18 Words or less on flow data

Micro-management of networks based on
flows == BAD

Macro-management of networks based on
flows == GOOD
Operational Challenges (1)

Keep this in mind!

Gilb’s Law:

“Anything can be measured in a way that is
superior to not measuring it at all.”
Operational Challenges (2)




ACLs vs. data-export in the great beast!
Sampled NetFlow on the GSR is usually
distributed to the LCs
ACL > SNF > PIRC > IP Coloring >
BGP Policy accounting > FR Traffic
policing which is not FR traffic shaping
Apparently this changes in 12.0(18)S
Operational Challenges (3)


Some releases of JUNOS have bugs
where only flow data from the highestnumbered ifIndex gets exported
Check for PR20159
Operational Challenges (4)

On high-speed interfaces, the best you can
realistically do is sample at some ratio < 1:1


If you need to count bytes, this will introduce
errors
If you need to compare samples, make sure
the samples are normalized


This does NOT mean multiply by interval!
Lack of current research on statistical
validity of flow data based on samples


Last research circa 1993
Research predates substantial HTTP traffic
Operational Challenges (5)



The Gilb-Wepman Construct:
“The total P.I.T.A. factor experienced through
the process of network measurement is far
less than the total P.I.T.A factor experienced
through planning and engineering a network
without network measurements.”
P.I.T.A = Pain In The Ass

those without customers may be unfamiliar with
this term
Performance Data

Active measurement

Round-trip vs. one-way

mrtg and link utilization

Important, but not part of our examples



Short on time sadly…
Helps in goal selection and re-selection
Bottom line – is it better or worse?
Applying Data to your Goals

What to do with all this data?

Traffic Accounting Data applied to Routing
data?

Traffic Load per <something>


attribute or route
The focus here is on traffic stats (byte and packet
rates) per AS-PATH
AS-PATH / Traffic-data tables

Traffic load per AS-PATH creates a tree of
traffic relationships




(101) X-bits/sec
(101,1234) Y-bits/sec
(101,1234,9995) Z-bits/sec
101 -> 1234 -> 9995



X+Y+Z -> Y+Z -> Z
Addresses the middle mile AS’s instead of
traditional first or last ASN.
Allows "TO“ (source/sink) and "THROUGH“
(transit) values instead of just "TO" values.
Data Aggregation - Time

Aggregate data over timeframes
(macro-level view)


Long term averages
Short term benchmarks


Of course, short term means “~long term”.
Micro-management of networks based on flows

BAD!
Data Aggregation - Interfaces

Aggregate across the set of interfaces
that represent your problem statement

What interfaces am I interested in?




Can be interface specific (one)
Can be router specific (many)
Can be domain wide
(all)
Can be N of M interfaces (some)

Pretty common…
What to do with all this?

What does one do once they have all
this data?
Eliciting Control and The
Feedback Loop



Sit down, Josh
Begone with your Snake Oil
It’s time to beat on some routers
Assumptions about your
Routing Architecture




Routes to external networks are in BGP
Your IGP tells you how to find the NEXT_HOP
addresses in BGP
We select exit points for traffic based on BGP
path selection, not some other weird thing
If your routing policy differs significantly from
this, you have more problems than
measurement can solve
Fixing Outbound Traffic

Mark policy on BGP routes at the place
where you learn them


General policy -- prefer peering links over
expensive transit links, prefer private
peering links over public peering links
Specific policy -- temporarily avoid NAP X
for traffic to AS Y, prefer AS C to reach
remote network D
Tweakable Knobs




LOCAL_PREF
MED
AS_PATH
Check your vendor’s BGP path selection
tiebreaker list, and chose a set of knobs
that gives you the kind of control your
policy dictates
Control of Outbound Traffic




Danger, Will Robinson!
Helpdesk phone may ring
Small change, pause, check, log, pause,
breathe, repeat
Exit selection is a reasonably precise
science
Fixing Inbound Traffic



Controlling inbound traffic flow is all
about trying to influence the BGP path
selection decisions which happens in
networks you don’t control
Some of those networks you pay money
to. Money is sometimes an appropriate
weapon
It’s nice to buy people drinks at NANOG
Tweakable Knobs

Provider-specific knobs


CIDR abuse




whois -h whois.ra.net as1755
Cheap trick
Longest prefix wins
AS_PATH stuffing
AS_PATH pollution

Another cheap trick
Responsible Citizenship

Some tweakable knobs have an
unwelcome impact on the networks of
others



Have you met my friend, MED?
Your relationship with your target
networks is symbiotic
It is inappropriate to make demands of
someone else’s routing policy, but asking
nicely is OK
Conceptual Examples (1)

Who are the top consumers of my
network resources?



Top sources of traffic
Top sinks of traffic
Asymmetry
Conceptual Examples (2)

Traffic Aggregation Points and Peering
Optimisation


Appropriate network expansion
Offloading the expensive peer


Mitigating settlement fees and traffic ratios
Mitigating congestion


Do it without MED selection issues
Maximize route availibility (N>1 copies, not 1 or 0)
Conceptual Examples (3)

Theft-over-IP (how to know when peers
are stealing from you)



Peers dumping traffic at you for routes you
didn’t send them
Rather rude
Catch them in the act
Who is doing this stuff?

Yahoo! - Jeffrey Papen (TUNDRA Tool)


Peering Analysis, Capacity Planning, Performance
Analysis
Features:

Custom macros for AS analysis:





Source and Destination AS bandwidth details
Transit AS (hop counts) bandwidth summary data
Bandwidth forecasting; peering merit analysis
Billing formulas for cost/benefit budget analysis
Also:



Analyze internal usage for Charge Back Billing
POP-to-POP Network Performance Analysis (latency / loss)
DOS attack detection
Destination vs. Transit Traffic – UUNet
(Yahoo – TUNDRA Output)
Who is doing this stuff?



MFN
Lots of people, we think
Not enough people, we think
Real Live Network Examples 1




We peer with a particular large regional
ISP in several places. Due to various
familiar reasons, the demands on the
peering circuits approach supply
Who are the top talkers and top listeners
that we reach via this peer?
Maybe we can peer with them directly
Not just sinks, but traffic aggregation
points (middle mile)
Network Facts




Topology is not pure core/edge in some
locations, so we might expect some
complexities
All peering routers happen to be
GSR12000s
Peering circuits are all OC12
Backbone links are mostly OC48
Data Collection

Relative traffic volumes






Low NetFlow sample ratio is OK
Turning on “ip route-cache flow sampled”
seems like it can cause traffic belches
Turn off all inbound ACLs on peering
interfaces
Turn off all outbound ACLs on peering routers
Drink from the Hose
Take off every /var
Analysis of Data



Relative byte count through and to
networks reached through the peer in
question
Ranked list of peering candidates
Absolute numbers don’t really matter; we
have a list of people we should be
talking to, in order of how useful they
would be to peer with
SeeASP Output
F a c e ts :
T im e In te r v a l
R o u te r Ip v 4 A d d r
R o u te r A S
R o u te r N a m e
AS
- - - - 3561
701
209
3967
6461
8112
19262
7018
1
87
286
2603
1653
10764
703
7660
3549
14265
:
:
:
:
P
P
P
P
P
P
P
P
-
1 2 /4 /0 1
1 1 :0 3 :5 9 .5 5
6 3 .1 3 6 .1 2 0 .6 5
3549
D ia m o n d J o e
p p sT h ru
bp s T h ru
- - - - - - - - - - - - - - - - - - - -1 .0 5
9 3 3 .5 7
4 .6 3
2 .4 0 1 K
0 .8 2
7 .3 2 4 K
0 .6
2 9 7 .1 9
0
3 .1
0
0
0 .5 7
4 .6 9 9 K
8 .4 4
4 .2 4 4 K
8 .6
3 .5 7 6 K
0
0
0 .2 4
2 .6 2 1 K
0 .2 4
2 .6 2 0 K
0
0 .0 5
0
0
0
1 .2 5
0
0
0
0
0
0
-
1 2 /6 /0 1 1 3 :4 0 :1 0 .0 2
EST
ppsT o
bp s T o
p p s T o ta l b p s T o ta l
-- -- - - - - - - - - - - - - -- -- -- -- -- - - - - -- - - - - - - - - - - 7 4 .3 4
6 4 .6 3 3 K
7 5 .3 9
6 5 .5 6 7 K
3 5 .2 1
1 4 .6 5 3 K
3 9 .8 4
1 7 .0 5 4 K
0
1 .3 6
0 .8 2
7 .3 2 5 K
1 1 .3
5 .6 9 4 K
1 1 .9 1
5 .9 9 1 K
1 1 .5 1
4 .7 9 0 K
1 1 .5 1
4 .7 9 3 K
0 .5 7
4 .6 9 9 K
0 .5 7
4 .6 9 9 K
0
0
0 .5 7
4 .6 9 9 K
0
1 .2 6
8 .4 4
4 .2 4 6 K
0
1 .5 6
8 .6
3 .5 7 8 K
8 .1 6
3 .3 9 6 K
8 .1 6
3 .3 9 6 K
0
0 .0 5
0 .2 4
2 .6 2 1 K
0
0 .2 4
0 .2 4
2 .6 2 0 K
0 .2 4
2 .6 1 9 K
0 .2 4
2 .6 2 0 K
5 .3 6
2 .2 3 0 K
5 .3 6
2 .2 3 0 K
4 .3 6
1 .8 1 5 K
4 .3 7
1 .8 1 6 K
3 .2 3
1 .3 4 4 K
3 .2 3
1 .3 4 4 K
2 .7 5
1 .3 0 6 K
2 .7 5
1 .3 0 6 K
1 .0 5
934
1 .0 5
934
Real Live Network Examples 2




AS R wants to peer
That’s fine, we’ll public peer with
anybody. We’re easy.
AS R wants to private peer right away,
since they say we send them 140M of
traffic already
Can we confirm those numbers before
we dedicate a port to them?
Network Facts




We currently reach AS R through AS T
We peer with AS T in six places
One of the peering routers is a 7500,
which doesn’t do SNF
One of the peering routers is a router
which is also being used to collect data
to answer the previous question
More Network Facts



Topology is not edge/core everywhere
We want numbers out of this, so we
need to manage the SNF ratios
K1dd13s keep attacking the routers



Ops folk attack K1dd13s with ACLs
The ACL attacks the SNF
The SNF dies!
Analysis



We only have traffic samples, but we
want absolute numbers
We have interface byte and packet
counters
We can take AS R traffic as a proportion
of all AS T traffic, and divide up the
mrtg/duck data in proportion
Summary

What did we talk about?



What didn’t we talk about?




Answering specific, ad-hoc questions by attacking
them with numbers
Inter-Domain Traffic Engineering is an Iterative
process (lather, rinse, repeat)
Experience exporting from Juniper (and other noncisco) routers
Construction of a full-time, general-purpose
measurement infrastructure
What if my vendor does not support flow-export and
traffic accounting?
Questions?

No? Good.
Download