DDoS flood protection

December 16, 2014, 11:24 am

≫ Next: Fabric visibility with Cumulus Linux

Denial of Service attacks represents a significant impact to on-going operations of many businesses. When most revenue is derived from on-line operation, a DDoS attack can put a company out of business. There are many flavors of DDoS attacks, but the objective is always the same: to saturate a resource, such as a router, switch, firewall or web server, with multiple simultaneous and bogus requests, from many different sources. These attacks generate large volumes of traffic, 100Gbit/s attacks are now common, making mitigation a challenge.

The 3 minute video demonstrates Flood Protect - a DDoS mitigation solution that leverages industry standard sFlow instrumentation in commodity data center switches to provide real-time detection and mitigation of DDoS attacks. Flood Protect is an application running on InMon's Switch Fabric Accelerator SDN controller. Other applications provide visibility and accelerate fabric performance applying controls reduce latency and increase throughput.

An early version of Flood Protect won the 2014 SDN Idol competition in a joint demonstration with Brocade Networks.

Visit sFlow.com to learn more, evaluate pre-release versions of these products, or discuss requirements.

↧

Fabric visibility with Cumulus Linux

December 22, 2014, 1:02 pm

≫ Next: REST API for Cumulus Linux ACLs

≪ Previous: DDoS flood protection

A leaf and spine fabric is challenging to monitor. The fabric spreads traffic across all the switches and links in order to maximize bandwidth. Unlike traditional hierarchical network designs, where a small number of links can be monitored to provide visibility, a leaf and spine network has no special links or switches where running CLI commands or attaching a probe would provide visibility. Even if it were possible to attach probes, the effective bandwidth of a leaf and spine network can be as high as a Petabit/second, well beyond the capabilities of current generation monitoring tools.

The 2 minute video provides an overview of some of the performance challenges with leaf and spine fabrics and demonstrates Fabric View - a monitoring solution that leverages industry standard sFlow instrumentation in commodity data center switches to provide real-time visibility into fabric performance.

Fabric View is free to try, just register at http://www.myinmon.com/ and request an evaluation. The software requires an accurate network topology in order to characterize performance and this article will describe how to obtain the topology from a Cumulus Networks fabric.

Complex Topology and Wiring Validation in Data Centers describes how Cumulus Networks' prescriptive topology manager (PTM) provides a simple method of verifying and enforcing correct wiring topologies. The following ptm.py script converts the topology from PTM's dot notation to the JSON representation used by Fabric View:

#!/usr/bin/env python

import sys, re, fileinput, requests, json

url = sys.argv[1]
top = {'links':{}}

def dequote(s):
  if (s[0] == s[-1]) and s.startswith(("'", '"')):
    return s[1:-1]
  return s

l = 1
for line in fileinput.input(sys.argv[2:]):
  link = re.search('([\S]+):(\S+)\s*(--|->)\s*(\S+):([^\s;,]+)',line)
  if link:
    s1 = dequote(link.group(1))
    p1 = dequote(link.group(2))
    s2 = dequote(link.group(4))
    p2 = dequote(link.group(5))
    linkname = 'L%d' % (l)
    l += 1
    top['links'][linkname] = {'node1':s1,'port1':p1,'node2':s2,'port2':p2}

requests.put(url,data=json.dumps(top),headers={'content-type':'application/json'})

The following example demonstrates how to use the script,converting the file topology.dot to JSON and posting the result to the Fabric View server running on host fabricview:

./ptm.py http://fabricview:8008/script/fabric-view.js/topology/json topology.dot

Cumulus Networks, sFlow and data center automation describes how enable sFlow on Cumulus Linux. Configure all the switches in the leaf and spine fabric to send sFlow to the Fabric View server and you should immediately start to see data through the web interface, http://fabricview:8008/. The video provides a quick walkthrough of the software features.

↧

REST API for Cumulus Linux ACLs

December 23, 2014, 3:58 pm

≫ Next: Hybrid OpenFlow ECMP testbed

≪ Previous: Fabric visibility with Cumulus Linux

RESTful control of Cumulus Linux ACLs included a proof of concept script that demonstrated how to remotely control iptables entries in Cumulus Linux. Cumulus Linux in turn converts the standard Linux iptables rules into the hardware ACLs implemented by merchant silicon switch ASICs to deliver line rate filtering.

Previous blog posts demonstrated how remote control of Cumulus Linux ACLs can be used for DDoS mitigation and Large "Elephant" flow marking.

A more advanced version of the script is now available on GitHub:

https://github.com/pphaal/acl_server/

The new script adds the following features:

It now runs as a daemon.
Exception generated by cl-acltool are caught and handled
Rules are compiled asynchronously, reducing response time of REST calls
Updates are batched, supporting hundreds of operations per second

The script doesn't provide any security, which may be acceptable if access to the REST API is limited to the management port, but is generally unacceptable for production deployments.

Fortunately, Cumulus Linux is a open Linux distribution that allows additional software components to be installed. Rather than being forced to add authentication and encryption to the script, it is possible to install additional software and leverage the capabilities of a mature web server such as Apache. The operational steps needed to secure access to Apache are well understood and the large Apache community ensures that security issues are quickly identified and addressed.

This article will demonstrate how Apache can be used to proxy REST operations for the acl_server script, allowing mature and familiar Apache features to be applied to secure access to the ACL service.

Download the acl_server script from GitHub and change the following line to limit access to requests made by other processes on the switch:

server = HTTPServer(('',8080), ACLRequestHandler)

Limiting access to localhost, 127.0.0.1:

server = HTTPServer(('127.0.0.1',8080), ACLRequestHandler)

Next, install the script on the switch:

root# mv acl_server /etc/init.d/
root# chmod 755 /etc/init.d/acl_server
root# service acl_server start

Now install Apache:

root#  echo 'deb http://ftp.us.debian.org/debian  wheezy main contrib' \
>>/etc/apt/sources.list.d/deb.list
root# apt-get update
root# apt-get install apache2

Next enable the Apache proxy module:

root# a2enmod proxy proxy_http

Create an Apache configuration file /etc/apache2/conf.d/acl_server with the following contents:

<ifmodule mod_proxy.c>
  ProxyRequests off
  ProxyVia off
  ProxyPass        /acl/ http://127.0.0.1:8080/acl/
  ProxyPassReverse /acl/ http://127.0.0.1:8080/acl/
</ifmodule>

Make any additional changes to the Apache configuration to encrypt and authenticate requests.

Finally, restart Apache:

root# service apache2 restart

These above steps are easily automated using tools like Puppet or Ansible that are available for Cumulus Linux.

The following examples demonstrate the REST API.

Create an ACL

curl -H "Content-Type:application/json" -X PUT --data '["[iptables]","-A FORWARD --in-interface swp+ -d 10.10.100.10 -p udp --sport 53 -j DROP"]' http://10.0.0.233/acl/ddos1

ACLs are sent as a JSON encoded array of strings. Each string will be written as a line in a file stored under /etc/cumulus/acl/policy.d/ - See Cumulus Linux: Netfilter - ACLs. For example, the rule above will be written to the file 50rest-ddos1.rules with the following content:

[iptables]
-A FORWARD --in-interface swp+ -d 10.10.100.10 -p udp --sport 53 -j DROP

This iptables rule blocks all traffic from UDP port 53 (DNS) to host 10.10.100.10. This is the type of rule that might be inserted to block a DNS amplification attack.

Retrieved an ACL

curl http://10.0.0.233/acl/ddos1

Returns the result:

{"result": ["[iptables]", "-A FORWARD --in-interface swp+ -d 10.10.100.10 -p udp --sport 53 -j DROP"]}

List ACLs

curl http://10.0.0.233/acl/

Returns the result:

{"result": ["ddos1"]}

Delete an ACL

curl -X DELETE http://10.0.0.233/acl/ddos1

Delete all ACLs

curl -X DELETE http://10.0.0.233/acl/

Note this doesn't delete all the ACLs, just the ones created using the REST API. All default ACLs or manually created ACLs are inaccessible through the REST API.

The acl_server batches and compiles changes after the HTTP requests complete. Batching has the benefit of increasing throughput and reducing request latency, but makes it difficult to track compilation errors since they are reported later. The acl_server catches the output and status when running cl-acltool and attaches a lastError object to the results of subsequent requests to notify the client of problems.

{"lastError": {"returncode": 255, "lines": [...]}, "result":[...]}

The REST API is intended to be used by automation systems and so syntax problems with the ACLs they generate should be rare and are the result of a software bug. A controller using this API should check responses for the presence of lastError, log the lastError values so that the problem can be debugged, and finally delete all the rules created through the REST API to restore the system to its default state.

While this REST API could be used as a convenient way to manually push an ACL to a switch, the API is intended to be part of automation solutions that combine real-time traffic analytics with automated control. Cumulus Linux includes standard sFlow measurement support, delivering real-time network wide visibility to drive solutions that include: DDoS mitigation, enforcing black lists, marking large flows, ECMP load balancing, packet brokers etc.

↧

Hybrid OpenFlow ECMP testbed

January 3, 2015, 10:30 am

≫ Next: OpenFlow integration

≪ Previous: REST API for Cumulus Linux ACLs

SDN fabric controller for commodity data center switches describes how the real-time visibility and hybrid control capabilities of commodity data center switches can be used to automatically adapt the network to changing traffic patterns and optimize performance. The article identifies hybrid OpenFlow as a critical component of the solution, allowing SDN to be combined with proven distributed routing protocols (e.g. BGP, ISIS, OSPF, etc) to deliver scaleable, production ready solutions that fully leverage the capabilities of commodity hardware.

This article will take the example of large flow marking that has been demonstrated using physical switches and show how Mininet can be used to emulate hybrid control of data center networks and deliver realistic results.

The article Elephant Detection in Virtual Switches & Mitigation in Hardware describes a demonstration by VMware and Cumulus Networks that shows how real-time detection and marking of large "Elephant" flows can dramatically improve application response time for small latency sensitive "Mouse" flows without impacting the throughput of the Elephants - see Marking large flows for additional background.

Performance optimizing hybrid OpenFlow controller demonstrated how hybrid OpenFlow can be used to mark Elephant flows on a top of rack switch. However, building test networks with physical switches to test the controller with realistic topologies is expensive and time consuming.

Mininet offers an attractive alternative, providing a lightweight network emulator that can be run in a virtual machine on a laptop and realistically simulate network topologies. In this example, Mininet will be used to emulate the four switch leaf and spine network shown in the diagram at the top of this article.

The sFlow-RT SDN controller includes a leafandspine.py script that configures Mininet to emulate ECMP leaf and spine fabrics with hybrid OpenFlow capable switches. To run the emulation, copy the leafandspine.py script from the sFlow-RT extras directory to your Mininet system and run the following command to create the leaf and spine network:

sudo ./leafandspine.py --collector=10.0.0.162 --controller=10.0.0.162 --topofile
=/var/www/html/topology.json

There are a few points to note about the emulation:

While physical networks might have link speeds ranging from 1Gbit/s to 100Gbit/s, the emulation scales link speeds down to 10Mbit/s so that they can be emulated in software.
The sFlow sampling rate is scaled proportionally - see Large flow detection
A pair of OpenFlow 1.3 tables is used to emulate normal ECMP forwarding and hybrid OpenFlow overrides
Linux Traffic Control (tc) commands are used to emulate hardware priority queueing based on Differentiated Services Code Point (DSCP) class, mapping DSCP class 8 to a lower priority or "less than best effort" queue.
The script posts the topology as a JSON file under the default Apache document root so that it can be retrieved remotely by an SDN controller
In this example the sFlow-RT controller is running on host 10.0.0.162 - change the address to match your setup.

The following script runs the ping command to test response time and plots the results as a simple text-based bar chart:

#!/bin/bash
SCALE=1
SCALE=${2-$SCALE}
ping $1 | awk -v SCALE=$SCALE 'BEGIN {FS="[= ]"; } NF==11 { n = $10 * SCALE; bar
 = ""; while(n >= 1) { bar = bar "*"; n--} print bar "" $10 " ms" }

Open and xterm on host h1 and run the command:

./pingtest 10.0.1.1 10

Next type the following command at the Mininet prompt to generate a large flow:

iperf h2 h3

The following screen capture shows the result of the iperf test:

The reported throughput of around the 10Mbit/s shows that the traffic is saturating the emulated 10Mbit/s links.

The following screen capture shows the ping results during the iperf test.

The ping test clearly shows the impact that the Elephant flow is having on response time. In addition, the increased response times of around 3ms are consistent with values shown in VMware / Cumulus Networks charts shown earlier.

The following sFlow-RT mark.js script implements an SDN controller that marks Elephant flows:

include('extras/leafandspine-hybrid.js');

// get topology from Mininet
setTopology(JSON.parse(http('http://10.0.0.30/topology.json')));

// Define large flow as greater than 1Mbits/sec for 1 second or longer
var bytes_per_second = 1000000/8, duration_seconds = 1;

// define TCP flow cache
setFlow('tcp',
 {keys:'ipsource,ipdestination,tcpsourceport,tcpdestinationport', filter:'direct
ion=ingress',
  value:'bytes', t:duration_seconds}
);

// set threshold identifying Elephant flows
setThreshold('elephant', {metric:'tcp', value:bytes_per_second, byFlow:true, tim
eout:4});

// set OpenFlow marking rule when Elephant is detected
var idx = 0;
setEventHandler(function(evt) {
 if(topologyInterfaceToLink(evt.agent,evt.dataSource)) return;
 var port = ofInterfaceToPort(evt.agent,evt.dataSource);
 if(port) {
  var dpid = port.dpid;
  var id = "mark" + idx++;
  var k = evt.flowKey.split(',');
  var rule= {
    match:{in_port: port.port, dl_type:2048, ip_proto:6, nw_src:k[0], nw_dst:k[1], tcp_src:k[2], tcp_dst:k[3]},
    actions:["set_ip_dscp=8","output=normal"], priority:1000, idleTimeout:5
  };
  logInfo(JSON.stringify(rule,null,1));
  setOfRule(dpid,id,rule);
 }
},['elephant']);

About the script:

The included leafandspine-hybrid.js script emulates hybrid OpenFlow by rewriting the NORMAL OpenFlow action to jump to the table that contains the ECMP forwarding rules.
The script assumes that Mininet emulation is running on host 10.0.0.30. Modify the address in the setTopology() function for your setup.
The setFlow() function instructs the controller build a flow cache to track TCP connections
The setThreshold() function defines Elephant flows as TCP connections that exceed 10% of the link's bandwidth (in this case 1Mbit/second) for 1 second or more.
The setEventHandler() function processes each Elephant flow notification and applies an OpenFlow marking rules to the ingress port on the edge switch where the traffic enters the fabric.
The OpenFlow rules have an idleTimeout of 5 seconds, ensuring that they are automatically deleted by the switch when the flow ends.

Modify the sFlow-RT start.sh script to include the following settings:

-Dopenflow.controller.start=yes 
-Dopenflow.controller.flushRules=no
-Dscript.file=mark.js

Start sFlow-RT:

./start.sh

Repeat the iperf test.

The iperf results show that throughput of large flows is unaffected by the controller.

The screen capture shows the controller actions. The controller installs an OpenFlow rule as soon as the large flow is detected, settings the ip_dscp value to 8 and outputting the packets to the normal ECMP forwarding pipeline. The marked packets are treated as lower priority than the ping packets. Since the ping packets aren't stuck behind the deep queues caused by the iperf tests, the reported response times should be unaffected by the large flow.

The ping test confirms that with the controller running, response times are unaffected by Elephant flows, an approximately 10 times improvement in response time that is consistent with the results shown for a physical switch in the VMware / Cumulus charts shown earlier.

More broadly, hybrid OpenFlow provides an effective way to deliver SDN solutions in production, using OpenFlow to enhance the performance of existing networks. In addition to large flow marking, other cases described on this blog include: DDoS mitigation, enforcing black lists, ECMP load balancing, and packet brokers.

Increasingly, vendors recognize the critical importance of hybrid OpenFlow in delivering practical SDN solutions - HP proposes hybrid OpenFlow discussion at Open Daylight design forum. The article Super NORMAL offers some suggestions for enhancing hybrid OpenFlow to address additional use cases, reduce operational complexity and increase reliability in production settings.

Finally, the sFlow measurement standard is critical to unlocking the full potential of hybrid OpenFlow. Support for sFlow is build into commodity switch hardware, providing cost effective visibility into traffic on production networks. The comprehensive real-time traffic analytics delivered by sFlow allows an SDN controller to effectively target actions, managing the limited hardware resources on the switches, to enhance network performance and security.

↧

OpenFlow integration

January 5, 2015, 4:11 pm

≫ Next: Open vSwitch performance monitoring

≪ Previous: Hybrid OpenFlow ECMP testbed

Northbound APIs for traffic engineering describes how sFlow and OpenFlow provide complementary monitoring and control capabilities that can be combined to create software defined networking (SDN) solutions that automatically adapt the network to changing traffic and address high value use cases such as: DDoS mitigation, enforcing black lists, ECMP load balancing, and packet brokers.

The article describes the challenge of mapping between the different methods used by sFlow and OpenFlow to identify switch ports:

Agent IP address ⟷ OpenFlow switch ID
SNMP ifIndex ⟷ OpenFlow port ID

The recently published sFlow OpenFlow Structures extension addresses the challenge by providing a way for switches to export the mapping as an sFlow structure.

The Open vSwitch recently implemented the extension, unifying visibility and control of the virtual network edge. In addition, most physical that support OpenFlow also support sFlow. Ask vendors about their plans to implement the sFlow OpenFlow Structures extension since it is a key enabler for SDN control applications.

↧

Open vSwitch performance monitoring

January 6, 2015, 11:35 am

≫ Next: Fabric visibility with Arista EOS

≪ Previous: OpenFlow integration

Credit: Accelerating Open vSwitch to “Ludicrous Speed”

Accelerating Open vSwitch to "Ludicrous Speed" describes the architecture of Open vSwitch. When a packet arrives, the OVS Kernel Module checks its cache to see if there is an entry that matches the packet. If there is a match then the packet is forwarded within the kernel. Otherwise, the packet is sent to the user space ovs-vswitchd process to determine the forwarding decision based on the set of OpenFlow rules that have been installed or, if no rules are found, by passing the packet to an OpenFlow controller. Once a forwarding decision has been made, the packet and the forwarding actions are passed back to the OVS Kernel Module which caches the decision and forwards the packet. Subsequent packets in the flow will then be matched by the cache and forwarded within the kernel.

The recent Open vSwitch 2014 Fall Conference included the talk, Managing Open vSwitch across a large heterogeneous fleet by Chad Norgan, describing Rackspace's experience with running a large scale OpenStack deployment using Open vSwitch for network virtualization. The talk describes the key metrics that Rackspace collects to monitor the performance of the large pools of Open vSwitch instances.

This article discusses the metrics presented in the Rackspace talk and describes how the embedded sFlow agent in Open vSwitch was extended to efficiently export the metrics.

The first chart trends the number of entries in each of the OVS Kernel Module caches across all the virtual switches in the OpenStack deployment.

The next chart trends the cache hit / miss rates for the OVS Kernel Module. Processing packets using cached entries in the kernel is much faster than sending the packet to user space and requires far fewer CPU cycles and so maintaining a high cache hit rate is critical to handling the large volume of traffic in a cloud data center.

The third chart from the Rackspace presentation tracks the CPU consumed by ovs-vswitchd as it handles cache misses. Excessive CPU utilization can result in poor network performance and dropped packets. Reducing the CPU cycles consumed by networking frees up resources that can be used to host additional virtual machines and generates additional revenue.

Currently, monitoring Open vSwitch cache performance involves polling each switch using the ovs-dpctl command and collecting the results. Polling is complex to configure and maintain and operational complexity is reduced if the Open vSwitch is able to push the metrics - see Push vs Pull

The following sFlow structure was defined to allow Open vSwitch to export cache statistics along with the other sFlow metrics that are pushed by the sFlow agent:

/* Open vSwitch data path statistics */
/* see datapath/datapath.h */
/* opaque = counter_data; enterprise = 0; format = 2207 */ 
struct ovs_dp_stats { 
  unsigned int hits;                                                
  unsigned int misses; 
  unsigned int lost;
  unsigned int mask_hits;
  unsigned int flows;
  unsigned int masks;
}

The sFlow agent was also extended to export CPU and memory statistics for the ovs-vswitchd process by populating the app_resources structure - see sFlow Application Structures.

These extensions are the latest in a set of recent enhancements to the Open vSwitch sFlow implementation, including:

The Open vSwitch project first added sFlow support five years ago and these recent enhancements build on the detailed visibility into network traffic provided by the core Open vSwitch sFlow implementation and the complementary visibility into hosts, hypervisors, virtual machines and containers provided by the Host sFlow project.

Visibility and the software defined data center

Broad support for the sFlow standard across the cloud data center stack provides simple, efficient, low cost, scaleable, and comprehensive visibility. The standard metrics can be consumed by a broad range of open source and commercial tools, including: sflowtool, sFlow-Trend, sFlow-RT, Ganglia, Graphite, InfluxDB and Grafana.

↧

Fabric visibility with Arista EOS

January 24, 2015, 1:55 pm

≫ Next: Cloud analytics

≪ Previous: Open vSwitch performance monitoring

A leaf and spine fabric is challenging to monitor. The fabric spreads traffic across all the switches and links in order to maximize bandwidth. Unlike traditional hierarchical network designs, where a small number of links can be monitored to provide visibility, a leaf and spine network has no special links or switches where running CLI commands or attaching a probe would provide visibility. Even if it were possible to attach probes, the effective bandwidth of a leaf and spine network can be as high as a Petabit/second, well beyond the capabilities of current generation monitoring tools.

The 2 minute video provides an overview of some of the performance challenges with leaf and spine fabrics and demonstrates Fabric View - a monitoring solution that leverages industry standard sFlow instrumentation in commodity data center switches to provide real-time visibility into fabric performance.

Fabric View is free to try, just register at http://www.myinmon.com/ and request an evaluation. The software requires an accurate network topology in order to characterize performance and this article will describe how to obtain the topology from a fabric of Arista Networks switches.

Arista EOS™ includes the eAPI JSON-RPC service for programmatic monitoring and control. The article Arista eAPI 101 introduces eAPI and describes how to enable the service in EOS. Enable eAPI on all the switches in the fabric.

Configure all the switches in the leaf and spine fabric to send sFlow to the Fabric View server. The following script demonstrates how sFlow can be configured programmatically using an eAPI script:

#!/usr/bin/env python

import requests
import json
import signal
from jsonrpclib import Server

switch_list = ['switch1.example.com','switch2.example.com']
username = "admin"
password = "password"

sflow_collector = "192.168.56.1"
sflow_port = "6343"
sflow_polling = "20"
sflow_sampling = "10000"

for switch_name in switch_list:
  switch = Server("https://%s:%s@%s/command-api" %
                (username, password, switch_name))
  response = switch.runCmds(1,
   ["enable",
"configure",
"sflow source %s" % switch_ip,
"sflow destination %s %s" % (sflow_collector, sflow_port),
"sflow polling-interval %s" % sflow_polling,
"sflow sample output interface",
"sflow sample dangerous %s" % sflow_sampling,
"sflow run"])

Next use the following eAPI script to discover the topology:

#/usr/bin/python 
'''
Copyright (c) 2015, Arista Networks, Inc. All rights reserved.

Redistribution and use in source and binary forms, with or without 
modification, are permitted provided that the following conditions are met:

 * Redistributions of source code must retain the above copyright notice, 
   this list of conditions and the following disclaimer. 

 * Redistributions in binary form must reproduce the above copyright notice, 
   this list of conditions and the following disclaimer in the documentation 
   and/or other materials provided with the distribution. 

 * Neither the name of Arista Networks nor the names of its contributors 
   may be used to endorse or promote products derived from this software 
   without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL ARISTA NETWORKS BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
'''

# v0.5 - initial version of the script to discover network topology using
# Arista eAPI and generate output in json format recognized by sFlow-RT.

from jsonrpclib import Server 
import json 
from pprint import pprint

# define switch in your topology, eapi transport protocol (http or https),
# eapi username and password
switch_list = ['switch1.example.com','switch2.example.com']
eapi_transport = 'https'
eapi_username = 'admin'
eapi_password = 'password'

debug = False

# internal variables used by the script
allports = {}
allswitches = {}
allneighbors = []
alllinks = {}

# method to populate allswitches and allports - called only from processNeighbor()
def addPort(switchname, switchIP, portname, ifindex):
 id = switchname + '>' + portname
 prt = allports.setdefault(id, { "portname": portname, "linked": False })
 if ifindex is not None:
  prt["ifindex"] = ifindex
 sw = allswitches.setdefault(switchname, { "name": switchname, "agent": switchIP, "ports": {} });
 if switchIP is not None:
  sw["agent"] = switchIP
 sw["ports"][portname] = prt

# method to collect neighbor records - called with each LLDP neighbor 
# entry as they are discovered
def processNeighbor(localname,localip,localport,localifindex,remotename,remoteport):
 addPort(localname, localip, localport,localifindex);
 addPort(remotename, None, remoteport, None);
 allneighbors.append({ "localname": localname, "localport": localport,
"remotename": remotename, "remoteport": remoteport });

# method to remove agents that we did not discover properly, or
# that we did not intend to include in the topology.  (If we
# assigned an agent field to the switch then we assume it should stay.)
def pruneAgents():
 for nm,sw in allswitches.items():
  #if not "agent" in sw:
  if sw['agent'] == '0.0.0.0' or not sw['agent']:
   del allswitches[nm]

# method to test for a new link - called only from findLinks()
def testLink(nbor,linkno):
 swname1 = nbor["localname"]
 swname2 = nbor["remotename"]
 # one of the switches might have been pruned out
 if swname1 not in allswitches or swname2 not in allswitches:
  return False
 sw1 = allswitches[swname1]
 sw2 = allswitches[swname2]
 pname1 = nbor["localport"]
 pname2 = nbor["remoteport"]
 port1 = sw1["ports"][pname1];
 port2 = sw2["ports"][pname2];
 if not port1["linked"] and not port2["linked"]:
  # add new link
  linkid = "link" + str(linkno)
  port1["linked"] = True;
  port2["linked"] = True;
  alllinks[linkid] = {
"node1": nbor["localname"],
"port1": nbor["localport"],
"node2": nbor["remotename"],
"port2": nbor["remoteport"]
   }
  return True
 return False

# method to find unique links - call at the end once all the LLDP records have
# been processed from all the switches
def findLinks():
 linkcount = 0
 for nbor in allneighbors:
  if testLink(nbor, linkcount+1):
   linkcount += 1

# method to dump topology in json format recognized by sFlow-RT
def dumpTopology():
 topology = { "nodes": allswitches, "links": alllinks }
 print(json.dumps(topology, indent=4))

# method to get LLDP neighbors of each switch - calls processNeighbor() for each LLDP neighbor found
def getLldpNeighbors(switch_name):
 try:
  switch = Server('%s://%s:%s@%s/command-api' % (eapi_transport, eapi_username, eapi_password, switch_name))

  # Get LLDP neighbors
  commands = ["enable", "show lldp neighbors"]
  response = switch.runCmds(1, commands, 'json')
  neighbors = response[1]['lldpNeighbors']

  # Get local hostname
  commands = ["enable", "show hostname"]
  response = switch.runCmds(1, commands, 'json')
  hostname = response[1]['hostname']

  # Get SNMP ifIndexes
  commands = ["enable", "show snmp mib ifmib ifindex"]
  response = switch.runCmds(1, commands, 'json')
  interfaceIndexes = response[1]['ifIndex']

  # Get sFlow agent source address
  commands = ["enable", "show sflow"]
  response = switch.runCmds(1, commands, 'json')
  sflowAddress = response[1]['ipv4Sources'][0]['ipv4Address']

  # Create 2D array lldp_neighbors where each line has following entries 
  # , , , 
  lldp_neighbors = []
  for neighbor in neighbors:
   lldp_neighbors.append([neighbor['neighborDevice'].split('.')[0], 
        neighbor['port'], neighbor['neighborPort'], interfaceIndexes[neighbor['port']]])

  if (debug): 
   pprint(lldp_neighbors)


  # collect switches, ports and neighbor-relationships
  for row in lldp_neighbors:
   processNeighbor(hostname, 
    sflowAddress,
    row[1], # localport
    row[3], # localifindex
    row[0], # remotename
    row[2]) # remoteport

  # Print list of LLDP neighbors in human friendly format:
  #  neighbor, , connected to local  with remote 
  if debug:
   print "Switch %s has following %d neighbors:" % (hostname[1], len(neighbors))
   for i, neighbor in enumerate(lldp_neighbors):
    print "#%d neighbor, %s, connected to local %s with remote %s" % (i+1, neighbor[0], neighbor[1], neighbor[2])

 except:
  print 'Exception while connecting to %s' % switch_name
  return []


for switch in switch_list:
 getLldpNeighbors(switch)

pruneAgents()
findLinks()
dumpTopology()

The script outputs a JSON representation of the topology, for example:

{
"nodes": {
"leaf332": {
"name": "leaf332", 
"agent": "10.10.130.142", 
"ports": {
"Management1": {
"portname": "Management1", 
"ifindex": 999001, 
"linked": false
                }, 
"Ethernet50/1": {
"portname": "Ethernet50/1", 
"ifindex": 50001, 
"linked": true
                }, 
"Ethernet36": {
"portname": "Ethernet36", 
"ifindex": 36, 
"linked": true
                }, 
"Ethernet51/1": {
"portname": "Ethernet51/1", 
"ifindex": 51001, 
"linked": true
                }, 
"Ethernet52/1": {
"portname": "Ethernet52/1", 
"ifindex": 52001, 
"linked": true
                }, 
"Ethernet49/1": {
"portname": "Ethernet49/1", 
"ifindex": 49001, 
"linked": true
                }, 
"Ethernet12": {
"portname": "Ethernet12", 
"ifindex": 12, 
"linked": false
                }, 
"Ethernet35": {
"portname": "Ethernet35", 
"ifindex": 35, 
"linked": true
                }
            }
        }, 
"leaf259": {
"name": "leaf259", 
"agent": "10.10.129.220", 
"ports": {
"Management1": {
"portname": "Management1", 
"ifindex": 999001, 
"linked": false
                }, 
"Ethernet5/1": {
"portname": "Ethernet5/1", 
"ifindex": 5001, 
"linked": true
                }, 
"Ethernet29": {
"portname": "Ethernet29", 
"ifindex": 29, 
"linked": true
                }, 
"Ethernet32": {
"portname": "Ethernet32", 
"ifindex": 32, 
"linked": true
                }, 
"Ethernet6/1": {
"portname": "Ethernet6/1", 
"ifindex": 6001, 
"linked": true
                }, 
"Ethernet31": {
"portname": "Ethernet31", 
"ifindex": 31, 
"linked": true
                }, 
"Ethernet30": {
"portname": "Ethernet30", 
"ifindex": 30, 
"linked": true
                }, 
"Ethernet15/1": {
"portname": "Ethernet15/1", 
"ifindex": 15001, 
"linked": false
                }
            }
        }, 
"leaf331": {
"name": "leaf331", 
"agent": "10.10.130.141", 
"ports": {
"Management1": {
"portname": "Management1", 
"ifindex": 999001, 
"linked": false
                }, 
"Ethernet50/1": {
"portname": "Ethernet50/1", 
"ifindex": 50001, 
"linked": true
                }, 
"Ethernet36": {
"portname": "Ethernet36", 
"ifindex": 36, 
"linked": true
                }, 
"Ethernet1": {
"portname": "Ethernet1", 
"ifindex": 1, 
"linked": false
                }, 
"Ethernet51/1": {
"portname": "Ethernet51/1", 
"ifindex": 51001, 
"linked": true
                }, 
"Ethernet52/1": {
"portname": "Ethernet52/1", 
"ifindex": 52001, 
"linked": true
                }, 
"Ethernet49/1": {
"portname": "Ethernet49/1", 
"ifindex": 49001, 
"linked": true
                }, 
"Ethernet11": {
"portname": "Ethernet11", 
"ifindex": 11, 
"linked": false
                }, 
"Ethernet35": {
"portname": "Ethernet35", 
"ifindex": 35, 
"linked": true
                }
            }
        }, 
"leaf260": {
"name": "leaf260", 
"agent": "10.10.129.221", 
"ports": {
"Management1": {
"portname": "Management1", 
"ifindex": 999001, 
"linked": false
                }, 
"Ethernet11/1": {
"portname": "Ethernet11/1", 
"ifindex": 11001, 
"linked": false
                }, 
"Ethernet5/1": {
"portname": "Ethernet5/1", 
"ifindex": 5001, 
"linked": true
                }, 
"Ethernet29": {
"portname": "Ethernet29", 
"ifindex": 29, 
"linked": true
                }, 
"Ethernet32": {
"portname": "Ethernet32", 
"ifindex": 32, 
"linked": true
                }, 
"Ethernet6/1": {
"portname": "Ethernet6/1", 
"ifindex": 6001, 
"linked": true
                }, 
"Ethernet31": {
"portname": "Ethernet31", 
"ifindex": 31, 
"linked": true
                }, 
"Ethernet30": {
"portname": "Ethernet30", 
"ifindex": 30, 
"linked": true
                }
            }
        }, 
"core210": {
"name": "core210", 
"agent": "10.10.129.185", 
"ports": {
"Ethernet3/3/1": {
"portname": "Ethernet3/3/1", 
"ifindex": 3037, 
"linked": false
                }, 
"Ethernet3/6/1": {
"portname": "Ethernet3/6/1", 
"ifindex": 3073, 
"linked": true
                }, 
"Ethernet3/5/1": {
"portname": "Ethernet3/5/1", 
"ifindex": 3061, 
"linked": true
                }, 
"Ethernet3/2/1": {
"portname": "Ethernet3/2/1", 
"ifindex": 3025, 
"linked": false
                }, 
"Ethernet3/8/1": {
"portname": "Ethernet3/8/1", 
"ifindex": 3097, 
"linked": true
                }, 
"Ethernet3/1/1": {
"portname": "Ethernet3/1/1", 
"ifindex": 3013, 
"linked": false
                }, 
"Management1/1": {
"portname": "Management1/1", 
"ifindex": 999011, 
"linked": false
                }, 
"Ethernet3/34/1": {
"portname": "Ethernet3/34/1", 
"ifindex": 3409, 
"linked": false
                }, 
"Ethernet3/31/1": {
"portname": "Ethernet3/31/1", 
"ifindex": 3373, 
"linked": false
                }, 
"Ethernet3/7/1": {
"portname": "Ethernet3/7/1", 
"ifindex": 3085, 
"linked": true
                }
            }
        }, 
"core212": {
"name": "core212", 
"agent": "10.10.129.64", 
"ports": {
"Ethernet3/3/1": {
"portname": "Ethernet3/3/1", 
"ifindex": 3037, 
"linked": false
                }, 
"Ethernet3/12/1": {
"portname": "Ethernet3/12/1", 
"ifindex": 3145, 
"linked": false
                }, 
"Ethernet3/2/1": {
"portname": "Ethernet3/2/1", 
"ifindex": 3025, 
"linked": false
                }, 
"Ethernet3/13/1": {
"portname": "Ethernet3/13/1", 
"ifindex": 3157, 
"linked": false
                }, 
"Ethernet3/31/1": {
"portname": "Ethernet3/31/1", 
"ifindex": 3373, 
"linked": false
                }, 
"Ethernet3/32/1": {
"portname": "Ethernet3/32/1", 
"ifindex": 3385, 
"linked": false
                }, 
"Ethernet3/18/1": {
"portname": "Ethernet3/18/1", 
"ifindex": 3217, 
"linked": true
                }, 
"Ethernet3/28/1": {
"portname": "Ethernet3/28/1", 
"ifindex": 3337, 
"linked": true
                }, 
"Ethernet3/33/1": {
"portname": "Ethernet3/33/1", 
"ifindex": 3397, 
"linked": false
                }, 
"Ethernet3/5/1": {
"portname": "Ethernet3/5/1", 
"ifindex": 3061, 
"linked": true
                }, 
"Ethernet3/8/1": {
"portname": "Ethernet3/8/1", 
"ifindex": 3097, 
"linked": true
                }, 
"Ethernet3/34/1": {
"portname": "Ethernet3/34/1", 
"ifindex": 3409, 
"linked": false
                }, 
"Ethernet3/36/1": {
"portname": "Ethernet3/36/1", 
"ifindex": 3433, 
"linked": false
                }, 
"Ethernet3/35/1": {
"portname": "Ethernet3/35/1", 
"ifindex": 3421, 
"linked": false
                }, 
"Ethernet3/15/1": {
"portname": "Ethernet3/15/1", 
"ifindex": 3181, 
"linked": true
                }, 
"Ethernet3/7/1": {
"portname": "Ethernet3/7/1", 
"ifindex": 3085, 
"linked": true
                }, 
"Ethernet3/16/1": {
"portname": "Ethernet3/16/1", 
"ifindex": 3193, 
"linked": true
                }, 
"Ethernet3/17/1": {
"portname": "Ethernet3/17/1", 
"ifindex": 3205, 
"linked": true
                }, 
"Management1/1": {
"portname": "Management1/1", 
"ifindex": 999011, 
"linked": false
                }, 
"Ethernet3/26/1": {
"portname": "Ethernet3/26/1", 
"ifindex": 3313, 
"linked": true
                }, 
"Ethernet3/25/1": {
"portname": "Ethernet3/25/1", 
"ifindex": 3301, 
"linked": true
                }, 
"Ethernet3/21/1": {
"portname": "Ethernet3/21/1", 
"ifindex": 3253, 
"linked": false
                }, 
"Ethernet3/11/1": {
"portname": "Ethernet3/11/1", 
"ifindex": 3133, 
"linked": false
                }, 
"Ethernet3/6/1": {
"portname": "Ethernet3/6/1", 
"ifindex": 3073, 
"linked": true
                }, 
"Ethernet3/27/1": {
"portname": "Ethernet3/27/1", 
"ifindex": 3325, 
"linked": true
                }, 
"Ethernet3/1/1": {
"portname": "Ethernet3/1/1", 
"ifindex": 3013, 
"linked": false
                }, 
"Ethernet3/23/1": {
"portname": "Ethernet3/23/1", 
"ifindex": 3277, 
"linked": false
                }, 
"Ethernet3/22/1": {
"portname": "Ethernet3/22/1", 
"ifindex": 3265, 
"linked": false
                }
            }
        }
    }, 
"links": {
"link5": {
"node1": "leaf260", 
"node2": "core212", 
"port2": "Ethernet3/15/1", 
"port1": "Ethernet31"
        }, 
"link4": {
"node1": "leaf260", 
"node2": "core212", 
"port2": "Ethernet3/5/1", 
"port1": "Ethernet30"
        }, 
"link7": {
"node1": "leaf259", 
"node2": "core210", 
"port2": "Ethernet3/6/1", 
"port1": "Ethernet29"
        }, 
"link6": {
"node1": "leaf260", 
"node2": "core212", 
"port2": "Ethernet3/25/1", 
"port1": "Ethernet32"
        }, 
"link1": {
"node1": "leaf260", 
"node2": "leaf259", 
"port2": "Ethernet5/1", 
"port1": "Ethernet5/1"
        }, 
"link3": {
"node1": "leaf260", 
"node2": "core210", 
"port2": "Ethernet3/5/1", 
"port1": "Ethernet29"
        }, 
"link2": {
"node1": "leaf260", 
"node2": "leaf259", 
"port2": "Ethernet6/1", 
"port1": "Ethernet6/1"
        }, 
"link9": {
"node1": "leaf259", 
"node2": "core212", 
"port2": "Ethernet3/16/1", 
"port1": "Ethernet31"
        }, 
"link8": {
"node1": "leaf259", 
"node2": "core212", 
"port2": "Ethernet3/6/1", 
"port1": "Ethernet30"
        }, 
"link15": {
"node1": "leaf331", 
"node2": "core212", 
"port2": "Ethernet3/17/1", 
"port1": "Ethernet51/1"
        }, 
"link14": {
"node1": "leaf331", 
"node2": "core212", 
"port2": "Ethernet3/7/1", 
"port1": "Ethernet50/1"
        }, 
"link17": {
"node1": "leaf332", 
"node2": "core210", 
"port2": "Ethernet3/8/1", 
"port1": "Ethernet49/1"
        }, 
"link16": {
"node1": "leaf331", 
"node2": "core212", 
"port2": "Ethernet3/27/1", 
"port1": "Ethernet52/1"
        }, 
"link11": {
"node1": "leaf331", 
"node2": "leaf332", 
"port2": "Ethernet35", 
"port1": "Ethernet35"
        }, 
"link10": {
"node1": "leaf259", 
"node2": "core212", 
"port2": "Ethernet3/26/1", 
"port1": "Ethernet32"
        }, 
"link13": {
"node1": "leaf331", 
"node2": "core210", 
"port2": "Ethernet3/7/1", 
"port1": "Ethernet49/1"
        }, 
"link12": {
"node1": "leaf331", 
"node2": "leaf332", 
"port2": "Ethernet36", 
"port1": "Ethernet36"
        }, 
"link20": {
"node1": "leaf332", 
"node2": "core212", 
"port2": "Ethernet3/28/1", 
"port1": "Ethernet52/1"
        }, 
"link19": {
"node1": "leaf332", 
"node2": "core212", 
"port2": "Ethernet3/18/1", 
"port1": "Ethernet51/1"
        }, 
"link18": {
"node1": "leaf332", 
"node2": "core212", 
"port2": "Ethernet3/8/1", 
"port1": "Ethernet50/1"
        }
    }
}

Access the Fabric View web interface at http://fabricview:8008/ and navigate to the Settings tab:

Upload the JSON topology file by clicking on the disk icon in the Topology section. Alternatively, the topology can be installed programmatically using the Fabric View REST API documented at the bottom of the Settings page.

As soon as the topology is installed, traffic data should start appearing in Fabric View. The video provides a quick walkthrough of the software features.

↧

Cloud analytics

February 5, 2015, 4:10 pm

≫ Next: Broadcom ASIC table utilization metrics, DevOps, and SDN

≪ Previous: Fabric visibility with Arista EOS

Librato is an example of a cloud based analytics service (now part of SolarWinds). Librato provides an easy to use REST API for pushing metrics into their cloud service. The web portal makes it simple to combine and trend data and build and share dashboards.

This article describes a proof of concept demonstrating how Librato's cloud service can be used to cost effectively monitor large scale cloud infrastructure by leveraging standard sFlow instrumentation. Librato offers a free 30 day trial, making it easy to evaluate solutions based on this demonstration.

The diagram shows the measurement pipeline. Standard sFlow measurements from hosts, hypervisors, virtual machines, containers, load balancers, web servers and network switches stream to the sFlow-RT real-time analytics engine. Metrics are pushed from sFlow-RT to Librato using the REST API.

Over 40 vendors implement the sFlow standard and compatible products are listed on sFlow.org. The open source Host sFlow agent exports standard sFlow metrics from hosts. For additional background, the Velocity conference talk provides an introduction to sFlow and case study from a large social networking site.

Librato's service is priced based on the number of data points that they need to store. For example, a Host sFlow agent reports approximately 50 measurements per node. Collecting all the measurements from a cluster of 100 servers would generate 5000 metrics and cost $1,000 per month if metrics are stored at 15 second intervals.

There are important scaleability and cost advantages to placing the sFlow-RT analytics engine in front of the metrics collection service. For example, in large scale cloud environments the metrics for each member of a dynamic pool isn't necessarily worth trending since virtual machines are frequently added and removed. Instead, sFlow-RT tracks all the members of the pool, calculates summary statistics for the pool, and logs the summary statistics. This pre-processing can significantly reduce storage requirements, reducing costs and increasing query performance. The sFlow-RT analytics software also calculates traffic flow metrics, hot/missed Memcache keys, top URLs, exports events via syslog to Splunk, Logstash etc. and provides access to detailed metrics through its REST API.

The following steps were involved in setting up the proof of concept.

First register for free trial at Librato.com.

Find or build a server with Java 1.7+ and install sFlow-RT:

wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
tar -xvzf sflow-rt.tar.gz
cd sflow-rt

Edit the init.js script and add the following lines (modifying the user and token from your Librato account):

var url = "https://metrics-api.librato.com/v1/metrics";
var user = "first.last@mycompany.com";
var token = "55add91c806fb5f634ad1a334789a32e8d10a597815e6865aa84f0749324450e";

setIntervalHandler(function() {
  var metrics = ['min:load_one','q1:load_one','med:load_one',
'q3:load_one','max:load_one'];
  var vals = metric('ALL',metrics,{os_name:['linux']});
  var gauges = {};
  for each (var val in vals) {
     gauges[val.metricName] = {
"value": val.metricValue,
"source": "Linux_Pool"
     };
  }
  var body = {"gauges":gauges};
  http(url,'post', 'application/json', JSON.stringify(body), user, token);
} , 15);

Now start sFlow-RT:

./start.sh

Cluster performance metrics describes the summary metrics that sFlow-RT can calculate. In this case, the load average minimum, maximum, and quartiles for the cluster are being calculated and pushed to Librato every 15 seconds.

Install Host sFlow agents on the physical or virtual machines in your cluster and direct them to send metrics to the sFlow-RT host. The installation steps can be easily automated using orchestration tools like Puppet, Chef, Ansible, etc.

Physical and virtual switches in the cluster can be configured to send sFlow to sFlow-RT in order to add traffic metrics to the mix, exporting metrics that characterizing traffic between service tiers etc. However, in public cloud environments, traffic flow information is typically not available. The articles, Amazon Elastic Compute Cloud (EC2) and Rackspace cloudservers describe how Host sFlow agents can be configured to monitor traffic between virtual machines in the cloud.

Metrics should start appearing in Librato as soon as the Host sFlow agents are started.

In this example, sFlow-RT is exporting 5 metrics to summarize the cluster performance, reducing the total monthly cost of monitoring the cluster from $1,000 to $1. Of course there are likely to be more metrics that you will want to track, but the ability to selectively log high value metrics provides a way to control costs and maximize benefits.

↧

Broadcom ASIC table utilization metrics, DevOps, and SDN

February 26, 2015, 2:44 pm

≫ Next: Topology discovery with Cumulus Linux

≪ Previous: Cloud analytics

Figure 1: Two-Level Folded CLOS Network Topology Example

Figure 1 from the Broadcom white paper, Engineered Elephant Flows for Boosting Application Performance in Large-Scale CLOS Networks, shows a data center leaf and spine topology. Leaf and spine networks are seeing rapid adoption since they provide the scaleability needed to cost effectively deliver the low latency, high bandwidth interconnect for cloud, big data, and high performance computing workloads.

Broadcom Trident ASICs are popular in white box, brite-box and branded data center switches from a wide range of vendors, including: Accton, Agema, Alcatel-Lucent, Arista, Cisco, Dell, Edge-Core, Extreme, Hewlett-Packard, IBM, Juniper, Penguin Computing, and Quanta.

Figure 2: OF-DPA Programming Pipeline for ECMP

Figure 2 shows the packet processing pipeline of a Broadcom ASIC. The pipeline consists of a number of linked hardware tables providing bridging, routing, access control list (ACL), and ECMP forwarding group functions. Operations teams need to be able to proactively monitor table utilizations in order to avoid performance problems associated with table exhaustion.

Broadcom's recently released sFlow specification, sFlow Broadcom Switch ASIC Table Utilization Structures, leverages the industry standard sFlow protocol to offer scaleable, multi-vendor, network wide visibility into the utilization of these hardware tables.

Support for the new extension has just been added to the open source Host sFlow agent, which runs on Cumulus Linux, a Debian based Linux distribution that supports open switch hardware from Agema, Dell, Edge-Core, Penguin Computing, Quanta. Hewlett-Packard recently announced that they will soon be selling a new line of open network switches built by Accton Technologies and supporting Cumulus Linux.

The speed with which this new features can be delivered on hardware from the wide range of vendors supporting Cumulus Linux is a powerful illustration of the power of open networking. While support for the Broadcom ASIC table extension has been checking into the Host sFlow trunk it hasn't yet made it into the Cumulus Networks binary repositories. However, Cumulus Linux is an open platform, so users are free to download sources, compile and install the latest software version direct from SourceForge.

The following output from the open source sflowtool command line utility shows the raw table measurements (this is in addition to the extensive set of sFlow measurements already exported via sFlow on Cumulus Linux):

bcm_asic_host_entries 4
bcm_host_entries_max 8192
bcm_ipv4_entries 0
bcm_ipv4_entries_max 0
bcm_ipv6_entries 0
bcm_ipv6_entries_max 0
bcm_ipv4_ipv6_entries 9
bcm_ipv4_ipv6_entries_max 16284
bcm_long_ipv6_entries 3
bcm_long_ipv6_entries_max 256
bcm_total_routes 10
bcm_total_routes_max 32768
bcm_ecmp_nexthops 0
bcm_ecmp_nexthops_max 2016
bcm_mac_entries 3
bcm_mac_entries_max 32768
bcm_ipv4_neighbors 4
bcm_ipv6_neighbors 0
bcm_ipv4_routes 0
bcm_ipv6_routes 0
bcm_acl_ingress_entries 842
bcm_acl_ingress_entries_max 4096
bcm_acl_ingress_counters 68
bcm_acl_ingress_counters_max 4096
bcm_acl_ingress_meters 18
bcm_acl_ingress_meters_max 8192
bcm_acl_ingress_slices 3
bcm_acl_ingress_slices_max 8
bcm_acl_egress_entries 36
bcm_acl_egress_entries_max 512
bcm_acl_egress_counters 36
bcm_acl_egress_counters_max 1024
bcm_acl_egress_meters 18
bcm_acl_egress_meters_max 512
bcm_acl_egress_slices 2
bcm_acl_egress_slices_max 2

The sflowtool output is useful for troubleshooting and is easy to parse with scripts.

DevOps

The diagram shows how the sFlow-RT analytics engine is used to deliver metrics and events to cloud based and on-site DevOps tools, see: Cloud analytics, InfluxDB and Grafana, Cloud Analytics, Metric export to Graphite, and Exporting events using syslog.

For example, the following sFlow-RT application simplifies monitoring of the leaf and spine network by combining measurements from all the switches, identifying the switch with the maximum utilization of each table, pushing the summaries to operations dashboard every 15 seconds, and sending syslog events immediately when any table exceeds 80% utilization:

var network_wide_metrics = [
'max:bcm_host_utilization',
'max:bcm_mac_utilization',
'max:bcm_ipv4_ipv6_utilization',
'max:bcm_total_routes_utilization',
'max:bcm_ecmp_nexthops_utilization',
'max:bcm_acl_ingress_utilization',
'max:bcm_acl_ingress_meters_utilization',
'max:bcm_acl_ingress_counters_utilization',
'max:bcm_acl_egress_utilization',
'max:bcm_acl_egress_meters_utilization',
'max:bcm_acl_egress_counters_utilization'
];

var max_utilization = 80;

setIntervalHandler(function() {
  var vals = metric('ALL',network_wide_metrics);
  var graphite_metrics = {};
  for each (var val in vals) {
    if(!val.hasOwnProperty('metricValue')) continue;

    // generate syslog events for over utilized tables
    if(val.metricValue >= max_utilization) {
       var event = {
"asic_table":val.metricName,
"utilization":val.metricValue,
"switchIP":val.agent
       };
       try {
         syslog(
'10.0.0.1', // syslog collector: splunk>, logstash, etc.
           514,        // syslog port
           16,         // facility = local0
           5,          // severity = notice
           event
        );
      } catch(e) { logWarning("syslog() failed " + e); }
    }

    // add metric to graphite set
    graphite_metrics["network.podA."+val.metricName] = val.metricValue;
  }

  // sent metrics to graphite
  try {
    graphite(
'10.0.0.151',  // graphite server
      2003,          // graphite carbon UDP port
      graphite_metrics
    );
  } catch(e) { logWarning("graphite() failed " + e); }
},15);

The following screen capture shows the graphs starting to appear in Graphite:

Real-time traffic analytics

The table utilization metrics are only a part of the visibility that sFlow provides into the performance of a leaf and spine network.

A leaf and spine fabric is challenging to monitor. The fabric spreads traffic across all the switches and links in order to maximize bandwidth. Unlike traditional hierarchical network designs, where a small number of links can be monitored to provide visibility, a leaf and spine network has no special links or switches where running CLI commands or attaching a probe would provide visibility. Even if it were possible to attach probes, the effective bandwidth of a leaf and spine network can be as high as a Petabit/second, well beyond the capabilities of current generation monitoring tools.

Scaleable traffic measurement is possible because Broadcom ASICs implement hardware support for sFlow monitoring, providing cost effective, line rate visibility that is build into the switches and scales to all port speeds (1G, 10G, 25G, 40G, 50G, 100G, ...) and the high port counts found in large leaf and spine networks.

The 2 minute video provides an overview of some of the performance challenges with leaf and spine fabrics and demonstrates Fabric View - a monitoring solution that leverages industry standard sFlow instrumentation in commodity data center switches to provide real-time visibility into fabric performance. Fabric visibility with Cumulus Linux describes how to set up Fabric View to monitor a Cumulus Linux leaf and spine network.

SDN

Real-time network analytics are a fundamental driver for a number of important SDN use cases, allowing the SDN controller to rapidly detect changes in traffic and respond by applying active controls. SDN fabric controller for commodity data center switches describes how control of the ACL table is the key feature needed to to build scaleable SDN solutions.

REST API for Cumulus Linux ACLs describes open source software to allow an SDN controller to centrally manage the ACL tables on a large scale network of switches running Cumulus Linux.

The ability to install software on the switches is transformative, allowing third party developers and network operators transparent access to the full capabilities of the switch and build solutions that efficiently handle automation challenges.

A number of SDN use cases have been demonstrated that build on Cumulus Linux to leverage the real-time visibility and control capabilities of the switch ASIC:

Visit the sFlow.com web site to learn more about SDN control of leaf and spine networks.

Finally, the SDN use cases make extensive use of the ACL table and so this brings us full circle to the importance of the Broadcom sFlow extension providing visibility into the utilization of table resources.

↧

Topology discovery with Cumulus Linux

March 11, 2015, 1:19 pm

≫ Next: ECMP visibility with Cumulus Linux

≪ Previous: Broadcom ASIC table utilization metrics, DevOps, and SDN

Demo: Implementing the OpenStack Design Guide in the Cumulus Workbench is a great demonstration of the power of zero touch provisioning and automation. When the switches and servers boot they automatically pick up their operating systems and configurations for the complex network shown in the diagram.

REST API for Cumulus Linux ACLs describes a REST server for remotely controlling ACLs on Cumulus Linux. This article will discuss recently added topology discovery methods that allow an SDN controller to learn topology and apply targeted controls (e.g Large "Elephant" flow marking, Large flow steering, DDoS mitigation, etc.).

Prescriptive Topology Manager

Complex Topology and Wiring Validation in Data Centers describes how Cumulus Networks' prescriptive topology manager (PTM) provides a simple method of verifying and enforcing correct wiring topologies.

The following REST call converts the topology from PTM's dot notation and returns a JSON representation:

cumulus@wbench:~$ curl http://leaf1:8080/ptm

Returns the result:

{
"links": {
"L1": {
"node1": "leaf1", 
"node2": "spine1", 
"port1": "swp1s0", 
"port2": "swp49"
  },
  ...
 }
}

LLDP

Prescriptive Topology Manager is preferred since it ensures that the discovered topology is correct. However, PTM builds on basic Link Level Discovery Protocol (LLDP), which provides an alternative method of topology discovery.

The following REST call return the hostname:

cumulus@wbench:~$ curl http://leaf1:8080/hostname

Returns result:

"leaf1"

The following REST call returns LLDP neighbor information:

cumulus@wbench:~$ curl http://leaf1:8080/lldp/neighbors

Returns result:

{
"lldp": [
     {
"interface": [
         {
"name": "eth0",
"via": "LLDP",
"chassis": [
             {
"id": [
                 {
"type": "mac",
"value": "6c:64:1a:00:2e:7f"
                 }
               ],
"name": [
                 {
"value": "colo-tor-3"
                 }
               ]
             }
           ],
"port": [
             {
"id": [
                 {
"type": "ifname",
"value": "swp10"
                 }
               ],
"descr": [
                 {
"value": "swp10"
                 }
               ]
             }
           ]
         },
         ...
     }
   ]
 }

The following REST call returns LLDP configuration information:

cumulus@wbench:~$ curl http://leaf1:8080/lldp/configuration

Returns result:

{
"configuration": [
     {
"config": [
         {
"tx-delay": [
             {
"value": "30"
             }
           ],
           ...
         }
       ]
     }
   ]
 }

Topology discovery with LLDP

The script lldp.py extracts LLDP data from all the switches in the network and compiles a topology:

#!/usr/bin/env python

import sys, re, fileinput, json, requests

switch_list = ['leaf1','leaf2','spine1','spine2']

l = 0
linkdb = {}
links = {}
for switch_name in switch_list:
  # verify that lldp configuration exports hostname,ifname information
  r = requests.get("http://%s:8080/lldp/configuration" % (switch_name));
  if r.status_code != 200: continue
  config = r.json()
  lldp_hostname = config['configuration'][0]['config'][0]['hostname'][0]['value']
  if lldp_hostname != '(none)': continue
  lldp_porttype = config['configuration'][0]['config'][0]['lldp_portid-type'][0]['value']
  if lldp_porttype != 'ifname': continue
  # local hostname 
  r = requests.get("http://%s:8080/hostname" % (switch_name));
  if r.status_code != 200: continue
  host = r.json()
  # get neighbors
  r = requests.get("http://%s:8080/lldp/neighbors" % (switch_name));
  if r.status_code != 200: continue
  neighbors = r.json()
  interfaces = neighbors['lldp'][0]['interface']
  for i in interfaces:
    # local port name
    port = i['name']
    # neighboring hostname
    nhost = i['chassis'][0]['name'][0]['value']
    # neighboring port name
    nport = i['port'][0]['descr'][0]['value']
    if not host or not port or not nhost or not nport: continue
    if host < nhost:
      link = {'node1':host,'port1':port,'node2':nhost,'port2':nport}
    else:
      link = {'node1':nhost,'port1':nport,'node2':host,'port2':port}
    keystr = "%s %s -- %s %s" % (link['node1'],link['port1'],link['node2'],link['port2'])
    if keystr in linkdb:
       # check consistency
       prev = linkdb[keystr]
       if (link['node1'] != prev['node1'] 
           or link['port1'] != prev['port1']
           or link['node2'] != prev['node2']
           or link['port2'] != prev['port2']): raise Exception('Mismatched LLDP', keystr)
    else:
       linkdb[keystr] = link
       linkname = 'L%d' % (l)
       links[linkname] = link
       l += 1

top = {'links':links}               
print json.dumps(top,sort_keys=True, indent=1)

Returns result:

cumulus@wbench:~$ ./lldp.py 
{
"links": {
"L0": {
"node1": "colo-tor-3", 
"node2": "leaf1", 
"port1": "swp10", 
"port2": "eth0"
  }, 
  ...
 }
}

The lldp.py script and the latest version of acl_server can be found on Github, https://github.com/pphaal/acl_server/

Demonstration

Fabric visibility with Cumulus Linux demonstrates the visibility into network performance provided by Cumulus Linux support for the sFlow standard (see Cumulus Networks, sFlow and data center automation). The screen shot shows 10Gbit/s Elephant flows traversing the network shown at the top of this article. The flows between server1 and server2 were generated using iperf tests running in a continuous loop.

The acl_server and sFlow agents are installed on the leaf1, leaf2, spine1, and spine2 switches. By default, the sFlow agents automatically pick up their settings using DNS Service Discovery (DNS-SD). Adding the following entry in the wbench DNS server zone file, /etc/bind/zones/lab.local.zone, enables sFlow on the switches and directs measurements to the wbench host:

_sflow._udp     30      SRV     0 0 6343 wbench

Note: For more information on running sFlow in the Cumulus workbench, see Demo: Monitoring Traffic on Cumulus Switches with sFlow). Another point to note, this workbench setup demonstrates the visibility into Link Aggregation (LAG) provides by sFlow (see Link aggregation).

Fabric View is installed on wbench and is configured with the network topology obtained from acl_server. The web interface is accessed through the workbench reverse proxy, but access is also possible using a VPN (see Setting up OpenVPN on the Cumulus Workbench).

This workbench example automatically provisions an OpenStack cluster on the two servers along with the network to connect them. In much the same way OpenStack provides access to virtual resources, Cumulus' Remote Lab leverages the automation capabilities of open hardware to provide multi-tenant access to physical servers and networks.

Finally, Cumulus Linux runs on open switch hardware from Agema, Dell, Edge-Core, Penguin Computing, Quanta. In addition, Hewlett-Packard recently announced that they will soon be selling a new line of open network switches built by Accton Technologies and support Cumulus Linux. This article, demonstrates the flexibility that open networking offers to developers and network administrators. If you are curious, its very easy to give Cumulus Linux a try.

↧

ECMP visibility with Cumulus Linux

March 13, 2015, 4:29 pm

≫ Next: OpenNetworking.tv interview

≪ Previous: Topology discovery with Cumulus Linux

Demo: Implementing the Big Data Design Guide in the Cumulus Workbench is a great demonstration of the power of zero touch provisioning and automation. When the switches and servers boot they automatically pick up their operating systems and configurations for the complex Equal Cost Multi-Path (ECMP) routed network shown in the diagram.

Topology discovery with Cumulus Linux looked at an alternative Multi-Chassis Link Aggregation (MLAG) configuration and shows how to extract the configuration and monitor traffic on the network using sFlow and Fabric View.

The paper Hedera: Dynamic Flow Scheduling for Data Center Networks describes the impact of colliding flows on effective ECMP cross sectional bandwidth. The paper gives an example which demonstrates that effective cross sectional bandwidth can be reduced by a factor of between 20% to 60%, depending on the number of simultaneous flows per host.

This article uses the workbench to demonstrate the effect of large "Elephant" flow collisions on network throughput. The following script running on each of the servers uses the iperf tool to generate pairs of overlapping Elephant flows:

cumulus@server1:~$ while true; do iperf -c 10.4.2.2 -t 20; sleep 20; done
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57234 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  21.9 GBytes  9.41 Gbits/sec
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57240 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  10.1 GBytes  4.34 Gbits/sec
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57241 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  21.9 GBytes  9.41 Gbits/sec
------------------------------------------------------------

The first iperf test achieves a TCP throughput of 9.41 Gbits/sec (the maximum achievable on the 10Gbit/s network in the workbench). However, the second test only achieves a throughput of 4.34 Gbits/sec. How can this result be explained?

The Top Flows table above confirms that two simultaneous elephant flows are being tracked by Fabric View.

The Traffic charts update every second and give a fine grained view of the traffic flows over time. The charts clearly show how iperf flows vary in throughput, with the low throughput runs achieving a throughput of approximately 50% of the network capacity (these results are consistent with 20% to 60% reported in the Hedera paper).

The Performance charts show what is happening. Packets take two hops as they are routed from leaf1 to leaf2 (via spine1 or spine2). Each iperf connection is able to fully utilize the two links to achieve line rate throughput. Comparing the Total Traffic and Busy Spine Links charts shows that peak total throughput of approximately 20Gbits/sec corresponds to interval when 4 spine links are busy. The throughput is halved during intervals when the routes overlap and share 1 or 2 links (shown in gold as Collisions on the Busy Spine Links chart).

Readers might be surprised by the frequency of collisions given the number of links in the network. Packets take two hops to go from leaf1 to leaf2 - routed via spine1 or spine2. In addition, the links between switches are paired, so there are 8 possible two hop paths from leaf1 to leaf2. The explanation involves looking at the conditional probability that the second flow with overlap with the first. Suppose the first flow takes is routed to spine1 via port swp1s0 and that spine1 routes the flow to leaf2 via port swp51. If the second flow is routed via any of the 4 paths through spine2, there is no collision. However, if it is routed via spine1, there is only 1 path that avoids collisions (leaf1 port swp1s1 to spine1 port swp52). This means that there is a 5 / 8 chance of avoiding a collision, or a 3/8 (37.5%) chance that the two flow will collide. The probability of flow collisions is surprisingly high even on very large networks with many spine switches and paths (see Birthday Paradox).

Also note the Discards trend in the Congestion and Errors section. Comparing the rate of discards with Collisions in the Busy Spine Links chart shows that discards don't occur unless there are Elephant flow collisions on the busy links.

The Discard trend lags the Collision trend because discards are reported using sFlow counters and the Collision metric are based on packet samples - see Measurement delay, counters vs. packet samples

This example demonstrates the visibility into leaf and spine fabric performance achievable using standard sFlow instrumentation built into commodity switch hardware. If you have a leaf and spine network, request a free evaluation of Fabric View to better understand your network's performance.

This small four switch leaf and spine network is composed of 12 x 10 Gbits/sec links which would require 24 x 10 Gbits/sec taps with associated probes and collector to fully monitor using traditional tools used to monitor legacy data center networks. The cost and complexity of tapping leaf and spine topologies is prohibitive. However, leaf and spine switches typically include hardware support for the sFlow measurement standard, embedding line rate visibility into every switch port for network wide coverage at no extra cost. In this example, the Fabric View analytics software is running on a commodity physical or virtual server consuming 1% CPU and 200 MBytes RAM.

Real-time analytics for leaf and spine networks is a core enabling technology for software defined networking (SDN) control mechanisms that can automatically adapt the network to rapidly changing flow patterns and dramatically improve performance.

For example, REST API for Cumulus Linux ACLs describes how and SDN controller can remotely control switches. Use cases discussed on this blog include: Elephant flow marking, Elephant flow steering, and DDoS mitigation.

Finally, Cumulus Linux runs on open switch hardware from Agema, Dell, Edge-Core, Penguin Computing, Quanta. In addition, Hewlett-Packard recently announced that they will soon be selling a new line of open network switches built by Accton Technologies and support Cumulus Linux. The increasing availability of low cost open networking hardware running Linux creates a platform for open source and commercial software developers to quickly build and deploy innovative solutions.

↧

OpenNetworking.tv interview

March 23, 2015, 2:13 pm

≫ Next: Big Tap sFlow: Enabling Pervasive Flow-level Visibility

≪ Previous: ECMP visibility with Cumulus Linux

The OpenNetworking.tv interview includes a wide ranging discussion of current trends in the software defined networking (SDN), including: merchant silicon, analytics, probes, scaleability, Open vSwitch, network virtualization, VxLAN, network function virtualization (NFV), Open Compute Project, white box / bare metal switches, leaf and spine topologies, large "Elephant" flow marking and steering, Cumulus Linux, Big Switch, orchestration, Puppet and Chef.

The interview and full transcript are available on SDxCentral: sFlow Creator Peter Phaal On Taming The Wilds Of SDN & Virtual Networking

Related articles on this blog include:

↧

Big Tap sFlow: Enabling Pervasive Flow-level Visibility

April 1, 2015, 8:08 pm

≫ Next: Analytics and SDN

≪ Previous: OpenNetworking.tv interview

Today's Big Switch Networks webinar, Big Tap sFlow: Enabling Pervasive Flow-level Visibility, describes how Big Switch uses software defined networking (SDN) to control commodity switches and deliver network visibility. The webinar presents a live demonstration showing how real-time sFlow analytics is used to automatically drive SDN actions to provide a "smarter way to find a needle in a haystack."

The video presentation covers the following topics:

0:00 Introduction to Big Tap
7:00 sFlow generation and use cases
12:30 Demonstration of real-time tap triggering based on sFlow

The webinar describes how the network wide monitoring provided by industry standard sFlow instrumentation complements the Big Tap SDN controller's ability to capture and direct packet selected packet streams to visibility tools.

The above slide from the webinar draws an analogy for the role that sFlow plays in targeting the capture network to that of a finderscope, the small, wide-angle telescope used to provide an overview of the sky and guide the telescope to its target. Support for the sFlow measurement standard is built into commodity switch hardware and is enabled on all ports in the capture network to provide a wide angle view of all traffic in the data center. Once suspicious activity is detected, targeted captures can be automatically triggered using Big Tap's REST API.

Blacklists are an important way in which the Internet community protects itself by identifying bad actors. Incorporating blacklists in traffic monitoring can be a useful way to find hosts on a network that have been compromised. If a host interacts with addresses known to be part of a botnet for example, then it raises the concern that the host has been compromised and is itself a member of the botnet.

Black lists can be very large, larger lists can exceed a million addresses. Switches don't have the resources to match traffic against such large lists. However, sFlow shifts analysis from the switches to external software which can easily handle to task of matching traffic against large lists. The live demonstration uses InMon's sFlow-RT real-time analytics software to match sFlow data against a large blacklist. When a match is detected the Big Tap controller is programmed via a REST API call to capture all the packets from the suspected hosts and stream them to Wireshark for further investigation.

↧

Analytics and SDN

May 18, 2015, 3:16 pm

≫ Next: Leaf and spine traffic engineering using segment routing and SDN

≪ Previous: Big Tap sFlow: Enabling Pervasive Flow-level Visibility

Recent presentations from AT&T and Google describe SDN/NFV architectures that incorporate measurement based feedback in order to improve performance and reliability.

The first slide is from a presentation by AT&T's Margaret Chiosi; SDN+NFV Next Steps in the Journey, NFV World Congress 2015. The future architecture envisions generic (white box) hardware providing a stream of analytics which are compared to policies and used to drive actions to assure service levels.

The second slide is from the presentation by Google's Bikash Koley at the Silicon Valley Software Defined Networking Group Meetup. In this architecture, "network state changes observed by analyzing comprehensive time-series data stream." Telemetry is used to verify that the network is behaving as intended, identifying policy violations so that the management and control planes can apply corrective actions. Again, the software defined network is built from commodity white box switches.

Support for standard sFlow measurements is almost universally available in commodity switch hardware. sFlow agents embedded within network devices continuously stream measurements to the SDN controller, supplying the analytics component with the comprehensive, scaleable, real-time visibility needed for effective control.

SDN fabric controller for commodity data center switches describes the measurement and control capabilities available in commodity switch hardware. In addition, there are a number of use cases described on this blog that demonstrate the benefits of incorporating traffic analytics in SDN solutions, including:

While the incorporation of telemetry / analytics in SDN architectures is recent, the sFlow measurement standard is a proven technology that has been incorporated in switch ASICs for over a decade. Incorporating sFlow in SDN solution stacks leverages the capabilities of commodity switches to provide immediate visibility into operational networks without the complexity and cost of adding probes or being locked in to vendor specific hardware.

↧

Leaf and spine traffic engineering using segment routing and SDN

June 12, 2015, 2:09 pm

≫ Next: Optimizing software defined data center

≪ Previous: Analytics and SDN

The short 3 minute video is a live demonstration showing how software defined networking (SDN) can be used to orchestrate the measurement and control capabilities of commodity data center switches to automatically load balance traffic on a 4 leaf, 4 spine, 10 Gigabit leaf and spine network.

The diagram shows the physical layout of the demonstration rack. The four logical racks with their servers and leaf switches are combined in a single physical rack, along with the spine switches, and SDN controllers. All the links in the data plane are 10G and sFlow has been enabled on every switch and link with the following settings, packet sampling rate 1-in-8192 and counter polling interval 20 seconds. The switches have been configured to send the sFlow data to sFlow-RT analytics software running on Controller 1.

The switches are also configured to enable OpenFlow 1.3 and connect to multiple controllers in the redundant ONOS SDN controller cluster running on Controller 1 and Controller 2.

The charts from The Nature of Datacenter Traffic: Measurements & Analysis show data center traffic measurements published by Microsoft. Most traffic flows are short duration. However, combined they consume less bandwidth than a much smaller number of large flows with durations ranging from 10 seconds to 100 seconds. The large number of small flows are often referred to as "Mice" and the small number of large flows as "Elephants."

This demonstration focuses on the Elephant flows since they consume most of the bandwidth. The iperf load generator is used to generate two streams of back to back 10Gbyte transfers that should take around 8 seconds to complete over the 10Gbit/s leaf and spine network.

while true; do iperf -B 10.200.3.32 -c 10.200.3.42 -n 10000M; done

while true; do iperf -B 10.200.3.33 -c 10.200.3.43 -n 10000M; done

These two independent streams of connections from switch 103 to 104 drive the demo.

The HTML 5 dashboard queries sFlow-RT's REST API to extract and display real-time flow information.

The dashboard shows a topological view of the leaf and spine network in the top left corner. Highlighted "busy" links have a utilization of over 70% (i.e. 7Gbit/s). The topology shows flows taking independent paths from 103 to 104 (via spines 105 and 106). The links are highlighted in blue to indicate that the utilization on each link is driven by a single large flow. The chart immediately under the topology trends the number of busy links. The most recent point, to the far right of the chart, has a value of 4 and is colored blue, recording that 4 blue links are shown in the topology.

The bottom chart trends the total traffic entering the network broken out by flow. The current throughput is just under 20Gbit/s and is comprised of two roughly equal flows.

The ONOS controller configures the switches to forward packets using Equal Cost Multi-Path (ECMP) routing. There are four equal cost (hop count) paths from leaf switch 103 to leaf switch 104 (via spine switches 105, 106, 107 and 108). The switch hardware selects between paths based on a hash function calculated over selected fields in the packets (e.g. source and destination IP addresses + source and destination TCP ports), e.g.

index = hash(packet fields) % group.size
selected_physical_port = group[index]

Hash based load balancing works well for large numbers of Mice flows, but is less suitable for the Elephant flows. The hash function may assign multiple Elephant flows to the same path resulting in congestion and poor network performance.

This screen shot shows the effect of a collision between flows. Both flows have been assigned the same path via spine switch 105. The analytics software has determined that there are multiple large flows on the pair of busy links and indicates this by coloring the highlighted links yellow. The most recent point, to the far right of the upper trend chart, has a value of 2 and is colored yellow, recording that 2 yellow links are shown in the topology.

Notice that the bottom chart shows that the total throughput has dropped to 10Gbit/s and that each of the flows is limited to 5Gbit/s - halving the throughput and doubling the time taken to complete the data transfer.

The dashboard demonstrates that the sFlow-RT analytics engine has all the information needed to characterize the problem - identifying busy links and the large flows. What is needed is a way to take action to direct one of the flows on a different path across the network.

This is where the segment routing functionality of the ONOS SDN controller comes into its own. The controller implements Segment Routing in Networking (SPRING) as the method of ECMP forwarding and provides a simple REST API for specifying paths across the network and assigning traffic to those paths.

In this example, the traffic is colliding because both flows are following a path running through spine switch 105. Paths from leaf 103 to 104 via spines 106, 107 or 108 have available bandwidth.

The following REST operation instructs the segment routing module to build a path from 103 via 106 to 104:

curl -H "Content-Type: application/json" -X POST http://localhost:8181/onos/segmentrouting/tunnel -d '{"tunnel_id":"t1", "label_path":[103,106,104]}'

Once the tunnel has been defined, the following REST operation assigns one of the colliding flows to the new path:

curl -H "Content-Type: application/json" -X POST http://localhost:8181/onos/segmentrouting/policy -d '{"policy_id":"p1", "priority":1000, "src_ip":"10.200.3.33/32", "dst_ip":"10.200.4.43/32", "proto_type":"TCP", "src_tp_port":53163, "dst_tp_port":5001, "policy_type":"TUNNEL_FLOW", "tunnel_id":"t1"}'

However, manually implementing these controls isn't feasible since there is a constant stream of flows that would require policy changes every few seconds.

The final screen shot shows the result of enabling the Flow Accelerator application on sFlow-RT. Flow Accelerator watches for collisions and automatically applies and removes segment routing policies as required to separate Elephant flows, in this case the table on the top right of the dashboard shows that a single policy has been installed sending one of the flows via spine 107.

The controller has been running for about half the interval show in the two trend charts (approximately two and half minutes). To the left you can see frequent long collisions and consequent dips in throughput. To the right you can see that more of the links are kept busy and flows experience consistent throughput.

Traffic analytics are a critical component of this demonstration. Why does this demonstration use sFlow? Could NetFlow/JFlow/IPFIX/OpenFlow etc. be used instead? The above diagram illustrates the basic architectural difference between sFlow and other common flow monitoring technologies. For this use case the key difference is that with sFlow real-time data from the entire network is available in a central location (the sFlow-RT analytics software), allowing the traffic engineering application to make timely load balancing decisions based on complete information. Rapidly detecting large flows, sFlow vs. NetFlow/IPFIX presents experimental data to demonstrate the difference is responsiveness between sFlow and the other flow monitoring technologies. OK, but what about using hardware packet counters periodically pushed via sFlow, or polled using SNMP or OpenFlow? Here again, measurement delay limits the usefulness of the counter information for SDN applications, see Measurement delay, counters vs. packet samples. Fortunately, the requirement for sFlow is not limiting since support for standard sFlow measurement is built into most vendor and white box hardware - see Drivers for growth.

Finally, the technologies presented in this demonstration have broad applicability beyond the leaf and spine use case. Elephant flows dominate data center, campus, wide area, and wireless networks (see SDN and large flows). In addition, segment routing is applicable to wide area networks as was demonstrated by an early version of the ONOS controller (Prototype & Demo Videos). The demonstration illustrates that the integration real-time sFlow analytics in SDN solutions enables fundamentally new use cases that drive SDN to a new level - optimizing networks rather than simply provisioning them.

↧

Optimizing software defined data center

June 21, 2015, 10:35 am

≫ Next: WAN optimization using real-time traffic analytics

≪ Previous: Leaf and spine traffic engineering using segment routing and SDN

The recent Fortune magazine article, Software-defined data center market to hit $77.18 billion by 2020, starts with the quote "Data centers are no longer just about all the hardware gear you can stitch together for better operations. There’s a lot of software involved to squeeze more performance out of your hardware, and all that software is expected to contribute to a burgeoning new market dubbed the software-defined data center."

The recent ONS2015 Keynote from Google's Amin Vahdat describes how Google builds large scale software defined data centers. The presentation is well worth watching in its entirety since Google has a long history of advancing distributed computing with technologies that have later become mainstream.

There are a number of points in the presentation that relate to the role of networking to the performance of cloud applications. Amin states, "Networking is at this inflection point and what computing means is going to be largely determined by our ability to build great networks over the coming years. In this world data center networking in particular is a key differentiator."

This slide shows the the large pools of storage and compute connected by the data center network that are used to deliver data center services. Amin states that the dominant costs are compute and storage and that the network can be relatively inexpensive.

In Overall Data Center Costs James Hamilton breaks down the monthly costs of running a data center and puts the cost of network equipment at 8% of the overall cost.

However, Amin goes on to explain why networking has a disproportionate role in the overall value delivered by the data center.

The key to an efficient data center is balance. If a resource is scarce, then other resources are left idle and this increases costs and limits the overall value of the data center. Amin goes on to state, "Typically the resource that is most scarce is the network."

The need to build large scale high-performance networks has driven Google to build networks with the following properties:

Leaf and Spine (Clos) topology
Merchant silicon based switches (white box / brite box / bare metal)
Centralized control (SDN)

The components and topology of the network are shown in the following slide.

Here again Google is leading the overall network market transition to inexpensive leaf and spine networks built using commodity hardware.

Google is not alone in leading this trend. Facebook has generated significant support for the Open Compute Project (OCP), which publishes open source designs data center equipment, including merchant silicon based leaf and spine switches. A key OCP project is the Open Network Install Environment (ONIE), which allows third party software to be installed on the network equipment. ONIE separates hardware from software and has spawned a number of innovative networking software companies, including: Cumulus Networks, Big Switch Networks, Pica8, Pluribus Networks. Open network hardware and the related ecosystem of software is entering the mainstream as leading vendors such as Dell and HP deliver open networking hardware, software and support to enterprise customers.

The ONS2015 keynote from AT&T's John Donovan, describes the economic drivers for AT&T's transition to open networking and compute architectures.

John discusses the rapid move from legacy TDM (Time Division Multiplexing) technologies to commodity Ethernet, explaining that "video now makes up the majority of traffic on our network." This is a fundamental shift for AT&T and John states that "We plan to virtualize and control more than 75% of our network using cloud infrastructure and a software defined architecture."

John mentions the CORD (Central Office Re-architected as a Datacenter) project which proposes an architecture very similar to Google's, consisting of a leaf and spine network built using open merchant silicon based hardware connecting commodity servers and storage. A prototype of the CORD leaf and spine network was shown as part of the ONS2015 Solutions Showcase.

ONS2015 Solutions Showcase: Open-source spine-leaf Fabric

Leaf and spine traffic engineering using segment routing and SDN describes a live demonstration presented in ONS2015 Solutions Showcase. The demonstration shows how centralized analytics and control can be used to optimize the performance of commodity leaf and spine networks handling the large "Elephant" flows that typically comprise most traffic on the network (for example, video streams - see SDN and large flows for a general discussion).

Getting back to the Fortune article, it is clear that the move to open commodity network, server and storage hardware shifts value from hardware to the software solutions that optimize performance. The network in particular is a critical resource that constrains overall performance and network optimization solutions can provide disproportionate benefits by eliminating bottlenecks that constrain compute and storage and limit the value delivered by the data center.

↧

WAN optimization using real-time traffic analytics

June 25, 2015, 12:04 pm

≫ Next: SDN router using merchant silicon top of rack switch

≪ Previous: Optimizing software defined data center

TATA Consultancy Services white paper, Actionable Intelligence in the SDN Ecosystem: Optimizing Network Traffic through FRSA, demonstrates how real-time traffic analytics and SDN can be combined to perform real-time traffic engineering of large flows across a WAN infrastructure.

The architecture being demonstrated is shown in the diagram (this diagram has been corrected - the diagram in the white paper incorrectly states that sFlow-RT analytics software uses a REST API to poll the nodes in the topology. In fact, the nodes stream telemetry using the widely supported, industry standard, sFlow protocol, providing real-time visibility and scaleability that would be difficult to achieve using polling - see Push vs Pull).

The load balancing application receives real-time notifications of large flows from the sFlow-RT analytics software and programs the SDN Controller (in this case OpenDaylight) to push forwarding rules to the switches to direct the large flows across a specific path. Flow Aware Real-time SDN Analytics (FRSA) provides an overview of the basic ideas behind large flow traffic engineering that inspired this use case.

While OpenDaylight is used in this example, an interesting alternative for this use case would be the ONOS SDN controller running the Segment Routing application. ONOS is specifically designed with carriers in mind and segment routing is a natural fit for the traffic engineering task described in this white paper.

Leaf and spine traffic engineering using segment routing describes a demonstration combining real-time analytics and SDN control in a data center context. The demonstration was part of the recent 2015 Open Networking Summit (ONS) conference Showcase and presented in the talk, CORD: FABRIC An Open-Source Leaf-Spine L3 Clos Fabric, by Saurav Das.

↧

SDN router using merchant silicon top of rack switch

July 14, 2015, 7:13 am

≫ Next: White box Internet router PoC

≪ Previous: WAN optimization using real-time traffic analytics

The talk from David Barroso describes how Spotify optimizes hardware routing on a commodity switch by using sFlow analytics to identify the routes carrying the most traffic. The full Internet routing table contains nearly 600,000 entries, too many for commodity switch hardware to handle. However, not all entries are active all the time. The Spotify solution uses traffic analytics to track the 30,000 most active routes (representing 6% of the full routing table) and push them into hardware. Based on Spotify's experience, offloading the active 30,000 routes to the switch provides hardware routing for 99% of their traffic.

David is interviewed by Ivan Pepelnjak, SDN ROUTER @ SPOTIFY ON SOFTWARE GONE WILD. The SDN Internet Router (SIR) source code and documentation is available on GitHub.

The diagram from David's talk shows the overall architecture of the solution. Initially the Internet Router (commodity switch hardware) uses a default route to direct outbound traffic to a Transit Provider (capable of handling all the outbound traffic). The BGP Controller learns routes via BGP and observes traffic using the standard sFlow measurement technology embedded with most commodity switch silicon.

After a period (1 hour) the BGP Controller identifies the most active 30,000 prefixes and configures the Internet Router to install these routes in the hardware so that traffic takes the best routes to each peer. Each subsequent period provides new measurements and the controller adjusts the active set of routes accordingly.

The internals of the BGP Controller are shown in this diagram. BGP and sFlow data are received by the pmacct traffic accounting software, which then writes out files containing traffic by prefix. The bgpc.py script calculates the TopN prefixes and installs them in the Internet Router.

In this example, the Bird routing daemon is running on the Internet Router and the TopN prefixes are written into a filter file that restricts prefixes that can be installed in the hardware.

The SIR router demonstrates that building an SDN controller that leverages standard measurement and control capabilities of commodity hardware has the potential to disrupt the router market by replacing expensive custom routers with inexpensive commodity switches based on merchant silicon. However, the relatively slow feedback loop (updating measurements every hour) limits SIR to access routers with relatively stable traffic patterns.

The rest of this article discusses how a fast feedback loop can be built combining real-time sFlow analytics with a BGP control plane. A fast feedback loop significantly reduces the number hardware cache misses and increase the scaleability of the solution, allowing a broader range of use cases to be addressed.

This diagram differs from the SIR router, re-casting the role of the hardware Switch as an accelerator that handles forwarding for a subset of prefixes in order to reduce the traffic forwarded by a Router implementing the full Internet routing table. Applications for this approach include taking an existing router and boosting its throughput (e.g. boosting 1Gigabit router to 100Gigabit), or, more disruptively, replacing an expensive hardware router with a commodity Linux server.

Route caching is not a new idea, the paper, Revisiting Route Caching: The World Should Be Flat, discusses the history of route caching and discusses applications to contemporary workloads and requirements.

The throughput increase is determined by the cache hit rate that can be achieved with the limited number of routing entries supported by the switch hardware. For example, if the hardware achieves a 90% cache hit rate, then only 10% of the traffic is handled by the router and the throughput is boosted by a factor of 10.

A fast control loop is critical to increasing the cache hit rate, rapidly detecting traffic to new destination prefixes and installing hardware forwarding entries that minimize traffic through the router.

The sFlow-RT analytics software already provides real-time (sub-second) traffic analytics and recently added experimental BGP support allows sFlow-RT to act as a route reflector client, learning the full set of prefixes so that it can track traffic rates by prefix.

The following steps are required to try out the software.

First download sFlow-RT.

wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
tar -xvzf sflow-rt.tar.gz
cd sflow-rt

Next configure sFlow-RT to listen for BGP connections. In this case, add the following entries to the start.sh file to enable BGP, listening on port 1179 rather than the well known BGP port 179 so that sFlow-RT does not need to run with root privileges:

-Dbgp.start=yes -Dbgp.port=1179

Edit, the init.js file and use the bgpAddNeighbor function to peer with the Router (10.0.0.254) where NNNN is the local autonomous system (AS) number and the bgpAddNeighbor function to combine sFlow data from the Switch (10.0.0.253) with the routing table, tracking bytes/second and using a 10 second moving average:

bgpAddNeighbor('10.0.0.254',NNNN);
bgpAddSource('10.0.0.253','10.0.0.254',10,'bytes');

Configure the Switch to send sFlow to sFlow-RT (see Switch configurations).

Configure the Router as a route reflector, connecting to sFlow-RT (10.0.0.252) and exporting the full routing table. For example, using Quagga as the routing daemon:

router bgp NNNN
 bgp router-id 10.0.0.254
 neighbor 10.0.0.252 remote-as NNNN
 neighbor 10.0.0.252 port 1179
 neighbor 10.0.0.252 route-reflector-client

Start sFlow-RT:

./start.sh

The following cURL command accesses the sFlow-RT REST API to query the TopN prefixes:

curl "http://10.0.0.162:8008/bgp/topprefixes/10.0.0.30/json?direction=destination&maxPrefixes=5&minValue=1"
{
"as": NNNN,
"direction": "destination",
"id": "N.N.N.N",
"learnedPrefixesAdded": 568313,
"learnedPrefixesRemoved": 9,
"nPrefixes": 567963,
"pushedPrefixesAdded": 0,
"pushedPrefixesRemoved": 0,
"startTime": 1436830843625,
"state": "established",
"topPrefixes": [
  {
"aspath": "NNNN",
"localpref": 888,
"nexthop": "N.N.N.N",
"origin": "IGP",
"prefix": "0.0.0.0/0",
"value": 680740.5504781345
  },
  {
"aspath": "NNNN-NNNN",
"localpref": 100,
"nexthop": "N.N.N.N",
"origin": "IGP",
"prefix": "N.N.0.0/14",
"value": 58996.251739893225
  },
  {
"aspath": "NNNN-NNNNN",
"localpref": 130,
"nexthop": "N.N.N.N",
"origin": "IGP",
"prefix": "N.N.0.0/13",
"value": 7966.802831354894
  },
  {
"localpref": 100,
"med": 2,
"nexthop": "N.N.N.N",
"origin": "IGP",
"prefix": "N.N.N.0/18",
"value": 3059.8853014045844
  },
  {
"aspath": "NNNN",
"localpref": 1010,
"med": 0,
"nexthop": "N.N.N.N",
"origin": "IGP",
"prefix": "N.N.N.0/24",
"value": 1635.0250535959976
  }
 ],
"valuePercentCoverage": 99.67670497397555,
"valueTopPrefixes": 752398.5154043833,
"valueTotal": 754838.871931838
}

In addition to returning the top prefixes, the query returns information about the amount of traffic covered by these prefixes. In this case, the valuePercentageCoverage of 99.67 indicates that 99.67 percent of the traffic is covered by the top 5 prefixes.

Try running this query on your own network to find out how many prefixes are required to cover 90%, 95%, and 99% of the traffic. If you have results you can share, please post them as comments to this article.

Obtaining the TopN prefixes is only part of the SDN routing application. An efficient method of installing the TopN prefixes in the switch hardware is also required . The SIR router uses a configuration file, but this approach doesn't work well for rapidly modifying large tables. In addition, configuration files vary between routers, limiting the portability of the controller.

In addition to listening for routes using BGP, sFlow-RT can also act as a BGP speaker. The following init.js script implements a basic hardware route cache:

bgpAddNeighbor('10.0.0.254',65000);
bgpAddSource('10.0.0.253','10.0.0.254',10,'bytes');
bgpAddNeighbor('10.0.0.253',65000);

var installed = {};
setIntervalHandler(function() {
  let now = Date.now();
  let top = bgpTopPrefixes('10.0.0.254',100,1,'destination');
  if(!top || !top.hasOwnProperty('topPrefixes')) return;

  let tgt = bgpTopPrefixes('10.0.0.253',0);
  if(!tgt || 'established' != tgt.state) return;

  for(let i = 0; i < top.topPrefixes.length; i++) {
     let entry = top.topPrefixes[i];
     if(bgpAddRoute('10.0.0.253',entry)) {
       installed[entry.prefix] = now; 
     }
  }
  for(let prefix in installed) {
     let time = installed[prefix];
     if(time === now) continue;
     if(bgpRemoveRoute('10.0.0.253',prefix)) {
        delete installed[prefix];
     } 
  }
}, 5);

Some notes on the script:

setIntervalHandler registers a function that is called every 5 seconds
The interval handler queries for the top 100 destination prefixes
Active prefixes are pushed to the switch using bgpAddRoute
Inactive prefixes are withdrawn using bgpRemoveRoute
The bgpAddRoute/bgpRemoveRoute functions are BGP session state aware and will only forward changes

The initial BGP functionality is fairly limited (no IPv6, no communities, ..) and experimental, please report any bugs here, or on the sFlow-RT group.

Try out the software and provide feedback. This example was only one use case for combining sFlow and BGP in an SDN controller. Other use cases include inbound / outbound traffic engineering, DDoS mitigation, multi-path load balancing, etc. Finally, the combination of commodity hardware with mature, widely deployed BGP and sFlow protocols is a pragmatic approach to SDN that allows solutions to be developed rapidly and deployed widely in production environments.

↧

White box Internet router PoC

July 21, 2015, 8:58 pm

≫ Next: CORD: Open-source spine-leaf Fabric

≪ Previous: SDN router using merchant silicon top of rack switch

SDN router using merchant silicon top of rack switch describes how the performance of a software Internet router could be accelerated using the hardware routing capabilities of a commodity switch. This article describes a proof of concept demonstration using Linux virtual machines and a bare metal switch running Cumulus Linux.

The diagram shows the demo setup, providing inter-domain routing between Peer 1 and Peer 2. The Peers are directly connected to the Hardware Switch and ingress packets are routed by the default (0.0.0.0/0) route to the Software Router. The Software Router learns the full set of routes from the Peers using BGP and forwards the packet to the correct next hop router. The packet is then switched to the selected peer router via bridge br_xen.

The following traceroute run on Peer 1 shows the set of router hops from 192.168.250.1 to 192.168.251.1

[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1
traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets
 1  192.168.152.2 (192.168.152.2)  3.090 ms  3.014 ms  2.927 ms
 2  192.168.150.3 (192.168.150.3)  3.377 ms  3.161 ms  3.099 ms
 3  192.168.251.1 (192.168.251.1)  6.440 ms  6.228 ms  3.217 ms

Ensuring that packets are first forwarded by the default route on the Hardware Switch is the key to accelerating forwarding decisions for the Software Router. Any specific routes added to Hardware Switch will override the default route and matching packets will bypass the Software Router and will be forwarded by the Hardware Switch.

In this test bench, routing is performed using Quagga instances running on Peer 1, Peer 2, Hardware Switch and Software Router.

Peer 1

router bgp 150
 bgp router-id 192.168.150.1
 network 192.168.250.0/24
 neighbor 192.168.152.3 remote-as 152
 neighbor 192.168.152.3 ebgp-multihop 2

Peer 2

router bgp 151
 bgp router-id 192.168.151.1
 network 192.168.251.0/24
 neighbor 192.168.152.3 remote-as 152
 neighbor 192.168.152.3 ebgp-multihop 2

Software Router

interface lo
  ip address 192.168.152.3/32

router bgp 152
 bgp router-id 192.168.152.3
 neighbor 192.168.150.1 remote-as 150
 neighbor 192.168.150.1 update-source 192.168.152.3
 neighbor 192.168.150.1 passive
 neighbor 192.168.151.1 remote-as 151
 neighbor 192.168.151.1 update-source 192.168.152.3
 neighbor 192.168.151.1 passive
 neighbor 10.0.0.162 remote-as 152
 neighbor 10.0.0.162 port 1179
 neighbor 10.0.0.162 timers connect 30
 neighbor 10.0.0.162 route-reflector-client

Hardware Switch

router bgp 65000
 bgp router-id 0.0.0.1
 neighbor 10.0.0.162 remote-as 65000
 neighbor 10.0.0.162 port 1179
 neighbor 10.0.0.162 timers connect 30

In addition, the following lines in /etc/network/interfaces configure the bridge:

auto br_xen
iface br_xen
  bridge-ports swp1 swp2 swp3
  address 192.168.150.2/24
  address 192.168.151.2/24
  address 192.168.152.2/24

Cumulus Networks, sFlow and data center automation describes how to configure sFlow monitoring on Cumulus Linux switches. The switch is configured to send sFlow to 10.0.0.162 (the host running the SDN controller).

SDN Routing Application

SDN router using merchant silicon top of rack switch describes how to install sFlow-RT and provides an application for pushing active routes to an accelerator. The application has been modified for this setup and is running on host 10.0.0.162:

bgpAddNeighbor('10.0.0.152',152,false);
bgpAddNeighbor('10.0.0.233',65000,true);
bgpAddSource('10.0.0.233','10.0.0.152',10);

var installed = {};
setIntervalHandler(function() {
  let now = Date.now();
  let top = bgpTopPrefixes('10.0.0.152',20000,1);
  if(!top || !top.hasOwnProperty('topPrefixes')) return;

  let tgt = bgpTopPrefixes('10.0.0.233',0);
  if(!tgt || 'established' != tgt.state) return;

  for(let i = 0; i < top.topPrefixes.length; i++) {
     let entry = top.topPrefixes[i];
     if(bgpAddRoute('10.0.0.233',entry)) {
       installed[entry.prefix] = now; 
     }
  }
  for(let prefix in installed) {
     let time = installed[prefix];
     if(time === now) continue;
     if(bgpRemoveRoute('10.0.0.233',prefix)) {
        delete installed[prefix];
     } 
  }
}, 1);

Start the application:

$ ./start.sh 
2015-07-21T19:36:52-0700 INFO: Listening, BGP port 1179
2015-07-21T19:36:52-0700 INFO: Listening, sFlow port 6343
2015-07-21T19:36:53-0700 INFO: Starting the Jetty [HTTP/1.1] server on port 8008
2015-07-21T19:36:53-0700 INFO: Starting com.sflow.rt.rest.SFlowApplication application
2015-07-21T19:36:53-0700 INFO: Listening, http://localhost:8008
2015-07-21T19:36:53-0700 INFO: bgp.js started
2015-07-21T19:36:57-0700 INFO: BGP open /10.0.0.152:50010
2015-07-21T19:37:23-0700 INFO: BGP open /10.0.0.233:55097

Next, examine the routing table on the Hardware Switch:

cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.233 
192.168.150.0/24 dev br_xen  proto kernel  scope link  src 192.168.150.2 
192.168.151.0/24 dev br_xen  proto kernel  scope link  src 192.168.151.2 
192.168.152.0/24 dev br_xen  proto kernel  scope link  src 192.168.152.2

This is the default set of routes configured to pass traffic to and from the Software Router.

To generate traffic using iperf, run the following command on Peer 2:

iperf -s -B 192.168.251.1

And generate traffic with the following command on Peer 1:

iperf -c 192.168.251.1 -B 192.168.250.1

Now check the routing table on the Hardware Switch again:

cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.233  
192.168.150.0/24 dev br_xen  proto kernel  scope link  src 192.168.150.2 
192.168.151.0/24 dev br_xen  proto kernel  scope link  src 192.168.151.2 
192.168.152.0/24 dev br_xen  proto kernel  scope link  src 192.168.152.2 
192.168.250.0/24 via 192.168.150.1 dev br_xen  proto zebra  metric 20 
192.168.251.0/24 via 192.168.151.1 dev br_xen  proto zebra  metric 20

Note the two hardware routes that have been added by the SDN controller. The route override can be verified by repeating the traceroute test:

[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1
traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets
 1  192.168.150.2 (192.168.150.2)  3.260 ms  3.151 ms  3.014 ms
 2  192.168.251.1 (192.168.251.1)  4.418 ms  4.351 ms  4.260 ms

Comparing with the original traceroute, notice that packets bypass the Software Router interface (192.168.150.3) and are forwarded entirely in hardware.

The traffic analytics driving the forwarding decisions can be viewed through the sFlow-RT REST API:

$ curl http://10.0.0.162:8008/bgp/topprefixes/10.0.0.52/json
{
"as": 152,
"direction": "destination",
"id": "192.168.152.3",
"learnedPrefixesAdded": 2,
"learnedPrefixesRemoved": 0,
"nPrefixes": 2,
"pushedPrefixesAdded": 0,
"pushedPrefixesRemoved": 0,
"startTime": 1437535255553,
"state": "established",
"topPrefixes": [
  {
"aspath": "150",
"localpref": 100,
"med": 0,
"nexthop": "192.168.150.1",
"origin": "IGP",
"prefix": "192.168.250.0/24",
"value": 1.4462334178258518E7
  },
  {
"aspath": "151",
"localpref": 100,
"med": 0,
"nexthop": "192.168.151.1",
"origin": "IGP",
"prefix": "192.168.251.0/24",
"value": 391390.33359066787
  }
 ],
"valuePercentCoverage": 100,
"valueTopPrefixes": 1.4853724511849185E7,
"valueTotal": 1.4853724511849185E7
}

The SDN application automatically removes routes from the hardware once they become idle, or to make room for more active routes if the hardware routing table exceeds the set limit of 20,000 routes, or if they are withdrawn. This switch has a maximum capacity of 32,768 routes and standard sFlow analytics can be used to monitor hardware table utilizations - Broadcom ASIC table utilization metrics, DevOps, and SDN.

The test setup was configured to quickly test the concept using limited hardware at hand and can be improved for production deployment (using smaller CIDRs and VLANs to separate peer traffic).

This proof of concept demonstrates that it is possible to use SDN analytics and control to combine standard sFlow and BGP capabilities of commodity hardware and deliver Terabit routing capacity with just a few thousand dollars of hardware.

↧

CORD: Open-source spine-leaf Fabric

August 8, 2015, 2:37 pm

≫ Next: Cisco adds sFlow support to Nexus 9K series

≪ Previous: White box Internet router PoC

Live demonstration of SDN leaf and spine traffic engineering recorded at the Open Networking Summit. The open source ONOS controller implements segment routing using OpenFlow 1.3 to control a four switch leaf and spine network of commodity switches. For more detail on the use of real-time sFlow analytics from commodity switches in this demonstrations, see Leaf and spine traffic engineering using segment routing and SDN

↧