Chapter 5: Statistical Flow Analysis Tran Song Dat Phuc SeoulTech 11.2015 Statistic Flow Analysis Network flow record analysis is analogous to analysis of traffic patterns on the road in real life, gathering info on computer network. Every packet that traverses a network can be logged and recorded. With networking, it records basic info about flow, including source and destination IP address, port, protocol, date, time, and the size of data transmitted in each packet. Network flow records were generated, captured, and analyzed to improve network performance, clear that the data is valuable for other reasons as well. Statistic Flow Analysis Forensic analysts investigating a crime may find flow records can reveal a detailed portrait of network activity. Statistical analysis of flow data serves various purposes: Identifying Compromised Host: sends out more traffic than normal, transmit or receive traffic on unusual ports, communicate with malicious systems … Confirming or Disproving Data Leakage: determine whether attacker exported sensitive info across the network perimeter, analyze the volume of exported data and determine whether a leak may have occurred. Individual Profiling: provide info of an individual’s activity (working hours, lunch, break times, sources of entertainment, inappropriate activity …), determine which users communicate and exchange… helpful in cases involve HR issues. Statistic Flow Analysis Statistical flow analysis is important in modern world due to the sheer volume of traffic that organizations produce, manage and inspect. Collecting or recording full contents of all packets can be done, but need a large amount of storage space, high impact on performance, not necessary. Statistical flow analysis helps investigators identify specific targets for content analysis and further investigation. Every bit of data that is not collected is lost forever. Too much useless data can make analysis difficult or impossible. 5.1 Process Overview Flow Record Processing System Flow record: A subset of info about a flow, includes source and destination IP address, port, protocol, date, time, and the amount of data transmitted in each flow. A “sensor” is a device used to monitor the flows of traffic on segment and extract bits of info in a flow record. Flow record data is exported from a sensor to a “collector”, a server listen on the network and store it to a hard drive. An organization may have multiple collectors to capture flow record data from different network segments, or to provide redundancy. Once the flow record data has been exported and stored, it can be aggregated and analyzed using a wide variety of tools. 5.2 Sensors Sensors The architecture of the local network environment and the types of equipment available for use as sensors influence whether flow record data can be produced, and effect the resulting output format of the flow export data and subsequent analysis techniques. Sensor Types Sensor can be deployed as standalone appliances, or software processes running on network equipment serves other purposes. Many common types of network equipment are capable of producing flow record data. Setting up a standalone appliance may be preferable, because specific types of network equipment support only limited output formats for flow record exportation Network Equipment Some switches (Cisco Catalyst) support flow record generation and exportation. Current Cisco routers and firewalls support to produce flow data and export it to a collector, using Cisco NetFlow format. Sonicwall supports flow record exportation protocol (IPFIX), in addition to NetFlow. Some of vendors base the exported flow record data on packet sampling, not appropriate for network forensic because it will not provide complete info. Standalone Appliances The network administrators or forensic investigators may choose to deploy a standalone appliance as a sensor for generating and exporting flow record data. A software-based sensor can be deploy anyplace on the network. Organizations can set up port monitoring or a network tap to send traffic to the standalone sensor, turn processes traffic, generates statistics, and send flow record data to the collector. Deploying standalone sensor is preferable because standalone server no need to function as a router or perform other functions, and no need to modify existing network infrastructure. Sensor Software Most modern, enterprise-quality routers and switches can be configured to act as sensors and export flow record data across the network to collectors (Cisco, Juniper, SonicWALL, …). Or set up standalone sensors using free, open-source tools, such as Argus, softflowd, yaf, … Argus Argus stands for “Audit Record Generation and Utilization System”, is a mature, libpcapbased network flow sensing, collection, and analysis toolkit. It is open-source software, copyrighted by QoSient, LLC and released under the GNU Public License, has been tested on Mac OS X, Linux, FreeBSD, OpenBSD, NetBSD, Solaris. Argus is distributed as two packages: the Argus server, which reads packets from a network interface or packet capture file; and the Argus client tools, which are used to collect, distribute, process, and analyze Argus data. The Argus server functions as a sensor and can output flow export data in Argus’ compressed format, as files or over the network via UDP. As a libpcap-based tool, Argus supports BPF filtering, which allows the user to finetune filtering and maximize performance. It also supports multiple output streams, which can be directed to different collector processes or hosts. softflowd softflowd is an open-source flow monitoring tool developed by Damien Miller. It is designed to passively monitor traffic and export flow record data in NetFlow format, can be installed on Linux and OpenBSD operating systems, and is based on libpcap. The hardware requirements are minimal; you need a network card capable of capturing your target volume of traffic and a second network card for exporting flows from the sensor to a collector. Select disk space based on the volume of flow data you would like to retain. yaf yaf stands for “Yet Another Flowmeter”, is open-source flow sensor software. Released in 2006 under the GNU GPL, yaf is based on libpcap and can read packets from live interfaces or packet captures. It exports flow data in the IPFIX format, over SCTP, TCP, or UDP transport-layer protocols. One of the most powerful features of yaf is that it supports BPF filters for the purposes of filtering incoming traffic. This can allow investigators and network engineers to dramatically reduce the required processing capabilities and volume of flow export data. yaf natively supports encrypted flow export using TLS. Sensor Placement Factors to consider when placing sensors includes: Duplication: minimize duplication of flows collected. Place sensors in location where traffic to pass multiple sensors, and can obtain targeted data. Time synchronization: very crucial. Is the time not accurate, it’s difficult to correlate flows exported with other devices, or gather evidence from other resources. Perimeter versus internal traffic: visibility inside the network is valuable. Local flow data help identify compromised workstations seeking to propagate to new targets, both internal and external. Resources: each investigator has limited resources, decide which is the most important, review local network map, choose choke points to maximize collection capacity, fit to budget, time … Capacity: network devices have limited capacity to monitor and process traffic, employ more filtering, or partition VLANs to multiple ports, sensors, interfaces … Modifying the Environment To gather additional flow record data by modifying the environment, few general options available: Leverage existing equipment: through minor configuration changes to equipment, existing network equipment is capable of exporting flow record data, includes switches, routers, firewalls, and NIDS/NIPS equipment. Examine network device capacity and format of flow record data fit to collection system. Upgrade network equipment: deploy replacement switches or other network devices, while existing network equipment not support flow record exportation or cannot handle the increased capacity, can be do straightforward or require reconfiguration. Deploy additional sensors: devices not support flow record exportation, but support port mirroring, choose to send packets to standalone sensor, or deploy network tap and send data to standalone sensor for flow record collection. 5.3 Flow Record Export Protocols Flow Record Export Protocols NetFlow: first developed in 1996 by Cisco System, Inc. for caching and exporting traffic flow info, in the purposes of improving performance. Many devices can produce NetFlow data, from routers to switches to standalone probes. When a flow is completed or no longer active, the sensor marks as “expired” and exports the flow record data in a “NetFlow Export” packet transmit over the network to (a) collector(s). The latest NetFlow v9 supports for flexible template-based specification of flow record fields, allow users to customize data cached and exported by a sensor, greater flexibility for analysis and performance improvements. Flow Record Export Protocols IPFIX: is a successor to Cisco’s NetFlow, based on NetFlow v9. Handle bidirectional flow reporting, reduce data redundancy when reporting on flows with similar attributes, provide better interoperability as an open standard. The flow record data is extensible through data templates. The collector send templates defining the data to be exported, then sensor uses this template to construct the flow data export packets, and send back to the collector as the flows expire. Flow Record Export Protocols sFlow: developed by InMon Corporation, is a standard for providing network visibility based on packet sampling. Support by many network device manafacturers (not Cisco). Conduct statistical packet sampling, not support recording and processing info about every packet. Scale well to large networks, with high throughput. The packets are not sampled analyzed. not recorded and cannot be Flow Export: Transport-Layer Protocols Traditional flow export data was transmitted over the network via UDP, a connectionless, unreliable transport-layer protocol (drop and never recovered). NetFlow v9 and IPFIX use reliable transport-layer protocols such as Stream Control Transmission Protocol (SCTP) for flow export data transmission. SCTP was designed from inception to allow for volume transfer, including multiple streams over a single session. Provide reliability for multiple voluminous streams. 5.4 Collection & Aggregation Collection Placement & Architecture Factors to consider when placing collectors includes: Congestion: choose locations where transit times between sensor and collector are unlikely to be impacted by either routine or incident-based congestion (caused by worm). Security: to protect the confidentiality, place collectors on segments where the paths between collector and sensor are accesscontrolled and well protected. Necessary to encrypt the flow export data with protocols such as IPSec or TLS. Reliability: use reliable transport-layer protocols, such as TCP or SCTP. Capacity: how many collectors to set up. Each collector keeps up bandwidth of sensors, network card capacity, processing power, RAM, storage space… Strategy for Analysis: consider the architecture of analysis environment. Collection Systems SiLK: System for Internet Level Knowledge, is a suite of commandline tools for the collection and storage of flow data. Provide powerful toolset for the filtering and analysis of flow records. 4 Produced by the Network Situational Awareness (NetSA) group at CERT, SiLK can process NetFlow v5 and v9, IPFIX data, produce statistical data in various output formats configurable to a fine level of granularity. Includes two collector-specific tools: flowcap- listens for flow records on a network socket, temporary stores flow data to disk or RAM, forwards compressed stream to a client program; and rwflowpack- collect flow record data and export to the compressed SiLK Flow format. Collection Systems flow-tools: collect NetFlow traffic exported by sensors via UDP, only accepts input on UDP ports, not support TCP or SCTP. nfdump/ NfSen: based on open source BSD license, read flow export data from UDP network socket or pcap files, allows for extensive customization of data files stored on disk. Argus: client tools suite includes a collector. All the client programs accept input Argus and NetFlow v1-8 formats from multiple networks or filesystem sources, and send Argus output up to 128 client programs via network or filesystem. 5.5 Analysis Analysis Statistics: - The science which has to do with the collection, classification, and analysis of facts of a numerical nature regarding any topic. Flow Record Analysis Techniques The precise analysis techniques vary on a case-by-case basic, depending on investigative goals and resources. Goals and resources: assess time, staff, equipment, and tools. Strategize and identify the analysis techniques best fit with investigative goals and available resources. Starting Indicators: includes IP address of system, time frame, known ports, specific flow records. Analysis Technique Filtering: narrow down a large pool of evidence to a subset or groups of subsets, involves flow record data, select only small percentages of detailed analysis and presentation. Baselining: network administrators build a profile of normal network activity, forensic investigators can compare flow record traffic to the baseline profile to identify anomalies. “Dirty Values”: suspicious keywords when searching for evidence relevant, include IP addresses, ports, or protocols. Activity Pattern Matching: every activity leaves fingerprints on network. Patterns can indicate suspicious activities (high volume), some match behaviors of viruses or worms. Filtering Filtering is fundamental to flow record analysis. As a forensic investigator, your job is to remove extraneous details and identify events, hosts, ports, and activities of interest. Alternatively, you might filter for flows that match particular patterns of activity indicative of worm behavior, data exfiltration, port scanning, or other suspicious behavior, depending on your investigative goals. Most analysis techniques used for forensic analysis of flow records are based on filtering. Baselining Network traffic analysts can build, maintain, and reference a baseline of traffic and identify trends and patterns of activity that are considered “normal” for the environment. By aggregating a large volume of flow information, it’s possible to build a large, detailed set of benchmarks of what should be considered normal over any span of time and across business cycles. • Network baselines: By looking at general trends over time for monitored segments, the traffic seen can be understood even if specific source and destination IP addresses must be abstracted or generalized. • Host baselines: Likewise, when a particular host becomes of interest, investigators can build or refer to a historical baseline of a specific host’s activities in order to identify or investigate anomalous behavior. Dirty Values Network forensic analysts can compile a list of “dirty values” and search flow record data to pick out relevant entries. When conducting statistical flow analysis, “dirty values” aren’t usually words, they are more likely to be suspicious IP addresses, ports, dates, and times. As you conduct your analysis, you will often find that it is helpful to maintain an updated list of suspicious values that you collect as you move forward. Activity Pattern Matching Network flow record data, when aggregated, represents complex behavior that is often predictable and can be analyzed mathematically. We have previously compared network flows to traffic on physical roads. Physical-world phenomena network flows contain patterns that can be empirically measured, mathematically described, and logically analyzed. The biggest challenge for forensic investigators and anyone who analyzes flow record data is the absence of large publicly available data sets for research and comparison. However, within an organization, it is entirely possible to empirically analyze day-to-day traffic and build statistical models of normal behavior. Elements IP address: the source and destination IP addresses are great clues that reveal a lot about the cause and purpose of a flow. Consider whether the addresses are on an internal network or Internetexposed, countries of origin, what companies they are registered to, and other factors. Ports: much of the time they do correspond with assigned or wellknown ports linked to specific applications or services. Port numbers can also indicate whether a system is port scanning or being scanned and help you identify malicious activity. Protocol and Flags: layer 3 and 4 protocols are often tracked in flow record data. These can indicate whether connections were completed and help you tell the difference between connection attempts that were denied by firewalls, successful port scans, successful data transfers, and more. They can also help you make educated guesses as to the content and purpose of a flow. Elements Directionality: the directionality of the flows are crucial, it can indicate whether proprietary data has been leaked or a malicious program was downloaded. Taken in aggregate, the directionality of data transfers can allow you to tell the difference between websurfing activity and web-serving activity. Volume of data transferred: can help indicate the type of activity and whether or not higher-layer data transfer attempts were successful (many small TCP packets may be indicative of port scanning, whereas larger packets can indicate file exportation). The distribution of the data transferred over time matters; a large volume of data transferred in a very short period of time is usually caused by something different than the same amount of data transferred over a very long period of time Simple Patterns “Many-to-one” IP addresses: if many IP addresses send large volumes of traffic to one destination IP address, maybe: DOS attack Syslog server “Drop box” data repository Email server “One-to-many” IP addresses: if one IP address send large volumes of traffic to many destination IP addresses, maybe: Web server Email server Spam bot Network port scanning Simple Patterns “Many-to-many” IP addresses: Many IP addresses sending distributed traffic to many destinations can be indicative of: Peer-to-peer filesharing traffic Widespread virus infection. “One-to-one” IP addresses: if one IP address send large volumes of traffic to one destination IP address, maybe: Targeted attack Routine server communications. Complex Patterns Fingerprint is the process of matching complex flow record patterns to specific activities. When fingerprinting traffic, examine multiple elements and context, and develop a hypothesis of the cause of the behavior. TCP SYN port scan might have characteristics: One source IP address One or more destination IP addresses Destination port numbers increasing incrementally Volume of packets surpassing a specific value within a period of time TCP protocol Outbound protocol flags set to “SYN” Flow Record Analysis Tools SiLK: The SiLK suite includes powerful flow export data analysis capabilties. rwfilter: is designed to extract flows of interest from a particular data repository, filter them by time and category, and then “partition” them further by protocol attributes. It supports a rich syntax that is generally as functional as the BPF, though different. rwstats, rwcount, rwcut, rwuniq, et al.: SiLK includes an arsenal of basic flow export data manipulation utilities. rwstats produces statistical aggregations based on the protocol fields specified. rwcount counts packets and bytes. rwcut selects the fields that rwuniq can help you sort on. rwidsquery: can be fed either a Snort rule file, or an alert file, and figures out what flows from its input would match the rule or alert, and writes an rwfilter invocation to produce the flows that match. Flow Record Analysis Tools rwpmatch: This is essentially a libpcap-based program that reads in SiLK-formatted flow metadata and an input packet source and saves out just the packets that match the flow metadata. Advanced SiLK: In addition to the chainable command-line suite, a Python interpreter, “PySiLK,” is available that implements the SiLK functionality by exposing it through a Python API.35. The nice folks at NetSA have provided a “Tooltips” wiki so the user community can share experience and grow better faster. Flow Record Analysis Tools flow-tools: The flow-tools suite includes a variety of flow export data collection, storage, processing, and sending tools, including a few tools that are particularly useful for forensic analysis. The “flow-report” tool creates ASCII-readable text reports based on stored flow data. Report contents can be customized by the user through the configuration file, and then sent as input to graphing or analysis programs. The “flow-nfilter” program allows users to filter flow export data based on “primitives,” which are specific to flow-tools. The “flow-dscan” is a particularly useful utility, designed to identify suspicious traffic based on flow export data. It includes features for identifying port scanning, host scanning, and denialof-service attacks. Flow Record Analysis Tools Argus Client Tools: The Argus suite includes a variety of specialized utlities with powerful analysis capabilties. ra: Argus’ basic tool for reading, filtering, and printing Argus data, allows the user to specify fields for printing, select specific records for processing, match regular expressions, and more. racluster: cluster flow export data based on user-specified criteria. This is very helpful for printing summaries of flow record data. rasort: sort flow data based on user-specified criteria, such as source or destination IP address, time, TTL, flow duration, and more. ragrep: powerful regular expression and pattern matching, based on the GNU “grep” utility. Flow Record Analysis Tools rahisto: generate frequency distribution tables for user-selected metrics such as flow record duration, source and destination port number, bytes transferred, packet count, bits per second, and more. ragraph: create visual plots based on user-specified fields of interest, such as bytes, packet counts, average duration, IP address, ports, and more. ragraph includes a variety of tools that allow users to customize the graph appearance. Flow Record Analysis Tools FlowTraq: FlowTraq is a commercial flow record analysis tool developed by ProQueSys. It supports a very wide variety of input formats, including NetFlow v9, IPFIX, JFlow, and many others. It can also sniff traffic directly and generate flow records. Once collected, FlowTraq allows users to filter, search, sort, and produce reports based on flow records. In addition, you can specify patterns to generate alerts. FlowTraq supports a variety of operating systems (Windows, Linux, Mac, and Solaris), and is designed and marketed for forensics and incident response (among other purposes). Flow Record Analysis Tools nfdump/NfSen: The “nfdump” utility (part of the nfdump suite) is designed to read flow record data, analyze it, and produce customized output. It offers users powerful analysis features, including: Aggregate flow record fields by specific fields. Limit by time range. Generate statistics about IP addresses, interfaces, ports, and much more. Anonymize IP addresses. Customize output format. BPF-style filters. “NfSen” (“Netflow Sensor”) provides a graphical, web-based interface for the nfdump suite. It is an open-source tool written in Perl and PHP, designed to run on Linux and POSIX-based operating systems. Flow Record Analysis Tools EtherApe: EtherApe is an open-source, libpcap-based graphical tool that visually displays network activity in real time. It reads packet data directly from a network interface or pcap file. EtherApe does not take flow records as input. We are including it here because it provides a nice high-level visualization of traffic patterns, and therefore may be of interest to the reader. 5.6 Conclusion Conclusion Statistic flow record analysis is becoming important for forensic analysis. Flow records were generated for the purposes of monitoring and improving network performance, they are also excellent resources of network-based forensic evidence. A variety of sensor, collector, aggregation, and analysis tools, ranging from proprietary to free and open-source tools. One of the biggest challenges forensic investigators face is ensuring that the formats used by the sensors and collectors are compatible with the analysis tools chosen for the investigation. Statistic flow record analysis is a powerful field of study that will grow over the next decades. References “Network Forensics: Tracking Hackers through Cyberspace” Sherri Davidoff, Jonathan Ham; ISBN-10: 0132564718, ISBN-13: 9780132564717©2012, Prentice Hall Cloth, 576 pages, Published 06/13/2012.