Detecting Skype flows Hidden in Web Traffic Presenter: Kuei-Yu Hsu Advisor: Dr. Kai-Wei Ke 2013/4/29 Outline Introduction Proposed Methodology Experimental Datasets Experimental Results Conclusions 2 Introduction • What is VoIP? • Delude restrictive firewalls • Skype Proprietary Protocol • About Detection 3 What is VoIP? VoIP(Voice over Internet Protocol): Refers to a way to carry phone calls over an IP data network, whether on the Internet or your own internal network. VoIP calls are usually much cheaper than traditional long distance telephone calls to PSTN users, or even free if a call is placed directly from a VoIP end user to another one. 4 Delude restrictive firewalls Restrictive firewalls are commonly adopted by network managers in an effort to give a better security to the internal network and optimize the use of network resources. Such firewalls are unlikely to block Web traffic because it is usually perceived as a fundamental service considered essential for Internet access. Using TCP ports 80 (HTTP) or 443 (HTTPS) for delivering non-HTTP traffic, thus fooling restrictive firewalls to gain network access. 5 Skype Proprietary Protocol Skype can delude a network firewall by using Web ports to establish communication with other Skype peers. This strategy is adopted by Skype as a fallback mechanism in the case of other strategies fail to get through a restrictive firewall. Such a strategy renders Skype traffic disguised as Web traffic quite difficult to be detected by network operators. 6 About Detection Detection of Skype flows in Web traffic HTTP Workload Model 2. Goodness-of-fit tests 1. 1) Chi-square test 2) Kolmogorov-Smirnov test 3. P2P VoIP characteristics Detection Process Training Datasets 2. Evaluation Datasets 1. 7 Proposed Methodology • HTTP Workload Model • Goodness-of-fit tests 1) Chi-square test 2) Kolmogorov-Smirnov test • Skype characteristics 8 Proposed Methodology 9 1. Define a HTTP workload model and capture real Web data to build empirical distributions of some relevant parameters. 2. Capture Web traffic with VoIP calls hidden in it, calculate the same relevant parameters for each flow and use metrics taken from two Goodness-of-fit tests to decide whether the computed parameters are compatible (or not) with the empirical distributions derived in the previous step, classifying each flow as legitimate Web traffic or not. Proposed Methodology 10 HTTP Workload Model Define a model for evaluate Web “normal” behavior. This model has the following parameters: 1. Web request size; 2. Web Response size; 3. Interarrival time between requests; 4. Number of requests per page; 5. Page retrieval time; 11 Goodness-of-fit tests 1. Chi-square test It was first investigated by Karl Pearson in 1900. Oi: an observed frequency; Ei: an expected (theoretical) frequency, asserted by the null hypothesis; K: the number of classes. 12 Goodness-of-fit tests 2. Kolmogorov-Smirnov test It quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. F0(x): the empirical distribution function derived from the training part. Sn(x):the cumulative step function of a sample of N observations. 13 Skype characteristics It does not use SIP or other known signaling protocol for VoIP calls and all its traffic is end-toend encrypted. Automatically detect network characteristics and choose the best option available to communicate with other Skype peers. It only uses Web ports as a fallback mechanism, when UDP is not available. 14 Experimental Datasets Training Datasets – model part 2. Evaluation Datasets – detection part 1. 15 Training Datasets - model part Using a training dataset to characterize a “normal” Web traffic behavior. tcpdump: capture HTTP full packet traces, generating dump files. 2. tcpflow: read these dump files and calculate the parameters present in the Web workload model. 1. 16 Training Datasets read HTTP headers to clearly identify a Web request or a Web response and we also compute the inactivity time between Web messages. ISP: Internet service provider ACD: academic institution 17 Training Datasets 18 Training Datasets 19 Training Datasets 20 Evaluation Datasets - detection part tcpdump: captured Web packet traces, but this time only TCP/IP headers were captured. 2. Another software: the calculations and the division of flows in Web pages are done without examining TCP payload (HTTP headers) information. 1. 21 Web Message Size: consider every MTU-sized packet as a part of the same Web message, if there is not too much inactive time between them. Evaluation Datasets We used the number of requests per page as a filter to remove smaller flows. The other three parameters(Web request size、 Web Response size、Interarrival time between requests) are represented by a list of values and they are used in Equations (1) and (2) to generate a χ2 or a Kolmogorov-Smirnov D score. 22 Evaluation Datasets we have three values that can be compared with thresholds to define if this set of related requestresponse messages is likely to be Skype or not. VoIP calls of different durations were produced in a controlled way by a small network of computers behind port-restrictive firewalls running the Skype program. 23 Experimental Results • Sensitivity and specificity • ROC curves • Detecting Skype flows • Evaluating real-time detection 24 Sensitivity and specificity Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. The test outcome can be positive or negative True positive = correctly identified False positive = incorrectly identified True negative = correctly rejected False negative = incorrectly rejected 25 ROC curves ROC curves: Receiver Operating Characteristic curves A graphical plot of the sensitivity against (1−specificity) of a binary classifier. Sensitivity is the same as true positive rate and (1−specificity) is equal to false positive rate. The classifier has a discrimination threshold that is varied to produce different points in the curve. 26 Detecting Skype flows 27 28 Detecting Skype flows 29 Detecting Skype flows 30 Detecting Skype flows Fig. 5. χ2 detection. 90% of 80 Skype flows correctly identified (i.e. true positive rate) with less than 2% of 17,294 non-Skype flows incorrectly identified (i.e. false positive rate) a 100% detection rate with around 5% of false positives. Fig. 6. Kolmogorov-Smirnov D detection. a true positive rate of 70% with a false positive rate around 2% a 80% detection with 5% of false positives. χ2 ROC curve are always closer to the top left corner in comparison with the K-S curve. 31 Evaluating real-time detection a network administrator may want to identify the Skype calls that are currently using the network, not the calls made some minutes or hours ago. here the data is captured and analyzed using limited short time intervals. the χ2 detection using the newly generated trace (the set of all 10s capture files) had a true positive rate up to 85% with a smaller number of false positives compared to the χ2 detection using the ISP-3 trace. 32 Evaluating real-time detection 33 Conclusions 34 Conclusions It is rather common to find non-HTTP traffic using Web ports to delude firewalls and other network elements. We evaluated a Skype detection system based on statistical tests to efficiently detect Skype flows hidden among Web traffic without a search for particular Skype patterns or signatures and without regarding payload information. 35 Conclusions We manually produced Skype traffic to build our Web evaluation dataset and verify that the proposed parameters are able to identify Skype flows hidden among HTTP traffic. Using simple metrics taken from two Goodness- of-Fit tests, the χ2 value and the KolmogorovSmirnov distance, we show that Skype flows can be clearly detected, but our results suggests that the χ2 metric is a much better choice. 36 Conclusions considering the experimental results for the chi- square detection, our methodology provides enough flexibility for the network management to adopt different approaches regarding the possible detection of Skype flows in Web traffic. As future work intend to further analyze the real-time detection by investigating the minimum time interval needed. intend to build and evaluate an optimized version of our tool to perform real-time monitoring in network links. 37 References E. P. Freire, A. Ziviani, and R. M. Salles, " Detecting Skype Flows in Web Traffic," Proc. of the IEEE/IFIP Network Operations and Management Symposium (NOMS 2008), April 2008, pp. 89-96. Emanuel P. Freire, Artur Ziviani and Ronaldo M. Salles, "Detecting VoIP Calls Hidden in Web Traffic," IEEE transaction on network and service management, Vol no. 5, pp- 210-214, December 2008. 38 Thanks for listening 39