Department of Accounting & Law State University of New York at Albany Acc 522. Statistical Methods for Business Decisions Fall, 2001 J Gangolly Group Project Information Objectives Some Basics The Definitions The Dataset Requirements and Deliverables Objectives: The objective of this group project assignment is to provide you with an opportunity to statistically analyse data, in a domain of interest to us as auditors, using the techniques covered during the semester. The data is from the The Third International Knowledge Discovery and Data Mining Tools Competition held in conjunction with The Fifth International Conference on Knowledge Discovery and Data Mining during 1999. However, our task in the project is fairly basic. (For those interested in the auditing of information systems in networked environments and in information security, please find the full details of the competition at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.) Those with sustained interest in information security can get more information from the paper Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project by Salvatore J. Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip K. Chan http://www.cs.columbia.edu/~wfan/papers/costdisex.ps.gz The dataset comes from the domain Intrusion Detection to protect a computer network from unauthorized users, including hackers, terrorists, and perhaps insiders. This handout provides all the information you’ll need in order to successfully do your group project. Some Basics (from the KDD site): A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes. Attacks fall into four main categories: DOS: denial-of-service, e.g. syn flood; R2L: unauthorized access from a remote machine, e.g. guessing password; U2R: unauthorized access to local superuser (root) privileges, e.g., various ``buffer overflow'' attacks; probing: surveillance and other probing, e.g., port scanning. Definitions : (You need to be concerned about these only if you have an enduring interest in auditing of information systems.) back DoS: Denial of service attack against apache web server where a client requests a URL containing many backslashes. As the server tries to process these requests it will slow down and be unable to process other requests(http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) buffer_overflow (u2r): The most common kind of DoS attack is simply to send more traffic to a network address than the programmers who planned its data buffers anticipated someone might send. The attacker may be aware that the target system has a weakness that can be exploited or the attacker may simply try the attack in case it might work. A few of the better-known attacks based on the buffer characteristics of a program or system include: Sending e-mail messages that have attachments with 256-character file names to Netscape and Microsoft mail programs Sending oversized Internet Control Message Protocol (ICMP) packets (this is also known as the Packet Internet or InterNetwork Groper (PING) of death) Sending to a user of the Pine e-mail progam a message with a "From" address larger than 256 characters (http://searchsecurity.techtarget.com/sDefinition/0,,sid14_gci213591,00.html) ftp_write (r2l): The anonymous FTP root directory (~ftp) and its subdirectories should not be owned by the ftp account or be in the same group as the ftp account. This is a common configuration problem. If any of these directories are owned by ftp or are in the same group as the ftp account and are not write protected, an intruder will be able to add files (such as a .rhosts file) and eventually gain access to the system. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) guess_passwd (r2l): Ideally, the password should be something that nobody could guess. In practice, most people choose a password that is easy to remember, such as their name or their initials. This is one reason it is relatively easy to break into most computer systems. imap (r2l): The Imap server must be run with root privileges so it can access mail folders and undertake some file manipulation on behalf of the user logging in. After login, these privileges are discarded. However, a vulnerability exists in the way the login transaction is handled, and this can be exploited to gain privileged access on the server. By preparing carefully crafted text to a system running a vulnerable version of the Imap server, remote users can cause a buffer overflow and execute arbitrary instructions with root privileges. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) ipsweep probe: Scanning a network to find valid IP addresses. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVALRESULTS-REVISED/B/attacks.html) land DoS: 1. An attacker can send a specifically formatted packet that can cause a remote server to crash, causing a DoS (Denial of Service). (http://www.eeye.com/html/Products/Retina/rths/DoS/38.html) 2. Some implementations of TCP/IP are vulnerable to packets that are crafted in a particular way (a SYN packet in which the source address and port are the same as the destination--i.e., spoofed). Land is a widely available attack tool that exploits this vulnerability. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) loadmodule (u2r): (SunOS 4.1.x) The loadmodule program is used by the xnews window system server to load two dynamically loadable kernel drivers into the currently running system and to create special devices in the /dev directory to use those modules. Because of the way the loadmodule program sanitizes its environment, unauthorized users can gain root access on the local machine. A script is publicly available and has been used to exploit this vulnerability. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) multihop (r2l): Multi-day scenario in which a user first breaks into one machine, and then uses the compromised machine as a stepping stone for different attacks on other machines. Uses several different exploit methods to gain access. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) neptune DoS: For each half-open connection made to a machine the tcpd server adds a record to a data structure describing all pending connections. This data structure is of finite size, and it can be made to overflow by intentionally creating too many partiallyopen connections. The half-open connections data structure on the victim server system will eventually fill; then the system will be unable to accept any new incoming connections until the table is emptied out. Normally there is a timeout associated with a pending connection, so the half-open connections will eventually expire and the victim server system will recover. However, the attacking system can simply continue sending IP-spoofed packets requesting new connections faster than the victim system can expire the pending connections. In some cases, the system may exhaust memory, crash, or be rendered otherwise inoperative. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) nmap probe: NMAP does three things. First, it will ping a number of hosts to determine if they are alive or not. Second, it will portscan hosts to determine what services are listening. Third, it will attempt to determine the OS of hosts. (http://www.insecure.org/nmap/lamont-nmap-guide.txt) perl (u2r): On systems that support saved set-user-ID and set-group-ID, suidperl does not properly relinquish its root privileges when changing its effective user and group IDs. On a system that has the suidperl or sperl program installed and that supports saved set-user-ID and saved set-group-ID, anyone with access to an account on the system can gain root access. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) phf (r2l): 1. The most well-known CGI bug is the 'phf' library shipped with NCSA httpd. The 'phf' library is supposed to allow serverparsed HTML, but can be exploited to give back any file. Other well-known CGI scripts that an intruder might attempt to exploit are: TextCounter, GuestBook, EWS, info2www, Count.cgi, handler, webdist.cgi, php.cgi, files.pl, nph-test-cgi, nphpublish, AnyForm, FormMail. If you see somebody trying to access one or all of these CGI scripts (and you don't use them), then it is clear indication of an intrusion attempt (assuming you don't have a version installed that you actually want to use). (http://www.isaserver.org/pages/intrusion%20detection%20faq.htm) 2. Any CGI program which relies on the CGI function escape_shell_cmd() to prevent exploitation of shell-based library calls may be vulnerable to attack. In particular, this includes the "phf" program which is distributed with the example code. The phf program allows remote users to run arbitrary commands on the server. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) pod DoS: Some systems will react in an unpredictable fashion when receiving oversized IP packets. Possible reactions include crashing, freezing, and rebooting. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTSREVISED/B/attacks.html) portsweep probe: Surveillance sweep through many ports to determine which services are supported on a single host. Portsweeps can be made partially stealthy by not finishing the 3-way handshake that opens a port (ie. FIN scanning). (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) rootkit (u2r): A rootkit is a collection of tools (programs) that a hacker uses to mask intrusion and obtain administrator-level access to a computer or computer network. The intruder installs a rootkit on a computer after first obtaining user-level access, either by exploiting a known vulnerability or cracking a password. The rootkit then collects userids and passwords to other machines on the network, thus giving the hacker root or privileged access. (http://whatis.techtarget.com/definition/0,289893,sid9_gci547279,00.html) satan probe: Network probing tool which looks for well known vulnerabilities. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) security smurf DoS: In this attack, the perpetrator sends an IP ping (or "echo my message back to me") request to a receiving site The ping packet specifies that it be broadcast to a number of hosts within the receiving site's local network. The packet also indicates that the request is from another site, the target site that is to receive the denial of service. (Sending a packet with someone else's return address in it is called spoofing the return address.) The result will be lots of ping replies flooding back to the innocent, spoofed host. If the flood is great enough, the spoofed host will no longer be able to receive or distinguish real traffic. (http://searchsecurity.techtarget.com/sDefinition/0,,sid14_gci213591,00.html) spy (r2l): is a LAN Protocol Analyzer running on UNIX platforms. It has a built-in interface to capture LAN traffic via a network interface. This capture facility supports Ethernet, FDDI, SLIP/CSLIP, PPP and PLIP. SPY also provides a so called User Capture Interface (UCI), where own programs can feed SPY with their packets. Of course, captured data can be stored to files in binary format for later analysis. The capture facility provides prefilters on the MAC and IP layer (this does not mean, that SPY only supports IP networks). i386 version. (http://www.antioffline.com/TID/sniffers/) teardrop DoS: This type of denial of service attack exploits the way that the Internet Protocol (IP) requires a packet that is too large for the next router to handle be divided into fragments. The fragment packet identifies an offset to the beginning of the first packet that enables the entire packet to be reassembled by the receiving system. In the teardrop attack, the attacker's IP puts a confusing offset value in the second or later fragment. If the receiving operating system does not have a plan for this situation, it can cause the system to crash. (http://searchsecurity.techtarget.com/sDefinition/0,,sid14_gci213591,00.html) warezclient (r2l): A multisession scenerio in which the Warezmaster puts a file on an anonymous ftp site with a world-writeable directory (such as an "incoming" directory) and Warezclients then retrieve the file. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) warezmaster (r2l): A multisession scenerio in which the Warezmaster puts a file on an anonymous ftp site with a world-writeable directory (such as an "incoming" directory) and Warezclients then retrieve the file. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html) The Dataset: I have placed the dataset for the project at /db4/teach/corrected.txt on cayley.bus. It contains the first 1110 lines of the file corrected.gz in KDD dataset. I have chopped down the original dataset in order to meet the limitations of diskspace allotted to you in the unix cluster. The original KDD dataset contains training as well as test datasets. We will use just a small part of the test data set. In the dataset, the attribute/value(s) are: 1.(back,buffer_overflow,ftp_write,guess_paswd, imap,ipsweep,land,loadmodul,multihop,neptune, nmap,normal,perl,phf,pod,portsweep,rootkit,sata n,smurf,spy,teardrop,warezclient,warezmaster). 2. duration: continuous. 3. protocol_type: symbolic. 4. service: symbolic. 5. flag: symbolic. 6. src_bytes: continuous. 7. dst_bytes: continuous. 8. land: symbolic. 9. wrong_fragment: continuous. 10. urgent: continuous. 11. hot: continuous. 12. num_failed_logins: continuous. 13. logged_in: symbolic. 14. num_compromised: continuous. 15. root_shell: continuous. 16. su_attempted: continuous. 17. num_root: continuous. 18. num_file_creations: continuous. 19. num_shells: continuous. 20. num_access_files: continuous. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. num_outbound_cmds: continuous. is_host_login: symbolic. is_guest_login: symbolic. count: continuous. srv_count: continuous. serror_rate: continuous. srv_serror_rate: continuous. rerror_rate: continuous. srv_rerror_rate: continuous. same_srv_rate: continuous. diff_srv_rate: continuous. srv_diff_host_rate: continuous. dst_host_count: continuous. dst_host_srv_count: continuous. dst_host_same_srv_rate: continuous. dst_host_diff_srv_rate: continuous. dst_host_same_src_port_rate: continuous. dst_host_srv_diff_host_rate: continuous. dst_host_serror_rate: continuous. dst_host_srv_serror_rate: continuous. dst_host_rerror_rate: continuous. dst_host_srv_rerror_rate: continuous. I have provided the definitions for each item within 1. in the box above for your information. Feel free to browse the web to get more details, or see me (or e-mail me). Requirements & Deliverables: Analyse the dataset in corrected.txt using any of the methods we studied during the semester in order to understand the intrusion behavior. The analysis must be done using S-Plus. The group project report must include any programs you may write to analyse the text, the results in S-Plus graphics, and your observations based on the analysis. The report must be a narrative. There is no limit on the length of the report, but 10-20 pages should be sufficient. The groups must make an oral presentation of their rteports during the class on December 11, 2001. You may use the PC in the classroom to make powerpoint presentations. The written report is due at the end of the class on December 11, 2001. Jagdish S. Gangolly (November 6, 2001)