Group Project Information

advertisement
Department of Accounting & Law
State University of New York at Albany
Acc 522. Statistical Methods for Business Decisions
Fall, 2001
J Gangolly
Group Project Information





Objectives
Some Basics
The Definitions
The Dataset
Requirements and Deliverables
Objectives:
The objective of this group project assignment is to provide you with an opportunity to statistically analyse data, in a
domain of interest to us as auditors, using the techniques covered during the semester. The data is from the The Third
International Knowledge Discovery and Data Mining Tools Competition held in conjunction with The Fifth International
Conference on Knowledge Discovery and Data Mining during 1999. However, our task in the project is fairly basic.
(For those interested in the auditing of information systems in networked environments and in information security,
please find the full details of the competition at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.) Those with
sustained interest in information security can get more information from the paper Cost-based Modeling and Evaluation for Data
Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project by Salvatore J. Stolfo, Wei Fan, Wenke Lee, Andreas
Prodromidis, and Philip K. Chan http://www.cs.columbia.edu/~wfan/papers/costdisex.ps.gz
The dataset comes from the domain Intrusion Detection to protect a computer network from unauthorized users, including hackers,
terrorists, and perhaps insiders.
This handout provides all the information you’ll need in order to successfully do your group project.
Some Basics (from the KDD site):
A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a
source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an
attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories:

DOS: denial-of-service, e.g. syn flood;

R2L: unauthorized access from a remote machine, e.g. guessing password;

U2R: unauthorized access to local superuser (root) privileges, e.g., various ``buffer overflow'' attacks;

probing: surveillance and other probing, e.g., port scanning.
Definitions :
(You need to be concerned about these only if you have an enduring interest in auditing of information
systems.)
back DoS: Denial of service attack against apache web server where a client requests a URL containing many backslashes. As the
server
tries
to
process
these
requests
it
will
slow
down
and
be
unable
to
process
other
requests(http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
buffer_overflow (u2r): The most common kind of DoS attack is simply to send more traffic to a network address than the
programmers who planned its data buffers anticipated someone might send. The attacker may be aware that the target system has a
weakness that can be exploited or the attacker may simply try the attack in case it might work. A few of the better-known attacks
based on the buffer characteristics of a program or system include:

Sending e-mail messages that have attachments with 256-character file names to Netscape and Microsoft mail programs

Sending oversized Internet Control Message Protocol (ICMP) packets (this is also known as the Packet Internet or InterNetwork Groper (PING) of death)

Sending to a user of the Pine e-mail progam a message with a "From" address larger than 256 characters
(http://searchsecurity.techtarget.com/sDefinition/0,,sid14_gci213591,00.html)
ftp_write (r2l): The anonymous FTP root directory (~ftp) and its subdirectories should not be owned by the ftp account or be in the
same group as the ftp account. This is a common configuration problem. If any of these directories are owned by ftp or are in the
same group as the ftp account and are not write protected, an intruder will be able to add files (such as a .rhosts file) and eventually
gain access to the system. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
guess_passwd (r2l): Ideally, the password should be something that nobody could guess. In practice, most people choose a
password that is easy to remember, such as their name or their initials. This is one reason it is relatively easy to break into most
computer systems.
imap (r2l): The Imap server must be run with root privileges so it can access mail folders and undertake some file manipulation on
behalf of the user logging in. After login, these privileges are discarded. However, a vulnerability exists in the way the login transaction
is handled, and this can be exploited to gain privileged access on the server. By preparing carefully crafted text to a system running a
vulnerable version of the Imap server, remote users can cause a buffer overflow and execute arbitrary instructions with root
privileges. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
ipsweep probe: Scanning a network to find valid IP addresses. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVALRESULTS-REVISED/B/attacks.html)
land DoS:
1. An attacker can send a specifically formatted packet that can cause a remote server to crash, causing a DoS (Denial of
Service). (http://www.eeye.com/html/Products/Retina/rths/DoS/38.html)
2. Some implementations of TCP/IP are vulnerable to packets that are crafted in a particular way (a SYN packet in which the
source address and port are the same as the destination--i.e., spoofed). Land is a widely available attack tool that exploits
this vulnerability. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
loadmodule (u2r): (SunOS 4.1.x) The loadmodule program is used by the xnews window system server to load two dynamically
loadable kernel drivers into the currently running system and to create special devices in the /dev directory to use those modules.
Because of the way the loadmodule program sanitizes its environment, unauthorized users can gain root access on the local machine.
A
script
is
publicly
available
and
has
been
used
to
exploit
this
vulnerability. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
multihop (r2l): Multi-day scenario in which a user first breaks into one machine, and then uses the compromised machine as a
stepping stone for different attacks on other machines. Uses several different exploit methods to gain
access. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
neptune DoS: For each half-open connection made to a machine the tcpd server adds a record to a data structure describing all
pending connections. This data structure is of finite size, and it can be made to overflow by intentionally creating too many partiallyopen connections. The half-open connections data structure on the victim server system will eventually fill; then the system will be
unable to accept any new incoming connections until the table is emptied out. Normally there is a timeout associated with a pending
connection, so the half-open connections will eventually expire and the victim server system will recover. However, the attacking
system can simply continue sending IP-spoofed packets requesting new connections faster than the victim system can expire the
pending connections. In some cases, the system may exhaust memory, crash, or be rendered otherwise
inoperative. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
nmap probe: NMAP does three things. First, it will ping a number of hosts to determine if they are alive or not. Second, it will
portscan hosts to determine what services are listening.
Third, it will attempt to determine the OS of hosts.
(http://www.insecure.org/nmap/lamont-nmap-guide.txt)
perl (u2r): On systems that support saved set-user-ID and set-group-ID, suidperl does not properly relinquish its root privileges
when changing its effective user and group IDs. On a system that has the suidperl or sperl program installed and that supports saved
set-user-ID and saved set-group-ID, anyone with access to an account on the system can gain root
access. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
phf (r2l):
1. The most well-known CGI bug is the 'phf' library shipped with NCSA httpd. The 'phf' library is supposed to allow serverparsed HTML, but can be exploited to give back any file. Other well-known CGI scripts that an intruder might attempt to
exploit are: TextCounter, GuestBook, EWS, info2www, Count.cgi, handler, webdist.cgi, php.cgi, files.pl, nph-test-cgi, nphpublish, AnyForm, FormMail. If you see somebody trying to access one or all of these CGI scripts (and you don't use
them), then it is clear indication of an intrusion attempt (assuming you don't have a version installed that you actually want
to use). (http://www.isaserver.org/pages/intrusion%20detection%20faq.htm)
2. Any CGI program which relies on the CGI function escape_shell_cmd() to prevent exploitation of shell-based library calls
may be vulnerable to attack. In particular, this includes the "phf" program which is distributed with the example code. The
phf
program
allows
remote
users
to
run
arbitrary
commands
on
the
server. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
pod DoS: Some systems will react in an unpredictable fashion when receiving oversized IP packets. Possible reactions include
crashing,
freezing,
and
rebooting. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTSREVISED/B/attacks.html)
portsweep probe: Surveillance sweep through many ports to determine which services are supported on a single host. Portsweeps
can be made partially stealthy by not finishing the 3-way handshake that opens a port (ie. FIN
scanning). (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
rootkit (u2r): A rootkit is a collection of tools (programs) that a hacker uses to mask intrusion and obtain administrator-level access
to a computer or computer network. The intruder installs a rootkit on a computer after first obtaining user-level access, either by
exploiting a known vulnerability or cracking a password. The rootkit then collects userids and passwords to other machines on the
network, thus giving the hacker root or privileged access. (http://whatis.techtarget.com/definition/0,289893,sid9_gci547279,00.html)
satan
probe:
Network
probing
tool
which
looks
for
well
known
vulnerabilities. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
security
smurf DoS: In this attack, the perpetrator sends an IP ping (or "echo my message back to me") request to a receiving site The ping
packet specifies that it be broadcast to a number of hosts within the receiving site's local network. The packet also indicates that the
request is from another site, the target site that is to receive the denial of service. (Sending a packet with someone else's return address
in it is called spoofing the return address.) The result will be lots of ping replies flooding back to the innocent, spoofed host. If the
flood is great enough, the spoofed host will no longer be able to receive or distinguish real traffic.
(http://searchsecurity.techtarget.com/sDefinition/0,,sid14_gci213591,00.html)
spy (r2l): is a LAN Protocol Analyzer running on UNIX platforms. It has a built-in interface to capture LAN traffic via a network
interface. This capture facility supports Ethernet, FDDI, SLIP/CSLIP, PPP and PLIP. SPY also provides a so called User Capture
Interface (UCI), where own programs can feed SPY with their packets. Of course, captured data can be stored to files in binary
format for later analysis. The capture facility provides prefilters on the MAC and IP layer (this does not mean, that SPY only supports
IP networks). i386 version. (http://www.antioffline.com/TID/sniffers/)
teardrop DoS: This type of denial of service attack exploits the way that the Internet Protocol (IP) requires a packet that
is too large for the next router to handle be divided into fragments. The fragment packet identifies an offset to the
beginning of the first packet that enables the entire packet to be reassembled by the receiving system. In the teardrop
attack, the attacker's IP puts a confusing offset value in the second or later fragment. If the receiving operating system
does
not
have
a
plan
for
this
situation,
it
can
cause
the
system
to
crash.
(http://searchsecurity.techtarget.com/sDefinition/0,,sid14_gci213591,00.html)
warezclient (r2l): A multisession scenerio in which the Warezmaster puts a file on an anonymous ftp site with a world-writeable
directory
(such
as
an
"incoming"
directory)
and
Warezclients
then
retrieve
the
file. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
warezmaster (r2l): A multisession scenerio in which the Warezmaster puts a file on an anonymous ftp site with a world-writeable
directory
(such
as
an
"incoming"
directory)
and
Warezclients
then
retrieve
the
file. (http://www.cs.columbia.edu/~sal/JAM/PROJECT/EVAL-RESULTS-REVISED/B/attacks.html)
The Dataset:
I have placed the dataset for the project at /db4/teach/corrected.txt on cayley.bus. It contains the first 1110 lines of
the file corrected.gz in KDD dataset. I have chopped down the original dataset in order to meet the limitations of
diskspace allotted to you in the unix cluster.
The original KDD dataset contains training as well as test datasets. We will use just a small part of the test data set.
In the dataset, the attribute/value(s) are:
1.(back,buffer_overflow,ftp_write,guess_paswd,
imap,ipsweep,land,loadmodul,multihop,neptune,
nmap,normal,perl,phf,pod,portsweep,rootkit,sata
n,smurf,spy,teardrop,warezclient,warezmaster).
2. duration: continuous.
3. protocol_type: symbolic.
4. service: symbolic.
5. flag: symbolic.
6. src_bytes: continuous.
7. dst_bytes: continuous.
8. land: symbolic.
9. wrong_fragment: continuous.
10. urgent: continuous.
11. hot: continuous.
12. num_failed_logins: continuous.
13. logged_in: symbolic.
14. num_compromised: continuous.
15. root_shell: continuous.
16. su_attempted: continuous.
17. num_root: continuous.
18. num_file_creations: continuous.
19. num_shells: continuous.
20. num_access_files: continuous.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
num_outbound_cmds: continuous.
is_host_login: symbolic.
is_guest_login: symbolic.
count: continuous.
srv_count: continuous.
serror_rate: continuous.
srv_serror_rate: continuous.
rerror_rate: continuous.
srv_rerror_rate: continuous.
same_srv_rate: continuous.
diff_srv_rate: continuous.
srv_diff_host_rate: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
dst_host_same_srv_rate: continuous.
dst_host_diff_srv_rate: continuous.
dst_host_same_src_port_rate: continuous.
dst_host_srv_diff_host_rate: continuous.
dst_host_serror_rate: continuous.
dst_host_srv_serror_rate: continuous.
dst_host_rerror_rate: continuous.
dst_host_srv_rerror_rate: continuous.
I have provided the definitions for each item within 1. in the box above for your information. Feel free to browse the web to get more
details, or see me (or e-mail me).
Requirements & Deliverables:
Analyse the dataset in corrected.txt using any of the methods we studied during the semester in order to understand the intrusion
behavior. The analysis must be done using S-Plus. The group project report must include any programs you may write to analyse the
text, the results in S-Plus graphics, and your observations based on the analysis. The report must be a narrative. There is no limit on
the length of the report, but 10-20 pages should be sufficient.
The groups must make an oral presentation of their rteports during the class on December 11, 2001. You may use the PC in the
classroom to make powerpoint presentations.
The written report is due at the end of the class on December 11, 2001.
Jagdish S. Gangolly (November 6, 2001)
Download