Requirements

advertisement
An Efficient Architecture for Distributed Data Mining
Abstract:
Classification are the main focus topic in data mining and AI area. Many methods have
been proposed by researchers in machine learning, business research and consulting
system. In this paper, some important existing techniques of Classification are introduced
such as: C4.5, SLIQ and SPRINT at first. Second, Serial and parallel approaches are
introduced following. Finally, distribute parallel classification computing techniques are
discussed. The object of this paper is to improve the efficiency and accuracy of existing
classification techniques from perspective of pro and con condition of some of existing
techniques. Construction of a classifier are the mainly focus on this paper, other detail
such as decision tree pruning will not introduced.
Many approaches have been introduced to improve existing classification
algorithms. Among them, C4.5 is recognized as one of the popular classification
methods. However, it has a strong restriction of the training datasets, which can fit into
main memory. Dealing with the large training dataset is the main consideration of
improving the efficiency and accuracy. Distributing System environment will be the final
solution of huge training dataset problem. How to partition the training dataset among
distribute host and how to deal with overhead shared-nothing processor's communication
also are the key to improve the efficiency of parallel classification compute. We try to
find ideal algorithm, which take advantage of all exiting parallel algorithms, in a
distributing environment to reach the highest efficiency and accuracy of building a
decision tree. The new parallel distributing system with agent technique seems to be is a
solution, and the implementation of such system will be shown in future study.
1) Requirements
 Efficiency
As we deal with very large real-world database efficiency and scalability of such
algorithms become major concerns of data mining community. Many real world mining
applications usually deal with very large training sets, i.e., millions of samples. However,
existing data mining algorithms have a strong limitation for the training samples to reside
in main memory. There is a critical tradeoff between efficiency and accuracy. The
scalability to resolve the limitation of decision tree construction directly leads to
inefficiency required frequent swapping of the training samples between main and cache
memories. Early strategies for inducing trees from large databases include discrediting
continuous attributes and sampling data at each node. These, however, still assume that
the training set can fit in memory. An alternative method first partitions the data into
subsets that individually can fit into memory, and then builds a decision tree from each
subset. The final output classifier combines each classifier obtained from the subsets.
Although this method allows for the classification of large data sets, its classification
accuracy is not as high as the single classifier that would have been built using all of the
data at once.
 Accuracy
 Influencing power
 Distribution
 Evolving
The enhancement of mining techniques is a continuous process with the volume of data
growing every day. The measure of a good mining algorithm is its ability to perform
‘effectively’ in mining useful knowledge and ‘efficiently’ in computational terms. This is
a constantly evolving process (Cooley, 1997).
2) Back Ground: Decision Tree
A decision tree, one of classification algorithms has been improved by a series of studies
decision tree generation [Hunt et al.], CART [Breiman et al., 1984] and C4.5 [Quinlan,
1993]. A decision-tree is a tree, where each internal node denotes a test on an attribute,
each branch represents an outcome of the test, and leaf nodes represents classes
distributions. The existing decision tree algorithms, such as ID3 and C4.5, has been well
established for relatively small data sets. A decision tree is constructed by recursive
divide-and-conquer algorithm. Most decision tree classifiers perform classification in two
phases: Tree Building and tree Pruning. Compare with the time taken in Tree building
phase, the time spent on tree Pruning is far smaller and about one percent of time of
building tree. For the rest of this paper, Tree building phase is the main discussion. A
summary of Tree building schema is following:
Maketree (trainingdata T)
Partition (T);
Partition (data S);
if (all points in S are in the same class) then return;
Evaluate splits for each attribute A
Use best split found to partition S into S1, and S2;
partition(S1);
Partition(S2);
(assume we are building a binary decision tree)
Let F being the interested dataset: if all the outcome in F are same, all instances
of F belong to the same class, all attributes in F are examined and a test attribute A are
selected and F is splitting base on this testing attribute: let A with mutually exclusive
outcomes A1, A2, A3,...An and Let Fi be the subset of F containing those instances with
outcome Ai, 1<= i <= n. the decision tree for F then has A as its root with a subtree for
each outcome Ai of A. If Fi is empty, the subtree corresponding to outcome Ai is a leaf
that nominates the majority class in F, otherwise, the subtree for Ai obtained by applying
the same procedure to subset Fi of F. a leaf that identifies the most frequency class
among the instances.
There are several computational models to determine the partition criteria.
Information gain, gain ratio, Gini index, X^2 contingency and the G-statistic how are the
statistical formula to Choose a testing attribute. Information gain [Quilan, 1986] and Gini
index [Breiman et al., 1984], which are the most popular two, are introduced here since
some techniques introduced in the paper will used these two alternatively. Breiman et al.
[1984] determine the impurity of a set of instance from its class distribution as follows:
Gini(S) = 1- Pj
-- (1)
Gini-split(S) = N1/N*Gini(S1) + N2/N * Gini(S2) -- (2)
In the formula, S stands for a data set containing examples from N classes, Pj stands for
the relative frequency of a class j in S and Gini-split(S) stands for the index of the divided
data when S is divided into two subsets S1 and S2 containing N1 and N2 classes,
respectively.
The advantage of the Gini index is that its calculation is required only the
distribution of the class values in each of the partitions so that we just scan each of the
node’s attribute lists and evaluate split points based on that attribute to find the best
testing attribute for a node. The Gini index of a set of instances assumes its minimum
value of zero when all instances belong to a single class.
The information gain measure [Quinlan, 1993] is the most popular one used in
choosing testing attribute. The attribute with the highest information gain (or greatest
entropy reduction) is chosen as the testing attribute as follows:
Info(S) = -(for j=1 to k){sum+= freq(Cj, S)/|S| * log2(freq(Cj, S)/|S|)bits} -- (1)
Infox(T) = |Ti|/|T| * Info(Ti)
-- (2)
Gain(X) = Info(T) - Infox(T)
-- (3)
In the formula-(1), Info (S) is to sum over the classes (Cj) in proportion to their
frequencies in S where there are k classes. In the formula-(2), Infox (T) measures the
average amount of information needed to identify the class in T when applied to the set of
training cases, Ti. (This quantity is also known as the entropy of the set S). Now consider
a similar measurement after T has been partitioned in accordance with the n outcomes of
a test X. The expected information requirement can found as the weighted sum over the
subsets.
Many researchers pointed out the strengths of C4.5 [Sprint, 19XX] as follows.
First, the construction of decision tree classifiers is relatively fast compared to other
classification methods, such as neural networks which is required extremely long training
times even for small datasets. A decision tree is an intuitive method for understanding the
process of building decision framework according to the data distribution of available
attributes and converting it into classification rules or even into SQL queries for
accessing databases. More interestingly, the accuracy of the tree classifiers is still
compatible with other classification methods'. On the other hand, many pointed out the
restriction of training data required to be residing in main memory. In data mining
application, very large training sets are common. Hence, this restriction limits the
scalability of such algorithms. The decision tree construction can become inefficient due
to swapping of the training samples in and out of main and cache memories
[classification by decision tree induction 293].
The real world dataset can be very large, even up to trillion. Furthermore, the
large datasets are desired for improving the accuracy of the classification model.
Developing efficient classification method for very large datasets is an essential and
challenging problem. In recent years, many techniques have been taken to resolve large
dataset, which can not fit into main memory,
 Selecting an appropriate set for training data: A technique called windowing is
introduced to try to remove main memory restriction. A subset of the training case
called a window was selected randomly and a decision tree developed from it. The
tree produced by this subset was then used to classify the training cases which are
exclusive with the subset used to build the tree. The result shows some of are
misclassified. A selection of these exceptions was then added to the initial window,
and to build a second tree to classify remaining cases. This cycle should be repeated
until a tree built from the current window correctly classified all the training cases
outside the window. The final window can be thought of as a screened set of training
cases that contains all the "interesting" ones, together with sufficient "ordinary" cases
to guide the tree-building. However, there are some problems with windowing. As
Quinlan states, the cycle was repeated until a tree built from the current window
correctly classified all the training cases outside the window. How long the process of
constructing a decision tree will be last? Certainly this process will be taken much
longer then building a single decision tree from a whole training case. Also, the
training case used in C4.5 like experiments were mostly free of noise, for the most
real-world classification domain, the process is even slower. In addition, it is possible
that a cycle will be never ended until it run out main memory because of noise
dataset.
 Data Partition/Distribution: Partition data set into different host, some of them try
to retrieve a small set of useful data from a large noisy dataset and build some small
decision tree from small subset of whole training case and combine them together to
build a final decision tree. Chan and Stolfo [Chan, XX] have studied the method of
partition the input data and then building a classifier for each partition. The outputs of
the multiple classifier are then combined to get the final classification. Their results
show that classification using multiple classifier never achieve the accuracy of a
single classifier that can classify all of the data [XX]. Quilan et al., [Quilan, XX] also
introduced a tree partition approach such as first, growing several alternative trees
and selecting as the tree with the lowest predicted error rate. Second, growing several
trees, generating production rules from all of them, then constructing a single
production rule classifier from all the available rules. However, both two described
here show that they are take much longer time to produce a final decision tree.
 Efficient Algorithms and Data Structure: Improve computation power using
efficient data structure and algorithms. XX pointed out the bottleneck of C4.5 as their
inefficient computation. Due to the use of a linear search algorithm for the threshold
(continuous attributes) in the whole training set, the program is slow down. This
choice is forced since the cases in the training set may not be ordered with the respect
to the attribute selected for test. First, binary search of thresholds will speed up the
linear search used in C4.5. Second, a counting sort method is adopted instead of the
Quicksort of C4.5.
SLIQ is a decision tree classifier that can handle both numeric and categorical
attributes. It uses a novel pre-sorting technique in the tree-growth phase. This sorting
procedure is integrated with a breadth-first tree growing strategy to enable
classification of disk-resident datasets. The datasets is partition vertically or by
attribute lists. For each single attribute, a RID (record identifier) associated with it,
we can think of Rid is a pointer to make a connection between one entry from each
attribute list and class list. See figure 4.
The algorithm of Evaluating Splits in SLIQ:
EvaluateSplits()
for each attriubte A do
traverse attribute list of A
for each value v in the attribute list do
find the corresponding entry in the class list, and
hence the corresponding class and the leaf node
update the class histogram in the leaf
if A is a categorical attribute then
for each leaf of the tree of the tree do
find subset of A with best split
The algorithm of Updating the class list of SLIQ
After find a testing attribute and its splitting points, the next step is to create
child nodes for each of the leaf nodes and update the class list:
UpdateLables()
for each attribute A used in a split do
traverse attribute list of A
for each value v in the attribute list do
find the new class c to which v belongs by applying
the splitting test at node referenced from e
update the class label for e to c
update node referenced in e to the child corresponding to the class c
The strength of this approach is that finding best attribute and updating leaf in class
leaf are in one pass scan of a attribute corresponding to a specific node of tree. Before
constructing a tree, breath-first sorting is applied to the entire continuous attribute list and
each record association can be maintained through a record id. This eliminates the need
to sort the data at each node of the decision. Instead, the training data are sorted just once
for each continuous attribute at the beginning of the tree growth phase. Second, attribute
does not have to be in the main memory while constructing decision tree. So that the
training dataset can be much larger than those have to fit into memory required by C4.5
like technique. Weakness of SLIQ: Since frequency update and random access of class
list, SLIQ assumes that there is enough memory to keep the class list memory-resident.
The size of the class list grows proportionally with the number of tuples in the training
set. When a class list cannot fit into memory, the performance of SLIQ decreases.
To overcome the limitation of SLIQ which require class list reside in main memory,
SPRINT (Scalable Parallel Classifier for Data Mining) was introduced. It is a new
decision-tree-based classification algorithm and try to removed all of the memory
restrictions. SPRINT also claim that it is a fast and scalable algorithm. The goal of
SPRINT was not to outperform SLIQ on datasets where a class list can fit in memory.
Instead, the purpose of SPRINT is to develop an accurate classifier for datasets that are
simply too large for any other algorithm, and to be able to develop such a classifier
efficiently.
In Sprint, all the information needed to find split points on a particular attribute
associated with a node are stored in Histograms. For continuous attribute, two
histograms, Cabove and Cbelow, are associated with each decision-tree node that is under
consideration for splitting. Cbelow maintains the distribution for attribute records that
have already been processed and Cabove maintains it for those that have not. In contrast,
only one histogram is needed containing the class distribution for each value of the given
attribute. A hash table is also needed in order for those non-testing attributes to be split
according to the split point in the testing attribute. The hash table could be very large as
the number of nodes under consideration of splitting increasing.
SPRINT is interested in two major steps of tree growth phase, which is a critical
performance implication:
1. How to find split points that define node tests. (continuous attribute are the main
concerned here) Gini formula is used to find the testing attribute. For continuous
attribute, The candidate split points are mid-points between every two consective
attribute values in the subset. Cbelow is initialized to Zero and Cabove is initialized with
the class distribution for a subset records for a specific node. During scan the subset
record and updating histograms, if a wining split point is evaluated, it is saved , Cabove
and Cbelow are deallocated. You must be wondering how Cabove is initialized, the
Answer is sample, each time a attribute performance splitting (physically), Cabove is
saved for a specific node and the subsets record associated with this node.
For Categorical attribute, the processing is much sample. What we needed is just
make a scan to the all the records and update the histogram. At the end of scan, calculate
the Gini value and compare it with other candidates and deallocated the histogram. Once
again, after spilt a attribute, a initial state of new histogram for each child must to save in
order of future calculation
2. Having chosen a split point, how to partition the data.
Once we find a testing attribute along with its split point, we start perform splitting. Scan
the testing attribute one more time, Physically partition on it and probe partition
information into a hashing table to indicate how other attribute should to do the partition.
A hash table is used to provide a mapping between record identifiers and the node to
which it belongs after the split. This mapping is then probed to spit the attribute lists in a
consistent manner. Other non-testing attributes will perform the partition base on the
information (RID associate with a child node) in the hashing table.
The strengths of SPRINT are identified as follows:
1. SPRINT used the pre-sorting approach for every single continuous attribute.
2. SPRINT uses a attribute list data structure that holds the class and RID information, as
show below. When a node is split, the attribute lists are partitioned and distributed among
the resulting child nodes accordingly. When a list is partitioned, the order of the records
in the list is maintained. Hence, partitioning lists does not require resorting. SPRINT was
also designed to be easily parallel, further contributing to its scalability. the weakness of
SPRINT are 1. SPRINT requires building class histograms for each new leaf, which is
used to initialize the Cabove histograms when evaluating continuous split-points in the
next pass. Due to frequent updating, Cabove histograms has to be in main memory, as the
number of node under consideration of split increasing, Cabove could be very large and it
is possible that the total space required to store Cabove larger than the main memory.
2. SPRINT requires the use of a hash tree proportional in size to the training set. This
may become expensive as the training set size grows.

Parallelizing Classification (perform classification in a shared-nothing multiprocessors distributing system environment).
There are two primary approaches for parallelizing SLIQ:
1. The class list is replicated in the memory of every processor.
Performing the splits requires updating the class list for each training example. Since
every processor must maintain a consistent copy of the entire class list every class-list
update must be communicated to and applied by every processor. Thus, the time of this
part of tree growth will increase with the size of the training set, even if the amount of
data at each node remains fixed. Therefore, the size of the training set is limited by the
memory size of a single processor, since each processor has a full copy of the class list.
2. Each processor's memory holds only a portion of the entire list.
After partitioning the class list in parallel. Each of N processors contains only 1/N of the
class list. Note that the class label corresponding to an attribute value could reside on a
different processor. It will cost high communication while evaluating continuous split
points. As each attribute list is scanned, we need to look-up the corresponding class label
and tree point for each attribute value. This implies that each processor will require
communication for N-1/N of its data. Also each processor will have to service lookup
requests from other processors in the middle of scanning its attribute lists.
Parallelizing SPRINT:
SPRINT achieves uniform data placement and workload balancing by distributing the
attribute list evenly over N processors of a shared-nothing machine. This allows each
processor to work on only 1/N of the total data. The parallel partition is showing
following:
Each processor has a separate contiguous section of a "global" attribute list. For
categorical attribute, since the count matrix built by each processor is based on "local"
information only, we must exchange these matrices to get the "global" counts in order to
calculate the Gini value. For continuous attributes extra processor's Cbelow and Cabove
histograms must be initialized to reflect the fact that there are section of the attribute list
on other processors. As in the serial version, The statistics are gathered when attribute
lists for new leaves are created. After collecting statistics, the information is exchanged
between all the processors and stored with each leaf, where it is later used to initialize
that leaf's Cabove and Cbelow class histograms.
Weakness: Note that a processor can have attribute records belonging to any leaf. Before
building the probe structure, it will need to collect rids from all the processors, it is quite
expensive. Also, calculating a set of record, associated with a specific leaf, will required
some processors who has part or that dataset passing the information in order to update
the histogram. This approach is also very expensive.
2) Our approach
2.1) Distributed computing using multiple Agent
 Data distribution (distributed database), partition attributes:
Typical data classification requires a two-step process. The first step, a model is built
describing a predetermining set of data classes or concepts. The model constructed by
analyzing database tuples described by attributes. Each tuple is assumed to belong to a
predefined class, as determined by one of the attributes. In the context of classification,
data tuples are also referred to as sample, example, or objects. The data tuples analyzed
build the model collectively form the training dataset. The individual tuples making up
the training set are referred to as training samples and are randomly selected from the
sample population. By repeatedly splitting the data into smaller and smaller partitions,
decision tree induction is prone to the problems of fragmentation, repetition, and
replication. In fragmentation, the number of samples at a given branch becomes so small
as to be statistically insignificant. One solution to this problem is to allow for the
grouping of categorical attribute values. A tree node may test whether the value of an
attribute belongs to a given set of values, such as Ai € {a1, a2, ..an}. Another alternative
is to create binary decision trees, the each branch holds a Boolean test on an attribute.
A pipe line approach may improve the efficiency of SPRINT: instead save extra Cabove
and Cbelow class histograms. Processor 0 may implement first and pass the state of
histograms to next processor and continuous to work on next subset of attribute record.
after processor 1 receive the state of histograms from processor it will do the same thing
which processor 0 has done and pass on to next processor. This cycle will be going on
until the last processor finish its job.
Introduce new distribute classification system:
Train data may be distributed across a set of computers for several reasons. For example,
several data sets concerning customers might owned by different insurance companies
who have competitive reasons for keeping the data private. However, the organization
would be interested in models of the aggregate data. Also parallel retrieve information
from the data set will improve the efficiency of constructing decision tree.
In our system, we used basic decision-tree algorithm, and adopts per-sort, RID
lists and histograms technique approaches from SPRINT. However, we eliminate the
SPRINT's processor communication overhead. In addition, partition the training data
vertically among share-nothing processors which is same as parallel SLIQ. However, a
class list, which is required reside in main memory, are eliminated. Agent techniques are
used in our system since it can help each processor have self-motivation to perform more
efficiently.
Data structure:
Histograms: same as SPRINT, each node under consideration of spitting will have initial
histograms, one histogram for categorical attributes and two for continuous attributes.
Since all the information needed to calculate information gain are in the Histograms.
Message Queue: perform assignment to each agent. Every time the master agent has a
new assignment, it will enqueue a massage in to Message queue. Every time a slave agent
finished a task from remote host, it will go to Message queue to dequeue a new
assignment (if it is not empty) from there and start new mission. Beware, the dequeue
procedure is exclusive among agents, that is, while a agent is dequeuing, no one else can
performance dequeue. Since the job assignment is dynamically assignment to agents,
which agent get job first or last is not a problem, so starvation will not be concerned here.
Building Binary tree (decision tree): after collection of information from slave agent
send back from remote host, local master agent will make a decision and construct a node
in decision tree and make future assignments if it is necessary. Encoding technical
[lowerbound, upperbound] is used in constructing a binary tree since it can make access a
leaf in log(n) time.
Data placement and Workload Balancing:
Different from SPRINT, we partition the data vertically. Each processor, which is sharenothing, working independently and parallel. For example: we have a training dataset
with N attribute lists in a M processor distribute environment, Then N attribute divided
evenly among M processors. The scheme for each attribute list stored in each processor
will be -> (attribute list + rid + class list)
Even one processor may have more than one-attribute lists physically. However,
logically, we can treat a single processor executes on only one attribute list. Since we
adopt the histogram technique from SPRINT, the dataset don't have to be residing in
main memory. We only keep the necessary histograms in main memory.
Finding split points:
Finding split points in parallel in our system is very simple, after a slave agent was born,
it will to go assignment queue to find a new assignment, in which a subset of training set
along its location are specified. The agent then will go the location, this can be processor
0, to calculate the information gain base on the histogram. and bring back the best
information gain of that dataset. The master agent then collects the information form each
agent regarding a leaf and make final decision of choosing the testing attribute and
splitting point.
Performing The Splits:
Having determined the winning testing attribute along with it's split points, master agent
will insert a new node to decision tree and make assignments enqueue into assignment
queue. Performing the splitting is really a logical splitting instead physical splitting. A
new initial histogram for new subset, if there is one, will be saved for each new children
leaf for future using.
Algorithm: Generate_decision_tree. Generate a decision tree from the given training
data.
Input: The training samples, samples, the set of candidate attributes attribute-list (
Distribute in remote hosts).
Output: A decision tree.
(1) create a node M ( label with Number of attributes under consideration T);
(2) create a manager agent and N slave agents.
(3) enqueue N assignments (if there are N attributes under consideration ) into
Assignments Queue.
(4) for each agent at local host (simultaneously)
4.1 dequeue a Assignment package for Assignment Queue,
4.2 base on information stored in Assignment package, dynamically travel to remote host
to
learn a specific attribute.
4.3 sending a information gain and possible splitting subsets information back manager
agent at
local host.
4.4 travel back to local host to find another assignment in Assignment Queue.
(5) While manager agent collecting information from Slave agents, if all T information
gains
associated with the node M are collected. manger will make a decision:
if sample are all of the same class, C then
return N as a leaf node labeled with the class C;
if attribute-list is empty then
return N as a leaf node labeled with the most common class in sample;
// majority voting
5.1 select test-attribute, the attribute among attribute-list with the highest information
gain.
label node N with test-attribute;
5.2 partition test-attribute into two subsets: (base on the information package from a
slave agent). B1 and B2.
5.3 grow two branch from node N for the condition test-attribute = B1 and B2
accordingly;
5.4 for each branch node:
let si be the set of samples in samples for which test-attribute = Bi ; //a partition.
if si is empty then
attach a leaf labeled with the most common class in samples;
else manager agent enqueue (n-1) assignments packages associated with this
particular node into Assignments Queue
recursive back to label (4).
What will stored in Assignment package:
remote host address
name of attribute
total records of the attribute
node id [lowerbound, upperbound]
What will stored in information package which a slave agent will send back to local host:
remote host address
name of attribute
information gain
left subset
right subset
number of records in left subset
number of records in right subset
How an slave agent perform in remote host:
1. read the database description in attributeName.dat to learn the attribute in this host :
1.1 continues or directed.
1.2 the outcome of this attribute.
1.3 If is discrete attribute, slave agent also needed to know possible districted values of
the attribute.
2. initialized a histogram base on the current status of the attribute
3. calculate information gain or Gini value while scanning attributeName.mdb:
if it is discrete attribute, keep update histogram until last record was calculated.
if it is continuous attribute, will update histogram, n-1 information gain is
continuously updating and eventually, a winning information gain will be stored
in return information package.
4. stored subset record Id into retune information package.
5. once there are enough information in return information package, slave agent will send
the package back to manager agent. and travel back to local host to seeking other
assignment.

Computation distribution employing mobile agents
2.2) Middleware (XML and metadata)
1. Naming (teasarus)
2. XML: Data integration and exchange
3. Reusable pattern repository (multi-dimensional pattern specification,
similarity and difference measure, pattern finder and matching schema,
decision rules)  (1) to avoid repeated process to generate overlapped
information, (2) to increase efficiency and accuracy by increasing the
patterns, (3) to provide user's participant and refer domain knowledge
(ontology), (4) to obtain sampling data from available domain library
1. alias (synonym)
2.3)
1.
2.
3.
2.4)
1.
2.
3.
4.
2.5)
1.
2.
3.
4.
5.
2. attribute selection,
3. attribute schema, constraints
4. information gain computation,
5. decision tree building strategy,
6. classification rules
7. categorization
8. pruning schema
Distributed data mining algorithms
SLIQS
SPRINT
Pipeline (enhancement of SPRINT)
Efficient data representation and integrated data mining algorithms
CBR: integration of Classification and association,
FP-tree: development of efficient data structure
Cooperate 2 with 1 (encoding)
Dynamic update; efficient, reusable, dynamic update
Meta-data and Ontology
Class hierarchy
Integration between KB and data mining
Using Ontology for efficient mining
Storing the mining results to ontology
Use the ontology for query
Reference:
R. Cooley, B. Mobasher, J. Srivastava, Web Mining: Information & Pattern Discovery on
the WWW, 1997
Download