Uploaded by deepanshu tlc

2k16 MC 023 BDA LAB FILE

advertisement
Big Data Practical File
CO-414
Submitted by:
Submitted to:
Ashish Gupta
2K16/MC/023
Ms. Shweta Meena
INDEX
S No.
1.
2.
3.
4.
5.
Experiment
Explain Architecture of
Hadoop and Perform setting
up and installing Hadoop in
its 2 operating system
modes: 1. Pseudo Distributed.
2. Fully Distributed.
Implement the following File
management tasks in Hadoop
System (HDFS): 1. Adding files and
directories.
2. Retrieving files.
3. Deleting files.
To run a basic Word Count
MapReduce program to
understand Mapreduce
Paradigm: To count words in a
given file, To view the output
file, and To calculate execution
file.
To implement
Bloom
Filters
for filtering on
Stream Data in C++/java.
Study and implement 5 basic
functions
and
5
basic
commands in R Programming.
Date
Sign
6.
7.
8.
Implementing clustering
program using R programming.
To find tf-idf matrix for
recommendation systems
using R used.
Find similar documents with
cosine similiarity in R.
AIM: Perform setting up and install HADOOP in its two
operating
modes,
Pseudo-distributed
and
fully
distributed.
About HADOOP
The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is designed
to scale up from single servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a highly-available service
on top of a cluster of computers, each of which may be prone to failures.
Setting up HADOOP
Pre-requisites:
1. Java
2. SSH
Before any other steps, we need to set the java environment variable, this can be done in
windows from the system variables window or on linux by adding the following to the variables
file:
export JAVA_HOME=/usr/java/latest
Download and extract the HADOOP binaries.
1
2
3
4
wget http://apache.claz.org/hadoop/common/hadoop-3.1.2/
hadoop-3.1.2.tar.gz
tar xzf hadoop-3.1.2.tar.gz
hadoop-3.1.2/* to hadoop/
Pseudo-distributed mode
1. Add the following variables to the system variable file
1
2
3
4
5
6
7
8
9
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export
export
export
export
export
HADOOP_HDFS_HOME=$HADOOP_HOME
YARN_HOME=$HADOOP_HOME
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
HADOOP_INSTALL=$HADOOP_HOME
2. Configure HADOOP files
a. Change to the Hadoop directory/etc/Hadoop
b. Add the following to the hadoop-env.sh file
export JAVA_HOME=/usr/local/jdk1.7.0_71
c. Edit the following config files
core-site.xml
1
2
3
4
5
6
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/<user_name>/hadoopinfra/hdfs/namenode
</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/<user_name>/hadoopinfra/hdfs/datanode
</value>
</property>
</configuration>
yarn-site.xml
1
2
3
4
5
6
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
1
2
3
4
5
6
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
d. Verifying the installation
i. Formatting the namenodes
ii. Verifying the HDFS File system
iii. Starting YARN
iv. Accessing the HADOOP bowser and verifying everything.
Fully distributed mode
1 Configure system and create host files on each node
a. For each node, edit eh /etc/hosts/ file and add the IP addresses of the servers
e.g.
1
2
3
2
3
4
5
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://node-master:9000</value>
</property>
</configuration>
Set the HDFS Paths in hdfs-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
7
8
node-master
node1
node2
Distribute the authentication key-pairs to the users
a. Login to the node-master and generate ssh-keys
b. Copy the keys tot the other nodes.
Download and extract the HADOOP binaries
Set the environment variables (same as pseudo-distributed)
Edit the core-site.xml file to set NameNode location
1
2
3
4
5
6
7
8
6
192.0.2.1
192.0.2.2
192.0.2.3
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Set the Job scheduler (same as pseudo-distributed)
Configure YARN in yarn-site.xml
1
2
3
4
5
6
7
8
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
9
10
11
12
13
14
15
16
<value>node-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
9 Duplicate the config files to each node.
10 Format the HDFS (same as pseudo-distributed).
11 Start the HDFS (same as pseudo-distributed).
12 Run YARN (same as pseudo-distributed).
Findings and Learnings:
1. We have installed HADOOP in both pseudo-distributed and fully-distributed modes.
AIM: Implementing file management operations on
HDFS 1) Adding files and Repositories, 2) Retrieving
Files, 3) Deleting files.
About HDFS
Apache HDFS or Hadoop Distributed File System is a block-structured file system where each
file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of
one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode (Master node) and all the other
nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines
that support Java. Though one can run several DataNodes on a single machine, but in the
practical world, these DataNodes are spread across various machines.
NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and
manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly
available server that manages the File System Namespace and controls access to files by clients.
I will be discussing this High Availability feature of Apache Hadoop HDFS in my next blog.
The HDFS architecture is built in such a way that the user data never resides on the NameNode.
The data resides on DataNodes only.
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high quality or high-availability. The
DataNode is a block server that stores the data in the local file ext3 or ext4.
File Management in HDFS
1. Creating Directories
1
2
3
4
## Usage:
hadoop fs -mkdir <paths>
## Example:
hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2
2. Adding files to HDFS
1
2
3
4
## Usage:
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
## Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/
3. Retrieving files
1
2
3
4
## Usage:
hadoop fs -get <hdfs_src> <localdst>
## Example:
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/
4. Removing files
1
2
3
4
## Usage :
hadoop fs -rm <arg>
## Example:
hadoop fs -rm /user/saurzcode/dir1/abc.txt
5. Removing directories
1
2
3
4
## Usage :
hadoop fs -rmr <arg>
## Example:
hadoop fs -rmr /user/saurzcode/
Findings and Learnings:
1. We studied file management and handling on the Hadoop DFS.
AIM: Run a basic Word Count MapReduce program to
understand MapReduce paradigm: Count words in a
given file. View the output file. Calculate the execution
time
About HDFS
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs).
Reduce task, which takes the output from a map as an input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task
is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
Below are the steps for MapReduce data flow:
`

Step 1: One block is processed by one mapper at a time. In the mapper, a developer
can specify his own business logic as per the requirements. In this manner, Map runs
on all the nodes of the cluster and process the data blocks in parallel.

Step 2: Output of Mapper also known as intermediate output is written to the local disk.
An output of mapper is not stored on HDFS as this is temporary data and writing on
HDFS will create unnecessary many copies.

Step 3: Output of mapper is shuffled to reducer node (which is a normal slave node
but reduce phase will run here hence called as reducer node). The shuffling/copying is
a physical movement of data which is done over the network.

Step 4: Once all the mappers are finished and their output is shuffled on reducer nodes
then this intermediate output is merged & sorted. Which is then provided as input to
reduce phase.

Step 5: Reduce is the second phase of processing where the user can specify his own
custom business logic as per the requirements. An input to a reducer is provided from
all the mappers. An output of reducer is the final output, which is written on HDFS.
Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
`
import java.io.IOException;
import java.util.*;
import
import
import
import
import
import
import
import
org.apache.hadoop.fs.Path;
org.apache.hadoop.conf.*;
org.apache.hadoop.io.*;
org.apache.hadoop.mapreduce.*;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class wordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
56
57
58
59
60
61
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Output
input text data
output
Findings and Learnings:
1. We learned how map-reduce works.
2. We learned how to code using hadoop in Java.
3. We learned how input and output in hadoop works.
`
AIM: Write a program for bloom-filtering in Hadoop.
Introduction & Theory
Bloom Filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard
Bloom in 1970, that is used to test whether an element is a member of a set. False positive
matches are possible, but false negatives are not – in other words, a query returns either
"possibly in set" or "definitely not in set". Elements can be added to the set, but not removed
(though this can be addressed with a "counting" filter); the more elements that are added to the
set, the larger the probability of false positives
Unlike a standard hash table using open addressing for collision resolution, a Bloom filter of a
fixed size can represent a set with an arbitrarily large number of elements; adding an element
never fails due to the data structure "filling up". However, the false positive rate increases
steadily as elements are added until all bits in the filter are set to 1, at which point all queries
yield a positive result. With open addressing hashing, false positives are never produced, but
performance steadily deteriorates until it approaches linear search.
Union and intersection of Bloom filters with the same size and set of hash functions can be
implemented with bitwise OR and AND operations respectively. The union operation on
Bloom filters is lossless in the sense that the resulting Bloom filter is the same as the Bloom
filter created from scratch using the union of the two sets. The intersect operation satisfies a
weaker property: the false positive probability in the resulting Bloom filter is at most the falsepositive probability in one of the constituent Bloom filters, but may be larger than the false
positive probability in the Bloom filter created from scratch using the intersection of the two
sets. (interesting properties: source Wikipedia)
Below are the steps for MapReduce data flow:

Step 1: One block is processed by one mapper at a time. In the mapper, a developer
can specify his own business logic as per the requirements. In this manner, Map runs
on all the nodes of the cluster and process the data blocks in parallel.

Step 2: Output of Mapper also known as intermediate output is written to the local disk.
An output of mapper is not stored on HDFS as this is temporary data and writing on
HDFS will create unnecessary many copies.

Step 3: Output of mapper is shuffled to reducer node (which is a normal slave node
but reduce phase will run here hence called as reducer node). The shuffling/copying is
a physical movement of data which is done over the network.

Step 4: Once all the mappers are finished and their output is shuffled on reducer nodes
then this intermediate output is merged & sorted. Which is then provided as input to
reduce phase.

Step 5: Reduce is the second phase of processing where the user can specify his own
custom business logic as per the requirements. An input to a reducer is provided from
all the mappers. An output of reducer is the final output, which is written on HDFS
Code
Bloom Filter Class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
package utils;
import
import
import
import
import
import
java.io.Serializable;
java.nio.charset.Charset;
java.security.MessageDigest;
java.security.NoSuchAlgorithmException;
java.util.BitSet;
java.util.Collection;
public class FilterBloom<E> implements Serializable {
private BitSet bitset;
private int bitSetSize;
private double bitsPerElement;
private int expectedNumberOfFilterElements; // expected (maximum)
number of elements to be added
private int numberOfAddedElements; // number of elements actually
added to the Bloom filter
private int k; // number of hash functions
static final Charset charset = Charset.forName("UTF-8"); //
encoding used for storing hash values as strings
static final String hashName = "MD5"; // MD5 gives good enough
accuracy in most circumstances. Change to SHA1 if it's needed
static final MessageDigest digestFunction;
static { // The digest method is reused between instances
MessageDigest tmp;
try {
tmp = java.security.MessageDigest.getInstance(hashName);
} catch (NoSuchAlgorithmException e) {
tmp = null;
}
digestFunction = tmp;
}
/**
* Constructs an empty Bloom filter. The total length of the
Bloom filter will be
* c*n.
*
* @param c is the number of bits used per element.
* @param n is the expected number of elements the filter will
contain.
* @param k is the number of hash functions used.
*/
public FilterBloom(double c, int n, int k) {
this.expectedNumberOfFilterElements = n;
this.k = k;
this.bitsPerElement = c;
this.bitSetSize = (int)Math.ceil(c * n);
numberOfAddedElements = 0;
this.bitset = new BitSet(bitSetSize);
}
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
/**
* Constructs an empty Bloom filter. The optimal number of hash
functions (k) is estimated from the total size of the Bloom
* and the number of expected elements.
*
* @param bitSetSize defines how many bits should be used in
total for the filter.
* @param expectedNumberOElements defines the maximum number of
elements the filter is expected to contain.
*/
public FilterBloom(int bitSetSize, int expectedNumberOElements) {
this(bitSetSize / (double)expectedNumberOElements,
expectedNumberOElements,
(int) Math.round((bitSetSize /
(double)expectedNumberOElements) * Math.log(2.0)));
}
/**
* Constructs an empty Bloom filter with a given false positive
probability. The number of bits per
* element and the number of hash functions is estimated
* to match the false positive probability.
*
* @param falsePositiveProbability is the desired false positive
probability.
* @param expectedNumberOfElements is the expected number of
elements in the Bloom filter.
*/
public FilterBloom(double falsePositiveProbability, int
expectedNumberOfElements) {
this(Math.ceil(-(Math.log(falsePositiveProbability) /
Math.log(2))) / Math.log(2), // c = k / ln(2)
expectedNumberOfElements,
(int)Math.ceil(-(Math.log(falsePositiveProbability) /
Math.log(2)))); // k = ceil(-log_2(false prob.))
}
/**
* Construct a new Bloom filter based on existing Bloom filter
data.
*
* @param bitSetSize defines how many bits should be used for the
filter.
* @param expectedNumberOfFilterElements defines the maximum
number of elements the filter is expected to contain.
* @param actualNumberOfFilterElements specifies how many
elements have been inserted into the <code>filterData</code> BitSet.
* @param filterData a BitSet representing an existing Bloom
filter.
*/
public FilterBloom(int bitSetSize, int
expectedNumberOfFilterElements, int actualNumberOfFilterElements,
BitSet filterData) {
this(bitSetSize, expectedNumberOfFilterElements);
this.bitset = filterData;
this.numberOfAddedElements = actualNumberOfFilterElements;
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
}
/**
* Generates a digest based on the contents of a String.
*
* @param val specifies the input data.
* @param charset specifies the encoding of the input data.
* @return digest as long.
*/
public static int createHash(String val, Charset charset) {
return createHash(val.getBytes(charset));
}
/**
* Generates a digest based on the contents of a String.
*
* @param val specifies the input data. The encoding is expected
to be UTF-8.
* @return digest as long.
*/
public static int createHash(String val) {
return createHash(val, charset);
}
/**
* Generates a digest based on the contents of an array of bytes.
*
* @param data specifies input data.
* @return digest as long.
*/
public static int createHash(byte[] data) {
return createHashes(data, 1)[0];
}
/**
* Generates digests based on the contents of an array of bytes
and splits the result into 4-byte int's and store them in an array.
The
* digest function is called until the required number of int's
are produced. For each call to digest a salt
* is prepended to the data. The salt is increased by 1 for each
call.
*
* @param data specifies input data.
* @param hashes number of hashes/int's to produce.
* @return array of int-sized hashes
*/
public static int[] createHashes(byte[] data, int hashes) {
int[] result = new int[hashes];
int k = 0;
byte salt = 0;
while (k < hashes) {
byte[] digest;
synchronized (digestFunction) {
digestFunction.update(salt);
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
salt++;
digest = digestFunction.digest(data);
}
for (int i = 0; i < digest.length/4 && k < hashes; i++) {
int h = 0;
for (int j = (i*4); j < (i*4)+4; j++) {
h <<= 8;
h |= ((int) digest[j]) & 0xFF;
}
result[k] = h;
k++;
}
}
return result;
}
/**
* Compares the contents of two instances to see if they are
equal.
*
* @param obj is the object to compare to.
* @return True if the contents of the objects are equal.
*/
@Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (getClass() != obj.getClass()) {
return false;
}
final FilterBloom<E> other = (FilterBloom<E>) obj;
if (this.expectedNumberOfFilterElements !=
other.expectedNumberOfFilterElements) {
return false;
}
if (this.k != other.k) {
return false;
}
if (this.bitSetSize != other.bitSetSize) {
return false;
}
if (this.bitset != other.bitset && (this.bitset == null ||
!this.bitset.equals(other.bitset))) {
return false;
}
return true;
}
/**
* Calculates a hash code for this class.
* @return hash code representing the contents of an instance of
this class.
*/
@Override
public int hashCode() {
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
int hash = 7;
hash = 61 * hash
this.bitset.hashCode() :
hash = 61 * hash
hash = 61 * hash
hash = 61 * hash
return hash;
}
+ (this.bitset != null ?
0);
+ this.expectedNumberOfFilterElements;
+ this.bitSetSize;
+ this.k;
/**
* Calculates the expected probability of false positives based
on
* the number of expected filter elements and the size of the
Bloom filter.
* <br /><br />
* The value returned by this method is the <i>expected</i> rate
of false
* positives, assuming the number of inserted elements equals the
number of
* expected elements. If the number of elements in the Bloom
filter is less
* than the expected value, the true probability of false
positives will be lower.
*
* @return expected probability of false positives.
*/
public double expectedFalsePositiveProbability() {
return
getFalsePositiveProbability(expectedNumberOfFilterElements);
}
/**
* Calculate the probability of a false positive given the
specified
* number of inserted elements.
*
* @param numberOfElements number of inserted elements.
* @return probability of a false positive.
*/
public double getFalsePositiveProbability(double
numberOfElements) {
// (1 - e^(-k * n / m)) ^ k
return Math.pow((1 - Math.exp(-k * (double) numberOfElements
/ (double) bitSetSize)), k);
}
/**
* Get the current probability of a false positive. The
probability is calculated from
* the size of the Bloom filter and the current number of
elements added to it.
*
* @return probability of false positives.
*/
public double getFalsePositiveProbability() {
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
return getFalsePositiveProbability(numberOfAddedElements);
}
/**
* Returns the value chosen for K.<br />
* <br />
* K is the optimal number of hash functions based on the size
* of the Bloom filter and the expected number of inserted
elements.
*
* @return optimal k.
*/
public int getK() {
return k;
}
/**
* Sets all bits to false in the Bloom filter.
*/
public void clear() {
bitset.clear();
numberOfAddedElements = 0;
}
/**
* Adds an object to the Bloom filter. The output from the
object's
* toString() method is used as input to the hash functions.
*
* @param element is an element to register in the Bloom filter.
*/
public void add(E element) {
add(element.toString().getBytes(charset));
}
/**
* Adds an array of bytes to the Bloom filter.
*
* @param bytes array of bytes to add to the Bloom filter.
*/
public void add(byte[] bytes) {
int[] hashes = createHashes(bytes, k);
for (int hash : hashes)
bitset.set(Math.abs(hash % bitSetSize), true);
numberOfAddedElements ++;
}
/**
* Adds all elements from a Collection to the Bloom filter.
* @param c Collection of elements.
*/
public void addAll(Collection<? extends E> c) {
for (E element : c)
add(element);
}
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
/**
* Returns true if the element could have been inserted into the
Bloom filter.
* Use getFalsePositiveProbability() to calculate the probability
of this
* being correct.
*
* @param element element to check.
* @return true if the element could have been inserted into the
Bloom filter.
*/
public boolean contains(E element) {
return contains(element.toString().getBytes(charset));
}
/**
* Returns true if the array of bytes could have been
into the Bloom filter.
* Use getFalsePositiveProbability() to calculate the
of this
* being correct.
*
* @param bytes array of bytes to check.
* @return true if the array could have been inserted
Bloom filter.
*/
public boolean contains(byte[] bytes) {
int[] hashes = createHashes(bytes, k);
for (int hash : hashes) {
if (!bitset.get(Math.abs(hash % bitSetSize)))
return false;
}
}
return true;
}
inserted
probability
into the
{
/**
* Returns true if all the elements of a Collection could have
been inserted
* into the Bloom filter. Use getFalsePositiveProbability() to
calculate the
* probability of this being correct.
* @param c elements to check.
* @return true if all the elements in c could have been inserted
into the Bloom filter.
*/
public boolean containsAll(Collection<? extends E> c) {
for (E element : c)
if (!contains(element))
return false;
return true;
}
/**
* Read a single bit from the Bloom filter.
* @param bit the bit to read.
* @return true if the bit is set, false if it is not.
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
*/
public boolean getBit(int bit) {
return bitset.get(bit);
}
/**
* Set a single bit in the Bloom filter.
* @param bit is the bit to set.
* @param value If true, the bit is set. If false, the bit is
cleared.
*/
public void setBit(int bit, boolean value) {
bitset.set(bit, value);
}
/**
* Return the bit set used to store the Bloom filter.
* @return bit set representing the Bloom filter.
*/
public BitSet getBitSet() {
return bitset;
}
/**
* Returns the number of bits in the Bloom filter. Use count() to
retrieve
* the number of inserted elements.
*
* @return the size of the bitset used by the Bloom filter.
*/
public int size() {
return this.bitSetSize;
}
/**
* Returns the number of elements added to the Bloom filter after
it
* was constructed or after clear() was called.
*
* @return number of elements added to the Bloom filter.
*/
public int count() {
return this.numberOfAddedElements;
}
/**
* Returns the expected number of elements to be inserted into
the filter.
* This value is the same value as the one passed to the
constructor.
*
* @return expected number of elements.
*/
public int getExpectedNumberOfElements() {
return expectedNumberOfFilterElements;
}
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
/**
* Get expected number of bits per element when the Bloom filter
is full. This value is set by the constructor
* when the Bloom filter is created. See also
getBitsPerElement().
*
* @return expected number of bits per element.
*/
public double getExpectedBitsPerElement() {
return this.bitsPerElement;
}
/**
* Get actual number of bits per element based on the number of
elements that have currently been inserted and the length
* of the Bloom filter. See also getExpectedBitsPerElement().
*
* @return number of bits per element.
*/
public double getBitsPerElement() {
return this.bitSetSize / (double)numberOfAddedElements;
}
}
Mapper
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package bfcode;
import java.lang.System.*;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import utils.*;
public class FilterMapper extends Mapper<LongWritable, Text, Text,
NullWritable> {
FilterBloom<String> filter;
@Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context
context)
throws IOException, InterruptedException {
super.setup(context);
double falsePositiveProbability = 0.1;
int expectedNumberOfElements = 100;
filter = new FilterBloom<String>(falsePositiveProbability,
expectedNumberOfElements);
filter.add("bad service");
filter.add("iron man");
filter.add("marvel");
filter.add("end game");
}
protected void map(LongWritable key, Text value, Context context)
throws java.io.IOException, InterruptedException {
String[] tokens = value.toString().split(",");
for(String token :tokens){
31
32
33
34
35
36
37
System.out.println(token);
if(filter.contains(token)){
context.write(value, NullWritable.get());
}
}
}
}
Main
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
package bfcode;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.CounterGroup;
import org.apache.hadoop.mapreduce.Counters;
import org.apache.hadoop.mapreduce.Job;
public class BloomFilter {
public static void main(String[] args)
throws IOException, ClassNotFoundException,
InterruptedException {
if (args.length != 2) {
System.err.println("Usage: FilterJob <input path> <output
path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(bfcode.FilterMapper.class);
job.setJobName("Customer Complaint Filter");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(bfcode.FilterMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.waitForCompletion(true);
}
}
Output
Findings and Learnings:
1.
2.
3.
4.
We learned how map-reduce works.
We learned how to code using hadoop in Java.
We have learned the working of bloom filters
We have successfully implemented bloom filters
AIM: Study and implement five basic functions in R
Language
Introduction and Theory
R is a programming language and software environment for statistical analysis,
graphics representation and reporting. R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand, and is currently developed by
the R Development Core Team. The core of R is an interpreted computer language
which allows branching and looping as well as modular programming using functions.
R allows integration with the procedures written in the C, C++, .Net, Python or
FORTRAN languages for efficiency.
The following are the important features of R:





R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
Basic functions in R:
1) R Input and Output
For input, use scan() or readline() for inputing a line from the keyboard in the
form of a string and for output use print().
2) R Date and Time
The function Sys.time() is used to return the current date and time. The 'date'
class in base R keeps dates as if the number of days since the starting of 1970.
3) R Packages
A package is a set of R functions and data-sets and the library is a folder on
your system / computer which stores the files for those package(s). To load a
package library(package_name) is used. To view all packages library() is
used. To install a package install.packages(package_name) is used.
4) seq()
seq(x,y) function returns a sequence of number from x to y.
5) mean()
mean(x:y) calculates the mean of all the numbers ranging from x to y.
Output:
Findings and Learnings:
Successfully implemented five basic functions in R language.
1) R is a programming language and environment commonly used in statistical
computing, data analytics and scientific research.
2) It is one of the most popular languages used by statisticians, data analysts, researchers
and marketers to retrieve, clean, analyse, visualize and present data.
3) Due to its expressive syntax and easy-to-use interface, it has grown in popularity in
recent years.
AIM: Find TF-IDF matrix for recommendation systems in
R.
Introduction and Theory
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates
how relevant a word is to a document in a collection of documents. This is done by
multiplying two metrics: how many times a word appears in a document, and the inverse
document frequency of the word across a set of documents.
TF-IDF for a word in a document is calculated by multiplying two different metrics:
1) Term Frequency (TF): Term Frequency is the number of times a word appears in a
document. There are ways to adjust the frequency, by length of a document i.e.
Normalized Term Frequency, the number of times a word appears in a document,
divided by the total number of words in that document.
2) Inverse Document Frequency (IDF): The inverse document frequency of the word
across a set of documents means how common or rare a word is in the entire
document set. It is calculated by taking the total number of documents, dividing it by
the number of documents that contain a word, and calculating the logarithm. So, if the
word is very common and appears in many documents, this number will approach 0.
Code:
docs<-c(doc1, doc2, doc3, doc4, doc5) class(docs)
library(tm)
doc_corpus<-VCorpus(VectorSource(docs)) doc_corpus
doc_corpus<-tm_map(doc_corpus, tolower)
doc_corpus<-tm_map(doc_corpus, removeWords, stopwords("english"))
doc_corpus<-tm_map(doc_corpus, stripWhitespace)
doc_corpus<-tm_map(doc_corpus, removePunctuation) doc_corpus<tm_map(doc_corpus, PlainTextDocument) #install.packages("SnowballC")
library(SnowballC)
doc_corpus<-tm_map(doc_corpus, stemDocument) doc_corpus[[1]]$content
doc_dtm<-TermDocumentMatrix(doc_corpus) doc_dtm
doc_dtm$dimnames$Docs<-as.character(1:5) inspect(doc_dtm)
findFreqTerms(doc_dtm, lowfreq = 2) library(rvest)
wiki_url <read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRank
ing") html_nodes(wiki_url, "#content")
job <- html_table(html_nodes(wiki_url, "table")[[1]]) head(job)
jobCorpus<-VCorpus(VectorSource(job[, 10])) jobCorpus<-tm_map(jobCorpus,
tolower)
for(j in seq(jobCorpus)){
jobCorpus[[j]] <- gsub("_", " ", jobCorpus[[j]])
}
jobCorpus<-tm_map(jobCorpus, removeWords, stopwords("english"))
jobCorpus<-tm_map(jobCorpus, removePunctuation)
jobCorpus<-tm_map(jobCorpus, stripWhitespace) jobCorpus<-tm_map(jobCorpus,
PlainTextDocument) jobCorpus<-tm_map(jobCorpus, stemDocument) dtm<DocumentTermMatrix(jobCorpus)
dtm
dtm.tfidf <- DocumentTermMatrix(jobCorpus, control =
list(weighting=weightTfIdf)) dtm.tfidf$dimnames$Docs <as.character(1:200)
inspect(dtm.tfidf[1:9, 1:10]
Output:
Findings and Learnings:
1) TF-IDF weight is a statistical measure used to evaluate how important a word is to a
document in a collection of documents.
2) The importance increases proportionally to the number of times a word appears in the
document but inversely proportional to the frequency of the word in the set.
3) Variations of the TF-IDF weighting scheme are often used by search engines in
scoring and ranking a document’s relevance given a query.
AIM: Write a program to implement clustering in R
programming.
Introduction and Theory
Clustering
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the
same group than those in other groups. In simple words, the aim is to segregate groups
with similar traits and assign them into clusters.
Broadly speaking, clustering can be divided into two subgroups :
 Hard Clustering: In hard clustering, each data point either belongs to a cluster
completely or not. For example, in the above example each customer is put into one
group out of the 10 groups.
 Soft Clustering: In soft clustering, instead of putting each data point into a separate
cluster, a probability or likelihood of that data point to be in those clusters is
assigned. For example, from the above scenario each costumer is assigned a
probability to be in either of 10 clusters of the retail store.

There are four ways we can do clustering, of which centroid based methods are most
commonly used.
 Connectivity models: As the name suggests, these models are based on the notion
that the data points closer in data space exhibit more similarity to each other than
the data points lying farther away. These models can follow two approaches. In the
first approach, they start with classifying all data points into separate clusters &
then aggregating them as the distance decreases. In the second approach, all data
points are classified as a single cluster and then partitioned as the distance increases.
Also, the choice of distance function is subjective. These models are very easy to
interpret but lacks scalability for handling big datasets. Examples of these models
are hierarchical clustering algorithm and its variants.
 Centroid models: These are iterative clustering algorithms in which the notion of
similarity is derived by the closeness of a data point to the centroid of the clusters.
K-Means clustering algorithm is a popular algorithm that falls into this category. In
these models, the no. of clusters required at the end have to be mentioned
beforehand, which makes it important to have prior knowledge of the dataset. These
models run iteratively to find the local optima.
 Distribution models: These clustering models are based on the notion of how
probable is it that all data points in the cluster belong to the same distribution (For
example: Normal, Gaussian). These models often suffer from overfitting. A popular
example of these models is Expectation-maximization algorithm which uses
multivariate normal distributions.
 Density Models: These models search the data space for areas of varied density of
data points in the data space. It isolates various different density regions and assign
the data points within these regions in the same cluster. Popular examples of density
models are DBSCAN and OPTICS.
Applications of Clustering







Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
k-means clustering
It is a method of vector quantization, originally from signal processing, that is popular
for cluster analysis in data mining. k-means clustering aims to partition n observations
into k clusters in which each observation belongs to the cluster with the nearest mean,
serving as a prototype of the cluster. This results in a partitioning of the data space into
Voronoi cells.
The problem is computationally difficult (NP-hard); however, efficient heuristic
algorithms converge quickly to a local optimum. These are usually similar to the
expectation-maximization algorithm for mixtures of Gaussian distributions via an
iterative refinement approach employed by both k-means and Gaussian mixture
modeling. They both use cluster centers to model the data; however, k-means clustering
tends to find clusters of comparable spatial extent, while the expectation-maximization
mechanism allows clusters to have different shapes.
The algorithm has a loose relationship to the k-nearest neighbor classifier, a popular
machine learning technique for classification that is often confused with k-means due
to the name. Applying the 1-nearest neighbor classifier to the cluster centers obtained
by k-means classifies new data into the existing clusters. This is known as nearest
centroid classifier or Rocchio algorithm.
To process the learning data, the K-means algorithm in data mining starts with a first
group of randomly selected centroids, which are used as the beginning points for every
cluster, and then performs iterative (repetitive) calculations to optimize the positions of
the centroids
It halts creating and optimizing clusters when either:
 The centroids have stabilized — there is no change in their values because the
clustering has been successful.
 The defined number of iterations has been achieved.
1 Initialize k means with random values
2
3 For a given number of iterations:
4
Iterate through items:
5
Find the mean closest to the item
6
Assign item to mean
7
Update mean
The k-mean pseudo-code
Code
Clustering.R
1
2
3
4
5
6
7
8
9
require(ggplot2)
require(MASS)
set.seed(42)
set1 = MASS::mvrnorm(n = 150, c(-10,10), matrix(c(1.5,1,1,1.5),2))
set2 = MASS::mvrnorm(n = 150, c(-5,10), matrix(c(1,2,2,6),2))
set3 = MASS::mvrnorm(n = 150, c(-1,1), matrix(c(4,0,0,4),2))
set4 = MASS::mvrnorm(n = 150, c(10,-10), matrix(c(4,0,0,4),2))
set5 = MASS::mvrnorm(n = 150, c(3,-3), matrix(c(4,0,0,4),2))
DF =
data.frame(rbind(set1,set2,set3,set4,set5),cluster=as.factor(c(rep(1:5,each=1
50))))
10 ggplot(DF,aes(x=X1,y=X2,color=cluster))+geom_point()
11
12 kmeans = function(data,K=4,stop_crit=10e-3)
13 {
14
#Initialisation of clusters
15
centroids = data[sample.int(nrow(data),K),]
16
current_stop_crit = 1000
17
cluster = rep(0,nrow(data))
18
converged = F
19
it = 1
20
while(current_stop_crit>=stop_crit & converged==F)
21
{
22
it=it+1
23
if (current_stop_crit<=stop_crit)
24
{
25
converged=T
26
}
27
old_centroids=centroids
28
##Assigning each point to a centroid
29
for (i in 1:nrow(data))
30
{
31
min_dist=10e10
32
for (centroid in 1:nrow(centroids))
33
{
34
distance_to_centroid=sum((centroids[centroid,]-data[i,])^2)
35
if (distance_to_centroid<=min_dist)
36
{
37
cluster[i]=centroid
38
min_dist=distance_to_centroid
39
}
40
}
41
}
42
##Assigning each point to a centroid
43
for (i in 1:nrow(centroids))
44
{
45
centroids[i,]=apply(data[cluster==i,],2,mean)
46
}
47
48
49
50
51
52
53
54
55
56
57
current_stop_crit=mean((old_centroids-centroids)^2)
}
return(list(data=data.frame(data,cluster),centroids=centroids))
}
res<-kmeans(DF[1:2],K=5)
res$centroids$cluster=1:5
res$data$isCentroid=F
res$centroids$isCentroid=T
data_plot=rbind(res$centroids,res$data)
ggplot(data_plot,aes(x=X1,y=X2,color=as.factor(cluster),size=isCentroid,alpha
=isCentroid))+geom_point()
Results & Outputs
Sample data
Final cluster assignment
Findings and Learnings:
1. K-means is one of the simplest unsupervised learning algorithms that solve the wellknown clustering problem.
2. R provides easy to use tools for performing cluster analysis.
3. We have successfully implemented clustering in R
AIM: Write a program to find similar documents with
cosine similarity in R.
Introduction and Theory
Cosine Similarity
Cosine similarity is a metric used to measure how similar the documents are irrespective
of their size. Mathematically, it measures the cosine of the angle between two vectors
projected in a multi-dimensional space. The cosine similarity is advantageous because
even if the two similar documents are far apart by the Euclidean distance (due to the
size of the document), chances are they may still be oriented closer together. The
smaller the angle, higher the cosine similarity.
The cosine for two non-zero vector can be calculated from the Euclidean dot product.
𝐴 ⋅ 𝐵 = | |𝐴| || |𝐵| | cos(𝜃)
The cosine similarity is then calculated as:
𝐴⋅𝐵
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = cos(𝜃) =
=
| |𝐴| || |𝐵| |
∑7 𝐴6𝐵6
689
:∑7689 𝐴;6 :∑7689 𝐵6;
Code
Cosine.R
1
2
3
library(twitteR)
library(tidytext)
library(dplyr)
4
library(tm)
library(SnowballC)
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
workingDir <- "~/Documents/R programs"
setwd(workingDir)
url <- "http://www.rdatamining.com/data/rdmTweets-201306.RData"
download.file(url, destfile = "rdmTweets-201306.RData")
load(file = "rdmTweets-201306.RData")
tweets <- twListToDF(tweets)
load(file = "rdmTweets-201306.RData")
tweets <- twListToDF(tweets)
tweets <- tweets %>%
mutate(text=gsub("(http|https).+$|\\n|&amp|[[:punct:]]","",text),
rowIndex=as.numeric(row.names(.))) %>%
select(text,retweetCount,rowIndex)
docList <- as.list(tweets$text)
24
N.docs <- length(docList)
25
26
QrySearch <- function(queryTerm) {
27
28
# Record starting time to measure your search engine performance
29
start.time <- Sys.time()
30
31
32
33
# store docs in Corpus class which is a fundamental data structure in
text mining
my.docs <- VectorSource(c(docList, queryTerm))
34
35
# Transform/standaridze docs to get ready for analysis
36
37
my.corpus <- VCorpus(my.docs) %>%
tm_map(stemDocument) %>%
38
39
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
40
41
tm_map(removeWords,stopwords("en")) %>%
tm_map(stripWhitespace)
42
43
44
45
# Store docs into a term document matrix where rows=terms and cols=docs
# Normalize term counts by applying TDiDF weightings
46
47
term.doc.matrix.stm <- TermDocumentMatrix(my.corpus,
control=list(
48
weighting=function(x)
weightSMART(x,spec="ltc"),
49
50
51
52
53
54
55
56
57
58
wordLengths=c(1,Inf)))
# Transform term document matrix into a dataframe
term.doc.matrix <- tidy(term.doc.matrix.stm) %>%
group_by(document) %>%
mutate(vtrLen=sqrt(sum(count^2))) %>%
mutate(count=count/vtrLen) %>%
ungroup() %>%
59
60
61
62
select(term:count)
docMatrix <- term.doc.matrix %>%
mutate(document=as.numeric(document)) %>%
filter(document<N.docs+1)
63
qryMatrix <- term.doc.matrix %>%
64
65
mutate(document=as.numeric(document)) %>%
filter(document>=N.docs+1)
66
67
68
69
70
# Calcualte top ten results by cosine similarity
searchRes <- docMatrix %>%
inner_join(qryMatrix,by=c("term"="term"),
suffix=c(".doc",".query")) %>%
71
mutate(termScore=round(count.doc*count.query,4)) %>%
72
group_by(document.query,document.doc) %>%
73
summarise(Score=sum(termScore)) %>%
filter(row_number(desc(Score))<=10) %>%
74
75
arrange(desc(Score)) %>%
left_join(tweets,by=c("document.doc"="rowIndex")) %>%
76
77
ungroup() %>%
78
79
rename(Result=text) %>%
select(Result,Score,retweetCount) %>%
data.frame()
80
81
82
83
84
85
# Record when it stops and take the difference
end.time <- Sys.time()
86
print(paste("Used",time.taken,"seconds"))
87
89
90
return(searchRes)
91
92
93
time.taken <- round(end.time - start.time,4)
}
QrySearch("data science")
Results & Outputs
output
Findings and Learnings:
1. Cosine similarity is one of the most commonly used similarity metrics in text processing
2. R provides easy to use tools for performing text analysis.
3. We have successfully implemented document retrieval using cosine similarity.
AIM: Write a program to find TF-IDF for any dataset and
also plot resultant term frequency matrix.
Introduction and Theory
TF-IDF
A central question in text mining and natural language processing is how to quantify
what a document is about. Can we do this by looking at the words that make up the
document? One measure of how important a word may be is its term frequency (tf),
how frequently a word occurs in a document. There are words in a document, however,
that occur many times but may not be important; in English, these are probably words
like ―the‖, ―is‖, ―of‖, and so forth. We might take the approach of adding words like
these to a list of stop words and removing them before analysis, but it is possible that
some of these words might be more important in some documents than others. A list of
stop words is not a sophisticated approach to adjusting term frequency for commonly
used words.
Another approach is to look at a term’s inverse document frequency (idf), which
decreases the weight for commonly used words and increases the weight for words that
are not used very much in a collection of documents. This can be combined with term
frequency to calculate a term’s tf-idf, the frequency of a term adjusted for how rarely it
is used. It is intended to measure how important a word is to a document in a collection
(or corpus) of documents. It is a rule-of-thumb or heuristic quantity; while it has proved
useful in text mining, search engines, etc., its theoretical foundations are considered
less than firm by information theory experts. The inverse document frequency for any
given term is defined as
𝑖𝑑𝑓(𝑡𝑒𝑟𝑚) = ln -
𝑛/01234567
𝑛/01234567 10569:5:5; 64<3
=
Then finally the resulting TF-IDF matrix is then calculated as:
𝑡𝑓𝑖𝑑𝑓(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝐷𝑎𝑡𝑎𝑠𝑒𝑡) = 𝑡𝑓(𝑡𝑒𝑟𝑚, 𝑑
𝑐) × 𝑖𝑓𝑑(𝑡𝑒𝑟𝑚, 𝐷𝑎𝑡𝑎𝑠𝑒𝑡)
The resulting matrix is not normalized, this is done using the L2 normalization:
𝑣
𝑣G =
I|𝑣|I
Code
TF-IDF.R
1
2
3
4
library(tm)
library(proxy)
library(dplyr)
library(ggplot2)
5
6
7
library(corrplot)
print("The dataset: each sentence is one docoment")
doc <- c( "The sky is blue.", "The sun is bright today.", "The sun in the
sky is bright.", "We can see the shining sun, the bright sun.", "The moon
8
9
10
11
12
13
is full, the sky full of stars.", "The sky was dark, the stars plentiful
and bright.", "The sun is but a morning star.")
corpus <- Corpus( VectorSource(doc) )
controlList <- list(removePunctuation = TRUE, stopwords = TRUE, tolower =
TRUE)
print("computing the term-frequency matrix: ")
(tf <- as.matrix(TermDocumentMatrix(corpus, control = controlList) ) )
corrplot(tf, method = "number", is.corr = FALSE, cl.pos = "n")
14
15
print("computing the idf, and then converting into a diagonal matrix (used
later)")
(idf <- log(ncol(tf) / (1 + rowSums(tf != 0) ) ) )
(idf <- diag(idf) )
16
17
18
19
20
print("calculating the final tf-idf matrix")
tf_idf <- crossprod(tf, idf)
colnames(tf_idf) <- rownames(tf)
(tf_idf <- tf_idf / sqrt(rowSums(tf_idf^2) ) )
corrplot(tf_idf, method = "number")
Results & Outputs
Findings and Learnings:
1. TF-IDF if one of the most popular methods in text processing
2. R provides easy to use tools for performing text analysis.
3. We have successfully implemented TF-IDF in R.
Download