Big Data Practical File CO-414 Submitted by: Submitted to: Ashish Gupta 2K16/MC/023 Ms. Shweta Meena INDEX S No. 1. 2. 3. 4. 5. Experiment Explain Architecture of Hadoop and Perform setting up and installing Hadoop in its 2 operating system modes: 1. Pseudo Distributed. 2. Fully Distributed. Implement the following File management tasks in Hadoop System (HDFS): 1. Adding files and directories. 2. Retrieving files. 3. Deleting files. To run a basic Word Count MapReduce program to understand Mapreduce Paradigm: To count words in a given file, To view the output file, and To calculate execution file. To implement Bloom Filters for filtering on Stream Data in C++/java. Study and implement 5 basic functions and 5 basic commands in R Programming. Date Sign 6. 7. 8. Implementing clustering program using R programming. To find tf-idf matrix for recommendation systems using R used. Find similar documents with cosine similiarity in R. AIM: Perform setting up and install HADOOP in its two operating modes, Pseudo-distributed and fully distributed. About HADOOP The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Setting up HADOOP Pre-requisites: 1. Java 2. SSH Before any other steps, we need to set the java environment variable, this can be done in windows from the system variables window or on linux by adding the following to the variables file: export JAVA_HOME=/usr/java/latest Download and extract the HADOOP binaries. 1 2 3 4 wget http://apache.claz.org/hadoop/common/hadoop-3.1.2/ hadoop-3.1.2.tar.gz tar xzf hadoop-3.1.2.tar.gz hadoop-3.1.2/* to hadoop/ Pseudo-distributed mode 1. Add the following variables to the system variable file 1 2 3 4 5 6 7 8 9 export HADOOP_HOME=/usr/local/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export export export export export HADOOP_HDFS_HOME=$HADOOP_HOME YARN_HOME=$HADOOP_HOME HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin HADOOP_INSTALL=$HADOOP_HOME 2. Configure HADOOP files a. Change to the Hadoop directory/etc/Hadoop b. Add the following to the hadoop-env.sh file export JAVA_HOME=/usr/local/jdk1.7.0_71 c. Edit the following config files core-site.xml 1 2 3 4 5 6 <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/<user_name>/hadoopinfra/hdfs/namenode </value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/<user_name>/hadoopinfra/hdfs/datanode </value> </property> </configuration> yarn-site.xml 1 2 3 4 5 6 <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> mapred-site.xml 1 2 3 4 5 6 <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> d. Verifying the installation i. Formatting the namenodes ii. Verifying the HDFS File system iii. Starting YARN iv. Accessing the HADOOP bowser and verifying everything. Fully distributed mode 1 Configure system and create host files on each node a. For each node, edit eh /etc/hosts/ file and add the IP addresses of the servers e.g. 1 2 3 2 3 4 5 <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://node-master:9000</value> </property> </configuration> Set the HDFS Paths in hdfs-site.xml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 7 8 node-master node1 node2 Distribute the authentication key-pairs to the users a. Login to the node-master and generate ssh-keys b. Copy the keys tot the other nodes. Download and extract the HADOOP binaries Set the environment variables (same as pseudo-distributed) Edit the core-site.xml file to set NameNode location 1 2 3 4 5 6 7 8 6 192.0.2.1 192.0.2.2 192.0.2.3 <configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> Set the Job scheduler (same as pseudo-distributed) Configure YARN in yarn-site.xml 1 2 3 4 5 6 7 8 <configuration> <property> <name>yarn.acl.enable</name> <value>0</value> </property> <property> <name>yarn.resourcemanager.hostname</name> 9 10 11 12 13 14 15 16 <value>node-master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> 9 Duplicate the config files to each node. 10 Format the HDFS (same as pseudo-distributed). 11 Start the HDFS (same as pseudo-distributed). 12 Run YARN (same as pseudo-distributed). Findings and Learnings: 1. We have installed HADOOP in both pseudo-distributed and fully-distributed modes. AIM: Implementing file management operations on HDFS 1) Adding files and Repositories, 2) Retrieving Files, 3) Deleting files. About HDFS Apache HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java. Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. I will be discussing this High Availability feature of Apache Hadoop HDFS in my next blog. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNodes only. DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file ext3 or ext4. File Management in HDFS 1. Creating Directories 1 2 3 4 ## Usage: hadoop fs -mkdir <paths> ## Example: hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2 2. Adding files to HDFS 1 2 3 4 ## Usage: hadoop fs -put <localsrc> ... <HDFS_dest_Path> ## Example: hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/ 3. Retrieving files 1 2 3 4 ## Usage: hadoop fs -get <hdfs_src> <localdst> ## Example: hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/ 4. Removing files 1 2 3 4 ## Usage : hadoop fs -rm <arg> ## Example: hadoop fs -rm /user/saurzcode/dir1/abc.txt 5. Removing directories 1 2 3 4 ## Usage : hadoop fs -rmr <arg> ## Example: hadoop fs -rmr /user/saurzcode/ Findings and Learnings: 1. We studied file management and handling on the Hadoop DFS. AIM: Run a basic Word Count MapReduce program to understand MapReduce paradigm: Count words in a given file. View the output file. Calculate the execution time About HDFS MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model. Below are the steps for MapReduce data flow: ` Step 1: One block is processed by one mapper at a time. In the mapper, a developer can specify his own business logic as per the requirements. In this manner, Map runs on all the nodes of the cluster and process the data blocks in parallel. Step 2: Output of Mapper also known as intermediate output is written to the local disk. An output of mapper is not stored on HDFS as this is temporary data and writing on HDFS will create unnecessary many copies. Step 3: Output of mapper is shuffled to reducer node (which is a normal slave node but reduce phase will run here hence called as reducer node). The shuffling/copying is a physical movement of data which is done over the network. Step 4: Once all the mappers are finished and their output is shuffled on reducer nodes then this intermediate output is merged & sorted. Which is then provided as input to reduce phase. Step 5: Reduce is the second phase of processing where the user can specify his own custom business logic as per the requirements. An input to a reducer is provided from all the mappers. An output of reducer is the final output, which is written on HDFS. Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 ` import java.io.IOException; import java.util.*; import import import import import import import import org.apache.hadoop.fs.Path; org.apache.hadoop.conf.*; org.apache.hadoop.io.*; org.apache.hadoop.mapreduce.*; org.apache.hadoop.mapreduce.lib.input.FileInputFormat; org.apache.hadoop.mapreduce.lib.input.TextInputFormat; org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class wordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); 56 57 58 59 60 61 FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } Output input text data output Findings and Learnings: 1. We learned how map-reduce works. 2. We learned how to code using hadoop in Java. 3. We learned how input and output in hadoop works. ` AIM: Write a program for bloom-filtering in Hadoop. Introduction & Theory Bloom Filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives Unlike a standard hash table using open addressing for collision resolution, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements; adding an element never fails due to the data structure "filling up". However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result. With open addressing hashing, false positives are never produced, but performance steadily deteriorates until it approaches linear search. Union and intersection of Bloom filters with the same size and set of hash functions can be implemented with bitwise OR and AND operations respectively. The union operation on Bloom filters is lossless in the sense that the resulting Bloom filter is the same as the Bloom filter created from scratch using the union of the two sets. The intersect operation satisfies a weaker property: the false positive probability in the resulting Bloom filter is at most the falsepositive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets. (interesting properties: source Wikipedia) Below are the steps for MapReduce data flow: Step 1: One block is processed by one mapper at a time. In the mapper, a developer can specify his own business logic as per the requirements. In this manner, Map runs on all the nodes of the cluster and process the data blocks in parallel. Step 2: Output of Mapper also known as intermediate output is written to the local disk. An output of mapper is not stored on HDFS as this is temporary data and writing on HDFS will create unnecessary many copies. Step 3: Output of mapper is shuffled to reducer node (which is a normal slave node but reduce phase will run here hence called as reducer node). The shuffling/copying is a physical movement of data which is done over the network. Step 4: Once all the mappers are finished and their output is shuffled on reducer nodes then this intermediate output is merged & sorted. Which is then provided as input to reduce phase. Step 5: Reduce is the second phase of processing where the user can specify his own custom business logic as per the requirements. An input to a reducer is provided from all the mappers. An output of reducer is the final output, which is written on HDFS Code Bloom Filter Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 package utils; import import import import import import java.io.Serializable; java.nio.charset.Charset; java.security.MessageDigest; java.security.NoSuchAlgorithmException; java.util.BitSet; java.util.Collection; public class FilterBloom<E> implements Serializable { private BitSet bitset; private int bitSetSize; private double bitsPerElement; private int expectedNumberOfFilterElements; // expected (maximum) number of elements to be added private int numberOfAddedElements; // number of elements actually added to the Bloom filter private int k; // number of hash functions static final Charset charset = Charset.forName("UTF-8"); // encoding used for storing hash values as strings static final String hashName = "MD5"; // MD5 gives good enough accuracy in most circumstances. Change to SHA1 if it's needed static final MessageDigest digestFunction; static { // The digest method is reused between instances MessageDigest tmp; try { tmp = java.security.MessageDigest.getInstance(hashName); } catch (NoSuchAlgorithmException e) { tmp = null; } digestFunction = tmp; } /** * Constructs an empty Bloom filter. The total length of the Bloom filter will be * c*n. * * @param c is the number of bits used per element. * @param n is the expected number of elements the filter will contain. * @param k is the number of hash functions used. */ public FilterBloom(double c, int n, int k) { this.expectedNumberOfFilterElements = n; this.k = k; this.bitsPerElement = c; this.bitSetSize = (int)Math.ceil(c * n); numberOfAddedElements = 0; this.bitset = new BitSet(bitSetSize); } 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 /** * Constructs an empty Bloom filter. The optimal number of hash functions (k) is estimated from the total size of the Bloom * and the number of expected elements. * * @param bitSetSize defines how many bits should be used in total for the filter. * @param expectedNumberOElements defines the maximum number of elements the filter is expected to contain. */ public FilterBloom(int bitSetSize, int expectedNumberOElements) { this(bitSetSize / (double)expectedNumberOElements, expectedNumberOElements, (int) Math.round((bitSetSize / (double)expectedNumberOElements) * Math.log(2.0))); } /** * Constructs an empty Bloom filter with a given false positive probability. The number of bits per * element and the number of hash functions is estimated * to match the false positive probability. * * @param falsePositiveProbability is the desired false positive probability. * @param expectedNumberOfElements is the expected number of elements in the Bloom filter. */ public FilterBloom(double falsePositiveProbability, int expectedNumberOfElements) { this(Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2))) / Math.log(2), // c = k / ln(2) expectedNumberOfElements, (int)Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))); // k = ceil(-log_2(false prob.)) } /** * Construct a new Bloom filter based on existing Bloom filter data. * * @param bitSetSize defines how many bits should be used for the filter. * @param expectedNumberOfFilterElements defines the maximum number of elements the filter is expected to contain. * @param actualNumberOfFilterElements specifies how many elements have been inserted into the <code>filterData</code> BitSet. * @param filterData a BitSet representing an existing Bloom filter. */ public FilterBloom(int bitSetSize, int expectedNumberOfFilterElements, int actualNumberOfFilterElements, BitSet filterData) { this(bitSetSize, expectedNumberOfFilterElements); this.bitset = filterData; this.numberOfAddedElements = actualNumberOfFilterElements; 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 } /** * Generates a digest based on the contents of a String. * * @param val specifies the input data. * @param charset specifies the encoding of the input data. * @return digest as long. */ public static int createHash(String val, Charset charset) { return createHash(val.getBytes(charset)); } /** * Generates a digest based on the contents of a String. * * @param val specifies the input data. The encoding is expected to be UTF-8. * @return digest as long. */ public static int createHash(String val) { return createHash(val, charset); } /** * Generates a digest based on the contents of an array of bytes. * * @param data specifies input data. * @return digest as long. */ public static int createHash(byte[] data) { return createHashes(data, 1)[0]; } /** * Generates digests based on the contents of an array of bytes and splits the result into 4-byte int's and store them in an array. The * digest function is called until the required number of int's are produced. For each call to digest a salt * is prepended to the data. The salt is increased by 1 for each call. * * @param data specifies input data. * @param hashes number of hashes/int's to produce. * @return array of int-sized hashes */ public static int[] createHashes(byte[] data, int hashes) { int[] result = new int[hashes]; int k = 0; byte salt = 0; while (k < hashes) { byte[] digest; synchronized (digestFunction) { digestFunction.update(salt); 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 salt++; digest = digestFunction.digest(data); } for (int i = 0; i < digest.length/4 && k < hashes; i++) { int h = 0; for (int j = (i*4); j < (i*4)+4; j++) { h <<= 8; h |= ((int) digest[j]) & 0xFF; } result[k] = h; k++; } } return result; } /** * Compares the contents of two instances to see if they are equal. * * @param obj is the object to compare to. * @return True if the contents of the objects are equal. */ @Override public boolean equals(Object obj) { if (obj == null) { return false; } if (getClass() != obj.getClass()) { return false; } final FilterBloom<E> other = (FilterBloom<E>) obj; if (this.expectedNumberOfFilterElements != other.expectedNumberOfFilterElements) { return false; } if (this.k != other.k) { return false; } if (this.bitSetSize != other.bitSetSize) { return false; } if (this.bitset != other.bitset && (this.bitset == null || !this.bitset.equals(other.bitset))) { return false; } return true; } /** * Calculates a hash code for this class. * @return hash code representing the contents of an instance of this class. */ @Override public int hashCode() { 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 int hash = 7; hash = 61 * hash this.bitset.hashCode() : hash = 61 * hash hash = 61 * hash hash = 61 * hash return hash; } + (this.bitset != null ? 0); + this.expectedNumberOfFilterElements; + this.bitSetSize; + this.k; /** * Calculates the expected probability of false positives based on * the number of expected filter elements and the size of the Bloom filter. * <br /><br /> * The value returned by this method is the <i>expected</i> rate of false * positives, assuming the number of inserted elements equals the number of * expected elements. If the number of elements in the Bloom filter is less * than the expected value, the true probability of false positives will be lower. * * @return expected probability of false positives. */ public double expectedFalsePositiveProbability() { return getFalsePositiveProbability(expectedNumberOfFilterElements); } /** * Calculate the probability of a false positive given the specified * number of inserted elements. * * @param numberOfElements number of inserted elements. * @return probability of a false positive. */ public double getFalsePositiveProbability(double numberOfElements) { // (1 - e^(-k * n / m)) ^ k return Math.pow((1 - Math.exp(-k * (double) numberOfElements / (double) bitSetSize)), k); } /** * Get the current probability of a false positive. The probability is calculated from * the size of the Bloom filter and the current number of elements added to it. * * @return probability of false positives. */ public double getFalsePositiveProbability() { 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 return getFalsePositiveProbability(numberOfAddedElements); } /** * Returns the value chosen for K.<br /> * <br /> * K is the optimal number of hash functions based on the size * of the Bloom filter and the expected number of inserted elements. * * @return optimal k. */ public int getK() { return k; } /** * Sets all bits to false in the Bloom filter. */ public void clear() { bitset.clear(); numberOfAddedElements = 0; } /** * Adds an object to the Bloom filter. The output from the object's * toString() method is used as input to the hash functions. * * @param element is an element to register in the Bloom filter. */ public void add(E element) { add(element.toString().getBytes(charset)); } /** * Adds an array of bytes to the Bloom filter. * * @param bytes array of bytes to add to the Bloom filter. */ public void add(byte[] bytes) { int[] hashes = createHashes(bytes, k); for (int hash : hashes) bitset.set(Math.abs(hash % bitSetSize), true); numberOfAddedElements ++; } /** * Adds all elements from a Collection to the Bloom filter. * @param c Collection of elements. */ public void addAll(Collection<? extends E> c) { for (E element : c) add(element); } 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 /** * Returns true if the element could have been inserted into the Bloom filter. * Use getFalsePositiveProbability() to calculate the probability of this * being correct. * * @param element element to check. * @return true if the element could have been inserted into the Bloom filter. */ public boolean contains(E element) { return contains(element.toString().getBytes(charset)); } /** * Returns true if the array of bytes could have been into the Bloom filter. * Use getFalsePositiveProbability() to calculate the of this * being correct. * * @param bytes array of bytes to check. * @return true if the array could have been inserted Bloom filter. */ public boolean contains(byte[] bytes) { int[] hashes = createHashes(bytes, k); for (int hash : hashes) { if (!bitset.get(Math.abs(hash % bitSetSize))) return false; } } return true; } inserted probability into the { /** * Returns true if all the elements of a Collection could have been inserted * into the Bloom filter. Use getFalsePositiveProbability() to calculate the * probability of this being correct. * @param c elements to check. * @return true if all the elements in c could have been inserted into the Bloom filter. */ public boolean containsAll(Collection<? extends E> c) { for (E element : c) if (!contains(element)) return false; return true; } /** * Read a single bit from the Bloom filter. * @param bit the bit to read. * @return true if the bit is set, false if it is not. 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 */ public boolean getBit(int bit) { return bitset.get(bit); } /** * Set a single bit in the Bloom filter. * @param bit is the bit to set. * @param value If true, the bit is set. If false, the bit is cleared. */ public void setBit(int bit, boolean value) { bitset.set(bit, value); } /** * Return the bit set used to store the Bloom filter. * @return bit set representing the Bloom filter. */ public BitSet getBitSet() { return bitset; } /** * Returns the number of bits in the Bloom filter. Use count() to retrieve * the number of inserted elements. * * @return the size of the bitset used by the Bloom filter. */ public int size() { return this.bitSetSize; } /** * Returns the number of elements added to the Bloom filter after it * was constructed or after clear() was called. * * @return number of elements added to the Bloom filter. */ public int count() { return this.numberOfAddedElements; } /** * Returns the expected number of elements to be inserted into the filter. * This value is the same value as the one passed to the constructor. * * @return expected number of elements. */ public int getExpectedNumberOfElements() { return expectedNumberOfFilterElements; } 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 /** * Get expected number of bits per element when the Bloom filter is full. This value is set by the constructor * when the Bloom filter is created. See also getBitsPerElement(). * * @return expected number of bits per element. */ public double getExpectedBitsPerElement() { return this.bitsPerElement; } /** * Get actual number of bits per element based on the number of elements that have currently been inserted and the length * of the Bloom filter. See also getExpectedBitsPerElement(). * * @return number of bits per element. */ public double getBitsPerElement() { return this.bitSetSize / (double)numberOfAddedElements; } } Mapper 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 package bfcode; import java.lang.System.*; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import utils.*; public class FilterMapper extends Mapper<LongWritable, Text, Text, NullWritable> { FilterBloom<String> filter; @Override protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException { super.setup(context); double falsePositiveProbability = 0.1; int expectedNumberOfElements = 100; filter = new FilterBloom<String>(falsePositiveProbability, expectedNumberOfElements); filter.add("bad service"); filter.add("iron man"); filter.add("marvel"); filter.add("end game"); } protected void map(LongWritable key, Text value, Context context) throws java.io.IOException, InterruptedException { String[] tokens = value.toString().split(","); for(String token :tokens){ 31 32 33 34 35 36 37 System.out.println(token); if(filter.contains(token)){ context.write(value, NullWritable.get()); } } } } Main 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 package bfcode; import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Counter; import org.apache.hadoop.mapreduce.CounterGroup; import org.apache.hadoop.mapreduce.Counters; import org.apache.hadoop.mapreduce.Job; public class BloomFilter { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { if (args.length != 2) { System.err.println("Usage: FilterJob <input path> <output path>"); System.exit(-1); } Job job = new Job(); job.setJarByClass(bfcode.FilterMapper.class); job.setJobName("Customer Complaint Filter"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(bfcode.FilterMapper.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); job.waitForCompletion(true); } } Output Findings and Learnings: 1. 2. 3. 4. We learned how map-reduce works. We learned how to code using hadoop in Java. We have learned the working of bloom filters We have successfully implemented bloom filters AIM: Study and implement five basic functions in R Language Introduction and Theory R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency. The following are the important features of R: R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities. R has an effective data handling and storage facility, R provides a suite of operators for calculations on arrays, lists, vectors and matrices. R provides a large, coherent and integrated collection of tools for data analysis. R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers. Basic functions in R: 1) R Input and Output For input, use scan() or readline() for inputing a line from the keyboard in the form of a string and for output use print(). 2) R Date and Time The function Sys.time() is used to return the current date and time. The 'date' class in base R keeps dates as if the number of days since the starting of 1970. 3) R Packages A package is a set of R functions and data-sets and the library is a folder on your system / computer which stores the files for those package(s). To load a package library(package_name) is used. To view all packages library() is used. To install a package install.packages(package_name) is used. 4) seq() seq(x,y) function returns a sequence of number from x to y. 5) mean() mean(x:y) calculates the mean of all the numbers ranging from x to y. Output: Findings and Learnings: Successfully implemented five basic functions in R language. 1) R is a programming language and environment commonly used in statistical computing, data analytics and scientific research. 2) It is one of the most popular languages used by statisticians, data analysts, researchers and marketers to retrieve, clean, analyse, visualize and present data. 3) Due to its expressive syntax and easy-to-use interface, it has grown in popularity in recent years. AIM: Find TF-IDF matrix for recommendation systems in R. Introduction and Theory TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. TF-IDF for a word in a document is calculated by multiplying two different metrics: 1) Term Frequency (TF): Term Frequency is the number of times a word appears in a document. There are ways to adjust the frequency, by length of a document i.e. Normalized Term Frequency, the number of times a word appears in a document, divided by the total number of words in that document. 2) Inverse Document Frequency (IDF): The inverse document frequency of the word across a set of documents means how common or rare a word is in the entire document set. It is calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm. So, if the word is very common and appears in many documents, this number will approach 0. Code: docs<-c(doc1, doc2, doc3, doc4, doc5) class(docs) library(tm) doc_corpus<-VCorpus(VectorSource(docs)) doc_corpus doc_corpus<-tm_map(doc_corpus, tolower) doc_corpus<-tm_map(doc_corpus, removeWords, stopwords("english")) doc_corpus<-tm_map(doc_corpus, stripWhitespace) doc_corpus<-tm_map(doc_corpus, removePunctuation) doc_corpus<tm_map(doc_corpus, PlainTextDocument) #install.packages("SnowballC") library(SnowballC) doc_corpus<-tm_map(doc_corpus, stemDocument) doc_corpus[[1]]$content doc_dtm<-TermDocumentMatrix(doc_corpus) doc_dtm doc_dtm$dimnames$Docs<-as.character(1:5) inspect(doc_dtm) findFreqTerms(doc_dtm, lowfreq = 2) library(rvest) wiki_url <read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRank ing") html_nodes(wiki_url, "#content") job <- html_table(html_nodes(wiki_url, "table")[[1]]) head(job) jobCorpus<-VCorpus(VectorSource(job[, 10])) jobCorpus<-tm_map(jobCorpus, tolower) for(j in seq(jobCorpus)){ jobCorpus[[j]] <- gsub("_", " ", jobCorpus[[j]]) } jobCorpus<-tm_map(jobCorpus, removeWords, stopwords("english")) jobCorpus<-tm_map(jobCorpus, removePunctuation) jobCorpus<-tm_map(jobCorpus, stripWhitespace) jobCorpus<-tm_map(jobCorpus, PlainTextDocument) jobCorpus<-tm_map(jobCorpus, stemDocument) dtm<DocumentTermMatrix(jobCorpus) dtm dtm.tfidf <- DocumentTermMatrix(jobCorpus, control = list(weighting=weightTfIdf)) dtm.tfidf$dimnames$Docs <as.character(1:200) inspect(dtm.tfidf[1:9, 1:10] Output: Findings and Learnings: 1) TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection of documents. 2) The importance increases proportionally to the number of times a word appears in the document but inversely proportional to the frequency of the word in the set. 3) Variations of the TF-IDF weighting scheme are often used by search engines in scoring and ranking a document’s relevance given a query. AIM: Write a program to implement clustering in R programming. Introduction and Theory Clustering Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. Broadly speaking, clustering can be divided into two subgroups : Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example each customer is put into one group out of the 10 groups. Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store. There are four ways we can do clustering, of which centroid based methods are most commonly used. Connectivity models: As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants. Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima. Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions. Density Models: These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS. Applications of Clustering Recommendation engines Market segmentation Social network analysis Search result grouping Medical imaging Image segmentation Anomaly detection k-means clustering It is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling. They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes. The algorithm has a loose relationship to the k-nearest neighbor classifier, a popular machine learning technique for classification that is often confused with k-means due to the name. Applying the 1-nearest neighbor classifier to the cluster centers obtained by k-means classifies new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm. To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids It halts creating and optimizing clusters when either: The centroids have stabilized — there is no change in their values because the clustering has been successful. The defined number of iterations has been achieved. 1 Initialize k means with random values 2 3 For a given number of iterations: 4 Iterate through items: 5 Find the mean closest to the item 6 Assign item to mean 7 Update mean The k-mean pseudo-code Code Clustering.R 1 2 3 4 5 6 7 8 9 require(ggplot2) require(MASS) set.seed(42) set1 = MASS::mvrnorm(n = 150, c(-10,10), matrix(c(1.5,1,1,1.5),2)) set2 = MASS::mvrnorm(n = 150, c(-5,10), matrix(c(1,2,2,6),2)) set3 = MASS::mvrnorm(n = 150, c(-1,1), matrix(c(4,0,0,4),2)) set4 = MASS::mvrnorm(n = 150, c(10,-10), matrix(c(4,0,0,4),2)) set5 = MASS::mvrnorm(n = 150, c(3,-3), matrix(c(4,0,0,4),2)) DF = data.frame(rbind(set1,set2,set3,set4,set5),cluster=as.factor(c(rep(1:5,each=1 50)))) 10 ggplot(DF,aes(x=X1,y=X2,color=cluster))+geom_point() 11 12 kmeans = function(data,K=4,stop_crit=10e-3) 13 { 14 #Initialisation of clusters 15 centroids = data[sample.int(nrow(data),K),] 16 current_stop_crit = 1000 17 cluster = rep(0,nrow(data)) 18 converged = F 19 it = 1 20 while(current_stop_crit>=stop_crit & converged==F) 21 { 22 it=it+1 23 if (current_stop_crit<=stop_crit) 24 { 25 converged=T 26 } 27 old_centroids=centroids 28 ##Assigning each point to a centroid 29 for (i in 1:nrow(data)) 30 { 31 min_dist=10e10 32 for (centroid in 1:nrow(centroids)) 33 { 34 distance_to_centroid=sum((centroids[centroid,]-data[i,])^2) 35 if (distance_to_centroid<=min_dist) 36 { 37 cluster[i]=centroid 38 min_dist=distance_to_centroid 39 } 40 } 41 } 42 ##Assigning each point to a centroid 43 for (i in 1:nrow(centroids)) 44 { 45 centroids[i,]=apply(data[cluster==i,],2,mean) 46 } 47 48 49 50 51 52 53 54 55 56 57 current_stop_crit=mean((old_centroids-centroids)^2) } return(list(data=data.frame(data,cluster),centroids=centroids)) } res<-kmeans(DF[1:2],K=5) res$centroids$cluster=1:5 res$data$isCentroid=F res$centroids$isCentroid=T data_plot=rbind(res$centroids,res$data) ggplot(data_plot,aes(x=X1,y=X2,color=as.factor(cluster),size=isCentroid,alpha =isCentroid))+geom_point() Results & Outputs Sample data Final cluster assignment Findings and Learnings: 1. K-means is one of the simplest unsupervised learning algorithms that solve the wellknown clustering problem. 2. R provides easy to use tools for performing cluster analysis. 3. We have successfully implemented clustering in R AIM: Write a program to find similar documents with cosine similarity in R. Introduction and Theory Cosine Similarity Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity. The cosine for two non-zero vector can be calculated from the Euclidean dot product. 𝐴 ⋅ 𝐵 = | |𝐴| || |𝐵| | cos(𝜃) The cosine similarity is then calculated as: 𝐴⋅𝐵 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = cos(𝜃) = = | |𝐴| || |𝐵| | ∑7 𝐴6𝐵6 689 :∑7689 𝐴;6 :∑7689 𝐵6; Code Cosine.R 1 2 3 library(twitteR) library(tidytext) library(dplyr) 4 library(tm) library(SnowballC) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 workingDir <- "~/Documents/R programs" setwd(workingDir) url <- "http://www.rdatamining.com/data/rdmTweets-201306.RData" download.file(url, destfile = "rdmTweets-201306.RData") load(file = "rdmTweets-201306.RData") tweets <- twListToDF(tweets) load(file = "rdmTweets-201306.RData") tweets <- twListToDF(tweets) tweets <- tweets %>% mutate(text=gsub("(http|https).+$|\\n|&amp|[[:punct:]]","",text), rowIndex=as.numeric(row.names(.))) %>% select(text,retweetCount,rowIndex) docList <- as.list(tweets$text) 24 N.docs <- length(docList) 25 26 QrySearch <- function(queryTerm) { 27 28 # Record starting time to measure your search engine performance 29 start.time <- Sys.time() 30 31 32 33 # store docs in Corpus class which is a fundamental data structure in text mining my.docs <- VectorSource(c(docList, queryTerm)) 34 35 # Transform/standaridze docs to get ready for analysis 36 37 my.corpus <- VCorpus(my.docs) %>% tm_map(stemDocument) %>% 38 39 tm_map(removeNumbers) %>% tm_map(content_transformer(tolower)) %>% 40 41 tm_map(removeWords,stopwords("en")) %>% tm_map(stripWhitespace) 42 43 44 45 # Store docs into a term document matrix where rows=terms and cols=docs # Normalize term counts by applying TDiDF weightings 46 47 term.doc.matrix.stm <- TermDocumentMatrix(my.corpus, control=list( 48 weighting=function(x) weightSMART(x,spec="ltc"), 49 50 51 52 53 54 55 56 57 58 wordLengths=c(1,Inf))) # Transform term document matrix into a dataframe term.doc.matrix <- tidy(term.doc.matrix.stm) %>% group_by(document) %>% mutate(vtrLen=sqrt(sum(count^2))) %>% mutate(count=count/vtrLen) %>% ungroup() %>% 59 60 61 62 select(term:count) docMatrix <- term.doc.matrix %>% mutate(document=as.numeric(document)) %>% filter(document<N.docs+1) 63 qryMatrix <- term.doc.matrix %>% 64 65 mutate(document=as.numeric(document)) %>% filter(document>=N.docs+1) 66 67 68 69 70 # Calcualte top ten results by cosine similarity searchRes <- docMatrix %>% inner_join(qryMatrix,by=c("term"="term"), suffix=c(".doc",".query")) %>% 71 mutate(termScore=round(count.doc*count.query,4)) %>% 72 group_by(document.query,document.doc) %>% 73 summarise(Score=sum(termScore)) %>% filter(row_number(desc(Score))<=10) %>% 74 75 arrange(desc(Score)) %>% left_join(tweets,by=c("document.doc"="rowIndex")) %>% 76 77 ungroup() %>% 78 79 rename(Result=text) %>% select(Result,Score,retweetCount) %>% data.frame() 80 81 82 83 84 85 # Record when it stops and take the difference end.time <- Sys.time() 86 print(paste("Used",time.taken,"seconds")) 87 89 90 return(searchRes) 91 92 93 time.taken <- round(end.time - start.time,4) } QrySearch("data science") Results & Outputs output Findings and Learnings: 1. Cosine similarity is one of the most commonly used similarity metrics in text processing 2. R provides easy to use tools for performing text analysis. 3. We have successfully implemented document retrieval using cosine similarity. AIM: Write a program to find TF-IDF for any dataset and also plot resultant term frequency matrix. Introduction and Theory TF-IDF A central question in text mining and natural language processing is how to quantify what a document is about. Can we do this by looking at the words that make up the document? One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document. There are words in a document, however, that occur many times but may not be important; in English, these are probably words like ―the‖, ―is‖, ―of‖, and so forth. We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a sophisticated approach to adjusting term frequency for commonly used words. Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf, the frequency of a term adjusted for how rarely it is used. It is intended to measure how important a word is to a document in a collection (or corpus) of documents. It is a rule-of-thumb or heuristic quantity; while it has proved useful in text mining, search engines, etc., its theoretical foundations are considered less than firm by information theory experts. The inverse document frequency for any given term is defined as 𝑖𝑑𝑓(𝑡𝑒𝑟𝑚) = ln - 𝑛/01234567 𝑛/01234567 10569:5:5; 64<3 = Then finally the resulting TF-IDF matrix is then calculated as: 𝑡𝑓𝑖𝑑𝑓(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝐷𝑎𝑡𝑎𝑠𝑒𝑡) = 𝑡𝑓(𝑡𝑒𝑟𝑚, 𝑑 𝑐) × 𝑖𝑓𝑑(𝑡𝑒𝑟𝑚, 𝐷𝑎𝑡𝑎𝑠𝑒𝑡) The resulting matrix is not normalized, this is done using the L2 normalization: 𝑣 𝑣G = I|𝑣|I Code TF-IDF.R 1 2 3 4 library(tm) library(proxy) library(dplyr) library(ggplot2) 5 6 7 library(corrplot) print("The dataset: each sentence is one docoment") doc <- c( "The sky is blue.", "The sun is bright today.", "The sun in the sky is bright.", "We can see the shining sun, the bright sun.", "The moon 8 9 10 11 12 13 is full, the sky full of stars.", "The sky was dark, the stars plentiful and bright.", "The sun is but a morning star.") corpus <- Corpus( VectorSource(doc) ) controlList <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE) print("computing the term-frequency matrix: ") (tf <- as.matrix(TermDocumentMatrix(corpus, control = controlList) ) ) corrplot(tf, method = "number", is.corr = FALSE, cl.pos = "n") 14 15 print("computing the idf, and then converting into a diagonal matrix (used later)") (idf <- log(ncol(tf) / (1 + rowSums(tf != 0) ) ) ) (idf <- diag(idf) ) 16 17 18 19 20 print("calculating the final tf-idf matrix") tf_idf <- crossprod(tf, idf) colnames(tf_idf) <- rownames(tf) (tf_idf <- tf_idf / sqrt(rowSums(tf_idf^2) ) ) corrplot(tf_idf, method = "number") Results & Outputs Findings and Learnings: 1. TF-IDF if one of the most popular methods in text processing 2. R provides easy to use tools for performing text analysis. 3. We have successfully implemented TF-IDF in R.