Associative Data Schemes for Cloud Computing Amir Basirat PhD Candidate Amir.Basirat@monash.edu Supervisor: Dr Asad Khan Clayton School of IT, Monash University STINT Workshop, Lulea, Sweden - May 2012 1 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 2 What is Cloud Computing? The vision of Cloud Computing encompasses a general shift of computer processing, storage, and software delivery away from the desktop and local servers, across the network, and into next generation of data centers hosted by large infrastructure companies. 3 Big Data! An IDC estimate put the size of the “digital universe” at 0.18 zetta-bytes back in 2006, and forecasted a tenfold growth by 2011 to 1.8 zetta-bytes. This flood of data is coming from many sources. Consider the following: • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year. 4 Challenge? Our existing capability to generate data seems to outstrip our capability to analyze it. 5 Data Management in Cloud There are some underlying issues that need to be addressed properly by any data management scheme deployed for clouds (Abadi, 2009), including: • capability to parallelise data workload • security concerns as a result of storing data at an untrusted host • and data replication functionality. Thus the question, how to effectively process immense data sets is becoming increasingly urgent. 6 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 7 Hadoop In a nutshell, what Hadoop provides: “A reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce” (Hadoop, 2011) 8 9 MapReduce MapReduce programming model requires expressing the solutions with two functions: Map and Reduce. • A map function takes a key/value pair, computes and emits a set of intermediate key/value pairs as output. • A reduce function merges all intermediate values associated with the same intermediate key, executes some computation on them, and emits the final output. (Hadoop, 2011) 10 Word Count in MapReduce Pseudo code for word count algorithm in MapReduce 1: class MAPPER 2: method MAP (docid a, doc d) 3: for all term t in doc d do 4: EMIT(term t, count 1) 1: class REDUCER 2: method REDUCE(term t, counts [c1,c2,…]) 3: sum = 0 4: for all count c in counts [c1,c2,…] do 5: sum = sum + c 6: EMIT(term t, count sum) 11 Challenges and Hurdles in MapReduce • Map function conducts its operation assuming all related data is distributed vertically, i.e. records being uniformly distributed across the network. However, it is possible that some parts of the related records being stored at different physical locations. • Intermediate records would need to be sorted before these are input to the reduce function. • Solution must be expressed in terms of the Map and Reduce functions working on key/value pairs, while in some cases this may not be possible or natural, such as multi-stage processes. • Moreover, dependency on HDFS for data storage and retrieval can create single-points of failure for Map/Reduce infrastructure, especially at master nodes. 12 Contents Existing data management schemes do not work well when data is partitioned among numerous available nodes Cloud Computing 1 dynamically. Hadoop MapReduce 2 Approaches towards scalable data management in cloud, which offer greater portability, manageability and compatibility of applications and data, are yet to be fully realised. Distributed Pattern Recognition 3 4 Graph Neuron (GN) 5 Hierarchical Graph Neuron (HGN) 6 Distributed Hierarchical Graph Neuron (DHGN) 7 Edge Detecting Hierarchical Graph Neuron (EdgeHGN) 8 Simulation Showcase 9 Question Time 13 Solution? To develop a distributed data access scheme that enables data storage and retrieval by association Treat data records as patterns As a result, data storage and retrieval is performed using a distributed pattern recognition approach that is implemented through the integration of loosely-coupled computational networks, followed by a divide-anddistribute approach that allows distribution of these networks within the cloud dynamically. 14 Associative Model of Data This associative model treats data records as pattern and hence it does not matter how data is represented. The associative model uses a single, common structure for all types of data 15 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 16 Distributed Pattern Recognition Distributed computing approach offers seemingly unlimited scalability towards pattern growth with the rapid advent of network computing technology that enables processing to be performed within the body of a network rather than concentrating on exhaustive single-CPU utilization Existing approaches are still lagged behind, due to highly-complex recognition algorithms being implemented. Neural network approach offers promising tool for large-scale pattern recognition. However, there are also several issues related to its implementation. These include: • • • convergence problems, complex iterative learning procedures, and low scalability with regards to the training data required for optimum recognition 17 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 18 An eight node GN is in the process of storing patterns (Khan, 2002). P1 (RED), P2 (BLUE), P3 (BLACK), and P4 (GREEN) 19 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 20 Hierarchical Graph Neuron (HGN) HGN compositions of 2-dimension (7x5) and 3-dimension (7x5x3) for pattern sizes 21 Distributed Hierarchical Graph Neuron (DHGN) DHGN distributed pattern recognition architecture (Muhammad Amin and Khan, 2009). 22 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 23 Research Objectives • Redesigning data management architecture from a scalable associative computing perspective for creating a database-like functionality that can scale up or down over the available infrastructure without interruption or degradation, dynamically. • Investigating a distributed data access scheme that enables data storage and retrieval by association while data records are treated as patterns • Processing the database and handling the dynamic load using a distributed pattern recognition approach • Developing an intelligent MapReduce framework that allows complex data representations to be used as keys for Map operations • Reducing cloud storage fragmentation by implementing a divide-and-distribute approach • Enhancing the existing cloud data management models for scalability • Validation of results and finding asymptotical limits of the technique through a rigorously designed computer simulation environment 24 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 25 Progress to Date • Proposing a Web-based GN for Real-time Image Recognition 26 Web-based GN (a) Total number of positive and negative matches. (b) Distortion rates for each line of image (each constructed HGN). Image distortion rates vs. rotation degrees. 27 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 28 Edge Detecting Hierarchical Graph Neuron (EdgeHGN) 7-by-7 bit Binary Character A and its 7 equally-sized DHGN subnets Reducing number of neurons by applying a drop-fall technique 29 Drop Fall Scheme • Drop-fall is often used for dividing touching pairs of digits into isolated character. Drop-fall algorithm simulates the path produced by a drop of water falling from above the character and sliding downwards along the contour under the action of gravity. • When the drop gets stuck in a groove, it melts the character‘s stroke and then continues to fall. The dividing path produced by Drop-fall algorithm depends on three aspects: a start point, movement rules, and direction. • There are four possible directions that generally produce four different paths to divide touching digits. They can start on the left or right side and can evolve downwards or upwards. One of the four is likely to produce the right result. • Therefore, a set of Drop-fall algorithms consists of four methods which try to segment a block by simulating a drop-falling process: Descending-left algorithm, Descending-right algorithm, Ascending-left algorithm, and Ascending-right algorithm 30 EdgeHGN Performance 31 Contents 1 Cloud Computing 2 Hadoop MapReduce 3 Pattern Recognition and Distributed Approach 4 Graph Neuron for Scalable Pattern Recognition 5 HGN and DHGN 6 Research Objective 7 Web-based GN 8 EdgeHGN 9 Simulation Showcase 32 Disclaimer I am not proposing any computer vision scheme for Image processing here. I am not suggesting in any way that my scheme is capable of competing against a bunch of image processing and face recognition algorithms which are treated in the literature. I am doing pattern matching and I could simply use any form of data representation for the purpose of my research. Images are complex matrixes of values, but people can relate to images very well, and that is why I found it an easy way to illustrate the effectiveness and strength of my proposed model. 33 Binary Image Recognition Fifty different individuals in the face image dataset obtained from the Face Recognition Data. 34 Sobel Operator In simple terms, the Sobel operator calculates the gradient of the image intensity at each point, giving the direction of the largest possible increase from light to dark and the rate of change in that direction. The result therefore shows how "abruptly" or "smoothly" the image changes at that point, and therefore how likely it is that that part of the image represents an edge, as well as how that edge is likely to be oriented. Edge map after applying Global Binary Signature and Sobel‘s edge detection 35 References Abadi, D.J. (2009). Data Management in the Cloud: Limitations and Opportunities, Bulletin of the Technical Committee on Data Engineering, pp. 3 - 12. Khan, A. I. and Muhamad Amin, A. (2007). One shot associative memory method for distorted pattern recognition, Al 2007: Advances in Artificial Intelligence, Springer, Berlin/Heidelberg, pp. 705—709. Muhamad Amin, A. and Khan, A. I. (2009). Collaborative-comparison learning for complex event detection using distributed hierarchical graph neuron (DHGN) approach in wireless sensor network, Al 2009: Advances in Artificial Intelligence, Springer, Berlin/Heidelberg, pp. 111—120 Nasution, B. B. and Khan, A. I. (2008). A hierarchical graph neuron scheme for real-time pattern recognition, IEEE Transactions on Neural Networks 19(2): 212—229. Shiers, J. (2009). Grid today, Communications, pp. 559 - 563. clouds on the horizon, Computer Physics Welsh, M., Malan, D., Duncan, B., Fulford-Jones, T. and Moulton, S. (2004). Wireless sensor networks for emergency medical care, GE global conference, Harvard university and Boston University school of medicine, Boston, MA. 36 Acknowledgement I would like here to thank everyone who helped me to make this possible. The first and foremost person that deserves immense gratitude is my thesis supervisor, Dr Asad Khan for his support and kind contributions. Thank You. 37 38