INTEGRATING NETWORK STORAGE INTO INFORMATION RETRIEVAL APPLICATIONS Svetlana Y. Mironova The University of Tennessee, Knoxville Spring 2003 Topics of Discussion Motivation General Text Parser (GTP) Network Storage Stack GTP with Network Storage Implementation Challenges Performance Future Work 2 Motivation Amount of textual-based information stored on our computers and on the Web is rapidly accumulating. Researchers and scientists need storage to run simulations and store outputs. Data mining and information retrieval professionals need a tool capable of creating an index from a document collection, storing it on the network and sharing with others. 3 General Text Parser (GTP) Two modules: GTP and GTPQUERY Text/document parsing and indexing Construct sparse matrix data structures Create vector-space model where documents and queries are vectors in low-dimensional subspace Term-by-document matrix defines relationships between docs and distinct terms Underlying model is Latent Semantic Indexing (LSI) 4 Versions of GTP C++ (original) Parallel C++ using MPI (for SVD computation) Java (GUI recently developed) Solaris (Unix), Linux in C++ Parallel only on Solaris Solaris, Linux, Mac OS X in Java 5 GTP Process Filter documents (optional) Create database of keys, IDs and weights Perform matrix decomposition (SVD) on the term-by-document matrix Clean up Write out summary 6 Query Process Filter queries (optional) Parse first query Generate query vector Scale query vector by singular values (optional) Perform cosine matching Write results to file for this query Repeat for more queries 7 Network Storage Stack Framework for storing and transferring data over network Modeled after Internet Protocol (IP) Stack Designed to add storage resources to the Internet in a sharable and scalable manner 8 Network Storage Stack Applications (GTP, etc) Logistical File System Logistical Tools L-Bone exNode IBP Local Access Physical Layer 9 IBP Internet Backplane Protocol Foundation of Network Storage Stack Share resources across networks Use of local storage to create global storage service Echoes advantages of IP: abstraction of datagram delivery, scalability, simple fault detection (discard faulty datagrams) Temporary and “unreliable” 10 IBP Client Calls Allocate Store Load Copy Mcopy Manage 11 exNode Hard to manage IBP capabilities exNode automates it exNodes are pointers to IBP allocations Allows to create network files from unreliable IBP allocations, with stronger properties (fault-tolerance, longer duration, etc.) Two major components: metadata and mappings 12 L-Bone The Logistical Backbone Resource discovery service Maintains list of public depots and metadata about them Uses Network Weather Service (NWS) to monitor throughput between depots http://loci.cs.utk.edu/lbone 13 LoRS Logistical Runtime System Automate finding of IBP depots via LBone, creation and management of IBP capabilities and exNodes C API and command line interface tool set 14 LoRS Functions Upload Download Augment Trim Refresh List 15 GTP with Network Storage Creating an index is a dynamic process Large document collection => large output files => require lots of storage space Need to share produced results with others (across the globe) If not satisfied with result – stored files will go away automatically If happy with collection – can either store on IBP longer or store locally (burn on CD, etc) 16 GTP and Upload GTP parses the collection GTP creates output files (keys and output ) Files are uploaded to remote network (IBP) Upload requires some information from the user (optional) Information helps optimize performance Capabilities are returned in the form of XML files (.xnd extension) 17 GTP and Upload (contd) Location (Null, hostname, zip, state, city, country, airport) Duration in days Fragments Copies 18 Download and GTPQUERY Files keys and output are downloaded using information from .xnd files Download is multithreaded Adaptive algorithm: takes into account throughput to the client “Faster” depots provide more blocks of data 19 Download + GTPQUERY 5K 100 Representation of the binary file output for 5K collection 20 Implementation Challenges GTP in Java, while LoRS tools in C Go through server (first xnd_server, then lors_server) Adapt to changes – both GTP and LoRS tools are constantly evolving Threading to optimize performance User friendliness 21 Performance All results were achieved using the Java version of GTP Three sub collections of FBIS (Foreign Broadcast Information Service) were used to produce benchmarks Server located in Tennessee Upload/download to/from Tennessee(TN), California (CA), France (FR) 22 Run Specifications By default, GTP uses 100 SVD factors, i.e., all term and document vectors are of length 100 The weighting scheme used was log entropy For the query only the first 15 singular triplets were used Three queries were used on each collection: Yugoslavia Croatia Bosnia-Herzegovina Russia embassy FIS Nissan Motor 23 FBIS 5K FBIS 5K FBIS 5K 85 300 14 50 284 284 Seconds Seconds 400 200 284 100 0 FR CA TN 60 50 40 30 20 10 0 6.4 49 FR 6.4 19.4 CA TN Location GTP 6.4 26 Location Upload Download GTP + Upload GTPQUERY Download + GTPQUERY Name Size Docs Terms output keys FBIS 5K 17.8 MB 5,000 22,558 11 MB 2.78 MB 24 FBIS 10K FBIS 10K FBIS 10K 600 157 20 19 548 548 Seconds Seconds 800 400 200 548 0 FR CA 100 80 60 40 20 0 21 71 FR TN 21 30 21 21 CA TN Location Location GTP Upload Download GTP + Upload GTPQUERY Download + GTPQUERY Name Size Docs Terms output keys FBIS 10K 32 MB 10,000 31,667 18 MB 3.5 MB 25 FBIS 20K FBIS 20K FBIS 20K 150 1500 23 270 23 78 1320 1320 1320 Seconds Seconds 2000 1000 500 0 FR CA 100 50 113 23 38 23 34 CA TN 0 TN FR Location GTP Location Upload Download GTP + Upload GTPQUERY Download + GTPQUERY Name Size Docs Terms output keys FBIS 20K 63 MB 20,000 46,488 28 MB 5.8 MB 26 Performance Analysis GTP + Upload GTP time is directly proportional to the collection size Additional overhead for upload is not significant compared to the total time Upload time depends on multiple factors: location, network bandwidth, time of day, size of file, number of copies requested, and status of depots at the time of the upload 27 Performance Analysis Download + GTPQUERY All “heavy-duty” preprocessing of the collection was done by GTP Query process simply projects the query into the term-by-document vector space Dimension of the vector space and number of factors used affects query time Number of queries requested affects query time Download takes up greater portion of the total time Download is affected by location of fragments and network conditions 28 Future Work Optimize Java performance Incorporate fully with GUI Incorporate network storage into the other (C++, parallel) versions of GTP Streaming data directly while it is generated? Avoid local file generation User friendliness 29