The Energy Data Collection Project 1 The Vision: Ask the Government... We’re thinking of moving to Denver...What are the schools like there? How many people had breast cancer in the area over the past 30 years? Is there an orchestra? An art gallery? How far are the nightclubs? 2 How have property values in the area changed over the past decade? Census Labor Stats The Vision: Ask the Government... We’re thinking of moving to Cambridge…How much does gas cost there? Which state has the highest oil production? How long has the nuclear plant been in service? 3 Are alternative energy sources any cheaper to use? Census Labor Stats The problem and the solution • Problem:FedStats has thousands of databases in over seventy Government agencies: – data is duplicated and near-duplicated, – even Government officials and specialists cannot find it • Solution: Create a system to provide easy standardized access: – need multi-database access engine, – need powerful user interface, – need terminology standardization mechanism. 4 The purpose of DGRC To Make Digital Government Happen • Advance information systems research • Bring the benefits of cutting edge IS research to government systems • Help educate government and the community • Learn needs from government partners to drive next stage system development • Built pilot systems as part of new infrastructure 5 Research challenges • Scale to incorporate many databases … build data models automatically • Process large and disparate data efficiently … develop fast processing techniques … create aggregation and substitution operators • Integrate data models across sources and agencies …take a large ontology and link the models into it automatically • Incorporate additional information that is available from text …use language processing tools to extract it • Display complex information from distributed sources …develop and evaluate new presentation techniques 6 Construction phase: • Deploy DBs • Extend ontol. System Architecture Integrated ontology - global terminology - source descriptions - integration axioms Databases - DB analysis - text analysis - query substitution - rapid analysis tools User Interface - ontology browser - query constructor User phase: Query processor - reformulation - cost optimization Sources R S • Compose query • Present results T Access phase: • Create DB query • Retrieve data 7 Text Tables Data Columbia’s Team Approach • User Interface – Year One: Hatzivassiloglou, Sandhaus – Year Two: Feiner, Temiyabutr • Database Aggregation – Year One: Gravano, Singla – Year Two: Ross, Zaman • Automatic Inter-Agency Ontologies – Years One and Two: Klavans, Whitman 8 System interface – Year One Progress Components: Vasileios Hatzivassiloglou Jay Sandhaus 1. Query formation 2. Ontology/glossary browsing for concept navigation 3. Answer display, interaction history GUI incorporates key technologies for facilitating user access to diverse databases: – Context-sensitive menu-based input mechanism – Visualization and navigation of results and the ontology – Lightweight client runs on multiple platforms without downloads – Java/Swing implementation allows client-side processing 9 Information Aggregation – Yr. 1 Progress Problem: Data is not in exactly the form the user needs (monthly, not annually; actual values, not averaged) Solution: Attempt to provide unified view of data of various granularities: – – – – time period geographical region product … Luis Gravano Anurag Singla Example over BLS data: – View: monthly data available for all geographical regions – Query: monthly prices for LA in 1979 – Answer: yearly price for LA in 1979 10 Aggregation challenges • Different coverage along these dimensions across data sets • Users see a simple, unified view of the data; if a query cannot be answered, we answer the closest query that we have data for • Answers are always exact • Key challenges: – defining query proximity (default vs. user-specific) – communicating ‘query relaxation’ to users – defining and navigating the space of ‘answerable’ queries efficiently 11 Extracting and Structuring Information from Definitions – Yr 1 Problems: – – – – 12 Judith Klavans Brian Whitman Proliferation of terms in domain Agencies define terms differently Many refer to the same or related entity Lengthy and dense term definitions often contain important information which is buried Glossary analysis framework • Gather glossaries, thesauri, definitions from govt agencies • Create framework into which text will be analyzed • Extract ontological information applying language sensitive analysis tools • Structure and deliver to ISI for access and display • Based on past projects: – analysis of definitions in machine-readable dictionaries • Original – domain specific glossaries 13 DGRC-EDC Plans for Year Two • User Interface – Incorporate new presentation approaches – Link ontology access mechanisms to query input – Incorporate other DG research (Marchionini) • Database – Integrate existing aggregation prototype – Main memory for fast performance • Lexical Knowledge Bases – Incorporate into SENSUS – Add web crawler to extend coverage – Develop mechanisms to merge definitions 14 End of Part I : DGRC – EDC • Reviewed goals of DGRC Energy Data Collection Project • Showed first year progress • Gave early second year results • Presented Columbia’s team approach • Set out future goals But what is next? 15 Next Steps for DGRC Growth • Ambitious two-pronged plan Additional Funding For DGRC – TRADE (NSF) 16 Independent Foundation Funding (leverage NSF Investment) One Facet: From DGRC-EDC to DGRC-TRADE • Builds on past successes • Brings in a new domain – trade data • Adds three new enhancements – User Needs and Evaluation • Electronic Data Service at Columbia • Users and Experts to test usefulness and usability – Database – incorporate cross data set aggregation – Ontology – add multilingual capability 17 Heterogeneous Data Sources EPA Information Access Census Definition Ontology Labor EIA Data 18 Integration User Interface Heterogeneous Data Sources EPA Labor Information Access User Interface Multilingual Access Task-based Evaluation Census Main Memory Query Processing Trade Definition Ontology User Evaluation EIA Data 19 Integration Columbia’s Electronic Data Service • Established to serve social science researchers • Operational unit of the Libraries • Excellent relationship with faculty, staff and students • Capable of supporting many levels of development and testing • Evaluation effort led by Walter Bourne 20 Partners – DGRC Trade • Evaluation experts from the US and Canada – Cognitive evaluation – User needs evaluation – User interface evaluation • Social scientists – ISERP and CIESEN at Columbia – Public Health – Policy research 21 Facet Two: Building the DGRC • Seek substantial Foundation support • Pursue a large vision • Involvement of high level Columbia and ISI administration • Gather an advisory board to develop a sustainable plan 22 What do we need from the NSF? 1. Information – Ways to interact with portals • E.g. firstgov.com • Private companies delivering (free) government data 2. Contacts – Leverage peer-review process of NSF to establish key contacts 23 To Sum • DGRC – Energy Data Collection (EDC) – Progress from Year One – Plans and early results from Year Two • Larger Plans for Growing DGRC – Trade Proposal – NSF – Plans for other funding 24 Today’s Plan: Focus on DGRC-EDC Major research challenges: • Building and structuring the ontology • Automated data aggregation • Presentation of complex information Major practical challenges: • Getting more data into the system • Understanding users’ needs 25 Thank you! Any questions? 26 Information Integration: Heterogeneity in Aggregation Luis Gravano Assistant Professor, Columbia U. (joint work with Anurag Singla and Vasilis Vassalos) 27 Information Integration Data Sets/Sources: Tables with statistical data, potentially produced by different organizations Goal: To Provide Single-Stop Access to Multiple Distributed Autonomous Data Sets 28 My Research Background • Databases • Distributed search and retrieval over text sources 29 Metasearchers: Single-Stop Access to Heterogeneous Text Sources Source 1 Query User Unified Results Meta Searcher Source 2 ... Source n 30 Main Metasearcher Tasks • Selects good text sources for query (source discovery) • Evaluates query at these sources (query translation) • Combines query results from sources (result merging) 31 Some of my Previous Work on Metasearchers • GlOSS: a scalable source discovery system that selects relevant text sources • STARTS: a protocol that facilitates metasearching (Participants included Infoseek, Microsoft, HewlettPackard, Fulcrum, Verity, and Netscape.) 32 Challenges for Information Integration • • • • • 33 “Semantic” Heterogeneity of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data Sets Challenges for Information Integration • • • • • 34 “Semantic” Heterogeneity of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data Sets ISI’s SIMS Future Work Challenges for Information Integration • • • • • 35 “Semantic” Heterogeneity of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data Sets Last Year Focus Mediators: Single-Stop Access to Heterogeneous Statistical Sources MainMemory DBMS Query Mediator User ... Unified Results Traditional DBMS 36 Varying Data Coverage and Granularity • Time period • Geographical region • Products Average Price of Gasoline from BLS 37 Varying Data Coverage (I) Region : US Average – Product : Leaded Regular Gasoline • Time Period: Oct 1973 to Mar 1991 • Source: BLS Series APU000074712 – Product: Leaded Premium Gasoline • Time Period: Oct 1973 to Dec 1983 • Source: BLS Series APU000074713 38 Varying Data Coverage (II) Product: Leaded Regular Gasoline – Region: San Diego, CA • Time Period : Jan 1978 to Dec 1986 • Source: BLS Series APUA42474712 – Region: Boston, Massachusetts • Time Period : Jan 1978 to Jan 1989 • Source: BLS Series APUA10374712 39 Varying Data Coverage (III) • Geographical coverage varies for different data fields (even for same gasoline type) • Not all data fields available for all gasoline types (e.g., Consumer Price Index available for Unleaded Regular but not for Leaded Premium) 40 Varying Data Granularity Granularity “hierarchies” for: – Time period – Geographical region – Products 41 Granularity Hierarchy for Time Period Year Quarter Week Month Day Granularity Hierarchy for Geographical Region World Country Region (Spanning cities or states) State City Granularity Hierarchy for Products Gasoline Leaded Gasoline Unleaded Gasoline Leaded Gasoline Leaded Gasoline Leaded Gasoline (Premium) (Midgrade) (Regular) Some BLS Data Sets for our Demo (Gasoline Unleaded Regular, Average Price) • US; Monthly; 10/1973 to 3/1991 Source: APU000074712 • San Diego; Monthly; 1/1978 to 12/1986 Source: APUA42474712 • Los Angeles; Monthly; 1/1986 to 4/1991 Source: APUA42174712 • Los Angeles; Yearly; 1978 to 1985 Source: APUA42174712 (aggregated) 45 What Do We Show Users as Data Sets Available for Querying? 46 What Do We Show Users as Data Sets Available for Querying? Possibility 1: All the details! • US; Monthly; 10/1973 to 3/1991 • San Diego; Monthly; 1/1978 to 12/1986 • Los Angeles; Monthly; 1/1986 to 4/1991 • Los Angeles; Yearly; 1978 to 1985 47 What Do We Show Users as Data Sets Available for Querying? Possibility 1: All the details! Advantages: Users can exploit all data sets 48 What Do We Show Users as Data Sets Available for Querying? Possibility 1: All the details! Advantages: Users can exploit all data sets Disadvantages: …if they don’t get overwhelmed first. 49 What Do We Show Users as Data Sets Available for Querying? Possibility 2: “Least common denominator” of data sets E.g., “only yearly data available” 50 What Do We Show Users as Data Sets Available for Querying? Possibility 2: “Least common denominator” of data sets Advantages: Users get a unified view of the data. 51 What Do We Show Users as Data Sets Available for Querying? Possibility 2: “Least common denominator” of data sets Advantages: Users get a unified view of the data. Disadvantages: Almost nothing is left! 52 What Do We Show Users as Data Sets Available for Querying? Possibility 3 (our approach): Define a reasonably expressive, unified view 53 Our Approach • Users have a simple, unified view of the data. • If a query cannot be answered, we answer the closest query that we have data for. • Answers are always exact. 54 Example over BLS Sources • View: monthly data available for all geographical regions • Query: monthly prices for LA in 1979 • Answer: yearly price for LA in 1979 55 What Do We Show Users as Data Sets Available for Querying? Possibility 3 (our approach): Define a reasonably expressive, unified view Advantages: Users get a unified view of the data; most data sets exploited. Disadvantages: Sometimes user queries cannot be answered. 56 Key Challenges • Defining query proximity (“default” vs. user-specific) • Communicating “query relaxation” to users • Defining and navigating the space of “answerable queries” efficiently 57 Proof-of-Concept Demo • Four BLS sources • Simple integrated view • Results for “closest” query when original answer cannot be computed http://db-pc01.cs.columbia.edu/digigov/Main.html 59 Some Open Issues • Definition of “right view” • Interaction with user interface • Addition of aggregation into ISI’s SIMS system 60 Aggregation in Main Memory Kenneth A. Ross Kazi A. Zaman Columbia University, New York 61 Research Experience • Complex query processing • Data Warehousing • Main memory databases 62 MainMemory DBMS Query Mediator User ... Unified Results Traditional DBMS 63 Outline • • • • • 64 Introduction to Datacubes Frameworks for querying cubes The Main Memory based framework Experimental Results Conclusions and Plan The CUBE BY Operator State Year Grade Sales State Year Grade Sales CA NY CA 1997 Regular 90 1997 Premium 70 1998 Premium 65 NY 1998 Premium 95 CUBE BY (sum Sales) Large increase in total Size, especially with many dimensions 65 CA CA ALL CA ALL ALL ALL CA ALL 1997 1997 1997 ALL 1997 1997 ALL ALL ALL Additional records ……. Regular 90 ALL 90 Regular 90 Regular 90 Regular 90 ALL 160 Regular 90 ALL 155 ALL 320 Lattice Representation State Year Grade State, Year State, Grade Year, Grade State, Year, Grade 66 Modeling Queries Slice Queries ask for a single aggregate record SELECT FROM GROUP BY HAVING 67 State, year, sum(sales) BLS-12345 State, year State = “NY” AND year = “1998” Existing Frameworks Choose subset of cube to materialize based on workload. State Year Grade Materialize on disk State, Year Appropriate record recovered or computed for incoming slice query State, Year, Grade Drawbacks: Ignores Clustering of Relation on disk. Smallest unit of materialization is too big. 68 State,Grade Year,Grade Our approach The full cube is often larger than available memory, but ... State The finest granularity aggregate may fit. Any record can be computed without having to go to disk. Year State, Year State,Grade Year,Grade State, Year, Grade How should the finest granularity be organized ? 69 Grade Framework Level-1 Store Level-2 Store Finest granularity cuboid Query q records in linked lists Selected coarse records in hash table Slot directory 70 The Level-1 Store Records are <Key,Value> pairs stored in a hash table. Records can contain ALL’s Key a1 b2 c2 … Given query Q, form composite key and check level-1 store (constant time). If not found, use level-2 store 71 Value 55 34 12 ... The Level-2 Store Level-2 Store Slot directory is organized as a multidimensional array: level2[sz1][sz2][sz3][sz4] Finest granularity cuboid Each slot points to a linked list of elements. records in linked lists Records placed according to set of mapping functions H Slot directory 72 Using the Level-2 store Query Q without ALL’s a3 Slot 4 b4 c2 d5 Slot 3 Slot 7 Slot1 Access list denoted by level2[4][3][7][1] ; aggregate those matching (a3,b4,c2,d5). 73 Using the Level-2 store Query Q with ALL’s a3 Slot 4 ALL c2 ALL List of Slots Slot 7 List of Slots Access lists matching level2[4][*][7][*] ; aggregate those matching (a3,*,c2,*). 74 Experimental Results Query Processing Time vs Additional Memory Used (real dataset, 10^6 records, 8 dimensions) Average time per query in milliseconds 15 Query Cost 10 5 0 0 20 40 60 Additional Memory Used in MB Scanning all records takes 194 ms. 75 80 Importance of Work •Aggregation is fundamental to analysis. •Make analysis interactive. •Make a variety of aggregate granularities available, where possible. 76 Contributions and Plan • A Main Memory based framework for answering datacube queries efficiently. • Query Performance in the 2-4 ms range which is more efficient than going to disk. • Goal: Integrate within Columbia/ISI system to facilitate interactive analysis. 77 Experimental Results Number of Tuples in Level-1 store Distribution of tuples in Level-1 store 1200000 1000000 800000 600000 400000 200000 0 4 ALLS 5 ALLS 6 ALLS 7 ALLS 8 ALLS 0 20 40 60 80 Additional Memory used in MB 78 Workload Based Distribution Each possible query record is assigned a probability. (Nonzero probability for some records not in cube.) Uniform Cuboid: Each cuboid has equal probability Each record in cuboid has same probability Count Based: Each cuboid has equal probability Probability proportional to count of record 79 Existing Cost Models Linear Cost Model: Cost proportional to number of records read. If cuboid is not materialized, use smallest materialized ancestor. Drawbacks: Ignores Clustering of Relation on disk Smallest unit of materialization is a cuboid 80 Design Decisions • Mapping functions H • level2[sz1][sz2][sz3][sz4] : choice of sz values. • Size of level-1 store vs level-2 • Choice of level-1 records 81 Choice of Mapping Functions Mapping functions aim for uniform distribution. We know the single attribute distribution in advance. Exact problem is intractable, use heuristics. 82 Example Attribute frequency a1 a2 a3 5 10 5 If the range size is 2, a2 maps to one slot, a1 and a3 the other Level-2 choice of Range Sizes Level 2 slot array: level2[sz1][sz2][sz3][sz4] Given slot array size T T sz1 * sz2 * sz3 * sz4 If all cuboids are equiprobable, pick uniform range sizes 1/ 4 sz1 = sz2 = sz3 = sz4 T In general, nonlinear integer programming problem. 83 Optimizing Slot Array Size If T is too big, Too many empty slots are checked. Less space available for level-1 store. If T is too small, Too many records examined for each query. 84 Cost of using Slot Array T: Number of slots n: Number of finest granularity records s: Cost of slot access l: Cost of list element access d: Dimensionality of dataset Number of slots accessed by a query with k ALL’s : k/d T (s+nl/T) 85 Average Cost of level-2 store pi : probability of record i being queried k i : number of ALL’s in record i A= Average cost of lookup for all non-materialized records A= Σ k i /d pi T (s+nl/T) i: all records not in level-1 store 86 Benefit of a level-1 record p : probability of record i being queried i k : number of ALL’s in record i i Expected benefit of materializing record: ki /d B=piT (s+nl/T) Exponential in k, linear in pi 87 The Tradeoff Equation Given unit extra space do we increase the slot array size or that of the level-1 store ? Pick option which provides greater average reduction in query time. Level-1 Benefit : B 88 Level-2 Benefit: A’ (dA/dT) The Tradeoff Equation Level-1 Storage cost : q Level-2 Slot Size: z Benefit per unit space Level-1 : B/q Level-2 : -A’/z Allocate memory to obtain larger benefit. Repeat for next unit of memory (parameters have changed) 89 Experimental Setup 8 dimensional dataset of Cloud coverage data 64 bits for CUBE BY attributes 32 bits for aggregate 1015367 base records (12 Mbytes) Size of total datacube = 102745662 records (1.2 Gbytes) Algorithms implemented in C 300 MHz Sun Ultra-2 running Solaris 5.6 Results shown for count based distribution 90 Experimental Results Size of Slot Array in MB Size of Slot Array 3 2.5 2 1.5 1 0.5 0 Slot Array Size 0 50 100 Additional Memory used in MB 91 Experimental Results Size of level-1 store in MB Size of Level-1 Store 120 100 80 60 40 20 0 Size of Store 0 50 100 Additional Memory used in MB 92 Experimental Results Percentages of tuples per level in level-1 store Levelwise Breakup of tuples in level-1 store 100 80 60 40 20 0 4 ALLS 5 ALLS 6 ALLS 7 ALLS 8 ALLS 0 50 100 Additional Memory used in MB 93 Experimental Results Average Update Cost in milliseconds Update Costs 3 2.5 2 1.5 1 0.5 0 Cuboid Info Independent 0 50 100 Additional Memory used in MB 94 See paper for… • • • • 95 More details on updates. Hierarchies in attributes. Range queries. More experiments. User Interfaces for DGRC-EDC Steven Feiner Surabhan Temiyabutr Department of Computer Science Columbia University New York, NY 10027 Supported by NSF Grant EIA-9876739 96 Approach • Redesign current UI – Heuristic analysis and informal experiments • Formal experiments and feedback 97 Redesign: First Steps 98 Redesign: Next Steps • Potential problem areas – Query • Alleviate “peep-hole” confusion of walking menu – Results • May interface with Marchionini et al. table browser – Ontology • Explore graph presentation strategies: layout, distortion viewing (e.g., fisheye), hierarchy, filtering 99 Redesign: Next Steps • Potential problem areas (cont.) – History • Support reuse and modification of previous queries – Metainformation • Determine utility and presentation approaches – Integration • Maintain consistency/linkage across displays – Substitution • Leverage ontology 100 Experiments • Design/perform/analyze formal user experiments at BLS et al. • Feed back experimental results to UI design 101 Extracting Information from Domain Specific Glossaries Judith L. Klavans Brian Whitman 102 Construction phase: • Deploy DBs • Extend ontol. System Architecture Integrated ontology - global terminology - source descriptions - integration axioms Databases - DB analysis - text analysis - query substitution - rapid analysis tools User Interface - ontology browser - query constructor User phase: Query processor - reformulation - cost optimization Sources R S • Compose query • Present results T Access phase: • Create DB query • Retrieve data 103 Text Tables Data Extracting and Structuring Metadata from Text Judith Klavans, Dir of CRIA, Columbia Brian Whitman, GRA, Columbia Problems: – – – – 104 Proliferation of terms in domain Agencies define terms differently Many refer to the same or related entity Lengthy and dense term definitions often contain important information which is buried Gasoline: Sample Definitions • Gasoline: A volatile mixture of flammable liquid hydrocarbons derived chiefly from crude petroleum and used principally as a fuel for internal-combustion engines and as a solvent, an illuminant, and a thinner. (The American Heritage® Dictionary of the English Language, Third Edition) • Gasoline: See regular gasoline. (Energy Information Administration, Gasoline Glossary 2000) 105 Regular Gasoline: Gasoline having an antiknock index, i.e., octane rating, greater than or equal to 85 and less than 88. Note: Octane requirements may vary by altitude. See Gasoline Grades. Data sources 106 EIA Edited Gasoline Glossary Regular Gasoline: Gasoline having an antiknock index, i.e., octane rating, greater than or equal to 85 and less than 88. Note: Octane requirements may vary by altitude. See Gasoline Grades. Motor Gasoline (Finished): A complex mixture of relatively volatile hydrocarbons with or without small quantities of additives, blended to form a fuel suitable for use in spark-ignition engines. Data sources EIA Online Energy Glossary 107 EIA Edited Gasoline Glossary The Core gasoline { petrol [N] , gas [N] gasolene [N } Regular Gasoline: Gasoline having an antiknock index, i.e., octane rating, greater than or equal to 85 and less than 88. Note: Octane requirements may vary by altitude. See Gasoline Grades. Large ontology (SENSUS) Motor Gasoline (Finished): A complex mixture of relatively volatile hydrocarbons with or without small quantities of additives, blended to form a fuel suitable for use in spark-ignition engines. Data sources EIA Online Energy Glossary 108 EIA Edited Gasoline Glossary The Core Large ontology (SENSUS) Concepts from glossaries (by LKB) gasoline { petrol [N] , gas [N] gasolene [N } (Regular Gasoline (source …) (xref "Gasoline Grades") (full-def … } (core-def … } (genus-phrase “gasoline") (head-word “gasoline") (properties (contains "an antiknock index") ) (quantifiers (less-than "88") ) (note … ) ) Data sources 109 EIA Edited Gasoline Glossary The Core Linguistic mapping Logical mapping Large ontology (SENSUS) Concepts from glossaries (by LKB) Domain-specific ontologies (SIMS models) Data sources 110 Glossary analysis framework • Gather glossaries, thesauri, definitions from govt agencies • Create framework into which text will be analyzed • Extract ontological information applying language sensitive analysis tools • Structure and deliver to ISI for access and display • Based on past projects: – analysis of definitions in machine-readable dictionaries • Original – domain specific glossaries 111 Columbia’s Lexical Knowledge Base (LKB) Tool Combines statistical and linguistic methods: • identifies topics with high accuracy • provides complete coverage • useful for any subject area • produced over 6,000 concepts in current domain 112 Extraction of Information from Domain Specific Glossaries • Year One – Built a definition analyzer using a combination of novel and known techniques – Analyzed and structured over 6000 entries from 4 sites – Tested on medical glossaries also • Year Two – – – – – Build a crawler to identify glossaries across sites Analyze additional data Evaluate nodes by social science experts Link our output to ISI’s ontology Develop first-stage merging representation structure (SENSUS) 113 Thank you! Any questions? 114 Lexical Knowledge Base Generation from Glossaries CARDGIS / Digital Government Project Brian Whitman Columbia University 115 The Problem • Unstructured glossaries provide needed information • Create a common ontology across many datasets 116 An Example from EIA • Input Motor Gasoline Blending Components: Naphthas (e.g., straight-run gasoline, alkylate, reformate, benzene, toluene, xylene) used for blending or compounding into finished motor gasoline. These components include reformulated gasoline blendstock for oxygenate blending (RBOB) but exclude oxygenates (alcohols, ethers), butane, and pentanes plus. Note: Oxygenates are reported as individual components and are included in the total for other hydrocarbons, hydrogens, and oxygenates. Output (Motor Gasoline Blending Components (isa "Naphthas") (used-for "blending") (!contains "oxygenates") ) 117 In the Way • • • • 118 Lack of standards Complex input Automatic extraction Acronyms The ALKB System ALKB SLKB Acrocat 119 What’s a LKB? • “Highly structured isomorph of a published dictionary” - Klavans, Boguraev, Byrd (1990) • Definition Structured tree 120 LKB Example “gasoline” (Sense 1…) American Heritage Senses Sense 2 Pronunciation: gas’e-len’ Used For... Cross-ref fuel for internal combustion engines propellant Columbia Encyclopedia Senses 121 Step-by-step Process - Demo One • • • • Parses POS Tagging NP Identification Bigram frequency Two attribute types – predefined – automatic 122 Demo One 123 Predefined Semantic Attributes • Developed after an analysis of the source material • Examples: – contains, includes, excludes – less than, greater than, more than – used for 124 Automatic Semantic Attributes • Also uses the probability material to determine additional attributes Head Term: Motor Gasoline (Finished) Cross-reference Genus Term: A complex mixture of relatively volatile hydrocarbons Head Genus Word: mixture Properties for use in: spark-ignition engines Excludes-Includes: includes conventional gasoline Acronym: ASTM [list] Data on: all types, aviation gasoline, gasohol, gasoline finished motor: gasoline, blending components In data: aviation gasoline, gasohol 125 Acrocat - Acronym Cataloguer • Glossaries full of acronyms • Acrocat ‘dereferences’ • Guesses difficult acronyms – RBOB = reformulated gasoline blendstock for oxygenate blending 126 What Acrocat Does • Salience measures: – – – – 127 Distance from named reference Capital letter match Length Crawls all pages within a domain Demo Two 128 Future Work • ALKB: – Integration of data into ISI ontology – Research on the semantic attribute set – Tests over non-glossary data • Acrocat: – Evaluation – Building with known dictionaries 129 EDS October 30, 2000 Walter Bourne Assistant Director Academic Information Systems (AcIS) walter@columbia.edu 130 EDS History & Operation • BASR(Bureau of Applied Social Research), ‘40’s • DARTS (Data Archive) est’d 1970’s as part of the Center for the Social Sciences. • EDS replaced DARTS in 1992. • EDS is a joint operation of the Libraries and AcIS (Academic Information Systems). • 4 full-time staff; 4 graduate assistants. • 10 librarians in Social Science Division provide extended subject expertise. 131 EDS Services • • • • • • • 132 Data Library Data Finding Data Access Data Consulting Data Acquisition Statistical Programming Assistance Instructional Assistance Who Uses EDS Visits/Contacts at EDS By Discipline 350 300 250 200 150 100 50 he r Ot Ur ba n rs he Te ac gy iolo lth So c He a t'l Wo rk ial So c ern a Sc i In t Ec o 133 Po li mi cs 0 no • EDS serves a wide variety of disciplines. • 1,089 contacts were recorded in past year. • Economics and Political Science the most frequent users. • Others: Undergrads, Journalism, Statistics, Data Library • 1,285 studies are maintained online. • 60 GB of data, including many gzipped files. • A variety of sample and access programs for SPSS and SAS. • A library of codebooks and manuals. • Extensive local how-to documentation, http://www.columbia.edu/acis/eds 134 Data Access The EDS DataGate • Full-text search over abstracts and titles of studies. • Abstracts are from ICPSR (Inter-University Consortium for Political and Social Research). • System reflects current status of data by nightly update from the files on disk. • A combination of an SQL DB, indexer, and CGI scripts. 135 The EDS DataGate • • • • • The principal finding tool. Originally developed in 1994. Uses OpenText’s PAT indexing engine. Ingres is current DBMS. cron scripts update DB and Web nightly. • http://www.columbia.edu/acis/eds/dgate.html 136 EDS and DGRC 1. Evaluation • EDS provides access to a pool of data users (seekers?) of varying sophistication. • EDS and Libraries staff have decades of experience doing and supporting social science research. • A ideal combination for DGRC evaluation. 137 EDS and DGRC 2. Futures • DGRC vision promises relief to EDS users and staff from the tough job of finding, preparing and analyzing data. • Users will be able to concentrate on their analysis. • Staff will be able to work on improving DGRC tools; incorporating their expertise. 138 The DGRC First-Year User Interface Vasilis Hatzivassiloglou Jay Sandhaus 139 Goals • Provide a means for uniform access to heterogeneous, distributed statistical databases. • Communicate in real time with the ontology and information mediator (SIMS) over the internet. 140 Tasks for First Year • Select appropriate platform that combines ease of access, interactivity, and capability of advanced graphical displays. • Develop communication APIs between the ontology and the UI and between the information manager and the UI. • Design and implement prototype interfaces. 141 Interface platform • A tradeoff between computational capabilities and accessibility to the casual user • We experimented with several prototypes: – Stand-alone application – JAVA AWT applet – JAVA Swing applet 142 Two layers • Application layer – Communicate with information manager using HTTP – Combine information from multiple relational tables (joins, identification of common information) • Presentation layer – What the user sees 143 User Interface Components • • • • • 144 Query specification Error handling Ontology browsing Information display Representation of integrated information User Interface Visual Structure • Four main panels: – – – – Query specification + error handling Result display Ontology navigation and display Help and documentation • User can switch between panels at any time. 145 Query Specification • Three modes considered – Restricted natural language input – Direct SQL entry (for expert users) – Guided selection of terms from contextsensitive menus (e.g., product, time period, location) • In all cases, SQL query is formulated and sent to the information manager (SIMS) 146 Error Handling • Syntactic checks (SQL and NL) • Terms in the ontology (highlight unknown terms, allow browsing of the ontology for replacement) • User can consult the ontology from any panel with a simple right-click on any term 147 Ontology browsing • Graphical display of the ontology as a tree/directed graph • Navigation capabilities for parent, children, and other related nodes • Display of information associated with each note (e.g., definitions, source) 148 Result display • Display of integrated result as a table • Ability to refer back to source documents • Display of extracted footnotes in separate area • Display of relevant ontology terms 149 Data Granularity • An issue that cuts across the interface and data integration • Open questions: – Do we impose a unified view? – Do we allow the user to ever give a query with no answer available? 150 Other interface components • What to show from the system’s integration operations? • Possibilities: – Data granularity level selected – Sources relevant to the query – Ontology terms used in answering the query 151 Current interface status • Prototypes as stand-alone application and JAVA AWT and SWING applets • Input: Menu navigation, SQL query, simple oneterm query • Output: Table returned over the internet by SIMS, along with associated displays of footnotes, sources, etc. • Ontology browsing and display of properties • SQL query construction and data exchanges with SIMS shown in log window 152