Int. J. , Vol. x, No. x, xxxx Configurable Meta-search in the Job Domain Tabbasum Naz Vienna University of Technology Institute for Information Systems, Favoritenstrasse 9-11, A-1040, Vienna, Austria Email: naz@dbai.tuwien.ac.at Jürgen Dorn Vienna University of Technology Institute of Software Technology and Interactive Systems, Favoritenstrasse 9-11, A-1040, Vienna, Austria Email: juergen.dorn@dbai.tuwien.ac.at Alexandra Poulovassilis London Knowledge Lab, Birkbeck, University of London, 23-29 Emerald Street, London WC1N 3QS, United Kingdom Email: ap@dcs.bbk.ac.uk Abstract: To aid job seekers searching for job vacancies, we have developed a new configurable meta-search engine for the human resources domain. In this paper we describe the three main components of our meta-search engine – the query interface generator, query dispatcher and information extractor – which collectively support the meta-search engine creation and usage. One of the important challenges in accessing heterogeneous and distributed data via a meta-search engine is schema/data matching and integration. We describe an approach to schema and data integration for meta-search engines that helps to resolve the semantic heterogeneities between different source search engines. Our approach is a hybrid one, in that we use multiple matching criteria and multiple matchers. A domain-specific ontology serves as a global ontology and allows us to resolve semantic heterogeneities by deriving mappings between the source search engine interfaces and the ontology. The mappings are used to generate an integrated meta-search query interface, to support query processing within the meta-search engine, and to resolve semantic conflicts arising during result extraction from the source search engines. Experiments conducted in the job search domain show that our hybrid approach increases the correctness of matching during the automatic integration of source search interfaces. Our system aims to support job meta-search providers in the rapid development of meta-search engines. Our use of a domain ontology and multiple matchers Copyright © 200x Inderscience Enterprises Ltd. 1 T.Naz, J.Dorn, A.Poulovassilis helps in the semantic understanding of job descriptions and provides a job seeker with integrated access to jobs from a variety of Websites. Keywords: Meta-search engine, schema matching and integration, information extraction and integration, ontology based schema integration, job search. Biographical notes: Tabbasum Naz studied Computer Science and received her Doctorate from Vienna University of Technology, Austria in 2009. She has also worked as visiting researcher at London Knowledge Lab, United Kingdom in 2008. Her Ph.D. dissertation is in the area of configurable meta-search in the human resource domain. Juergen Dorn studied Computer Science and Economics at Technische Universität Berlin. He received his Ph.D. from Technische Universität Berlin in 1989 for a thesis on knowledge-based reactive robot planning. From 1989 to 1996, he was head of a group at Christian Doppler Laboratory for Expert Systems in Vienna that has developed several scheduling expert systems for the Austrian steel industry. He is now Professor of Business Information Systems at Vienna University of Technology. Alexandra Poulovassilis studied Mathematics at Cambridge University. She received her Ph.D. from Birkbeck, University of London in 1990 for research into functional database languages. She held posts at University College London and King’s College London before returning to Birkbeck as Reader in 1999 and Professor of Computer Science from 2001. Since 2003 she has been Co-Director of the London Knowledge Lab, a multi-disciplinary research institution which aims to explore the ways in which digital technologies and new media are shaping the future of knowledge and learning. 1 Introduction Unemployment is not only a serious problem for developing nations but also for the developed world. At the time of writing, unemployment is rising in many sectors. According to (Unemployment Rate, 2008) unemployment rates in 2008 were 9.10% in Germany, 7.50% in Pakistan, 4.30% in Austria, 4.60% in the United States and 5.40% in the United Kingdom. One of the contributory factors to unemployment is problems in individuals’ searches for appropriate jobs for them, and in the distribution of information about job vacancies. The Web has drastically increased the availability of information. However, the volume and heterogeneity of the information that is available via Websites makes it difficult for a user to visit every Website that may be relevant to their information needs. Traditional search engines are based on keyword or phrase search, without taking into account the semantics of the word or phrase, and hence may not provide the desired results to the user. Other traditional search tools suffer from low recall and precision. To overcome these problems, meta-search engines aim to offer topic-specific search using multiple heterogeneous search engines. If we compare general purpose search engines with domain specific meta-search engines, it is observed that domain specific queries cannot be handled effectively by general purpose search engines. General purpose search engines may produce a lot of results from a submitted keyword or phrase, but many of these results will be irrelevant to the user’s needs. Not all of the results will be useable and the user may have to navigate a large number of results to find the domain specific Configurable Meta-search in the Job domain results. This is because general purpose search engines are normally developed to meet the needs of users’ general queries, not of domain specific queries. Meta-search engines provide access to information from multiple search engines simultaneously. They increase coverage of the Web by combining the coverage of several search engines. They also make the user’s task much quicker and easier by allowing users to submit just one search query, rather than several and by automatically retrieving and ranking results from multiple search engines. They have ability to search the “deep” Web too, thus improving the precision and recall (Meng et al. 2001). Our focus in this paper is to address the problems of searching for job vacancies from multiple job search engines by the construction of a job meta-search engine. The main technical challenges involved in this approach are: (a) schema and data matching and integration in order to resolve semantic differences between different search engines; (b) automatic integration of different job search interfaces to develop a single meta-search engine query interface; and (c) translation of results from different job search engines into a common format for presentation to the user. Regarding (a), much prior research has focused on developing techniques for schema matching and mapping: schema matching identifies correspondences between elements from different schemas, while schema mapping defines these correspondences i.e. provides view definitions that link the two schemas (Rahm et al. 2001). Schema matching and mapping may generally be undertaken manually, semi-automatically or automatically. However, the volumes and heterogeneity of Web data, in particular, mandate the development of automatic schema matching and mapping techniques. Different types of heterogeneity may arise when schemas are matched e.g. syntactic, semantic and structural; different types of semantic conflicts may arise e.g. confounding, scaling, naming (Wache et al. 2001); and there may be different mapping cardinalities between elements from different schemas, 1:1, 1:n, n:1 or n:m (Embley et al. 2004). In our setting, we are concerned with automatic schema and data integration of information arising from different Web portals. Consider, for example, the three job T.Naz, J.Dorn, A.Poulovassilis search interfaces shown in Figure 1, 2 and 3. There is semantic heterogeneity both at the schema level and the data level. At the schema level, we see that three different concepts, “career type”, “categories” and “select a category”, are used to represent the same category of information. At the data level, we see that canjobs.com uses “Administrative support”, careerbuilder.com uses “Admin – Clerical”, and jobs.net uses “Admin & Clerical” to represent the same item of information. For business-related jobs, careerbuilder.com uses “Business Development” and “Business Opportunity” while jobs.net uses “Business Development” and “General Business”. The ontology integration, data integration and information integration research communities are addressing similar types of problems in matching and integrating heterogeneous schemas and ontologies. One of the foundations of our research is that the techniques developed by these communities are of relevance to schema/data matching in meta-search engines, and that a combination of approaches is required c.f. (Madhavan et al. 2001) and (Noy, 2004). We have developed an operational prototype meta-search engine for the jobs domain. Our focus in this paper is on the three main components of our job meta-search engine: the query interface generator, query dispatcher and information extractor. The query interface generator focuses on schema matching and integration techniques aiming to resolve semantic conflicts between different search engines. Semantic heterogeneity both at the schema and the data level needs to be resolved. Appropriate integrated terms both at the schema and the data level must be selected for the meta-search interface. The mappings between the source and integrated schemas need to be used for query processing by query dispatcher. The mappings also need to be used by the information extractor in order to resolve semantic conflicts arising during result extraction from multiple source search engines. We have developed a configurable approach to metasearch engine construction that aims to meet these requirements. The contributions of this paper are the use of a domain ontology and multiple schema/data matching techniques in order to resolve semantic heterogeneities between the search interfaces of source web search engines, and also between the results pages that they return. Our techniques are general and can be used in the development of metasearch engines in any domain provided there is an appropriate ontology for that domain. Our techniques contribute to more comprehensive and more concise meta-search query interface generation, more accurate query processing, and more comprehensive and concise presentation of search results. The rest of the paper is organized as follows. Section 2 reviews related work in meta-search engines, schema and ontology matching/mapping, Web data integration systems and wrapper generation. Section 3 presents our overall design for a job meta-search engine construction, the main components of a meta-search engine, and the development of an ontology for the jobs domain. Section 4 describes a case study in job search and some experimental results from this case study. Section 5 discusses our contributions, and gives directions for future work. 2 Related Work There has been much work on meta-search engines and we discuss here an indicative subset. The meta-search engine presented in He et al. (2004) consists of the “WISE: iExtractor” for interface extraction and “WISE-Integrator” for automatic integration of schema, attribute values, format and layout. He et al. (2004) uses traditional dictionaries Configurable Meta-search in the Job domain along with multiple matching techniques, to find semantic similarity between schema elements and values. The Lixto Suite (http://www.lixto.com) provides a platform to access Web sites and extract information from them. Lixto’s meta-search uses the Lixto visual wrapper for the extraction of relevant information and then integrates the different Web sites into a meta-search application. Ondrej (2006) introduces a special-purpose meta-search named “Snorri” which extends the Lixto meta-search by eliminating limitations such as synchronous provision of results. The MetaQuerier project explores and integrates databases that are not visible to traditional crawlers. It consists of a metaexplorer and a meta-integrator and uses a statistical/probabilistic approach for schema matching during the integration process (Chang et al. 2005). There also has been much work on schema/data matching and mapping. Rahm et al. (2001) and Shvaiko et al. (2005) present reviews and classifications of the main schema matching approaches. Cupid (Madhavan et al. 2001) uses multiple matching approaches, and also a thesaurus to find acronyms, short forms and synonyms of words. COMA++ (Aumueller et al. 2005) supports multiple matchers and also uses a taxonomy that acts as an intermediate ontology for schema or ontology matching. (Hakimpour et al. 2002) merges different ontologies into a single global schema, using similarity relations. (Embley et al. 2004) uses a combination of structural similarity between two schemas and also a domain-specific ontology to discover mappings. String distances can also be utilized for schema matching e.g. for matching entity names (Cohen et al. 2003). Techniques developed for schema matching can also be employed for ontology merging: Linková, (2007) distinguishes between using a single ontology describing all the sources, and using multiple ontologies – one for each data source – which are then merged to form a single shared ontology. There has also been much research into ontology matching and mapping e.g. Wache et al. (2001), Noy, (2004) and Linková, (2007). In the information extraction area there has been research in wrapper induction techniques. For example, Zhao et al. (2005) utilizes visual content features and the tag structure of HTML result pages for the automatic wrapper generation of any given search engine. Baumgartner et al. (2001) describes the Lixto visual wrapper generator, which provides a visual interactive interface supporting semi-automatic wrapper generation. The relevant information is extracted from HTML documents and translated into XML, which can then be queried and processed further. Compared to WISE and MetaQuerier, our approach to schema matching uses also a domain ontology to resolve semantic conflicts. Compared to Lixto, we aim to provide an automatic and simple construction process for information extraction. Compared to vertical search engines (such as www.kayak.com and www.skyscanner.net), which provide hard-wired solutions, we are aiming for a configurable and extensible approach. Dorn et al. (2006) described a domain-specific scenario of job portal integration but did not describe full meta-search engine. Dorn et al. (2008) discussed design patterns appropriate for the construction of meta-search engines. This paper extends previous work by giving details of the main components of job meta-search engine, describing the techniques we use for schema/data matching and integration and for result extraction, and presenting an evaluation of our techniques. Finally, we distinguish our work from Web-scale architectures such as PAYGO (Madhavan et al. 2007), in that we are aiming to develop techniques to support the construction of domain-specific meta-search engines, rather than Web-scale search of the deep Web. Our aim is to combine the respective benefits of vertical search engines, meta- T.Naz, J.Dorn, A.Poulovassilis search engines and semantic search engines within a domain-specific context, in which there is well-understood domain ontology. 3 Meta-search Engine Design In this paper, we are concerned with techniques to support two key aspects of job meta-search engines: i) meta-search engine creation by meta-search engine providers and ii) meta-search engine usage by job seekers. Our approach to meta-search is configurable in the sense that there are two different architectures for these two different processes. . Figures 4 and 5 show the components of our meta-search architecture that support these two processes. In Section 3.1 we focus in more detail on several key components of this architecture. In Section 3.2 we also briefly discuss the development of an ontology for the jobs domain. Here, we first give an overview of how the various components support processes i) and ii). The meta-search engine creation process (see Figure 4) is as follows. First, the job meta-search provider submits its preferences via the Preference Collector (our approach to meta-search is therefore also configurable in the sense that we can do real-time tuning of the meta-search and handle preferences of the meta-search provider). Currently, preferences may be for which geographical areas or job categories meta-search is required. The Job Search Engine Selector is then activated and job search engines that meet the preferences of the meta-search provider are selected from an already known set of URLs of candidate job search engines. Next, the Interface Extractor derives attributes from those job search engine’s interfaces. The process of interface extraction has two phases: attribute extraction and attribute analysis. The XML Schema Generator then creates an XML schema corresponding to each search interface. We assume that a jobs ontology is available for the job meta-search engine (see 3.2 below). Several matchers (see 3.1.4 below) are used by the Job Meta-search Query Interface Generator in order to Configurable Meta-search in the Job domain create mappings between the source XML schemas and the ontology, and hence indirect mappings between the different XML schemas. The Query Interface Generator also generates a single query interface for the meta-search. The meta-search engine usage process (see Figure 5) is as follows. A job seeker can access and use the job meta-search interface generated by the job meta-search creation process. Queries submitted to the query interface are re-written by the Query Dispatcher in order to target the individual source search engines, using the mapping rules. The query dispatcher submits the re-written queries to the individual search engines. The result pages from various search engines are passed to the Information Extractor (IE) component. Automatic wrapper generation techniques are used for the implementation of this component. In particular, the IE consists of Record Collector and Result Collector sub-components. The Record Collector is responsible for automatic identification of the record section from each result page i.e. a list or table containing job records. It also identifies the required URL and title of each identified record. The Result Collector visits the identified URL and is responsible for extracting the job description and fields, e.g. job salary, job start date, job requirements, from the result page. Since different job search engines use different concepts and data structures for their result pages, the Result Collector utilizes again the domain ontology and a variety of matching techniques in order to conform the different concepts and data structures of result descriptions and result attributes, and to convert them to a single common format for presentation to the job seeker. The conformed results are merged by the Result Merger component. Duplicate results are removed by the Duplicate Result Eliminator and stored in a database for further use. Finally, the results are ranked by the Result Ranker according to the preferences of the job seeker and displayed to the job seeker. T.Naz, J.Dorn, A.Poulovassilis 3.1 Meta-search Engine Components Following are the main components of meta-search engine. 3.1.1 Interface Extractor The Interface Extractor component derives attributes and metadata from the search engines’ interfaces. The process of interface extraction has two phases: attribute extraction and attribute analysis. During attribute extraction individual labels and form control elements (e.g. input fields, checkboxes, and radio buttons) are extracted. Text between elements is extracted to determine labels for control elements. <BR>, <P> and </TR> tags are also extracted to determine the physical location of elements and labels. Extra scripting and styling information i.e. font sizes, styles is ignored. Logically, elements and their associated labels together form different attributes. Attributes can have one or more labels and elements. To provide a physical layout of a search interface, an interface expression (IEXP) is constructed. . The IEXP is used to group the labels and elements that semantically correspond to the same attribute, and to find an appropriate attribute label for each group. For grouping labels and elements, the LEX (layout-expression-based extraction) technique described in (He et al. 2005) is used. LEX finds an appropriate attribute label for each element, either in the same row or above the current row. Our interface extractor uses heuristic measures i.e. colon at the end of text, nearest neighbour label of element, distance between element and text, vertical alignment of element and text, and finally number of labels with ending colon to identify an appropriate label for elements (He et al. 2005) (Naz, 2006). During attribute analysis, we identify the relationships between the extracted attributes: a set of attributes may be inferred as forming a group, or an attribute may be inferred to be ‘part-of’ another attribute. We undertake this identification by analyzing the HTML control elements in the search interface; the order of labels and control elements; and keywords or string patterns appearing in the attribute labels. Next, we similarly derive metadata about these attributes e.g. how many values can be associated with an attribute, default value, value type etc. Currently, string, Boolean, integer, date, time and currency types are supported. The range of values that an attribute can take may be finite (e.g. selected from an enumerated list), infinite (entered as free text by the user), or comprise a range of lower and upper values. To illustrate, Table 1 shows the attribute names and other metadata collected during the interface extraction and attribute analysis phase for the job search interface shown in Figure 3. 3.1.2 XML Schema Generator The metadata created by the interface extractor is used by the XML schema generator to define the building blocks of a XML schema document describing the source search interface. All simple and complex elements indentified by the interface extractor are represented in this XML schema. Simple elements contain only text while complex elements can contain other elements and may also contain attributes. Text boxes and text areas in the source search interface are represented as simple elements. A group of radio buttons in the source interface is also a simple element having a default value, restriction and enumeration list. Text or a label associated with a radio button is taken as a value for that radio button. Multiple checkboxes with domain type Configurable Meta-search in the Job domain “group” are treated as a complex element with attributes “fixed” and “minOccurs”. If an HTML select list does not contain the attribute “multiple” then the select list is singleselect list otherwise it is multiple-select list. A single-select list is treated as a simple element having a default value, restriction and enumeration list in the same way as radio buttons. A multiple-select list is treated as complex element. Table 1 Meta Information for job.net Attribute Name Relationship Type Domain Type enter_ keywords None Infinite enter_ a _city None Infinite select_a _state None select_ a_category jobs_ posted_within employment_ type Default Value Value Type Unit Nill String Nill Nill String Nill Finite -all united states- String Nill None Finite -all job categories- String Nill None Finite last 30 days String Nill Group Finite Nill String Nill 3.1.3 Query Interface Generator A key requirement in the creation of a job meta-search engine is to provide automatic techniques for schema/data matching and integration. We use multiple matchers, and the mappings generated are stored in XML format for subsequent use by the query interface generator and query dispatcher component. We adopt a single-ontology approach and utilize the domain ontology to find matchings between attributes of different search engine interfaces; a synonym matcher is also used during this process. After schema/data matching and integration, a query form for the meta-search engine is generated, using XForms (Rainer et al. 2004). Since our meta-search engine generation and creation processes use multiple matching criteria and multiple techniques/matchers, we term this a ‘hybrid’ approach. We use a combination of element-level techniques, structure-level techniques and ontologybased techniques to find similarity between schema elements, and between data values (see Shvaiko et al. (2005) for a general review of the main techniques used in schema matching). The matching techniques that we use are described briefly in Section 3.1.3.1. Section 3.1.3.2 then discusses our schema and data integration algorithm. 3.1.3.1 Matching Techniques The element-level techniques we use include a string-based matcher, language-based matcher, data values cardinality matcher, ID matcher, default value matcher and alignment reuse matcher. Our string-based matcher uses a stemming algorithm and different string distance functions to find a similarity between strings. In particular, the porter stemming algorithm removes the prefix and suffix of a string, handles singular and plural of concepts, and then finds the similarity between strings (http://tartarus.org/~martin/PorterStemmer). The following are examples resolved with the porter stemming algorithm. Keywordskeyword, Provincesprovince, Statesstate, Posted DatePost Date, Job TypesJob Type, Starting dateStart Date We utilize three different string distance algorithms: Levenshtein distance, Cosine similarity and Euclidean distance (http://www.dcs.shef.ac.uk/~sam/stringmetrics.html). If T.Naz, J.Dorn, A.Poulovassilis the sum of their similarity scores exceeds a threshold value, we consider this as a positive match. The following are examples of strings matched using these string distance functions: business operationsbusiness, interninternship, engineering software software engineering, contractorcontract Our language-based matcher is based on natural language processing techniques, including tokenization and elimination. The following are examples of strings transformed using tokenization and elimination: “Enter a Keyword” keyword, “Career type(s)”career types, “Select a State:” state, “Fulltime” full time Our data value cardinality matcher uses the cardinality of attributes to find a match. For example, suppose an attribute “Job Type” that contains 7 data values may match an attribute “Type of Hour” that contains 8 data values or an attribute “Job Category” that contains 44 data values. In this situation, the number of data values can be compared, from which it can be inferred that attribute “Job Type” is more similar to “Type of Hour”. If element name matching fails, then our ID matcher may help to find a match (the name of an input element from the HTML job search page is stored in our XML schema as an attribute ID, hence the name ‘ID matcher’). Some examples of IDs from job search engines for the element “keyword” are qskwd, keywords, jobsearch, keywordtitle, kwd Suppose a search engine contains an element with name “Type of Skills” and ID=“kwd”. Suppose that the element name fails to match with any element in ontology. In this situation, the ID matcher will be utilized and it will compare “kwd” to elements of the ontology e.g. keyword, type of hour, job category etc. With the help of the string distance functions above, the ID matcher will find a similarity between “kwd” and “keyword”. Sometimes, search engine interfaces provide default values with attributes, so that if the user does not select any value, the default value is used. If a default value is available, our default value matcher can be helpful in increasing the matching results. For example, suppose there is ambiguity between the “Job Type” attribute of one schema and the “Type of Hour” or “Job Category” concepts of the domain ontology. The default value matcher can find that the default value “intern” of the “Job Type” attribute is matched to data value “internship” of the “Type of Hour” concept. As already noted, our schema/data matching process is based on a domain ontology. This ontology is incrementally extended with synonyms and hypernyms of attributes from previously matched schemas. As soon as a new matching is found, we store it in the domain ontology. When matching fragments of schemas, we employ an alignment reuse matcher to reuse these previously stored match results: if there already exists a matching for an attribute, then there is no need to attempt to match the attribute again. Structure-level matchers (Rahm et al. 2001) consider a combination of elements that appear near to each other within a schema in order to identify a match. Two elements are considered similar if the elements in their local vicinity are similar. In particular, bottom-up structure-level matchers compare all elements in their sub-trees before two elements are compared i.e. data values are considered first. Top-down matchers compare first parent elements and, if they show some similarity, their children are then compared. This is a cheaper approach, and we utilize this, although it may miss some matches that a bottom-up matcher would detect. For example, suppose there is a choice in matching an attribute “Job Type” of a schema with either attribute “Type of Hour” or attribute “Job Category” of the ontology. Configurable Meta-search in the Job domain Our top-down matcher will match the children of “Job Type” with the children of “Type of Hour” (e.g. full time, part time, contract etc.) and with children of “Job Category” (e.g. computer science, business, engineering etc.). It will select whichever of these two attributes has the set of children having the closest combined match to the children of “Job Type”. Finally, with respect to ontology-based techniques, we use a single ontology approach, and the domain ontology acts as a global ontology. After completion of the schema integration process, the meta-search query interface is generated that contains concepts from this domain ontology. We recall that an XML schema is generated for every search engine to be included in the meta-search. A synonym matcher is used to find similarities between such a source schema S and the global ontology OG, using synonyms associated with concepts in OG. For example, in the job domain, synonyms for “job category” might be “industry”, “occupation”, “career type”, “function”. It is noted that a domain-specific ontology is likely to perform much better than traditional dictionaries or thesauri in finding semantic similarity between source terms. 3.1.3.2 Schema/Data Integration Algorithm The XML schemas of the source search engines are given as input to our schema and data integration algorithm, and an integrated XML schema for the meta-search engine is generated as an output. All the mappings that are discovered are stored within this schema. Our schema/data integration algorithm works as follows: First, the set of attributes (i.e. schema elements) from every source XML schema is extracted, and the schema matching and integration process starts. For every attribute, the algorithm attempts to find an equivalent attribute in OG by applying multiple matchers in the following order: a) searching for the attribute within OG, possibly using also the synonym matcher, b) using the string-based matcher or language-based matcher, c) using the data-value cardinality matcher or top-down matcher, and d) using the ID or default value matcher. If an equivalent attribute is detected at any step, the matching process stops and the discovered mapping is stored in the integrated XML schema. Our rationale for applying the various matching techniques in the sequence a)-d) above is as follows. The domain ontology is examined first, together with the use of a synonym matcher, as the ontology will be a source of high-quality, broad-coverage information about the domain. If a match fails to be found for an attribute, we then use the techniques in b) because they are cheaper than the techniques in c) (in terms of execution time), as observed from our experiments with several search engine interfaces. Finally, regarding d), we apply the ID and default value matchers last because they are of low precision: in many cases the ID is not meaningful (Web developers may use arbitrary IDs for HTML control elements) or a default value is not specified. When the matching process for all the attributes is completed, the data matching and integration process starts. The children of each XML schema attribute are matched against OG. Children attributes are only matched against attributes in OG if there has already been found to exist some similarity between the parent attribute in the XML schema and the parent attribute in OG. The same matchers as in a) and b) above are applied in sequence and the mappings discovered are stored in the integrated XML schema. Our algorithm can generate 1:1 mappings at the schema level, and 1:1, 1:n, n:1 and m:n mappings at the data level. The integrated XML schema generated, incorporating the T.Naz, J.Dorn, A.Poulovassilis mappings discovered, is then used for generating the integrated meta-search interface and for subsequent processing of queries. 3.1.4 Query Dispatcher The query dispatcher is designed to meet the search requirements of a job seeker. A job seeker can pose their search query to the meta-search interface produced by the query interface generator. The query is rewritten by the meta-search engine to target every source search engine that was incorporated into the generation of the meta-search engine, using the integrated XML schema and the mappings stored within it. The query dispatcher submits the rewritten queries to the source job search portals. It then collects the HTML result pages, containing lists of jobs, from the job search portals. 3.1.5 Information Extractor Different search engines use different concepts and data structures for results in their result pages. So our Information Extractor (IE) component, too, utilizes the domain ontology and multiple matchers in order to conform the different concepts and data structures of result descriptions and result attributes arising from different search engines, again by generating appropriate mappings, and to convert these into a single common format for presentation to the user. Thus, our hybrid matching approach is used in the extraction of search results too. The IE component consists of the Record collector and Result collector components, which are described next. 3.1.5.1 Record Collector The record collector identifies the job record section from job result pages and extracts a list or table of jobs with their URLs and titles. A job record consists of at least a job title and a URL. Result pages returned by the job search portals are analyzed, and pages containing no job results are identified and omitted. Result pages may consist of multiple forms with advertisements, extra details and a job record section. Result pages that contain a list of jobs need to be further analyzed by the help of a wrapper for the identification of the job record section, ignoring irreverent information such as advertisements. It is possible that advertisements may be helpful for the users but this will not always be the case. In order to read the job-related data only from the HTML page, the advertisements are ignored. There are different methods to generate wrappers for identifying the job record section from search engine result pages. The wrapper generation process can be based on domain-specific keywords, dynamic section identification, or pattern formats. Our automatic wrapper generation process is based on pattern formats, similarly to (Zhao et al. 2005) but with some modifications. In particular, we do not check pattern/block similarity on the basis of type distance, shape distance and position distance but instead we find similarity between patterns by using the Levenshtein distance algorithm and by setting a threshold value for this algorithm. With this technique, regularity in visual content is used to extract the job record section from the result page. In any result page, job records are similar to each other e.g. a hyperlink with title, a brief description, location, date posted and a visual line. Also, job records are normally placed in the centre of the result page and occupy a large portion of the result page. For the identification of the job record section, the first step is pattern construction to derive a physical layout of a search record by considering the visual content features (content line, link, text, link-head, record separator) from the HTML results page. For Configurable Meta-search in the Job domain example, a pattern “TLTTT” would be constructed for the job record shown in Figure 6 where T represents text and L represents a link. The second step is identification of the candidate patterns. A candidate pattern is a pattern that may possibly be a job record pattern. Blank sections are removed and line numbers are also stored with the patterns to mark the start and end of a job record. Various heuristics as listed below are used to identify the set of candidate patterns from the patterns of an HTML results page. Heuristic 1: If the pattern length is greater than 2 then we consider the pattern as a candidate pattern, otherwise we ignore it (because an HTML job record consists of at least one link, title and text/bullet). Heuristic 2: If the pattern contains at least one link and text “LT” or “TL” then it is considered as a candidate pattern. Heuristic 3: If the current and next patterns are exactly the same then these may be candidate patterns. Heuristic 4: If the current and next patterns are not the same, then we apply the Levenshtein distance algorithm to them. If their Levenshtein distance is less than or equal to 3 (our threshold value) then they are considered as being similar patterns and are candidate patterns. In the third step, a weight is assigned to the candidate patterns on the basis of their frequency i.e. the greater the number of patterns of the same type, the higher the weighting assigned to a pattern. The candidate pattern with the highest weighting is selected as the target job record pattern. The fourth step is the identification of the target-start-boundary marker and targetend-boundary marker. The line number of the first target pattern is considered as a candidate-start-boundary marker and the line number of the last target pattern is considered as a candidate-end-boundary marker. These candidate boundary markers are further refined to determine the actual target-start boundary marker and target-endboundary marker. In particular, the nearest <table> or <ul> tag above the candidate-startboundary marker is considered as the target-start-boundary marker and the closing </table> or </ul> tag after the candidate-end-boundary marker is considered as targetend-boundary marker. The target record section falls between the target-start-boundary marker and the target-end-boundary marker. The final step for record section identification is URL and titles extraction. A target job record section may contain multiple URLs e.g. a URL for “job description”, a URL for “company Web page” or a URL for “apply for job”. We only need to extract the URL that links to the job description Web page. For extracting such URLs, the target record field that contains links to the job description Web page is identified and URLs are extracted and stored from the identified record field. 3.1.5.2 Result Collector The result collector visits all the stored URLs, downloads individual result pages, identifies the job descriptions in each one, and extracts the set of attributes for each job. T.Naz, J.Dorn, A.Poulovassilis A job description may consist of the type of work, salary, start date, end date, details, location, company etc. Different job search engines return job descriptions in different formats within their job result pages. Job descriptions may have different numbers of attributes and different attribute names. For example, the job result pages from techjobscafe.com have “Employment Term” to represent the “Type of Hours” attribute and “Salary” to represent the “Salary” attribute, while those from 6figurejobs.com have “Job Type” and “Compensation”, respectively. We note that our Result Collector component is different from the wrapper generation of (Zhao et al. 2005) in that we utilize the domain ontology and multiple matchers to conform the different concepts and data structures of job descriptions, as follows. 3.1.5.3 Information Extraction Algorithm For every job attribute extracted by the Result Collector, the algorithm attempts to find an equivalent attribute in the domain ontology OG by applying matching in the following order: a) searching for the attribute within OG , b) using the string-based matcher or language-based matcher to compare the attribute with concepts from OG, c) using the synonym matcher within OG, and d) using the string-based matcher or language-based matcher on the attribute and the synonym matcher within OG. As soon as a match for the attribute is identified, the value for that attribute is also extracted. We note that compared to the earlier schema/data integration process, fewer of the matchers are used during the result identification and collection process (the data value cardinality, ID, default value and top-down matchers are not applicable). When the matching process for all the attributes is completed, all the identified attributes and their values are stored in a common structured format and are then passed to the Result Merger component for further processing. 3.1.6 Result Merger and Duplicate Result Eliminator The Result Merger merges the results from the multiple search engines. It is possible that different job search engines may return the same job. The Duplicate Result Eliminator detects and removes duplicate jobs by the identification of the same URLs or parts of URLs. The remaining job results are stored in a MySQL database. 3.1.7 Result Ranker The salary information identified and extracted from the source job search engines may contain different currencies and different range formats: salary values may be in different currencies, a salary value may be given but the currency not mentioned, the salary value may be given on a yearly, weekly, monthly or hourly basis, the salary value may be given in a range format (minimum to maximum), a salary value may be expressed with a suffix “k” to represent 1000, “million” to represent 1,000,000 etc. The Result Ranker component is responsible for converting such salary information into a single format and then ranking jobs according to salary size. Firstly, a regular expression is used to extract digits from the salary string. The currency is then identified by matching against known currency names, currency symbols and currency abbreviations. If the currency is not identified by this matching process then it is obtained by detecting the IP of the Web site with the help of “GeoLite Country” (http://www.maxmind.com/app/ip-locate). Configurable Meta-search in the Job domain Next, the salary period, e.g. yearly, weekly, monthly, hourly, is determined from the job description, and salary ranges are also identified; some examples from job pages are “30,000 – 40,000 €”, “20k to 25k”, “Upto 40k USD”, “Rs 25000 per month” etc. All the salaries are converted into a single periodicity. Regular expressions are also used in the identification of salary ranges. If salary is expressed with “k” or “million”, then it is converted to an integer format accordingly. After converting all the salaries into a single format, jobs are ranked according to salary size. Some job sites show job records with an average salary while others show a minimum and maximum salary. In the latter situation, the average of minimum and maximum salary is used. 3.1.8 Collection of Preferences There are two types of preferences in our system: the preferences of the meta-search provider and the preferences of a job seeker. The meta-search provider may wish to create a meta-search engine for a particular geographical area and/or job category e.g. offering jobs in Austria only or offering IT-related jobs only. The meta-search provider may also want to set a currency or salary range preference for presenting job salary information and the meta-search engine will convert salary results accordingly. Our system also provides a facility for the job seeker to set their salary range and/or currency preferences regarding the return of job results. 3.2 Ontology Development As discussed earlier, for our jobs meta-search engine we required a domain ontology containing job-related attributes and their synonyms in order to support the schema/data integration process and the generation of a unified query interface. We developed this as a sub-ontology of a broader Human Resources domain ontology (Dorn et al, 2007). We collected job attributes from different job search engines, we identified their corresponding attributes from HR-XML (www.hr-xml.org), and we used this information to create a first version of an “occupations” sub-ontology. We then integrated the computing and business-related occupations from the Standard Occupation Classification (SOC) and International Co-operation Europe Ltd. standards into one format and added this job category information to our ontology. Our ontology also includes data values for attributes in the job domain. For example, the data values for the attribute “Type_of_Hour” are {Contract, Full_Time, Internship, Part_Time, Permanent, Student, Temporary, Voluntary}. Our ontology also contains subclass information. For example, the attribute “Occupation” has multiple subclasses, and the “Computer_Science” subclass has data values {Software_Engineer, Administrtor, Multimedia_Designer, System_Specialist etc.}. 4 Case Study and Evaluation In Sections 4.1 and 4.2 we discuss query interface generation and query processing in a case study involving searching for jobs from several source job search engines. The URLs of the source search engines are given as input to our system, and the GUI of the meta-search engine is automatically generated. All schema and data mappings are generated using the techniques described in Section 3.1.3. Our HR domain ontology described in Section 3.2 is used to support this process. In Section 4.3 we present an evaluation of our schema/data matching techniques in the full case study. T.Naz, J.Dorn, A.Poulovassilis 4.1 Query Interface Generation for the Job Meta-search Engine Each source job search engine has a different interface and job search criteria. For simplicity, we describe here just a fragment of our case study, and consider just two simple schemas from the full set of job search engines used in the case study (we list the full set in Section 4.3). We also consider only a subset of the attributes and data values from these schemas. S1 is the schema for search engine http://www.jobs.net, and contains attributes “Enter Keywords(s)”, “Enter a City”, “Select a State”, “Select a Category”, “Employment Type”. The “Select a Category attribute” has data values {Business Development, General Business, Information Technology, Science, Telecommunications, Design}. The “Employment Type” attribute has data values {Full-Time, Part-Time, Contractor, Intern}. S2 is the schema for search engine https://www.mymatrixjobs.com and contains attributes “Keywords”, “City or Zip”, “States”, and “Job Type” which has data values {Contract or Permanent, Contract, Permanent}. Our job domain ontology OG contains a class “Job attributes” with sub-classes “Competency”, “City”, “State”, “Job Category”, “Type of Hour” etc.. The class “Job Category” has multiple synonyms, and has data values {Computer science, Business, Engineering, Telecommunication, Web Design etc.}, along with synonyms for each once of these. The class “Type of Hour” has synonyms “Employment Type” and “Job Type”, and data values {Contract, Full-time, Internship, Part-time, Permanent, Student, Temporary, Voluntary}, along with their individual synonyms. When the schema/data matching process starts, first S1 is matched with OG. By applying a combination of matchers as described in Section 3.1.3, the schema-level mappings shown in Column 1 of Table 2 are generated. Since we use a top-down structural matching approach, when the schema-level concepts are successfully matched then data-level matching starts. Column 2 of Table 2 shows the data-level mappings generated. Next, S2 is matched with OG and the schema- and data-level mappings generated are shown in Columns1 and 2 of Table 2. Table 2 Schema & Data Level Mappings for S1 and S2 S1 Schema Level Mappings Data Level Mappings S1.Enter Keyword(s) OG.Competency S1 .Business Development OG.Business S1.Enter a City OG.City S1 .General Business OG.Business S1.Enter a StateOG.State S1.Information TechnologyOG.Computer Science S1.Select a CategoryOG.Job Category S1.ScienceOG.Computer Science S1.Employment TypeOG.Type of Hour S1.TelecommunicationsOG.Telecommunication S1.Design OG.Web Design S1.Full-Time OG.Full-time S1.Part-Time OG.Part-Time S1.Contractor OG.Contract S2 S1.Intern OG.Internship S2.Keywords OG.Competency S2 .Contract or Permanent OG.Contract S2.City or ZipOG.City S2.Contract or Permanent OG .Permanent S2.States OG.State S2.Contract OG.Contract S2.Job TypeOG.Type of Hour S2.Permanent OG.Permanent Configurable Meta-search in the Job domain From these mappings, schema attributes and data values, an integrated XML schema SMSE, for the meta-search query interface is then generated (as shown in Listing 1 below). SMSE consists of attributes “Competency”, “City”, “State”, “Job Category” and “Type of Hour”. Attribute “Job Category” has data values {Business, Computer Science, Telecommunication, Web Design} and attribute “Type of Hour” has data values {Fulltime, Part-Time, Contract, Internship, Permanent}. Finally, a GUI is generated from SMSE for the job meta-search engine, as illustrated in Figure 7. Listing 1 Integrated XML Schema for Jobs.net and Mymatrixjobs.com <?xml version="1.0" encoding="UTF-8" standalone="no"?> <MetaSearchEngine> <JSE1> <!-- Job Search Engine 1 --> <URL>http://www.jobs.net/</URL> <Method>GET</Method> <Competency>qskwd</Competency> <!-- For Enter Keyword(s) --> <State>qssts</State> <!-- For Enter a State --> <City>qscty</City> <!-- For Enter a City --> <Job_Category id="qsjbt"> <!-- For Select a Category --> <computer_science>information technology, science <telecommunication>telecommunications</telecommunication> <web_design>design</web_design> </computer_science> <business>business development, general business</business> .... </Job_Category> <Type_Of_Hour id="qsetd"> <!-- For Employment Type --> <Full-Time>full-time</Full-Time> <Internship>intern</Internship> <Contract>contractor</Contract> <Part-Time>part-time</Part-Time> <Permanent/> </Type_Of_Hour> </JSE1> <JSE2> <!-- Job Search Engine 2 --> <URL>https://www.mymatrixjobs.com/candidate/Login.action</URL> <Method>POST</Method> < Competency >keywordtitle</Competency> <!-- For Keywords --> <State>state</State> <!-- For States --> <City>location</City> <!-- For City or Zip --> <Type_Of_Hour id="jobtype"> <!-- For Job Type --> <Contract>contract, contract or permanent</Contract> <Permanent>permanent, contract or permanent</Permanent> </Type_Of_Hour> </JSE2></MetaSearchEngine> 4.2 Query Processing by the Job Meta-search Engine A job seeker can now pose a query from the integrated meta-search interface GUI. This query is rewritten by the meta-search engine to target every search engine involved in the meta-search engine generation process. For example, suppose a job seeker poses a T.Naz, J.Dorn, A.Poulovassilis query QMSE requesting all “contract” jobs with competency “java” in the “computer science” field: QMSE: Jobs (Competency=Java, Job Category=Computer science, Type of Hour= Contract) The query QMSE is transformed to target each individual search engine, using the schema- and data-level mappings shown in Table 2. So we have queries Q11 and Q12 below targeted at http://www.jobs.net and queries Q21 and Q22 targeted at https://www.mymatrixjobs.com: Q11: Jobs (Enter Keyword(s)=java, Select a Category= Information technology, Employment Type = Contractor) Q12: Jobs (Enter Keyword(s)= java, Select a Category= Science, Employment Type=Contractor Q21: Jobs (Keywords=Java, Job Type = Contract) Q22: Jobs (Keywords=Java, Job Type = Contract or Permanent) Q11, Q12, Q21, Q22 are then submitted to the two search engines. The results are extracted by the Information Extractor component of our meta-search engine architecture. The results are then merged and duplicates are removed. The results are ranked (according to the preferences of the information seeker) and returned to the seeker, as described in Section 3.1 earlier. The upper part of Figure 8 shows a results page returned to the Information Extractor when query [Competency=“Java”&Job_Category=“computer_science”&Type_of_Hour= “Contract”] is sent to www.jobs.net from our meta-search engine. Figure 8 shows how attributes in this result page are identified and converted into a structured format, using Configurable Meta-search in the Job domain the techniques described in Section 3.1.5. We note that the following equivalences are derived between the attributes in the results page and those of the ontology: PostedPost Date, Base PaySalary, IndustryJob Category, CompanyCompany 4.3 Evaluation We have evaluated our schema/data matching techniques for the meta-search query interface generation using the following job search engines: • http://www.careerbuilder.com • http://www.learn4good.com/jobs/ • https://www.mymatrixjobs.com/candidate/Login.action • http://www.jobs.net/ • http://jobsearch.monster.com/ • http://www.canjobs.com/index.cfm • http://www.brightspyre.com/opening/index.php? • http://www.top-consultant.com/UK/career/appointments.asp Figure 9 shows the contributions of element-level, structure-level and ontology-based techniques in the matching process for each job search engine. We see that for the careerbuilder search engine, for example, our hybrid approach identifies a total of 6 jobrelated attributes, with element-level techniques identifying 3 attributes, structure- level techniques 1 attribute and ontology-based techniques 2 attributes. The results for the other search engines are shown in similarly, and we can see the benefits of adopting our hybrid approach to schema/data matching. T.Naz, J.Dorn, A.Poulovassilis Combining the above results, we calculate an overall contribution to the identification of job-related attributes within all the search engine interfaces of 60.60% for elementlevel techniques, 15.15% for structure-level techniques, and 18.18% for ontology-based techniques. When we combine all the techniques, our hybrid approach achieves an overall correctness of 60.60% + 15.15% + 18.18% = 93.93%, where we define correctness as: number of attributes correctly identified over the set of search engine interfaces total number of attributes in the set of search engine interfaces The precision achieved in this experiment was 100% (all the attributes identified were correct) and the recall was 93.93%. This experiment took 1 minute and 44 seconds for the job meta-search query interface generation process, for the eight job search engines above, on a machine with 1.60 GHz processor, 512 RAM and running Microsoft Windows XP. We note that, for this particular experiment, if the ordering of the groups a)-d) described in section 3.2.4.2 is altered, then the same overall set of matchings would be discovered. However, this may not be the case in general i.e. different orderings of application of a)-d) may yield different sets of matchings. For the evaluation of the Information Extractor component, we evaluated the Record Collector and the Result Collector separately. For the evaluation of the record collector, we focused on the identification of the total number of record sections from the full HTML results pages and the URLs of jobs. Experiments on 21 job search engines showed that the record collector is 90.5% correct in record section identification, 95.3% correct in job URL identification and 63.2% correct in job titles identification. For the evaluation of the result collector, we focused on the attributes identified by our hybrid schema/data matching approach. Experiments on the same 21 job search engines showed that our result collector is 77% correct in mapping job attributes from the results pages to the ontology. 5 Conclusions and Future Work The volume and heterogeneity of information available online via Websites or databases makes it difficult for a user to find relevant information. Primary tools for searching for information on the Web are search engines, subject directories and social network search engines. Traditional search tools do not provide comprehensive coverage Configurable Meta-search in the Job domain of the Web and suffer from low recall and precision because they do not take into account the semantics of search words or phrases. To overcome these problems, we have proposed a new configurable meta-search engine that uses a hybrid approach to resolving semantic conflicts between different source search engines. We use a domain ontology and multiple schema/data matching techniques in order to resolve semantic heterogeneities between the search interfaces of the source search engines, and also between the results pages that they return. Our techniques have been verified in developing a prototype for job meta-search, as discussed in this paper. However, our techniques are general and can be used in the development of meta-search engines in any domain provided there is an appropriate ontology describing that domain. Using a domain ontology is advantageous because it is a rich source of high quality, broad coverage information in a particular domain. Our work can also be viewed as providing a generic approach for semi-automatically creating a “vertical” search engine for a given domain, which combines multiple domain-specific search engines. In this paper, our main focus has been on the schema/data matching and integration aspects of meta-search engine generation and usage. We have introduced a hybrid approach that leverages the cumulative work of the large research community in this area. Our experiments in the job domain show that the combined use of element-level, structure-level and ontology-based techniques increases the correctness of matching during the automatic integration of the source search engine interfaces. Our techniques and results provide a contribution in the area of generating more comprehensive and more concise meta-search query interfaces, more accurate metasearch query processing, and more comprehensive and concise presentation of search results to users. For future work, we will report on the query processing performance of meta-search engines that are generated using our techniques. We have used multiple matching techniques so that we can increase the information extracted from Web pages. But even then, for some jobs pages it is possible that our meta-search engine will fail to identify job related attributes and data. We plan to investigate introducing further matchers and techniques, and to capture and use also preferences about units for numeric data types. Also, rather than requiring the URLs of candidate source search engines to be made known to our system, in the future our plan is to identify and choose search engines on the fly from the Web. Another area of concern is the comparison of different matching systems in the context of meta-search engines, and e.g. what factors should be considered in the comparison and how should the evaluation be undertaken. We also plan to develop benchmarks for the comparison and evaluation of matching systems in the context of meta-search engines. Finally, our meta-search cannot handle multiple languages and handles only the English language. If a Web page being processed is in a language other than English then our approach will fail. So future work would include the handling of multiple languages. References Aumueller, D., Do, H.H., Massmann, S. and Rahm, E. (2005) ‘Schema and Ontology Matching with COMA++’, Proceedings of the 2005 ACM SIGMOD International Conference on Management Data, Maryland, USA, pp. 906-908. T.Naz, J.Dorn, A.Poulovassilis Baumgartner, R., Flesca, S. and Gottlob, G. (2001), ‘Visual Web Information Extraction with Lixto’, Proceedings of the 27th VLDB Conference, Rome, Italy, pp. 119-128. Chang, K.C., He. B. and Zhang, Z. (2005), ‘Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web’, Proceedings of the Second Conference on Innovative Data Systems Research, Asilomar, California, pp. 44-55. Cohen, W.W., Ravikumar, P. and Fienberg, S.E. (2003), ‘A Comparison of String Distance Metrics for Name-Matching Tasks’, Proceedings of IJCAI-03 workshop on information integration on the Web, Acapulco, pp. 73–78. Dorn, J. and Naz, T. (2006), ‘Meta-search in Human Resource Development’, International Journal of Social Science, Vol. 1, No. 2, pp. 105-110. Dorn J, Naz T, and Pichlmair M, (2007), ‘Ontology Development for Human Resource Management’, 4th International Conference on Knowledge Management, Vienna, Austria, pp. 109-120. Dorn, J. and Naz, T. (2008), ‘Structuring Meta-search Research by Design Patterns’, Proceedings of International Computer Science and Technology Conference, California, USA, pp. 1-12. Embley, W.D., Xu, L. and Ding, Y. (2004), ‘Automatic Direct and Indirect Schema Mapping: Experiences and Lessons Learned’, ACM SIGMOD Record, pp. 14-19. Hakimpour, F. and Geppert, A. (2002), ‘Global Schema Generation Using Formal Ontologies’, Proceedings of the 21st International Conference on Conceptual Modeling, pp. 307-321. He, H., Meng, W., Yu, C. and Wu, Z. (2004), ‘Automatic Integration of Web Search Interfaces with WISE-Integrator’, VLDB Journal, Vol. 13, No. 3, pp. 256-273. He, H., Meng, W., Yu, C. and Wu, Z. (2005), ‘Constructing Interface Schemas for Search Interfaces of Web Databases’, 6th International Conference on Web Information Systems Engineering (WISE05), New York City, pp. 29-42. Linková, Z. (2007), ‘Schema Matching In the Semantic Web Environment’, PhD Conference, Matfyzpress, pp. 36-42. Linková, Z. (2007), ‘Ontology Based Schema Integration’, Proceedings of SOFSEM, Prague, pp. 71-80. Madhavan, J., Bernstein, P.A. and Rahm, E. (2001), ‘Generic Schema Matching with Cupid’, Proceedings of the 27th VLDB Conference, Roma, Italy, pp. 49-58. Madhavan, J., Jeffery, R.S., Cohen, S., Dong, X.L., Ko, D., Yu, C. and Halevy, A. (2007), ‘Webscale Data Integration: You can only afford to Pay As You Go’, Third Biennial Conference on Innovative Data Systems Research, California, USA, pp. 342-350. Meng, W., Wu, Z., Yu, C. and Li, Z. (2001), ‘A Highly Scalable and Effective Method for Metasearch’, ACM Transactions on Information Systems (TOIS), Vol. 19, pp. 310-335. Naz, T. (2006), ‘An XML Schema Generator for HTML Search Interfaces’, Technical Report, DBAIEC, Institute Faculty of Informatics, TUWien, Austria Noy, N.F. (2004), ‘Semantic Integration: A Survey of Ontology-Based Approaches’, Special section on Semantic Integration, Column ACM SIGMOD Record, Vol. 33, issue 4, pp. 65-70. Ondrej, J. (2006), ‘A Scalable Special-Purpose Meta-Search Engine’, PhD Thesis, Institute for Information Systems, Faculty for Informatics, Vienna University of Technology, Vienna, Austria Rahm, E. and Bernstein, P.A. (2001), ‘A Survey of approaches to Automatic Schema Matching’, VLDB Journal, Vol. 10, No. 4, ISSN: 1066-8888, pp. 334-350. Rainer, A., Dorn, J. and Hrastnik, P. (2004), ‘Strategies for Virtual Enterprises using XForms and the Semantic Web’, Proceedings of International Workshop on Semantic Web Technologies in Electronic Business, Berlin, pp. 166-172. Shvaiko, P. and Euzenat, J. (2005), ‘A Survey of Schema-based Matching Approaches’, Technical Report, DIT-04-087, Informatica e Telecomunicazioni, University of Trento. Configurable Meta-search in the Job domain Unemployment Rate, (2008) ‘Unemployment Rate (%), 2008 Country http://www.photius.com/rankings/economy/unemployment_rate_2008_1.html Ranks’, Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumaan, H. and Hübner, S. (2001), ‘Ontology-Based Integration of Information - A Survey of Existing Approaches’, IJCAI-01 Workshop, pp. 108-117. Zhao, H., Meng, W., Wu, Z., Raghavan, V. and Yu, C. (2005), ‘Fully Automatic Wrapper Generation for Search Engines’, Proceedings of 14th International conference on World Wide Web Conference, Japan, pp. 66-75