142-Configurable Meta-search in the Job Domain

advertisement
Int. J. , Vol. x, No. x, xxxx
Configurable Meta-search in the Job Domain
Tabbasum Naz
Vienna University of Technology
Institute for Information Systems,
Favoritenstrasse 9-11, A-1040, Vienna, Austria
Email: naz@dbai.tuwien.ac.at
Jürgen Dorn
Vienna University of Technology
Institute of Software Technology and Interactive Systems,
Favoritenstrasse 9-11, A-1040, Vienna, Austria
Email: juergen.dorn@dbai.tuwien.ac.at
Alexandra Poulovassilis
London Knowledge Lab,
Birkbeck, University of London,
23-29 Emerald Street, London WC1N 3QS, United Kingdom
Email: ap@dcs.bbk.ac.uk
Abstract: To aid job seekers searching for job vacancies, we have developed a
new configurable meta-search engine for the human resources domain. In this
paper we describe the three main components of our meta-search engine – the
query interface generator, query dispatcher and information extractor – which
collectively support the meta-search engine creation and usage. One of the
important challenges in accessing heterogeneous and distributed data via a
meta-search engine is schema/data matching and integration. We describe an
approach to schema and data integration for meta-search engines that helps to
resolve the semantic heterogeneities between different source search engines.
Our approach is a hybrid one, in that we use multiple matching criteria and
multiple matchers. A domain-specific ontology serves as a global ontology and
allows us to resolve semantic heterogeneities by deriving mappings between
the source search engine interfaces and the ontology. The mappings are used to
generate an integrated meta-search query interface, to support query processing
within the meta-search engine, and to resolve semantic conflicts arising during
result extraction from the source search engines. Experiments conducted in the
job search domain show that our hybrid approach increases the correctness of
matching during the automatic integration of source search interfaces. Our
system aims to support job meta-search providers in the rapid development of
meta-search engines. Our use of a domain ontology and multiple matchers
Copyright © 200x Inderscience Enterprises Ltd.
1
T.Naz, J.Dorn, A.Poulovassilis
helps in the semantic understanding of job descriptions and provides a job
seeker with integrated access to jobs from a variety of Websites.
Keywords: Meta-search engine, schema matching and integration, information
extraction and integration, ontology based schema integration, job search.
Biographical notes: Tabbasum Naz studied Computer Science and received
her Doctorate from Vienna University of Technology, Austria in 2009. She has
also worked as visiting researcher at London Knowledge Lab, United Kingdom
in 2008. Her Ph.D. dissertation is in the area of configurable meta-search in the
human resource domain.
Juergen Dorn studied Computer Science and Economics at Technische
Universität Berlin. He received his Ph.D. from Technische Universität Berlin in
1989 for a thesis on knowledge-based reactive robot planning. From 1989 to
1996, he was head of a group at Christian Doppler Laboratory for Expert
Systems in Vienna that has developed several scheduling expert systems for the
Austrian steel industry. He is now Professor of Business Information Systems
at Vienna University of Technology.
Alexandra Poulovassilis studied Mathematics at Cambridge University. She
received her Ph.D. from Birkbeck, University of London in 1990 for research
into functional database languages. She held posts at University College
London and King’s College London before returning to Birkbeck as Reader in
1999 and Professor of Computer Science from 2001. Since 2003 she has been
Co-Director of the London Knowledge Lab, a multi-disciplinary research
institution which aims to explore the ways in which digital technologies and
new media are shaping the future of knowledge and learning.
1
Introduction
Unemployment is not only a serious problem for developing nations but also for the
developed world. At the time of writing, unemployment is rising in many sectors.
According to (Unemployment Rate, 2008) unemployment rates in 2008 were 9.10% in
Germany, 7.50% in Pakistan, 4.30% in Austria, 4.60% in the United States and 5.40% in
the United Kingdom. One of the contributory factors to unemployment is problems in
individuals’ searches for appropriate jobs for them, and in the distribution of information
about job vacancies.
The Web has drastically increased the availability of information. However, the
volume and heterogeneity of the information that is available via Websites makes it
difficult for a user to visit every Website that may be relevant to their information needs.
Traditional search engines are based on keyword or phrase search, without taking into
account the semantics of the word or phrase, and hence may not provide the desired
results to the user. Other traditional search tools suffer from low recall and precision. To
overcome these problems, meta-search engines aim to offer topic-specific search using
multiple heterogeneous search engines. If we compare general purpose search engines
with domain specific meta-search engines, it is observed that domain specific queries
cannot be handled effectively by general purpose search engines. General purpose search
engines may produce a lot of results from a submitted keyword or phrase, but many of
these results will be irrelevant to the user’s needs. Not all of the results will be useable
and the user may have to navigate a large number of results to find the domain specific
Configurable Meta-search in the Job domain
results. This is because general purpose search engines are normally developed to meet
the needs of users’ general queries, not of domain specific queries.
Meta-search engines provide access to information from multiple search engines
simultaneously. They increase coverage of the Web by combining the coverage of several
search engines. They also make the user’s task much quicker and easier by allowing users
to submit just one search query, rather than several and by automatically retrieving and
ranking results from multiple search engines. They have ability to search the “deep” Web
too, thus improving the precision and recall (Meng et al. 2001).
Our focus in this paper is to address the problems of searching for job vacancies from
multiple job search engines by the construction of a job meta-search engine. The main
technical challenges involved in this approach are: (a) schema and data matching and
integration in order to resolve semantic differences between different search engines; (b)
automatic integration of different job search interfaces to develop a single meta-search
engine query interface; and (c) translation of results from different job search engines
into a common format for presentation to the user.
Regarding (a), much prior research has focused on developing techniques for schema
matching and mapping: schema matching identifies correspondences between elements
from different schemas, while schema mapping defines these correspondences i.e.
provides view definitions that link the two schemas (Rahm et al. 2001). Schema
matching and mapping may generally be undertaken manually, semi-automatically or
automatically. However, the volumes and heterogeneity of Web data, in particular,
mandate the development of automatic schema matching and mapping techniques.
Different types of heterogeneity may arise when schemas are matched e.g. syntactic,
semantic and structural; different types of semantic conflicts may arise e.g. confounding,
scaling, naming (Wache et al. 2001); and there may be different mapping cardinalities
between elements from different schemas, 1:1, 1:n, n:1 or n:m (Embley et al. 2004).
In our setting, we are concerned with automatic schema and data integration of
information arising from different Web portals. Consider, for example, the three job
T.Naz, J.Dorn, A.Poulovassilis
search interfaces shown in Figure 1, 2 and 3. There is semantic heterogeneity both at the
schema level and the data level. At the schema level, we see that three different concepts,
“career type”, “categories” and “select a category”, are used to represent the same
category of information. At the data level, we see that canjobs.com uses “Administrative
support”, careerbuilder.com uses “Admin – Clerical”, and jobs.net uses “Admin &
Clerical” to represent the same item of information. For business-related jobs,
careerbuilder.com uses “Business Development” and “Business Opportunity” while
jobs.net uses “Business Development” and “General Business”.
The ontology integration, data integration and information integration research
communities are addressing similar types of problems in matching and integrating
heterogeneous schemas and ontologies. One of the foundations of our research is that the
techniques developed by these communities are of relevance to schema/data matching in
meta-search engines, and that a combination of approaches is required c.f. (Madhavan et
al. 2001) and (Noy, 2004).
We have developed an operational prototype meta-search engine for the jobs domain.
Our focus in this paper is on the three main components of our job meta-search engine:
the query interface generator, query dispatcher and information extractor. The query
interface generator focuses on schema matching and integration techniques aiming to
resolve semantic conflicts between different search engines. Semantic heterogeneity both
at the schema and the data level needs to be resolved. Appropriate integrated terms both
at the schema and the data level must be selected for the meta-search interface. The
mappings between the source and integrated schemas need to be used for query
processing by query dispatcher. The mappings also need to be used by the information
extractor in order to resolve semantic conflicts arising during result extraction from
multiple source search engines. We have developed a configurable approach to metasearch engine construction that aims to meet these requirements.
The contributions of this paper are the use of a domain ontology and multiple
schema/data matching techniques in order to resolve semantic heterogeneities between
the search interfaces of source web search engines, and also between the results pages
that they return. Our techniques are general and can be used in the development of metasearch engines in any domain provided there is an appropriate ontology for that domain.
Our techniques contribute to more comprehensive and more concise meta-search query
interface generation, more accurate query processing, and more comprehensive and
concise presentation of search results. The rest of the paper is organized as follows.
Section 2 reviews related work in meta-search engines, schema and ontology
matching/mapping, Web data integration systems and wrapper generation. Section 3
presents our overall design for a job meta-search engine construction, the main
components of a meta-search engine, and the development of an ontology for the jobs
domain. Section 4 describes a case study in job search and some experimental results
from this case study. Section 5 discusses our contributions, and gives directions for future
work.
2
Related Work
There has been much work on meta-search engines and we discuss here an indicative
subset. The meta-search engine presented in He et al. (2004) consists of the “WISE:
iExtractor” for interface extraction and “WISE-Integrator” for automatic integration of
schema, attribute values, format and layout. He et al. (2004) uses traditional dictionaries
Configurable Meta-search in the Job domain
along with multiple matching techniques, to find semantic similarity between schema
elements and values. The Lixto Suite (http://www.lixto.com) provides a platform to
access Web sites and extract information from them. Lixto’s meta-search uses the Lixto
visual wrapper for the extraction of relevant information and then integrates the different
Web sites into a meta-search application. Ondrej (2006) introduces a special-purpose
meta-search named “Snorri” which extends the Lixto meta-search by eliminating
limitations such as synchronous provision of results. The MetaQuerier project explores
and integrates databases that are not visible to traditional crawlers. It consists of a metaexplorer and a meta-integrator and uses a statistical/probabilistic approach for schema
matching during the integration process (Chang et al. 2005).
There also has been much work on schema/data matching and mapping. Rahm et al.
(2001) and Shvaiko et al. (2005) present reviews and classifications of the main schema
matching approaches. Cupid (Madhavan et al. 2001) uses multiple matching approaches,
and also a thesaurus to find acronyms, short forms and synonyms of words. COMA++
(Aumueller et al. 2005) supports multiple matchers and also uses a taxonomy that acts as
an intermediate ontology for schema or ontology matching. (Hakimpour et al. 2002)
merges different ontologies into a single global schema, using similarity relations.
(Embley et al. 2004) uses a combination of structural similarity between two schemas
and also a domain-specific ontology to discover mappings. String distances can also be
utilized for schema matching e.g. for matching entity names (Cohen et al. 2003).
Techniques developed for schema matching can also be employed for ontology merging:
Linková, (2007) distinguishes between using a single ontology describing all the sources,
and using multiple ontologies – one for each data source – which are then merged to form
a single shared ontology. There has also been much research into ontology matching and
mapping e.g. Wache et al. (2001), Noy, (2004) and Linková, (2007).
In the information extraction area there has been research in wrapper induction
techniques. For example, Zhao et al. (2005) utilizes visual content features and the tag
structure of HTML result pages for the automatic wrapper generation of any given search
engine. Baumgartner et al. (2001) describes the Lixto visual wrapper generator, which
provides a visual interactive interface supporting semi-automatic wrapper generation. The
relevant information is extracted from HTML documents and translated into XML, which
can then be queried and processed further.
Compared to WISE and MetaQuerier, our approach to schema matching uses also a
domain ontology to resolve semantic conflicts. Compared to Lixto, we aim to provide an
automatic and simple construction process for information extraction. Compared to
vertical search engines (such as www.kayak.com and www.skyscanner.net), which
provide hard-wired solutions, we are aiming for a configurable and extensible approach.
Dorn et al. (2006) described a domain-specific scenario of job portal integration but
did not describe full meta-search engine. Dorn et al. (2008) discussed design patterns
appropriate for the construction of meta-search engines. This paper extends previous
work by giving details of the main components of job meta-search engine, describing the
techniques we use for schema/data matching and integration and for result extraction, and
presenting an evaluation of our techniques.
Finally, we distinguish our work from Web-scale architectures such as PAYGO
(Madhavan et al. 2007), in that we are aiming to develop techniques to support the
construction of domain-specific meta-search engines, rather than Web-scale search of the
deep Web. Our aim is to combine the respective benefits of vertical search engines, meta-
T.Naz, J.Dorn, A.Poulovassilis
search engines and semantic search engines within a domain-specific context, in which
there is well-understood domain ontology.
3
Meta-search Engine Design
In this paper, we are concerned with techniques to support two key aspects of job
meta-search engines: i) meta-search engine creation by meta-search engine providers and
ii) meta-search engine usage by job seekers. Our approach to meta-search is configurable
in the sense that there are two different architectures for these two different processes. .
Figures 4 and 5 show the components of our meta-search architecture that support these
two processes. In Section 3.1 we focus in more detail on several key components of this
architecture. In Section 3.2 we also briefly discuss the development of an ontology for the
jobs domain. Here, we first give an overview of how the various components support
processes i) and ii).
The meta-search engine creation process (see Figure 4) is as follows. First, the job
meta-search provider submits its preferences via the Preference Collector (our approach
to meta-search is therefore also configurable in the sense that we can do real-time tuning
of the meta-search and handle preferences of the meta-search provider). Currently,
preferences may be for which geographical areas or job categories meta-search is
required. The Job Search Engine Selector is then activated and job search engines that
meet the preferences of the meta-search provider are selected from an already known set
of URLs of candidate job search engines. Next, the Interface Extractor derives attributes
from those job search engine’s interfaces. The process of interface extraction has two
phases: attribute extraction and attribute analysis. The XML Schema Generator then
creates an XML schema corresponding to each search interface. We assume that a jobs
ontology is available for the job meta-search engine (see 3.2 below). Several matchers
(see 3.1.4 below) are used by the Job Meta-search Query Interface Generator in order to
Configurable Meta-search in the Job domain
create mappings between the source XML schemas and the ontology, and hence indirect
mappings between the different XML schemas. The Query Interface Generator also
generates a single query interface for the meta-search.
The meta-search engine usage process (see Figure 5) is as follows. A job seeker can
access and use the job meta-search interface generated by the job meta-search creation
process. Queries submitted to the query interface are re-written by the Query Dispatcher
in order to target the individual source search engines, using the mapping rules. The
query dispatcher submits the re-written queries to the individual search engines. The
result pages from various search engines are passed to the Information Extractor (IE)
component. Automatic wrapper generation techniques are used for the implementation of
this component. In particular, the IE consists of Record Collector and Result Collector
sub-components. The Record Collector is responsible for automatic identification of the
record section from each result page i.e. a list or table containing job records. It also
identifies the required URL and title of each identified record. The Result Collector visits
the identified URL and is responsible for extracting the job description and fields, e.g. job
salary, job start date, job requirements, from the result page. Since different job search
engines use different concepts and data structures for their result pages, the Result
Collector utilizes again the domain ontology and a variety of matching techniques in
order to conform the different concepts and data structures of result descriptions and
result attributes, and to convert them to a single common format for presentation to the
job seeker. The conformed results are merged by the Result Merger component.
Duplicate results are removed by the Duplicate Result Eliminator and stored in a
database for further use. Finally, the results are ranked by the Result Ranker according to
the preferences of the job seeker and displayed to the job seeker.
T.Naz, J.Dorn, A.Poulovassilis
3.1
Meta-search Engine Components
Following are the main components of meta-search engine.
3.1.1
Interface Extractor
The Interface Extractor component derives attributes and metadata from the search
engines’ interfaces. The process of interface extraction has two phases: attribute
extraction and attribute analysis.
During attribute extraction individual labels and form control elements (e.g. input
fields, checkboxes, and radio buttons) are extracted. Text between elements is extracted
to determine labels for control elements. <BR>, <P> and </TR> tags are also extracted to
determine the physical location of elements and labels. Extra scripting and styling
information i.e. font sizes, styles is ignored. Logically, elements and their associated
labels together form different attributes. Attributes can have one or more labels and
elements.
To provide a physical layout of a search interface, an interface expression (IEXP) is
constructed. . The IEXP is used to group the labels and elements that semantically
correspond to the same attribute, and to find an appropriate attribute label for each group.
For grouping labels and elements, the LEX (layout-expression-based extraction)
technique described in (He et al. 2005) is used. LEX finds an appropriate attribute label
for each element, either in the same row or above the current row. Our interface extractor
uses heuristic measures i.e. colon at the end of text, nearest neighbour label of element,
distance between element and text, vertical alignment of element and text, and finally
number of labels with ending colon to identify an appropriate label for elements (He et
al. 2005) (Naz, 2006).
During attribute analysis, we identify the relationships between the extracted
attributes: a set of attributes may be inferred as forming a group, or an attribute may be
inferred to be ‘part-of’ another attribute. We undertake this identification by analyzing
the HTML control elements in the search interface; the order of labels and control
elements; and keywords or string patterns appearing in the attribute labels. Next, we
similarly derive metadata about these attributes e.g. how many values can be associated
with an attribute, default value, value type etc. Currently, string, Boolean, integer, date,
time and currency types are supported. The range of values that an attribute can take may
be finite (e.g. selected from an enumerated list), infinite (entered as free text by the user),
or comprise a range of lower and upper values.
To illustrate, Table 1 shows the attribute names and other metadata collected during
the interface extraction and attribute analysis phase for the job search interface shown in
Figure 3.
3.1.2
XML Schema Generator
The metadata created by the interface extractor is used by the XML schema generator
to define the building blocks of a XML schema document describing the source search
interface. All simple and complex elements indentified by the interface extractor are
represented in this XML schema. Simple elements contain only text while complex
elements can contain other elements and may also contain attributes.
Text boxes and text areas in the source search interface are represented as simple
elements. A group of radio buttons in the source interface is also a simple element having
a default value, restriction and enumeration list. Text or a label associated with a radio
button is taken as a value for that radio button. Multiple checkboxes with domain type
Configurable Meta-search in the Job domain
“group” are treated as a complex element with attributes “fixed” and “minOccurs”. If an
HTML select list does not contain the attribute “multiple” then the select list is singleselect list otherwise it is multiple-select list. A single-select list is treated as a simple
element having a default value, restriction and enumeration list in the same way as radio
buttons. A multiple-select list is treated as complex element.
Table 1 Meta Information for job.net
Attribute Name
Relationship
Type
Domain
Type
enter_ keywords
None
Infinite
enter_ a _city
None
Infinite
select_a _state
None
select_ a_category
jobs_ posted_within
employment_ type
Default Value
Value
Type
Unit
Nill
String
Nill
Nill
String
Nill
Finite
-all united states-
String
Nill
None
Finite
-all job categories-
String
Nill
None
Finite
last 30 days
String
Nill
Group
Finite
Nill
String
Nill
3.1.3
Query Interface Generator
A key requirement in the creation of a job meta-search engine is to provide automatic
techniques for schema/data matching and integration. We use multiple matchers, and the
mappings generated are stored in XML format for subsequent use by the query interface
generator and query dispatcher component. We adopt a single-ontology approach and
utilize the domain ontology to find matchings between attributes of different search
engine interfaces; a synonym matcher is also used during this process. After schema/data
matching and integration, a query form for the meta-search engine is generated, using
XForms (Rainer et al. 2004).
Since our meta-search engine generation and creation processes use multiple
matching criteria and multiple techniques/matchers, we term this a ‘hybrid’ approach. We
use a combination of element-level techniques, structure-level techniques and ontologybased techniques to find similarity between schema elements, and between data values
(see Shvaiko et al. (2005) for a general review of the main techniques used in schema
matching). The matching techniques that we use are described briefly in Section 3.1.3.1.
Section 3.1.3.2 then discusses our schema and data integration algorithm.
3.1.3.1 Matching
Techniques
The element-level techniques we use include a string-based matcher, language-based
matcher, data values cardinality matcher, ID matcher, default value matcher and
alignment reuse matcher.
Our string-based matcher uses a stemming algorithm and different string distance
functions to find a similarity between strings. In particular, the porter stemming algorithm
removes the prefix and suffix of a string, handles singular and plural of concepts, and
then finds the similarity between strings (http://tartarus.org/~martin/PorterStemmer). The
following are examples resolved with the porter stemming algorithm.
Keywordskeyword, Provincesprovince, Statesstate, Posted DatePost Date, Job
TypesJob Type, Starting dateStart Date
We utilize three different string distance algorithms: Levenshtein distance, Cosine
similarity and Euclidean distance (http://www.dcs.shef.ac.uk/~sam/stringmetrics.html). If
T.Naz, J.Dorn, A.Poulovassilis
the sum of their similarity scores exceeds a threshold value, we consider this as a positive
match. The following are examples of strings matched using these string distance
functions:
business operationsbusiness, interninternship, engineering software software
engineering, contractorcontract
Our language-based matcher is based on natural language processing techniques,
including tokenization and elimination. The following are examples of strings
transformed using tokenization and elimination:
“Enter a Keyword” keyword, “Career type(s)”career types, “Select a State:” state, “Fulltime” full time
Our data value cardinality matcher uses the cardinality of attributes to find a match.
For example, suppose an attribute “Job Type” that contains 7 data values may match an
attribute “Type of Hour” that contains 8 data values or an attribute “Job Category” that
contains 44 data values. In this situation, the number of data values can be compared,
from which it can be inferred that attribute “Job Type” is more similar to “Type of Hour”.
If element name matching fails, then our ID matcher may help to find a match (the
name of an input element from the HTML job search page is stored in our XML schema
as an attribute ID, hence the name ‘ID matcher’). Some examples of IDs from job search
engines for the element “keyword” are
qskwd, keywords, jobsearch, keywordtitle, kwd
Suppose a search engine contains an element with name “Type of Skills” and ID=“kwd”.
Suppose that the element name fails to match with any element in ontology. In this
situation, the ID matcher will be utilized and it will compare “kwd” to elements of the
ontology e.g. keyword, type of hour, job category etc. With the help of the string distance
functions above, the ID matcher will find a similarity between “kwd” and “keyword”.
Sometimes, search engine interfaces provide default values with attributes, so that if
the user does not select any value, the default value is used. If a default value is available,
our default value matcher can be helpful in increasing the matching results. For example,
suppose there is ambiguity between the “Job Type” attribute of one schema and the
“Type of Hour” or “Job Category” concepts of the domain ontology. The default value
matcher can find that the default value “intern” of the “Job Type” attribute is matched to
data value “internship” of the “Type of Hour” concept.
As already noted, our schema/data matching process is based on a domain ontology.
This ontology is incrementally extended with synonyms and hypernyms of attributes
from previously matched schemas. As soon as a new matching is found, we store it in the
domain ontology. When matching fragments of schemas, we employ an alignment reuse
matcher to reuse these previously stored match results: if there already exists a matching
for an attribute, then there is no need to attempt to match the attribute again.
Structure-level matchers (Rahm et al. 2001) consider a combination of elements
that appear near to each other within a schema in order to identify a match. Two elements
are considered similar if the elements in their local vicinity are similar. In particular,
bottom-up structure-level matchers compare all elements in their sub-trees before two
elements are compared i.e. data values are considered first. Top-down matchers compare
first parent elements and, if they show some similarity, their children are then compared.
This is a cheaper approach, and we utilize this, although it may miss some matches that a
bottom-up matcher would detect.
For example, suppose there is a choice in matching an attribute “Job Type” of a
schema with either attribute “Type of Hour” or attribute “Job Category” of the ontology.
Configurable Meta-search in the Job domain
Our top-down matcher will match the children of “Job Type” with the children of “Type
of Hour” (e.g. full time, part time, contract etc.) and with children of “Job Category” (e.g.
computer science, business, engineering etc.). It will select whichever of these two
attributes has the set of children having the closest combined match to the children of
“Job Type”.
Finally, with respect to ontology-based techniques, we use a single ontology
approach, and the domain ontology acts as a global ontology. After completion of the
schema integration process, the meta-search query interface is generated that contains
concepts from this domain ontology. We recall that an XML schema is generated for
every search engine to be included in the meta-search. A synonym matcher is used to find
similarities between such a source schema S and the global ontology OG, using synonyms
associated with concepts in OG. For example, in the job domain, synonyms for “job
category” might be “industry”, “occupation”, “career type”, “function”. It is noted that a
domain-specific ontology is likely to perform much better than traditional dictionaries or
thesauri in finding semantic similarity between source terms.
3.1.3.2 Schema/Data
Integration
Algorithm
The XML schemas of the source search engines are given as input to our schema and
data integration algorithm, and an integrated XML schema for the meta-search engine is
generated as an output. All the mappings that are discovered are stored within this
schema. Our schema/data integration algorithm works as follows:
First, the set of attributes (i.e. schema elements) from every source XML schema is
extracted, and the schema matching and integration process starts. For every attribute,
the algorithm attempts to find an equivalent attribute in OG by applying multiple matchers
in the following order: a) searching for the attribute within OG, possibly using also the
synonym matcher, b) using the string-based matcher or language-based matcher, c) using
the data-value cardinality matcher or top-down matcher, and d) using the ID or default
value matcher. If an equivalent attribute is detected at any step, the matching process
stops and the discovered mapping is stored in the integrated XML schema.
Our rationale for applying the various matching techniques in the sequence a)-d)
above is as follows. The domain ontology is examined first, together with the use of a
synonym matcher, as the ontology will be a source of high-quality, broad-coverage
information about the domain. If a match fails to be found for an attribute, we then use
the techniques in b) because they are cheaper than the techniques in c) (in terms of
execution time), as observed from our experiments with several search engine interfaces.
Finally, regarding d), we apply the ID and default value matchers last because they are of
low precision: in many cases the ID is not meaningful (Web developers may use arbitrary
IDs for HTML control elements) or a default value is not specified.
When the matching process for all the attributes is completed, the data matching and
integration process starts. The children of each XML schema attribute are matched
against OG. Children attributes are only matched against attributes in OG if there has
already been found to exist some similarity between the parent attribute in the XML
schema and the parent attribute in OG. The same matchers as in a) and b) above are
applied in sequence and the mappings discovered are stored in the integrated XML
schema.
Our algorithm can generate 1:1 mappings at the schema level, and 1:1, 1:n, n:1 and
m:n mappings at the data level. The integrated XML schema generated, incorporating the
T.Naz, J.Dorn, A.Poulovassilis
mappings discovered, is then used for generating the integrated meta-search interface and
for subsequent processing of queries.
3.1.4
Query Dispatcher
The query dispatcher is designed to meet the search requirements of a job seeker. A
job seeker can pose their search query to the meta-search interface produced by the query
interface generator. The query is rewritten by the meta-search engine to target every
source search engine that was incorporated into the generation of the meta-search engine,
using the integrated XML schema and the mappings stored within it. The query
dispatcher submits the rewritten queries to the source job search portals. It then collects
the HTML result pages, containing lists of jobs, from the job search portals.
3.1.5
Information Extractor
Different search engines use different concepts and data structures for results in their
result pages. So our Information Extractor (IE) component, too, utilizes the domain
ontology and multiple matchers in order to conform the different concepts and data
structures of result descriptions and result attributes arising from different search engines,
again by generating appropriate mappings, and to convert these into a single common
format for presentation to the user. Thus, our hybrid matching approach is used in the
extraction of search results too. The IE component consists of the Record collector and
Result collector components, which are described next.
3.1.5.1 Record
Collector
The record collector identifies the job record section from job result pages and
extracts a list or table of jobs with their URLs and titles. A job record consists of at least a
job title and a URL. Result pages returned by the job search portals are analyzed, and
pages containing no job results are identified and omitted. Result pages may consist of
multiple forms with advertisements, extra details and a job record section. Result pages
that contain a list of jobs need to be further analyzed by the help of a wrapper for the
identification of the job record section, ignoring irreverent information such as
advertisements. It is possible that advertisements may be helpful for the users but this
will not always be the case. In order to read the job-related data only from the HTML
page, the advertisements are ignored.
There are different methods to generate wrappers for identifying the job record
section from search engine result pages. The wrapper generation process can be based on
domain-specific keywords, dynamic section identification, or pattern formats. Our
automatic wrapper generation process is based on pattern formats, similarly to (Zhao et
al. 2005) but with some modifications. In particular, we do not check pattern/block
similarity on the basis of type distance, shape distance and position distance but instead
we find similarity between patterns by using the Levenshtein distance algorithm and by
setting a threshold value for this algorithm. With this technique, regularity in visual
content is used to extract the job record section from the result page. In any result page,
job records are similar to each other e.g. a hyperlink with title, a brief description,
location, date posted and a visual line. Also, job records are normally placed in the centre
of the result page and occupy a large portion of the result page.
For the identification of the job record section, the first step is pattern construction to
derive a physical layout of a search record by considering the visual content features
(content line, link, text, link-head, record separator) from the HTML results page. For
Configurable Meta-search in the Job domain
example, a pattern “TLTTT” would be constructed for the job record shown in Figure 6
where T represents text and L represents a link.
The second step is identification of the candidate patterns. A candidate pattern is a
pattern that may possibly be a job record pattern. Blank sections are removed and line
numbers are also stored with the patterns to mark the start and end of a job record.
Various heuristics as listed below are used to identify the set of candidate patterns from
the patterns of an HTML results page.
Heuristic 1: If the pattern length is greater than 2 then we consider the pattern as a
candidate pattern, otherwise we ignore it (because an HTML job record consists of at
least one link, title and text/bullet).
Heuristic 2: If the pattern contains at least one link and text “LT” or “TL” then it is
considered as a candidate pattern.
Heuristic 3: If the current and next patterns are exactly the same then these may be
candidate patterns.
Heuristic 4: If the current and next patterns are not the same, then we apply the
Levenshtein distance algorithm to them. If their Levenshtein distance is less than or equal
to 3 (our threshold value) then they are considered as being similar patterns and are
candidate patterns.
In the third step, a weight is assigned to the candidate patterns on the basis of their
frequency i.e. the greater the number of patterns of the same type, the higher the
weighting assigned to a pattern. The candidate pattern with the highest weighting is
selected as the target job record pattern.
The fourth step is the identification of the target-start-boundary marker and targetend-boundary marker. The line number of the first target pattern is considered as a
candidate-start-boundary marker and the line number of the last target pattern is
considered as a candidate-end-boundary marker. These candidate boundary markers are
further refined to determine the actual target-start boundary marker and target-endboundary marker. In particular, the nearest <table> or <ul> tag above the candidate-startboundary marker is considered as the target-start-boundary marker and the closing
</table> or </ul> tag after the candidate-end-boundary marker is considered as targetend-boundary marker. The target record section falls between the target-start-boundary
marker and the target-end-boundary marker.
The final step for record section identification is URL and titles extraction. A target
job record section may contain multiple URLs e.g. a URL for “job description”, a URL
for “company Web page” or a URL for “apply for job”. We only need to extract the URL
that links to the job description Web page. For extracting such URLs, the target record
field that contains links to the job description Web page is identified and URLs are
extracted and stored from the identified record field.
3.1.5.2 Result
Collector
The result collector visits all the stored URLs, downloads individual result pages,
identifies the job descriptions in each one, and extracts the set of attributes for each job.
T.Naz, J.Dorn, A.Poulovassilis
A job description may consist of the type of work, salary, start date, end date, details,
location, company etc. Different job search engines return job descriptions in different
formats within their job result pages. Job descriptions may have different numbers of
attributes and different attribute names. For example, the job result pages from
techjobscafe.com have “Employment Term” to represent the “Type of Hours” attribute
and “Salary” to represent the “Salary” attribute, while those from 6figurejobs.com have
“Job Type” and “Compensation”, respectively.
We note that our Result Collector component is different from the wrapper generation
of (Zhao et al. 2005) in that we utilize the domain ontology and multiple matchers to
conform the different concepts and data structures of job descriptions, as follows.
3.1.5.3 Information
Extraction
Algorithm
For every job attribute extracted by the Result Collector, the algorithm attempts to
find an equivalent attribute in the domain ontology OG by applying matching in the
following order: a) searching for the attribute within OG , b) using the string-based
matcher or language-based matcher to compare the attribute with concepts from OG, c)
using the synonym matcher within OG, and d) using the string-based matcher or
language-based matcher on the attribute and the synonym matcher within OG. As soon as
a match for the attribute is identified, the value for that attribute is also extracted. We
note that compared to the earlier schema/data integration process, fewer of the matchers
are used during the result identification and collection process (the data value cardinality,
ID, default value and top-down matchers are not applicable).
When the matching process for all the attributes is completed, all the identified
attributes and their values are stored in a common structured format and are then passed
to the Result Merger component for further processing.
3.1.6
Result Merger and Duplicate Result Eliminator
The Result Merger merges the results from the multiple search engines. It is possible
that different job search engines may return the same job. The Duplicate Result
Eliminator detects and removes duplicate jobs by the identification of the same URLs or
parts of URLs. The remaining job results are stored in a MySQL database.
3.1.7
Result Ranker
The salary information identified and extracted from the source job search engines
may contain different currencies and different range formats: salary values may be in
different currencies, a salary value may be given but the currency not mentioned, the
salary value may be given on a yearly, weekly, monthly or hourly basis, the salary value
may be given in a range format (minimum to maximum), a salary value may be expressed
with a suffix “k” to represent 1000, “million” to represent 1,000,000 etc.
The Result Ranker component is responsible for converting such salary information
into a single format and then ranking jobs according to salary size. Firstly, a regular
expression is used to extract digits from the salary string. The currency is then identified
by matching against known currency names, currency symbols and currency
abbreviations. If the currency is not identified by this matching process then it is obtained
by detecting the IP of the Web site with the help of “GeoLite Country”
(http://www.maxmind.com/app/ip-locate).
Configurable Meta-search in the Job domain
Next, the salary period, e.g. yearly, weekly, monthly, hourly, is determined from the
job description, and salary ranges are also identified; some examples from job pages are
“30,000 – 40,000 €”, “20k to 25k”, “Upto 40k USD”, “Rs 25000 per month” etc. All the
salaries are converted into a single periodicity. Regular expressions are also used in the
identification of salary ranges. If salary is expressed with “k” or “million”, then it is
converted to an integer format accordingly. After converting all the salaries into a single
format, jobs are ranked according to salary size. Some job sites show job records with an
average salary while others show a minimum and maximum salary. In the latter situation,
the average of minimum and maximum salary is used.
3.1.8
Collection of Preferences
There are two types of preferences in our system: the preferences of the meta-search
provider and the preferences of a job seeker. The meta-search provider may wish to
create a meta-search engine for a particular geographical area and/or job category e.g.
offering jobs in Austria only or offering IT-related jobs only. The meta-search provider
may also want to set a currency or salary range preference for presenting job salary
information and the meta-search engine will convert salary results accordingly. Our
system also provides a facility for the job seeker to set their salary range and/or currency
preferences regarding the return of job results.
3.2
Ontology Development
As discussed earlier, for our jobs meta-search engine we required a domain ontology
containing job-related attributes and their synonyms in order to support the schema/data
integration process and the generation of a unified query interface. We developed this as
a sub-ontology of a broader Human Resources domain ontology (Dorn et al, 2007). We
collected job attributes from different job search engines, we identified their
corresponding attributes from HR-XML (www.hr-xml.org), and we used this information
to create a first version of an “occupations” sub-ontology. We then integrated the
computing and business-related occupations from the Standard Occupation Classification
(SOC) and International Co-operation Europe Ltd. standards into one format and added
this job category information to our ontology. Our ontology also includes data values for
attributes in the job domain. For example, the data values for the attribute
“Type_of_Hour” are {Contract, Full_Time, Internship, Part_Time, Permanent, Student,
Temporary, Voluntary}. Our ontology also contains subclass information. For example,
the attribute “Occupation” has multiple subclasses, and the “Computer_Science” subclass
has
data
values
{Software_Engineer,
Administrtor,
Multimedia_Designer,
System_Specialist etc.}.
4
Case Study and Evaluation
In Sections 4.1 and 4.2 we discuss query interface generation and query processing in
a case study involving searching for jobs from several source job search engines. The
URLs of the source search engines are given as input to our system, and the GUI of the
meta-search engine is automatically generated. All schema and data mappings are
generated using the techniques described in Section 3.1.3. Our HR domain ontology
described in Section 3.2 is used to support this process. In Section 4.3 we present an
evaluation of our schema/data matching techniques in the full case study.
T.Naz, J.Dorn, A.Poulovassilis
4.1
Query Interface Generation for the Job Meta-search Engine
Each source job search engine has a different interface and job search criteria. For
simplicity, we describe here just a fragment of our case study, and consider just two
simple schemas from the full set of job search engines used in the case study (we list the
full set in Section 4.3). We also consider only a subset of the attributes and data values
from these schemas.
S1 is the schema for search engine http://www.jobs.net, and contains attributes “Enter
Keywords(s)”, “Enter a City”, “Select a State”, “Select a Category”, “Employment
Type”. The “Select a Category attribute” has data values {Business Development,
General Business, Information Technology, Science, Telecommunications, Design}. The
“Employment Type” attribute has data values {Full-Time, Part-Time, Contractor, Intern}.
S2 is the schema for search engine https://www.mymatrixjobs.com and contains
attributes “Keywords”, “City or Zip”, “States”, and “Job Type” which has data values
{Contract or Permanent, Contract, Permanent}.
Our job domain ontology OG contains a class “Job attributes” with sub-classes
“Competency”, “City”, “State”, “Job Category”, “Type of Hour” etc.. The class “Job
Category” has multiple synonyms, and has data values {Computer science, Business,
Engineering, Telecommunication, Web Design etc.}, along with synonyms for each once
of these. The class “Type of Hour” has synonyms “Employment Type” and “Job Type”,
and data values {Contract, Full-time, Internship, Part-time, Permanent, Student,
Temporary, Voluntary}, along with their individual synonyms.
When the schema/data matching process starts, first S1 is matched with OG. By
applying a combination of matchers as described in Section 3.1.3, the schema-level
mappings shown in Column 1 of Table 2 are generated. Since we use a top-down
structural matching approach, when the schema-level concepts are successfully matched
then data-level matching starts. Column 2 of Table 2 shows the data-level mappings
generated. Next, S2 is matched with OG and the schema- and data-level mappings
generated are shown in Columns1 and 2 of Table 2.
Table 2 Schema & Data Level Mappings for S1 and S2
S1
Schema Level Mappings
Data Level Mappings
S1.Enter Keyword(s)  OG.Competency
S1 .Business Development  OG.Business
S1.Enter a City OG.City
S1 .General Business OG.Business
S1.Enter a StateOG.State
S1.Information TechnologyOG.Computer Science
S1.Select a CategoryOG.Job Category
S1.ScienceOG.Computer Science
S1.Employment TypeOG.Type of Hour
S1.TelecommunicationsOG.Telecommunication
S1.Design OG.Web Design
S1.Full-Time OG.Full-time
S1.Part-Time OG.Part-Time
S1.Contractor OG.Contract
S2
S1.Intern OG.Internship
S2.Keywords OG.Competency
S2 .Contract or Permanent OG.Contract
S2.City or ZipOG.City
S2.Contract or Permanent OG .Permanent
S2.States OG.State
S2.Contract OG.Contract
S2.Job TypeOG.Type of Hour
S2.Permanent OG.Permanent
Configurable Meta-search in the Job domain
From these mappings, schema attributes and data values, an integrated XML schema
SMSE, for the meta-search query interface is then generated (as shown in Listing 1 below).
SMSE consists of attributes “Competency”, “City”, “State”, “Job Category” and “Type of
Hour”. Attribute “Job Category” has data values {Business, Computer Science,
Telecommunication, Web Design} and attribute “Type of Hour” has data values {Fulltime, Part-Time, Contract, Internship, Permanent}. Finally, a GUI is generated from SMSE
for the job meta-search engine, as illustrated in Figure 7.
Listing 1 Integrated XML Schema for Jobs.net and Mymatrixjobs.com
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<MetaSearchEngine>
<JSE1> <!-- Job Search Engine 1 -->
<URL>http://www.jobs.net/</URL>
<Method>GET</Method>
<Competency>qskwd</Competency> <!-- For Enter Keyword(s) -->
<State>qssts</State> <!-- For Enter a State -->
<City>qscty</City> <!-- For Enter a City -->
<Job_Category id="qsjbt"> <!-- For Select a Category -->
<computer_science>information technology, science
<telecommunication>telecommunications</telecommunication>
<web_design>design</web_design>
</computer_science>
<business>business development, general business</business>
....
</Job_Category>
<Type_Of_Hour id="qsetd"> <!-- For Employment Type -->
<Full-Time>full-time</Full-Time>
<Internship>intern</Internship>
<Contract>contractor</Contract>
<Part-Time>part-time</Part-Time>
<Permanent/>
</Type_Of_Hour>
</JSE1>
<JSE2> <!-- Job Search Engine 2 -->
<URL>https://www.mymatrixjobs.com/candidate/Login.action</URL>
<Method>POST</Method>
< Competency >keywordtitle</Competency> <!-- For Keywords -->
<State>state</State> <!-- For States -->
<City>location</City> <!-- For City or Zip -->
<Type_Of_Hour id="jobtype"> <!-- For Job Type -->
<Contract>contract, contract or permanent</Contract>
<Permanent>permanent, contract or permanent</Permanent>
</Type_Of_Hour>
</JSE2></MetaSearchEngine>
4.2
Query Processing by the Job Meta-search Engine
A job seeker can now pose a query from the integrated meta-search interface GUI.
This query is rewritten by the meta-search engine to target every search engine involved
in the meta-search engine generation process. For example, suppose a job seeker poses a
T.Naz, J.Dorn, A.Poulovassilis
query QMSE requesting all “contract” jobs with competency “java” in the “computer
science” field:
QMSE: Jobs (Competency=Java, Job Category=Computer science, Type of Hour= Contract)
The query QMSE is transformed to target each individual search engine, using the
schema- and data-level mappings shown in Table 2. So we have queries Q11 and Q12
below targeted at http://www.jobs.net and queries Q21 and Q22 targeted at
https://www.mymatrixjobs.com:
Q11: Jobs (Enter Keyword(s)=java, Select a Category= Information technology, Employment Type =
Contractor)
Q12: Jobs (Enter Keyword(s)= java, Select a Category= Science, Employment Type=Contractor
Q21: Jobs (Keywords=Java, Job Type = Contract)
Q22: Jobs (Keywords=Java, Job Type = Contract or Permanent)
Q11, Q12, Q21, Q22 are then submitted to the two search engines. The results are
extracted by the Information Extractor component of our meta-search engine
architecture. The results are then merged and duplicates are removed. The results are
ranked (according to the preferences of the information seeker) and returned to the
seeker, as described in Section 3.1 earlier.
The upper part of Figure 8 shows a results page returned to the Information Extractor
when query [Competency=“Java”&Job_Category=“computer_science”&Type_of_Hour=
“Contract”] is sent to www.jobs.net from our meta-search engine. Figure 8 shows how
attributes in this result page are identified and converted into a structured format, using
Configurable Meta-search in the Job domain
the techniques described in Section 3.1.5. We note that the following equivalences are
derived between the attributes in the results page and those of the ontology:
PostedPost Date, Base PaySalary, IndustryJob Category, CompanyCompany
4.3
Evaluation
We have evaluated our schema/data matching techniques for the meta-search query
interface generation using the following job search engines:
• http://www.careerbuilder.com
• http://www.learn4good.com/jobs/
• https://www.mymatrixjobs.com/candidate/Login.action
• http://www.jobs.net/
• http://jobsearch.monster.com/
• http://www.canjobs.com/index.cfm
• http://www.brightspyre.com/opening/index.php?
• http://www.top-consultant.com/UK/career/appointments.asp
Figure 9 shows the contributions of element-level, structure-level and ontology-based
techniques in the matching process for each job search engine. We see that for the
careerbuilder search engine, for example, our hybrid approach identifies a total of 6 jobrelated attributes, with element-level techniques identifying 3 attributes, structure- level
techniques 1 attribute and ontology-based techniques 2 attributes. The results for the
other search engines are shown in similarly, and we can see the benefits of adopting our
hybrid approach to schema/data matching.
T.Naz, J.Dorn, A.Poulovassilis
Combining the above results, we calculate an overall contribution to the identification
of job-related attributes within all the search engine interfaces of 60.60% for elementlevel techniques, 15.15% for structure-level techniques, and 18.18% for ontology-based
techniques. When we combine all the techniques, our hybrid approach achieves an
overall correctness of 60.60% + 15.15% + 18.18% = 93.93%, where we define
correctness as:
number of attributes correctly identified over the set of search engine interfaces
total number of attributes in the set of search engine interfaces
The precision achieved in this experiment was 100% (all the attributes identified
were correct) and the recall was 93.93%. This experiment took 1 minute and 44 seconds
for the job meta-search query interface generation process, for the eight job search
engines above, on a machine with 1.60 GHz processor, 512 RAM and running Microsoft
Windows XP.
We note that, for this particular experiment, if the ordering of the groups a)-d)
described in section 3.2.4.2 is altered, then the same overall set of matchings would be
discovered. However, this may not be the case in general i.e. different orderings of
application of a)-d) may yield different sets of matchings.
For the evaluation of the Information Extractor component, we evaluated the Record
Collector and the Result Collector separately. For the evaluation of the record collector,
we focused on the identification of the total number of record sections from the full
HTML results pages and the URLs of jobs. Experiments on 21 job search engines
showed that the record collector is 90.5% correct in record section identification, 95.3%
correct in job URL identification and 63.2% correct in job titles identification. For the
evaluation of the result collector, we focused on the attributes identified by our hybrid
schema/data matching approach. Experiments on the same 21 job search engines showed
that our result collector is 77% correct in mapping job attributes from the results pages to
the ontology.
5
Conclusions and Future Work
The volume and heterogeneity of information available online via Websites or
databases makes it difficult for a user to find relevant information. Primary tools for
searching for information on the Web are search engines, subject directories and social
network search engines. Traditional search tools do not provide comprehensive coverage
Configurable Meta-search in the Job domain
of the Web and suffer from low recall and precision because they do not take into account
the semantics of search words or phrases.
To overcome these problems, we have proposed a new configurable meta-search
engine that uses a hybrid approach to resolving semantic conflicts between different
source search engines. We use a domain ontology and multiple schema/data matching
techniques in order to resolve semantic heterogeneities between the search interfaces of
the source search engines, and also between the results pages that they return.
Our techniques have been verified in developing a prototype for job meta-search, as
discussed in this paper. However, our techniques are general and can be used in the
development of meta-search engines in any domain provided there is an appropriate
ontology describing that domain. Using a domain ontology is advantageous because it is
a rich source of high quality, broad coverage information in a particular domain. Our
work can also be viewed as providing a generic approach for semi-automatically creating
a “vertical” search engine for a given domain, which combines multiple domain-specific
search engines.
In this paper, our main focus has been on the schema/data matching and integration
aspects of meta-search engine generation and usage. We have introduced a hybrid
approach that leverages the cumulative work of the large research community in this area.
Our experiments in the job domain show that the combined use of element-level,
structure-level and ontology-based techniques increases the correctness of matching
during the automatic integration of the source search engine interfaces.
Our techniques and results provide a contribution in the area of generating more
comprehensive and more concise meta-search query interfaces, more accurate metasearch query processing, and more comprehensive and concise presentation of search
results to users.
For future work, we will report on the query processing performance of meta-search
engines that are generated using our techniques. We have used multiple matching
techniques so that we can increase the information extracted from Web pages. But even
then, for some jobs pages it is possible that our meta-search engine will fail to identify
job related attributes and data. We plan to investigate introducing further matchers and
techniques, and to capture and use also preferences about units for numeric data types.
Also, rather than requiring the URLs of candidate source search engines to be made
known to our system, in the future our plan is to identify and choose search engines on
the fly from the Web. Another area of concern is the comparison of different matching
systems in the context of meta-search engines, and e.g. what factors should be considered
in the comparison and how should the evaluation be undertaken. We also plan to develop
benchmarks for the comparison and evaluation of matching systems in the context of
meta-search engines. Finally, our meta-search cannot handle multiple languages and
handles only the English language. If a Web page being processed is in a language other
than English then our approach will fail. So future work would include the handling of
multiple languages.
References
Aumueller, D., Do, H.H., Massmann, S. and Rahm, E. (2005) ‘Schema and Ontology Matching
with COMA++’, Proceedings of the 2005 ACM SIGMOD International Conference on
Management Data, Maryland, USA, pp. 906-908.
T.Naz, J.Dorn, A.Poulovassilis
Baumgartner, R., Flesca, S. and Gottlob, G. (2001), ‘Visual Web Information Extraction with
Lixto’, Proceedings of the 27th VLDB Conference, Rome, Italy, pp. 119-128.
Chang, K.C., He. B. and Zhang, Z. (2005), ‘Toward Large Scale Integration: Building a
MetaQuerier over Databases on the Web’, Proceedings of the Second Conference on
Innovative Data Systems Research, Asilomar, California, pp. 44-55.
Cohen, W.W., Ravikumar, P. and Fienberg, S.E. (2003), ‘A Comparison of String Distance Metrics
for Name-Matching Tasks’, Proceedings of IJCAI-03 workshop on information integration
on the Web, Acapulco, pp. 73–78.
Dorn, J. and Naz, T. (2006), ‘Meta-search in Human Resource Development’, International
Journal of Social Science, Vol. 1, No. 2, pp. 105-110.
Dorn J, Naz T, and Pichlmair M, (2007), ‘Ontology Development for Human Resource
Management’, 4th International Conference on Knowledge Management, Vienna, Austria,
pp. 109-120.
Dorn, J. and Naz, T. (2008), ‘Structuring Meta-search Research by Design Patterns’, Proceedings
of International Computer Science and Technology Conference, California, USA, pp. 1-12.
Embley, W.D., Xu, L. and Ding, Y. (2004), ‘Automatic Direct and Indirect Schema Mapping:
Experiences and Lessons Learned’, ACM SIGMOD Record, pp. 14-19.
Hakimpour, F. and Geppert, A. (2002), ‘Global Schema Generation Using Formal Ontologies’,
Proceedings of the 21st International Conference on Conceptual Modeling, pp. 307-321.
He, H., Meng, W., Yu, C. and Wu, Z. (2004), ‘Automatic Integration of Web Search Interfaces
with WISE-Integrator’, VLDB Journal, Vol. 13, No. 3, pp. 256-273.
He, H., Meng, W., Yu, C. and Wu, Z. (2005), ‘Constructing Interface Schemas for Search
Interfaces of Web Databases’, 6th International Conference on Web Information Systems
Engineering (WISE05), New York City, pp. 29-42.
Linková, Z. (2007), ‘Schema Matching In the Semantic Web Environment’, PhD Conference,
Matfyzpress, pp. 36-42.
Linková, Z. (2007), ‘Ontology Based Schema Integration’, Proceedings of SOFSEM, Prague, pp.
71-80.
Madhavan, J., Bernstein, P.A. and Rahm, E. (2001), ‘Generic Schema Matching with Cupid’,
Proceedings of the 27th VLDB Conference, Roma, Italy, pp. 49-58.
Madhavan, J., Jeffery, R.S., Cohen, S., Dong, X.L., Ko, D., Yu, C. and Halevy, A. (2007), ‘Webscale Data Integration: You can only afford to Pay As You Go’, Third Biennial Conference
on Innovative Data Systems Research, California, USA, pp. 342-350.
Meng, W., Wu, Z., Yu, C. and Li, Z. (2001), ‘A Highly Scalable and Effective Method for Metasearch’, ACM Transactions on Information Systems (TOIS), Vol. 19, pp. 310-335.
Naz, T. (2006), ‘An XML Schema Generator for HTML Search Interfaces’, Technical Report,
DBAIEC, Institute Faculty of Informatics, TUWien, Austria
Noy, N.F. (2004), ‘Semantic Integration: A Survey of Ontology-Based Approaches’, Special
section on Semantic Integration, Column ACM SIGMOD Record, Vol. 33, issue 4, pp. 65-70.
Ondrej, J. (2006), ‘A Scalable Special-Purpose Meta-Search Engine’, PhD Thesis, Institute for
Information Systems, Faculty for Informatics, Vienna University of Technology, Vienna,
Austria
Rahm, E. and Bernstein, P.A. (2001), ‘A Survey of approaches to Automatic Schema Matching’,
VLDB Journal, Vol. 10, No. 4, ISSN: 1066-8888, pp. 334-350.
Rainer, A., Dorn, J. and Hrastnik, P. (2004), ‘Strategies for Virtual Enterprises using XForms and
the Semantic Web’, Proceedings of International Workshop on Semantic Web Technologies
in Electronic Business, Berlin, pp. 166-172.
Shvaiko, P. and Euzenat, J. (2005), ‘A Survey of Schema-based Matching Approaches’, Technical
Report, DIT-04-087, Informatica e Telecomunicazioni, University of Trento.
Configurable Meta-search in the Job domain
Unemployment Rate, (2008) ‘Unemployment Rate (%), 2008 Country
http://www.photius.com/rankings/economy/unemployment_rate_2008_1.html
Ranks’,
Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumaan, H. and Hübner, S.
(2001), ‘Ontology-Based Integration of Information - A Survey of Existing Approaches’,
IJCAI-01 Workshop, pp. 108-117.
Zhao, H., Meng, W., Wu, Z., Raghavan, V. and Yu, C. (2005), ‘Fully Automatic Wrapper
Generation for Search Engines’, Proceedings of 14th International conference on World
Wide Web Conference, Japan, pp. 66-75
Download