Semantic Web Architecture and Applications

advertisement
Semantic Web Architecture and Applications
Semantic Web architecture and applications are the next generation in information architecture.
This paper reviews four generations of information management applications (Keywords,
Statistical, Natural Language, Semantic Web) and defines their key features and limitations.
Organizations will migrate to Semantic Web architecture and applications in the next 1-3 years;
copying the 1981 – 1984 user-led migration from centralized mainframe-terminal systems with rigid
applications; to distributed Intel/Microsoft/PC architecture and flexible Visi-calc applications.
First Generation - Keywords
Keyword technologies were originally used in IBM’s free text retrieval system in the late 1960’s.
These tools are based on a simple scan of a text document to find a key word or root stem of a key
word. This approach can find key words in a document, and can list and rank documents
containing key words. But, these tools have no ability to extract the meaning from the word or root
stem, and no ability to understand the meaning of the sentence.
Advanced Search
Most keyword systems now include some form of Boolean logic “AND” , “OR” functions to narrow
searches. This is often called “advanced search”. But, using Boolean logic to exclude documents
from a search is not “advanced” . It is an arbitrary and random means to reduce the size of the
source database to reduce the number of documents retrieved. This “advanced search”
significantly increases false negatives by missing many relevant source documents.
Applications:
Keyword tools are appropriate for creating a word location or a list of documents which contain
specific defined keywords and root stems. These are not capable of understanding similar words,
the meanings of words, or their relationships, or context.
Problems:
The most common problems with keyword tools are: a) false negatives (no matches found because
the word or stem are not exactly identical: “big” and “large”), false positives (too many unrelated
matches found because a root stem finds many unrelated words: “process” and “processor”, and c)
scale factors (keyword search tool produce very long random lists of documents if the source
database is large, and the relevance rankings are highly misleading.
Examples:
The most common examples of key word tools are web site “Search” tools and the Microsoft “Find”
function (control “f” key) in Microsoft Office applications.
Second Generation - Statistical Forecasting
Statistical forecasting first finds keywords; and then calculates the frequency and distance of these
keywords. Statistical forecasting tools now include many techniques for predictive forecasting, most
often using inference theory. The frequency and distribution of words has some general value in
understanding content. But, cannot understand the meaning of words or sentences; or provide
context. These tools are still limited by keyword constraints; and can only infer simplistic meaning
from the frequency and distribution of words.
Applications:
Statistical forecasting tools are appropriate for performing simple document searches where the
desired output is a list of documents which contain specific words which must then be read and
classified and summarized manually by end users. These are not capable of understanding the
meaning or context or relationships of documents.
Semantic Web Technology: Executive Summary
1
Problems:
The most common problems with statistical forecasting tools are: a) keyword limitations of false
positives and false negatives; b) misunderstanding the meaning of words and sentences (“man
bites dog” is the same as “dog bites man”); c) lack of context: “Duke” could be Duke of Windsor or
Duke of Earl or John Wayne; d) scale factors: a single statistical relevance ranking creates huge
“Google” lists of many irrelevant documents.(“you have 100,000 hits”).
Examples:
The most common statistical forecasting tool is “Google” and many other tools using inference
theory and similar analysis and predictive algorithms.
Third Generation - Natural Language Processing
Natural language processors focus on the structure of language. These recognize that certain
words in each sentence (nouns and verbs) play a different role (subject-verb-object) than others
(adjectives, adverbs, articles). This understanding of grammar increases the understanding of key
words and their relationships. (“man bites dog” is different from “dog bites man”). But, these tools
cannot extract the understanding of the words or their logical relationship beyond their basic
grammar. And, these cannot perform any information summary, analysis or integration functions.
Applications:
Natural language tools are appropriate for linguistic research and word-for-word translation
applications where the desired output is a linguistic definition or a translation. These are not
capable of understanding the meaning or context of sentences in documents, or integrating
information within a database.
Problems:
The most common problems with linguistic tools are: a) keyword limitations of false positives and
false negatives; b) misunderstanding the context (does “I like java” mean an island in Indonesia, a
computer programming language or coffee?) Without understanding the broader context, a
linguistic tool only has a dictionary definition of “Java” and does not know which “Java” is relevant
or what other data related to a specific “Java” concept.
Examples:
The most common natural language tools are translator programs which use dictionary look up
tables to convert words and language-specific grammar to convert source to target languages.
Fourth Generation – Semantic Web Architecture and Applications
Semantic web architecture and applications are a dramatic departure from earlier database and
applications generations. Semantic processing includes these earlier statistical and natural langue
techniques, and enhances these with semantic processing tools. First, Semantic Web architecture
is the automated conversion and storage of unstructured text sources in a semantic web database.
Second, Semantic Web applications automatically extract and process the concepts and context in
the database in a range of highly flexible tools.
a. Architecture; not only Application
First, the Semantic web is a complete database architecture, not only an application program.
Semantic web architecture combines a two-step process. First, a Semantic Web database is
created from unstructured text documents. And, then Semantic Web applications run on the
Semantic Web database; not the original source documents.
The Semantic Web architecture is created by first converting text files to XML and then analyzing
these with a semantic processor. This process understands the meaning of the words and
Semantic Web Technology: Executive Summary
2
grammar of the sentence, and also the semantic relationships of the context. These meanings and
relationships are then stored in a Semantic web database. The Semantic Web is similar to the
schematic logic of an electronic device or the DNA of a living organism. It contains all of the logical
content AND context of the original source. And, it links each word and concept back to the original
document.
Semantic Web applications directly access the logical relationships in the Semantic Web database.
Semantic web applications can efficiently and accurately search, retrieve, summarize, analyze and
report discrete concepts or entire documents from huge databases.
A search for “Java” links directly to the three Semantic Web logical clusters for “Java”: (island in
Indonesia, a computer programming language, and coffee). The processor can now query the user
for which “Java”, and then expand the search to all other concepts and documents related to the
specific “Java”.
b. Structured and Unstructured Data
Second, Semantic Web architecture and applications handle both structured and unstructured
data. Structured data is stored in relational databases with static classification systems, and also in
discrete documents. These databases and documents can be processed and converted to
Semantic Web databases, and then processed with unstrctured data.
Much of the data we read, produce and share is now unstructured; emails, reports, presentations,
media content, web pages. And, these documents are stored in many different formats; text, email
files, Microsoft word processor, spreadsheet, presentation files, Lotus Notes, Adobe.pdf, and
HTML. It is difficult, expensive, slow and inaccurate to attempt to classify and store these in a
structured database. All of these sources can be automatically converted to a common Semantic
Web database, and integrated into one common information source.
c. Dynamic and Automatic; not Static and Manual
Third, Semantic Web database architecture is dynamic and automated. Each new document which
is analyzed, extracted and stored in the Semantic Web expands the logical relationships in all
earlier documents. These expanding logical relationships increase the understanding of content
and context in each document, and the entire database. The Semantic Web conversion process is
automated. No human action is required for maintaining a taxonomy, meta data tagging or
classification. The semantic database is constantly updated and more accurate.
Semantic Web architecture is different from relational database systems. Relational databases are
manual and static because these are based on a manual process for maintaining a taxonomy,
meta data tagging and document classification in static file structures. Documents are manually
captured, read, tagged, classified and stored in a relational database only once, and not updated.
More important, the increase in new documents and information in a relational database does not
make the database more “intelligent” about the concepts, relationships or documents.
d. From Machine Readable to Machine Understandable
Fourth, Semantic Web architecture and applications support both human and machine intelligence
systems. Humans can use Semantic Web applications on a manual basis, and improve the
efficiency of search, summary, analysis and reporting tasks. Machines can also use Semantic Web
applications to perform tasks that humans cannot do; because of the cost, speed, accuracy,
complexity and scale of the tasks.
e. Synthetic vs Artificial Intelligence:
Semantic Web Technology: Executive Summary
3
Semantic Web technology is NOT “Artificial Intelligence”. AI was a mythical marketing goal to
create “thinking” machines. The Semantic Web supports a much more limited and realistic goal.
This is “Synthetic Intelligence”. The concepts and relationships stored in the Semantic Web
database are “synthesized”, or brought together and integrated, to automatically create a new
summary, analysis, report, email, alert; or launch another machine application. The goal of
Synthetic Intelligence information systems is bringing together all information sources and user
knowledge, and synthesizing these in global networks.
Future of Information Management: Network Spread Sheets for Ideas
The future of information management will be based on Semantic Web architecture and
applications. The most important issue is which technologies and firms take the immediate
leadership to drive the migration, and therefore guide the information architecture of the future.
1) Tidal Wave of Information Shifts Power
End users and corporations will drive the rapid expansion of Semantic Web architecture and
applications to survive the tidal wave of data, and improve costs, speed and performance. IT
management will resist or accelerate this trend. Information power will shift from the database
managers back to the departments and end users; as the PC + spread sheet did in 1981-1984.
2) Migration to XMLand RDF Standards
Applications programs will follow Microsoft’s migration to XML standards for document
authoring and exchange. XML and RDF standards will become the dominant approach for
capturing, understanding, storing and exchanging external document descriptions and
document content. Unstructured Text Documents become Synthetic Expert Networks.
3) Universal Internet Web Portals
Information access will migrate to web portals within organizations and with the general
population; and web portals based on Semantic Web applications will become the central user
application. Operating systems and legacy applications will become transparent under
semantic web portals with highly flexible applications: Network spead sheets for ideas.
4) Parallel Legacy Database Integration
Legacy databases will be extracted into parallel Semantic Web architecture databases to
provide access to fragmented sources. Parallel architecture dramatically reduces the costs,
risks, and schedules from the ERP “tear down and rebuild” Transparent Grid Architecture.
5) Global and Language Expansion
Information sources, users and entities will expand globally and support many languages.
Because Semantic Web architectures and applications “learn and think” in the original
language, the production and exchange of multi-language information between language
domains will increase dramatically.Interactive Japanese language sources on China in English.
6) Network Access and Distribution
Networks will get better, faster, cheaper, wireless and distributed. Semantic Web architecture
and applications will expand to link global data sources from mainframe servers, desktop
workstation and laptops, to hand held PDA and cell phones. Voice driven expert systems.
7) Network Transactions and Capacity
Human transactions will grow slowly; and machine transactions will grow exponentially. The
migration from man to machine intelligence transactions will rapidly take over the private and
public networks. This rapid capacity demand will force a major increase in network hardware
investment and stimulate new value added network services. Japan DoKoMo mobile network.
Semantic Web Technology: Executive Summary
4
Download