What will be covered
• What is information
– How much is there?
• Properties of text
– Documents models
• Information retrieval (IR) systems and methods
–
–
–
–
–
Query structures
Evaluation and Relevance
Role of the user
Vector models
Inverted index
What is Information?
•
•
What will we retrieve with information
retrieval?
There are several ways to define
“information”
–
–
Subjective: People develop models of their
environment. Information created by people
makes those models more accurate.
Thing/artifact: Information is what’s
captured in a book, web page, or other
resource.
•
More information is digital & is increasing
Information - wikipedia
• Information as a concept has a diversity of meanings, from everyday
usage to technical settings. Generally speaking, the concept of
information is closely related to notions of constraint, communication,
control, data, form, instruction, knowledge, meaning, mental stimulus,
pattern, perception, and representation.
• Many people speak about the Information Age as the advent of the
Knowledge Age or knowledge society, the information society, the
Information revolution, and information technologies, and even though
informatics, information science and computer science are often in the
spotlight, the word "information" is often used without careful
consideration of the various meanings it has acquired.
How much information is there in
the world
Informetrics - the measurement of
information
• Stored
–
–
–
What can we store
What do we intend to store.
What is stored.
• How do we use it
–
–
Decision making
Knowledge discovery
Aspects of the Information & Data Age
•
Much information/data will/can be made and stored
digitally
•
Information/data can be automatically processed,
mined, and accessed
•
Why? Moore’s Law
Information Age & Data Age
•
We have entered the information & data age
–
•
What is the information age?
When do we leave it and where do we go
next?
–
–
David Weinberger’s Too Big to Know
What information was
Digitization of Everything: the Zettabytes are coming
•
•
•
•
•
Soon most everything
will be recorded and
indexed
Much will remain local
Most bytes will never
be seen by humans.
Search, data
summarization, trend
detection, information
and knowledge
extraction and
discovery are key
technologies
So will be
infrastructure to
manage this.
Digital Information
Created, Captured, Replicated Worldwide
Exabytes
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
10-fold
Growth in 5
Years!
DVD
RFID
Digital TV
MP3 players
Digital cameras
Camera phones, VoIP
Medical imaging, Laptops,
Data center applications, Games
Satellite images, GPS, ATMs, Scanners
Sensors, Digital radio, DLP theaters, Telematics
Peer-to-peer, Email, Instant messaging, Videoconferencing,
CAD/CAM, Toys, Industrial machines, Security systems, Appliances
2006
Source: IDC, 2008
2007
2008
2009
2010
2011
How much information is there?
Yotta
• Soon most everything will be
recorded and indexed
• Most bytes will never be seen
by humans.
• Data summarization,
trend detection
anomaly detection
are key technologies
See Mike Lesk:
How much information is there:
http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
Everything
!
Recorded
All Books
MultiMedia
Exa
Peta
All books
(words)
.Movi
e
A Photo
http://www.sims.berkeley.edu/research/projects/how-much-info/
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Zetta
Tera
Giga
Mega
Kilo
Information Facts
Print, film, magnetic, and optical storage media produced about 5 exabytes of new
information in 2002. Ninety-two percent of the new information was stored on
magnetic media, mostly in hard disks.
•
•
•
•
How big is five exabytes? If digitized with full formatting, the seventeen million
books in the Library of Congress contain about 136 terabytes of information;
five exabytes of information is equivalent in size to the information contained in
37,000 new libraries the size of the Library of Congress book collections.
Hard disks store most new information. Ninety-two percent of new information
is stored on magnetic media, primarily hard disks. Film represents 7% of the
total, paper 0.01%, and optical media 0.002%.
The United States produces about 40% of the world's new stored information,
including 33% of the world's new printed information, 30% of the world's new
film titles, 40% of the world's information stored on optical media, and about
50% of the information stored on magnetic media.
How much new information per person? According to the Population Reference
Bureau, the world population is 6.3 billion, thus almost 800 MB of recorded
information is produced per person each year. It would take about 30 feet of
books to store the equivalent of 800 MB of information on paper.
Moore's Law
• Defined by Dr. Gordon Moore during the
sixties.
• Predicts an exponential increase in
component density over time, with a
doubling time of 18 months.
• Applicable to microprocessors, DRAMs ,
DSPs and other microelectronics.
• Monotonic increase in density observed
since the 1960s.
First Disk 1956
• IBM 305 RAMAC
• 4 MB
• 50x24” disks
• 1200 rpm
• 100 ms access
• 35k$/y rent
• Included computer &
accounting software
(tubes not transistors)
1.6 meters
10 years later
30 MB
Now - Terabytes on your desk
Terabyte external
drive for
<$100 - 10 cents a
gigabyte.
In 5 years, 1
cent/gigabyte, $10
for a terabyte?
Moore’s Law - Density
Disk TB Shipped per Year
1E+7
Storage capacity
beating Moore’s law
• Improvements:
Capacity
60%/y
Bandwidth 40%/y
Access time
16%/y
• 1000 $/TB
today
• 100 $/TB in 2007
Moores law
58.70% /year
TB growth
112.30% /year since 1993
Price decline 50.70% /year since 1993
Most (80%) data is personal (not enterprise)
This will likely remain true.
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
ExaByte
1E+6
1E+5
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
1E+4
1E+3
1988
1991
1994
1997
2000
Digital Immortality
Bell, Gray, CACM, ‘01
Requirements for storing various media for a single
person’s lifetime at modest fidelity
What is Digital Immortality?
• Preservation and interaction of digitized
experiences for individuals and/or groups
– Preservation and access
– Active interaction with archives through
queries and/or an avatar (agents)
– Avatar interactions for group experiences
• Issues:
–
–
–
–
Archiving
Indexing
Veracity
Access
All the world’s libraries on
your iPod! SmartPhone
NY Times Magazine
And you thought finding that
song was hard.
•Storage is practically free
•Much is mobile
•Access is crucial
•Moore’s law keeps on trucking
Low rent
min $/byte
Shrinks time
now or later
Shrinks space
here or there
Automate processing
knowbots
Immediate OR Time Delayed
Why Put Everything in Cyberspace?
Point-to-Point
OR
Broadcast
Locate
Process
Analyze
Summarize
Memex
As We May Think, Vannevar Bush, 1945
“A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized so
that it may be consulted with exceeding speed
and flexibility”
“yet if the user inserted 5000 pages of material a
day it would take him hundreds of years to fill
the repository, so that he can be profligate and
enter material freely”
Trying to fill a terabyte in a year
Item
Items/TB
Items/day
300 KB JPEG
3M
9,800
1 MB Doc
1M
2,900
1 hour 256 kb/s
MP3 audio
1 hour 1.5 Mbp/s
MPEG video
9K
26
290
0.8
Progress of Science
• Thousand years ago:
science was empirical
describing natural phenomena
• Last few hundred years:
theoretical branch
using models, generalizations
• Last few decades:
a computational branch
2
 .
a
4G
c2
 a   3  2
a
 
 
simulating complex phenomena
• Today: (big data/information)
data and information exploration (eScience)
unify theory, experiment, and simulation - information driven
– Data captured by sensors, instruments
or generated by simulator
– Processed/searched by software
– Information/Knowledge stored in computer
– Scientist analyzes database / files
using data management and statistics
– Network Science
– Cyberinfrastructure
People and Information
• People process information based on their
experience and context.
• Human information processing is affected
by emotions and needs.
• Your data may be my information
• Search engine relevance is the same
What is knowledge?
• Data - Facts, observations, or perceptions.
• Information - Subset of data, only including those
data that possess context, relevance, and purpose.
• Knowledge -
A more simplistic view considers
knowledge as being at the highest level in a hierarchy
with data (at the lowest level) and information (at the
middle level).
•Data refers to bare facts void of context.
–A telephone number.
•Information is data in context.
–A phone book.
•Knowledge is information that facilitates action.
–Recognizing that a phone number belongs to a good client,
who needs to be called once per week to get his orders.
From Facts to Wisdom
(Haeckel & Nolan, 1993)
one example of the hierarchy
Volume
Completeness
Objectivity
Less is
Value
More
Structure
Wisdom
Knowledge
Intelligence
Information
Facts
Subjectivity
What is knowledge?
• Knowledge - A more complex view considers
knowledge as intrinsically different from
information. Instead of considering knowledge as
richer or more detailed set of facts, we define
knowledge in an area as justified beliefs about
relationships among concepts relevant to that
particular area.
Great Predictions
•
•
•
•
•
•
"Computers in the future may weigh no more than 1.5 tons.” Popular
Mechanics, forecasting the relentless march of science, 1949
"I think there is a world market for maybe five computers.” Thomas Watson,
chairman of IBM, 1943
"Heavier-than-air flying machines are impossible.” Lord Kelvin, president,
Royal Society, 1895.
"Man will never reach the moon regardless of all future scientific
advances."Dr. Lee De Forest, inventor of the vacuum tube and father of
television.
"Everything that can be invented has been invented.” Charles H. Duell,
Commissioner, U.S. Office of Patents, 1899.
“Nobody would ever need more than 640 kilobytes of memory on their
personal computer,” 1981, Bill Gates.
– Other predictions of Bill Gates?
Great Predictions
RIGHT!
•
Artificial Intelligence:
– speech recognition
– Some reasoning; computer beats man in
chess
– Privacy and security problems
– Computers can be a pain in the butt
WRONG!
•
Missed Moore’s law and ubiquity of
computers
Predicting the future
– “The future ain’t what it used to be” Yogi Berra
• Can we really predict the future?
• Who predicted the implications of the web and
search engines?
• Social networking?
• Can we understand power laws and their
implications?
– We have no examples of exponential growth in our
evolution except plagues.
• Can we understand the pervasiveness of
computers?
Information Science and Data
Generation Trends
• What does large amounts of information
provide?
– New opportunities for search!
– New discoveries
•
•
•
•
Business opportunities?
Research opportunities?
Problems?
Wisdom search engine?
Thanks to:
• Jim Gray, Microsoft
• L. Floridi, Hertfordshire
• Robert Allen, Drexel
• Wikipedia