OCLC Research Update: ALA Annual 2015

advertisement
ALA Annual 2015
OCLC Research Update
Merrilee Proffitt, Senior Program Officer
Bruce Washburn, Consulting Software Engineer
Diane Vizine-Goetz, Senior Research Scientist
Roy Tennant, MC and Senior Program Officer
Merrilee Proffitt
Wikipedia and Libraries
Bruce Washburn
On the Linked Data Learning Curve
Diane Vizine-Goetz
FAST (Faceted Application of Subject Terminology)
ALA Annual 2015
OCLC Research in Brief
Roy Tennant
Senior Program Officer
OCLC Research
• Explores challenges facing libraries and archives in a
rapidly changing information technology environment
• Three primary modes of activity:
– Community research & development
– Advanced development (data mining, prototyping)
– Member/Partner engagement
• OCLC Research Library Partnership
• Work is openly available, e.g.,
– Reports
– Experimental services
THEMES
New Report: Stewardship of the Evolving
Scholarly Record
• The scholarly record is
evolving, so …
• … stewardship models for
scholarly record are
changing too
• “Conscious coordination”
key to securing future of
scholarly record
oc.lc/esr-stewardship
Library Linked Data in the Cloud
• Just published
• Offers insights gained from
OCLC’s innovative work
with linked data
• Main sections:
–
–
–
–
–
Library Standards and the Semantic Web
Modeling Library Authority Files
Modeling and Discovering Creative Works
Entity Identification Through Text Mining
The Library Linked Data Cloud
• Technical but approachable
– Anyone with a modest background in
metadata can read & understand it
ALA Annual 2015
Wikipedia & Libraries
Increasing Library Visibility
Merrilee Proffitt
Senior Program Officer, OCLC Research
SM
“Discovery happens elsewhere…”
Lorcan Dempsey, OCLC Research
Why Wikipedia?
35 million articles
286 languages
2 billion edits (11 million / month)
8000 views per second
500 million monthly visitors
5th most popular website
2000x larger than Brittanica
Why Wikipedia?
• Starting point for research
– Learning black market and GWR
Google > Wikipedia > References
• Ideologically aligned with library mission
– Access to knowledge – for free
• Shared appreciation of quality sources
• Shared appreciation of authority control
Wikipedia + Libraries
Wikipedia + Libraries
Wikipedia + Libraries
Wikipedia + Libraries
How to engage?
•
•
•
•
•
Learn to edit Wikipedia
Attend or host an editing event
Host a Wikipedian in Residence
Host a Wikipedia Visiting Scholar
Consider the unique value that libraries and librarians
can bring to Wikipedia
• For more information / inspiration
– Wiki
Libraries
– GLAM Wiki
– Wikipedia Library
– WikiEdu (Wikipedia Education Foundation)
OCLC Research Update, June 29, 2015
On the Linked Data Learning Curve
Current work in OCLC Research
Bruce Washburn
Consulting Software Engineer
The Knowledge Graph
A Google blog post from
2012 describes the
Knowledge Graph that
supports searching for the
things, people and places
that Google knows about
and suggestions for relevant
related things.
The Graph powers the
Google Knowledge Panel in
search results
The Google Knowledge Vault
A series of recent Google Research papers describe the
use of probabilistic models and machine learning to assess
the truth of statements made by multiple sources.
•
•
•
•
Li, X., Dong, X. L., Lyons, K., Meng, W., Srivastava, D. (2013). Truth Finding on the Deep Web: Is the Problem
Solved?
Dong, X. L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S., Zhang, W. (2013). From Data Fusion to
Knowledge Fusion.
Dong, X. L., Murphy, K., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., ... & Zhang, W. (2014). Knowledge Vault: A
Web-scale approach to probabilistic knowledge fusion
Dong, X. L., Gabrilovich, E., Murphy, K. Dang, V., Horn, W., … & Zhang, W. (2015). Knowledge-Based Trust:
Estimating the Trustworthiness of Web Sources
A “Knowledge Vault” for Libraries?
• OCLC research scientists and software
engineers are prototyping a similar model for
bibliographic and authority data sources,
• in combination with user-contributed content
and Linked Data from other providers,
• to evaluate a “knowledge vault” for
statements about entities and their
relationships, including people, groups,
places, events, concepts, and works.
Library data sources
• WorldCat – thousands of libraries, museums and
archives contribute to the aggregation. OCLC adds
FRBR clustering, algorithmically-deduced
connections of strings to Linked Data identifiers, and
new entities.
• VIAF – 30 or more authority systems contribute, and
OCLC merges and links records into new VIAF
clusters.
• FAST – OCLC transforms Library of Congress
subject headings into a new controlled vocabulary,
friendly to faceted navigation.
Knowledge Vault data flow
Enhanced
WorldCat
Extractor
VIAF
Extractor
FAST
Data
Sources
Knowledge
Triples
Fusers
Collective
Fusion
Scored
Triples
Extractor
Extraction
Knowledge
Vault
The EntityJS Research Project
Get some real-life RDF experience, test entity refinement
and editing, and push triples back to the knowledge vault.
Testing with a subset of Knowledge
Just the “ArchiveGrid” WorldCat MARC records
WorldCat
VIAF
FAST
ArchiveGrid
EntityJS
Extractor
Knowledge
Triples
Fusers
Collective
Fusion
Scored
Triples
Extractor
Extractors
DBPedia
Vault
Services
Wikidata
Application
Triples
Knowledge
Vault
Search across entities
Show related entities
Show related entities
Get more relationships from Wikidata
EntityJS users can identify matching entities
EntityJS users can identify matching entities
As the EntityJS research project
continues …
We will explore other ways to edit and refine entity data
and experiment with using the knowledge vault to support
data visualizations
Keeping up with EntityJS
Track initial progress reports on the OCLC Research blog
at hangingtogether.org.
We’ll provide project details on the OCLC Research
website soon.
Questions or comments?
Bruce Washburn (washburb@oclc.org) or
Jeff Mixter (mixterj@oclc.org)
at OCLC Research
OCLC Research Update, June 29, 2015
FAST (Faceted Application of
Subject Terminology)
Diane Vizine-Goetz
Senior Research Scientist
Basics
• Enumerative, faceted subject heading vocabulary
• Derived primarily from Library of Congress Subject Headings
(LCSH), LC NACO file, and LC Genre/Form terms
• Retains the vocabulary and reference structure of the source
files
• Eight categories of terms
–
–
–
–
–
–
–
–
Persons
Organizations
Events
Titles of works
Chronological/Time periods
Topics
Places
Form/Genre terms
35
Why FAST?
• Developed to meet a need for a general subject
vocabulary that is easy to learn, apply, and control
• Modern design
– All headings established in authority file
• Persistent identifiers for all headings
• Obsolete headings deprecated not deleted
– Authority file structure facilitates application and
automated maintenance of headings
– Faceted-navigation friendly
Responsible parties
• Began as a collaboration of OCLC Research and the
Library of Congress
• OCLC Research & advisory groups
– WorldCat quality management team
– FAST users (e.g., Cornell University, Australian Policy Online,
etc.)
– ALCTS Faceted Subject Access Interest Group
Facet Counts – 5 June 2015
Facet
Count
Persons
Organizations
Events
Titles of Works
Chronological/Time periods
Topics
Places
Form/Genre
692,734
360,571
12,417
63,074
676
406,873
176,774
2,507
Total
1,715,626
38
FAST in MARC Bibliographic
Records
Headings before conversion
FAST headings after conversion
600 10$a Lacks, Henrietta, $d 1920-1951 $x Health.
650 #0$a Cancer $x Patients $z Virginia $v Biography.
650 #0$a African American women $x History.
650 #0$a Human experimentation in medicine $z United States
$x History.
650 #0$a HeLa cells.
650 #0$a Cancer $x Research.
650 #0$a Cell culture.
650 #0$a Medical ethics.
600 17 $a Lacks, Henrietta, $d 1920-1951 $2 fast $0 (OCoLC)fst01914767
650 #7 $a African American women $2 fast $0 (OCoLC)fst00799438
650 #7 $a Cancer $x Patients $2 fast $0 (OCoLC)fst00845411
650 #7 $a Cancer $x Research $2 fast $0 (OCoLC)fst00845497
650 #7 $a Cell culture $2 fast $0 (OCoLC)fst00850172
650 #7 $a Health $2 fast $0 (OCoLC)fst00952743
650 #7 $a HeLa cells $2 fast $0 (OCoLC)fst00952578
650 #7 $a Human experimentation in medicine $2 fast $0
(OCoLC)fst00963042
650 #7 $a Medical ethics $2 fast $0 (OCoLC)fst01014081
651 #7 $a United States $2 fast $0 (OCoLC)fst01204155
651 #7 $a Virginia $2 fast $0 (OCoLC)fst01204597
655 #7 $a Biography $2 fast $0 (OCoLC)fst01423686
655 #7 $a History $2 fast $0 (OCoLC)fst01411628
39
FAST in MARC Bibliographic
Records
Headings before conversion
FAST headings after conversion
600 10$a Lacks, Henrietta, $d 1920-1951 $x Health.
650 #0$a Cancer $x Patients $z Virginia $v Biography.
650 #0$a African American women $x History.
650 #0$a Human experimentation in medicine $z United States
$x History.
650 #0$a HeLa cells.
650 #0$a Cancer $x Research.
650 #0$a Cell culture.
650 #0$a Medical ethics.
600 17 $a Lacks, Henrietta, $d 1920-1951 $2 fast $0 (OCoLC)fst01914767
650 #7 $a African American women $2 fast $0 (OCoLC)fst00799438
650 #7 $a Cancer $x Patients $2 fast $0 (OCoLC)fst00845411
650 #7 $a Cancer $x Research $2 fast $0 (OCoLC)fst00845497
650 #7 $a Cell culture $2 fast $0 (OCoLC)fst00850172
650 #7 $a Health $2 fast $0 (OCoLC)fst00952743
650 #7 $a HeLa cells $2 fast $0 (OCoLC)fst00952578
650 #7 $a Human experimentation in medicine $2 fast $0
(OCoLC)fst00963042
650 #7 $a Medical ethics $2 fast $0 (OCoLC)fst01014081
651 #7 $a United States $2 fast $0 (OCoLC)fst01204155
651 #7 $a Virginia $2 fast $0 (OCoLC)fst01204597
655 #7 $a Biography $2 fast $0 (OCoLC)fst01423686
655 #7 $a History $2 fast $0 (OCoLC)fst01411628
http://experimental.worldcat.org/fast/963042/
40
FAST in MARC Bibliographic
Records
Facet
Person……………………………………………
Topic……………………………………………..
Place………………………………………………
Form/Genre………………………………………
FAST headings after conversion
600 17 $a Lacks, Henrietta, $d 1920-1951
650 #7 $a African American women
650 #7 $a Cancer $x Patients
650 #7 $a Cancer $x Research
650 #7 $a Cell culture
650 #7 $a Health
650 #7 $a HeLa cells
650 #7 $a Human experimentation in
medicine
650 #7 $a Medical ethics
651 #7 $a United States
651 #7 $a Virginia
655 #7 $a Biography
655 #7 $a History
41
FAST and Authority Files
FAST
Cancer--Patients
Cancer--Patients--Attitudes
Cancer--Patients--Biography[obsolete]
Cancer--Patients--Care
Cancer--Patients--Conduct of life
Cancer--Patients--Counseling of
Cancer--Patients--Dental care
Cancer--Patients--Economic conditions
Cancer--Patients--Education
Cancer--Patients--Employment
Cancer--Patients--Family relationships
Facet
topic
topic
topic
topic
topic
topic
topic
topic
topic
topic
topic
Cancer--Patients--Home care
topic
Cancer--Patients--Home care--Planning
topic
Cancer--Patients--Hospital care
topic
Cancer--Patients--Hospital care--Planning
topic
Cancer--Patients--Legal status, laws, etc.
topic
Cancer--Patients--Long-term care
topic
Cancer--Patients--Long-term care--History[obsolete] topic
Cancer--Patients--Medical care
topic
Cancer--Patients--Mental health
topic
Cancer--Patients--Mental health services
topic
Cancer--Patients--Nutrition
topic
Cancer--Patients--Pastoral counseling of
topic
Cancer--Patients--Psychological aspects
topic
Cancer--Patients--Psychology
topic
Cancer--Patients--Rehabilitation
topic
Cancer--Patients--Rehabilitation--Societies, etc.
topic
Cancer--Patients--Religious life
topic
Cancer--Patients--Research
topic
Cancer--Patients--Services for
topic
Cancer--Patients--Sexual behavior
topic
Cancer--Patients--Social conditions
topic
Cancer--Patients--Social networks
topic
Cancer--Patients--Treatment
topic
WC usage
LCSH
13,564 Cancer ‡x Patients [150]
155
Cancer ‡x Patients ‡v Biography [150]
552
24
85
24
40 Cancer ‡x Patients ‡x Economic conditions [150]
45
41
1,274 Cancer ‡x Patients ‡x Family relationships [150]
Cancer ‡x Patients ‡v Fiction [150]
252 Cancer ‡x Patients ‡x Home care [150]
4
132 Cancer ‡x Patients ‡x Hospital care [150]
5
24
56 Cancer ‡x Patients ‡x Long-term care [150]
105
110
5
76
39
129
394
799 Cancer ‡x Patients ‡x Rehabilitation [150]
5
255 Cancer ‡x Patients ‡x Religious life [150]
19
608
49
105 Cancer ‡x Patients ‡x Social conditions [150
35
55
Cancer ‡x Patients ‡z United States ‡v Biography
42
Tools for Application and Use
assignFAST
– service that automates the manual selection of FAST Subjects
based on autosuggest technology
searchFAST
– search interface to the FAST authority file that simplifies the process
of heading selection
FASTConverter
– web application that converts LCSH headings to FAST headings; it
helps users become familiar with FAST and see the differences
between LCSH and FAST
FAST Linked Data API
– Linked Data descriptions expressed using SKOS (Simple Knowledge
Organization System) and Schema.org
WorldShare Record Manager
– uses assignFAST API in a feature to apply FAST headings
44
FAST Datasets
• Available under Open Data Commons Attribution
License (ODC-By)
• Bulk downloads updated quarterly
– MARC Authority Format in XML
– MARC Authority Format in ISO MARC
– RDF/XML
• Change files published between updates
– MARC Authority Format in ISO MARC
Links to other Files
Authority
Count
Library of Congress Subject
Headings*
LC NACO File
1,213,647
VIAF
1,213,232
DbPedia/Wikipedia
299,172
160,837
Geonames
Total geographic coordinates
85,422
120,561
*One-to-One only, not including references to partial or pattern
headings
46
Where FAST is used
• Bodleian Libraries, University of Oxford (U.K.)
• British Library (U.K., testing FAST)
• Chronicling Illinois & The Papers of Abraham Lincoln
projects (U.S.A.)
• Cornell University Libraries (U.S.A.)
• Databib.org (U.S.A.)
• National Library of New Zealand (New Zealand)
• OCLC (U.S.A.)
• RMIT Publishing (Australia)
• University of North Dakota (U.S.A.)
FAST at OCLC
• WorldCat - January 2015
– 76 million records enhanced with FAST
• WorldCat Entities > WorldCat Works
– Experimental WorldCat Linked Data (includes DDC, FAST, VIAF
and LCSH URIs)
• Experimental applications (OCLC Research)
–
–
–
–
Classify
WorldCat Identities
mapFAST
…
48
FAST in Classify
What’s new?
• FAST geographic headings in VIAF
• Synchronizing FAST forms with LCGFT
• FAST Changes page
– http://fast.oclc.org/fastChanges/
What’s next?
• Implementation of Machine-generated Metadata
Provenance field (MARC 883) in 6xx headings in
WorldCat (expected August 2015)
– Preserve user-added FAST headings
– Facilitate updating of machine-generated headings
• Under consideration/development
– User-defined subsets
– Support for local authority files
– More links to Wikipedia
FAST Team
• OCLC Research
– Rick Bennett, Eric Childress, Kerre Kammerer,
Diane Vizine-Goetz
• WorldCat Quality Management
– Robert Bremer, Linda Gabel
Links
• Project page
– http://www.oclc.org/research/themes/datascience/fast.html
• Tools
– http://fast.oclc.org/searchfast/
– http://experimental.worldcat.org/fast/fastconverter/
– http://experimental.worldcat.org/fast/assignfast/ (+ API)
– http://experimental.worldcat.org/fast/ (+ API)
• Datasets
– http://www.oclc.org/research/themes/datascience/fast/download.html
ALA Annual 2015
We Welcome Your Engagement
https://twitter.com/OCLC/lists/oclc-research
http://www.oclc.org/research
https://www.facebook.com/OCLCResearch
• News & events
• Reports & presentations
http://www.slideshare.net/oclcr
http://hangingtogether.org/
http://youtube.com/oclcresearch
SM
Download