Open Data and Open Code for S&T Assessment

advertisement
Open Data and Open Code for S&T Assessment
Dr. Katy Börner
Cyberinfrastructure for Network Science Center, Director
Information Visualization Laboratory, Director
School of Library and Information Science
Indiana University, Bloomington, IN
katy@indiana.edu
With special thanks to Kevin W. Boyack, Micah Linnemeier,
Russell J. Duhon, Patrick Phillips, Joseph Biberstine, Chintan Tank
Nianli Ma, Angela M. Zoss, Hanning Guo, Mark A. Price,
S
Scott
W
Weingart
i
Northwestern Institute on Complex Systems (NICO) Annual Conference
Northwestern University, IL
September 3, 2009
Overview
Science of Science Studies
Science of Science Cyberinfrastructure (http://sci.slis.indiana.edu):
(http //sci slis indiana ed )
 Scholarly Database (SDB) (http://sdb.slis.indiana.edu) that provides free
access to 23 million scholarly records
 Sci
S i2 Tool
T l which
hi h reads
d SDB data
d andd supports the
h id
identification
ifi i off activity
i i
bursts, the extraction and display of co-author/inventor/investigator
networks, and topical analysis, among others.
Mapping Science Exhibit
Overview
Science of Science Studies
Science of Science Cyberinfrastructure (http://sci.slis.indiana.edu):
(http //sci slis indiana ed )
 Scholarly Database (SDB) (http://sdb.slis.indiana.edu) that provides free
access to 23 million scholarly records
 Sci
S i2 Tool
T l which
hi h reads
d SDB data
d andd supports the
h id
identification
ifi i off activity
i i
bursts, the extraction and display of co-author/inventor/investigator
networks, and topical analysis, among others.
Mapping Science Exhibit
Computational Scientometrics:
Studying Science by Scientific Means
 Börner, Katy, Chen, Chaomei, and Boyack, Kevin. (2003). Visualizing Knowledge Domains. In
Blaise Cronin (Ed.), Annual Review of Information Science & Technology, Medford, NJ: Information
Today, Inc./American Society for Information Science and Technology, Volume 37, Chapter 5, pp. 179255. http://ivl.slis.indiana.edu/km/pub/2003-borner-arist.pdf
 Shiffrin, Richard M. and Börner, Katy (Eds.) (2004). Mapping Knowledge Domains.
Proceedings of the National Academy of Sciences of the United States of America,
America 101(Suppl_1).
101(Suppl 1)
http://www.pnas.org/content/vol101/suppl_1/
 Börner, Katy, Sanyal, Soma and Vespignani, Alessandro (2007). Network Science. In Blaise
Cronin (Ed.), Annual Review of Information Science & Technology, Information Today, Inc./American
Society for Information Science and Technology, Medford, NJ, Volume 41, Chapter 12, pp. 537-607.
http://ivl.slis.indiana.edu/km/pub/2007-borner-arist.pdf
 Börner, Katy, Ma, Nianli, Duhon, Russell Jackson & Zoss, Angela. (2009). Science &
Technology Assessment Using Open Data and Open Code. IEEE Intelligent Systems.
Vol. 24(4), 78-81, IEEE Computer Systems..
 Places & Spaces: Mapping Science exhibit, see also http://scimaps.org.
4
Computational Scientometrics Opportunities
Advantages for Funding Agencies
 Supports monitoring of (long-term) money flow and research developments, evaluation of
fundingg strategies
g for different p
programs,
g
, decisions on p
project
j durations,, fundingg patterns.
p
 Staff resources can be used for scientific program development, to identify areas for future
development, and the stimulation of new research areas.
Advantages for Researchers
 Easy access to research results,
results relevant funding programs and their success rates,
rates potential
collaborators, competitors, related projects/publications (research push).
 More time for research and teaching.
Advantages for Industry
 Fast and easy access to major results, experts, etc.
 Can influence the direction of research by entering information on needed technologies
(industry-pull).
Advantages
Ad
antages for Publishers
P blishers
 Unique interface to their data.
 Publicly funded development of databases and their interlinkage.
For Society
 Dramatically improved access to scientific knowledge and expertise.
Process of Computational Scientometrics
, Topics
Börner, Katy, Chen, Chaomei, and Boyack, Kevin. (2003) Visualizing Knowledge Domains. In Blaise Cronin (Ed.), Annual
Review of Information Science & Technology, Volume 37, Medford, NJ: Information Today, Inc./American Society for
Information Science and Technology
Technology, chapter 55, pp
pp. 179
179-255.
255
Latest ‘Base Map’ of Science
Kevin W. Boyack, Katy Börner, & Richard Klavans (2007). Mapping the Structure and Evolution of
Ch i Research.
Chemistry
R
h 11th
11 h International
I
i l Conference
C f
on Scientometrics
Si
i andd Informetrics.
I f
i pp. 112-123.
112 123
 Uses combined SCI/SSCI
from 2002
• 1.07M papers, 24.5M
references, 7,300 journals
• Bibliographic coupling of
p p r aggregated
papers,
r t d tto
journals
 Initial ordination and clustering
of journals gave 671 clusters
 Coupling counts were
reaggregated at the journal
cluster level to calculate the
• (x,y) positions for each
journal cluster
• by association, (x,y)
positions
ii
for
f each
h jjournall
Math
Law
Computer Tech
Policy
Statistics
Economics
CompSci
Vision
Education
Phys-Chem
Chemistry
Physics
Psychology
Brain
Environment
Psychiatry
MRI
Biology
BioMaterials
BioChem
Microbiology
Pl t
Plant
Cancer
Animal
Disease &
Treatments
Virology
Infectious Diseases
Science map applications: Identifying core competency
Kevin W. Boyack, Katy Börner, & Richard Klavans (2007).
Funding patterns of the US Department of Energy (DOE)
Math
Law
Computer Tech
Policy
Statistics
Economics
CompSci
Vision
Education
Phys-Chem
Chemistry
Physics
Psychology
Brain
Environment
Psychiatry
GeoScience
MRI
Biology
GI
GeoScience
BioBi
Materials
BioChem
Microbiology
Plant
Cancer
Animal
Virology
Infectious Diseases
Science map applications: Identifying core competency
Kevin W. Boyack, Katy Börner, & Richard Klavans (2007).
Funding Patterns of the National Science Foundation (NSF)
Math
Law
Computer Tech
Policy
Statistics
Economics
CompSci
Vision
Education
Phys-Chem
Chemistry
Physics
Psychology
Brain
Environment
GeoScience
Psychiatry
MRI
Biology
GI
BioBi
Materials
BioChem
Microbiology
Plant
Cancer
Animal
Virology
Infectious Diseases
Science map applications: Identifying core competency
Kevin W. Boyack, Katy Börner, & Richard Klavans (2007).
Funding Patterns of the National Institutes of Health (NIH)
Math
Law
Computer Tech
Policy
Statistics
Economics
CompSci
Vision
Education
Phys-Chem
Chemistry
Physics
Psychology
Brain
Environment
Psychiatry
GeoScience
MRI
Biology
GI
BioBi
Materials
BioChem
Microbiology
Plant
Cancer
Animal
Virology
Infectious Diseases
Science map applications: Identifying core competency
Kevin W. Boyack, Katy Börner, & Richard Klavans (2007).
Funding Patterns of the National Institutes of Health (NIH)
Math
Law
Computer Tech
Policy
Statistics
Data:
SCI/SSCI 2002:
proprietary
DOE:
FOIR
NIH:
http://projectreporter.nih.gov
NSF:
http://www.nsf.gov/awardsearch
SciMap to DOE/NIH/NSF linkage data not available.
Economics
Education
Psychology
CompSci
Vision
Phys-Chem
Chemistry
Physics
Brain
Environment
Psychiatry
GeoScience
MRI
Biology
GI
BioBi
Materials
BioChem
Algorithms/Tools:
DrL available
Microbiology
Plant
Cancer
Virology
Animal
Infectious Diseases
Mapping Indiana’s Intellect
Intellectual
al Space
Data:
Proprietary
Identify
Id
if
 Pockets of innovation
 Pathways from ideas to products
 Interplay
I
l off industry
i d
andd academia
d i
Algorithms/Tools:
Custom DB queries and code, not available.
Mapping the Evolution of Co-Authorship Networks
Ke, Visvanath & Börner, (2004) Won 1st price at the IEEE InfoVis Contest.
13
Data:
Available as mdb from
http://iv.slis.indiana.edu/ref/iv04contest
p //
/ /
Algorithms/Tools:
Complete workflow with pointers to code are at
http://iv.slis.indiana.edu/ref/iv04contest
14
Studying the Emerging Global Brain: Analyzing and Visualizing the Impact of
Co-Authorship Teams
Börner Dall’Asta
Börner,
Dall Asta, Ke & Vespignani (2005) Complexity,
Complexity 10(4):58
10(4):58-67.
67
Research question:
• Iss science
sc e ce driven
d ve by prolific
p o c single
s g e experts
e pe ts
or by high-impact co-authorship teams?
Contributions:
• New approach to allocate
citational
Data:
credit.
Available as mdb from
• Novel weighted graph representation.
http://iv.slis.indiana.edu/ref/iv04contest
p //
/ /
• Visualization of the growth of weighted
co-author network. Algorithms/Tools:
• Centrality measures to identify
Customauthor
DB queries and code, not available.
i
impact.
• Global statistical analysis of paper
production and citations in correlation
with co
co-authorship
authorship team size over time
time.
• Local, author-centered entropy measure.
15
113 Years of Physical Review
http://scimaps.org/dev/map_detail.php?map_id=171
Bruce W. Herr II and Russell Duhon (Data Mining & Visualization), Elisha F. Hardy (Graphic Design), Shashikant
Penumarthy (Data Preparation) and Katy Börner (Concept)
Data:
Available via Scholarly Database if APS permits access
http://sdb.slis.indiana.edu/
p //
/
(Bob Kelly, Director Journal Information Systems
The American Physical Society, 631-591-4064)
Algorithms/Tools:
Custom DB queries and code, not available.
Spatio-Temporal Information Production and Consumption of Major U.S.
Research Institutions
Börner, Katy, Penumarthy, Shashikant, Meiss, Mark and Ke, Weimao. (2006)
M i the
Mapping
h Diffusion
Diff i off Scholarly
S h l l Knowledge
K l d Among
A
Major
M j U.S.
U S Research
R
h
Institutions. Scientometrics. 68(3), pp. 415-426.
Research questions:
1 Does space still matter
1.
in the Internet age?
2. Does one still have to
studyy and work at major
j research
institutions in order toData:
have access to
high quality data and expertise and to produce high
Available via Scholarly Database if you attended Sackler
quality research?
Colloquium
q l bon
“Mapping
pp g Knowledge
w g Domains” in
3 Does
3.
D
the
h IInternet llead
d to more global
l citation
i i
May 2003
patterns, i.e., more citation
links between papers
http://sdb.slis.indiana.edu/
produced at geographically
distant research
instructions?
Contributions:
Algorithms/Tools:
 Answer to Qs 1 + 2 isCustom
YES. DB queries and code, not available.
 Answer to Qs 3 is NO.
 Novel
N l approach
h to analyzing
l i the
h dduall role
l off
institutions as information producers and
consumers and to study and visualize the diffusion
of information among them.
Mapping Topic Bursts
Co-word space of
the top 50 highly
frequent and bursty
words used in the
top 10% most
highly cited PNAS
publications in
1982-2001.
Mane & Börner. (2004)
PNAS, 101(Suppl. 1):
5287-5290.
Data:
Available via Scholarly Database if you attended Sackler
Colloquium
q
on “Mapping
pp g Knowledge
w g Domains” in
May 2003
http://sdb.slis.indiana.edu/
Algorithms/Tools:
Custom DB queries and code, not available.
18
Mapping Transdisciplinary Tobacco Use Research
Centers Publications
C
Compare
R01 investigator
i
i
based
b d funding
f di with
i h TTURC
Center awards in terms of number of publications and
evolving co-author networks.
Z & Börner,
Zoss
Bö
forthcoming.
f th i
Data:
NIH awards linked to resulting publications
We hope it will become available to more researchers.
Algorithms/Tools:
Al
i h /T l
Custom DB queries and NWB Tool
Reference Mapper
Duhon & Börner,
Börner forthcoming.
forthcoming
Data:
References from NSF proposals that have been funded.
Proprietary.
Algorithms/Tools:
Al
i h /T l
RefMapper is part of Sci2 Tool, shared if NSF permits.
Overview
Science of Science Studies
Science of Science Cyberinfrastructure (http://sci.slis.indiana.edu):
(http //sci slis indiana ed )
 Scholarly Database (SDB) (http://sdb.slis.indiana.edu) that provides free
access to 23 million scholarly records
 Sci
S i2 Tool
T l which
hi h reads
d SDB data
d andd supports the
h id
identification
ifi i off activity
i i
bursts, the extraction and display of co-author/inventor/investigator
networks, and topical analysis, among others.
Mapping Science Exhibit
http://sci.slis.indiana.edu
Scholarly Database
http://sdb.slis.indiana.edu
//
Nianli Ma
“From
From Data Silos to Wind Chimes”
Chimes
 C
Create public
bli ddatabases
b
that
h any scholar
h l can use. Sh
Share the
h b
burden
d off ddata cleaning
l i and
d
federation.
 Interlink creators, data, software/tools, publications, patents, funding, etc.
La Rowe, Gavin, Ambre, Sumeet, Burgoon, John, Ke, Weimao and Börner, Katy. (2007) The Scholarly Database and Its Utility for
Scientometrics Research. In Proceedings of the 11th International Conference on Scientometrics and Informetrics, Madrid, Spain, June 2527, 2007, pp. 457-462. http://ella.slis.indiana.edu/~katy/paper/07-issi-sdb.pdf
Scholarly Database: Web Interface
Anybody can register for free to search the about 23 million records and
download results as data dumps.
Currently the system has over 130 registered users from academia,
academia
industry, and government from over 60 institutions and four continents.
Since March 2009:
Users can download networks:
- Co-author
- Co-investigator
- Co-inventor
Co ve o
- Patent citation
and tables for
burst analysis in NWB.
SDB D
Demo
http://sdb.slis.indiana.edu
Scholarly Database: # Records, Years Covered
Datasets available via the Scholarly Database (* internally)
Dataset
# Records
Years Covered
Updated
Restricted
Access
Medline
17 764 826
17,764,826
1898 2008
1898-2008
PhysRev
398,005
1893-2006
Yes
PNAS
16,167
1997-2002
Yes
JCR
59,078
1974, 1979, 1984, 1989
1994-2004
Yes
3, 875,694
1976-2008
Yes*
NSF
174,835
1985-2002
Yes*
NIH
1,043,804
1961-2002
Yes*
Total
23,167,642
, ,
1893-2006
4
USPTO
Yes
Aim for comprehensive time, geospatial, and topic coverage.
Temporal and Geospatial Coverage
3
Comparison with Major Publication Data
commonly
l used
d in
i scientometric
i
i studies
di
NIH Grants
Medline Publications
NSF Grants
US Patents
Sci2 Tool
http://sci.slis.indiana.edu
p
“Open Code for S&T Assessment”
Branded OSGi/CIShell based tool with NWB plugins
p g
and many new plugins.
Geo Maps
Sci Maps
GUESS Network Vis
Hierarchical Circular Visualization
Horizontal Time Graphs
Börner, Katy, Huang, Weixia (Bonnie), Linnemeier, Micah, Duhon, Russell Jackson, Phillips, Patrick, Ma,
Ni li Zoss,
Nianli,
Z Angela,
A l Guo,
G Hanning
H i & Price,
P i Mark.
M k (2009).
(2009) R
Rete-Netzwerk-Red:
N
kRd A
Analyzing
l i andd
Visualizing Scholarly Networks Using the Scholarly Database and the Network Workbench Tool.
Proceedings of ISSI 2009: 12th International Conference on Scientometrics and Informetrics, Rio de Janeiro,
Brazil, July 14-17 . Vol. 2, pp. 619-630.
Sci2 Tool
Geo Maps
Circular Hierarchy
Serving Non-CS Algorithm Developers & Users
Users
Developers
CIShell Wizards
CIShell
IVC Interface
NWB Interface
36
Sci2 Tool: Supported Data Formats
Personal Bibliographies
 Bibtex (.bib)
 Endnote Export Format (.enw)
Data Providers
 Web of Science by Thomson Scientific/Reuters (.isi)
 Scopus by Elsevier ((.scopus)
scopus)
 Google Scholar (access via Publish or Perish save as CSV, Bibtex,
EndNote)
 Awards Search by National Science Foundation (.nsf)
Scholarly Database (all text files are saved as .csv)
 Medline publications by National Library of Medicine
 NIH funding awards by the National Institutes of Health
(NIH)
 NSF funding
f di awards
d b
by the
h N
National
i l SScience
i
F
Foundation
d i (NSF)
 U.S. patents by the United States Patent and Trademark Office
(USPTO)
 Medline papers – NIH Funding
Network Formats
 NWB (.nwb)
 Pajek (.net)
 GraphML (.xml or
.graphml)
 XGMML (.xml)
Burst Analysis Format
 Burst (.burst)
Other
O
h Formats
F
 CSV (.csv)
 Edgelist (.edge)
 Pajek (.mat)
 TreeML
T ML (.xml)
( l)
37
NWB=Sci2 Tool: Algorithms (July 1st, 2008)
p
y and handout
See https://nwb.slis.indiana.edu/community
38
NWB=Sci2 Tool: Output Formats







NWB tool can be used for data conversion. Supported output formats comprise:
CSV (.csv)
( )
NWB (.nwb)
Pajek (.net)
Pajek (.mat)
( mat)
GraphML (.xml or .graphml)
XGMML (.xml)
 GUESS
Supports
pp
export
p of images
g into
common image file formats.
 Horizontal Bar Graphs
 saves out raster and ps files.
39
Exemplary
p y Analyses
y
and Visualizations
Individual Level
A. Loading ISI files of major network science researchers, extracting, analyzing
and visualizing paper-citation networks and co-author networks.
B. Loadingg NSF datasets with currently active NSF fundingg for 3 researchers at
Indiana U
Will be presented in hands-on Workshop on
Thursday Sept 3, 2009, 1-5pm
Institution Level
C. Indiana U, Cornell
U, and Michigan
U, extracting,
comparing Co-PI
Together
with guidance
on howand
to design
networks.
workflows using 100+ algorithms
and how to dissect and design effective
Scientific Field Level
visualizations.
D. Extracting co-author networks, patent-citation networks, and detecting
bursts in SDB data.
Bonus: Create your custom tool.
Scii2 Tool
S
T l Demo
D
http://sci.slis.indiana.edu
http://www.nsf.gov/awardsearch
Outlook
CIShell/OSGi is at the core of different CIs and a total of 169 unique plugins are used in the
- Information Visualization (http://iv.slis.indiana.edu),
( p
),
- Network Science (NWB Tool) (http://nwb.slis.indiana.edu),
- Scientometrics and Science Policy (Sci2 Tool) (http://sci.slis.indiana.edu), and
- Epidemics (http://epic.slis.indiana.edu) research communities.
Most interestingly, a number of other projects recently adopted OSGi and one adopted CIShell:
Cytoscape (http://www.cytoscape.org) lead by Trey Ideker, UCSD is an open source bioinformatics
software platform for visualizing molecular interaction networks and integrating these interactions
with gene expression profiles and other state data (Shannon et al., 2002).
T
Taverna
Workbench
W kb h (http://taverna.sourceforge.net)
(h //
f
) lead
l d by
b Carol
C lG
Goble,
bl U
University
i
i off M
Manchester,
h
UK is a free software tool for designing and executing workflows (Hull et al., 2006). Taverna allows
users to integrate many different software tools, including over 30,000 web services.
MAEviz (https://wiki.ncsa.uiuc.edu/display/MAE/Home) managed by Shawn Hampton, NCSA is an
open-source,
p
extensible software p
platform which supports
pp
seismic risk assessment based on the MidAmerica Earthquake (MAE) Center research.
TEXTrend (http://www.textrend.org) lead by George Kampis, Eötvös University, Hungary develops a
framework for the easy and flexible integration, configuration, and extension of plugin-based
components in support of natural language processing (NLP), classification/mining, and graph
algorithms for the analysis of business and governmental text corpuses with an inherently temporal
component.
As the functionality of OSGi-based software frameworks improves and the number and diversity of
dataset and algorithm plugins increases, the capabilities of custom tools or macroscopes will expand.
Overview
Science of Science Studies
Science of Science Cyberinfrastructure (http://sci.slis.indiana.edu):
(http //sci slis indiana ed )
 Scholarly Database (SDB) (http://sdb.slis.indiana.edu) that provides free
access to 23 million scholarly records
 Sci
S i2 Tool
T l which
hi h reads
d SDB data
d andd supports the
h id
identification
ifi i off activity
i i
bursts, the extraction and display of co-author/inventor/investigator
networks, and topical analysis, among others.
Mapping Science Exhibit
Mapping Science Exhibit – 10 Iterations in 10 years
http://scimaps.org
The Power of Maps (2005)
Science Maps for Economic Decision Makers (2008)
The Power of Reference Systems (2006)
Science Maps for Science Policy Makers (2009)
Science Maps for Scholars (2010)
Science Maps as Visual Interfaces to Digital Libraries (2011)
Science Maps for Kids (2012)
Science Forecasts (2013)
The Power of Forecasts (2007)
How to Lie with Science Maps (2014)
Exhibit has been shown in 72 venues on four continents. Currently at
- NSF, 10th Floor, 4201 Wilson Boulevard, Arlington, VA
- Wallenberg Hall, Stanford University, CA
- Center of Advanced European Studies and Research, Bonn, Germany
- Science Train, Germany.
46
Debut
D
b off 5th Iteration
I
i off Mapping
M i SScience
i
E
Exhibit
hibi at MEDIA X was on M
May 18
18, 2009 at W
Wallenberg
ll b H
Hall,
ll
Stanford University, http://mediax.stanford.edu, http://scaleindependentthought.typepad.com/photos/scimaps
47
Th Power
The
P
r off Maps
M p
Four Early Maps of Our World
VERSUS
Si E
Six
Early
l M
Maps off S
Science
i
(1st Iteration of Places & Spaces Exhibit - 2005)
Th Power
The
P
r off Reference
R f r n Systems
S t m
Four Existing Reference Systems
VERSUS
Si P
Six
Potential
i lR
Reference
f
S
Systems off S
Science
i
(2nd Iteration of Places & Spaces Exhibit - 2006)
The Power of Forecasts
F
Four
Existing
E i ti F
Forecasts
t
VERSUS
Six Potential Science ‘Weather’
Weather Forecasts
(3rd Iteration of Places & Spaces Exhibit - 2007)
114 Years of Physical Review - Bruce W. Herr II, Russell Duhon, Katy Borner, Elisha Hardy, Shashikant Penumarthy - 2007
58
Maps of Science: Forecasting Large Trends in Science - Richard Klavans, Kevin Boyack - 2007
Science Maps for
Economic Decision Making
Four Existing Maps
VERSUS
Six Science Maps
(4th Iteration of Places & Spaces Exhibit - 2008)
59
Science Maps for
Science Policy Making
Four Existing Maps
VERSUS
Six Science Maps
(5th Iteration of Places & Spaces Exhibit - 2009)
A Clickstream Map of Science – Bollen, Johan, Herbert Van de Sompel, Aric Hagberg,
Luis M.A. Bettencourt, Ryan Chute, Marko A. Rodriquez, Lyudmila Balakireva - 2008
64
Council for Chemical Research - Chemical R&D Powers the U.S. Innovation Engine.
Washington, DC. Courtesy of the Council for Chemical Research - 2009
65
Additional Elements of the Exhibit
Illuminated Diagram Display
Hands--on Science Maps for Kids
Hands
Worldprocessor Globes
Illuminated Diagram Display
W. Bradford Paley, Kevin W. Boyack, Richard Kalvans, and Katy Börner (2007)
Mapping, Illuminating, and Interacting with Science. SIGGRAPH 2007.
Questions:
p
 Who is doingg research on what topic
and where?
 What is the ‘footprint’ of
interdisciplinary research fields?
p have scientists?
 What impact
Large-scale, high
resolution prints
illuminated via projector
or screen.
Interactive touch panel.
Contributions:
 Interactive, high resolution interface
to access and make sense of data
about scholarly activity.
68
Science Maps in “Expedition Zukunft” science train visiting 62 cities in 7 months, 12 coaches, 300 m long.
Opening was on April 23rd, 2009 by German Chancellor Merkel, http://www.expedition-zukunft.de
79
Thi is
This
i the
th only
l mockup
k in
i this
thi slide
lid show.
h
E
Everything
hi else
l iis available
il bl today.
d
All papers, maps, cyberinfrastructures, talks, press are linked
from http://cns.slis.indiana.edu
Download