CERIF/euroCRIS and Elsevier Where do they meet? Prague, November 9, 2010 M’hamed el Aisati, Head of Product Technology, S&T Elsevier Outline What’s Elsevier (S&T) from a data and technology perspective? Data types Data processing Data Technology adopted and deployed at Elsevier From Elsevier data models to CERIF Is there a role for a publisher? What role More opportunities 2 Elsevier S&T = Scientific Data + Technology + much more > 43 M Abstracts (A&I) > 10 M Full-text Articles > 55K Main organization profiles > 20 M Author profiles > 60 M Patents > 1 M Awarded grants > 10 years of ScienceDirect and > 5 years of Scopus usage/analytics data > 10 K Books > 500 M Quality scientific web pages 3 Scopus coverage A rich and ex-tended coverage including Abstracts and citations from 5000 publishers (ELS 15%) 3,6 Million conference papers (10% of Scopus records) “Articles in Press” from more than 3000 titles 23 Million Patents 1,200 Open Access journals 80% of all Scopus records have an abstract Abstracts going back to 1823 (Scopus includes all historical material of ELS, Springer, ACS, AIP, Nature, Science, etc..) Nearly 2,700 Arts & Humanities titles 430 m integrated scientific websites via Scirus.com ~16,500 Nearly 18,000 Titles including 16,500 Peer Reviewed Titles 600 Trade Journals 350 Book Series Extensive Conference Proceedings 40 languages are covered 600 350 7700 5440 1460 250 350 230 Scopus info on www.info.scopus.com 4 Scientific Data + Technology provides extra value 5 “Your companion for a scientific life” Department head Librarian Researcher Funding agent Manager/Admin Dean/Provost 7 Performance Evaluation BadenWurttemberg 8 Australian Research Council – ERA 2010 More info on: http://www.arc.gov.au/era/default.htm Assessment of research quality within Australia's higher education institutions using a combination of indicators and expert review by committees comprising experienced, internationally-recognized experts. ERA uses leading researchers to evaluate research in eight discipline clusters. ERA will detail areas within institutions and disciplines that are internationally competitive, as well as point to emerging areas where there are opportunities for development and further investment. Early January 2010 – Aug/Sep 2010 First trial (PCE) in 2009 Scopus selected as source information provider and partner 9 Australian Research Council – ERA 3 main components: 2010 EID tagging - Dedicated web service (API) - Reports: - » Citation Benchmark report (cpp) » Centile threshold report » Ranked journal ‘Indicative World Distribution’ Benchmark Report ARC – Scopus – Universities interaction EID tagging process Dedicated Web Service Outline What’s Elsevier (S&T) from a data and technology perspective? Data types Data processing Data Technology adopted and deployed at Elsevier From Elsevier data models to CERIF Is there a role for a publisher? What role More opportunities 12 Database technologies at Elsevier (1) XML native database for large bulk of data, e.g. Full-text articles, Abstract and Indexing records No ETL process involved “Search Interface” as top layer for retrieving data – XQueries instead of SQL queries No (upfront) data modelling is required Leveraging and retaining original XML structure Multiple DTDs and schemas supported concurrently. DTD or Schema not as a perquisite for data loading With XQuery whole web applications can be built, i.e. no integration with additional web programming language (e.g. php, javascript, etc.) Though an expensive technology Straightforward huge amount of data loading and querying might be challenging Requires specific skills 13 Database technologies at Elsevier (2) RDMBs databases for lightweight information, e.g. article and journal metadata. Known and established technology (e.g. SQL) Typically heavy lifting is done at ETL stage in order to boost query performance Plenty of open source choice and thus free (e.g. MySQL), low threshold for adoption Ideal for small amount of information ETL process can be lengthy XML structure is ‘lost’ once data loaded. Separate DTD or schema required for exporting data SQL is typically a back-end technology. Front end (web) application programming requires a different language (e.g. php, jsp, asp) Data modelling is required. Updating the data model usually requires data re-loading 14 Outline What’s Elsevier (S&T) from a data and technology perspective? Data types Data processing Data Technology adopted and deployed at Elsevier From Elsevier data models to CERIF Is there a role for a publisher? What role More opportunities 15 Elsevier logical fit 16 Some data models at Elsevier Authors are disambiguated and profiled. Unique and persistent identifier Affiliations are disambiguated and profiled. Unique and persistent identifier Backward and forward citations captured through reference linking Funding data aggregated to affiliations 17 Simple relational data model example Covers publications, journals, classifications (disciplines), authors, affiliation, journal metrics, citations, etc. 18 Affiliation Profile XML snippert <xocs:doc content-type="Profile" dbname="scopusbase" xsi:schemaLocation="http:// www.elsevier.com/xml/xocs/dtd xocs-ip502.xsd" xmlns:xocs="http://www.elsevier.co m/xml/xocs/dtd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><xocs:meta ><xocs:eid>10-s2.0-101718729</xocs:eid><xocs:timestamp>2009-01-09T13:04:06.06773 5-05:00</xocs:timestamp></xocs:meta><xocs:institution-profile><institution-profi le affiliation-id="101718729"> <status>update</status> Unique and <date-created year="2008" month="02" day="03"/> persistent <date-revised year="2008" month="05" day="14" timestamp="2008-05 -14T00:05:34.000034+01:00"/> affiliation ID <date-revised year="2008" month="06" day="30" timestamp="2008-06 -30T02:09:24.000024+01:00"/> <date-revised year="2009" month="01" day="01" timestamp="2009-01 -01T13:57:41.000041+00:00"/> <date-revised year="2009" month="01" day="09" timestamp="2009-01 -09T17:44:11.000011+00:00"/> <preferred-name>Balearic Islands Government</preferred-name> <sort-name>Balearic Islands Government</sort-name> <name-variant>Balearic Islands Government</name-variant> <name-variant>Govern Balear</name-variant> <name-variant>Govern de les Illes Balears</name-variant> <address country="es"> <address-part>C/. Foners 10</address-part> <city>Palma</city> <postal-code>07006</postal-code> </address> …… 19 Author Profile XML snippert <author-profile id=“7401581436" type="author" suppress="false"> ….. Unique and <preferred-name> <initials>A.W.</initials> persistent <indexed-name>MacDonald A.</indexed-name> author ID <surname>MacDonald</surname> <given-name>Alistair W.</given-name>< /preferred-name> <name-variant> <initials>A.W.</initials> <indexed-name>Macdonald A.</indexed-name> <surname>MacDonald</surname> <given-name>A. W.</given-name> </name-variant> … <classificationgroup> <classifications type="ASJC"> <classification frequency="7">1306</classification> <classification frequency="1">1315</classification> <publication-range start="1989" end="2009"/> … <journal-history type="author"> <journal type="j"> <sourcetitle>Clinical Cancer Research</sourcetitle> ….. <affiliation-current> <affiliation affiliation-id="106499546" parent="60019718"/> </affiliation-current> Reference <affiliation-history> to affiliation <affiliation affiliation-id="104228751" parent="60024340"/> </affiliation-history> </author-profile> 20 Publication XML snippert <bibrecord><item-info> <copyright type="Elsevier">Copyright 2008 Elsevier B.V.,All rights reserved.</copyright> <itemidlist><itemid idtype="SCP">34147094726</itemid> Unique and <history><date-created year="2007" month="04" day="18"/></history> <dbcollection>SNCABS</dbcollection>< persistent dbcollection>Scopusbase</dbcollection></item-info> publication <head><citation-info><citation-type code="ar"/> <citation-language xml:lang="en"/> ID ….. <author seq="3" auid="7003372933"> <ce:initials>P.</ce:initials><ce:indexed-name>Barret P.</ce:indexed-name> <ce:surname>Barret</ce:surname><ce:given-name>Pierre</ce:given-name> Reference <preferred-name><ce:initials>P.</ce:initials> to author <ce:indexed-name>Barret P.</ce:indexed-name> <ce:surname>Barret</ce:surname><ce:given-name>Pierre</ce:given-name> </preferred-name> <ce:e-address type="email">XXX@YYYY.inra.fr</ce:e-address></author> <affiliation country="fr" afid="60001542"> <organization>Plateforme de Transg??n??se du Bl??</organization> <organization>UMR ASP 1095 INRA</organization> Reference <organization>Université Blaise Pascal</organization> <city-group>63100 Clermont-Ferrand</city-group> to affiliation </affiliation> <references count=“27”> ….. </references> </bibrecord> 21 Scopus Custom Data Example of XML data - <author-group> - <author seq="1" auid="7005613516"> <ce:initials>A.</ce:initials> <ce:indexed-name>Rothschild A.</ce:indexed-name> <ce:surname>Rothschild</ce:surname> <ce:given-name>Avner</ce:given-name> - <preferred-name> <ce:initials>A.</ce:initials> <ce:indexed-name>Rothschild A.</ce:indexed-name> <ce:surname>Rothschild</ce:surname> <ce:given-name>Avner</ce:given-name> </preferred-name> <ce:e-address type="email">avner@mit.edu</ce:e-address> </author> - <author seq="2" auid="8625399100"> <ce:initials>S.J.</ce:initials> Custom Data is: • A big bucket of highly structured XML items • Extracted directly from Scopus • Accompanied by the articles’ cited by counts • Supported by extensive documentation and test data upon request • FTP-ed or shipped via mobile (usb) drives • Scopus contains ~42 million items • In principle all articles can be ordered • Custom Data can be grouped using the following criteria: • On ASJC code (All Science Journal Classification Code). (see next slide) • Per Country • List of countries • Per year • Range of years • Further refining possible in close cooperation with Product Team • Certain fields can be taken out if preferred; • Abstracts • References • Etc. 22 A wide variety of Web Services and APIs SOAP and REST: Simple and accessible to low level development Different service levels supported Access to different content types Hub ScienceDirect articles Scopus abstracts, Author profiles, Affiliation profiles Both Search and retrieval XML and other formats supported 23 Outline What’s Elsevier (S&T) from a data and technology perspective? Data types Data processing Data Technology adopted and deployed at Elsevier From Elsevier data models to CERIF Is there a role for a publisher? What role More opportunities 24 Significant part of Research Information is at publisher Publishers have lots of info about publications and researchers Publishers have been dealing with research info for many years Early adopters of XML and database technologies Are at the front of changes taking place on research area More and more publishers – certainly Elsevier - are working closely with institutions on topics related to research information management and performance evaluation 25 euroCRIS and CERIF as seen by Elsevier CERIF as a standardized format is a great initiative Elsevier is happy to partner with euroCRIS to improve, maintain and update the ‘standard’ Elsevier at the other hand is “agnostic” to CRIS implementations What is the future of data models moving forward with evolving technologies? Do you need one today? Do you care about how systems are implemented and set up? Shouldn’t the focus be on the interface/exchange layer? With web services according to a standard (CERIF), back-end systems are less relevant 26 Opportunities for euroCRIS and Elsevier Work collaboratively on further standardization of CERIF Ensure completeness of research information exchanged through CERIF Adopt CERIF as one of the exporting formats straight into local systems (CRIS or non CRIS) Elsevier and euroCRIS to help accelerate research community management the population of local systems and repositories Expand CERIF to include metric based report information for performance evaluation Exchange technology and knowledge for potential CRIS implementation recommendation Accelerate integration of Elsevier and other vendors’ products and its data with local systems (e.g. HR, etc.) 27 Thanks For questions and/or follow up: M’hamed el Aisati m.aisati@elsevier.com 28