Lab I – READ Product Description Running Head Lab I – READ Product Description Andrew Moss CS411 Janet Brunelle March 18, 2013 Version 2 1 Lab I – READ Product Description 2 Table of Contents 1 Introduction .................................................................................................................................. 3 2 READ Product Description.......................................................................................................... 5 2.1 Key Product Features and Capabilities ..................................................................................... 5 2.2 Major Components (Hardware/Software)................................................................................. 7 3 Identification of Case Study......................................................................................................... 9 4 READ Product Prototype Description ....................................................................................... 10 4.1 Prototype Architecture (Hardware/Software) ......................................................................... 12 4.2 Prototype Features and Capabilities........................................................................................ 12 4.3 Prototype Development Challenges ........................................................................................ 13 Glossary ........................................................................................................................................ 14 References ..................................................................................................................................... 19 List of Figures Figure 1 - Major Functional Component Diagram ......................................................................... 7 List of Tables Table 1 - Side-by-side Comparison of Real World Product and Prototype .................................. 10 Lab I – READ Product Description 3 1 Introduction Publications are the primary method of distributing the results that come from conducting research. There are approximately 4,600 universities (NCES, 2011) that “account for more than half of the basic research conducted in the United States (McRobbie, 2012)”. Unfortunately, many of these institutions lack an efficient online resource for organizing and displaying both the publications resulting from their research and information about the grants that helped finance it. Such a system would provide research universities and the departments therein, as well as the students and professors performing the research, with increased recognition and awareness of their work. One example of a university in need of an improved publication system is Old Dominion University (ODU) in Norfolk, Virginia. Their Computer Science Department (ODUCS), in particular, would benefit a great deal from having an online well-maintained system for publications and grants as it lacks one entirely. This department’s professors are burdened with manually updating their own web pages to provide awareness of their recent publications. In the past there was a single web page for the entire department that was maintained by an individual member of their Systems Group. However, this page was last updated in 2008, likely a result of the slow, tedious, and manual nature of the process. The team behind READ, a Repository for Electronic Aggregation of Documents, intends to alleviate the lack of quality online resources for displaying publications and grants. The READ system will use a scraper to provide researchers with a method of organizing their publications and grants in a format that allows for easy searching, sorting, filtering, and browsing. Additionally, content authors will be able to verify that the listed publications are Lab I – READ Product Description 4 actually their own work in the event that READ mistakenly shows something written by another researcher with the same name. There will be a prototype READ system developed for ODUCS as a proof of concept to display its most basic capabilities. A prototype is necessary due to time constraints placed on development. This prototype will provide public and private user interfaces to publication and grant databases, user controls for publication verification, and most importantly, a scraper that will gather links to publications automatically at set intervals to minimize manual effort. (This Space Intentionally Left Blank.) Lab I – READ Product Description 5 2 READ Product Description READ is an automated system using a database to store links to articles, the publications themselves and information about grants involved. It will allow anyone with Internet access to browse the lists of publications and filter them by author, date, keywords, and publication type. It will minimize the need for manual effort on the part of the author by automatically finding their publications making it easier to manage the work they have already done. This will allow faculty to spend more time actually conducting the research that attracts new students and funding alike. Lastly, it will maximize the amount of information available. 2.1 Key Product Features and Capabilities The most essential element of the READ solution is the Schaefer Scraper. This is an algorithm developed by Andrew Schaefer, a graduate student at ODU, with the help of several other ODUCS graduate students. The Schaefer Scraper combs external websites looking for publications written by a specific author. It then extracts relevant information such as the title, attributed authors, date, and the type of publication. This information is then inserted into a database along with a link to the page where the publication can be accessed. After the database has been updated, READ will then send a notification e-mail to the associated author so that newly found documents can be verified with a single mouse click. Based on the responses to these verification e-mails, READ will learn when a publication is likely to have been written by a different author of the same name, and make the decide whether the publication truly belongs to the associated author on its own. The Schaefer Scraper will also extract information about grants from external websites. In addition to filtering the displayed publications as discussed in Section 2, READ will also allow viewers to filter the grants displayed by the amount, the status of the grant, the funding Lab I – READ Product Description agency, and the principal investigator. Each faculty member will have a publicly available profile that will display his or her name, title, organization, and homepage. The profile will also contain graphical representations of the number of publications created and amount of grant funding received along with a filterable list of the author’s publications. (This space intentionally left blank.) 6 Lab I – READ Product Description 7 2.2 Major Components (Hardware/Software) As can be seen in Figure 1, the foundation of the READ solution consists of a single server (this can be physical or virtual). It will be home to three main software components: a web interface, a publication and grant link database, and Schaefer's Scraper. In order to implement this solution, a web server and SQL database server software will also be required but can be kept on the same host if desired. Figure 1 – Major Functional Component Diagram The web interface contains both public and private sections. The latter will be accessible only to document authors and administrative staff. Access to this section will be strictly protected by requiring user authentication before it can be viewed. Figure 1 shows Dr. Michele Weigle, the READ team’s mentor, as an example author using the READ system. Lab I – READ Product Description 8 The next component of the READ solution is the publication link database. The database's primary function will be to provide links to externally located publications and grant information. It will also contain files uploaded directly by the authors. The last and, undoubtedly, most important element of READ is Schaefer's Scraper. This is an automated tool that will regularly comb a specific list of external web sites for new publications submitted by a known list of authors. Both of these lists are provided as input in XML files. The algorithm consists of nested loops where for each author, and for each external web site, the Scraper will search for publications by the author, parse the results, and export them to the READ link database. (This Space Intentionally Left Blank.) Lab I – READ Product Description 9 3 Identification of Case Study The initial customer for READ is Old Dominion University's Computer Science Department. This is primarily due to the fact that the solution was requested by the team’s mentor, Dr. Michele Weigle, a prominent professor in the department. In addition to Dr. Weigle, according to its website, the department features a total of 37 faculty members, 11 currently enrolled Ph.D. students, and 111 currently enrolled Master's students. That is potentially 159 authors who could benefit from a system that would make it easy for others to find their research. After successful testing at ODUCS, READ could then be used by other departments at Old Dominion University. Even further into the future, this solution could potentially be utilized by other universities, governments or non-profit research institutions. Numerous organizations could benefit from making their publications easier to manage. (This Space Intentionally Left Blank.) Lab I – READ Product Description 10 4 READ Product Prototype Description As mentioned in Section 1, due to time constraints, it will be necessary to develop a prototype to display the most basic functionality of the READ solution. The READ prototype will use data from real authors from ODUCS and the database will be populated with their publications by Schaefer’s Scraper. It will offer nearly the same functionality as the Real World Product (RWP). Due to time constraints, the prototype will not feature graphical representation of data about publications and grants, nor will it implement a learning algorithm to automatically decide whether a publication does or does not likely belong to a specific author. This is shown in Table 1. Table 1 – Side-by-side Comparison of Real World Product and Prototype Features Real World Project Prototype Browsing Ability to browse all grants and Ability to browse all grants and Capabilities publication publications Publication Filtered by title, publisher, authors, Filtered by title, publisher, authors, Filtering publication date, date added, and publication date, date added, and Capabilities keywords. keywords. Grant Filtering Filtered by title, funding agency, Filtered by title, funding agency, Capabilities principal or co-principal principal or co-principal investigator, start date, end date, investigator, start date, end date, and and active state. active state. Lab I – READ Product Description Add, edit, and Included. A thumbnail image and Included. A thumbnail image and delete publications files may be associated with the files may be associated with the and grants document. Fields can be document. Fields can be automatically filled in using a automatically filled in using a Bibtext document. Bibtext document. Lists faculty and provides a link to Not included. Faculty page 11 each person’s profile page Login interface Profile Page Scraper Linked to Old Dominion University Linked to Old Dominion University Computer Science accounts Computer Science accounts Displays authors’ profile picture, Displays authors’ profile picture, job title, email address, personal job title, email address, personal webpage link, and the author’s webpage link, and the author’s publications and grants. Displays publications and grants. Graphs not graphs included. Will update the system with new Will update the system with publications and grants and alert publications only and alert users users when one is added to the when one is added to the system system under their name. under their name. Prediction Predicts if the consumer has enough Not included algorithm space to use the READ system. Lab I – READ Product Description 12 Administrative Administrators are able to edit, add, Administrators are able to edit, add, Privileges or remove anything in the system. or remove anything in the system. 4.1 Prototype Architecture (Hardware/Software) The prototype will use the same major functional components as the RWP, as seen in Figure 1. The server will be a virtual machine (VM) running Debian Linux. The web interface will be served via Open Source web server software running on the VM. The database will be stored and served from the same virtual machine using Open Source database software. Schaefer’s Scraper is software that has already been written in PHP with which READ will interface. The scraper has a list of external sites that it searches for publications. This list is hard coded in the scraper itself. The results of the search are output in HTML. A method of exporting the scraper’s output to a format friendly to the database will need to be developed. Finally, the user interfaces will need to be coded as described in Section 2.2. 4.2 Prototype Features and Capabilities The primary feature of the prototype is the automation provided by Schaefer’s Scraper. Manual effort is the most significant hindrance to maintaining up-to-date lists of grants and publications. It also removes the considerable capacity for human error. The prototype will allow anyone on the Internet to browse all grants and publications. While browsing, the viewer will be able to apply a variety of filters to the information displayed as described in Section 2. The viewer will also have access to thumbnail images associated with publications and grants. Authors will be able to log in to a private interface using their ODUCS Lab I – READ Product Description 13 Unix/Linux credentials. Through this private interface they will be able to manage the publications and grants associated with their accounts. 4.3 Prototype Development Challenges The biggest obstacle in the development of the prototype will be the fact that Schaefer’s Scraper, as provided to the team, is completely non-functional. It is poorly documented PHP that should be re-written in Python. The scraper will also need to be integrated with the publication and grant links database. Lastly, the limited amount of time available for the development of the prototype will also be a formidable obstacle. (This space intentionally left blank.) Lab I – READ Product Description 14 Glossary Administrator/Administrative User: a user with increased privileges for editing database content Author: A person that is able to add and edit publications and grants to the system under their name. BibTeX: A file format for reference information in XML format. It will be used to automatically fill in key information when uploading or editing publications and grants. Computer Science (CS): An academic discipline based on advancing computing theory and algorithm development, that sometimes includes theory about software engineering methods. Client application: In a client/server architecture, the module that takes input and creates queries to be processed by a server, and receives the results from the server. Client/Server Architecture: A software engineering paradigm that separates functionality into a “client” application and a “server” application that interact. CSS: A programming language used to specify presentation of HTML pages Data Mining: The act of going through a source of input to find specific information. Lab I – READ Product Description 15 Database Schema: A description of the structure of database Funding Agency: The source of funds for research grants. These organizations usually have a limited amount of money to (pass out) principle investigator’s that submit an accepted application for research funds. GIT: A software system for controlling and organizing software versioning. GoogleScholar (http://scholar.google.com): Google Scholar provides a simple way to broadly search for scholarly literature. From one place, you can search across many disciplines and sources: articles, theses, books, abstracts and court opinions, from academic publishers, professional societies, online repositories, universities and other web sites. Google Scholar helps you find relevant work across the world of scholarly research. scholar.google.com Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc that can be interacted with via a mouse and keyboard, through which a user interacts with a software application. Used to differentiate from a “command-line interface”, in which a user interacts with a software application solely through a text terminal. internet scraper: internet scraper / web scraper - (wikipedia) web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data Lab I – READ Product Description 16 that can be stored and analyzed in a central local database or spreadsheet. JQuery Sparklines: A development library for the visualization of data. ODU: Old Dominion University. MicrosoftAcademic (http://academic.research.microsoft.com/): Microsoft Academic Search is a free service developed by Microsoft Research to help scholars, scientists, students, and practitioners quickly and easily find academic content, researchers, institutions, and activities. Microsoft Academic Search indexes not only millions of academic publications, it also displays the key relationships between and among subjects, content, and authors, highlighting the critical links that help define scientific research. Microsoft Academic Search makes it easy for you to direct your search experience in interesting and heretofore hidden directions with its suite of unique features and visualizations. MySQL: A database querying language. Parse: A technical term usually used to describe the processing of a statement written in a programming language. May be used generally to describe the processing of any statement for specific meaning. Perl: A widely-used programming language on the server-side of web applications. Lab I – READ Product Description 17 PHP: A widely-used programming language on the server-side of web applications. Principle Investigator (PI): The primary researcher that a research grant is bestowed upon, responsible for documenting the work and publishing research results. Publication or Academic Publication: A document created by a faculty member to share research. They are usually published in an academic journals, technical reports, and records of conference proceedings. Query: An algorithm sent to the database to either change the database or get back results READ: Repository for Electronic Aggregation of Documents RSS: A system for subscribing to and distributing news. Scraper: An automated application designed to scan a source of input such as a document or a website for pertinent information. Server application: In a client/server architecture, the module that takes queries or requests from a client module, process them, and returns the result to the client. Software Compatibility: A description of whether different softwares, or versions of software, Lab I – READ Product Description 18 can communicate/interact. SQL: A widely-used programming language used to query databases. SQL injection: Performing unauthorized queries on a database for malicious purposes. User Authentication: The process of verifying the access credentials of a user of an automated system, usually accomplished by requesting a username and password combination. Viewer: In the scope of our project an outside person who wishes to query the information contained in the READ database. Version Control: A method for organizing and recording different versions of documents that have been created over time. Virtual Private Server (VPS): A software version of a hardware server. Used to create independent servers (....) on a single piece of hardware. Webserver: A group of applications run on a computer or VPS in to serve webpages and provide server-side computation for browser-based client applications. A web server is a constantly “on” resource whose sole or main job is to respond to HTTP requests from browsers. XML: Extensible markup language. Lab I – READ Product Description 19 References McRobbie, Michael A (2012, December 19). The Multibillion-Dollar Threat to Research Universities. From The Chronicle of Higher Education: http://chronicle.com/article/The-Multibillion-Dollar-Threat/136363/ National Center for Education Statistics. Degree-granting institutions and branches, by controls and level of institution and state or jurisdiction, 2010-11. From the Digest of Education Statistics: http://nces.ed.gov/programs/digest/d11/tables/dt11_280.asp