Running Head: Lab 1 – READ Description 1 Lab I – READ Product Specification Jim Lawrence Calderon, Team Black CS411W Janet Brunelle February 18, 2013 Version 3.0 Lab 1 – READ Description 2 Table of Contents 1 INTRODUCTION ................................................................................................................... 3 2 PRODUCT DESCRIPTION.................................................................................................... 4 2.1 Key Product Features and Capabilities ............................................................................ 4 2.2 Major Components (Hardware/Software) ........................................................................ 5 3 IDENTIFICATION OF CASE STUDY.................................................................................. 6 4 PRODUCT PROTOTYPE DESCRIPTION ........................................................................... 7 4.1 Prototype Architecture (Hardware/Software) .................................................................. 7 4.2 Prototype Features and Capabilities ............................................................................... 10 4.3 Prototype Development Challenges ............................................................................... 10 GLOSSARY ................................................................................................................................. 11 REFERENCES ............................................................................................................................. 14 Table of Figures Figure 1: Major Functional Component Diagram .......................................................................... 5 Table of Tables Table 1: RWP vs. Prototype features .............................................................................................. 9 Lab 1 – READ Description 1 3 INTRODUCTION Grants are up to 20% of a public research university’s revenue (Delta Cost Project). Universities are awarded these grants through ongoing research that is documented through techreports and publications. Having an online database for these publications is a way to attract potential students who are interested in the ongoing research at the university. However, keeping this sort of system updated manually is very tedious. Old Dominion University’s Computer Science department is a good case study to illustrate this example. The current static webpage, which has not been updated since 2008, consists of a list of publications manually updated by an appointed staff member. In order to have one’s publication in the system, they must send the information to this person via email and wait for them to update it. ODU is not the only institution that experiences this problem, and other organizations may also experience similar issues as well. The Repository for Electronic Aggregation of Documents (READ) is a system created by the CS411 Black Group in response to the ODU Computer Department’s lack of an online publication database. Like the initial problem, this may be integrated into the systems of any interested parties and is not created strictly for ODU. The Schaefer Scraper is designed to automatically perform the task of collecting publications owned by CS faculty members, from across multiple scholastic websites. These publications will then be organized by the viewer through the use of numerous filters as they browse through the material. While the viewers have the ability to browse, authors and administrators will additionally be able to update information on existing publications, or manually add publications when necessary. With these features, Lab 1 – READ Description 4 READ aims to ease the responsibility of each researcher to manually manage their publications outside of the initial upload to a single site. 2 PRODUCT DESCRIPTION READ is an online database that will house grants, articles, information on statistics pertaining to them, and links to research. Viewers will also be able to browse through them using a number of filters such by the name of the author, publish date, and various keywords. The goal is to minimize the need for an author to manage their publications through the features that READ will employ. READ will also strive to be able to advertise ongoing research more efficiently, and show available grants to whomever requires them. The goal is for the authors to work less in order to READ more. 2.1 Key Product Features and Capabilities A major part of READ is software named the Schaefer Scraper. It searches through a list of specific academic websites, defined by an administrator, for publications that match registered author’s credentials. The pertinent information is then uploaded to the database. The author will receive an email notification at this point in order to authorize the publication, edit any mistaken information, and the system will learn when not to notify them based on patterns on why they denied past publications. This allows for gathering all of an author’s publications into a single location automatically, with little work on their part. The system will allow viewers to browse the database using filters for grants, multiple types of article publications, and their authors. Viewing user profiles will show personal information, a graphical representation of the amount of publications they have created, funding Lab 1 – READ Description 5 they received, and a list of associated publications. This statistical information will show viewers a specific author’s area of expertise, and level of activity. 2.2 Major Components (Hardware/Software) READ consists of three major software functional components: web interface, database, and scraper, (Figure 1). The web interface contains pages that are only viewable by administrators or registered authors and those that are accessible to the public. Private content will only be accessible by having an authorized login, and will mainly deal with updating publication or grant data. The public section consists of the multiple pages that any viewer will be able to interact with, such as the article publications, grant proposals, student list, and faculty list. The viewer interacts with the web interface by selecting filters that will be used to query the database. These queries will then build the pages for them to browse. Figure 1: Major Functional Component Diagram Lab 1 – READ Description 6 The second major component is a MySQL database that will contain six tables: Authors, Paper, Grants, Owns, Tags, and CO_PI. These will either be populated automatically by information obtained by the Schaefer Scraper or manually by registered authors. Its primary function is to store links to external publications, grant proposals, and all the information of the associated material. Information accuracy will be verified by an email confirmation sent to the authors, and will allow for them to edit any incorrect data. The third major component is the Schaefer Scraper, an automated tool used to go through predefined websites for new publications. This is an algorithm that locates an author’s profile in a specified website with a unique string identifier and extracts the BibTex information. It is then parsed for required data from these websites into the database, where it will be queried and viewed on the web interface. 3 IDENTIFICATION OF CASE STUDY This product is being built for the Old Dominion University Computer Science Department in order to replace an antiquated and unused publication page. The previous system was a list of publications sorted chronologically from latest upload, manually updated by an assigned staff member. It lacked any sort of filtering, listing of received grants, and woefully out of date. Due to the tedious work required, the page was abandoned as of 2008. As a result, publications owned by the faculty are not readily available in a centralized location for interested parties. (This space intentionally left blank.) Lab 1 – READ Description 4 7 PRODUCT PROTOTYPE DESCRIPTION The prototype for READ will essentially be identical to the Real World Product (RWP) modeled using actual publications from the ODU Computer Science department and hosted on a virtual machine running the Debian operating system. It is required in order to demonstrate the essential features of the RWP. As illustrated in Table 1, it will not include the Sparkline, graph integration in the user profiles, the learning algorithm, nor the faculty list. The graphs will be visual representations of publications and grants that an author has been associated with within the last few years. The learning algorithm will allow the system to learn to distinguish possible publications of an author based on their patterns of approving or denying past entries. 4.1 Prototype Architecture (Hardware/Software) READ’s prototype will be very similar to the RWP with a few withheld features. The web interface that READ will be using is going to be created in PHP and hosted on an ODU web server with Google Chrome as the browser of choice. All of the planned sections, including the publications, grants, and faculty list will be fully functional, while the profile pages will have limited functionality compared to the RWP. A desktop or laptop with Internet access is required to access the interface. (This space intentionally left blank.) Lab 1 – READ Description 8 Features Real World Project Prototype Browsing Ability to browse all grants and Ability to browse all grants and Capabilities publication publications Publication Filtered by title, publisher, authors, Filtered by title, publisher, authors, Filtering publication date, date added, and publication date, date added, and Capabilities keywords. keywords. Grant Filtering Filtered by title, funding agency, Filtered by title, funding agency, Capabilities principal or co-principal principal or co-principal investigator, start date, end date, investigator, start date, end date, and and active state. active state. Add, edit, and Included. A thumbnail image and Included. A thumbnail image and delete publications files may be associated with the files may be associated with the and grants document. Fields can be document. Fields can be automatically filled in using a automatically filled in using a BibTex document. BibTex document. Lists faculty and provides a link to Not included. Faculty page each person’s profile page Login interface Linked to Old Dominion University Linked to Old Dominion University Computer Science accounts Computer Science accounts Lab 1 – READ Description 9 Features Real World Description Prototype Profile Page Displays authors’ profile picture, Displays authors’ profile picture, job title, email address, personal job title, email address, personal webpage link, and the author’s webpage link, and the author’s publications and grants. Displays publications and grants. Graphs not graphs included. Will update the system with new Will update the system with publications and grants and alert publications only and alert authors authors when one is added to the when one is added to the system system under their name. under their name. Administrative Administrators are able to edit, add, Administrators are able to edit, add, Privileges or remove anything in the system. or remove anything in the system. Scraper Table 1: RWP vs. Prototype features The Schaefer Scraper software, which will be responsible for obtaining publications, is already coded in PHP as well, requiring only to be integrated with the web interface once ready. Some websites that will be used are GoogleScholar, Scopus, and MicrosoftAcademic. Actual data will be scraped from these websites using the software. It will then be parsed for pertinent information, and stored into a MySQL database, which will be integrated with the web interface. Lab 1 – READ Description 4.2 10 Prototype Features and Capabilities The prototype’s main purpose consists of three major functions: obtain publication information from across multiple scholastic sites using the Schaefer Scraper, store the data obtained from this procedure into the database and display the information for viewers to browse through in the web interface. It will be divided into the publications page, grants page, user profile pages, and administrative function pages. Each of them has their respective filters tailored to the page. By default, both the publications and grants page will display their respective material by latest upload. The profile page will allow for the author to edit their personal information, grants, publications or citation information. They may be added either by entering the information into the fields or by supplying a BibTeX bibliography file from which the information can be extracted in a similar manner to the Schaefer Scraper. 4.3 Prototype Development Challenges The predominant development challenge for the READ prototype is the understanding and integration of the Schaefer Scraper with the web interface. As it currently stands, the software is merely a black box with the group having little knowledge of how it actually works. The data will have to be translated in order to be used in queries for the database, so the manner in which the Scraper exports the data from the websites is especially important to understand. Another issue is the format of the information obtained through the Schaefer Scraper. When scraped, the information does not necessarily follow a single format and parsing it for appropriate data to insert into the database may be troublesome. Lab 1 – READ Description 11 GLOSSARY Administrator/Administrative User: a user with increased privileges for editing database content Author: A person that is able to add and edit publications and grants to the system under their name. BibTeX: A file format for reference information in XML format. It will be used to automatically fill in key information when uploading or editing publications and grants. Client application: The module that takes input and creates queries to be processed by a server, and receives the results from the server. Client/Server Architecture: A software engineering paradigm that separates functionality into a “client” and a “server” application that interact. CSS: A programming language used to specify presentation of HTML pages Data Mining: The act of going through a source of input to find specific information. Database Schema: A description of the structure of database Funding Agency: The source of funds for research grants. GIT: A software system for controlling and organizing software versioning. Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc that can be interacted with via a mouse and keyboard, through which a user interacts with a software application. Used to differentiate from a “command-line interface”, in which a user interacts with a software application solely through a text terminal. Joomla!: A content management system. jQuery Sparklines: A development library for the visualization of data. ODU: Old Dominion University. Lab 1 – READ Description 12 MySQL: An implementation of SQL that is open-source. Parse: The processing of a statement. Perl: A widely-used programming language on the server-side of web applications. PHP: A widely-used programming language on the server-side of web applications. Principle Investigator (PI): The primary researcher that a research grant is bestowed upon, responsible for documenting the work and publishing research results. Publication or Academic Publication: A document created published in an academic journals, technical reports, and records of conference proceedings. Query: An algorithm sent to the database to either change the database or get back results READ: Repository for Electronic Aggregation of Documents RSS: A system for subscribing to and distributing news. Scraper: An automated application designed to scan a source of input such as a document or a website for pertinent information. Server application: T module that takes queries or requests from a client module, process them, and returns the result to the client. Software Compatibility: A description of whether different softwares, or versions of software, can communicate/interact. SQL: A widely-used programming language used to query databases. SQL injection: Performing unauthorized queries on a database for malicious purposes. User Authentication: The process of verifying the access credentials of a user of an automated system, usually accomplished by requesting a username and password combination. Viewer: In the scope of our project an outside person who wishes to query the information contained in the READ database. Lab 1 – READ Description 13 Version Control: A method for organizing and recording different versions of documents that have been created over time. Virtual Private Server (VPS): A software version of a hardware server. Used to create independent program that manages access to a centralized resource or service in a network on a single piece of hardware. Webserver: A constantly “on” resource whose sole or main job is to respond to HTTP requests from browsers. XML: Extensible markup language. Lab 1 – READ Description REFERENCES "Delta Cost Project Data." The Delta Project on Postsecondary Education Costs, Productivity, and Accountability. The Delta Project, n.d. Web. 9 Feb. 2013. 14