Running Head: Lab 2 – READ Prototype Product Specification CS 411W Lab II Prototype Product Specification For READ Prepared by: Jim Lawrence Calderon, Black Team Date: 04/08/13 1 Lab 1 – READ Prototype Product Specification 2 Table of Contents 1 2 INTRODUCTION ................................................................................................................................ 3 1.1 Purpose.......................................................................................................................................... 4 1.2 Scope ............................................................................................................................................. 6 1.3 Definitions, Acronyms, and Abbreviations ................................................................................... 6 1.4 References ..................................................................................................................................... 9 1.5 Overview ....................................................................................................................................... 9 General Description .............................................................................................................................. 9 2.1 Prototype Architecture Description .............................................................................................. 9 2.2 Prototype Functional Description ............................................................................................... 12 2.3 External Interfaces ...................................................................................................................... 15 2.3.1 Hardware Interfaces ............................................................................................................ 15 2.3.2 Software Interfaces ............................................................................................................. 15 2.3.3 User Interfaces .................................................................................................................... 15 2.3.4 Communication Protocols and Interfaces ........................................................................... 16 Table of Figures Figure 1: Major Functional Component Diagram......................................................................................... 5 Figure 2: READ Prototype MFCD ............................................................................................................. 13 Figure 3: Scraper Flowchart........................................................................................................................ 13 Figure 4: READ Site Map........................................................................................................................... 16 Table of Tables Table 1: RWP vs. Prototype features .......................................................................................................... 11 Table 2: Functionality for Publications, Grants, and User Profile pages .................................................... 15 Lab 1 – READ Prototype Product Specification 1 3 INTRODUCTION Grants are up to 20% of a public research university’s revenue (Delta Cost Project). Universities are awarded these grants through ongoing research that is documented through techreports and publications. Having an online database for these publications is a way to attract potential students who are interested in the ongoing research at the university. However, keeping this sort of system updated manually is very tedious. Old Dominion University’s Computer Science department is a good case study to illustrate this example. The current static webpage, which has not been updated since 2008, consists of a list of publications manually updated by an appointed staff member. In order to have one’s publication in the system, they must send the information to this person via email and wait for them to update it. ODU is not the only institution that experiences this problem, and other organizations may also experience similar issues as well. The Repository for Electronic Aggregation of Documents (READ) is a system created by the CS411 Black Group in response to the ODU Computer Department’s lack of an online publication database. Like the initial problem, this may be integrated into the systems of any interested parties and is not created strictly for ODU. The Schaefer Scraper is designed to automatically perform the task of collecting publications owned by CS faculty members, from across multiple scholastic websites. These publications will then be organized by the viewer through the use of numerous filters as they browse through the material. While the viewers have the ability to browse, authors and administrators will additionally be able to update information on existing publications, or manually add publications when necessary. With these features, READ aims to ease the responsibility of each researcher to manually manage their publications outside of the initial upload to a single site. Lab 1 – READ Prototype Product Specification 1.1 4 Purpose READ is an online database that will house grants, articles, information on statistics pertaining to them, and links to research. Viewers will also be able to browse through them using a number of filters such by the name of the author, publish date, and various keywords. The goal is to minimize the need for an author to manage their publications through the features that READ will employ. READ will also strive to be able to advertise ongoing research more efficiently, and show available grants to whomever requires them. The goal is for the authors to work less in order to READ more. A major part of READ is software named the Schaefer Scraper. It searches through a list of specific academic websites, defined by an administrator, for publications that match registered author’s credentials. The pertinent information is then uploaded to the database. The author will receive an email notification at this point in order to authorize the publication, edit any mistaken information, and the system will learn when not to notify them based on patterns on why they denied past publications. This allows for gathering all of an author’s publications into a single location automatically, with little work on their part. The system will allow viewers to browse the database using filters for grants, multiple types of article publications, and their authors. Viewing user profiles will show personal information, a graphical representation of the amount of publications they have created, funding they received, and a list of associated publications. This statistical information will show viewers a specific author’s area of expertise, and level of activity. READ consists of three major software functional components: web interface, database, and scraper, (Figure 1). The web interface contains pages that are only viewable by administrators or registered authors and those that are accessible to the public. Private content Lab 1 – READ Prototype Product Specification will only be accessible by having an authorized login, and will mainly deal with updating publication or grant data. The public section consists of the multiple pages that any viewer will be able to interact with, such as the article publications, grant proposals, graduate student list, and faculty list. The viewer interacts with the web interface by selecting filters that will be used to query the database. These queries will then build the pages for them to browse. Figure 1: Major Functional Component Diagram The second major component is a MySQL database that will contain six tables: Authors, Paper, Grants, Owns, Tags, and CO_PI. These will either be populated automatically by information obtained by the Schaefer Scraper or manually by registered authors. Its primary function is to store links to external publications, grant proposals, and all the information of the associated material. Information accuracy will be verified by an email confirmation sent to the authors, and will allow them to edit any incorrect data. 5 Lab 1 – READ Prototype Product Specification 6 The third major component is the Schaefer Scraper, an automated tool used to go through predefined websites for new publications. This algorithm locates an author’s profile in a specified website using a unique string identifier and extracts the BibTex information. It is then parsed for required data from these websites into the database, where it will be queried and viewed on the web interface. 1.2 Scope The prototype for READ will essentially be identical to the Real World Product (RWP) modeled using actual publications from the ODU Computer Science department and hosted on a virtual machine running the Debian operating system. Its main objective is to demonstrate the essential features of the RWP. As illustrated in Table 1, it will not include the Sparkline, graph integration in the user profiles, the learning algorithm, nor the faculty list. The graphs will be visual representations of publications and grants that an author has been associated with within the last few years. The learning algorithm will allow the system to learn to distinguish possible publications of an author based on their patterns of approving or denying past entries. 1.3 Definitions, Acronyms, and Abbreviations Administrator/Administrative User: a user with increased privileges for editing database content Author: A person that is able to add and edit publications and grants to the system under their name. BibTeX: A file format for reference information in XML format. It will be used to automatically fill in key information when uploading or editing publications and grants. Client application: The module that takes input and creates queries to be processed by a server, and receives the results from the server. Lab 1 – READ Prototype Product Specification 7 Client/Server Architecture: A software engineering paradigm that separates functionality into a “client” and a “server” application that interact. CSS: A programming language used to specify presentation of HTML pages Data Mining: The act of going through a source of input to find specific information. Database Schema: A description of the structure of database Funding Agency: The source of funds for research grants. GIT: A software system for controlling and organizing software versioning. Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc that can be interacted with via a mouse and keyboard, through which a user interacts with a software application. Used to differentiate from a “command-line interface”, in which a user interacts with a software application solely through a text terminal. Joomla!: A content management system. jQuery Sparklines: A development library for the visualization of data. ODU: Old Dominion University. MySQL: An implementation of SQL that is open-source. Parse: The processing of a statement. Perl: A widely-used programming language on the server-side of web applications. PHP: A widely-used programming language on the server-side of web applications. Principle Investigator (PI): The primary researcher that a research grant is bestowed upon, responsible for documenting the work and publishing research results. Publication or Academic Publication: A document created published in an academic journals, technical reports, and records of conference proceedings. Query: An algorithm sent to the database to either change the database or get back results Lab 1 – READ Prototype Product Specification READ: Repository for Electronic Aggregation of Documents RSS: A system for subscribing to and distributing news. Scraper: An automated application designed to scan a source of input such as a document or a website for pertinent information. Server application: T module that takes queries or requests from a client module, process them, and returns the result to the client. Software Compatibility: A description of whether different softwares, or versions of software, can communicate/interact. SQL: A widely-used programming language used to query databases. SQL injection: Performing unauthorized queries on a database for malicious purposes. User Authentication: The process of verifying the access credentials of a user of an automated system, usually accomplished by requesting a username and password combination. Viewer: In the scope of our project an outside person who wishes to query the information contained in the READ database. Version Control: A method for organizing and recording different versions of documents that have been created over time. Virtual Private Server (VPS): A software version of a hardware server. Used to create independent program that manages access to a centralized resource or service in a network on a single piece of hardware. Webserver: A constantly “on” resource whose sole or main job is to respond to HTTP requests from browsers. XML: Extensible markup language. 8 Lab 1 – READ Prototype Product Specification 1.4 9 References "Delta Cost Project Data." The Delta Project on Postsecondary Education Costs, Productivity, and Accountability. The Delta Project, n.d. Web. 9 Feb. 2013. Calderon, Jim. (2013). Lab 1 – READ Product Description. 1.5 Overview This product specification provides the hardware and software configuration, external interfaces, capabilities and features of the READ prototype. The information provided in the remaining sections of this document includes a detailed description of the software, and external interface architecture of the READ prototype; the key features of the prototype; the parameters that will be used to control, manage, or establish that feature; and the performance characteristics of that feature in terms of outputs, displays, and user interaction. 2 General Description The prototype of READ will focus on the proper gathering, and displaying of publications owned by the ODU Computer Science faculty, obtained from MicrosoftAcademic. Secondary priorities include user profiles, and administrative privileges. Due to time constraints, certain features of the RWP will be omitted. 2.1 Prototype Architecture Description READ’s prototype will be very similar to the RWP with a few withheld features. The web interface that READ will be using is going to be created in PHP and hosted on an ODU web server with Google Chrome as the browser of choice. All of the planned sections, including the Lab 1 – READ Prototype Product Specification 10 publications, grants, and faculty list will be fully functional, while the profile pages will have limited functionality compared to the RWP. A desktop or laptop with Internet access is required to access the interface. Features Real World Project Prototype Browsing Ability to browse all grants and Ability to browse all grants and Capabilities publication publications Publication Filtered by title, publisher, authors, Filtered by title, publisher, authors, Filtering publication date, date added, and publication date, date added, and Capabilities keywords. keywords. Grant Filtering Filtered by title, funding agency, Filtered by title, funding agency, Capabilities principal or co-principal investigator, principal or co-principal investigator, start date, end date, and active state. start date, end date, and active state. Add, edit, and Included. A thumbnail image and files Included. A thumbnail image and delete may be associated with the document. files may be associated with the publications Fields can be automatically filled in document. Fields can be and grants using a BibTex document. automatically filled in using a BibTex document. Faculty page Lists faculty and provides a link to each person’s profile page Not included. Lab 1 – READ Prototype Product Specification 11 Feature Real World Product Prototype Login interface Linked to Old Dominion University Linked to Old Dominion University Computer Science accounts Computer Science accounts Displays authors’ profile picture, job Displays authors’ profile picture, job title, email address, personal webpage title, email address, personal webpage link, and the author’s publications and link, and the author’s publications grants. Displays graphs and grants. Graphs not included. Will update the system with new Will update the system with publications and grants and alert publications only and alert authors authors when one is added to the when one is added to the system system under their name. under their name. Administrative Administrators are able to edit, add, or Administrators are able to edit, add, Privileges remove anything in the system. or remove anything in the system. Profile Page Scraper Table 1: RWP vs. Prototype features The Schaefer Scraper software, which will be responsible for obtaining publications, is already coded in PHP as well, requiring only to be integrated with the web interface once ready. Only MicrosoftAcademic will be used for the prototype. Actual data will be scraped from these websites using the software. It will then be parsed for pertinent information, and stored into a MySQL database, which will be integrated with the web interface. The prototype’s main purpose consists of three major functions: obtain publication information from across multiple scholastic sites using the Schaefer Scraper, store the data Lab 1 – READ Prototype Product Specification 12 obtained from this procedure into the database and display the information for viewers to browse through in the web interface. It will be divided into the publications, grants, user profile, faculty list and administrative function pages. Each of them has their respective filters tailored to the page. By default, both the publications and grants page will display their respective material by latest upload. The profile page will allow the author to edit their personal, grant, publication or citation information. They may be added either by entering the information into the fields or by supplying a BibTeX bibliography file from which the information can be extracted in a similar manner to the Schaefer Scraper. 2.2 Prototype Functional Description Figure 2 illustrates the major functional components of prototype. READ users accounts will be linked to the ODU Computer Science department accounts so further registration is unnecessary. On their initial use of the system, each user will be asked to provide their unique string identifiers for their profile pages on MicrosoftAcademic. The Schaeffer Scraper will then scrape the website monthly for a BibTex file from each user, and parse it for publication information. The information will be compared with current publications in the database, and discarded if it is found to be a duplicate. Otherwise, it is uploaded into the database and an email will be sent to verify with the user if the obtained publication is theirs, and if the gathered information is correct. Appropriate changes are then made based on the user’s verification. (This space left intentionally blank.) Lab 1 – READ Prototype Product Specification Figure 2: READ Prototype MFCD Figure 3: Scraper Flowchart 13 Lab 1 – READ Prototype Product Specification 14 The publications, as well as grants, are then browsed by viewers who interact with the READ website through the publication, grants, and profile page. Each page will have their respective filters to narrow down the results. Table 2 lists the options that the viewer may filter the entries by (including the user profile page), the information displayed for each entry, and the information that viewers may see on a user profile page. Publications Grants Profile Page Filterable Keywords Principal, Co-principal Start-End Date by: Start-End Date investigator Keywords Full Text Organization Full Text Availability Funding Agency Availability Items per page Award Range Items per page Authors Start-End Date Active Status Items per page Information Title Title Name shown: Authors Principal, Co-principal Job Title Reference Information Investigators Organization Organization Affiliation Active State Homepage Link Publisher Publication Date Start-End Date Lab 1 – READ Prototype Product Specification Publications Information Abstract 15 Grants Profile Page Funding Agency, Agency Directorate, Agency Division Grants/Publications associated with the faculty member Shown: Award Amount, Award Number Table 2: Functionality for Publications, Grants, and User Profile pages 2.3 External Interfaces The external interfaces required are limited to standard computer hardware and software. The READ website where the information is browsed is the only custom interface. 2.3.1 Hardware Interfaces No special hardware is created or needed for the prototype. Personal laptops and ODU class computers will be used for testing the READ website, and monitoring and controlling the database. All components will interact using the ODU network. 2.3.2 Software Interfaces The system will be created and tested on a virtual machine running the Debian operating system. MySQL databases will also be maintained and controlled through this machine. The READ website will be developed in PHP with the assistance of Joomla!. The Schaeffer Scraper will be written in python, and a parser will be developed to format the information obtained from the scrape. 2.3.3 User Interfaces This system will utilize two user interfaces, which are both accessible by any browser on a computer device capable of internet connection. The first interface is an email service, such as Lab 1 – READ Prototype Product Specification 16 Gmail or Yahoo! Mail, and is only required by users. The service is used to verify the validity of uploaded publications. The second interface is the READ website. The site will allow for publications, grants, and users in the database to be browsed by viewers. Figure 4 illustrates the READ site map. READ Homepage Publication Grant Administration User Profile Add Publications Add Grants Edit Publications Edit Grants Figure 4: READ Site Map 2.3.4 Communication Protocols and Interfaces READ will be using two communication protocols: Hyper-Text Transfer Protocol Secure (Https) and Transmission Control Protocol/Internet Protocol (TCP/IP).