Lab 2 – Prototype Product Specification for READ CS 411W Lab II Version 1 Prototype Product Specification For READ Prepared by: Marcus Zehr – Black Team Date: 04/8/2013 1 Lab 2 – Prototype Product Specification for READ 2 Table of Contents 1. Introduction …......……………………...…………...………………………..…………….…3 1.1 Purpose ……………………………….………….……………………………………4 1.2 Scope …......…………………………………………………………..…………….…5 1.3 Definitions, Acronyms, and Abbreviations ....…………...…………..…………….…7 1.4 References …......………..…………………………..………………..…………….…9 1.5 Overview …......……………………………..………………………..…………….…9 2. General Description …... ...………………………………….………………………..…..…10 2.1 Prototype Architecture Description ………………………………...…….....………10 2.2 Prototype Functional Description ………………………………………………...…11 2.3 External Interfaces ...……………………………………………..………………….12 2.3.1 Hardware Interfaces ……………………………………………………..12 2.3.2 Software Interfaces ……………………………………………………...12 2.3.3 User Interfaces …………………………………………………………..13 2.3.4 Communication Protocols / Interfaces ………………………………..…13 List of Figures Figure 1. READ’s MFCD…………………………………………..………………………..10 Figure 2. READ’s Web Layout …………………………………………………..……..…...11 List of Tables Table 1. READ Prototype Vs. RWP ……….……………………………………………...……..5 Lab 2 – Prototype Product Specification for READ 3 1. Introduction There are more than 4,700 research institutions in the United States (Digest of Education Statistics).These institutions can display and their research through abundant publications and upload them to the Internet. However, many of the organizations associated with these institutions lack an efficient method or procedure for uploading and maintaining documents such as these publications. This is a problem for research institutions and is important to fix so that institutions may be appropriately recognized for work which is completed. In doing this, research institutions may further advertise specific areas of research being performed at any given time. The current process for many organizations to present their publications online is nonautomated, slow, and tedious which means there are many areas to improve upon to properly display and advertise an institutions’ publications. One main reason for this is that the responsibility to update the system may rest upon a sole individual or administrator. This is the issue that will be addressed through the development of the READ web application for research institutions. The Repository for Electronic Aggregation of Documents, READ, aims to automate the currently manual process of submitting an organizations’ publications and grant information. It will also keep this information better organized and allow publications and grants to be searchable using filters while being displayed in an easy to read format. In addition to this, READ will provide the ability for users to verify that their grant and publication information is correct. Ultimately READ will ease the burden of keeping up with numerous publications to allow researchers to spend more time working and less time managing files. Lab 2 – Prototype Product Specification for READ 4 1.1 Purpose READ is an online database which has an abundance of capabilities including the ability to store information about, publications, and links to outside grants publications; providing a way for users to browse and search through publications using a number of filters such as author, publish date, keywords, and type of document; It allows a user to advertise their research, both past and current, as well as information about any grants which apply to them. It minimizes the need for a user to manually organize their publications by utilizing the Schaefer Scraper to automatically find publications on the Internet. READ however will not provide access to copyrighted material or gather research material for anyone outside of Old Dominion University. The initial consumer for READ is Old Dominion University’s Computer Science Department (ODUCS). Dr. Michele Weigle, a professor at ODU with her Ph.D. in Computer Science, had requested a solution for this particular issue and is acting as group mentor for this project. The ODUCS Department features 37 faculty members, 11 currently enrolled Ph. D students, and 11 Master’s students according to its website. These are all individuals who would be able to take advantage of a system which could make it easier to discover relevant and up to date research, but they do not begin to cover the number of people who would find READ to be an indispensable resource in the future. Once testing is complete at ODUCS then READ may be utilized by other departments within Old Dominion University. Potentially READ could then be used at other universities to help with their organizational and research needs as well as government institutions, research institutions, and libraries. Lab 2 – Prototype Product Specification for READ 5 1.2 Scope The Schaefer Scraper will be used by READ to search for publications matching a registered users credentials and extract pertinent information found in the publications such as the title, author or authors, publication date, and type of publication. This information is then inserted into the database along with a link to that specific publication and a notification email is sent to that user to authorize the newly uploaded publication information. Based on the actions of the user and patterns of denied publications, READ will learn when not to notify a user to authorize certain publications when they are found. With READ in place, any user will be able to browse the database using an assortment of filters for publications. Publications may be searched by publish date, author or keywords, and whether or not the full text is available. Grants may be searched by total amount, grant status, funding agency, or investigators. Each user will have a profile which will display information including the user’s name, job title, personal photo, email address, affiliated organization, and homepage. The users’ profile page will also include graphical representations of the number of publications they have authored and time which they were published as well as any funding received through the participating publications. In addition to the graphical representations, the profile page for each user will include a list of that specific user’s publications. The READ prototype will be able to store and view numerous publications, grants, and types of publications. The publications which are stored on the READ database will also be searchable as well as any grant information pertaining to those publications. It will also have the ability to inform its users of any publications found using the Schaefer Scraper in order to confirm the Lab 2 – Prototype Product Specification for READ 6 authenticity of the publications in the database. The table below will display the differences between the final version of READ and its prototype. Features Real World Project Prototype Browsing Capabilities Ability to browse all grants and publication Ability to browse all grants and publications Publication Filtering Capabilities Filtered by title, publisher, authors, publication date, date added, and keywords. Filtered by title, publisher, authors, publication date, date added, and keywords. Grant Filtering Capabilities Filtered by title, funding agency, principal or co-principal investigator, start date, end date, and active state. Filtered by title, funding agency, principal or co-principal investigator, start date, end date, and active state. Add, edit, and delete publications and grants Included. A thumbnail image and files may be associated with the document. Fields can be automatically filled in using a BibTex document. Included. A thumbnail image and files may be associated with the document. Fields can be automatically filled in using a BibTex document. Faculty page Lists faculty and provides a link to each person’s profile page Not included. Login interface Linked to Old Dominion University Computer Science accounts Linked to Old Dominion University Computer Science accounts Profile Page Displays authors’ profile picture, job title, email address, personal webpage link, and the author’s publications and grants. Displays graphs Displays authors’ profile picture, job title, email address, personal webpage link, and the author’s publications and grants. Graphs not included. Scraper Will update the system with new publications and grants and alert users when one is added to the system under their name. Will update the system with publications only and alert users when one is added to the system under their name. Prediction algorithm Predicts if the consumer has enough space to use the READ system. Not included Administrative Privileges Administrators are able to edit, add, or remove anything in the system. Administrators are able to edit, add, or remove anything in the system. Table 1 – READ Prototype Vs. RWP Lab 2 – Prototype Product Specification for READ 7 1.3 Definitions, Acronyms, and Abbreviations Administrator/Administrative User: a user with increased privileges for editing database content Author: A person that is able to add and edit publications and grants to the system under their name. BibTeX: A file format for reference information in XML format. It will be used to automatically fill in key information when uploading or editing publications and grants. Computer Science (CS): An academic discipline based on advancing computing theory and algorithm development that sometimes includes theory about software engineering methods. Client application: The module that takes input and creates queries to be processed by a server, and receives the results from the server. Client/Server Architecture: A software engineering paradigm that separates functionality into a “client” application and a “server” application that interact. CSS: A programming language used to specify presentation of HTML pages Data Mining: The act of going through a source of input to find specific information. Database Schema: A description of the structure of database Funding Agency: The source of funds for research grants. GIT: A software system for controlling and organizing software versioning. Google Scholar: A search engine primarily used to find academic literature. Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc that can be interacted with via a mouse and keyboard, through which a user interacts with a software application. . Internet scraper: A program which takes unstructured data on the web and puts it into structured Lab 2 – Prototype Product Specification for READ 8 data that can be stored and analyzed in a central local database or spreadsheet. JQuery Sparklines: A development library for the visualization of data. ODU: Old Dominion University. MicrosoftAcademic: A free service developed by Microsoft Research to help scholars, scientists, students, and practitioners quickly and easily find academic content, researchers, institutions, and activities. MySQL: An open source database software. Parse: To process a statement for specific meanings. Perl: A widely-used programming language on the server-side of web applications. PHP: A widely-used programming language on the server-side of web applications. Principle Investigator (PI): The primary researcher that a research grant is bestowed upon, responsible for documenting the work and publishing research results. Publication or Academic Publication: A document published in an academic journals, technical reports, and records of conference proceedings. Query: A command sent to the database to either change the database or get back results READ: Repository for Electronic Aggregation of Documents RSS: A dialect of XML for subscribing to and distributing news. RWP: Real World Project. Scraper: An automated application designed to scan a source of input such as a document or a website for pertinent information. Server application: In a client/server architecture, the module that takes queries or requests from a client module, process them, and returns the result to the client. Software Compatibility: A description of whether different software, or versions of software, can Lab 2 – Prototype Product Specification for READ 9 communicate/interact. SQL: A widely used programming language used to query databases. SQL injection: Performing unauthorized queries on a database for malicious purposes. User Authentication: The process of verifying the access credentials of a user of an automated system, usually accomplished by requesting a username and password combination. Viewer: An outside person who wishes to query the information contained in the READ database. Version Control: A method for organizing and recording different versions of documents that have been created over time. Virtual Private Server (VPS): A software version of a hardware server used to create independent servers on a single piece of hardware. Web server: A group of applications constantly “on” resource whose sole or main job is to respond to HTTP requests from browsers. XML: Extensible markup language. 1.4 References Digest of Education Statistics. 2011. National Center For Educational Statistics Web. 19 Nov 2012. <http://nces.ed.gov/programs/digest/d11/tables/dt11_001.asp?referrer=report>. 1.5 Overview This product specification provides the hardware and software configuration, external interfaces, capabilities and features of the READ prototype. The information which is provided in the remainder of this document includes a detailed description of the hardware, software, and internal design of the READ prototype; the key features of the prototype; and the parameters that will be used to control, manage, or establish those features. Lab 2 – Prototype Product Specification for READ 10 2. General Description READ will incorporate the use of simple hardware and software solutions which have been integrated together seamlessly in order to perform its duties with ease in the hands of the users. Figure 1 below illustrates the major functional components of READ. The READ solution consists of a single server which will house three main software components: a web interface, a publication link database, and Schaefer’s Scraper. 2.1 Prototype Architecture Description Figure 1 – READ’s MFCD The web interface itself will have both public and private areas available for access to users. The private areas will require a user to log on to their account in order to access and will allow for the user to perform various tasks. The user will then be allowed access to their own profile page and administrative abilities. Inside the web interface the user may access the search filters for publications and grants as well as other users’ public profiles. The publication link database’s main function is to house and provide links to any external publications and grant information to the users. This database will also contain files which have Lab 2 – Prototype Product Specification for READ 11 been uploaded by the users including publications, grants, and other files which may be related to either. The last internal component is the Schaefer Scraper which is an automated tool that will search specific external web sites for new publications submitted via a list of authors provided as input within a XML file. The scraper will do this by looking for publications by the included authors, collect and parse the results, and then export them into the READ link database for further use. 2.2 Prototype Functional Description The READ prototype will be vital in order to organize publications and the grant information associated with them for Old Dominion University. The prototype which will be implemented will be modeled using the Old Dominion University's Computer Science Department’s publications and hosted on a virtual machine running a Linux based operating system. It will also include all of the features of the real world project and be instrumental to maintain and upkeep publications and grants for the university. The READ Prototype will have the ability to search through the database and filter the results based on user queries, implement RSS Feeds, allow for users to log on, edit, and upload data, and give functional control of the web application to administrators. By utilizing the Schafer Scraper, it will also find publications and insert them into the READ database. The users of this system will be able to log on and access their own publication and grant information and have the ability to edit this information. They will also be able to upload personal information and upload files to the READ database. These functions of the prototype will allow for easy access to numerous publications, grants, and tech reports written by Old Dominion University faculty and students. This will all be located in a well-organized and easy to navigate user interface which will utilize Lab 2 – Prototype Product Specification for READ 12 a filter and search page, displays for publications and grants, a profile display system, and a RSS feed. 2.3 External Interfaces This section goes over the software and devices used within the READ prototype. READ requires the use of particular hardware and software to operate and contains a graphical user interface. 2.3.1 Hardware Interfaces READ will require a computer with an active connection to the Internet in order to perform tasks. This computer is how the user will gain access to the web interface via a web browser and directing themselves to the correct URL from the ODU computer science department home page. 2.3.2 Software Interfaces The READ prototype which will be created will allow a user to log onto the system via a web-based interface. This interface will then let the user search for or add publications to the database which is publicly viewable. The Schafer Scraper will run as well on pre-defined schedule in order to populate the database with information both initially and on a regular basis as defined by the administrators of the READ program. READ will be programmed using php and use mysql as these languages and open source software are better suited for the needs of this project than others and will allow a large and detailed database structure to be accessed with ease on the Internet. This prototype also incorporates the use of an over the shelf piece if software named the Schaefer Scraper. This was built using php and returns search results in HTML format. For the purpose of the prototype, the Schaefer Scraper will be scraping information from Google Scholar, Microsoft Academic, Arnetminer, Scopus, and Google Citation. Lab 2 – Prototype Product Specification for READ 13 2.3.3 User Interfaces Figure 2 – READ’s Web Layout The user interface for the READ prototype is designed for use on a desktop computer but can also be used on mobile devices with access to the internet as well. Figure 2 displays the flow of the web sites GUI design and navigation links. 2.3.4 Communication Protocols / Interfaces READ will be utilizing Transmission Control Protocol/Internet Protocol (TCP/IP). This is done in order to ensure that data found by the Schaefer Scraper is delivered reliably to the link server database.