Lab 1 – READ Product Description Marcus Zehr CS411 Janet Brunelle March 18, 2013 Lab 1 – READ Description 2 Table of Contents 1. INTRODUCTION …………………………………………………………………..……..…3 2. PRODUCT DESCRIPTION…... ...……………………………………………….…..…..…..4 2.1 Key Product Features and Capabilities …………………………………………...…..4 2.2 Major Components (Hardware/Software) …………………………………....……….5 2.3 Target Market/Customer Base ...……..………………………..………………..….....6 3. PRODUCT PROTOTYPE DESCRIPTION …………………………………..……………...7 3.1 Prototype Functional Goals and Objective ………………………………………..….7 3.2 Prototype Architecture (Hardware/Software) ……………………………………..….8 3.3 Prototype Features and Capabilities ……………………..…………..……….…..…...8 3.4 Prototype Development Challenges …………………………….…..…………….....10 GLOSSARY .……..…………………..………..……………………………………………11 REFERENCES ….....……………………..………..……………………………………..…14 List of Figures Figure 1. READ’s MFCD ……………………..……..………..………………………………5 List of Tables Table 1. Comparison of Features and Capabilities between READ prototype and RWP …...9 Lab 1 – READ Description 3 1. Introduction There are more than 4,700 research institutions in the United States (Digest of Education Statistics).These institutions can display their research through abundant publications and upload them to the Internet. However, many of the organizations associated with these institutions lack an efficient method or procedure for uploading and maintaining documents such as these publications. This is a problem for research institutions and is important to fix so that institutions may be appropriately recognized for work which is completed. In doing this, research institutions may further advertise specific areas of research being performed at any given time. The current process for many organizations to present their publications online is nonautomated, slow, and tedious which means there are many areas to improve upon to properly display and advertise an institutions’ publications. One main reason for this is that the responsibility to update the system may rest upon a sole individual or administrator. This issue will be addressed through the development of the READ web application for research institutions. The Repository for Electronic Aggregation of Documents (READ) aims to automate the currently manual process of submitting an organizations’ publications and grant information. It will also keep this information better organized and allow publications and grants to be searchable using filters while being displayed in an easy to read format. In addition to this, READ will provide the ability for users to verify that their grant and publication information is correct. Ultimately, READ will ease the burden of keeping up with numerous publications to allow researchers to spend more time working and less time managing files. Lab 1 – READ Description 4 2. Product Description READ is an online database which has an abundance of capabilities including the ability to store information about, publications, and links to outside grants publications. It provides a way for users to browse and search through publications using a number of filters such as author, publish date, keywords, and type of document. It allows a user to advertise their research, both past and current, as well as information about any grants which apply to them. Lastly, it minimizes the need for a user to manually organize their publications by utilizing the Schaefer Scraper to automatically find publications on the Internet. 2.1 Key Product Features and Capabilities The Schaefer Scraper will be used by READ to search for publications matching a registered users credentials and extract pertinent information found in the publications such as the title, author or authors, publication date, and type of publication. This information is then inserted into the database along with a link to that specific publication and a notification email is sent to that user to authorize the newly uploaded publication information. Based on the actions of the user and patterns of denied publications, READ will learn when not to notify a user to authorize certain publications when they are found. With READ in place, any user will be able to browse the database using an assortment of filters for publications. Publications may be searched by publish date, multiple authors, keywords, and whether or not the full text is available. Grants may be searched by total amount, status, funding agency, or investigators. Each user will have a profile which will display information including the user’s name, job title, personal photo, email address, affiliated organization, and homepage. The users’ profile page will also include graphical representations of the number of publications they have authored Lab 1 – READ Description 5 and time which they were published as well as any funding received through the participating publications. In addition to the graphical representations, the profile page for each user will include a list of that specific user’s publications. 2.2 Major Components (Hardware/Software) READ will incorporate the use of simple hardware and software solutions which have been integrated together seamlessly in order to perform its duties with ease in the hands of the users. Figure 1 below illustrates the major functional components of READ. This solution consists of a single server which will house three main software components: a web interface, a publication link database, and Schaefer’s Scraper. Schaefer’s Scraper Figure 1 – READ’s MFCD The web interface itself will have both public and private areas available for access to users. The private areas will require a user to log on to their account in order to access and will allow for the user to perform various tasks. The user will then be allowed access to their own profile Lab 1 – READ Description 6 page and administrative abilities. Inside the web interface the user may access the search filters for publications and grants as well as other users’ public profiles. The publication link database’s main function is to house and provide links to any external publications and grant information to the users. This database will also contain files which have been uploaded by the users including publications, grants, and other files which may be related to either. The last internal component is the Schaefer Scraper, an automated tool that will search specific external web sites for new publications submitted via a list of authors provided as input within a XML file. The scraper will do this by looking for publications by the included authors, collect and parse the results, and then export them into the READ link database for further use. 2.3 Target Market/Customer Base The initial consumer for READ is Old Dominion University’s Computer Science Department (ODUCS). Dr. Michele Weigle, a professor at ODU with her Ph.D. in Computer Science, had requested a solution for this particular issue and is acting as group mentor for this project. The ODUCS Department features 37 faculty members, 11 currently enrolled Ph. D students, and 11 Master’s students according to its website. These are all individuals who would be able to take advantage of a system that could make it easier to discover relevant and up to date research, but they do not begin to cover the number of people who would find READ to be an indispensable resource in the future. Once testing is complete at ODUCS then READ may be utilized by other departments within Old Dominion University. Potentially READ could then be used at other universities to help with their organizational and research needs as well as government institutions, research institutions, Lab 1 – READ Description 7 and libraries. Overall this system will become a useful tool which can be used by many people looking for more information about a schools’ focus of study. 3. READ Product Prototype Description The READ prototype will be vital in order to organize publications and the grant information associated with them for Old Dominion University. The prototype will be modeled using the Old Dominion University's Computer Science Department’s publications and hosted on a virtual machine running a Linux based operating system. It will also include all of the features of the real world project and be instrumental to maintain and upkeep publications and grants for the university. 3.1 Prototype Functional Goals and Objectives The READ prototype will have the ability to search through the database and filter the results based on user queries, implement RSS feeds, allow for users to log on, edit, and upload data, and give functional control of the web application to administrators. By utilizing the Schafer Scraper, it will also find publications and insert them into the READ database. The users of this system will be able to log on and access their own publication and grant information and have the ability to edit this information. They will also be able to upload personal information and upload files to the READ database. These functions of the prototype will allow for easy access to numerous publications, grants, and tech reports written by Old Dominion University faculty and students. This will all be located in a well organized and easy to navigate user interface which will utilize a filter and search page, displays for publications and grants, a profile display system, and a RSS feed. Lab 1 – READ Description 8 3.2 Prototype Architecture (Hardware/Software) The READ prototype which will be created will allow a user to log onto the system via a web-based interface. This interface will then let the user search for or add publications to the database which is publicly viewable. The Schafer Scraper will run as well on pre-defined schedule in order to populate the database with information both initially and on a regular basis as defined by the administrators of the READ program. READ will be programmed using PHP and MySQL as these languages are better suited for the needs of this project than others and will allow a large and detailed database structure to be accessed with ease on the Internet. This prototype also incorporates the use of an over the shelf piece if software named the Schaefer Scraper. This was built using PHP and returns search results in HTML format. For the purpose of the prototype, the Schaefer Scraper will be scraping information from Google Scholar, Microsoft Academic, Arnetminer, Scopus, and Google Citation. 3.3 Prototype Features and Capabilities The READ prototype will be able to store and view numerous publications, grants, and types of publications. Grant information pertaining to these publications will be searchable as well as the publications themselves in the READ database. It will also have the ability to inform its users of any publications found using the Schaefer Scraper in order to confirm the authenticity of the publications in the database. Lab 1 – READ Description 9 Features Real World Project Prototype Browsing Capabilities Ability to browse all grants and publication Ability to browse all grants and publications Publication Filtering Capabilities Filtered by title, publisher, authors, publication date, date added, and keywords. Filtered by title, publisher, authors, publication date, date added, and keywords. Grant Filtering Capabilities Filtered by title, funding agency, principal or co-principal investigator, start date, end date, and active state. Filtered by title, funding agency, principal or co-principal investigator, start date, end date, and active state. Add, edit, and delete publications and grants Included. A thumbnail image and files may be associated with the document. Fields can be automatically filled in using a BibTex document. Included. A thumbnail image and files may be associated with the document. Fields can be automatically filled in using a BibTex document. Faculty page Lists faculty and provides a link to each person’s profile page Not included. Login interface Linked to Old Dominion University Computer Science accounts Linked to Old Dominion University Computer Science accounts Profile Page Displays authors’ profile picture, job title, email address, personal webpage link, and the author’s publications and grants. Displays graphs Displays authors’ profile picture, job title, email address, personal webpage link, and the author’s publications and grants. Graphs not included. Scraper Will update the system with new publications and grants and alert users when one is added to the system under their name. Will update the system with publications only and alert users when one is added to the system under their name. Lab 1 – READ Description Features Real World Project Prototype Prediction algorithm Predicts if the consumer has enough space to use the READ system. Not included Administrative Privileges Administrators are able to edit, add, or remove anything in the system. Administrators are able to edit, add, or remove anything in the system. 10 Table 2 – Comparison of Features and Capabilities between READ prototype and RWP 3.4 Prototype Development Challenges Possible challenges for the READ prototype will include understanding the architecture of the code for the Schaefer Scraper and learning how to implement it into the READ solution properly. This piece of code it crucial to the innovative features of what READ supplies. Secondly, the prototype may not reach all necessary specifications within the required timeframe which will result in a usable but unfinished end product for use by the Old Dominion University Computer Science department. (This space intentionally left blank) Lab 1 – READ Description 11 GLOSSARY Administrator/Administrative User: a user with increased privileges for editing database content Author: A person that is able to add and edit publications and grants to the system under their name. BibTeX: A file format for reference information in XML format. It will be used to automatically fill in key information when uploading or editing publications and grants. Computer Science (CS): An academic discipline based on advancing computing theory and algorithm development that sometimes includes theory about software engineering methods. Client application: The module that takes input and creates queries to be processed by a server, and receives the results from the server. Client/Server Architecture: A software engineering paradigm that separates functionality into a “client” application and a “server” application that interact. CSS: A programming language used to specify presentation of HTML pages Data Mining: The act of going through a source of input to find specific information. Database Schema: A description of the structure of database Funding Agency: The source of funds for research grants. GIT: A software system for controlling and organizing software versioning. Google Scholar: A search engine primarily used to find academic literature. Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc that can be interacted with via a mouse and keyboard, through which a user interacts with a software application. . Internet scraper: A program which takes unstructured data on the web and puts it into structured Lab 1 – READ Description 12 data that can be stored and analyzed in a central local database or spreadsheet. JQuery Sparklines: A development library for the visualization of data. ODU: Old Dominion University. MicrosoftAcademic: A free service developed by Microsoft Research to help scholars, scientists, students, and practitioners quickly and easily find academic content, researchers, institutions, and activities. MySQL: An open source database software. Parse: To process a statement for specific meanings. Perl: A widely-used programming language on the server-side of web applications. PHP: A widely-used programming language on the server-side of web applications. Principle Investigator (PI): The primary researcher that a research grant is bestowed upon, responsible for documenting the work and publishing research results. Publication or Academic Publication: A document published in an academic journals, technical reports, and records of conference proceedings. Query: A command sent to the database to either change the database or get back results READ: Repository for Electronic Aggregation of Documents RSS: A dialect of XML for subscribing to and distributing news. RWP: Real World Project. Scraper: An automated application designed to scan a source of input such as a document or a website for pertinent information. Server application: In a client/server architecture, the module that takes queries or requests from a client module, process them, and returns the result to the client. Software Compatibility: A description of whether different software, or versions of software, can Lab 1 – READ Description communicate/interact. SQL: A widely used programming language used to query databases. SQL injection: Performing unauthorized queries on a database for malicious purposes. User Authentication: The process of verifying the access credentials of a user of an automated system, usually accomplished by requesting a username and password combination. Viewer: An outside person who wishes to query the information contained in the READ database. Version Control: A method for organizing and recording different versions of documents that have been created over time. Virtual Private Server (VPS): A software version of a hardware server used to create independent servers on a single piece of hardware. Web server: A group of applications constantly “on” resource whose sole or main job is to respond to HTTP requests from browsers. XML: Extensible markup language. (This space intentionally left blank) 13 Lab 1 – READ Description REFERENCES Digest of Education Statistics. 2011. National Center For Educational Statistics Web. 19 Nov 2012. <http://nces.ed.gov/programs/digest/d11/tables/dt11_001.asp?referrer=report>. 14