Lab 1 – READ Description Running head: LAB 1 – READ DESCRIPTION Lab1 – READ Product Description Andrew Sprague CS411 Janet Brunelle February 15, 2013 Version #4 1 Lab 1 – READ Description 2 Table of Contents 1 INTRODUCTION .......................................................................................................................3 2 READ PRODUCT DESCRIPTOIN ............................................................................................5 2.1 KEY PRODUCT FEATURES AND CAPABILITIES ..........................................................................5 2.2 MAJOR COMPONENTS ..............................................................................................................6 3 IDENTIFICATION OF CASE STUDY ....................................................................................10 4 READ PRODUCT PROTOTYPE DESCRIPTOIN ..................................................................11 4.1MAJOR FUNCTIONAL COMPONENTS ........................................................................................11 4.2 PROTOTYPE FEATURES AND CAPABILITIES ............................................................................12 4.3PROTOTYPE CHALLENGES ......................................................................................................15 GLOSSARY ..................................................................................................................................16 REFERENCES ..............................................................................................................................19 List of Tables Table 1. Prototype Features ...........................................................................................................13 List of Figures Figure1. Major Functional Components ..........................................................................................9 Figure2. Prototype Major Functional Components .......................................................................12 Lab 1 – READ Description 3 1. INTRODUCTION According to the Digest of Education Statistics there are 4,706 research institutions in the United States (Digest of Education Statistics). The primary way these institutions attract both clients and new talent is to disseminate information on what they and their employees have accomplished. This dissemination is usually done by employees publishing papers. Universities, one of the largest groups of research institutions, make twenty percent of their annual income from federal contracts and grants (freeby50.com). These universities do not have a good online tool for sharing the work that they have done with prospective students or faculty. Currently the systems that many institutions use to share publications are slow and tedious. This issue causes much of the work that universities and faculty accomplish to go without the proper recognition. Students, who wish to go to a university that has professors who specialize in a particular research area, can have trouble discerning between universities. The work universities have done is not as well known, and, as a result, the faculty may not always have their work recognized. The Repository for the Electronic Aggregation of Documents (READ) system is an online program that consists of a database and web scraper designed to automate the process of gathering and sharing faculty publications. READ will allow faculty to organize all of their publications and make any corrections that are required before sharing them to the public. Then the public may access and browse the publications using READ. The case study of READ will be the Computer Science Department of Old Dominion University (ODU). This is because the current system that ODU utilizes is particularly deficient. Their current method for storing papers no longer is being utilized. It is not currently updated, Lab 1 – READ Description 4 has never had any method of searching the content and, the method for updating the system was unable to encourage faculty to participate. (This space intentionally left blank.) Lab 1 – READ Description 2 5 READ PRODUCT DESCRIPTION READ is an online system that will collect and store information on the publications and the grants obtained by authors. READ will access the publications from online sources and obtain the information about the publication including a link to where the actual publication is stored. The system will also allow for the authors to add their publications into READ manually. READ will also allow people from outside of the system to access it to see the publication information that has been stored. People will be able to sort through publications and grants though a number of filters such as author, publication data, and keywords. The viewers will also be able to sort the filtered results by relevance. 2.1 Key Product Features and Capabilities READ will include the Schaefer Scraper, an online publication data scraper. The Schaefer Scraper will search online publication databases for the names of all of the authors that reside in READ. It will then take BibTex information it finds on the publications and put the information into the database. After the Schaefer Scraper gathers the information, READ will then send an email to the author that the publication supposedly belongs to, in order to alert them that a publication has been found. The author will log onto the system and either approve the information or correct it. After the information is corrected, READ will save the changes. In the completed version there will also be a learning algorithm that will keep repetitive inaccurate results from being repeated in the database. For instance, if an author has indicated that a paper the Schaefer Scraper found was not theirs, then the learning algorithm will prevent results from being gathered. Lab 1 – READ Description 6 In addition to sending an email to authors, READ will also implement an RSS feed for confirming publications. The RSS feed will be for both allowing outside users to see new publications as they are added and for informing authors of publications found by the scraper if they register to RSS feeds. The author will have a choice between being notified by either email or RSS. The author will also be able to change the information that is presented on their profile that is provided to them. They will be able to upload a picture, a link to their homepage, contact information, and personal information. The authors will also be able to upload thumbnails to any of the publications that belong to him or her. Viewers are users who wish to look up publication and grant information in READ, but are not authors who contribute. Viewers will have limited access to the READ system. However, viewers will be able to view the content of the system. The viewers will be able to sort and filter the publications that they queried as well. For each author, there will be a profile page that the viewers should have access to, featuring a graph to display the publication statistics of the author. 2.2 Major Components (Hardware/Software) The READ system will have three major components as shown in Figure 1. The READ system will have a web-based interface. The system will include a database to hold all of the information about publications, grants, and authors. The READ system will include the Schaefer Scraper to gather the relevant publication information. All users, including authors and viewers, will not access the system directly but will instead access it through a web-based graphical user interface (GUI). The interface will be separated into two sections. The public section will be used by people who wish to view the Lab 1 – READ Description 7 information about the publications, and the private section will be used by authors and administrators who will have to change parts of the content. The only accessible parts in public section will be the page to search and filter through the database for grants and publications and profile views of the authors. The publicly viewable profile page will allow the viewer to see any information that the author has added and their jQuery Sparklines statistics, but will not allow for any editing without entering the site through a login page. Once logged into the website, authors and administrators will be able to edit the grant and publication information. The system administrators will be able to add and remove profiles as authors come into or leave the institution. The Administrators will be able to manually take out any publications or grants that are not supposed to be in the database for legal reasons. Finally, the web-based interface will implement an RSS feed that, will direct any updates about the system that are important to the user to an RSS reader. This will allow the viewers to see if any new publications have been found. There will also be RSS feeds for each author so that a link to any new publications can be sent to them through RSS instead of just email. The major parts of the web-based interface will include the grants search, publication search, login profile, and authors interfaces. The publication search page and grants search page will allow for the searching and filtering of publications and grants respectively. The two pages were separated at the request of the customer. The login page will allow authors and administrators to login to the system using a password and username in order to edit publication, grant, and profile information. The profile page is where authors will edit their profiles and viewers will be able to see statistics on the authors. Finally, the authors’ page will be a list of the various authors in the system so that their profiles will be accessible to the public. Lab 1 – READ Description 8 The link database is the part of the system that will hold all of the information on publications and grants. It will use MySQL and accessed from the web-based interface. The only time anyone should access the database directly is when it is installed. Part of the database will also be dedicated to holding information on the authors and the PI (principal investigators) of the grants. The database may not necessarily hold the actual content of the papers, only the information about them. The purpose is not to create a new place to access the research but, as a place where viewers can see what kind of research has been done by the authors at a particular institution. Inside the database will be tables that will store data on the authors, the publications, and the grants. Queries to the tables will be constructed in the web based interface using PHP to create the SQL. Figure 1 shows the schema of the database. The Schaefer Scraper will periodically gather information on the publications of the authors from the Web. It will access specific academic sites that keep publications and find information pertinent to the authors at the institution that uses it. After this, it will store all of the information in the database and email the authors for them to correct, confirm, or remove. In the finished product, the Schaefer Scraper will be implemented with a learning algorithm in order to prevent authors from receiving multiple messages about publications that they do not own. The Schaefer Scraper will attempt to gather the information in a BibTex format, for ease of conversion. BibTex is an XML base file format which allows it to be easily understod and converted into data for a MySQL database to use. Another advantage is that many of the sites that the Schaefer Scraper will be gathering publication data from already allow for the exporting of publication data into a BibTex format. Lab 1 – READ Description 9 The Schaefer Scraper will be run at regular intervals in order to gather this data for the system. In the finished product, the automatic running of the Schaefer Scraper will be managed by the customer. The customer will be able to turn off or alter the frequency of runs of the Schaefer Scraper in order to suit their individual needs. Figure 1. Major Functional Component Diagram (This space intentionally left blank.) Lab 1 – READ Description 3 10 IDENTIFICATION OF CASE STUDY The case study for READ will be the ODU Computer Science (CS) department. Currently, the department does not have a working system to distribute publication information on their faculty. The purpose of the case study will be to remedy this situation. One important thing to note about the case study is that the department did once have a system to document faculty publications, but that this system fell into disuse. The previous system had authors submit their publications directly to an administrator who then manually placed the paper’s information into the system. Because the process was entirely manual, many people were either reluctant or too busy to submit information about their publications to the administrator. Eventually, authors stopped submitting their papers and the system was abandoned. The system had not been updated since 2008. This is important because it shows that if a system is created, it must minimize user interaction or it will fail. (This space intentionally left blank.) Lab 1 – READ Description 11 4 READ PRODUCT PROTOTYPE DESCRIPTION The READ prototype will be implemented in the Computer Science Department of Old Dominion University. A prototype is needed because the scope of this project is larger than the timeframe allotted to create it. Some of the functionality of the READ system must be left out of the prototype. The user types specified will be implemented in the READ prototype. The viewer, author, and administrator will all be included in the prototype. The functions of each of the user types will remain unchanged. After the prototype has been implemented, an administrator will be chosen from the faculty or the systems group. 4.1 Prototype Architecture (Hardware/Software) The prototype will consist of three main components that are the same as those of the finished product as shown in Figure 2. The prototype will include a basic user interface. It will also contain an implementation of the database. The Schaefer Scraper will be included to datamine websites. The prototype will be implemented in the Computer Science Department’s servers. The database will be implemented using MySQL. The web based interface will be implemented in the prototype using Joomla!. Using this content management system will make logging in authors easier to implement the interface because, one of the team members working on READ already has a log in method for Computer Science faculty implemented in another project using Joomla!. All of the queries to the database will be made through PHP scripts to interface between MySQL and the web based interface. An interface between the Schaefer Scrapper and author information in the database will be written in python. Lab 1 – READ Description 12 Figure 2 Prototype MFCD 4.2 Features and Capabilities The Prototype will include many of the features that are planned for the final product as defined in Table 1. The prototype will allow viewers to search and filter the database through the web-site. It will also allow for minimal user-profile control. An RSS feed and email system will be implemented, so that people can stay informed of what is contained within the database. Access Control will be a priority to prevent unauthorized users from updating author papers. The Schaefer Scraper will automate much of the process of updating the publication lists. In the Lab 1 – READ Description 13 prototype, the Schaefer Scraper will search online for publication on one fourth of the authors every week. The prototype will not implement every feature of the finished product. The prototype will not include a learning algorithm that will make sure an incorrect paper is not resubmitted. The prototype will also not include any visual representations of the data such as graphs and jQuery Sparkliness to display author statistics. Features Real World Project Prototype Browsing Capabilities Ability to browse all grants Ability to browse all grants and publication and publications Publication Filtering Filtered by title, publisher, Filtered by title, publisher, Capabilities authors, publication date, date authors, publication date, date added, and keywords. added, and keywords. Filtered by title, funding Filtered by title, funding agency, principal or co- agency, principal or co- principal investigator, start principal investigator, start date, end date, and active date, end date, and active state. state. Add, edit, and delete Included. A thumbnail image Included. A thumbnail image publications and grants and files may be associated and files may be associated Grant Filtering Capabilities Lab 1 – READ Description 14 with the document. Fields can with the document. Fields can be automatically filled in be automatically filled in using a Bibtext document. using a Bibtext document. Features Real World Project Prototype Faculty page Lists faculty and provides a Not included. link to each person’s profile page Login interface Profile Page Scraper Linked to Old Dominion Linked to Old Dominion University Computer Science University Computer Science accounts accounts Displays authors’ profile Displays authors’ profile picture, job title, email picture, job title, email address, personal webpage address, personal webpage link, and the author’s link, and the author’s publications and grants. publications and grants. Displays graphs Graphs not included. Will update the system with Will update the system with new publications and grants publications only and alert and alert users when one is users when one is added to Lab 1 – READ Description 15 added to the system under the system under their name. their name. Features Real World Project Prototype Prediction algorithm Predicts if the consumer has Not included enough space to use the READ system. Administrative Privileges Administrators are able to Administrators are able to edit, add, or remove anything edit, add, or remove anything in the system. in the system. Table 1 Key Prototype Features 4.3 Prototype Development Challenges The primary challenge in the development of the prototype is the implementation of the Schaefer Scraper. Much of the code is uncommented and difficult to understand. One method of mitigating this challenge is to contact Andrew Schaefer, and try to get him to explain to us the architecture of his code. Currently, the Black Group does not completely understand the code. Another challenge in the development of the READ prototype will be publication types that differ from one another. Some publications may be technical papers, while others may be academic publications. In a BibTex, format some of the information may be arranged in a slightly different way depending on what kind of document the publication is. Lab 1 – READ Description 16 Glossary Administrator/Administrative User: a user with increased privileges for editing database content Author: A person that is able to add and edit publications and grants to the system under their name. BibTeX: A file format for reference information in XML format. Computer Science (CS): An academic discipline based on advancing computing theory and algorithm development, that sometimes includes theory about software engineering methods. Client application: In a client/server architecture, the module that takes input and creates queries to be processed by a server, and receives the results from the server. Client/Server Architecture: A software engineering paradigm that separates functionality into a “client” application and a “server” application that interact. CSS: A programming language used to specify presentation of HTML pages Data Mining: The act of going through a source of input to find specific information. Database Schema: A description of the structure of database Funding Agency: The source of funds for research grants. These organizations usually have a limited amount of money to (pass out) principle investigator’s that submit an accepted application for research funds. GIT: A software system for controlling and organizing software versioning. GoogleScholar (http://scholar.google.com): A website that stores academic publications. Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc that can be interacted with via a mouse and keyboard, through which a user interacts with a software application. internet scraper: A program that is designed to sort through data that is stored online Joomla!: A content management system for designing web interfaces. Lab 1 – READ Description 17 JQuery Sparklines: A development library for the visualization of data. ODU: Old Dominion University. MicrosoftAcademic (http://academic.research.microsoft.com/): A website that stores information on academic publications MySQL: An implementation of SQL that is open source. Parse: A technical term usually used to describe the processing of a statement written in a programming language. Perl: A widely used programming language on the server-side of web applications. PHP: A widely used programming language on the server-side of web applications. Principle Investigator (PI): The primary researcher that a research grant is bestowed upon, responsible for documenting the work and publishing research results. Publication or Academic Publication: A document created by a faculty member to share research. They are usually published in an academic journals, technical reports, and records of conference proceedings. Query: An algorithm sent to the database to either change the database or get back results READ: Repository for Electronic Aggregation of Documents RSS: A specification for subscribing to and distributing news. Scraper: An automated application designed to scan a source of input such as a document or a website for pertinent information. Server application: In a client/server architecture, the module that takes queries or requests from a client module, process them, and returns the result to the client. Software Compatibility: A description of whether different software, or versions of software, can communicate/interact. SQL: A widely used programming language used to manipulate databases. SQL injection: Performing unauthorized queries on a database for malicious purposes. User Authentication: The process of verifying the access credentials of a user of an automated system, usually accomplished by requesting a username and password combination. Lab 1 – READ Description 18 Viewer: an outside person who wishes to query the information contained in the READ database. Version Control: A method for organizing and recording different versions of documents that have been created over time. Virtual Private Server (VPS): A software version of a hardware server, used to create independent servers on a single piece of hardware. Webserver: A group of applications run on a computer or VPS in to serve webpages and provide server-side computation for browser-based client applications. XML: Extensible markup language. Lab 1 – READ Description REFERENCES Digest of Education Statistics. 2011. National Center For Educational Statistics Web. 19 Nov 2012. <http://nces.ed.gov/programs/digest/d11/tables/dt11_001.asp?referrer=report>. "Where Do Universities Get their Money From?." Free By 50. N.p., 13 2011. Web. 19 Nov 2012. <http://www.freeby50.com/2011/11/where-do-universities-get-their-money.html>. 19