Lab 1 – READ Product Description 1 Running head: LAB1 – READ DESCRIPTION Lab 1 – READ Product Description Jacob Phillmon CS411 Janet Brunelle February 18, 2013 Lab 1 – READ Product Description 2 Running head: LAB1 – READ DESCRIPTION 1 Introduction ........................................................................................................................................... 3 2 Read Product Description ..................................................................................................................... 4 3 2.1 Key Product Features and Capabilities ......................................................................................... 4 2.2 Major Components........................................................................................................................ 6 Identification of Case Study .................................................................................................................. 7 Read Product Prototype Description ............................................................................................................. 8 3.1 Prototype Architecture .................................................................................................................. 9 3.2 Prototype Features and Capabilities ............................................................................................ 10 3.3 Prototype Development Challenges ............................................................................................ 11 Glossary…………………………………………………………………………………………………...12 References………………………………………………………………………………………………....15 Lab 1 – READ Product Description 3 Running head: 1 LAB1 – READ DESCRIPTION Introduction In the United States there are over 4,700 research institutions (Digest of Education Statistics). These institutions publicize their research through publications and upload them to the Internet in order to share them with the online community. The need to upload these documents can expend a large amount of time because in most organizations it is a manual process. The workload can be so extensive that some groups, like Old Dominion University’s Computer Science Department, are unable to keep their systems up to date; the latest publication uploaded dates back to 2008. Outdated systems of this nature are a poor representation of group’s findings; the system should display the information in a way that advertises the organization to the general public. Funding the organization’s research projects are numerous grants that have been awarded by external funding agencies. Like publications, grants must also be stored in the system, and are just as tedious to upload as the publications they are associated with. READ is a repository for electronic aggregation of documents developed by Old Dominion University’s Computer Science department. It is designed specifically to the needs of Old Dominion University’s Computer Science department, but it can be integrated into any online system which displays a company’s publications and grants. It is designed to automate the process of adding and organizing publications and grants into a filterable format. It will also give users the option to filter what they are looking for, allowing users to narrow down topics and locate ones of relevant interest. The prototype will provide basic functionality including a user interface, a fully constructed database, user functions such as editing or adding publications or grants, and the automation of publication and grant submissions to the system. It will not include features that mainly provide aesthetic functionality, such as graphs that illustrate the number of Lab 1 – READ Product Description 4 Running head: LAB1 – READ DESCRIPTION publications a person a created over the past few years or the amount of grant money a person has earned. 2 Read Product Description READ is an online system that stores publications and grants that are associated with members of Old Dominion University’s Computer Science department. The system is designed to minimize the amount of time needed to update the system with the most recent publications produced by the department’s faculty. Only extra information such as a single thumbnail image will require manual input from the author. The system will still preserve the ability to allow authors to upload publications directly into the system without having to use the systems automated features. 2.1 Key Product Features and Capabilities The system interface will allow anyone to browse through documents stored within the system. In addition, it will also allow the user to filter the displayed documents into much smaller, manageable results. Filters can be compared to information stored within the document, such as the title, authors associated with the publication, the publication date, or specific keywords manually set by the author. If a match is found after comparing the filter information to the document information, then the document will be displayed in the filtered results. Filter parameters will vary depending on the type of document currently displayed. A personal profile will also be included within the system interface. Profile information, such as a profile picture and additional descriptive information about the author, will be displayed here; profile information can be altered by the profile’s owner at any time through an editing interface. Additionally, any publications in which the profile’s owner is listed as an Lab 1 – READ Product Description 5 Running head: LAB1 – READ DESCRIPTION author, or grants in which the profile’s owner is listed as a principle or co-principle investigator, will be displayed here. If an author is logged into the system and is on their own profile page, they can choose to edit publications and grants on the page because they are associated with them. While editing publications or grants an author has the ability to submit a BibTex document in order to fill in various fields instantaneously. Authors will also have the ability to upload files to be associated with their publications if they so wish to do so; the files can be downloaded by viewers of the READ system. Graphs detailing the number of publications and the amount of grant money earned each year will also be displayed. Each author’s profile page can be viewed by any user, but only authors logged into the system viewing their own profile page or a system administrator will have rights to edit anything within it. The system will use an external module called the Schaefer Scraper to search predefined sites that contain publications created by author’s. It will extract publication and grant data automatically from each site searched, including the title of the publication, the authors associated with it, and a link to the page the publication is displayed within. The information will then be stored in the READ database. After it has been added to the database, an alert informing authors that a new publication or grant has been added under their name will be emailed to authors whenever a new publication or grant has been added to the database. Authors will be able to choose to either remove the publication from the system if they believe it was added under their name in error by clicking on a link displayed within the alert or through their own profile interface, or they have the option to add extra information to the publication and correct any mistakes that may exist through an editing interface. Over time the Scraper will learn to avoid alerting specific users based on the publication removal patterns made by authors. If for Lab 1 – READ Product Description 6 Running head: LAB1 – READ DESCRIPTION some reason a publication is manually added to the system and the scraper finds a copy of it on an external site, a duplicate copy will not be added to the system. 2.2 Major Components Figure 1 – Major Functional Component Diagram Figure 1 illustrates the components that will be used within the READ system. The system will be stored on a server owned by Old Dominion University. Major software components within the system include a graphical user interface, a database, and a Scraper. The system interface will be split up into two sections: a public section and a private section. The public section will allow anyone to browse and filter publications or grants stored within the system as well as allow the user to view author profile pages. The private section allows authors to edit or remove publications from the system over which they have ownership of. The private section will require a login interface that will validate whether or not the user is a valid author. A database will be used to house all publication and grant data as well as authors that are registered Lab 1 – READ Product Description 7 Running head: LAB1 – READ DESCRIPTION within the system. The user interface will communicate with the database in order to display publications and grants stored within it or when an author submits changes to their profile information, publications, or grants they have ownership over. The Schaefer Scraper will search specific sites over the Internet and extract publications that are associated with authors stored within the database. It will run on a timed basis set by system administrators and will update the database with the most recent publications automatically. A module called the Prediction Algorithm will be provided on the READ main webpage to determine if a company has enough storage space in order to use READ to meet their standards. The Prediction Algorithm will require the average amount of storage consumed by an author, the average number of uploaded files per author, along with the average size of the upload. 3 Identification of Case Study The READ system is designed specifically for Old Dominion University’s Computer Science Department. The department is composed of a group of faculty members, most of which produce numerous publications detailing their research every year. In an attempt to organize the faculty’s publications into a single viewable location, the department had a system in place where publications were manually submitted by the faculty and later added to the system manually by the system’s administrator. The process cost such a large amount of time just to update the system that most of the faculty stopped submitting publications all together. This can be seen in the systems display itself, as the last submitted publication dates back to the year 2008 (Recent Publications). The page also lacks any filter capabilities; all publications are displayed with those most recently published at the top and older ones going to the. The display page is no longer linked to the department’s homepage because it is out of date and no longer in use. The READ system is designed to encourage use of the new system through by automating the Lab 1 – READ Product Description 8 Running head: LAB1 – READ DESCRIPTION process of updating the system as well as adding additional browsing capabilities. Eventually the system may be expanded to be included in other departments at Old Dominion University as well as other organizations that require a system to organize their publications and grants. Read Product Prototype Description The READ prototype is designed to integrate the use of the Schaefer Scraper into a working database system and display environment. It will be used to demonstrate the functionality of the system to the Old Dominion University Computer Science department; this demonstration will allow them to decide on any changes they may want made to the system before it is fully developed. The prototype will use actual publications and grants created by Old Dominion University’s Computer Science faculty in order demonstrate the effectiveness of the Schaefer Scraper. Additional user interface functionality will also be implemented in order to demonstrate the systems usage. Intentionally left blank Lab 1 – READ Product Description 9 Running head: 3.1 LAB1 – READ DESCRIPTION Prototype Architecture Figure 2 Prototype Major Function Component Diagram The major hardware and software component structure of the READ prototype is illustrated in Figure 2. The READ system is stored on a Debian Virtual machine. Access to the system will require a computer the ability to browse the Internet. The main software components built within the system are the database, the Web-Based interface, and the Schafer Scraper. The database shall be written and created using MySQL software as it is a language the READ team has extensive experience working with. All publication and grant data stored in the database will be based off of actual publications and grants owned by Old Dominion University’s Computer Science faculty and graduate students. The Web-Based interface shall be written using PHP and standard HTML, as well as AJAX in order to create a type-ahead publication and grant filter and Lab 1 – READ Product Description 10 Running head: LAB1 – READ DESCRIPTION query system. The Schafer Scraper is a prebuilt module provided by Andrew Schaefer. It will provide all the functional capability of the Scraper needed except for the ability to add grants automatically into the READ system. 3.2 Prototype Features and Capabilities Features Browsing Capabilities Real World Project Ability to browse all grants and publication Prototype Ability to browse all grants and publications Publication Filtering Capabilities Filtered by title, publisher, authors, publication date, date added, and keywords. Filtered by title, publisher, authors, publication date, date added, and keywords. Grant Filtering Capabilities Filtered by title, funding agency, principal or co-principal investigator, start date, end date, and active state. Filtered by title, funding agency, principal or co-principal investigator, start date, end date, and active state. Add, edit, and delete publications and grants Included. A thumbnail image and files may be associated with the document. Fields can be automatically filled in using a BibTex document. Included. A thumbnail image and files may be associated with the document. Fields can be automatically filled in using a BibTex document. Faculty page Lists faculty and provides a link to each person’s profile page Not included. Login interface Linked to Old Dominion University Computer Science accounts Linked to Old Dominion University Computer Science accounts Profile Page Displays authors’ profile picture, job title, email address, personal webpage link, and the author’s publications and grants. Displays graphs Displays authors’ profile picture, job title, email address, personal webpage link, and the author’s publications and grants. Graphs not included. Scraper Will update the system with new publications and grants and alert users when one is added to the system under their name. Will update the system with publications only and alert users when one is added to the system under their name. Prediction algorithm Predicts if the consumer has enough space to use the READ system. Not included Administrative Administrators are able to edit, add, or Privileges remove anything in the system. Table 1 – Features and Capabilities list Administrators are able to edit, add, or remove anything in the system. Lab 1 – READ Product Description 11 Running head: LAB1 – READ DESCRIPTION Table 1 details the differences between the real world project and the READ prototype. The prototype itself consists of most of the capabilities and features of the real world problem except for a few that are primarily aesthetic. For starters the profile page will not display graphs detailing information about the author’s contributions. The Prediction Algorithm will not be included in the prototype as it would only be used as a guideline for other groups that may wish to use the READ system. The faculty page will also not be included as the computer science department already has one on their main page. The department may choose to incorporate links to the profile pages from their own faculty page in the future. 3.3 Prototype Development Challenges There are a number of challenges and risks that may appear during the development of the READ system. First of all, there is a chance that the Schaefer Scraper may need to be modified in order to be compatible with the READ system.. The format of the data extracted from various websites might not meet the format specifications of the database we develop as well. This is probably an inevitable problem that must be overcome, so the group will start deciphering the code to the Schaefer Scraper early in development. Secondly, there is the possibility that the prototype may not meet all of the user requirements and specifications. A time limit has been placed for the production of the prototype, so it might not be finished by the due date. It is also possible that it might not be finished due to the lack of knowledge needed to develop the system. To avoid this, the task of coding the prototype will be split up between group members. Any technical skills needed to develop the prototype will also be researched ahead of time. There is also the possibility that the interface may be incompatible with certain browsers. The requirements for a page to be displayed on a browser such as Google Chrome differ from those on one such as Fire Fox or Internet Explorer. Google Chrome will be the main focus group Lab 1 – READ Product Description 12 Running head: LAB1 – READ DESCRIPTION for the prototype, but later on the interface will be expanded to be fully compatible with most browsers and possibly even portable devices such as smart phones. GLOSSARY Administrator/Administrative User: a user with increased privileges for editing database content Author: a person who publishes in an academic journal or other academic BibTeX: A file format for reference information in XML format. It will be used to automatically fill in key information when uploading or editing publications and grants. Computer Science (CS): An academic discipline based on advancing computing theory and algorithm development, that sometimes includes theory about software engineering methods. Client application: In a client/server architecture, the module that takes input and creates queries to be processed by a server, and receives the results from the server. Client/Server Architecture: A software engineering paradigm that separates functionality into a “client” application and a “server” application that interact. CSS: A programming language used to specify presentation of HTML pages Data Mining: The act of going through a source of input to find specific information. Database Schema: A description of the structure of database Funding Agency: The source of funds for research grants. These organizations usually have a limited amount of money to (pass out) principle investigator’s that submit an accepted application for research funds. GIT: A software system for controlling and organizing software versioning. GoogleScholar (http://scholar.google.com): Google Scholar provides a simple way to broadly Lab 1 – READ Product Description 13 Running head: LAB1 – READ DESCRIPTION search for scholarly literature. From one place, you can search across many disciplines and sources: articles, theses, books, abstracts and court opinions, from academic publishers, professional societies, online repositories, universities and other web sites. Google Scholar helps you find relevant work across the world of scholarly research. scholar.google.com Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc. that can be interacted with via a mouse and keyboard, through which a user interacts with a software application. Used to differentiate from a “command-line interface”, in which a user interacts with a software application solely through a text terminal. internet scraper: internet scraper / web scraper - (wikipedia) web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. JQuery Sparklines: A development library for the visualization of data. ODU: Old Dominion University. MicrosoftAcademic (http://academic.research.microsoft.com/): Microsoft Academic Search is a free service developed by Microsoft Research to help scholars, scientists, students, and practitioners quickly and easily find academic content, researchers, institutions, and activities. Microsoft Academic Search indexes not only millions of academic publications, it also displays the key relationships between and among subjects, content, and authors, highlighting the critical links that help define scientific research. Microsoft Academic Search makes it easy for you to direct your search experience in interesting Lab 1 – READ Product Description 14 Running head: LAB1 – READ DESCRIPTION and heretofore hidden directions with its suite of unique features and visualizations. MySQL: A database querying language. Parse: A technical term usually used to describe the processing of a statement written in a programming language. May be used generally to describe the processing of any statement for specific meaning. Perl: A widely-used programming language on the server-side of web applications. PHP: A widely-used programming language on the server-side of web applications. Principle Investigator (PI): The primary researcher that a research grant is bestowed upon, responsible for documenting the work and publishing research results. Publication or Academic Publication: A document created by a faculty member to share research. They are usually published in an academic journals, technical reports, and records of conference proceedings. Query: An algorithm sent to the database to either change the database or get back results READ: Repository for Electronic Aggregation of Documents RSS: A system for subscribing to and distributing news. Scraper: An automated application designed to scan a source of input such as a document or a website for pertinent information. Server application: In a client/server architecture, the module that takes queries or requests from a client module, process them, and returns the result to the client. Software Compatibility: A description of whether different softwares, or versions of software, can communicate/interact. SQL: A widely-used programming language used to query databases. Lab 1 – READ Product Description 15 Running head: LAB1 – READ DESCRIPTION SQL injection: Performing unauthorized queries on a database for malicious purposes. User Authentication: The process of verifying the access credentials of a user of an automated system, usually accomplished by requesting a username and password combination. Viewer: In the scope of our project an outside person who wishes to query the information contained in the READ database. Version Control: A method for organizing and recording different versions of documents that have been created over time. Virtual Private Server (VPS): A software version of a hardware server. Used to create independent servers (....) on a single piece of hardware. Webserver: A group of applications run on a computer or VPS in to serve webpages and provide server-side computation for browser-based client applications. A web server is a constantly “on” resource whose sole or main job is to respond to HTTP requests from browsers. XML: Extensible markup language. REFERENCES Digest of Education Statistics. 2011. National Center For Educational Statistics Web. 19 Nov 2012. <http://nces.ed.gov/programs/digest/d11/tables/dt11_001.asp?referrer= report>. "Recent Publications." Department Of Computer Science. N.p., n.d. Web. 13 Feb. 2013. <http://www.cs.odu.edu/recent_publications.shtml>.