Lab 1 – READ Product Description Marcus Zehr CS411

advertisement
Lab 1 – READ Product Description
Marcus Zehr
CS411
Janet Brunelle
March 18, 2013
Lab 1 – READ Description
2
Table of Contents
1. INTRODUCTION …………………………………………………………………..……..…3
2. PRODUCT DESCRIPTION…... ...……………………………………………….…..…..…..4
2.1 Key Product Features and Capabilities …………………………………………...…..4
2.2 Major Components (Hardware/Software) …………………………………....……….5
2.3 Target Market/Customer Base ...……..………………………..………………..….....6
3. PRODUCT PROTOTYPE DESCRIPTION …………………………………..……………...7
3.1 Prototype Functional Goals and Objective ………………………………………..….7
3.2 Prototype Architecture (Hardware/Software) ……………………………………..….8
3.3 Prototype Features and Capabilities ……………………..…………..……….…..…...8
3.4 Prototype Development Challenges …………………………….…..…………….....10
GLOSSARY .……..…………………..………..……………………………………………11
REFERENCES ….....……………………..………..……………………………………..…14
List of Figures
Figure 1. READ’s MFCD ……………………..……..………..………………………………5
List of Tables
Table 1. Comparison of Features and Capabilities between READ prototype and RWP …...9
Lab 1 – READ Description
3
1. Introduction
There are more than 4,700 research institutions in the United States (Digest of Education
Statistics).These institutions can display their research through abundant publications and upload
them to the Internet. However, many of the organizations associated with these institutions lack
an efficient method or procedure for uploading and maintaining documents such as these
publications. This is a problem for research institutions and is important to fix so that institutions
may be appropriately recognized for work which is completed. In doing this, research institutions
may further advertise specific areas of research being performed at any given time.
The current process for many organizations to present their publications online is nonautomated, slow, and tedious which means there are many areas to improve upon to properly
display and advertise an institutions’ publications. One main reason for this is that the
responsibility to update the system may rest upon a sole individual or administrator. This issue
will be addressed through the development of the READ web application for research
institutions.
The Repository for Electronic Aggregation of Documents (READ) aims to automate the
currently manual process of submitting an organizations’ publications and grant information. It
will also keep this information better organized and allow publications and grants to be
searchable using filters while being displayed in an easy to read format. In addition to this,
READ will provide the ability for users to verify that their grant and publication information is
correct. Ultimately, READ will ease the burden of keeping up with numerous publications to
allow researchers to spend more time working and less time managing files.
Lab 1 – READ Description
4
2. Product Description
READ is an online database which has an abundance of capabilities including the ability to
store information about, publications, and links to outside grants publications. It provides a way
for users to browse and search through publications using a number of filters such as author,
publish date, keywords, and type of document. It allows a user to advertise their research, both
past and current, as well as information about any grants which apply to them. Lastly, it
minimizes the need for a user to manually organize their publications by utilizing the Schaefer
Scraper to automatically find publications on the Internet.
2.1 Key Product Features and Capabilities
The Schaefer Scraper will be used by READ to search for publications matching a registered
users credentials and extract pertinent information found in the publications such as the title,
author or authors, publication date, and type of publication. This information is then inserted into
the database along with a link to that specific publication and a notification email is sent to that
user to authorize the newly uploaded publication information. Based on the actions of the user
and patterns of denied publications, READ will learn when not to notify a user to authorize
certain publications when they are found.
With READ in place, any user will be able to browse the database using an assortment of
filters for publications. Publications may be searched by publish date, multiple authors,
keywords, and whether or not the full text is available. Grants may be searched by total amount,
status, funding agency, or investigators.
Each user will have a profile which will display information including the user’s name, job
title, personal photo, email address, affiliated organization, and homepage. The users’ profile
page will also include graphical representations of the number of publications they have authored
Lab 1 – READ Description
5
and time which they were published as well as any funding received through the participating
publications. In addition to the graphical representations, the profile page for each user will
include a list of that specific user’s publications.
2.2 Major Components (Hardware/Software)
READ will incorporate the use of simple hardware and software solutions which have been
integrated together seamlessly in order to perform its duties with ease in the hands of the users.
Figure 1 below illustrates the major functional components of READ. This solution consists of a
single server which will house three main software components: a web interface, a publication
link database, and Schaefer’s Scraper.
Schaefer’s Scraper
Figure 1 – READ’s MFCD
The web interface itself will have both public and private areas available for access to users.
The private areas will require a user to log on to their account in order to access and will allow
for the user to perform various tasks. The user will then be allowed access to their own profile
Lab 1 – READ Description
6
page and administrative abilities. Inside the web interface the user may access the search filters
for publications and grants as well as other users’ public profiles.
The publication link database’s main function is to house and provide links to any external
publications and grant information to the users. This database will also contain files which have
been uploaded by the users including publications, grants, and other files which may be related
to either.
The last internal component is the Schaefer Scraper, an automated tool that will search
specific external web sites for new publications submitted via a list of authors provided as input
within a XML file. The scraper will do this by looking for publications by the included authors,
collect and parse the results, and then export them into the READ link database for further use.
2.3 Target Market/Customer Base
The initial consumer for READ is Old Dominion University’s Computer Science Department
(ODUCS). Dr. Michele Weigle, a professor at ODU with her Ph.D. in Computer Science, had
requested a solution for this particular issue and is acting as group mentor for this project. The
ODUCS Department features 37 faculty members, 11 currently enrolled Ph. D students, and 11
Master’s students according to its website. These are all individuals who would be able to take
advantage of a system that could make it easier to discover relevant and up to date research, but
they do not begin to cover the number of people who would find READ to be an indispensable
resource in the future.
Once testing is complete at ODUCS then READ may be utilized by other departments within
Old Dominion University. Potentially READ could then be used at other universities to help with
their organizational and research needs as well as government institutions, research institutions,
Lab 1 – READ Description
7
and libraries. Overall this system will become a useful tool which can be used by many people
looking for more information about a schools’ focus of study.
3. READ Product Prototype Description
The READ prototype will be vital in order to organize publications and the grant information
associated with them for Old Dominion University. The prototype will be modeled using the Old
Dominion University's Computer Science Department’s publications and hosted on a virtual
machine running a Linux based operating system. It will also include all of the features of the
real world project and be instrumental to maintain and upkeep publications and grants for the
university.
3.1 Prototype Functional Goals and Objectives
The READ prototype will have the ability to search through the database and filter the results
based on user queries, implement RSS feeds, allow for users to log on, edit, and upload data, and
give functional control of the web application to administrators. By utilizing the Schafer Scraper,
it will also find publications and insert them into the READ database. The users of this system
will be able to log on and access their own publication and grant information and have the ability
to edit this information. They will also be able to upload personal information and upload files to
the READ database. These functions of the prototype will allow for easy access to numerous
publications, grants, and tech reports written by Old Dominion University faculty and students.
This will all be located in a well organized and easy to navigate user interface which will utilize
a filter and search page, displays for publications and grants, a profile display system, and a RSS
feed.
Lab 1 – READ Description
8
3.2 Prototype Architecture (Hardware/Software)
The READ prototype which will be created will allow a user to log onto the system via a
web-based interface. This interface will then let the user search for or add publications to the
database which is publicly viewable. The Schafer Scraper will run as well on pre-defined
schedule in order to populate the database with information both initially and on a regular basis
as defined by the administrators of the READ program.
READ will be programmed using PHP and MySQL as these languages are better suited for
the needs of this project than others and will allow a large and detailed database structure to be
accessed with ease on the Internet. This prototype also incorporates the use of an over the shelf
piece if software named the Schaefer Scraper. This was built using PHP and returns search
results in HTML format. For the purpose of the prototype, the Schaefer Scraper will be scraping
information from Google Scholar, Microsoft Academic, Arnetminer, Scopus, and Google
Citation.
3.3 Prototype Features and Capabilities
The READ prototype will be able to store and view numerous publications, grants, and types
of publications. Grant information pertaining to these publications will be searchable as well as
the publications themselves in the READ database. It will also have the ability to inform its users
of any publications found using the Schaefer Scraper in order to confirm the authenticity of the
publications in the database.
Lab 1 – READ Description
9
Features
Real World Project
Prototype
Browsing
Capabilities
Ability to browse all grants and
publication
Ability to browse all grants and
publications
Publication
Filtering
Capabilities
Filtered by title, publisher, authors,
publication date, date added, and
keywords.
Filtered by title, publisher, authors,
publication date, date added, and
keywords.
Grant Filtering
Capabilities
Filtered by title, funding agency,
principal or co-principal
investigator, start date, end date,
and active state.
Filtered by title, funding agency,
principal or co-principal
investigator, start date, end date,
and active state.
Add, edit, and
delete
publications and
grants
Included. A thumbnail image and
files may be associated with the
document. Fields can be
automatically filled in using a
BibTex document.
Included. A thumbnail image and
files may be associated with the
document. Fields can be
automatically filled in using a
BibTex document.
Faculty page
Lists faculty and provides a link to
each person’s profile page
Not included.
Login interface
Linked to Old Dominion
University Computer Science
accounts
Linked to Old Dominion University
Computer Science accounts
Profile Page
Displays authors’ profile picture,
job title, email address, personal
webpage link, and the author’s
publications and grants. Displays
graphs
Displays authors’ profile picture,
job title, email address, personal
webpage link, and the author’s
publications and grants. Graphs not
included.
Scraper
Will update the system with new
publications and grants and alert
users when one is added to the
system under their name.
Will update the system with
publications only and alert users
when one is added to the system
under their name.
Lab 1 – READ Description
Features
Real World Project
Prototype
Prediction
algorithm
Predicts if the consumer has
enough space to use the READ
system.
Not included
Administrative
Privileges
Administrators are able to edit,
add, or remove anything in the
system.
Administrators are able to edit,
add, or remove anything in the
system.
10
Table 2 – Comparison of Features and Capabilities between READ prototype and RWP
3.4 Prototype Development Challenges
Possible challenges for the READ prototype will include understanding the architecture of
the code for the Schaefer Scraper and learning how to implement it into the READ solution
properly. This piece of code it crucial to the innovative features of what READ supplies.
Secondly, the prototype may not reach all necessary specifications within the required timeframe
which will result in a usable but unfinished end product for use by the Old Dominion University
Computer Science department.
(This space intentionally left blank)
Lab 1 – READ Description
11
GLOSSARY
Administrator/Administrative User: a user with increased privileges for editing database content
Author: A person that is able to add and edit publications and grants to the system under their
name.
BibTeX: A file format for reference information in XML format. It will be used to automatically
fill in key information when uploading or editing publications and grants.
Computer Science (CS): An academic discipline based on advancing computing theory and
algorithm development that sometimes includes theory about software engineering
methods.
Client application: The module that takes input and creates queries to be processed by a server,
and receives the results from the server.
Client/Server Architecture: A software engineering paradigm that separates functionality into a
“client” application and a “server” application that interact.
CSS: A programming language used to specify presentation of HTML pages
Data Mining: The act of going through a source of input to find specific information.
Database Schema: A description of the structure of database
Funding Agency: The source of funds for research grants.
GIT: A software system for controlling and organizing software versioning.
Google Scholar: A search engine primarily used to find academic literature.
Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc
that can be interacted with via a mouse and keyboard, through which a user interacts with
a software application. .
Internet scraper: A program which takes unstructured data on the web and puts it into structured
Lab 1 – READ Description
12
data that can be stored and analyzed in a central local database or spreadsheet.
JQuery Sparklines: A development library for the visualization of data.
ODU: Old Dominion University.
MicrosoftAcademic: A free service developed by Microsoft Research to help scholars, scientists,
students, and practitioners quickly and easily find academic content, researchers,
institutions, and activities.
MySQL: An open source database software.
Parse: To process a statement for specific meanings.
Perl: A widely-used programming language on the server-side of web applications.
PHP: A widely-used programming language on the server-side of web applications.
Principle Investigator (PI): The primary researcher that a research grant is bestowed upon,
responsible for documenting the work and publishing research results.
Publication or Academic Publication: A document published in an academic journals, technical
reports, and records of conference proceedings.
Query: A command sent to the database to either change the database or get back results
READ: Repository for Electronic Aggregation of Documents
RSS: A dialect of XML for subscribing to and distributing news.
RWP: Real World Project.
Scraper: An automated application designed to scan a source of input such as a document or a
website for pertinent information.
Server application: In a client/server architecture, the module that takes queries or requests from
a client module, process them, and returns the result to the client.
Software Compatibility: A description of whether different software, or versions of software, can
Lab 1 – READ Description
communicate/interact.
SQL: A widely used programming language used to query databases.
SQL injection: Performing unauthorized queries on a database for malicious purposes.
User Authentication: The process of verifying the access credentials of a user of an automated
system, usually accomplished by requesting a username and password combination.
Viewer: An outside person who wishes to query the information contained in the READ
database.
Version Control: A method for organizing and recording different versions of documents that
have been created over time.
Virtual Private Server (VPS): A software version of a hardware server used to create
independent servers on a single piece of hardware.
Web server: A group of applications constantly “on” resource whose sole or main job is to
respond to HTTP requests from browsers.
XML: Extensible markup language.
(This space intentionally left blank)
13
Lab 1 – READ Description
REFERENCES
Digest of Education Statistics. 2011. National Center For Educational Statistics
Web. 19 Nov 2012.
<http://nces.ed.gov/programs/digest/d11/tables/dt11_001.asp?referrer=report>.
14
Download