Mini Project Documentation Web Wise Document System (WWDS) Gayatri Vidya Parishad College Of Engineering 1 Visakhapatnam Certificate This report on “ WEB WISE DOCUMENT SYSTEM (WWDS) ” is a bonafide record of the mini-project work submitted By S.S.V. Kaushik (Reg No:06131A0579) P. Santosh Varma (Reg No:06131A0563) Bh.S. Ramaraju (Reg No:06131A0577) in their sixth semester of Bachelor of Technology In Computer Science and Engineering During the academic year 2009-2010 Guide Observer Head of the Department 2 Candidate’s Declaration We hereby declare that the work presented in this project titled “Web Wise Document System(WWDS)” submitted towards completion of mini-project in sixth Semester of B. Tech (CSE) at the Gayatri Vidya Parishad College Of Engineering(GVPCOE), Visakhapatnam is an authentic record of our original work pursued under the guidance of Prof. David Wayne Clay and Prof. Krishna Subba Rao,GVPCOE, Visakhapatnam . We have not submitted the matter embodied in this project for the award of any other degree. S.S.V. Kaushik P. Santosh Varma Bh.S. Ramaraju Place: Visakhapatnam Date: 21 -12 - 2009 ------------------------------------------------------------------------------------------------------------ 3 ACKNOWLEDGMENT First and foremost, we would like to express our sincere gratitude to our mini project guides, Prof. David Wayne Clay and Prof. Krishna Subba Rao. We were privileged to experience a sustained enthusiastic and involved interest from their side. This fueled our enthusiasm even further and encouraged us to boldly step into what was a totally dark and unexplored expanse before us. ABSTRACT Introduction 4 With the Internet growing in size day by day both in terms of number of users and content, the traditional stand alone approach is almost on the verge of an end. There is a need for an approach that would integrate both the World Wilde Web and the Stand Alone Systems. Web Wise Files comprise of one such approach. Motivation The idea of Web Wise Files comes from the very basis of a distributed operating system in which the changes done in files of one terminal should be reflected in every terminal that is a part of that system. That is the whole system is a virtual system comprising of parts of it at different geographical locations. Problem Statement “A user regularly browses the Internet for getting himself/herself acquainted with the changes in the Stock Market, the Weather of that day, the Score of an ongoing Cricket match, the latest technological advances and so on. In addition to these he/she may listen to the latest audio or videos or he may wish to use the services available on the Internet”. 5 In order to do all these, the user needs to spend a lot of time surfing different sections of the World Wide Web for corresponding information. An approach that would actually do all these things in a jiffy, in a more systematic customizable way, depriving the users of the strain, would be a boon to all users. We use the Web Wise Files to solve this problem. Approach Web Wise Files are the files that change their contents accordingly with that of the World Wide Web. Yet they have their physical existence on the very terminal the user works on. The web wise files are actually user defined files comprising of data (from the WWW) of user’s choice in the format specified by the user. TERMINOLOGY USED Web Wise Document (WWD) The actual document displaying all the content blocks in user-defined format. 6 Web Wise Document Definition(WWDD) It contains the information about all the content blocks involved. Web Wise Document System (WWDS) The System which actually contains all the web wise documents and their definitions. Web Wise Document Template (WWDT) This refers to the XML file containing the WWDD. INTRODUCTION TO WWDS Working of a Web Wise Document System will allow a user to create, edit and view local document with embedded web content. A web wise document will consist of a layout 7 definition section and a set of content block definitions. Layout definition section indicates the placement of content blocks in the view document. These definitions will be expressed using XML. Web Wise Document System Layout There are 2 layout options which a document can have: 1. column layout defines a linear sequence of content blocks 2. row wise layout defines a rectangular array of content blocks Content Block Content block definitions indicates the method of locating the content and any details needed to access the content. These definitions will be expressed using XML. Each content block definition will contain: 1. content block title 2. content block type 3. content block parameters which depend on block type Content Block Types 8 Content block types include the following : 1. Local text 2. Remote text using ftp 3. Web page content 4. Blog post 5. Web service 6. Twitter post 7. RSS Feed FEATURES of a WWDS The features provided by the WWDS involve the following: ü Creating/Editing/Modifying contents of the WWDT specific to a user ü Dynamically retrieving a section of Local or Remote text files. ü Dynamically retrieving a section or part of a Web Site. ü Dynamically retrieving recent posts/articles in a Blog. ü Dynamically retrieving content from RSS Feeds. ü Providing GUIs to access Web Services. 9 ü Retrieving data from a social networking site Ex:Twitter ü Showing the retrieved data in user desired format/layout. User's Role in a WWDS User in web wise document system should be able to do the following: 1)Create/Edit a WWDD. 2)Customize Layout, Type and Number of Content Blocks to be displayed in the WWD. 3)View the WWD updated to that instant. System's Role in a WWDS System in web wise document system,on the other hand, should be able to do the following: n Store the WWDD created/edited by the user in some understandable format such as XML. n Understand the WWDD and accordingly create the WWD by retrieving the appropriate content blocks and representing them in the user desired layout . 10 MODULES INVOLVED The System will contain mainly two modules, l Creator l Viewer. Creator The Creator will have the following properties: è The Creator will be responsible for generating the XML file for WWD (Web Wise Document). è The Creator provides a visual interface for the user to customize layout and content definitions for each WWD. è The Creator will express the layout and content definitions using XML file. è In addition the Creator may provide options for the customization of the appearance of the content sections (like color etc.). Viewer The Viewer will have at least the following properties: l The Viewer will be responsible for retrieving and showing the content to the user in the appropriate layout using the XML file (the one that is 11 generated by the Creator). l The Viewer retrieves data dynamically during the opening of the WWD by the user. l The Viewer provides options for storing the WWD. Content Block Title Each Content Block will be assigned an id (normally generated sequentially) by the Creator. Content Block title normally represents the following: 1. File name (for local or remote text). 2. URL (for web content,blogs etc.). 3. Name of the service (for web service). Content Block Parameters Content Block parameters may include one or more of the following to uniquely identify it in the source: 12 1. Page ID or Section ID 2. Id of the HTML tag in case of web content. 3. The absolute position of the content section in the entire page or document. 4. The relative position of the content section with respect to another fixed section in the same page or document. 5. The IP address of the source (and optional log in credentials) in case of remote text. 6. The absolute path of file in case of local or remote text. 7. Heading of the content section. 8. Date and time constraints in case of blog posts etc. 9. Log in credentials for restricted web content. 10. Any other parameters not mentioned above. The user can define Custom Layout by defining the absolute positions and orientations for each content block. 13 REQUIREMENTS / SPECIFICATIONS The requirements and specifications for this application include both software(SRS) and hardware requirements. SRS (Software Requirement Specification) We mainly need an operating system and Microsoft's Visual Studio with .NET framework installed on it to develop our application as we implemented it in VB .NET language. Operating Systems Windows 2000, Windows XP, Windows Vista Windows Server 2003 or more Microsoft's VB .NET Package Microsoft Visual Studio or more (Visual Basic 6.0 or more) Hardware Requirements 14 1. RAM: 256 MB 2. Processor: Pentium Class II Processors 3. Video: 800x600, 256 Colors System Requirements Functional Requirements Ø These are the requirements given to the system during Requirements Phase of the Software Development. Ø The user should be able to choose from a variety of content sources available in the internet. Ø The user should be able to able to view all the content blocks simultaneously. Ø The user should be able to modify the content blocks and their definitions at his will. Ø The system should minimize the overall time spent by the user in surfing and browsing the content blocks individually. Ø The user should be able to view the document in the desired layout. Ø The system should dynamically obtain the contents of the document from the corresponding sources. Non Functional Requirements Ø The interface should be a GUI. Ø The user should be able to use the GUI with minimum guidance. That is, the interface should be self-understandable. 15 Ø The GUI must provide an easy way to create, edit and delete contents of the document. Ø The system should consume minmum hardware resources. DESIGN Core Features of .NET Though the application can be developed on either java or .NET, we preferred .NET over java due to the following advantages of .NET 1. Comprehensive interoperability with existing code 2. Integration among .NET programming languages 3. A common run time engine shared by all .NET-aware languages 4. A comprehensive base class library 5. No more COM (Component Object Model) plumbing 6. A truly simplified deployment model Though there are many language in .NET like ASP and VB etc., we choose VB because our project is a desktop based web application. ASP .NET is mostly preferred for pure web applications. As our project involves both desktop and web functionalities VB serves better. Challenges Faced 16 During the initial stages of the design of the project , there are many challenges (hurdles) coming through the design of the application. They are: Ø Do we need to select a few predefined websites? Ø Do we need to allow the user to select only from a set of predefined sections corresponding to each of the predefined websites? Ø The general content is also too variable to solve. We observed the sites of BBC and CNN and found it very difficult to find the required content on just the mere mention of a heading and a URL unless we predefine them before itself separately. Ø We have to provide only well defined web content at that point. So for example , a web site where only one heading block appears and it is contained in a specific table cell. when the user selects web content you will give a list of only known and parsable content. Ø The general content problem is too variable to solve. This is a regular expression solution where the target is not always regular. Ø At the end of Editor Design, we got a doubt regarding the Document Viewer (Content Retriever). How exactly do we identify web content? How exactly do we uniquely identify a content block in its source page? Ø For instance we have a www.example.com (some site) and we need to fetch a content block with "Heading" as heading. The problem is 17 there might be several blocks in the source page with the same "Heading". So, 1. How exactly do we identify the correct content block? 2. Even if we identify it, how exactly do we determine its boundaries? That is, what comes into the content block and what doesn't? To solve the above problems, we thought of including HTML tag name and id or even the absolute position of the content block rectangle in the web page. But this created more problems like: 1. How many users will know about the actual coding details of the content block? all 2. And there can be many ways to uniquely identify the content block 3. Id, name and absolute position are only one way. So, trying to provide such options is not feasible. But without knowing these details we can not fetch the required content block in a deterministic manner. So now we are in a dilemma of whether to think of user friendliness or code complexity. 18 UML Diagrams UML Diagrams are basic modeling diagrams used to determine the architecture of the software product being developed. We are concerned with the three major UML diagrams. They are 1.Use Case 2.Class 3.Sequence Use Case Diagrams Web Wise Document System Local and Remote Text Web Page Content User Web Services XML File Blog Post Twitter 19 The above scenario shows the interaction of the system with two actors: Scenario1 i) User ii) XML file The user interacts with the system to select any of the content types mentioned. For each content type selected by the users, corresponding changes are updated in the XML file. The following scenario shows another type of interaction of the user with the system. Here the user can perform a variety of functions to create/edit/view a WWDT. The system uses the WWDT to retrieve the required content blocks dynamically gtom the internet. Web Wise Document System Create a WWDT Edit/Modify existing a WWDT User View a WWD System Retrieve contents from the internet 20 Class Diagram The above class diagram shows the interaction of a total of 9 classes involved in WWDS. The classes are basically categorized into two packages based on their 21 functionality: i) Windows Application1 – Creator Module ii) Windows Application2 - Retriver/Viewer Module Windows Application1 This package comprises of 8 classes that function together to implement the requirements/features of the creator/editor module. The 8 Classes involved are: i) Form1 – This class is responsible for layout selection, content block creation, modification and deletion in the WWDT. ii) Form2 – This class contains functions for the defining/modifying title and type of the selected content block in the WWDT. iii) Form3 – This class contains functions that define, store/edit the properties of Local/Remote Text. That is, the information required to retrieve a Local/Remote Text. iv) Form4 - This class contains functions that define, store/edit the properties of Web Page Content. That is, the information required to retrieve a portion of a Web Page. v) Form5 - This class contains functions that define, store/edit the 22 properties of Web Service. That is, the information required to select a Web Service and provide a GUI to it. vi) Form6 - This class contains functions that define, store/edit the properties of RSS Feed. That is, the information required to retrieve information in a RSS Feed. vii) Form7 - This class contains functions that define, store/edit the properties of Twitter Post. That is, the information required to retrieve recent Twitter Post. viii) Form8 - This class contains functions that define, store/edit the properties of Blog Post. That is, the information required to retrieve recent post from a Blog Post. Windows Application2 This package comprises of a single class that implements the functions a viewer/retriver. i) Form1 - This class contains functions that help the system to read/understand the WWDT and accordingly retrieve the contents of the WWD. This class also contains functions that help in organizing the contents into a user-defined 23 layout. Sequence Diagram The above sequence diagram encloses a typical sequence of steps that are followed by a user for creating/editing a WWDT and later viewing it using the system. 24 The steps 3 to 14 are actually asynchronous in the sense that each of them can occur at any time any number of times as required by a particular user. For instance a user may want to access 3 web services but only a single twitter post. In such a case, the other steps wont be necessary and even these 4 content blocks can be defined by the user in any order. That is the order is not important for retrieving information. But this order is very much essential if the user is also considers the order in which these content blocks are finally displayed as the viewer/retriever displays the content blocks in the same chronological order as chosen/defined by the user. Platform: We used Microsoft's VB Professional Edition 2008 with the .NET 3.5 Framework Platform as it is the latest and is well suited for web applications and also has more options than the previous versions regarding styles and functionalities. For a WWD, Steps taken to give some input: 1. We have designed the editor as a 3 level form. 2. The user fills the details into the form to create,edit and delete the definitions of the content blocks to appear in the WWD. 3. document 4. in The data exchange between the forms in done by using a XML as data storage and retrieval. By the end of creation/edition of the WWD, the entire definition is stored this XML file (it acts as a template for this WWD). 25 So, input is taken and is stored in an XML file. Steps taken to retrieve the output for the given specific input: 1. We used the XML file to retrieve data from the source (web,local text etc.) 2. we used SOAP like technology to access the web services using inbuilt .NET libraries. 3. To retrieve data from the web, we used the concepts of HTML and XML Parsing along with the nested HTML concepts to solve the parsing html code problem. Through this we also achieved a few useful ways to find the content embedded in the nested html. Web Wise Document Template Format: XML is the language used to represent the WWDT. It typically represents the following information: i) information regarding the layout of the document. ii) information regarding each content block to be retrieved. WWDT XML Structure: The <wwd> tag represents the root of the document. It essentially comprises of: i) a <layout> tag ii) a number of <content> tags 26 <layout> tag: The <layout> tag represents the layout of the given document. It encloses the name of the layout used. For example, <wwd> …. <layout>Column</layout> … </wwd> indicates that the layout of the document is a Column Layout. That is, all the content blocks are showed column by column. Similarly, Row Wise is used to represent the Row Layout. <content> tag: A content tag typically contains the following tags: i) <title> tag that contatins title of the block ii) <type> tag that contains the type of the content to be retrieved. iii) <params> tag that contains information regarding parameters specific to the content block. For example, <wwd> … 27 <content> <params> <BlogURL>http://www.gizmodo.com/ </BlogURL> </params> <title>My Blog</title> <type>Blog Post</type> </content> … </wwd> indicates that one of the content blocks to be retrieved is a Blog Post named My Blog, with a URL http://www.gizmodo.com/ . Content Types and Parameters There are 6 types of contents included in the project. Their properties are: 1. Local and remote text: Inorder to retrieve a portion of the text from a file, we require the following: − Absolute Path of the text file. − Starting line number of the text file. 28 − Number of lines to be displayed. 2. Web page content: As the general Web Content Problem is too variable and difficult to solve we have chosen three predefined categories. For each category we chose two websites exhibitting good web design standards. In each web site we have preselected the portions of the website to be retrieved. The following are the categories and the corresponding websites chosen: Categories: − Headlines − Weather − Sports Websites: Some websites include: − URL of BBC News − URL of NDTV news etc. We used the class and ids of the HTML tags to uniquely identify well structured blocks in a Web Page. 3. Blog post: Inorder to retrieve a recent Blog Post article, we need the following information about the Blog Post. − URL of the blog of posts 29 Using this information, we have retrieved the RSS Feeds corresponding to various articles in the particular blog. The first RSS Feed obtained corresponds to the recent article posted in the blog. Twitter post: Inorder to retrieve the status of a blog we required the following information − The Twitter ID of the user whose status has to be displayed. Even here we used the Twitter ID to obtain the RSS Feeds corresponding to the status updates of that user. Web Services: A wide number of services are available in the Internet today. We have chosen a few very important and frequently used web services for demonstration and provied easily accessible GUIs to them. Some of them include: − Stock Quote (gives the current Stock details for a company) − Currency converter (converts money from one currency system to another) − Global Weather (gives the current weather information for a given location) − Send SMS World (sends free SMS to any cell phone in India) RSS Feeds: The typical information required to get data from a RSS Feed includes: − URL of the RSS Feeds 30 The URL of the RSS Feed corresponds to the URL of the WSDL (Web Service Description Language) corresponding to a specific feed. This WSDL contains information regarding various functions defined, paramaters to be passed and the output expected for each function. Using the inbuilt functionality of recognizing functions provided by a web service, given its WSDL, we have implemented GUIs to functions that we felt are most necessary. HTML & XML Parsing There is no general universal parser to parse HTML and retrieve required information from a HTML page. Though there are a few HTML parsers available like MSHTML etc. they provide only a partial solution. This is because, HTML is not a strongly typed language and hence various users use a variety of nonstandard methods while designing a web page. Often these methods involve tags that are highly unstructured and syntactically incomplete. Part of this non-standard nature of the websites can be attributed to the modern browsers which allow and parse a number of syntactical errors without any complaint. Thus, HTML parsing is a nondeterministic problem that can be solved only through standardization. Hence, our HTML parsing is done using our own parsing routines with the help of MS HTML parser. But this type of page specific parsing is very limited in its approach and is highly susceptible to errors the moment the corresponding web site designers decide to change the standards used in the page. On the contrary, the universally standard structure of a XML 31 document made it easy to write a XML parser. There a number of XML parsers available over the Internet. One could write their own XML parser provided they have enough time. Typically there are two types of XML parsers: i) a SAX Parser ii) a DOM Parser We chose to use a DOM parser because of the ease and efficiency with which one can create/edit/access/delete any node and its corresponding information. We have used the Microsoft XML DOM parser provided by the .NET package as it suits very well to our purpose. Microsoft's Visual Studio The interface is so simple and easy to access. A lay man can understand the usage of .NET by observing the control tool box in the menu. 32 Solution Explorer in Visual Studio 2008 33 Selection of windows forms application in Visual Studio 2008 Sample Code 34 For instance, a small part of the code used for retrieving content provided as input to the XML file is: Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) ' start of sub Procedure TextBox1.Text = e.Url.ToString() Dim xdoc As XmlDocument = New XmlDocument() ' Loading of XML file for input xdoc.Load("C:\Documents and Settings\SantosH\My Documents\Google Talk Received Files\Modified\WindowsApplication1\WindowsApplication1\bin\Debug\sample. xml") Dim node As XmlNode = 35 xdoc.SelectSingleNode("wwd/content/type[@webpage='" + e.Url.ToString() + "']") Dim id As Integer = CType(node.Attributes.ItemOf("id").Value, Integer) Dim url As String = New String(node.Attributes.ItemOf("webpage").Value) disp(id).Stop() RemoveHandler disp(id).DocumentCompleted,AddressOfMe.WebBrowser1_DocumentComplete d ' TextBox1.Text += id.ToString() Dim doc As HtmlDocument = disp(id).Document Dim ele As HtmlElement = doc.GetElementById("") Dim ele1 As HtmlElement = doc.GetElementById("") If (url = "http://news.bbc.co.uk/sport/") Then ele = doc.GetElementById("tickerHolder").NextSibling.NextSibling ElseIf (url = "http://www.skysports.com/") Then 36 ele = doc.GetElementById("advert_8").Parent.NextSibling.NextSibling ElseIf (url = "http://www.ndtv.com/news/index.php") Then ele = doc.GetElementById("box315") ele1 = doc.GetElementById("divtopstlatest") ElseIf (url = "http://www.bbc.co.uk/") Then ele = doc.GetElementById("a") ElseIf (url = "http://www.espnstar.com/") Then ele = doc.GetElementById("portlet_878") doc.GetElementById("tickerHolder").NextSibling.NextSibling for http://news.bbc.co.uk/sport/ doc.GetElementById("advert_8").Parent.NextSibling.NextSibling for http://www.skysports.com/ Dim ele as htmlelement=doc.GetElementById("box315")http://www.ndtv.com/news/index.php Dim ele1 as htmlelement= 37 doc.GetElementById("divtopstlatest")http://www.ndtv.com/news/index.php Dim ele as htmlelement=doc.GetElementById("a") http://www.bbc.co.uk/ Dim ele as htmlelement=doc.GetElementById("portlet_878") for http://www.espnstar.com/ End If TextBox1.Text += ele.InnerHtml Dim filew As StreamWriter = New StreamWriter(".\" + id.ToString() + ".html") If (url = "http://www.ndtv.com/news/index.php") Then filew.Write("<html>" + "<head><base href='" + e.Url.ToString() + "' /></head>" + "<body>" + ele.InnerHtml + "<br />" + ele1.InnerHtml +"</body></html>") 38 Else filew.Write("<html>" + "<head><base href='" + e.Url.ToString() + "' /></head>" + "<body>" + ele.InnerHtml + "</body></html>") End If filew.Close() disp(id).ScriptErrorsSuppressed = True disp(id).Navigate("file:///C:/Documents%20and%20Settings/SantosH/My%20D ocuments/Google%20Talk%20Received%20Files/Modified/WindowsApplicatio n2/WindowsApplication2/bin/Debug" + id.ToString() + ".html") disp(id).Show() ' End of Sub Procedure End Sub Screen shots 39 Basic Editor window form without any input provided Editor window form to select the layout using drop down menu 40 Selecting column layout and clicking “New” button Content Block form appearing on clicking “New” 41 Giving a title and specifying the type of content On selecting a type, “Properties” button will be highlighted 42 On clicking “Properties” button in the content block form selection of file name and lines to be displayed 43 Clicking OK in the properties form Clicking OK in the content block form creates it 44 On selecting a content block, “edit” and “delete” buttons will be highlighted 45 Clicking NEW for new content block and repeating the same procedure 46 Properties window form for web page content Selecting any one of the predefined web sites for the category 47 After creation of the 2 content blocks 48 Clicking NEW for new content block on web services 49 Different Types of web services After creating 3 content blocks 50 Same Procedure repeated for Twitter Posts 51 Same Procedure repeated for RSS Feeds 52 Entering the URL for RSS Feeds Click OK after adding all the content blocks 53 Editing a content block 54 Deleting a content block 55 XML File parse the inputs and system stores them 56 XML File parse the inputs and system stores them Output Forms 57 O/P for local text,Headlines and RSS feeds 58 O/P for weather(web service),twitter and blog posts Testing Instead of the traditional late Testing, Testing is performed from the initial stages of the software development life cycle. It is performed at different levels at different stages of the development. In the initial stages of the coding, Unit Testing is performed. That is, each form 59 designed is tested for robustness, consistency and scalability. Each bug is corrected then and there increasing the reliability of code. In the later stages of the development, Integration Testing is performed to identify the new code bugs that creep up when integrated. These bugs are identified and rectified. Majority of the paths in the Control Flow Graph are followed to identify bugs and almost C1+C2 coverage is reached. Each bug identified is used to perform modifications in the code of individual units involved and again integration testing is performed to identify new bugs that may have crept due to the modifications. This testing is rigorously and recusrsively performed to eliminate all major bugs. A good amount of System Testing is also done to identify most frequent bugs and corresponding corrections are made. The task of testing is easened to some extent in case of features like web services which use their own exception handling mechanisms in the servers where they are implemented. Boundary testing is also performed to identify the correctness of the code and VB.NET’s inbuilt Exception Handling is used to catch any unidentified bugs. Future Scope In future there can be many extensions to this application. Some of them include: – Improving the Performance of the System by including a Cache like 60 phenomenon. – Providing options for multiple WWDs to be created, saved and retrieved in the file system. – Including an automated scheduler to retrieve the updated content regularly. – Developing a generalized HTML Parser to retrieve content sections from the Web. − Providing access to more Web Services. − Expanding this application for dynamically changing web site structures. − Including all web sites that can be accessed. − Improving the look and feel and including a variety of Visual Layouts. Results With this approach, the user can simply access a Web Wise File as any other file on his disk except for the extra delay it would take to update itself. This 61 approach would be faster and easier than the manual surfing of the Internet. Conclusions This concept of Web Wise Files will have a profound effect on the cloud computing and other areas where it can be extended to devices other than computers. As mentioned in the introduction, this concept has the capability of restoring the stand alone feel people used to have in the earlier days when there is no Internet. The results mentioned above are universally applicable to users of almost all domains. 62