International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015 PAGE CONTENT RANK: AN APPROACH TO THE WEB CONTENT MINING Urvashi1, Mr. Rajesh Singh 2 1 2 M.Tech Student, Department of CSE, B.S.Anangpuria Institute of Technology and Management, Alampur, India Assistant professor, Department of CSE, B.S.Anangpuria Institute of Technology and Management, Alampur, India Abstract: World Wide Web is a system of interlinked hypertext documents that are accessed via the internet.. However, Information on Web continues to expand in size and complexity. Making the retrieval of the required web page on the web, efficiently and effectively, is a challenge. Web structure mining plays an effective role in finding or extracting the relevant information. I. INRODUCTION: In this paper, I proposed a new algorithm, based on Page Content Rank (PCR), based on structure and content.In the proposed work, a new approach is introduced to rank the relevant pages based on the content and keywords. Methods of web data mining can be divided into several categories according to a kind of mined information and goals that particular categories set: Web structure mining (WSM), Web usage mining (WUM), and Web Content Mining (WCM). The objective of this paper is to propose a new WCM method of a page relevance ranking based on the page content exploration. II. WEB MINING: 3. Web usage mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. Web usage mining collects the data from Web log records to discover user access patterns of Web pages. Web Mining Web Content Mining Web Usage Mining Web Structure Mining Text & Multimedia documents Web Log Records Hyperlink Structure The general process of web mining is: Extracting valuable knowledge from Web or analyzing data from different perspectives and summarizing it into useful information - information that can be used to do important tasks. Two different approaches are: One approach is process-based and the other is databased. Data-based definition is more widely accepted today. Mining is of three types:- Web content mining, Web structure mining, Web usage mining. 1. Web content mining: targets the knowledge discovery, in which the main objects are the traditional collections of text documents and, more recently, also the collections of multimedia documents such as images, videos, audios, which are embedded in or linked to the Web pages. ISSN: 2231-5381 2. Web structure mining focuses on the hyperlink structure of the Web. The different objects are linked in some way. Simply applying the traditional processes and assuming that the events are independent can lead to wrong conclusions. However, the appropriate handling of the links could lead to potential correlations, and then improve the predictive accuracy of the learned models. Resource Discovery Information pre processing Generalization Some terms regarding mining: Resource Discovery, whose task is retrieving web documents, is the process of retrieving the web resources. Information Pre-processing is the transform process of the result of resource discovery. Generalization is to uncover general patterns at individual web sites and across multiple sites. In this step, machine learning and traditional data mining techniques are typically used. Pattern Analysis is the validation of the mined patterns. http://www.ijettjournal.org Page 74 Pattern Analysis International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015 The Subtasks of Web Usage Mining are: Site Files Pre processing Row Logs Mining Algorithms User Session Files An access log file in web usage contains information about user visits in Common Log Format. In this format, each user request to any URL corresponds to a record in access log file. Each record is a tuple containing 7 attributes. Session information is 2-tuple containing an IP address of user and a sequential list of web pages that are visited in this session. Si = ( IPi, PAGESi ) PAGESi = { (URLi)1 , (URLi)2……………,(URLi)k } After applying data preprocessing and data reduction, session information is obtained from web log data. In the next section data preprocessing step is given. Search for frequent patterns by some famous algos: 1. Breadth First Search(BFS): In breadth first search, the lattice of equivalence classes is generated by the recursive application, exploring the whole lattice in a bottom up manner. All child length-n patterns are generated before moving into parent patterns with length n+1. 2. Depth First Search(DFS): In the depth first search, all patterns are explored by a single path coming from its child before exploring all patterns having smaller length from corresponding pattern. In breadth first search we only need to keep the id lists of current patterns with length n in the memory. 3. GSP Algorithm: GSP makes multiple passes over the session set. Given a set of frequent n-1 patterns, the candidate set for next generation are generated from input set according to the thresholds. ISSN: 2231-5381 Rules, Patterns, Statistics Pattern Analysis Modified Rules, Patterns, Statistics SPADE Algorithm: In the SPADE algorithm, firstly Session id-timestamp list of atoms created. Then these lists are sorted with respect to the support of each atom. Then, if support of any atom is below the input threshold it is eliminated automatically. Next, frequent patterns from single atoms are generated according to union operation ∨ based on prefix-based approach defined above. Finally, all frequent items with length n > 2 are discovered in their length-1 prefix class independently. In our experiments, GSP has given the worst results because it does not use pattern lattice structure and at each step it has to perform a session scan. DFS is better than BFS because it eliminates infrequent patterns at each level and it keeps less patterns in memory. SPADE is the best one, because it works on prefixbased equivalence classes, which is a much smaller search-space. Three main page ranking and document clustering techniques are as follows: 1. PageRank Algorithm: PageRank was developed at Stanford University by Larry Page (cofounder of Google search engine) and Sergey Brin. Google uses this algorithm to order its search results in such a way that important documents move up in the results of a search while moving the less important pages down in its list. This algorithm states that if a page has some important incoming links to it, then its outgoing links to other pages also become important, thus it takes backlinks into account and propagates the ranking through links. When some query is given, Google combines precomputed PageRank scores with text matching scores to obtain an overall ranking for each resulted web page in response to the query. http://www.ijettjournal.org Page 75 International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015 A simplified formula of PageRank is defined as: PR (v) , or N (v ) PR(u)=(1d)+d PR (v) N (v ) PR (u)=c v B (u ) vB (u ) Iteration PR(A) PR(B) PR(C) 0 1 1 1 1 1 1.25 0.81 2 1.21 1.2 0.8 3 1.2 1.2 0.8 4 1.2 1.2 0.8 .... .... .... .... where d is a damping factor and (1-d) as the page rank distribution from non-directly linked pages. The PageRanks for pages A, B, C can be calculated using (2) as shown below: PR(A)= (1-d)+d((PR(B)/2+PR(C)/2 ) PR(B)= (1-d)+d( PR(A)/1+PR(C)/2 ) 3. Weighted PageRank Algorithm: PR(C)= (1-d)+d( PR(B)/2) By calculating the above equations with d=0.5(say), the page ranks of pages A, B and C become: PR(A)=1.2, PR(B)=1.2, PR(C)=0.8 A B C 2. Iterative Method of Page Rank: It is easy to solve the equation system for a small set of pages to determine the page rank solution by inspection method. In iterative calculation, each page is assigned a starting page rank value of 1 as shown in Table and many iterations could be followed to normalize the page ranks. WPR assumes that more popular the web pages are, more linkages other web pages tend to have to them or are linked to by them. This algorithm assigns larger rank values to more important pages instead of dividing the rank value of a page evenly among its outgoing linked pages. Each outlink page gets a value proportional to its popularity or importance and this popularity is measured by its number of incoming and outgoing links. The popularity is assigned in terms of weight values to the incoming and outgoing links, which are denoted as Win(v,u) and Wout(v,u) respectively. Win (v,u) is the weight of link(v, u) calculated based on the number of incoming links of page u and the number of incoming links of all reference (outgoing linked) pages of page v. Win ( v ,u ) = Iu I pR (v ) , where Iu and Ip represent the p number of inlinks of page u and page p, respectively. R(v) denotes the reference page list of page v. Wout (v,u) is the weight of link(v, u) calculated based on the number of outlinks of page u and the number of outlinks of all reference pages of page v. Wout(v,u)= Ou , where Ou and Op represent the Op pR (v ) number of outlinks of page u and page p, respectively. The original PageRank formula is modified as given : ISSN: 2231-5381 http://www.ijettjournal.org Page 76 International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015 WPR(u) (1 d ) d WPR(v)W in ( v ,u ) vB ( u ) W(out v ,u ) Clustering : Clustering divides a set of objects into groups such that the objects in the same group are similar to each other. In the context of web document clustering, objects are replaced by documents and are grouped together based upon some measure like similarity of content or of hyperlinked structure. Most of the search engines return a large and unmanageable list of documents containing the query keywords. Finding the required documents from such a large list is usually difficult, often impossible. As a solution, the search engines can group a set of returned documents with the aim of finding semantically meaningful clusters, rather than a list of ranked documents. Web clustering may be based on content alone, may be based on both content and links or may only be based on links. Proposed Architecture for CLUSTERING AND RANKING: sequential pattern generator. Step 3: The level weight are calculated for every page X present in the sequential pattern. Step 4: The rank is calculated for every page X present in the sequential pattern. The improved rank is calculated as the summation of previous rank and assigned weight value. Algorithm: Rank improve (Q,n), Given: Set of n queries and corresponding clicked URLs stored array Q [qi, URL1……., URLm], 1≤i≤n Output: A set C= {C1, C2…, Ck} of k query. K=0; // Start of algorithm For (each query P in Q) Set Clusterid (P) = NULL; For (each P€ Q with clustered (P) = NULL) { I =n, page= Q (n); User WWW Clusterid (p) =ck; Weight(X) =In (lenpar(X)) Web Crawler level(X) Query Interface Page_ rank(X) = (1-d) +d Σ PR (v) Indexer V € B(X) Nv Query Processor Index New Page _rank(X) =Page_ rank + Weight(X) While (i>1) and (Q [i/2] (New Page _rank(X)) do Rank Calculator Cluster Generator Similarity Calculator { Q[i] = Q [i/2]; I=i/2; Rank improvement: This module takes the input from the query processor and matched documents of a user query and an improvement is applied to improve the rank score of the returned pages. The module operates online at the query time and applies the improvement on the current documents. Step 1: Given an input user query q and matched document D collected from the query processor, the cluster Ck is found to which the query q belongs. Step 2: Sequential pattern of the concerned cluster is retrieved from the local repository maintained by the ISSN: 2231-5381 } Q[i] =New Page _rank; return true; } K=k+1;} III. CONCLUSION: The paper describes Page content ranking and algorithms and experiences with its use in the Web mining. It was found with a number of examples the method has better behavior that popular Page Rank http://www.ijettjournal.org Page 77 International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015 Ranjna Gupta,”Web Search Result Optimization by Mining the Search Engine Query logs,”. ProceedingIEEE International Conference on methods and models in Computer Science (ICM2CS-2010). algorithm.Obviously, we would like to state a hypothesis that:: The PCR identifies pages which are more significant with respect to their content and better explains given topic than the Page Rank algorithm. However, more experiments have to be performed as a future work in order to validate the hypothesis. A.Spink, D.Wolfram,B.J.Jansen,T.Saracevis,”Searching the Web :The public and their queries ”.jouurnel of the amercianSocitey for information Science and technology 52(3),2001,226-234. R.Cooley, B.Mobasher and J.Srivastava, “Web mining: Information and pattern discovery on the World Wide Web,”. In 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 97) 1997. M.H.Dunham,Companion slides for the text ,” Data mining Introductory and advanced topics ”.Prentice Hall,2002. Thorsten joachims,”optimizing search engine using clickthough data” Proceeding of the 8th ACM SIGKDD international conference on knowledge discovery and data mining ,2002,pp:133-142, New York. H.Ma, H.Yang,I.King,and M.R.Lyu,”learning latent semantic relations from clickthough data from query suggestion ”.InCIKM’08:Proceeding oh the 17th ACM conference on information and knowledge management ,pages 709-708,New York,ny,USA,2008,ACM. IV. FUTURE WORK: There are some possibilities of the future development of PCR. Certainly, the method should be tested on data samples of more respective sizes. A weak point of the PCR implementation is the time complexity of obtaining the starting set of pages Rq,n. Possibility how improve PCR is a continuous adaptability of the system depending on user reactions. So, WPCR will be adapted more cordially, as a standardize technique to improve page content problem. REFERENCES: Smizansky, J., 2004. Web data mining. Master Thesis, Faculty of Mathematics and Physics, Charles University in Prague. (in Czech). A. Arasu, J.Cho, H. Garcia -Molina, A.Paepcke, and S. Raghavan, “Searching the Web,” ACM Transactions on Internet Technology, Vol. 1, No. 1, pp. 97_101, 2001. Gibson, J., Wellner, B., Lubar, S, "Adaptive webpage content identification", In WIDM ’07: Proceedings of the 9th annual ACM international workshop on Web information and data management. New York, USA,2007. Han, J. and Kamber, M. “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers,2001. J.Wen, J.Mie, and H.zhang,” Clustering user queries of a search engine ”.In Proc of 10th International WWW Conference .W3C,2001. IsakTaka ,SarahZelikovitz,AmandaSpink,”Web Search log to Identify Query Classficationterms”Proceeding of IEEE International Conference on Information Technology (ITNG’07),pp:50-57,2008. A.K Sharma,Neha Aggarwal, Neelam Dhuan and ISSN: 2231-5381 http://www.ijettjournal.org Page 78