Analytical Implementation of Web Structure Mining Using Data

advertisement
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556
© Research India Publications. http://www.ripublication.com
Analytical Implementation of Web Structure Mining Using Data Analysis in
Educational Domain
Dr. S. P. Victor
Professor CS, St. Xaviers College, Tirunelveli, Tamil Nadu, India.
Mr. M. Xavier Rex
Research Scholar, M. S. University, Tirunelveli, Tamil Nadu, India.
be extra careful/polite during the crawling process, to avoid
causing any problems for the webmaster.
Abstract
The optimal web data mining analysis of web page structure
acts as a key factor in educational domain which provides the
systematic way of novel implementation towards real-time
data with different level of implications. Our experimental
setup initially focuses with retrieval of web structure such that
WebPages as nodes and hyperlinks as edges in order to
identify the webpage as a popular webpage or similar
webpage. This paper perform a detailed study of web structure
retrieval schema towards variant effect of periodic web pages
in the field of educational domain which can be carried out
with expected optimal output strategies. We will implement
our experimental web structure restoration techniques with
real time implementation of object representation in the
motive of educational Domains such as a college webpage
required for an open data analysis system. We will also
perform algorithmic procedural strategies for the successful
implementation of our proposed research technique in several
sampling domains with a maximum level of improvements. In
near future we will implement the Web usage techniques for
the efficient data analysis domain.
Structure
A traditional data mining task gets information from a
database, which provides some level of explicit structure. A
typical web mining task is processing unstructured or semistructured data from web pages. Even when the underlying
information for web pages comes from a database, this often
is obscured by HTML markup.
A strategic analysis department can undermine their client
archives with data mining software to determine what offers
they need to send to what clients for maximum conversions
rates. For example, a company is thinking about launching
cotton shirts as their new product [2]. Through their client
database, they can clearly determine as to how many clients
have placed orders for cotton shirts over the last year and how
much revenue such orders have brought to the company. After
having a hold on such analysis, the company can make their
decisions about which offers to send both to those clients who
had placed orders on the cotton shirts and those who had not.
This makes sure that the organization heads in the right
direction in their marketing and not goes through a trial and
error phase to learn the hard facts by spending money
needlessly [3]. These analytical facts also shed light as to what
the percentage of customers is who can move from your
company to your competitor.
The data mining also empowers companies to keep a record of
fraudulent payments which can all be researched and studied
through data mining [4]. This information can help develop
more advanced and protective methods that can be undertaken
to prevent such events from happening. Buying trends shown
through web data mining can help you to make forecast on
your inventories as well [5]. This is a direct analysis, which
will empower the organization to fill in their stocks
appropriately for each month depending on the predictions
they have laid out through this analysis of buying trends [6].
The data mining technology is going through a huge evolution
and new and better techniques are made available all the time
to gather whatever information is required. Web data mining
technology is opening avenues on not just gathering data but it
is also raising a lot of concerns related to data security. There
is loads of personal information available on the internet and
web data mining had helped to keep the idea of the need to
secure that information at the forefront [7].
Keywords: Web Mining, Hyperlink, Web Structure Mining,
Pattern, Classification
Introduction
When comparing web mining with traditional data mining,
there are three main differences to consider [1]:
Scale
In traditional data mining, processing 1 million records from a
database would be large job. In web mining, even 10 million
pages wouldn’t be a big number.
Access
When doing data mining of corporate information, the data is
private and often requires access rights to read. For web
mining, the data is public and rarely requires access rights.
But web mining has additional constraints, due to the implicit
agreement with webmasters regarding automated (non-user)
access to this data. This implicit agreement is that a
webmaster allows crawlers access to useful data on the
website, and in return the crawler (a) promises not to overload
the site, and (b) has the potential to drive more traffic to the
website once the search index is published. With web mining,
there often is no such index, which means the crawler has to
2552
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556
© Research India Publications. http://www.ripublication.com
Where
● PR: Page Rank
● pi: page I
● d: damping factor
● N: number of pages
● L: out-links
● M: in-links
Proposed Methodology
The proposed methodology describes the structure of a typical
web graph consists of web pages as nodes, and hyperlinks as
edges connecting related pages. Web structure mining is the
process of discovering structure information from the web.
This can be further divided into two kinds based on the kind
of structure information used.
The implementation of the web structure mining is done in
the basically procedure as follows,
1. Extracting the page Rank manual or automatic.
2. Extracting the hyperlinks in a web page.
3. Internet domain classification.
4. Major domain influence computation.
5. Identify the URL characteristics.
Hyperlinks:
A hyperlink is a structural unit that connects a location in a
web page to a different location, either within the same web
page or on a different web page. A hyperlink that connects to
a different part of the same page is called an intra-document
hyperlink, and a hyperlink that connects two different pages is
called an inter-document hyperlink. There has been a
significant body of work on hyperlink analysis [8] provide an
up-to-date survey.
The actual implementation of web content extraction can
be utilized by using the following java programming
codes.
1. The pseudo code algorithm for calculating the rank of web
pages is presented below.
Document Structure:
In addition, the content within a Web page can also be
organized in a tree structured format, based on the various
HTML and XML tags within the page. Mining efforts here
have focused on automatically extracting document object
model (DOM) structures out of documents [9].
Where e is the vector with all elements 1, € is the accuracy
threshold and 1 is the norm of the vector calculating by
summing up its elements.
2. Extracting the links in a webpage
public class ExtractLinks{
public static void main(String args[]) throws Exception {
try
{
String sUrl_yahoo = "http://www. mamma. com/result.
php?type=web&q=hai+bird&j_q=&l=";
Figure 1: Proposed methodology for web data mining in
online sales domain
Implementation
Google Page Rank:
Websites link to interesting websites, so they “vote” for them.
The more websites vote to a website, the more interesting it is
also regard the votes for recommending Websites. Every
website has a starting score
Which are calculated incremental? [10]
If there are few links, a specific one will be chosen with high
probability
If there are many links, a specific one will be chosen with low
probability
● Many in-links: Authority
● Many out-links: Hub
The Page Rank can be calculated as follows,
PR(pi)=(1−d)/N+d∑PR(pj)/L(pj)--p j∈ M(pi)
String nextLine;
String webPage;
StringBuffer wPage;
String sSql;
java. net. URL siteURL = new java. net. URL (sUrl_yahoo);
java. net. URLConnection siteConn = siteURL.
openConnection();
java. io. BufferedReader in = new java. io. BufferedReader (
new
java. io. InputStreamReader(siteConn. getInputStream() ) );
wPage = new StringBuffer(30*1024);
while ( ( nextLine = in. readLine() ) != null ) {
wPage. append(nextLine); }
in. close();
(1)
2553
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556
© Research India Publications. http://www.ripublication.com
webPage = wPage. toString(); out. println(webPage); }
catch(Exception e) {
out. println("Error" + e); }
}
}
. net
. org
3. Computation for Page Rank Algorithm
Page rank algorithm is proposed by Larry page and Brin
which is patented by Stanford University. It is
The ranking algorithm used by Google’s search engine to
compute a page rank of the web page. We compute the page
rank algorithm by assuming a small universe of four web
pages; A, B, C and D. The links from a page to itself or
multiple outbound links from one single page to another
single web page are ignored. Page Rank is initialized to the
same value for all the web pages present in the web. In the
Page Rank, we assume the sum of Page Rank over all pages
equal to the total number of pages on the web at that time. We
assume a probability distribution between 0 and 1 for all the
web pages. The Page Rank transferred from a given page to
the other web page of its outbound links in the next iteration is
divided equally among all the outbound links of the given web
page. Let us suppose, the page B has a link to pages C and A,
page C has a link to page A and page D have links to all three
pages. We assume the initial value for each web page as 0. 25.
Extracting the substring of domain Name from URL using
java:
public class ExtractLinks
{
public static void main(String args[]) throws Exception
{
String last3;
if (str == null || str. length() < 3) // if needed use it for the
required length as 6 for gov. in etc..
{
last3 = str;
} else {
last3 = str. substring(str. length()-3);
}
}
Results and Discussion
To compute the page rank of A:
If the only links in the system were from pages B, C and D to
A, each outgoing link would transfer 0. 25 to
Web page A to compute the Page Rank of A in the next
iteration.
The actual web structure content analysis for a data warehouse
after the retrieval of web content from the corresponding
trusted web resource is as follows for the URL of MS
University Tirunelveli.
The Page rank obtained by Google ranking structure is 0. 5
through http://checkpagerank. net/index. Php
PR (A) = PR (B) + PR (C) + PR (D)
With back links the equation will be
PR (A) = PR (B)/2 + PR(C)/1 +PR (D)/3
Thus, upon the next iteration, page B would transfer half of its
existing value or 0. 125 to page A, because
Page B has 2 back-links; to page A and to page C; and the
other half or 0. 125 to page C. And page C would transfer all
of its existing value, which is 0. 25, to the only page it links to
the web page A. If D has three outbound links. The web page
D would transfer one third of its existing value or 0. 083 (0.
25/3=0. 083) values to web page A. At the completion of this
iteration, page A will have a Page Rank of 0. 458.
PR (A) = 0. 25/2 + 0. 25/1 + 0. 25/3 = 0. 458.
Figure 2: MS University Domain Page rank
Now by executing the Java code for extracting the links
connected to the specified URL of MS University can be
obtained as follows,
Internet domains classification:
Domain
Name
. com
. int
. edu
. gov
. mil
Originally intended for sites related to the Internet
itself, but now used for a wide variety of sites.
Originally
intended
for
non-commercial
"organizations", but now used for a wide variety
of sites. Was managed by the Internet Society for
a while.
Context
MS University website hyper links extraction:
http://www. msuniv. ac. in/Default. aspx
http://www. msuniv. ac. in/default. aspx
http://www. msuniv. ac. in/contactus. aspx
https://admin. google. com/msuniv. ac. in/
http://www. i-radiolive. com/#/aod/manovani/
http://www. nvsp. in/
http://www. msuniv. ac. in/ourstory. aspx
http://www. msuniv. ac. in/mission-vision. aspx
http://www. msuniv. ac. in/universitymap. aspx
http://www. msuniv. ac. in/university-act. aspx
Originally stood for "commercial" to indicate a
site used for commercial purposes, but it has since
become the most well-known top-level Internet
domain, and is now used for any kind of site.
Used by "International" sites, usually NATO
sites.
Used for educational institutions like universities.
Used for US Government sites.
Used for US Military sites.
2554
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556
© Research India Publications. http://www.ripublication.com
http://www. ijtra. com/
http://www. researchpublish. com/
http://www. ugc. ac. in/
http://www. inflibnet. ac. in/econ/eresource. php
http://www. aicte-india. org/
http://www. tn. gov. in/
http://www. msuniv. ac. in/statutes. aspx
http://www. msuniv. ac. in/howtoreach. aspx
http://www. msuniv. ac. in/#
http://www. msuniv. ac. in/chancellor. aspx
http://www. msuniv. ac. in/pro-chancellor. aspx
http://www. msuniv. ac. in/vice-chancellor. aspx
http://www. msuniv. ac. in/planing-board. aspx
http://www. msuniv. ac. in/syndicate. aspx
http://www. msuniv. ac. in/senate. aspx
http://www. msuniv. ac. in/scaa. aspx
http://www. msuniv. ac. in/registrar. aspx
http://www. msuniv. ac. in/deans. aspx
http://www. msuniv. ac. in/fo. aspx
http://www. msuniv. ac. in/pio. aspx
http://www. msuniv. ac. in/deputyregistrar. aspx
http://www. msuniv. ac. in/assistantregistrar
http://www. msuniv. ac. in/department. aspx
http://www. msuniv. ac. in/fee-structure. aspx
http://www. msuniv. ac. in/constituent-colleges. aspx
http://www. msuniv. ac. in/affilated-colleges. aspx
http://www. msuniv. ac. in/community-colleges. aspx
http://www. msuniv. ac. in/PG-Extension. aspx
http://www. msuniv. ac. in/Research%20Projects. pdf
http://www. msuniv. ac. in/coe-contact. aspx
http://www. msuniv. ac. in/revisedfee. aspx
http://www. msuniv. ac. in/results. aspx
http://www. msuniv. ac. in/coedownload. aspx
http://www. msuniv. ac. in/Research. aspx
http://www. msuniv. ac. in/Research-contact. aspx
http://www. msuniv. ac. in/Evaluation-Status. aspx
http://www. msuniv. ac. in/Research-Guide. aspx
http://www. msuniv. ac. in/Research-Circulars. aspx
http://www. msuniv. ac. in/Research-Downloads. aspx
http://www. msuniv. ac. in/Research-Centres. aspx
http://www. msuniv. ac. in/ddce-Director. aspx
http://www. msuniv. ac. in/ddce-department. aspx
http://www. msuniv. ac. in/ddce/FeeStructure. pdf
http://www. msuniv. ac. in/application-form. aspx
http://www. msuniv. ac. in/prospectus. aspx
http://www. msuniv. ac. in/ddce-studycentres. aspx
http://www. msuniv. ac. in/MSULibrary/index. html
http://www. msuniv. ac. in/Library-Faculty. aspx
http://www. msuniv. ac. in/iqacnew. aspx
http://www. msuniv. ac. in/nssnew. aspx
http://www. msuniv. ac. in/eqqopp. aspx
http://www. msuniv. ac. in/ywd. aspx
http://www. msuniv. ac. in/Grievances
http://www. msuniv. ac. in/ddceonsite. aspx
http://www. msuniv. ac. in/#carousel-generic
http://14. 139. 186. 252:8080/jspui/
http://www. msuniv. ac. in/downloads. aspx
http://www. msuniv. ac. in/timetable. aspx
http://www. msuniv. ac. in/upgallery. aspx
http://www. msuniv. ac. in/registrar@msuniv. ac. in
http://www. academicroom. com/
http://orcid. org/
https://scholar. google. co. in/schhp?hl=en
http://www. researchgate. net/
http://www. researcherid. com/
http://ieeexplore. ieee. org/Xplore/home. jsp
Performing the classification of domains by obtaining the last
set of substrings using the java code yields the following table
for MS University website hyperlinks.
Table. 1: Domain Link analysis MS University
Hyper links Domain count
. com
. int
. edu/. ac. in
. gov
. mil
. net
. org
6
0
63
3
0
1
3
The final result of this domain identification is an educational
university oriented web crawls structure based on its ac. in
features.
The actual web structure content analysis for a data warehouse
after the retrieval of web content from the corresponding
trusted web resource is as follows for the URL of St. Xaviers
College Tirunelveli.
The Page rank obtained by Google ranking structure is 0. 4
through http://checkpagerank. net/index. php
Figure 3: St. Xaviers College Domain Page rank
Now by executing the Java code for extracting the links
connected to the specified URL of MS University can be
obtained as follows,
ST. XAVIERS College Tirunelveli Webwite Hyper Links
Extraction:
http://stxavierstn. edu. in/#
http://stxavierstn. edu. in/index. php
http://stxavierstn. edu. in/about_xavier. php
http://stxavierstn. edu. in/about_jesuit. php
http://stxavierstn. edu. in/about_college. php
http://stxavierstn. edu. in/departments. php
http://stxavierstn. edu. in/s_one_courses. php
http://stxavierstn. edu. in/s_two_courses. php
http://stxavierstn. edu. in/abt_iqac. php
http://stxavierstn. edu. in/examinations. php
2555
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556
© Research India Publications. http://www.ripublication.com
http://stxavierstn. edu. in/404page. php
http://stxavierstn. edu. in/@
http://www. xaho. com/
http://stxavierstn. edu. in/alumni. php
http://stxavierstn. edu. in/Mathematics. zip
http://stxavierstn. edu. in/001E. pdf
http://stxavierstn. edu. in/ZOO2015. pdf
http://stxavierstn. edu. in/XCOM. pdf
http://stxavierstn. edu. in/PLACEMENTCELL. pdf
http://stxavierstn. edu. in/aqar2014. pdf
http://stxavierstn. edu. in/Shift1. pdf
http://stxavierstn. edu. in/Shift2. pdf
http://117. 239. 105. 123:8080/r2015o/estart. asp
http://stxavierstn. edu. in/SXCCIA/
http://stxavierstn. edu. in/Webmail/
http://stxavierstn. edu. in/RBlog/
http://www. aicte-india. org/
analysis for its domain goal attainment. In our sample
experiment we identified the university web portal is more
emphasized on educational links rather than with the
individual college links. Since this is a huge area, and there a
lot of work to do, we hope this paper could be a useful starting
point for identifying opportunities for further research.
Our proposed methodology make it as an easy process by the
novel view of periodic web data level storage and retrieval
combinations, further focusing of their mutual proportion
along with variational effects we achieved an data analysis
process with 99 % efficiency. In near future this research will
extend its range towards web usage analysis.
References
[1]
Baraglia, R. Silvestri, F. (2007) "Dynamic
personalization of web sites without user
intervention", In Communication of the ACM 50(2):
63-67
[2] Cooley, R. Mobasher, B. and Srivastave, J. (1997)
“Web Mining: Information and Pattern Discovery on
the World Wide Web” In Proceedings of the 9th
IEEE International Conference on Tool with
Artificial Intelligence
[3] Cooley, R., Mobasher, B. and Srivastava, J. “Data
Preparation for Mining World Wide Web Browsing
Patterns”, Journal of Knowledge and Information
System, Vol. 1, Issue. 1, pp. 5–32, 1999
[4] Costa, RP and Seco, N. “Hyponymy Extraction and
Web Search Behavior Analysis Based On Query
Reformulation”, 11th Ibero-American Conference on
Artificial Intelligence, 2008 October.
[5] Kohavi, R., Mason, L. and Zheng, Z. (2004)
“Lessons and Challenges from Mining Retail Ecommerce Data” Machine Learning, Vol 57, pp. 83–
113
[6] Lillian Clark, I-Hsien Ting, Chris Kimble, Peter
Wright,
Daniel Kudenko (2006)"Combining
ethnographic and clickstream data to identify user
Web browsing strategies" Journal of Information
Research, Vol. 11 No. 2, January 2006
[7] Eirinaki, M., Vazirgiannis, M. (2003) "Web Mining
for Web Personalization", ACM Transactions on
Internet Technology, Vol. 3, No. 1, February 2003
[8] Mobasher, B., Cooley, R. and Srivastava, J. (2000)
“Automatic Personalization based on web usage
Mining” Communications of the ACM, Vol. 43, No.
8, pp. 142–151
[9] Mobasher, B., Dai, H., Luo, T. and Nakagawa, M.
(2001) “Effective Personalization Based on
Association Rule Discover from Web Usage Data” In
Proceedings of WIDM 2001, Atlanta, GA, USA, pp.
9–15
[10] Nasraoui O., Petenes C., "Combining Web Usage
Mining and Fuzzy Inference for Website
Personalization", in Proc. of WebKDD 2003 – KDD
Workshop on Web mining as a Premise to Effective
and Intelligent Web Applications, Washington DC,
August 2003, p. 37
Performing the classification of domains by obtaining the last
set of substrings using the java code yields the following table
2 for St. Xaviers College website hyperlinks.
Table 2: Domain Link analysis St. Xaviers College
Tirunelveli.
Hyper links Domain count
. com
. int
. edu/. ac. in
. gov
. mil
. net
. org
1
0
26
1
0
0
1
The comparison hyperlink analysis chart is as follows,
Figure 3: Domain Hyperlink comparison analysis
Conclusion
Web Structure Mining is a powerful technique used to extract
the information from past behavior of users. Web Structure
Mining plays an important role in this approach. Various
algorithms are used in Web Structure Mining to rank the
relevant pages, which treat all links equally when distributing
the rank score. In this paper, we approach the research area of
Web mining, focusing on the category of Web structure
mining for identifying the specified URL structure content
2556
Download