Improved Methodology for Mining Datasets

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
Improved Methodology for Mining Datasets
1
T.Sudharani,2P.Sasikiran
Asistant Professor, 2M.Tech Scholar
1,2
Dept of CSE, Aditya Engineering College, Aditya Nagar, Surampalem, Andhra Pradesh
1
Abstract: In the preparation of the datasets or database we
use joins, aggregations of the columns. In this, traditional
methods transforming the retrieved rows are called horizontal
aggregations. In this aggregations all sql aggregations are
performed but we don’t apply aggregation methods on file or
image data. For this problem we introduced a method that
retrieves the columns that it searches the given patterns in the
sql query. It reduces the complexity of the nested queries.
1. INTRODUCTION
Database is the main part in the real time applications.
There is more amount of data have to store in the database.
But the problem is retrieving because user wants that he
retrieve only abstract results. Most of the databases use
Structured Query Language for commands to retrieve. SQL
solved the ad hoc needs of users, the need for data access
by computer programs did not go away. In fact, most
database access still was (and is) programmatic, in the form
of regularly scheduled reports and statistical analyses, data
entry programs such as those used for order entry, and data
manipulation programs, such as those used to reconcile
accounts and generate work orders. The first technique for
sending SQL statements to the DBMS is embedded SQL.
Because SQL does not use variables and control-of-flow
statements, it is often used as a database sublanguage that
can be added to a program written in a conventional
programming language, such as C or COBOL. This is a
central idea of embedded SQL: placing SQL statements in
a program written in a host programming language. Briefly,
the following techniques are used to embed SQL
statements in a host language:
 Embedded SQL statements are processed by a
special SQL pre-compiler. All SQL statements
begin with an introducer and end with a
terminator, both of which flag the SQL statement
for the pre-compiler. The introducer and
terminator vary with the host language. For
example, the introducer is "EXEC SQL" in C and
"&SQL (" in MUMPS, and the terminator is a
semicolon (;) in C and a right parenthesis in
MUMPS.
 Variables from the application program, called
host variables, can be used in embedded SQL
statements wherever constants are allowed. These
can be used on input to tailor an SQL statement to
a particular situation and on output to receive the
results of a query.
ISSN: 2231-5381

Queries that return a single row of data are
handled with a singleton SELECT statement; this
statement specifies both the query and the host
variables in which to return data.
 Queries that return multiple rows of data are
handled with cursors. A cursor keeps track of the
current row within a result set. The DECLARE
CURSOR statement defines the query, the OPEN
statement begins the query processing, the
FETCH statement retrieves successive rows of
data, and the CLOSE statement ends query
processing.
 While a cursor is open, positioned update and
positioned delete statements can be used to update
or delete the row currently selected by the cursor.
Although static SQL works well in many situations, there
is a class of applications in which the data access cannot be
determined in advance. For example, suppose a spreadsheet
allows a user to enter a query, which the spreadsheet then
sends to the DBMS to retrieve data. The contents of this
query obviously cannot be known to the programmer when
the spreadsheet program is written. To solve this problem,
the spreadsheet uses a form of embedded SQL called
dynamic SQL. Unlike static SQL statements, which are
hard-coded in the program, dynamic SQL statements can
be built at run time and placed in a string host variable.
They are then sent to the DBMS for processing. Because
the DBMS must generate an access plan at run time for
dynamic SQL statements, dynamic SQL is generally slower
than static SQL. When a program containing dynamic SQL
statements is compiled, the dynamic SQL statements are
not stripped from the program, as in static SQL. Instead,
they are replaced by a function call that passes the
statement to the DBMS; static SQL statements in the same
program are treated normally.
The simplest way to execute a dynamic SQL statement is
with an EXECUTE IMMEDIATE statement. This
statement passes the SQL statement to the DBMS for
compilation and execution.
In the creation datasets aggregations play important
role. Aggregate functions perform a calculation on a set of
values and return a single value. Except for COUNT,
aggregate functions ignore null values. Aggregate functions
are frequently used with the GROUP BY clause of the
SELECT statement.
http://www.ijettjournal.org
Page 29
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
All aggregate functions are deterministic. This means
aggregate functions return the same value any time that
they are called by using a specific set of input values. For
more
information
about
function
determinism,
see Deterministic and Nondeterministic Functions.
The OVER clause may follow all aggregate functions
except GROUPING and GROUPING_ID.
Aggregate functions can be used as expressions only in the
following:
 The select list of a SELECT statement (either a
sub-query or an outer query).
 A HAVING clause.
By their very nature, our databases contain a lot of
data. In previous features, we've explored methods of
extracting the specific data we're looking for using the
Structured Query Language (SQL). Those methods worked
great when we were seeking the proverbial needle in the
haystack. We were able to answer obscure questions like
"What are the last names of all customers who have
purchased Siberian wool during the slow months of July
and August?" Oftentimes, we're also interested in
summarizing our data to determine trends or produce toplevel reports.
For example, the purchasing manager may not be
interested in a listing of all widget sales, but may simply
want to know the number of widgets sold this month.
Fortunately, SQL provides aggregate functions to assist
with the summarization of large volumes of data. In this
three-segment article, we'll look at functions that allow us
to add and average data, count records meeting specific
criteria and find the largest and smallest values in a table.
All of our queries will use the Widget Order table
described below. Please note that this table is not
normalized and I've combined several data entities into one
table for the purpose of simplifying this scenario. A good
relational design would likely have Products, Orders, and
Customers tables at a minimum.
It is used within a SELECT statement and, predictably,
returns the summation of a series of values. If the widget
project manager wanted to know the total number of
widgets sold to date, we could use the following query:
Select sum(sal) from EmployeeSalary group by month;
The AVG (average) function works in a similar manner to
provide the mathematical average of a series of values.
Let's try a slightly more complicated task this time. We'd
like to find out the average dollar amount of all orders
placed on the North American continent. Note that we'll
have to multiply the Quantity column by the UnitPrice
column to compute the dollar amount of each order. Here's
what our query will look like:
select AVG(sal) as averageSal from EmployeeSalary
where EmployeeId=’EMP1’;
ISSN: 2231-5381
For fast retrieving of data from the database aggregations
applied like vertical. It gives best results but the takes
more time to retrieve. So researches introduced horizontal
aggregations.
II. RELATED WORK
Generally, data mining (sometimes called data or
knowledge discovery) is the process of analyzing data from
different perspectives and summarizing it into useful
information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of
a number of analytical tools for analyzing data. It allows
users to analyze data from many different dimensions or
angles, categorize it, and summarize the relationships
identified. Technically, data mining is the process of
finding correlations or patterns among dozens of fields in
large relational databases.
— for example, finding linked products in gigabytes of
store scanner data — and mining a mountain for a vein of
valuable ore. Both processes require either sifting through
an immense amount of material, or intelligently probing it
to find exactly where the value resides. Given databases of
sufficient size and quality, data mining technology can
generate new business opportunities by providing these
capabilities:
Automated prediction of trends and behaviours. Data
mining automates the process of finding predictive
information in large databases. Questions that traditionally
required extensive hands-on analysis can now be answered
directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses
data on past promotional mailings to identify the targets
most likely to maximize return on investment in future
mailings. Other predictive problems include forecasting
bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to
given events.
Automated discovery of previously unknown patterns.
Data mining tools sweep through databases and identify
previously hidden patterns in one step. An example of
pattern discovery is the analysis of retail sales data to
identify seemingly unrelated products that are often
purchased together. Other pattern discovery problems
include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry
keying errors.
The most commonly used techniques in data mining are:
Artificial neural networks: Non-linear predictive models
that learn through training and resemble biological neural
networks in structure.
Decision trees: Tree-shaped structures that represent sets
of decisions. These decisions generate rules for the
http://www.ijettjournal.org
Page 30
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
classification of a dataset. Specific decision tree methods
include Classification and Regression Trees (CART) and
Chi Square Automatic Interaction Detection (CHAID) .
Genetic algorithms: Optimization techniques that use
processes such as genetic combination, mutation, and
natural selection in a design based on the concepts of
evolution.
Nearest neighbour method: A technique that classifies
each record in a dataset based on a combination of the
classes of the k record(s) most similar to it in a historical
dataset (where k ³ 1). Sometimes called the k-nearest
neighbour technique.
Rule induction: The extraction of useful if-then rules
from data based on statistical significance.
Horizontal aggregations are the functions that
retrieve the data in the form of vertical tables. These are
also used in the select statement only. In this aggregations
there are some special functions such as SPJ method,
CASE method and the pivot method.
SPJ Method: It creates vertical aggregations and of each
column and the joins the all the resultant tables. This is also
called as projection join aggregations. Projection means
retrieve the information.
Case Method: This statements return values based on
Boolean expressions. It returns a number based on the
conjunction values.
Pivot method: It returns transformed table and group by
condition.
But in these methods aggregations are failed to apply on
large files or binary data. For this we presented a solution
for this limitation. It is explained in the below section.
III. PROPOSED SYSTEM
In our work we present solution for applying aggregations
in the binary data. The proposed methods perform the
logical operations in the binary data in the database tables.
The binary bits are stored in the var-binary data type or
blob data type in the database. We present the functions
such as PATINDEX. In this we perform the pattern
search.performs comparisons based on the collation of the
input. To perform a comparison in a specified collation,
you can use COLLATE to apply an explicit collation to the
input.
It is shown below:
A collation specifies the bit patterns that represent each
character in a data set. Collations also determine the rules
that sort and compare data. SQL Server supports storing
objects that have different collations in a single database.
For non-Unicode columns, the collation setting specifies
the code page for the data and which characters can be
represented. Data that is moved between non-Unicode
columns must be converted from the source code page to
the destination code page.
The following example shows the format of this method
and this example finds the position at which the
pattern ensure starts
in
a
specific
row
of
the DocSummary column in the Document table.
USE SampleDoc;
GO
SELECT PATINDEX('%ensure%',DocSummary)
FROM Production.Document
WHERE DocumentNode = 0x7B40;
GO
In this we use wild characters also and performs pattern
search as shown below:
The following example uses % and _ wildcards to find the
position at which the pattern 'en', followed by any one
character and 'ure' starts in the specified string (index starts
at 1):
SELECT PATINDEX('%en_ure%', 'please ensure the door
is locked');
empId
E1
E2
E3
emp
sam
ram
tom
Address
visakha
visaklak
visak
Phone
998877
223344
435634
This is also apply in image data also. The sql query using
SPJ,CASE is shown below.
SELECT
emp,empId,PATINDEX(%visak%,Address),Phone from
(case when emp=Memp then a null end)
FROM employee group by emp;
In the above query we it compares the pattern in the
Address field and the compares the case condition then
group by the emp column.
SELECT name FROM customer ORDER BY name
COLLATE Latin1_General_CS_AI;
ISSN: 2231-5381
http://www.ijettjournal.org
Page 31
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
IV.CONCLUSION
In our work we introduced a method that we perform
aggregation method on the text and image data. In the
traditional horizontal aggregations all aggregation methods
are used but not in the text data. By using this we can
perform the searching or retrieving the information in the
text. And also we it has low processing time.
REFERENCES
[1] G. Bhargava, P. Goel, and B.R. Iyer. Hypergraph based
reordering of outer join queries with complex predicates. In
ACM SIGMODConference, pages 304.315, 1995.
[2] J.A. Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire,
and C. Kleinerman..NET database programmability and
extensibility in MicrosoftSQL Server. In Proc. ACM
SIGMOD Conference, pages 1087.1098,2008.
[3] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P.
Lohman. Non-stopSQL/MX primitives for knowledge
discovery. In ACM KDD Conference,pages 425.429, 1999.
[4] E.F. Codd. Extending the database relational model to
capture moremeaning. ACM TODS, 4(4):397.434, 1979.
[5] C. Cunningham, G. Graefe, and C.A. Galindo-Legaria.
PIVOT andUNPIVOT: Optimization and execution
strategies in an RDBMS. InProc. VLDB Conference, pages
998.1009, 2004.
[6] C. Galindo-Legaria and A. Rosenthal. Outer join
simpli_cation andreordering for query optimization. ACM
TODS, 22(1):43.73, 1997.
[7] H. Garcia-Molina, J.D. Ullman, and J. Widom.
Database Systems: TheComplete Book. Prentice Hall, 1st
edition, 2001.
[8] G. Graefe, U. Fayyad, and S. Chaudhuri. On the
ef_cient gathering ofsuf_cient statistics for classi_cation
from large SQL databases. In Proc.ACM KDD Conference,
pages 204.208, 1998.
[9] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh.
Data cube: Arelational aggregation operator generalizing
group-by, cross-tab and subtotal.In ICDE Conference,
pages 152.159, 1996.
[10] J. Han and M. Kamber. Data Mining: Concepts and
Techniques. MorganKaufmann, San Francisco, 1st edition,
2001.
[11] G. Luo, J.F. Naughton, C.J. Ellmann, and M. Watzke.
Locking protocolsfor materialized aggregate join views.
IEEE Transactions on Knowledgeand Data Engineering
(TKDE), 17(6):796.807, 2005.
[12] C. Ordonez. Horizontal aggregations for building
tabular data sets. InProc. ACM SIGMOD Data Mining and
Knowledge Discovery Workshop,pages 35.42, 2004.
ISSN: 2231-5381
BIOGRAPHIES
Mr.P.Sasikiranis
a
student
of
AdityaEngineering College, Surampalem.
Presently he is pursuing his M.Tech
[Computer Science] from this college and
he received his B.Tech from Sri Prakash
College of Engineering, affiliated to JNT
University, Hyderabad in the year 2006.
His area of interest includes Database
Management Systems, Data Mining, all
current trends and techniques in Computer Science.
Miss
T.Sudha
Rani,
ReceivedM.Tech(CSE)fromJNT
university,Ananthapuramis working
as Sr.Asisstant Professorin Aditya
Engineering College. She had over 7
years of experience in various
engineering collages.To her credit
there are 7 publications both national
and international
conferences/journals . Her area of
Interest includes Data Warehouse and Data Mining,
information security, flavors of Unix Operating systems,
Object oriented
Programminglanguages and other
advances in computer Applications.
http://www.ijettjournal.org
Page 32
Download