International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 Improved Methodology for Mining Datasets 1 T.Sudharani,2P.Sasikiran Asistant Professor, 2M.Tech Scholar 1,2 Dept of CSE, Aditya Engineering College, Aditya Nagar, Surampalem, Andhra Pradesh 1 Abstract: In the preparation of the datasets or database we use joins, aggregations of the columns. In this, traditional methods transforming the retrieved rows are called horizontal aggregations. In this aggregations all sql aggregations are performed but we don’t apply aggregation methods on file or image data. For this problem we introduced a method that retrieves the columns that it searches the given patterns in the sql query. It reduces the complexity of the nested queries. 1. INTRODUCTION Database is the main part in the real time applications. There is more amount of data have to store in the database. But the problem is retrieving because user wants that he retrieve only abstract results. Most of the databases use Structured Query Language for commands to retrieve. SQL solved the ad hoc needs of users, the need for data access by computer programs did not go away. In fact, most database access still was (and is) programmatic, in the form of regularly scheduled reports and statistical analyses, data entry programs such as those used for order entry, and data manipulation programs, such as those used to reconcile accounts and generate work orders. The first technique for sending SQL statements to the DBMS is embedded SQL. Because SQL does not use variables and control-of-flow statements, it is often used as a database sublanguage that can be added to a program written in a conventional programming language, such as C or COBOL. This is a central idea of embedded SQL: placing SQL statements in a program written in a host programming language. Briefly, the following techniques are used to embed SQL statements in a host language: Embedded SQL statements are processed by a special SQL pre-compiler. All SQL statements begin with an introducer and end with a terminator, both of which flag the SQL statement for the pre-compiler. The introducer and terminator vary with the host language. For example, the introducer is "EXEC SQL" in C and "&SQL (" in MUMPS, and the terminator is a semicolon (;) in C and a right parenthesis in MUMPS. Variables from the application program, called host variables, can be used in embedded SQL statements wherever constants are allowed. These can be used on input to tailor an SQL statement to a particular situation and on output to receive the results of a query. ISSN: 2231-5381 Queries that return a single row of data are handled with a singleton SELECT statement; this statement specifies both the query and the host variables in which to return data. Queries that return multiple rows of data are handled with cursors. A cursor keeps track of the current row within a result set. The DECLARE CURSOR statement defines the query, the OPEN statement begins the query processing, the FETCH statement retrieves successive rows of data, and the CLOSE statement ends query processing. While a cursor is open, positioned update and positioned delete statements can be used to update or delete the row currently selected by the cursor. Although static SQL works well in many situations, there is a class of applications in which the data access cannot be determined in advance. For example, suppose a spreadsheet allows a user to enter a query, which the spreadsheet then sends to the DBMS to retrieve data. The contents of this query obviously cannot be known to the programmer when the spreadsheet program is written. To solve this problem, the spreadsheet uses a form of embedded SQL called dynamic SQL. Unlike static SQL statements, which are hard-coded in the program, dynamic SQL statements can be built at run time and placed in a string host variable. They are then sent to the DBMS for processing. Because the DBMS must generate an access plan at run time for dynamic SQL statements, dynamic SQL is generally slower than static SQL. When a program containing dynamic SQL statements is compiled, the dynamic SQL statements are not stripped from the program, as in static SQL. Instead, they are replaced by a function call that passes the statement to the DBMS; static SQL statements in the same program are treated normally. The simplest way to execute a dynamic SQL statement is with an EXECUTE IMMEDIATE statement. This statement passes the SQL statement to the DBMS for compilation and execution. In the creation datasets aggregations play important role. Aggregate functions perform a calculation on a set of values and return a single value. Except for COUNT, aggregate functions ignore null values. Aggregate functions are frequently used with the GROUP BY clause of the SELECT statement. http://www.ijettjournal.org Page 29 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 All aggregate functions are deterministic. This means aggregate functions return the same value any time that they are called by using a specific set of input values. For more information about function determinism, see Deterministic and Nondeterministic Functions. The OVER clause may follow all aggregate functions except GROUPING and GROUPING_ID. Aggregate functions can be used as expressions only in the following: The select list of a SELECT statement (either a sub-query or an outer query). A HAVING clause. By their very nature, our databases contain a lot of data. In previous features, we've explored methods of extracting the specific data we're looking for using the Structured Query Language (SQL). Those methods worked great when we were seeking the proverbial needle in the haystack. We were able to answer obscure questions like "What are the last names of all customers who have purchased Siberian wool during the slow months of July and August?" Oftentimes, we're also interested in summarizing our data to determine trends or produce toplevel reports. For example, the purchasing manager may not be interested in a listing of all widget sales, but may simply want to know the number of widgets sold this month. Fortunately, SQL provides aggregate functions to assist with the summarization of large volumes of data. In this three-segment article, we'll look at functions that allow us to add and average data, count records meeting specific criteria and find the largest and smallest values in a table. All of our queries will use the Widget Order table described below. Please note that this table is not normalized and I've combined several data entities into one table for the purpose of simplifying this scenario. A good relational design would likely have Products, Orders, and Customers tables at a minimum. It is used within a SELECT statement and, predictably, returns the summation of a series of values. If the widget project manager wanted to know the total number of widgets sold to date, we could use the following query: Select sum(sal) from EmployeeSalary group by month; The AVG (average) function works in a similar manner to provide the mathematical average of a series of values. Let's try a slightly more complicated task this time. We'd like to find out the average dollar amount of all orders placed on the North American continent. Note that we'll have to multiply the Quantity column by the UnitPrice column to compute the dollar amount of each order. Here's what our query will look like: select AVG(sal) as averageSal from EmployeeSalary where EmployeeId=’EMP1’; ISSN: 2231-5381 For fast retrieving of data from the database aggregations applied like vertical. It gives best results but the takes more time to retrieve. So researches introduced horizontal aggregations. II. RELATED WORK Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: Automated prediction of trends and behaviours. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. The most commonly used techniques in data mining are: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the http://www.ijettjournal.org Page 30 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Nearest neighbour method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbour technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Horizontal aggregations are the functions that retrieve the data in the form of vertical tables. These are also used in the select statement only. In this aggregations there are some special functions such as SPJ method, CASE method and the pivot method. SPJ Method: It creates vertical aggregations and of each column and the joins the all the resultant tables. This is also called as projection join aggregations. Projection means retrieve the information. Case Method: This statements return values based on Boolean expressions. It returns a number based on the conjunction values. Pivot method: It returns transformed table and group by condition. But in these methods aggregations are failed to apply on large files or binary data. For this we presented a solution for this limitation. It is explained in the below section. III. PROPOSED SYSTEM In our work we present solution for applying aggregations in the binary data. The proposed methods perform the logical operations in the binary data in the database tables. The binary bits are stored in the var-binary data type or blob data type in the database. We present the functions such as PATINDEX. In this we perform the pattern search.performs comparisons based on the collation of the input. To perform a comparison in a specified collation, you can use COLLATE to apply an explicit collation to the input. It is shown below: A collation specifies the bit patterns that represent each character in a data set. Collations also determine the rules that sort and compare data. SQL Server supports storing objects that have different collations in a single database. For non-Unicode columns, the collation setting specifies the code page for the data and which characters can be represented. Data that is moved between non-Unicode columns must be converted from the source code page to the destination code page. The following example shows the format of this method and this example finds the position at which the pattern ensure starts in a specific row of the DocSummary column in the Document table. USE SampleDoc; GO SELECT PATINDEX('%ensure%',DocSummary) FROM Production.Document WHERE DocumentNode = 0x7B40; GO In this we use wild characters also and performs pattern search as shown below: The following example uses % and _ wildcards to find the position at which the pattern 'en', followed by any one character and 'ure' starts in the specified string (index starts at 1): SELECT PATINDEX('%en_ure%', 'please ensure the door is locked'); empId E1 E2 E3 emp sam ram tom Address visakha visaklak visak Phone 998877 223344 435634 This is also apply in image data also. The sql query using SPJ,CASE is shown below. SELECT emp,empId,PATINDEX(%visak%,Address),Phone from (case when emp=Memp then a null end) FROM employee group by emp; In the above query we it compares the pattern in the Address field and the compares the case condition then group by the emp column. SELECT name FROM customer ORDER BY name COLLATE Latin1_General_CS_AI; ISSN: 2231-5381 http://www.ijettjournal.org Page 31 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 IV.CONCLUSION In our work we introduced a method that we perform aggregation method on the text and image data. In the traditional horizontal aggregations all aggregation methods are used but not in the text data. By using this we can perform the searching or retrieving the information in the text. And also we it has low processing time. REFERENCES [1] G. Bhargava, P. Goel, and B.R. Iyer. Hypergraph based reordering of outer join queries with complex predicates. In ACM SIGMODConference, pages 304.315, 1995. [2] J.A. Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire, and C. Kleinerman..NET database programmability and extensibility in MicrosoftSQL Server. In Proc. ACM SIGMOD Conference, pages 1087.1098,2008. [3] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stopSQL/MX primitives for knowledge discovery. In ACM KDD Conference,pages 425.429, 1999. [4] E.F. Codd. Extending the database relational model to capture moremeaning. ACM TODS, 4(4):397.434, 1979. [5] C. Cunningham, G. Graefe, and C.A. Galindo-Legaria. PIVOT andUNPIVOT: Optimization and execution strategies in an RDBMS. InProc. VLDB Conference, pages 998.1009, 2004. [6] C. Galindo-Legaria and A. Rosenthal. Outer join simpli_cation andreordering for query optimization. ACM TODS, 22(1):43.73, 1997. [7] H. Garcia-Molina, J.D. Ullman, and J. Widom. Database Systems: TheComplete Book. Prentice Hall, 1st edition, 2001. [8] G. Graefe, U. Fayyad, and S. Chaudhuri. On the ef_cient gathering ofsuf_cient statistics for classi_cation from large SQL databases. In Proc.ACM KDD Conference, pages 204.208, 1998. [9] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: Arelational aggregation operator generalizing group-by, cross-tab and subtotal.In ICDE Conference, pages 152.159, 1996. [10] J. Han and M. Kamber. Data Mining: Concepts and Techniques. MorganKaufmann, San Francisco, 1st edition, 2001. [11] G. Luo, J.F. Naughton, C.J. Ellmann, and M. Watzke. Locking protocolsfor materialized aggregate join views. IEEE Transactions on Knowledgeand Data Engineering (TKDE), 17(6):796.807, 2005. [12] C. Ordonez. Horizontal aggregations for building tabular data sets. InProc. ACM SIGMOD Data Mining and Knowledge Discovery Workshop,pages 35.42, 2004. ISSN: 2231-5381 BIOGRAPHIES Mr.P.Sasikiranis a student of AdityaEngineering College, Surampalem. Presently he is pursuing his M.Tech [Computer Science] from this college and he received his B.Tech from Sri Prakash College of Engineering, affiliated to JNT University, Hyderabad in the year 2006. His area of interest includes Database Management Systems, Data Mining, all current trends and techniques in Computer Science. Miss T.Sudha Rani, ReceivedM.Tech(CSE)fromJNT university,Ananthapuramis working as Sr.Asisstant Professorin Aditya Engineering College. She had over 7 years of experience in various engineering collages.To her credit there are 7 publications both national and international conferences/journals . Her area of Interest includes Data Warehouse and Data Mining, information security, flavors of Unix Operating systems, Object oriented Programminglanguages and other advances in computer Applications. http://www.ijettjournal.org Page 32