International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 Misusability Weight Measure Using Ensemble Approaches Sridevi Sakhamuri1 1 V.Ramachandran 2 Dr.Rupa3 PG Student (M.Tech CSE), Vasireddy Venkatadri institute of Technology, Guntur, AP, India 2 Research Scholor, Acharya Nagarjuna University, Nambur, Guntur, AP, India 3 Professor & H.O.D, Vasireddy Venkatadri institute of Technology, Guntur, AP, India ABSTRACT Assigning a misusability weight to a given dataset is strongly related to the way the data is presented (e.g., tabular data, structured or free text) and is domain specific. Therefore, one measure of misusability weight cannot fit all types of data in every domain but it gives a fair idea on how to proceed to handle sensitive data. Previous approaches such as M-score models that consider number of entities, anonymity levels, number of properties and values of properties to estimate misusability value for a data record has better efficiency in deducting record sensitivities. Quality of data, Quantity data, and the distinguishing attributes are vital factors that can influence Mscore. Combined with record ranking and knowledge models prior Approaches used one domain expert for deducting sensitive information. But for better performance and accuracy we propose to use the effect of combining knowledge from several experts (e.g., ensemble of knowledge models). Also we plan to extend the computations of sensitivity level of sensitive attributes to be objectively obtained by using machine learning techniques such as SVM classifier along with expert scoring models. This approach particularly fits the sensitive parameter values to the customer value based on customer activity which is far more efficient compared to face value specification with human involvement. A practical implementation of the proposed system validates our claim. INTRODUCTION: Data represent today an important asset for companies and organizations. In the any organization the data may cost the worth of million and for that we want to take the care of the data, the control access must be handled care to the data with respect to both internal users, and external users, within/outside of the organization. Security issues are becoming increasingly crucial, especially with regard to ensuring the protection of data from “fabrication, modification, and interruption”. To assure the security and privacy of data assets is a crucial and very difficult problem in our modern networked world. Privacy preservation has become a major issue in many data mining applications. Ensuring the security and privacy of data assets is a crucial and very difficult problem in our modern networked world. Overall, data security has a central role in the ISSN: 2231-5381 larger context of information systems security. The development of Database Management Systems (DBMS) with high-assurance security is a central research issue, It requires a revision of architectures and techniques adopted by traditional DBMS. Relational database management systems (RDBMS) are the fundamental means of data organization, storage and access in most organizations, services, and applications. A data set is released to other parties for data mining; some privacy-preserving technique is often required to reduce the possibility of identifying sensitive information about individuals. Most statistical solutions concern more about maintaining statistical invariant of data. The ubiquity of RDBMSs led to the prevalence of security threats against the systems. Consider, an intruder from the outside may be able to gain http://www.ijettjournal.org Page 4218 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 unauthorized access to data by sending carefully crafted queries to a back-end database of a Web application. The data mining community has been studied at building strong privacy-preserving models and designing efficient optimal and scalable heuristic solutions. An insider attack against an RDBMS is much more difficult to detect, and potentially much more dangerous. The constraints and dependencies that exist in the data model are used to generate the dependency Matrix, which in turn will be used to construct the knowledge graph of the insider. The goal of the work reported in our scheme is focus on a study on the k-anonymity property .The quasi-identifier is the related model for k-anonymity which is a set of attributes that may serve as an identifier in the data set. Let Q be the quasi-identifier. An equivalence class of a table with respect to Q is a collection of all tuples in the table containing identical values for Q. The size of an equivalence class indicates the strength of identification protection of individuals in the equivalent class. It would be difficult to re-identify the tuples individually when the no. of tuples in the equivalence class is more. A data set D is k-anonymous with respect to Q if the size of every equivalence class with respect to Q is k or more. Consider the example raw medical data set in table 1 having the attributes job, birth, postcode, illness. JOB BIRTH POSTCODE ILLNESS Cat1 1975 4350 HIV Cat1 1955 4350 HIV Cat1 1955 5432 Flu Cat2 1955 5432 Fever Cat2 1975 4350 Flu Cat2 1975 4350 fever TABLE 1: RAW MEDICAL DATA SET There may be identified that the two records are reidentified as the unique because of having the same Job, Post code and the Illness where as the birth has been differ. The table is generalized as a 2anonymous table as in Table 2. JOB Cat1 BIRTH * POSTCODE 4350 ISSN: 2231-5381 ILLNESS HIV Cat1 * 4350 HIV Cat1 1955 5432 Flu Cat2 1955 5432 Fever Cat2 1975 4350 Flu Cat2 1975 4350 fever TABLE 2: A 2-ANONYMOUS DATA SET In our survey we have noticed that k-anonymization has 2 main models. They are 1. Global recording 2. Local recording All values of an attribute come from the same domain level in the hierarchy are maintained in the global recording. One advantage in the global recording is that an anonymous view has uniform domains but it may lose more information. In our considered example the global recording for the raw medical data set is shown in the table 3 it suffers from overgeneralization. JOB BIRTH POSTCODE ILLNESS * * 4350 HIV * * 4350 HIV * * 5432 Flu * * 5432 Fever * * 4350 Flu * * 4350 fever TABLE 3: A (0.5, 2)ANONYMOUS TABLE BY FULL-DOMAIN GENERALIZATION The local reading the values may be generalized to different levels in the domain. In our considered example Table 2 is a 2-anonymous table by local recoding and in fact one can say that local recoding is a more general model and global recoding is a special case of local recoding. We represent the unknown values by (*) which can be known as the suppression, which is one special case of generalizations. If we observe the table 2 means, it is clear that though it satisfies 2-anonymity property but it does not protect two patient information’s i.e. HIV infection. We cannot distinguish the individual with the first two tuples. Surely, this is an undesirable outcome. That this is a problem because the other individual whose http://www.ijettjournal.org Page 4219 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 generalized identifying attributes are the same as the mayor also has HIV. The appropriate solution is the table 4. JOB BIRTH POSTCODE ILLNESS * 1975 4350 HIV * * 4350 HIV Cat1 1955 5432 Flu Cat2 1955 5432 Fever * * 4350 Flu * 1975 4350 fever TABLE 4: AN ALTERNATIVE 2ANONYMOUS DATA SET We see from the above that protection of relationship to sensitive attribute values is as important as identification protection. In the case of the privacy preservation we have two primary goals: A. To protect individual identification. B. To protect sensitive relationships. The aim of our schema is to build the disclosed data sets. We propose an (α, k)-anonymity model Where α = Fraction k = Integer The frequency for the sensitive values must be require for the k anomalies which is not less than α in the any equivalence classes. It is better to extend the well-known k-anonymity algorithm Incognito to our (α, k)-anonymity problem. But the algorithm is not scalable to the size of quasi-identifier and may give a lot of distortions to the data since it is globalrecoding based, so we also propose an efficient localrecoding based method. Previously we have seen the work of association rules for hidden transactional set which is entirely different from our proposed system. But the rules to be hidden have to be known beforehand and each time only one rule can be hidden. Our scheme blocks all rules from quasi-identifications to a sensitive class. Even this work is differ from the work of template-based privacy preservation in classification problems these are consider hiding strong associations between some attributes and sensitive classes and combines k-anonymity with association hiding. By the suppression we can ISSN: 2231-5381 provide the solution for aim to minimize the distortion effect by the Global recording. The main aim of our schema is the distortions of data modifications without any attachment to a particular data mining method such as classification. Hence we proposed a model to solve the above homogeneity attack problem named as the (c, l) - diversity model. The attack assumes that the attacker has background knowledge to rule out some possible values in a sensitive attribute for the targeted victim. Where l = Level of diversity, It will be more different sensitive values in a group if l is large. The intensive idea of using parameters c and l is to ensure that the most frequent sensitive value in a group should not be too frequent after the next p most frequent sensitive values are removed, where p is related to parameter l. We propose to handle the issues of k-anonymity with protection of sensitive values for sensitive attributes. We propose a simple and effective model to protect both identifications and sensitive associations in a disclosed data set. Where (α, k)-anonymity has been extended from the k-anonymity model to limit the confidence of the implications from the quasiidentifier to a sensitive value (attribute) to within a in order to protect the sensitive information from being inferred by strong implications. We extend an unknown a global-recoding algorithm for the kanonymity problem, to solve this problem for (α, k)anonymity. The proposed local-recoding algorithm is scalable and generates less distortion. The kanonymity model requires that every value set for the quasi-identifier attribute set has a frequency of zero or at least k. Consider a large collection of patient records with different medical conditions. Some diseases are sensitive, such as HIV, but many diseases are common, such as cold and fever. DEFINITION ((a, k)-ANONYMIZATION). A view of a table is said to be an (a, k)-anonymization of the table if the view modifies the table such that the view satisfies both k-anonymity and a-deassociation properties with respect to the quasi-identifier. Table 4 is a (0.5, 2)-anonymous view of the raw medical data set since the size of all equivalence classes with respect to the quasi-identifier is 2 and http://www.ijettjournal.org Page 4220 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 each equivalence class contains at most half of the tuples associating with HIV. DEFINITION (LOCAL RECODING). Given a data set D of tuples, a function c that convert each tuple t in D to c(t) is a local recoding for D. The tuples in the data sets can be destroyed by the Local recording. We can define a measurement for the amount of distortion generated by a recoding, which we shall call the recoding cost. If the suppression uses the unknown (*) values is used in the recording then the total cost is calculated by the total no. of the suppressions or the number of *’s in the resulting data set. We can justify it as the optimal (α, k)-anonymity. The problem decision is defined as following : (a, k)-ANONYMIZATION: Given a data set D with a quasi-identifier Q and a sensitive value s. There a local recoding for D by a function c such that, after recoding, (a, k)-anonymity is satisfies and the cost of the recoding is at most C? GLOBAL-RECODING Incognito algorithm is an optimal algorithm for the kanonymity problem which can be extended to the existing global-recoding based algorithm called Incognito for the (a, k)-anonymous model. It has also been used in for the l-diversity problem and the make use of monotonicity property in searching the solution space. The searches can be made efficient if a stopping condition is satisfied. The algorithm is similar to (a, k)-anonymous model. Test for the kanonymity property and tests the k-anonymity and ldiversity properties. LOCAL-RECODING The solution may generate excessive distortions to the data set and may not be scalable by the extended Incognito algorithm which is an exhaustive global recoding algorithm. We propose a scalable localrecoding algorithm called top-down approach. The approach is to tackle the problem for ease of illustration. The quasi-identifier of size 1 for initially. Then the method can handle the quasi-identifiers of size greater than 1. Main proposal of the algorithm is ISSN: 2231-5381 to generalize all tuples completely. Hence all the tuples are generalized into the one equivalence class and these tuples are specialized in the iterations. We must maintain the (a, k)-anonymous during the specialization. Up to the occurrence of no specialization the process is performed continuously. Quasi-identifier of Size More Than 1: In the case of the quasi-identifier has a size greater than one to handle we extend the top-down algorithm. At the first step all the attributes can be fully generalized. For each iteration we can find the “best” attribute for specialization and perform the specialization for the “best” attribute. For the analysis consider the group P, and specialize with single attribute. Our approach for each attribute in the quasiidentifier “tries” to specialize P. We find the “best” attribute for final specialization among those specializations. Paradigm 1 (Greatest No of Tuples Specialized): During the specialization of P, we obtain a final distribution of the tuples and some may remain in the P. If the least no. of distortion is corresponded it is caused by the “best” specialization yields the greatest number of tuples specialized. Paradigm 2 (Smallest No of Branching Specialized): We want to consider the no. of branches specialized when there is a tie between the tuples in the first considered criteria. The smallest no. of branches is specialized for the “best” specialized yields. The rationale is that smallest number of branches can be an indicator of more generalized domain and it is a good choice compared to a less generalized domain. EXISTING SYSTEM Mitigating leakage or misuse incidents of data stored in databases (i.e., tabular data) by an insider having legitimate privileges to access the data is a challenging task. Current Methods devised are generally based on user behavioral profiles that define normal user behavior and issue an alert whenever a user’s behavior significantly deviates from the normal profile. Security-related data measures including k-Anonymity, l-Diversity and (α,k)-Anonymity are mainly used for privacypreserving and are not relevant when the user has free http://www.ijettjournal.org Page 4221 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 access to the data. The most common approach for representing user behavioral profiles is by analyzing the SQL statement submitted by an application server to the database (as a result of user requests), and extracting various features from these SQL statements. The data exposed to the used is attained by the another approach for the analyzing, i.e., the result-sets. However these methods ignore the different sensitivity levels of the data to which an insider is exposed. The data leaks or misuses in an organization show the big and great impact of the damages by the sensitivity factors. So a better system is required to address these issues. Mitigating data leakage or misuse is still the primary focus. The Misusability Weight, which assigns sensitivity score to datasets, thereby estimating the level of harm that might be inflicted upon the organization when the data is leaked. The misusability can propose the four optional usages: The data actually exposed by applying the anomalies detection by the learning the normal behavior to an insider in terms of the sensitivity. To handle the leakages, to prevent, we have an improved process to identify the incidents by other misuse detection systems by enabling the security officer to focus on incidents involving more sensitive data. Implementing a dynamic misusability-based access control, to store the data in the relational databases is designed to regulate user access to sensitive data Data misusability is reduced. Assigning a misusability weight to a given dataset is strongly related to the way the data is presented (e.g., tabular data, structured or free text) and is domain specific. Hence, the measure of misusability weight cannot fit all types of data in every domain but it gives a fair idea on how to proceed. Number of entities Anonymity level Number of properties Values of properties The M-score incorporates three main factors ISSN: 2231-5381 Quality of data - the importance of the information. Quantity data - how much information is exposed. The distinguishing factor - given the quasiidentifiers, the amount of efforts required in order to discover the specific entities that the table refers to. Records Ranking In this approach, the domain expert is requested to assign a sensitivity score to individual records. Hence, domain expert expresses the sensitivity level of different combinations of sensitive values. There are two challenges when applying the records ranking method: a. choosing a record-set that will make it possible to derive a general model that will be as small and compact as possible (since it is not possible to rank all records in the database). b. Choosing an algorithm (knowledge model) for deriving the scoring model. Tackling the second challenge is a bit more complicated because many different methods, each with its pros and cons, can be chosen for building a knowledge-model from a list of ranked records. Among the attributes of the functional dependencies one of the most prominent difference between methods of functional dependencies to derive the function, we examined Records Ranking[LR], Pair wise Comparison[AHP], Records Ranking[CART] models for validating Kendall Tau(Accuracy) scores. Records Ranking [LR] handles unknown values well with an assumption is made that the relationships between the attributes and the dependent variable are linear. Pair wise Comparison [AHP] uses analytic hierarchy process AHP tree structures and hence produces optimized results faster. Records Ranking[CART] makes no assumption that the relationships between the attributes and the dependent variable are linear and hence takes much time. Records Ranking[LR] and Pair wise Comparison[AHP] significantly outperformed the Records Ranking[CART] in expert scoring and hence http://www.ijettjournal.org Page 4222 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 the chosen mode of knowledge model. Knowledge acquired from one expert is sufficient to calculate the M-score for the entire domain. The main goal of this experiment was to find whether the M-score fulfills its goal of measuring misusability weight. An implementation of the above approach validates the current systems efficiency in identifying the potential data misuse. PROPOSED SYSTEM Prior Approaches used one domain expert for deducting sensitive information. But for better performance and accuracy we propose to use the effect of combining knowledge from several experts (e.g., ensemble of knowledge models). Also we plan to extend the computations of sensitivity level of sensitive attributes to be objectively obtained by using machine learning techniques such as SVM classifier along with expert scoring models. This approach particularly by fits the sensitive parameter values to the customer value based on customer activity which is far more efficient compared to face value specification with human involvement. Consider, an intruder from the outside may be able to gain unauthorized access to data by sending carefully crafted queries to a back-end database of a Web application. The data mining community has been studied at building strong privacy-preserving models and designing efficient optimal and scalable heuristic solutions. An insider attack against an RDBMS is much more difficult to detect, and potentially much more dangerous. The goal of the work reported in our scheme is focus on a study on the k-anonymity property .The quasiidentifier is the related model for k-anonymity which is a set of attributes that may serve as an identifier in the data set. In our survey we have noticed that kanonymization has 2 main models. They are 1. Global recording 2. Local recording All values of an attribute come from the same domain level in the hierarchy are maintained in the global recording. One advantage in the global recording is that an anonymous view has uniform domains but it may lose more information. The local reading the values may be generalized to different ISSN: 2231-5381 levels in the domain. The aim of our schema is to build the disclosed data sets. We propose an (α, k)anonymity model Where α = Fraction k = Integer The frequency for the sensitive values must be require for the k anomalies which is not less than α in the any equivalence classes. (α, k)-anonymization: Given a data set D with a quasi-identifier Q and a sensitive value s. There a local recoding for D by a function c such that, after recoding, (α, k)-anonymity is satisfies and the cost of the recoding is at most C. Incognito algorithm is an optimal algorithm for the k-anonymity problem which can be extended to the existing global-recoding based algorithm called Incognito for the (a, k)-anonymous model. It has also been used in for the l-diversity problem and the make use of monotonicity property in searching the solution space. The solution may generate excessive distortions to the data set and may not be scalable by the extended Incognito algorithm which is an exhaustive global recoding algorithm. The quasiidentifier of size 1 for initially. Then the method can handle the quasi-identifiers of size greater than 1. Main proposal of the algorithm is to generalize all tuples completely. In the case of the quasi-identifier has a size greater than one to handle we extend the top-down algorithm. At the first step all the attributes can be fully generalized. For each iteration we can find the” best” attribute for specialization and perform the specialization for the “best” attribute. For the analysis consider the group P, and specialize with single attribute. We find the “best” attribute for final specialization among those specializations. SPECULATIVE STUDY The algorithm was implemented in C/C++. Pentium IV 2.2GHz PC with 1GM RAM was used to conduct our experiment. For data set availability, Adult Database, at the UCIrvine Machine Learning Repository. Table 5 shows the quasi-identifier chosen nine attributes. Consider α, k values as 0.5, 2 and http://www.ijettjournal.org Page 4223 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 choose the first 8 attributes and last attributes as the quasi-identifier and the sensitive attribute respectively. We can assay the proposed algorithm in terms of two measurements: execution time and distortion ratio as we seen in the previous sections. Attribute Distinct Values Generalization Height 1 2 Age 74 5,10,20 year ranges 4 Work 7 Taxonomy Tree 3 Class 3 Education 16 Taxonomy Tree 4 4 Marital 7 Taxonomy Tree 3 Status 5 Occupation 14 Taxonomy Tree 2 6 Race 5 Taxonomy Tree 2 7 Gender 2 Suppression 1 8 Native 41 Taxonomy Tree 3 Country 9 Salary 2 Suppression 1 Class TABLE 7: DESCRIPTION OF ADULT DATA SET We expressed the proposed algorithms by Top Down and eIncognito. The eIncognito justifies the extended Incognito algorithm while Top Down denotes the local-recoding based top-down approach, respectively. (a) (c) ISSN: 2231-5381 (b) (d) Fig : Execution Time and Distortion Ratio Versus Quasiidentifier Size and a (k = 2) The above figure shows the graphs of the execution time and the distortion ratio against quasi-identifier size and a when k = 2. In (a) when α varies, different algorithms change differently. In (b) the execution time of the algorithm increases because the complexity of the algorithms is increased with the quasi-identifier size, when the quasi-identifier size increases. In (c) the distortion and α is inversely propositional, hence when α increases, the distortion ratio decreases. In (d) it is easy to see why the distortion ratio increases with the quasi-identifier size. Intuitively, if α is greater, there is less requirement of α -deassociation, to generalize the values in the data set better to yield few operations by which we can attain the distortion ratio less in manner. When the quasi-identifier contains more attributes, the quasi-identifier has more chance for having two tuples are not unique. The distortion ratio is high when the k value is large in size obviously because it is less likely that the quasi-identifier of two tuples is equal. The observation in the average Top Down algorithm results is about 3 times smaller distortion ratio compared with eIncognito Algorithm. GENERAL (α, k)-ANONYMITY MODEL We extend the simple (α, k)-model to multiple sensitive values, when there are two or more sensitive values and they are rare cases in a data set. To combine the two sensitive classes in to one sensitive class the simple (α, k)-anonymity model is applicable. The inference confidence controlled by α, to each individual sensitive value is smaller than or equal to the confidence to the combined value. It is possible to have an (α, k) anonymity model to protect a sensitive attribute when the attribute contains many values and no single value dominates the attribute. When same scales holds for each equivalent class with even distribution, the equivalent class with the 33% of confidence to attain the infer the scale. The αdisassociated for each data set with respect to a sensitive attribute with respect to every value in attribute. We can extended the proposed algorithm to http://www.ijettjournal.org Page 4224 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 the general (α, k)-anonymity model. The monotonicity property supposed to global-recoding, to holds for the general (a, k)-anonymity. The topdown local-recoding algorithm can also be easily extended to the general model by modifying the condition when testing the candidates. [6] A. Machanavajjhala, J. Gehrke, and D. Kifer. ldiversity: privacy beyond k-anonymity. In To appear in ICDE06, 2006. Authors Profile CONCLUSION The identification information is protected and the sensitive data cannot be protected in the kanonymity model. To protect the both identification information and sensitive data we propose (α, k)anonymity. We prove that achieving optimal (a, k)anonymity by local recoding is NP-hard. To transform the data set to satisfy (α, k)-anonymity property we present an optimal global-recoding method and an efficient local-encoding based algorithm. Hence our evolution shows that, on average, the local-encoding based algorithm performs about 4 times faster and gives about 3 times less distortions of the data set compared with the globalrecoding algorithm. REFERENCE [1]Raymond Chi-Wing Wong, Jiuyong Li, Ada WaiChee Fu and Ke Wang, “(α, k)-Anonymity: An Enhanced k-Anonymity Model for PrivacyPreserving Data Publishing”, KDD’06, August 20– 23, 2006. [2] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In SIGMOD Conference, pages 49–60, 2005. [3] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In ICDE, pages 205–216, 2005. [4] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International journal on uncertainty, Fuzziness and knowledge based systems, 10(5):571 – 588, 2002. [5] E. K. C. Blake and C. J. Merz. UCI repository of machine learning databases, http://www.ics.uci.edu/mlearn/MLRepository.html, 1998. ISSN: 2231-5381 SRIDEVI. S has completed her P.G degree, MCA from OsmaniaUniversity in 2003.She has worked previously as an Assistant Professor in JMJ College for Women and also in K.Chandrakala PG College, Tenali.Presently she is M.Tech student in VVIT College,Nambur Ramachandran, is a research scholar at Acharya NagarjunaUniversity, Nambur. He got his B.TECH Computer Science &Systems engineering Degreefrom Andhra University andM.TECH in Computer ScienceEngineering from JNTU, Kakinada. He is verymuch interested in image processing, medical retrieval , human vision & pattern recognition.He did several projects in image processing. http://www.ijettjournal.org Page 4225 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 Dr Rupa, working as theof the Department of Computer Science and Engineering in VVIT.Nambur.She has 12 yearsof teaching experience. Dr Ch.Rupa, obtained her B.Tech(CSIT) Degree in Computer Science & Information Technology from JNTU, Hyderabad in 2002 and M.Tech (IT) in Information Technology from Andhra University in 2005. She was awarded Ph. D (CSE) in Computer Science Engineering by Andhra University. ISSN: 2231-5381 http://www.ijettjournal.org Page 4226