A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department of Computer Science Wayne State University Northern Kentucky University Outlines What is Similarity Join Long String Values Our Contribution Privacy Preserving Protocol For Long String Values Experiments and Results Conclusions/Future Work Contact Information Motivation Is Natural Join always suitable? Name John Smith Mary Jones Address 4115 Main St. 2619 Ford Rd. Major Biology Chemical Eng. … Name Address Smith, John 4115 Main Street Mary Jons 2619 Ford Rd. Monthly Sal. 1645 … 2100 Similarity Join Joining a pair of records if they have SIMILAR values in the join attribute. Formally, similarity join consists of grouping pairs of records whose similarity is greater than a threshold, T. Studied widely in the literature, and referred to as record linkage, entity matching, duplicate detection, citation resolution, … Our Previous Contribution: Long String Values (ICDM MMIS10) The term long string refers to the data type representing any string value with unlimited length. The term long attribute refers to any attribute of long string data type. Most tables contain at least one attribute with long string values. Examples are Paper Abstract, Product Description, Movie Summary, User Comment, … Most of the previous work studied similarity join on short fields. In our previous work, we showed that using long attributes as join attributes under supervised learning can enhance the similarity join performance. Example P1 Title P2 Title P3 Title … P1 Kwds P2 Kwds P3 Kwds P1 Authrs P1 Abstract P2 Authrs P2 Abstract P3 Authrs P3 Abstract … … … P10 Title P11 Title … P10 Kwds P10 Authrs P10 Abstract … P11 Kwds P11 Authrs P11 Abstract … Our Paper (Motivation) Some sources may not allow sharing its whole data in the similarity join process. Solution: Privacy Preserved Similarity Join. Using long attributes as join attributes can increase the similarity join accuracy. Up to our knowledge, all the current Privacy Preserved SJ algorithms use short attributes. Most of the current privacy preserved SJ algorithms ignore the semantic similarities among the values. Problem Formulation Our goal is to find a Privacy Preserved Similarity Join Algorithm when the join attribute is a long attribute and consider the semantic similarities among such long values. Our Work Plan Phase1: Compare multiple similarity methods for long attributes when similarity thresholds are used. Phase2: Use the best method as part in the privacy preserved SJ protocol. Phase1: Finding Best SJ Method for Long Strings with Threshold Candidate Methods: Diffusion Maps. Latent Semantic Indexing. Locality Preserving Projection. Performance Measurements F1 Measurement: the harmonic mean between recall R and precision P. Where recall is the ratio of the relevant data among the retrieved data, and precision is the ratio of the accurate data among the retrieved data. Performance Measurements(Cont.) Preprocessing time is the time needed to read the dataset and generate matrices that could be used later as an input to the semantic operation. Operation time is the time needed to apply the semantic method. Matching time is the time required by the third party, C, to find the cosine similarity among the records provided by both A and B in the reduced space and compare the similarities with the predefined similarity threshold. Datasets IMDB Internet Movies Dataset: Movie Summary Field Amazon Dataset: Product Title Product Description Phase1 Results Finding best dimensionality reduction method using Movie Summary from IMDB Dataset (Left) and Product Descriptions from Amazon (Right). Phase2 Results Preprocessing Time: Read Dataset (1000 Movie Summaries) TF.IDF Weighting 12 Sec. Reduce Dimensionality using Mean TF.IDF Find Shared Features 0.5 Sec. 1 Sec. Negligible Phase2 Results Operation Time for the best performing methods from phase 1. Matching Time is negligible. Our Protocol Both sources A and B share the Threshold value T to decide similar pairs later. Our Protocol Source A P1 Title P1 Authors P1 Abstract … P2 Title P2 Authors P2 Abstract … P3 Title P3 Authors P3 Abstract Px Authors Px Abstract Source B Px Title … Ma Find Term_LSV Frequency Matrix for Each Source LSV1 LSV2 LSV3 Image 4 0 0 Classify 5 0 0 Similarity 0 6 5 Join 0 6 4 Find TD_Weighted Matrix Using TF.IDF Weighting WeightedMa LSV1 LSV2 LSV3 Image 0.9 0 0 Classify 0.7 0 0 Similarity 0 0.85 0.9 Join 0 0.7 0.85 TF.IDF Weighting TF.IDF weighting of a term W in a long string value x is given as: where tfw,x is the frequency of the term w in the long string value x, and idfw is , where N is the number of long string values in the relation, and nw is the number of long string values in the relation that contains the term w. MeanTF.IDF Feature Selection MeanTF.IDF is an unsupervised feature selection method. Every feature (term) is assigned a value according to its importance. The Value of a term feature w is given as Where TF.IDF(w, x) is the weighting of feature w in long string value x, and N is the total number of long string values. Apply MeanTF.IDF on WeightedMa and Get Important Features to Imp_Fea. Add Random features to Imp_Fea to get Rand_ Imp_Fea. Rand_ Imp_Fea and Rand_ Imp_Feb are returned to C. C Finds the intersection and return the shared important features SF to both A and B. Reduced WeightedM Dimensions in Both Sources using SF. SFa LSV1 LSV2 LSV3 Image 0.9 0 0 Similarity 0 0.85 0.9 … Add Random Vectors to SF Rand_Weighted_a LSV1 LSV2 LSV3 Random Cols Image 0.9 0 0 0.6 Similarity 0 0.85 0.9 0.2 … Find Wa (The Kernel) … 1-Cos_Sim(LSV1,LSV1)=0 1-Cos_Sim(LSV1,LSV2)=0.2 1-Cos_Sim(LSV1,LSV3)=0.3 … 1-Cos_Sim(LSV2,LSV1)=0.2 1-Cos_Sim(LSV2,LSV2)=0 1-Cos_Sim(LSV2,LSV3)=0.87 … 1-Cos_Sim(LSV3,LSV1)=0.3 1-Cos_Sim(LSV3,LSV2)=0.87 1-Cos_Sim(LSV3,LSV3)=0 … … … … |Wa| = D x D, where D is total number of columns in Rand_Weighteda Use Diffusion Maps to Find Red_Rand_Weighted_a [Red_Rand_Weighted_a,Sa,Va,Aa] = Diffusion_Map(Wa , 10, 1, red_dim), red_dim < D Red_Rand_Weighted_a=Diffusion Map Representation of first row of W a Diffusion Map Representation of second row of Wa Diffusion Map Representation of third row of Wa Col1 Col2 0.4 0.1 0.8 0.6 0.75 0.5 … … Colred_dim C Finds Pairwise Similarity Between Red_Rand_Weighted_a and Red_Rand_Weighted_b Red_Rand_ Weighted_a 1 Red_Rand_ Cos_Sim Weighted_b 1 0.77 1 2 0.3 … … … 2 1 0.9 If Cos_Sim>T, Insert the tuple in Matched Matched Red_Rand_ Weighted_a 1 Red_Rand_ Cos_Sim Weighted_b 1 0.77 2 1 0.9 2 7 0.85 … … … Matched is returned to both A and B. A and B remove random vectors from Matched and share their matrices. Our Protocol (Part1) Our Protocol (Part2) Phase2 Results Effect of adding random columns on the accuracy. Phase2 Results Effect of adding random columns on the number of suggested matches. Conclusions Efficient secure SJ semantic protocol for long string attributes is proposed. Diffusion maps is the best method (among compared) to semantically join long string attributes when threshold values are used. Mapping into diffusion maps space and adding random records can hide the original data without affecting the accuracy. Future Work Potential further works: Compare diffusion maps with more candidate semantic methods for joining long string attributes. Study the performance of the protocol on huge databases. Thank You … Dr. Farshad Fotouhi. Dr. Traian Marius Truta.