Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864 Table of Contents Background Information Problems Addressed Motivation Data Gathered Conclusion Personal Thoughts Question and Comments Background Information Data mining for project comes from a single source of data Results can be applied to Libre Software Look at separately: Mailing Lists Bug Repositories Background Information Libre Software shows Pareto law for commits: For each major artifact, 20% of developers are shown to contribute 80% of the activity in it. Problems Addressed Are the people that commit so much in one artifact the same people in the other artifact? People use different identities in each artifact Current mining techniques focus on one artifact so cannot tell who is who Motivation To gain insight into the social network and structure of libre software projects To find all the identities that correspond to one person Focus more on data analysis rather than the extraction process Data Gathered Actor has access to artifacts Alternate rules for each artifact Figure 1.0 Data Gathered Actor can post on more than one mailing list: bylchan@ca.ibm.com briancha@ca.ibm.com Source Files can appear with many identities: Brian Chan Brian bchan Interaction with versioning repository occurs through account in server machine Bug tracking systems require email address: i.e. Bugzilla Data Gathered Primary Required Information Secondary Not Required for the transaction i.e. name in email Figure 2.0 Data Gathered (cont’d) Automated process extracts data into data repository Figure 3.0 Data Gathered Sources Table: Lists where id information was originally extracted: i.e. file1.C bugreport230 Identification Table: Identity Id key to Source table Data Gathered Persons Identifications Gender, Nationality, Hash Pseudo identity: bchan Match number with another identity Matches Tells which two identities belong to the same actor Table 1.0 1 bchan bylchan@ca.ibm.com Deduction 80% 1 Brian Chan bylchan@ca.ibm.com Same Email 90% Data Gathered Matching during automated data gathering process Inference Automatic Heuristics Human Verification Data Gathered Rule 1: Primary Identities may have part of the real name in it: Example User <username@example.com Rule 2 Identities can be built from another one nsurname@example.com, name.surname@example.com name surname@example.com Rule 3 Some projects or repositories have foresight to keep list information that can be used for matching Data Gathered Still error in matching algorithms but in statistical gathering process, if it is small enough then can be ignored. Still use cleaning and verification. Data Gathered Privacy Issues: Use Hash value (1st Firewall) to reference information. Cannot reference Identifications directly Person ID (2nd Firewall) Given in such a way so cannot infer real identity without direct access to Identifications table Given to unique person so hackers cannot find specific id Conclusions Actors in Libre Software may use many different identities for development Paper deals with design of how to account for all the different people and who is actually doing what Discussed how privacy can be dealt with Personal Thoughts Good Points: Effective Solution Good examination of all the different identities in business Unique interpretation of data mining Personal Thoughts Points for improvement: No actual ‘data’ to view results Reference GNOME but never actually give statistical information from it Some interpretation is left to the reader Questions and Comments