Learning Reflection 9 Summary Tyrell Garza CSE 163 3 -16 -23 This week's lessons focused on the topic of privacy in data analysis. We learned about differential privacy, which involves adding random noise to data to protect the privacy of individuals while still allowing meaningful analysis. We also learned about the Laplace distribution and how it can be used for "jittering" data to achieve differential privacy. Finally, we discussed randomized response, a mechanism for ensuring differential privacy in the absence of a trusted aggregator, which involves randomizing individual data before sending it to the aggregator. Group Fairness Conceptual Inventory Statistical Parity vs Equal Opportunity vs Predictive Equality Group fairness : fairness in algorithms that aims to avoid discrimination against subgroups based on their protected characteristics such as race, sex, ability, religion, political identity, etc. It ensures that the model does not treat individuals differently based on their subgroup membership. Statistical parity, equal opportunity, and predictive equality are definitions of group fairness in algorithms that check for equity in subgroup decisions, false-negative rates, and false-positive rates, respectively. Statistical parity aims to ensure equity in the predictions for each subgroup, equal opportunity aims to ensure equity in the falsenegative rates for each subgroup, while predictive equality aims to ensure equity in the false-positive rates for each subgroup. EX.) college admissions decision model that does not discriminate against a minority subgroup, such as squares in a population of circles and squares, and ensures that both subgroups have equal opportunities for admission. Calculating False Positives Rates and False Negative Rates True Positive vs True Negative vs False Positive vs False Negative (concrete example, described in these terms) • • • • WYSIWYG vs WAE worldview • • A true positive is an instance where the model predicts a positive result for a data point that is actually positive. A true negative is an instance where the model predicts a negative result for a data point that is actually negative. A false positive is an instance where the model predicts a positive result for a data point that is actually negative. A false negative is an instance where the model predicts a negative result for a data point that is actually positive. For example, in a medical diagnosis scenario, a true positive would be when a patient is diagnosed as having a disease and actually has it, a true negative would be when a patient is diagnosed as not having a disease and actually does not have it, a false positive would be when a patient is diagnosed as having a disease but actually does not have it, and a false negative would be when a patient is diagnosed as not having a disease but actually does have it. • • WYSIWYG (What You See Is What You Get) worldview assumes that observed data is a good measure of the construct space, making individual fairness easy to achieve but group fairness difficult. WAE (We're All Equal) worldview assumes structural bias in the process of making proxy measurements, making group fairness the ideal to strive for and individual fairness potentially discriminatory. Jittering Jittering is the act of adding random noise to published statistics to ensure differential privacy. The Laplace distribution is used to select the amount of random noise to add, with the amount of jittering directly informing the level of differential privacy. • • • False positive rate (FPR) is calculated by dividing the number of false positive predictions by the total number of negative instances in the dataset. False negative rate (FNR) is calculated by dividing the number of false negative predictions by the total number of positive instances in the dataset. Pareto Frontier • • Pareto Frontier refers to the set of optimal trade-offs between two or more conflicting objectives. It is the curve that connects all the points representing the best possible solutions to a problem. The Pareto Frontier helps decision-makers to choose the best trade-off between multiple objectives. K-anonymity k-anonymity is a privacy property that ensures at least some level of privacy for an individual in a dataset by making sure that any combination of "insensitive" attributes appearing in the dataset match at least k individuals in the dataset. This can be achieved by either removing insensitive attributes or "fuzzing" the data to make it less precise in identifying individuals. Differential Privacy Differential privacy is a privacy guarantee that ensures the results of a study are not too dependent on any one individual's data. It measures how much privacy is lost in an analysis and allows for fine-tuning how similar the results of an analysis are between two parallel universes, one with an individual's data and one without, by controlling the value of epsilon (ε). Randomized Response Randomized response is a mechanism for ensuring differential privacy in the absence of a trusted aggregator by jittering individual data before sending it to the aggregator. It involves flipping a coin to randomly choose whether to report the truth or a random answer. Uncertainties how to choose appropriate values for parameters like epsilon? how to implement these methods in practice? how to evaluate the trade-off between privacy and data utility?