[Intro to CP 2023/2024] Group 296721 EUROPEAN SOCCER DATA ANALYSIS Using Python for Data Insights Federico Romano Gargarella, Kayihura Herta Keza OVERVIEW 01 Strategy 02 Approach to the problem and solution 03 Query 1, 2 & 3 Objectives Results File handling Understand which modules and datasets use 04 Output Saving data STRATEGY DATASET ANALYSIS Examination of the databases structure by reading csv files. UNDERSTANDING QUERIES Defined functions to solve them. HANDLING LARGE DATASET Efficient memory use with line-by-line reading and selective filtering in large datasets. MANTAINING DATA INTEGRITY Implementing validation checks and error handling. DATA SERIALIZATION Applied ‘pickle’ for saving data, facilitating efficient storage of processed results. FILES HANDLING 01 with open (‘Player.csv’, ‘r’) as file: 02 reader = csv.DictReader(file) 03 for row in reader: 04 Open file Read file Iterate to extract necessary information PURPOSE Parse and extract specific data fields efficiently, crucial for running our analysis queries QUERY 1 Write a Python script that calculates the player whose overall rating improved the most between two consecutive timestamps. Data analysis Step 1 Step 2 Result Extract player data and ratings over time Calculate the improvement percentage for each player Identify the player with the highest improvement percentage Display the name of the most improved player, the date range, and the improvement percentage. QUERY 1 INSIGHT QUERY 2 Find the match with the highest number of fouls for each league. Data analysis Step 1 Load league and match data Calculate the match with the highest number of fouls in each league Step 2 Mapping league IDs to names Result Prepare output and save results QUERY 2 INSIGHT QUERY 3 Determine the season winners for each season in the Bundesliga. Data analysis Step 1 Step 2 Load Bundesliga match data Calculate points for each team per season Determine the team with the highest points for each season. Result List the Bundesliga season winners for each season. QUERY 3 INSIGHT OUTPUT PICKLE MODULE - Utilized Python's pickle module for data serialization from different queries, ensuring data integrity. EFFICIENT DATA SERIALIZATION -Employed ‘pickle.dump’ for effective serialization and storage of query results. BINARY MODE - Opened files in write-binary ('wb') mode to accurately save binary data, critical for maintaining the data structure and format. DATA MANAGEMENT -By serializing data into .pkl files, facilitated access and sharing.