UNIT-1(Overview of Python & Data Structure) PREPARED BY: Rachita Mohanty Assistant Professor Computer Engineering Department SAFFRONY INSTITUTE OF TECHNOLOGY Topic to be Covered Core Competencies of a Data Scientist: Linking Data Science, Big Data, and AI: Role of Programming: Data Science Pipeline: Python's Role in Data Science: Shifting Profile of Data Scientists: Working with Python Loading Data, Training a Model, Viewing Results: Using the Python Ecosystem for Data Science Core Competencies of a Data Scientist: A data scientist needs a combination of skills from various domains: Statistical Analysis: Proficiency in statistical methods to analyze and interpret data. Programming: Strong coding skills, often in languages like Python or R. Machine Learning: Understanding of machine learning algorithms and techniques. Data Manipulation: Ability to clean, pre-process, and transform data. Domain Knowledge: Familiarity with the specific industry or field the data pertains to. Data Visualization: Skill in creating meaningful visualizations to communicate insights. Communication: Ability to explain complex findings to both technical and non-technical stakeholders. Linking Data Science, Big Data, and AI: Data Science: The overall field that involves extracting insights and knowledge from data. Big Data: Handling and analyzing large volumes of data that traditional methods cannot manage. Artificial Intelligence (AI): Enabling machines to simulate human intelligence, often using data and algorithms. Differences between Big Data and Data Science: Data Science •Data Science is an area. Big Data •Big Data is a technique to collect, maintain and process huge information. •It is about the collection, processing, analyzing, and •It is about extracting vital and valuable information utilizing of data in various operations. It is more from a huge amount of data. conceptual. •It is a field of study just like Computer Science, •It is a technique for tracking and discovering trends Applied Statistics, or Applied Mathematics. in complex data sets. •The goal is to make data more vital and usable i.e. •The goal is to build data-dominant products for a by extracting only important information from the venture. huge data within existing traditional aspects. •Tools mainly used in Data Science include SAS, R, •Tools mostly used in Big Data include Hadoop, Python, etc Spark, Flink, etc. •It is a superset of Big Data as data science consists •It is a sub-set of Data Science as mining activities of Data scrapping, cleaning, visualization, statistics, which is in a pipeline of Data science. and many more techniques. •It is mainly used for scientific purposes. •It is mainly used for business purposes and customer satisfaction. •It broadly focuses on the science of the data. •It is more involved with the processes of handling voluminous data. Role of Programming: Programming is crucial for a data scientist as it enables: Data manipulation and cleaning. Implementing machine learning algorithms. Building and deploying models. Automation of tasks. Creating data visualizations. Data Science Pipeline: Preparing the Data: Cleaning, transforming, and pre-processing raw data. Exploratory Data Analysis (EDA): Understanding data characteristics through visualization and summary statistics. Learning from Data: Applying machine learning algorithms to train models. Visualizing and Obtaining Insights: Creating visualizations to interpret and communicate findings. Data Products: Developing tools, dashboards, or applications that provide insights to end-users. Python's Role in Data Science: Python is widely used in data science due to its simplicity, versatility, and extensive libraries like NumPy, pandas, scikit-learn, and Matplotlib. It provides tools for data manipulation, analysis, visualization, and machine learning. Shifting Profile of Data Scientists: As the field evolves, data scientists are expected to have knowledge of more advanced techniques, such as deep learning, natural language processing, and AI ethics. They also need to collaborate with domain experts and effectively communicate results. Working with Python: Python's simplicity and libraries make it suitable for data science tasks. You can quickly learn Python using online tutorials, courses, and resources like the official Python documentation. Loading Data, Training a Model, Viewing Results: Loading Data: Use libraries like pandas to read and manipulate data from various sources. Training a Model: Employ libraries like scikit-learn to train machine learning models on the data. Viewing Results: Visualize model performance and insights using Matplotlib or libraries tailored to specific types of visualizations. Libraries in order to perform specific data science task in python. Following are the list of libraries which we are going to use in this subject. Performing fundamental scientific computing using NumPy Performing data analysis using pandas Plotting the data using matplotlib Accessing scientific tools using SciPy Implementing machine learning using Scikit-learn Going for deep learning with Keras and TensorFlow Creating graphs with NetworkX Parsing HTML documents using Beautiful Soup Key Features of NumPy: Multidimensional Arrays: NumPy introduces the ndarray, which is a multi-dimensional array object. These arrays can be 1-dimensional (vectors), 2-dimensional (matrices), or even higher-dimensional. Efficient Numerical Operations: NumPy arrays are more memory-efficient and faster for numerical computations compared to regular Python lists. This is due to NumPy's underlying implementation in C and its optimization for numerical tasks. Broadcasting: NumPy allows element-wise operations on arrays of different shapes and dimensions through broadcasting. This simplifies operations and avoids explicit loops. Key Features of NumPy: Mathematical Functions: NumPy provides a wide range of mathematical functions for basic arithmetic, linear algebra, trigonometry, statistics, and more. Array Indexing and Slicing: NumPy supports advanced indexing and slicing operations on arrays, making it easy to access and manipulate specific parts of the data. Universal Functions (ufuncs): These are functions that operate element-wise on arrays and support broadcasting. Examples include addition, subtraction, exponentiation, etc. Integration with other Libraries: NumPy integrates well with other libraries used in data science, such as pandas (for data manipulation) and Matplotlib (for visualization). Mathematical Functions: import numpy as np # Create NumPy arrays a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) # Element-wise operations result = a + b Array Indexing and Slicing using numpy import numpy as np arr = np.array([[1, 2, 3], [4, 5, 6]]) element = arr[1, 2] # Accesses the element at row 1, column 2 (value: 6) import numpy as np # Creating an array arr = np.array([1, 2, 3, 4, 5]) # Performing operations result = arr * 2 print(result) # [ 2 4 6 8 10] # Calculating mean and standard deviation mean_value = np.mean(arr) std_dev = np.std(arr) print("Mean:", mean_value) print("Standard Deviation:", std_dev) import numpy as np # Creating a 1-dimensional array arr1 = np.array([1, 2, 3, 4, 5]) # Creating a 2-dimensional array (matrix) arr2 = np.array([[1, 2, 3], [4, 5, 6]]) # Basic arithmetic operations result = arr1 + 10 print(result) # [11 12 13 14 15] # Matrix multiplication mat_product = np.dot(arr2, np.array([2, 2, 2])) print(mat_product) # [12 30] # Broadcasting broadcasted = arr2 * 2 print(broadcasted) # [[ 2 4 6] # [ 8 10 12]] # Statistical operations mean_value = np.mean(arr1) print(mean_value) # 3.0 # Slicing subset = arr1[1:4] print(subset) # [2 3 4] [1 2 3 4 5] [11 12 13 14 15] [12 30] [[ 2 4 6] [ 8 10 12]] 3.0 [2 3 4] Integration with other Libraries using numpy in python import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 10, 100) y = np.sin(x) plt.plot(x, y) plt.xlabel('X-axis') plt.ylabel('Y-axis') plt.title('Sine Wave') plt.show() Universal Functions (ufuncs) in numpy import numpy as np a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) # Basic arithmetic operations add_result = np.add(a, b) subtract_result = np.subtract(a, b) multiply_result = np.multiply(a, b) divide_result = np.divide(a, b) Key Features of Pandas: DataFrame: The central data structure in pandas is the DataFrame, a two-dimensional table-like structure with rows and columns. It allows you to store and manipulate data in a tabular format, similar to a spreadsheet or SQL table. Series: A Series is a one-dimensional labeled array, similar to a column in a DataFrame. It can hold data of any type and is useful for working with single columns or as index/column labels. Data Manipulation: Pandas provides a wide range of functions for data manipulation, including filtering, sorting, grouping, reshaping, merging, and joining datasets. Handling Missing Data: Pandas offers tools to handle missing or null values, allowing you to fill, replace, or drop missing data points. Data I/O: Pandas supports reading and writing data from various file formats, such as CSV, Excel, SQL databases, and more. Indexing and Selection: You can select, slice, and filter data using various indexing techniques, such as label-based indexing, positional indexing, and Boolean indexing. Data Aggregation: Pandas simplifies the process of summarizing and aggregating data using functions like groupby(). Time Series: Pandas provides functionalities for working with time series data, including date and time parsing, resampling, and time-based operations. Data Visualization: While not a primary visualization library, pandas integrates well with visualization libraries like Matplotlib and Seaborn to create informative plots and charts. import pandas as pd # Creating a DataFrame from a dictionary data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 22, 28]} df = pd.DataFrame(data) # Display the DataFrame print(df) # Filtering data young_people = df[df['Age'] < 30] print(young_people) # Grouping and aggregation age_group = df.groupby('Age').count() print(age_group) # Reading and writing data df.to_csv('people.csv', index=False) new_df = pd.read_csv('people.csv') # Displaying summary statistics summary_stats = df.describe() print(summary_stats) matplotlib The matplotlib library gives a MATLAB like interface for creating data presentations of the analysis. The library is initially limited to 2-D output, but it still provide means to express analysis graphically. Without this library we can not create output that people outside the data science community could easily understand. Examples import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 10, 100) y = np.sin(x) plt.plot(x, y) plt.xlabel('X-axis') plt.ylabel('Y-axis') plt.title('Sine Wave') plt.show() SciPy The SciPy stack contains a host of other libraries that we can also download separately. These libraries provide support for mathematics, science and engineering. When we obtain SciPy, we get a set of libraries designed to work together to create applications of various sorts, these libraries are: NumPy Pandas matplotlib Jupeter Sympy etc.. Scikit-learn The Scikit-learn library is one of many Scikit libraries that build on the capabilities provided by NumPy and SciPy to allow Python developers to perform domain specific tasks. Scikit-learn library focuses on data mining and data analysis, it provides access to following sort of functionality: Classification Regression Clustering Dimensionality reduction Model selection Pre-processing Scikit-learn is the most important library we are going to learn in this subject Keras and TensorFlow Keras is an application programming interface (API) that is used to train deep learning models. An API often specifies a model for doing something, but it doesn’t provide an implementation. TensorFlow is an implementation for the keras, there are many other implementations for the keras like Microsoft’s cognitive Toolkit, CNKT Theano NetworkX NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks (For example GPS setup to discover routes through city streets). NetworkX also provides the means to output the resulting analysis in a form that humans understand. Main advantage of using NetworkX is that nodes can be anything (including images) and edges can hold arbitrary data. Beautiful Soup Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Important Questions Justify why python is most suitable language for Data Science. Explain Core competencies of a data scientist. Explain steps of Data Science Pipeline. Explain different programming styles (programming paradigms) in python. Explain Factors affecting Speed of Execution. Linking Data Science, Big Data, and AI Write down key features of NumPy and Pandas Write down the difference between Data science and Big Data. List out different types of library used in Python. Thank You