SUMMER TRAINING AT KIIT CAREER ADVISORY AND AUGMENTATION SCHOOL(CAAS) (ONLINE MODE) Duration of Training ___22/05/2023___to____15/07/2023___ Type of Industry/Company: AI/ML(TECH:-IT-SOFTWARE/DATA SCIENCE) Prepared By Name___BISHWAYAN BHATTACHARYYA___ Semester___6TH__ Roll No. ___2004283__ Branch____ETC__ Email Id___2004283@kiit.ac.in__ Contact No.___8961864788__ Under the Supervision of Name ______Dr Sasmita Pahadsingh___ Designation_____Assistant Professor_____ Department______School of Electronics Engineering, KIIT____ School of Electronics Engineering KIIT (Deemed to be University) Bhubaneswar – 751024 NOVEMBER 2023 Certificate From the Industry/Company 3 SUMMARY/COMPENDIUM The "AI and ML Using Python" course is a comprehensive training program that equips us with the essential knowledge and practical skills required to understand, implement, and apply Data Science(DS) and Machine Learning (ML) concepts and techniques using the Python programming language. With a focus on both theoretical foundations and hands-on applications, the course covers a wide array of topics, including Python fundamentals, data types in python, data structures, strings and lists in python, machine learning concepts such as supervised and unsupervised learning, linear and logistic regression, and machine learning models, and important data science topics like descriptive stats, data distribution, python libraries like pandas and numpy. We can gain proficiency in Python, problem-solving abilities for various AI and ML tasks, and the expertise to apply these techniques to real-world scenarios. By the course's end, attendees will be well-prepared to engage in AI and ML projects, further their careers, and contribute to innovation across diverse industries, regardless of their previous programming experience. We get a comprehensive introduction to the exciting world of AI and ML through this training. 4 CONTENTS S. No. Topic Page No. 1. Introduction 1. 2. Technical aspects covered 2. 3. Work done/Learnt at the Training Institute (may include tabular data/photos or any experimental result) 3-24 4. Conclusion/Remarks 25 5 INTRODUCTION CAAS is a Department that strives to mould careers of students of KIIT from Engineering to Management to Law . It take care of training them for their placements in areas of aptitude, soft skills etc. and also train them for higher studies in areas MBA, GRE, etc. They are also in the process of training students for areas like Banking, CSAT and PSU test preparations. The team comprises the best trainers in the city with more than 10 years of teaching expereience in their areas of expertise. The team strives to ensure that our students are trained to take on the challenges in life and overcome adversities with confidence. CAAS, the Career Advisory and Augmentation Services is a Department that takes care of a students holistic development starting from Placement Training (including aptitude training, honing reasoning skills and tuning up their soft skills) with a team of dedicated faculties who are the best in the industry in experience and student feedback. I had taken a course offered by Team CAAS in the summer internship period after 6th semester. My course was ‘AI & ML Using Python’ where I attended the live lectures conducted by the teachers and practiced the concepts taught by completing given assignments. 1 TECHNICAL ASPECTS The course is divided into multiple modules, covering a wide range of topics related to AI and ML using Python. The technical aspects include : (i) Introduction to python programming language : The "Python Fundamentals" module of the course covers the foundational elements of Python programming essential for data science applications. Beginning with Python basics tailored for data science, the module covers data types, variables, and operators, providing a solid groundwork for manipulating and analyzing data. We learnt if-else statements and loops, which are crucial in data manipulation and analysis. The module also delves into functions and libraries tailored for data analysis, which helps to efficiently handle, process, and visualize data. Mastering these Python fundamentals lays the groundwork for more advanced concepts of artificial intelligence and machine learning. (ii) Introduction to data science: In this module, we learnt descriptive statistics, that include measures of central tendency, mathematical analysis of continuous data, normal and uniform distribution of data, which offer a comprehensive understanding of the dataset's characteristics. Also we learnt about Numerical Python or numpy, which provides efficient data structures, mathematical functions, and easy integration with other libraries and helps in data manipulation, analysis, and scientific computing within the Python ecosystem. Lastly, we learn about Pandas, another library, used for data cleaning, exploration, and analysis in data science. Pandas also provide efficient functions for data transformation and aggregation and exploratory data science. (iii) Machine learning fundamentals: The "Machine Learning Using Python" training program provides a comprehensive understanding of machine learning concepts. This module covers foundational topics such as supervised and unsupervised learning, neural networks, and steps to build machine learning model, and concepts of linear and logistic regression. Also, we learnt about data transformation and data pipelines, custom transformers, and classification metrices, which provide solid foundation for understanding machine learning concepts and algorithms. 2 WORK DONE/LEARNT 1. INTRODUCTION TO PYTHON Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Python’s simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. 1.1 FUNDAMENTALS OF PYTHON Data type (int, float, string) Conditional, loops and function Object-oriented programming and using external libraries 1.2 DATA STRUCTURES OF PYTHON A data structure is a specialized format for organizing, processing, retrieving and storing data. There are several basic and advanced types of data structures, all designed to arrange data to suit a specific purpose. The different types of data structures include : i. Lists: A list is a mutable, ordered collection that can contain elements of different data types. Example: my_list = [1, 2, 3, "apple", "banana", True] ii. Tuples: A tuple is an immutable, ordered collection. Once created, the elements cannot be changed. Example: my_tuple = (1, 2, 3, "apple", "banana", True) 3 iii. Sets: A set is an unordered collection of unique elements. Sets are useful for mathematical operations like union and intersection. Example: my_set = {1, 2, 3, "apple", "banana"} iv. Dictionaries: A dictionary is an unordered collection of key-value pairs. It provides a fast way to access values based on their keys. Example: my_dict = {"name": "John", "age": 25, "city": "New York"} v. Strings: Strings are sequences of characters. They are immutable, meaning their values cannot be changed after creation. Example: my_string = "Hello, Python!" vi. Arrays: Arrays in Python are provided by the array module and are similar to lists but with a specific data type for the elements. Example: vii. Queues: Queues are implemented in Python using the queue module. They follow the First In, First Out (FIFO) principle. Example: 4 viii. Stacks: Stacks are implemented in Python using lists. They follow the Last In, First Out (LIFO) principle. Example: ix. Linked Lists: Linked lists are implemented using the collections module. They consist of nodes where each node contains data and a reference to the next node. Example: x. Doubly Linked Lists: Doubly linked lists have nodes with references to both the next and previous nodes. Example: 5 1.3 DATA TYPES OF PYTHON Python supports various data types that allows to store and manipulate different types of data. Here's an overview of some fundamental data types in Python: Numeric Types: int: Integer type represents whole numbers, e.g., 5, -10, 1000. float: Float type represents floating-point numbers, e.g., 3.14, -0.5, 2e3. Boolean Type: bool: Boolean type represents the truth values True or False. It is often used in conditional statements and logical operations. Sequence Types: str: String type represents a sequence of characters. Strings are enclosed in single (' '), double (" "), or triple (''' ''' or """ """) quotes. list: List type represents an ordered, mutable sequence of elements. Lists are created using square brackets, e.g., [1, 2, 3]. tuple: Tuple type represents an ordered, immutable sequence of elements. Tuples are created using parentheses, e.g., (1, 2, 3). Set Types: set: Set type represents an unordered collection of unique elements. Sets are created using curly braces, e.g., {1, 2, 3}. Mapping Type: dict: Dictionary type represents an unordered collection of key-value pairs. Dictionaries are created using curly braces with key-value pairs, e.g., {'key': 'value'}. None Type: The None type represents the absence of a value or a null value. It is often used to indicate that a variable or function returns nothing. 6 Sequence Types (Binary): bytes: Bytes type represents a sequence of bytes. It is immutable and often used for binary data. bytearray: Bytearray type is similar to bytes but mutable. Text Type: str: Although mentioned earlier, it's worth noting that strings in Python (str) are Unicode text. The above data types provide the building blocks for creating complex data structures and performing various operations in Python. Understanding these types is crucial for effective programming and data manipulation in Python. 1.4 IDENTIFIERS IN PYTHON In Python, identifiers are essential components of the language that enable you to name variables, functions, and other entities, as well as perform various operations on data. The different Identifiers are: Variables: Identifiers are used to name variables. They can start with a letter (a-z, A-Z) or an underscore (_), followed by letters, digits (0-9), or underscores. Example: my_variable, _count, temperature_2 Functions: Identifiers are used to name functions, following similar rules as variables. Example: calculate_sum(), print_message(), get_data Classes: Identifiers are used to name classes, following the same rules as variables. Example: Car, Person, DataProcessor Modules: Identifiers are used to name modules. Module names should follow the same rules as variables. Example: math, random, my_module 7 1.5 OPERATORS IN PYTHON i. Arithmetic Operators: + (addition), - (subtraction), * (multiplication), / (division), % (modulo), ** (exponentiation) ii. Comparison Operators: == (equal to), != (not equal to), < (less than), > (greater than), <= (less than or equal to), >= (greater than or equal to) iii. Logical Operators: and (logical AND), or (logical OR), not (logical NOT) iv. Assignment Operators: = (assignment), += (addition assignment), -= (subtraction assignment), *= (multiplication assignment), /= (division assignment) v. Membership Operators: in (true if a value is found in the specified sequence), not in (true if a value is not found in the specified sequence) vi. Identity Operators: is (true if both variables refer to the same object), is not (true if both variables do not refer to the same object) vii. Bitwise Operators: & (bitwise AND), | (bitwise OR), ^ (bitwise XOR), ~ (bitwise NOT), << (left shift), >> (right shift) The above identifiers and operators collectively form the foundation for expressing logic and performing operations in Python. 1.6 CONDITIONAL STATEMENTS IN PYTHON Conditional statements in Python allow you to control the flow of your program based on certain conditions. The primary conditional statements in Python are: I. if Statement: The if statement is used to execute a block of code only if a specified condition is true. 8 II. if-else Statement: The if-else statement allows you to execute one block of code if the condition is true and another block if the condition is false. III. if-elif-else Statement: The if-elif-else statement is used when there are multiple conditions to check. It allows you to specify multiple blocks of code, and the first true condition encountered will be executed. IV. Nested if Statements: You can nest if statements within other if statements to create more complex conditions. 9 V. Ternary (Conditional) Operator: The ternary operator provides a concise way to write simple if-else statements in a single line. Conditional statements are fundamental for implementing decision-making logic in Python programs. They help to make the code respond dynamically to different scenarios, making it more flexible and powerful. 1.7 LOOPS IN PYTHON Loops in Python allows to repeatedly execute a block of code. The two main types of loops in Python are for loops and while loops. I. for Loop: The for loop is used for iterating over a sequence (such as a list, tuple, string, or range) or other iterable objects. II. while Loop: The while loop continues to execute a block of code as long as a specified condition is true. 10 III. break Statement: The break statement is used to exit a loop prematurely, regardless of whether the loop condition is true. IV. continue Statement: The continue statement is used to skip the rest of the code inside a loop for the current iteration and move to the next iteration. V. else Clause with Loops: Both for and while loops in Python can have an else clause. The code in the else block is executed when the loop condition becomes False or when the loop has iterated over all items. 11 1.8 FUNCTIONS IN PYTHON Functions in Python are blocks of reusable code designed to perform a specific task. They help in organizing code, promoting code reuse, and enhancing readability. Here's an overview of functions in Python: Function definition: Here the def keyword is used followed by the function name and a colon. The function body is indented. Function call: A function is invoked by using function call. To execute a function, it is called by its name, the required arguments if any are provided. 1.9 PARAMETERS AND ARGUMENTS Parameters: These are the variables defined in the function signature. They act as placeholders for values that will be passed during the function call. Arguments: These are the actual values passed to the function during the function call. 1.10 LIBRARIES IN PYTHON Python has a rich ecosystem of libraries and frameworks that extend its functionality for various purposes, ranging from data science and machine learning to web development and GUI programming. Here are some key libraries in Python: Scikit for handling basic ML algorithms like clustering, linear and logistic regressions,regression, classification, and others. Pandas for high-level data structures and analysis. It allows merging and filtering of data, aswell as gathering it from other external sources like Excel, for instance. Keras for deep learning. It allows fast calculations and prototyping, as it uses the GPU inaddition to the CPU of the computer. TensorFlow for working with deep learning by setting up, training, and utilizing artificialneural networks with massive datasets. 12 Matplotlib for creating 2D plots, histograms, charts, and other forms of visualization. NLTK for working with computational linguistics, natural language recognition, and processing. Scikit-image for image processing. PyBrain for neural networks, unsupervised and reinforcement learning. Caffe for deep learning that allows switching between the CPU and the GPU and processing60+ mln images a day using a single NVIDIA K40 GPU. StatsModels for statistical algorithms and data exploration. 2. INTRODUCTION TO DATA SCIENCE 2.1 INTRODUCTION TO DATA SCIENCE AND STATISTICS Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract insights and knowledge from structured and unstructured data. It combines techniques from statistics, mathematics, computer science, and domain-specific knowledge to analyze and interpret complex datasets. The main goal of data science is to uncover hidden patterns, trends, and valuable information that can aid decision-making in various domains. Data scientists often employ statistical models, machine learning algorithms, and data visualization tools to derive meaningful insights from data. Statistics is a branch of mathematics that involves collecting, analyzing, interpreting, presenting, and organizing data. In the context of data science, statistical methods play a crucial role in drawing inferences, making predictions, and quantifying uncertainty. Key concepts in statistics include: Descriptive Statistics: Summarizing and describing the main features of a dataset, such as mean, median, mode, and standard deviation. Inferential Statistics: Drawing conclusions and making predictions about a population based on a sample of data. 13 2.2 CENTRAL TENDENCY AND SPREADNESS OF DATA Central tendency measures indicate the typical or central value of a set of data. The three main measures of central tendency are: I. Mean: Formula: The mean is the average of all values in a dataset. It depends on extreme values. II. Median: The median is the middle value in a sorted dataset. If there is an even number of observations, the median is the average of the two middle values. It is less affected by extreme values compared to the mean. III. Mode: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all. SPREADNESS OF DATA: Spreadness measures describe the extent to which data values deviate from the central tendency. The two main measures of spreadness are: I. Range: Formula: Range = Maximum Value − Minimum Value The range is the difference between the maximum and minimum values in a dataset. It provides a simple measure of variability. II. Variance: Formula: 14 Variance measures how far each data point in the set is from the mean. A higher variance indicates greater variability. III. Standard Deviation: Formula = The standard deviation is the square root of the variance. It provides a measure of the average distance between each data point and the mean. IV. Interquartile Range (IQR): The IQR is the range of values between the first quartile (Q1) and the third quartile (Q3). It is less sensitive to extreme values than the range. 2.3 DATA DISTRIBUTION In data science, understanding data distribution is a fundamental concept because it helps data analysts and scientists gain insights into the properties of a dataset and make informed decisions. Data distribution refers to the way data values are spread or distributed across different values or ranges. There are several types of data distributions, each with its own characteristics, but the most common distribution curves used in data science are: I. Normal Distribution (Gaussian Distribution): Also known as the bell curve. Characterized by a symmetrical and unimodal shape. Mean, median, and mode are all equal and located at the center. Many natural phenomena follow this distribution. II. Uniform Distribution: All values in the dataset have the same probability of occurring. It forms a rectangular shape when visualized. 15 2.4 NUMERICAL PYTHON(NUMPY) IN DATA SCIENCE NumPy (Numerical Python) is a fundamental library in data science and scientific computing in Python. It provides support for arrays and matrices, as well as a variety of mathematical functions to operate on these arrays. NumPy is particularly well-suited for numerical and data manipulation tasks. Here, we'll focus on 2D arrays and common NumPy operations. Creating a 2D Array (Matrix) in NumPy: To create a 2D array, we use the numpy.array function and provide a nested list of values. Common numpy operations with 2D arrays:\ I. Array dimensions: To get the dimensions (shape) of a 2D array, the .shape attribute is used. II. Accessing elements: Specific elements or slices of a 2D array can be accessed using indexing. III. Arithmetic operations: Numpy provide element wise operation like addition, subtraction, multiplication and division. IV. Transposition: To perform transposition we use .T attribute. V. Matrix multiplication: To perform matrix multiplication, we use @operator or ‘numpy.dot’ function. Also we can perform statistical calculations like mean(), max(), min() etc. 2.5 EXPLORATORY DATA SCIENCE Exploratory Data Analysis (EDA) is a crucial step in the data science process. It involves the initial examination and visualization of a dataset to understand its main characteristics, identify patterns, uncover potential outliers, and generate hypotheses for further analysis. EDA helps data scientists and analysts gain insights and make informed decisions before diving into more advanced modeling and statistical techniques. Here are key steps and techniques used in exploratory data analysis: 16 I. Data Collection: Gather the data from various sources, such as databases, files, APIs, or web scraping. II. Data Cleaning: Clean the data by handling missing values, correcting data types, and addressing inconsistencies. This step ensures the data is suitable for analysis. III. Descriptive Statistics: Calculate and examine basic statistics such as mean, median, mode, standard deviation, and range to understand the central tendency and spread of the data. IV. Data Visualization: Create visualizations to explore the data's characteristics. Common plots and charts include histograms, box plots, scatter plots, bar plots, and line charts. V. Correlation Analysis: Explore the relationships between variables by calculating correlation coefficients. This helps identify which variables are related and to what extent. VI. Outlier Detection: Identify potential outliers or extreme values that may have a significant impact on analysis. Box plots and scatter plots are often used to spot outliers. VII. Data Distribution: Determine the type of data distribution (e.g., normal, uniform, skewed) to understand the underlying structure of the data. VIII. Feature Engineering: Create new features or transform existing ones to improve the performance of machine learning models. IX. Dimensionality Reduction: Reduce the dimensionality of the data using techniques like Principal Component Analysis (PCA) to visualize highdimensional data. X. Data Grouping and Aggregation: Group data by categories or time intervals and calculate aggregate statistics to uncover patterns and trends. XI. Data Cross-Tabulation: Create cross-tabulations or contingency tables to examine relationships between categorical variables. XII. Time Series Analysis: If the data has a time component, analyze time series data using techniques like seasonality decomposition and autocorrelation. 17 XIII. Hypothesis Generation: Formulate hypotheses about the data based on observed patterns and relationships. These hypotheses can guide further analysis. XIV. Interactive Dashboards: Create interactive dashboards or data exploration tools using libraries like Plotly or Tableau to facilitate exploration and reporting. XV. Documentation: Keep detailed records of the exploratory analysis, including findings, visualizations, and initial insights. This documentation is valuable for later stages of the data science process. 3. MACHINE LEARNING FUNDAMENTALS 3.1 INTRODUCTION TO MACHINE LEARNING Machine Learning is the science of getting computers to learn without being explicitly programmed. It is closely related to computational statistics, which focuses on making prediction using computer. Machine Learning focuses on the development of computer programs that can access data and use it to learn themselves. The process of learning begins with observations or data, such as examples, providing inputs and outputs in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers learn automatically without human intervention or assistance and adjust actions accordingly. 3.2 TYPES OF MACHINE LEARNING The types of machine learning algorithms differ in their approach, the type of data they input and output, and the type of task that they are intended to solve. Broadly Machine Learning can be categorized into three categories. I. Supervised Learning: Supervised Learning is a type of learning in which we are given a data set and we already know what are correct output should look like, having the idea that there is a relationship between the input and output. Basically, it is learning task of learning a function that maps an input to an output based on example input-output pairs. It deduces a function from labeled training data consisting of a set of training examples. 18 II. Unsupervised Learning: Unsupervised Learning is a type of learning that allows us to approach problems with little or no idea what our problem should look like. We can derive the structure by clustering the data based on a relationship among the variables in data. Basically, it is a type of selforganized learning that helps in finding previously unknown patterns in data set without pre-existing label. III. Reinforcement Learning: Reinforcement learning is a learning method that interacts with its environment by producing actions and discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of reinforcement learning. This method allows machines and software agents to automatically determine the ideal behavior within a specific context in order to maximize its performance. 3.3 STEPS TO BUILD A MACHINE LEARNING MODEL Importing the data: For importing data, pandas library is used widely. • Cleaning data or preprocessing data: Machine Learning algorithms don’t work so well with processing raw data. Before we can feed such data to an ML algorithm, we must preprocess it. We must apply some transformations on it. With data preprocessing, we convert raw data into a clean data set. In preprocessing we perform steps like removing duplicate values, removing NULL values, replacing 0 with mean, etc. Because it will improve our results. • Split data in training set and test set: For feeding data to an ML algorithm, we have to first split it into two sets training and test set. So that algorithm can train itself using training set and test its model on test set. Usually this split is 80-20 i.e., 80% training set and 20% test set. • Choosing a model: The next step is to choose the model or algorithm which will evaluate the given dataset and make according predictions. The flowchart to build an ML model is given below. 19 3.4 MACHINE LEARNING MODELS There are many types of Machine Learning Algorithms specific to different use cases. As we work with datasets, a machine learning algorithm works in two stages. We usually split the data around 20%-80% between testing and training stages. Under supervised learning, we split a dataset into a training data and test data in Python ML. Followings are the Algorithms of Python Machine Learning: I. Linear Regression: Linear regression is one of the supervised Machine learning algorithms in Python that observes continuous features and predicts an outcome. Depending on whether it runs on a single variable or on many features, we can call it simple linear regression or multiple linear regression. This is one of the most popular Python ML algorithms and often under-appreciated. It assigns optimal weights to variables to create a line ax+b to predict the output. We often use linear regression to estimate real values like a number of calls and costs of houses based on continuous variables. The regression line is the best line that fits Y=a*X+b to denote a relationship between independent and dependent variables. 20 II. Logistic Regression Logistic regression is a supervised classification is unique Machine Learning algorithms in Python that finds its use in estimating discrete values like 0/1, yes/no, and true/false. This is based on a given set of independent variables. We use a logistic function to predict the probability of an event and this gives us an output between 0 and 1. Although it says ‘regression’, this is actually a classification algorithm. Logistic regression fits data into a logit function and is also called logit regression. III. Decision Tree A decision tree falls under supervised Machine Learning Algorithms in Python and comes of use for both classification and regression- although mostly for classification. This model takes an instance, traverses the tree, and compares important features with a determined conditional statement. Whether it descends 21 to the left child branch or the right depends on the result. Usually, more important features are closer to the root. Decision Tree, a Machine Learning algorithm in Python can work on both categorical and continuous dependent variables. Here, we split a population into two or more homogeneous sets. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regressiontrees. IV. Support Vector Machine (SVM)SVM is a supervised classification is one of the most important Machines Learning algorithms in Python, that plots a line that divides different categories of your data. In this ML algorithm, we calculate the vector to optimize the line. This is to ensure that the closest point in each group lies farthest from each other. While you will almost always find this to be a linear vector, it can be other than that. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into highdimensional feature spaces. When data are unlabeled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. 22 V. kNN Algorithm This is a Python Machine Learning algorithm for classification and regressionmostly for classification. This is a supervised learning algorithm that considers different centroids and uses a usually Euclidean function to compare distance. Then, it analyzes the results and classifies each point to the group to optimize it to place with all closest points to it. It classifies new cases using a majority vote of k of its neighbors. The case it assigns to a class is the one most common among its K nearest neighbors. For this, it uses a distance function. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. k-NN is a special case of a variablebandwidth, kernel density "balloon" estimator with a uniform kernel. VI. K-Means Algorithm k-Means is an unsupervised algorithm that solves the problem of clustering. It classifies data using a number of clusters. The data points inside a class are homogeneous and heterogeneous to peer groups. k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. k-means clustering is rather easy to apply to even large data sets, particularly when using heuristics such as Lloyd's algorithm. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration. The problem is computationally difficult (NP-hard). k-means originates from signal processing, 23 and still finds use in this domain. In cluster analysis, the k-means algorithm can be used to partition the input data set into k partitions (clusters). k-means clustering has been used as a feature learning (or dictionary learning) step, in either (semi-)supervised learning or unsupervised learning. 24 CONCLUSION In summary, this report has presented a comprehensive overview of the pivotal roles played by Python, Machine Learning (ML), and Data Science in our datacentric world. Python's versatility and rich library ecosystem underpin data manipulation and analysis, making it an essential tool for both data scientists and developers. Machine Learning, as a subfield of AI, empowers computers to learn from data, driving transformative applications across industries. Data Science bridges domain knowledge with statistics and computer science, enabling data-driven decision-making through techniques like exploratory data analysis and statistical inference. Together, these domains are propelling us into an era of data-driven insights, automation, and innovation, redefining our approach to complex problem-solving and decision support. 25