Uploaded by anasua bhattacharjee

Summer Internship Trainning Report 6th sem python

advertisement
SUMMER TRAINING
AT
KIIT CAREER ADVISORY AND AUGMENTATION SCHOOL(CAAS)
(ONLINE MODE)
Duration of Training ___22/05/2023___to____15/07/2023___
Type of Industry/Company: AI/ML(TECH:-IT-SOFTWARE/DATA SCIENCE)
Prepared By
Name___BISHWAYAN BHATTACHARYYA___
Semester___6TH__
Roll No. ___2004283__
Branch____ETC__
Email Id___2004283@kiit.ac.in__
Contact No.___8961864788__
Under the Supervision of
Name ______Dr Sasmita Pahadsingh___
Designation_____Assistant Professor_____
Department______School of Electronics Engineering, KIIT____
School of Electronics Engineering
KIIT (Deemed to be University)
Bhubaneswar – 751024
NOVEMBER 2023
Certificate From the Industry/Company
3
SUMMARY/COMPENDIUM
The "AI and ML Using Python" course is a comprehensive training program
that equips us with the essential knowledge and practical skills required to
understand, implement, and apply Data Science(DS) and Machine Learning
(ML) concepts and techniques using the Python programming language. With a
focus on both theoretical foundations and hands-on applications, the course
covers a wide array of topics, including Python fundamentals, data types in
python, data structures, strings and lists in python, machine learning concepts
such as supervised and unsupervised learning, linear and logistic regression, and
machine learning models, and important data science topics like descriptive
stats, data distribution, python libraries like pandas and numpy. We can gain
proficiency in Python, problem-solving abilities for various AI and ML tasks,
and the expertise to apply these techniques to real-world scenarios. By the
course's end, attendees will be well-prepared to engage in AI and ML projects,
further their careers, and contribute to innovation across diverse industries,
regardless of their previous programming experience. We get a comprehensive
introduction to the exciting world of AI and ML through this training.
4
CONTENTS
S. No.
Topic
Page No.
1.
Introduction
1.
2.
Technical aspects
covered
2.
3.
Work done/Learnt at the
Training Institute (may
include tabular
data/photos or any
experimental result)
3-24
4.
Conclusion/Remarks
25
5
INTRODUCTION
CAAS is a Department that strives to mould careers of students of KIIT from
Engineering to Management to Law . It take care of training them for their
placements in areas of aptitude, soft skills etc. and also train them for higher
studies in areas MBA, GRE, etc. They are also in the process of training
students for areas like Banking, CSAT and PSU test preparations. The team
comprises the best trainers in the city with more than 10 years of teaching
expereience in their areas of expertise. The team strives to ensure that our
students are trained to take on the challenges in life and overcome adversities
with confidence.
CAAS, the Career Advisory and Augmentation Services is a Department that
takes care of a students holistic development starting from Placement Training
(including aptitude training, honing reasoning skills and tuning up their soft
skills) with a team of dedicated faculties who are the best in the industry in
experience and student feedback.
I had taken a course offered by Team CAAS in the summer internship period
after 6th semester. My course was ‘AI & ML Using Python’ where I attended
the live lectures conducted by the teachers and practiced the concepts taught by
completing given assignments.
1
TECHNICAL ASPECTS
The course is divided into multiple modules, covering a wide range of topics
related to AI and ML using Python. The technical aspects include :
(i) Introduction to python programming language : The "Python
Fundamentals" module of the course covers the foundational elements of
Python programming essential for data science applications. Beginning with
Python basics tailored for data science, the module covers data types, variables,
and operators, providing a solid groundwork for manipulating and analyzing
data. We learnt if-else statements and loops, which are crucial in data
manipulation and analysis. The module also delves into functions and libraries
tailored for data analysis, which helps to efficiently handle, process, and
visualize data. Mastering these Python fundamentals lays the groundwork for
more advanced concepts of artificial intelligence and machine learning.
(ii) Introduction to data science: In this module, we learnt descriptive
statistics, that include measures of central tendency, mathematical analysis of
continuous data, normal and uniform distribution of data, which offer a
comprehensive understanding of the dataset's characteristics. Also we learnt
about Numerical Python or numpy, which provides efficient data structures,
mathematical functions, and easy integration with other libraries and helps in
data manipulation, analysis, and scientific computing within the Python
ecosystem. Lastly, we learn about Pandas, another library, used for data
cleaning, exploration, and analysis in data science. Pandas also provide efficient
functions for data transformation and aggregation and exploratory data science.
(iii) Machine learning fundamentals: The "Machine Learning Using Python"
training program provides a comprehensive understanding of machine learning
concepts. This module covers foundational topics such as supervised and
unsupervised learning, neural networks, and steps to build machine learning
model, and concepts of linear and logistic regression. Also, we learnt about data
transformation and data pipelines, custom transformers, and classification
metrices, which provide solid foundation for understanding machine learning
concepts and algorithms.
2
WORK DONE/LEARNT
1. INTRODUCTION TO PYTHON
Python is an interpreted, object-oriented, high-level programming language with
dynamic semantics. Python’s simple, easy to learn syntax emphasizes
readability and therefore reduces the cost of program maintenance. Python
supports modules and packages, which encourages program modularity and
code reuse.
1.1 FUNDAMENTALS OF PYTHON
Data type (int, float, string)
Conditional, loops and function
Object-oriented programming and using external libraries
1.2 DATA STRUCTURES OF PYTHON
A data structure is a specialized format for organizing, processing, retrieving
and storing data. There are several basic and advanced types of data structures,
all designed to arrange data to suit a specific purpose. The different types of
data structures include :
i. Lists:
 A list is a mutable, ordered collection that can contain elements of different
data types.
 Example: my_list = [1, 2, 3, "apple", "banana", True]
ii. Tuples:
 A tuple is an immutable, ordered collection. Once created, the elements
cannot be changed.
 Example: my_tuple = (1, 2, 3, "apple", "banana", True)
3
iii. Sets:
 A set is an unordered collection of unique elements. Sets are useful for
mathematical operations like union and intersection.
 Example: my_set = {1, 2, 3, "apple", "banana"}
iv. Dictionaries:
 A dictionary is an unordered collection of key-value pairs. It provides a fast
way to access values based on their keys.
 Example: my_dict = {"name": "John", "age": 25, "city": "New York"}
v. Strings:
 Strings are sequences of characters. They are immutable, meaning their
values cannot be changed after creation.
 Example: my_string = "Hello, Python!"
vi. Arrays:
 Arrays in Python are provided by the array module and are similar to lists
but with a specific data type for the elements.
 Example:
vii. Queues:
 Queues are implemented in Python using the queue module. They follow the
First In, First Out (FIFO) principle.
 Example:
4
viii.
Stacks:
 Stacks are implemented in Python using lists. They follow the Last In, First
Out (LIFO) principle.
 Example:
ix. Linked Lists:
 Linked lists are implemented using the collections module. They consist of
nodes where each node contains data and a reference to the next node.
 Example:
x. Doubly Linked Lists:
 Doubly linked lists have nodes with references to both the next and previous
nodes.
 Example:
5
1.3 DATA TYPES OF PYTHON
Python supports various data types that allows to store and manipulate different
types of data. Here's an overview of some fundamental data types in Python:
 Numeric Types:
 int: Integer type represents whole numbers, e.g., 5, -10, 1000.
 float: Float type represents floating-point numbers, e.g., 3.14, -0.5, 2e3.
 Boolean Type:
 bool: Boolean type represents the truth values True or False. It is often used
in conditional statements and logical operations.
 Sequence Types:
 str: String type represents a sequence of characters. Strings are enclosed in
single (' '), double (" "), or triple (''' ''' or """ """) quotes.
 list: List type represents an ordered, mutable sequence of elements. Lists are
created using square brackets, e.g., [1, 2, 3].
 tuple: Tuple type represents an ordered, immutable sequence of elements.
Tuples are created using parentheses, e.g., (1, 2, 3).
 Set Types:
 set: Set type represents an unordered collection of unique elements. Sets are
created using curly braces, e.g., {1, 2, 3}.
 Mapping Type:
 dict: Dictionary type represents an unordered collection of key-value pairs.
Dictionaries are created using curly braces with key-value pairs, e.g., {'key':
'value'}.
 None Type: The None type represents the absence of a value or a null
value. It is often used to indicate that a variable or function returns nothing.
6
 Sequence Types (Binary):
 bytes: Bytes type represents a sequence of bytes. It is immutable and often
used for binary data. bytearray: Bytearray type is similar to bytes but
mutable.
 Text Type:
 str: Although mentioned earlier, it's worth noting that strings in Python (str)
are Unicode text.
 The above data types provide the building blocks for creating complex data
structures and performing various operations in Python. Understanding
these types is crucial for effective programming and data manipulation in
Python.
1.4 IDENTIFIERS IN PYTHON
In Python, identifiers are essential components of the language that enable you
to name variables, functions, and other entities, as well as perform various
operations on data. The different Identifiers are:
 Variables: Identifiers are used to name variables. They can start with a letter
(a-z, A-Z) or an underscore (_), followed by letters, digits (0-9), or
underscores. Example: my_variable, _count, temperature_2
 Functions: Identifiers are used to name functions, following similar rules as
variables. Example: calculate_sum(), print_message(), get_data
 Classes: Identifiers are used to name classes, following the same rules as
variables. Example: Car, Person, DataProcessor
 Modules: Identifiers are used to name modules. Module names should
follow the same rules as variables. Example: math, random, my_module
7
1.5 OPERATORS IN PYTHON
i. Arithmetic Operators: + (addition), - (subtraction), * (multiplication), /
(division), % (modulo), ** (exponentiation)
ii. Comparison Operators: == (equal to), != (not equal to), < (less than), >
(greater than), <= (less than or equal to), >= (greater than or equal to)
iii. Logical Operators: and (logical AND), or (logical OR), not (logical NOT)
iv. Assignment Operators: = (assignment), += (addition assignment), -=
(subtraction assignment), *= (multiplication assignment), /= (division
assignment)
v. Membership Operators: in (true if a value is found in the specified
sequence), not in (true if a value is not found in the specified sequence)
vi. Identity Operators: is (true if both variables refer to the same object), is not
(true if both variables do not refer to the same object)
vii. Bitwise Operators: & (bitwise AND), | (bitwise OR), ^ (bitwise XOR), ~
(bitwise NOT), << (left shift), >> (right shift)
The above identifiers and operators collectively form the foundation for
expressing logic and performing operations in Python.
1.6 CONDITIONAL STATEMENTS IN PYTHON
Conditional statements in Python allow you to control the flow of your program
based on certain conditions. The primary conditional statements in Python are:
I. if Statement: The if statement is used to execute a block of code only if a
specified condition is true.
8
II. if-else Statement: The if-else statement allows you to execute one block of
code if the condition is true and another block if the condition is false.
III. if-elif-else Statement: The if-elif-else statement is used when there are
multiple conditions to check. It allows you to specify multiple blocks of code,
and the first true condition encountered will be executed.
IV. Nested if Statements: You can nest if statements within other if statements
to create more complex conditions.
9
V. Ternary (Conditional) Operator: The ternary operator provides a concise
way to write simple if-else statements in a single line.
Conditional statements are fundamental for implementing decision-making
logic in Python programs. They help to make the code respond dynamically to
different scenarios, making it more flexible and powerful.
1.7 LOOPS IN PYTHON
Loops in Python allows to repeatedly execute a block of code. The two main
types of loops in Python are for loops and while loops.
I. for Loop: The for loop is used for iterating over a sequence (such as a list,
tuple, string, or range) or other iterable objects.
II. while Loop: The while loop continues to execute a block of code as long as
a specified condition is true.
10
III. break Statement: The break statement is used to exit a loop prematurely,
regardless of whether the loop condition is true.
IV. continue Statement: The continue statement is used to skip the rest of the
code inside a loop for the current iteration and move to the next iteration.
V. else Clause with Loops: Both for and while loops in Python can have an
else clause. The code in the else block is executed when the loop condition
becomes False or when the loop has iterated over all items.
11
1.8 FUNCTIONS IN PYTHON
Functions in Python are blocks of reusable code designed to perform a specific
task. They help in organizing code, promoting code reuse, and enhancing
readability. Here's an overview of functions in Python:
 Function definition: Here the def keyword is used followed by the
function name and a colon. The function body is indented.
 Function call: A function is invoked by using function call. To execute a
function, it is called by its name, the required arguments if any are provided.
1.9 PARAMETERS AND ARGUMENTS
 Parameters: These are the variables defined in the function signature. They
act as placeholders for values that will be passed during the function call.
 Arguments: These are the actual values passed to the function during the
function call.
1.10 LIBRARIES IN PYTHON
Python has a rich ecosystem of libraries and frameworks that extend its
functionality for various purposes, ranging from data science and machine
learning to web development and GUI programming. Here are some key
libraries in Python:
 Scikit for handling basic ML algorithms like clustering, linear and logistic
regressions,regression, classification, and others.
 Pandas for high-level data structures and analysis. It allows merging and
filtering of data, aswell as gathering it from other external sources like
Excel, for instance.
 Keras for deep learning. It allows fast calculations and prototyping, as it
uses the GPU inaddition to the CPU of the computer.
 TensorFlow for working with deep learning by setting up, training, and
utilizing artificialneural networks with massive datasets.
12
 Matplotlib for creating 2D plots, histograms, charts, and other forms of
visualization.
 NLTK for working with computational linguistics, natural language
recognition, and processing.
 Scikit-image for image processing.
 PyBrain for neural networks, unsupervised and reinforcement learning.
 Caffe for deep learning that allows switching between the CPU and the
GPU and processing60+ mln images a day using a single NVIDIA K40
GPU.
 StatsModels for statistical algorithms and data exploration.
2. INTRODUCTION TO DATA SCIENCE
2.1 INTRODUCTION TO DATA SCIENCE AND STATISTICS
Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract insights and knowledge from structured and
unstructured data. It combines techniques from statistics, mathematics,
computer science, and domain-specific knowledge to analyze and interpret
complex datasets. The main goal of data science is to uncover hidden patterns,
trends, and valuable information that can aid decision-making in various
domains. Data scientists often employ statistical models, machine learning
algorithms, and data visualization tools to derive meaningful insights from data.
Statistics is a branch of mathematics that involves collecting, analyzing,
interpreting, presenting, and organizing data. In the context of data science,
statistical methods play a crucial role in drawing inferences, making predictions,
and quantifying uncertainty. Key concepts in statistics include:
 Descriptive Statistics: Summarizing and describing the main features of a
dataset, such as mean, median, mode, and standard deviation.
 Inferential Statistics: Drawing conclusions and making predictions about a
population based on a sample of data.
13
2.2 CENTRAL TENDENCY AND SPREADNESS OF DATA
Central tendency measures indicate the typical or central value of a set of data.
The three main measures of central tendency are:
I. Mean:
 Formula:
 The mean is the average of all values in a dataset. It depends on extreme
values.
II. Median:
 The median is the middle value in a sorted dataset. If there is an even
number of observations, the median is the average of the two middle values.
 It is less affected by extreme values compared to the mean.
III. Mode:
 The mode is the value that appears most frequently in a dataset.
 A dataset may have one mode, more than one mode, or no mode at all.
SPREADNESS OF DATA:
Spreadness measures describe the extent to which data values deviate from the
central tendency. The two main measures of spreadness are:
I. Range:
 Formula: Range = Maximum Value − Minimum Value
 The range is the difference between the maximum and minimum values in a
dataset. It provides a simple measure of variability.
II. Variance:
 Formula:
14
 Variance measures how far each data point in the set is from the mean. A
higher variance indicates greater variability.
III. Standard Deviation:
 Formula =
 The standard deviation is the square root of the variance. It provides a
measure of the average distance between each data point and the mean.
IV. Interquartile Range (IQR): The IQR is the range of values between the first
quartile (Q1) and the third quartile (Q3). It is less sensitive to extreme values
than the range.
2.3 DATA DISTRIBUTION
In data science, understanding data distribution is a fundamental concept
because it helps data analysts and scientists gain insights into the properties of a
dataset and make informed decisions. Data distribution refers to the way data
values are spread or distributed across different values or ranges. There are
several types of data distributions, each with its own characteristics, but the
most common distribution curves used in data science are:
I. Normal Distribution (Gaussian Distribution):
 Also known as the bell curve.
 Characterized by a symmetrical and unimodal shape.
 Mean, median, and mode are all equal and located at the center.
 Many natural phenomena follow this distribution.
II. Uniform Distribution:
 All values in the dataset have the same probability of occurring.
 It forms a rectangular shape when visualized.
15
2.4 NUMERICAL PYTHON(NUMPY) IN DATA SCIENCE
NumPy (Numerical Python) is a fundamental library in data science and
scientific computing in Python. It provides support for arrays and matrices, as
well as a variety of mathematical functions to operate on these arrays. NumPy is
particularly well-suited for numerical and data manipulation tasks. Here, we'll
focus on 2D arrays and common NumPy operations.
 Creating a 2D Array (Matrix) in NumPy:
To create a 2D array, we use the numpy.array function and provide a nested
list of values.
 Common numpy operations with 2D arrays:\
I. Array dimensions: To get the dimensions (shape) of a 2D array, the .shape
attribute is used.
II. Accessing elements: Specific elements or slices of a 2D array can be
accessed using indexing.
III. Arithmetic operations: Numpy provide element wise operation like
addition, subtraction, multiplication and division.
IV. Transposition: To perform transposition we use .T attribute.
V. Matrix multiplication: To perform matrix multiplication, we use @operator or ‘numpy.dot’ function.
Also we can perform statistical calculations like mean(), max(), min() etc.
2.5 EXPLORATORY DATA SCIENCE
Exploratory Data Analysis (EDA) is a crucial step in the data science process. It
involves the initial examination and visualization of a dataset to understand its
main characteristics, identify patterns, uncover potential outliers, and generate
hypotheses for further analysis. EDA helps data scientists and analysts gain
insights and make informed decisions before diving into more advanced
modeling and statistical techniques. Here are key steps and techniques used in
exploratory data analysis:
16
I. Data Collection: Gather the data from various sources, such as databases,
files, APIs, or web scraping.
II. Data Cleaning: Clean the data by handling missing values, correcting data
types, and addressing inconsistencies. This step ensures the data is suitable
for analysis.
III. Descriptive Statistics: Calculate and examine basic statistics such as mean,
median, mode, standard deviation, and range to understand the central
tendency and spread of the data.
IV. Data Visualization: Create visualizations to explore the data's
characteristics. Common plots and charts include histograms, box plots,
scatter plots, bar plots, and line charts.
V. Correlation Analysis: Explore the relationships between variables by
calculating correlation coefficients. This helps identify which variables are
related and to what extent.
VI. Outlier Detection: Identify potential outliers or extreme values that may
have a significant impact on analysis. Box plots and scatter plots are often
used to spot outliers.
VII. Data Distribution: Determine the type of data distribution (e.g., normal,
uniform, skewed) to understand the underlying structure of the data.
VIII. Feature Engineering: Create new features or transform existing ones to
improve the performance of machine learning models.
IX. Dimensionality Reduction: Reduce the dimensionality of the data using
techniques like Principal Component Analysis (PCA) to visualize highdimensional data.
X. Data Grouping and Aggregation: Group data by categories or time
intervals and calculate aggregate statistics to uncover patterns and trends.
XI. Data Cross-Tabulation: Create cross-tabulations or contingency tables to
examine relationships between categorical variables.
XII. Time Series Analysis: If the data has a time component, analyze time
series data using techniques like seasonality decomposition and
autocorrelation.
17
XIII. Hypothesis Generation: Formulate hypotheses about the data based on
observed patterns and relationships. These hypotheses can guide further
analysis.
XIV. Interactive Dashboards: Create interactive dashboards or data
exploration tools using libraries like Plotly or Tableau to facilitate
exploration and reporting.
XV. Documentation: Keep detailed records of the exploratory analysis,
including findings, visualizations, and initial insights. This documentation is
valuable for later stages of the data science process.
3. MACHINE LEARNING FUNDAMENTALS
3.1 INTRODUCTION TO MACHINE LEARNING
Machine Learning is the science of getting computers to learn without being
explicitly programmed. It is closely related to computational statistics, which
focuses on making prediction using computer. Machine Learning focuses on the
development of computer programs that can access data and use it to learn
themselves. The process of learning begins with observations or data, such as
examples, providing inputs and outputs in order to look for patterns in data and
make better decisions in the future based on the examples that we provide. The
primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.
3.2 TYPES OF MACHINE LEARNING
The types of machine learning algorithms differ in their approach, the type of
data they input and output, and the type of task that they are intended to solve.
Broadly Machine Learning can be categorized into three categories.
I. Supervised Learning: Supervised Learning is a type of learning in which we
are given a data set and we already know what are correct output should look
like, having the idea that there is a relationship between the input and output.
Basically, it is learning task of learning a function that maps an input to an
output based on example input-output pairs. It deduces a function from labeled
training data consisting of a set of training examples.
18
II. Unsupervised Learning: Unsupervised Learning is a type of learning that
allows us to approach problems with little or no idea what our problem should
look like. We can derive the structure by clustering the data based on a
relationship among the variables in data. Basically, it is a type of selforganized
learning that helps in finding previously unknown patterns in data set without
pre-existing label.
III. Reinforcement Learning: Reinforcement learning is a learning method that
interacts with its environment by producing actions and discovers errors or
rewards. Trial and error search and delayed reward are the most relevant
characteristics of reinforcement learning. This method allows machines and
software agents to automatically determine the ideal behavior within a specific
context in order to maximize its performance.
3.3 STEPS TO BUILD A MACHINE LEARNING MODEL
Importing the data: For importing data, pandas library is used widely.
• Cleaning data or preprocessing data: Machine Learning algorithms don’t
work so well with processing raw data. Before we can feed such data to an ML
algorithm, we must preprocess it. We must apply some transformations on it.
With data preprocessing, we convert raw data into a clean data set. In
preprocessing we perform steps like removing duplicate values, removing
NULL values, replacing 0 with mean, etc. Because it will improve our results.
• Split data in training set and test set: For feeding data to an ML algorithm, we
have to first split it into two sets training and test set. So that algorithm can train
itself using training set and test its model on test set. Usually this split is 80-20
i.e., 80% training set and 20% test set.
• Choosing a model: The next step is to choose the model or algorithm which
will evaluate the given dataset and make according predictions.
The flowchart to build an ML model is given below.
19
3.4 MACHINE LEARNING MODELS
There are many types of Machine Learning Algorithms specific to different use
cases. As we work with datasets, a machine learning algorithm works in two
stages. We usually split the data around 20%-80% between testing and training
stages. Under supervised learning, we split a dataset into a training data and test
data in Python ML. Followings are the Algorithms of Python Machine
Learning:
I. Linear Regression:
Linear regression is one of the supervised Machine learning algorithms in
Python that observes continuous features and predicts an outcome. Depending
on whether it runs on a single variable or on many features, we can call it
simple linear regression or multiple linear regression. This is one of the most
popular Python ML algorithms and often under-appreciated. It assigns optimal
weights to variables to create a line ax+b to predict the output. We often use
linear regression to estimate real values like a number of calls and costs of
houses based on continuous variables. The regression line is the best line that
fits Y=a*X+b to denote a relationship between independent and dependent
variables.
20
II. Logistic Regression Logistic regression is a supervised classification is unique Machine Learning
algorithms in Python that finds its use in estimating discrete values like 0/1,
yes/no, and true/false. This is based on a given set of independent variables. We
use a logistic function to predict the probability of an event and this gives us an
output between 0 and 1. Although it says ‘regression’, this is actually a
classification algorithm. Logistic regression fits data into a logit function and is
also called logit regression.
III. Decision Tree A decision tree falls under supervised Machine Learning Algorithms in Python
and comes of use for both classification and regression- although mostly for
classification. This model takes an instance, traverses the tree, and compares
important features with a determined conditional statement. Whether it descends
21
to the left child branch or the right depends on the result. Usually, more
important features are closer to the root. Decision Tree, a Machine Learning
algorithm in Python can work on both categorical and continuous dependent
variables. Here, we split a population into two or more homogeneous sets. Tree
models where the target variable can take a discrete set of values are called
classification trees; in these tree structures, leaves represent class labels and
branches represent conjunctions of features that lead to those class labels.
Decision trees where the target variable can take continuous values (typically
real numbers) are called regressiontrees.
IV. Support Vector Machine (SVM)SVM is a supervised classification is one of the most important Machines
Learning algorithms in Python, that plots a line that divides different categories
of your data. In this ML algorithm, we calculate the vector to optimize the line.
This is to ensure that the closest point in each group lies farthest from each
other. While you will almost always find this to be a linear vector, it can be
other than that. An SVM model is a representation of the examples as points in
space, mapped so that the examples of the separate categories are divided by a
clear gap that is as wide as possible. In addition to performing linear
classification, SVMs can efficiently perform a non-linear classification using
what is called the kernel trick, implicitly mapping their inputs into highdimensional feature spaces. When data are unlabeled, supervised learning is not
possible, and an unsupervised learning approach is required, which attempts to
find natural clustering of the data to groups, and then map new data to these
formed groups.
22
V. kNN Algorithm This is a Python Machine Learning algorithm for classification and regressionmostly for classification. This is a supervised learning algorithm that considers
different centroids and uses a usually Euclidean function to compare distance.
Then, it analyzes the results and classifies each point to the group to optimize it
to place with all closest points to it. It classifies new cases using a majority vote
of k of its neighbors. The case it assigns to a class is the one most common
among its K nearest neighbors. For this, it uses a distance function. k-NN is a
type of instance-based learning, or lazy learning, where the function is only
approximated locally and all computation is deferred until classification. k-NN
is a special case of a variablebandwidth, kernel density "balloon" estimator with
a uniform kernel.
VI. K-Means Algorithm k-Means is an unsupervised algorithm that solves the problem of clustering. It
classifies data using a number of clusters. The data points inside a class are
homogeneous and heterogeneous to peer groups. k-means clustering is a method
of vector quantization, originally from signal processing, that is popular for
cluster analysis in data mining. k-means clustering aims to partition n
observations into k clusters in which each observation belongs to the cluster
with the nearest mean, serving as a prototype of the cluster. k-means clustering
is rather easy to apply to even large data sets, particularly when using heuristics
such as Lloyd's algorithm. It often is used as a preprocessing step for other
algorithms, for example to find a starting configuration. The problem is
computationally difficult (NP-hard). k-means originates from signal processing,
23
and still finds use in this domain. In cluster analysis, the k-means algorithm can
be used to partition the input data set into k partitions (clusters). k-means
clustering has been used as a feature learning (or dictionary learning) step, in
either (semi-)supervised learning or unsupervised learning.
24
CONCLUSION
In summary, this report has presented a comprehensive overview of the pivotal
roles played by Python, Machine Learning (ML), and Data Science in our datacentric world. Python's versatility and rich library ecosystem underpin data
manipulation and analysis, making it an essential tool for both data scientists
and developers. Machine Learning, as a subfield of AI, empowers computers to
learn from data, driving transformative applications across industries. Data
Science bridges domain knowledge with statistics and computer science,
enabling data-driven decision-making through techniques like exploratory data
analysis and statistical inference. Together, these domains are propelling us into
an era of data-driven insights, automation, and innovation, redefining our
approach to complex problem-solving and decision support.
25
Download