Uploaded by debashishlenka2021

Introduction to Machine Learning

advertisement
Introduction to Machine Learning
Industries across healthcare, finance, and technology are seeking
professionals who can apply machine learning to solve real-world
problems and make data-driven decisions. This has led to a high demand
for machine learning skills in the job market. In this machine learning
document, you will gain a thorough basic understanding of machine
learning from theory to practical applications.
What is Machine Learning?
Machine learning is a subfield of artificial intelligence that focuses on the
design of systems that can learn from and make decisions and predictions
based on experience, which is data in the case of machines. Machine
learning enables computers to act and make data-driven decisions rather
than being explicitly programmed to carry out a certain task. These
programs are designed to learn and improve over time when exposed to
new data.
AI, Machine Learning, and Deep Learning
Artificial intelligence is a broader concept of machines being able to carry
out tasks in a smart way. Machine learning is a subset or a current
application of AI. It is based on the idea that machines should be able to
access data and learn from it. Deep learning is a subset of machine
learning where similar machine learning algorithms are used to train deep
neural networks to achieve better accuracy in cases where the former was
not performing well.
Types of Machine Learning
Machine learning can be categorized into three types:

Supervised Learning: This is where you have input variables x and
an output variable y, and you use an algorithm to learn the
mapping function from the input to the output (y = f(x)). The goal
is to approximate the mapping function so well that whenever you
have a new input data x, you can predict the output variable y for
that data.

Unsupervised Learning: This is where you only have input data
(X) and no corresponding output variables. The goal for
unsupervised learning is to model the underlying structure or
distribution in the data to learn more about the data.

Reinforcement Learning: This is where an agent learns to behave
in an environment by performing certain actions and observing
the rewards or punishments it gets for those actions.
Each type of machine learning is used in various domains such as banking,
healthcare, and retail.
Supervised learning is a category of machine learning where the training
data set is composed of labeled pictures, which allows the model to
recognize the ducts in the image. The resulting predictive model can then
be deployed to the production environment, allowing it to recognize new
pictures. Popular supervised learning algorithms include linear regression,
random forest, and support vector machines. Common use cases include
speech automation in mobile phones, weather apps, biometric attendance,
credit worthiness prediction in banking, patient readmission rate
prediction in healthcare, and product analysis in retail.
Unsupervised learning, on the other hand, does not have any expected
output associated with the training data set. Instead, the algorithm detects
patterns based on the characteristics of the input data, allowing it to
group similar data instances together. Popular unsupervised learning
algorithms include k-means, apriori algorithm, and hierarchical clustering.
Common use cases include customer segmentation in banking, MRI data
categorization in healthcare, and product recommendation in retail.
Finally, reinforcement learning allows software agents and machines to
automatically determine the ideal behavior within a specific context to
maximize its performance. The learning agent interacts with the
environment and leverages both exploration and exploitation mechanisms
to improve its environment knowledge and select the next action. Popular
reinforcement learning algorithms include Q-Learning and SARSA.
Common use cases include self-driving cars, gaming AI, and robotics.
Reinforcement Learning
In reinforcement learning, an agent learns from its environment to take
actions that maximize a reward. The agent observes the environment,
selects an action using a policy, and receives a reward or penalty. The
agent updates its policy to improve its decision-making capabilities.
Reinforcement learning is used in various industries, such as banking,
healthcare, and retail.

In banking, it is used to create a next best offer model for a call
center.

In healthcare, it is used to allocate medical resources for different
types of ER cases.

In retail, it can be used to reduce excess stock with dynamic
pricing.
Data Science
Data science is all about uncovering insights from data and making
smarter business decisions. It covers a wide spectrum of domains,
including artificial intelligence, machine learning, and deep learning.
Artificial intelligence is a subset of data science that lets machines simulate
human-like behavior. Machine learning is a subfield of artificial intelligence
that provides machines the ability to learn and improve from experiences
without being explicitly programmed. Deep learning is a part of machine
learning that uses computational measures and algorithms inspired by the
structure and function of the human brain.
Recommendation Engines
A recommendation engine filters down a list of choices for each user
based on their browsing history, ratings, profile details, transactional
details, and more. It provides every user a unique view of the ecommerce
website based on their profile and allows them to select relevant products.
Data science and machine learning are used to build a recommendation
engine. The data science lifecycle starts with defining the business
requirements and gathering data from different sources. The next phase is
data processing or cleaning, which involves preparing the data for analysis.
The final phase is model building, where machine learning algorithms are
used to build a recommendation model.
Data Science Life Cycle
The data science life cycle consists of several stages that are crucial in the
process of extracting insights from data. The stages are data gathering,
data cleaning, data exploration, data modeling, and deployment and
optimization.
Data Cleaning
Data cleaning is considered one of the most time-consuming tasks in data
science. It involves removing irrelevant or inconsistent data and identifying
and fixing inconsistencies. This stage is crucial because it ensures that the
data is in the desired format for analysis.
Data Exploration
Data exploration involves understanding patterns in the data and
retrieving useful insights. In a recommendation engine, this stage involves
studying the shopping behavior of each customer to suggest relevant
items to them.
Data Modeling
Data modeling incorporates machine learning, which is a method used by
data science to retrieve useful information. The stages in machine learning
are importing data, data cleaning, creating a model, model training, model
testing, and improving the efficiency of the model.
Machine Learning Engineers
Machine learning engineers develop machines and systems that can learn
and apply knowledge without any specific direction. They create programs
that enable machines to take actions without being specifically directed to
perform those tasks.
Skills Needed for Machine Learning Engineers
The skills needed for machine learning engineers include programming
skills, computer science fundamentals, knowledge of statistics and
calculus, and expertise in machine learning algorithms and tools.
Skills Needed to Become an ML Engineer
To become a successful machine learning (ML) engineer, it is important to
have a strong foundation in programming languages such as Python and
Java. Probability and statistics are also crucial skills as they are at the heart
of many ML algorithms. Understanding data modeling and evaluation is
essential to estimate the underlying structure of a given dataset and find
useful patterns. Standard implementations of ML algorithms are widely
available through libraries or packages, but it is necessary to choose a
suitable model, learning procedure, and understand the basics of machine
learning. Lastly, understanding how software engineering and system
design works are important as a machine learning engineer's typical
output is software and it is often a small component that fits into a larger
ecosystem of products and services.
Roles and Responsibilities
The main role of an ML engineer is to create AI products by building
efficient applications. Responsibilities include studying prototypes,
designing and building ML systems, selecting the right dataset and data
representation methods, running ML tests and experiments, training
systems for top-notch accuracy, and retraining them as necessary. Careful
system design may be required to avoid bottlenecks and let algorithms
scale well with increasing data volumes. Software engineering best
practices are also crucial for productivity, collaboration, quality, and
maintainability.
Salary and Trends
According to the 2019 Indeed report, the average salary for an ML
engineer in India is 6 lakhs 89,460 rupees, whereas the average salary for
an ML engineer in the United States of America is $112,000. The demand
for ML engineers is increasing exponentially as the world's challenges
require complex systems to solve them.
Companies Hiring ML Engineers
Big players such as Apple, Uber, Facebook, and Salesforce are constantly
hiring ML engineers and paying high salaries. The opportunities for ML
engineers are exponentially growing.
The Future of Machine Learning
Machine learning has limitless applicability and is impacting various fields
such as education, finance, and healthcare. Machine learning techniques
are already being applied to critical areas within healthcare, such as care
variation reduction efforts and medical scan analysis. The demand for ML
engineers is only going to keep increasing as the world's challenges
require complex systems to solve them.
Real-Life Applications of Machine Learning
Classification, anomaly detection, and clustering algorithms are some of
the techniques used in machine learning. Classification is used to predict
categories such as gender or spam. Anomaly detection is used to identify
unusual data points or outliers. Clustering is used to group data. Machine
learning techniques have many applications in solving real-life problems
such as intrusion detection, system health monitoring, fraud detection, and
fault detection.
Clustering and Regression in Machine Learning
Clustering is the process of dividing data points into groups based on their
similarities, while regression is the process of predicting continuous values
based on the relationship between the features of the data. In this tutorial,
we will explore both clustering and regression using Python and
Anaconda.
Clustering
Imagine you run a rental store and want to understand customer
preferences to scale up your business. Clustering can help you group
customers into different clusters based on their purchasing habits so that
you can use a separate strategy for each cluster.
Regression
Regression is used in a vast number of applications, including stock price
prediction. It allows you to make predictions from data by learning the
relationship between the features of your data and some observed
continuous valued response.
Choosing the Right Algorithm
Now that we understand the basics of clustering and regression, how do
we choose the right algorithm to use? In this tutorial, we will use the iris
flower dataset to create six different machine learning models and pick the
best one with the most reliable accuracy.
The Iris Flower Dataset
The iris flower dataset consists of 150 observations of iris flowers, with
four columns of flower measurements in centimeters and a fifth column
indicating the species of the flower. There are three species of iris in the
dataset: setosa, versicolor, and virginica. This dataset is ideal for beginners
as it is straightforward to understand, small, and numeric.
Getting Started with Python and Anaconda
To get started, we will use Anaconda with Python 3.0 installed on it. We
will be using Jupyter Notebook, a web-based interactive computing
notebook environment, to write and execute our Python code.
Loading the Dataset
We can load the dataset directly from the UCI Machine Learning
Repository using pandas. We will specify the name of each column when
loading the data to help us explore it later.
import pandas as pdurl =
'https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data'names = ['sepal-length',
'sepal-width', 'petal-length', 'petal-width',
'class']dataset = pd.read_csv(url, names=names)
Once loaded, we can explore the dataset using pandas:
print(dataset.shape)
This will print the number of rows and columns in the dataset.
Exploring the Data Set
To start, let's look at the shape of our data set. The shape method gives us
the total number of rows and columns in the data set. We can also print
out the first few instances of the data set using the head method.
dataset.shapeprint(dataset.head(30))
We can also get a summary of each attribute in the data set using the
describe method. This gives us the count, mean, minimum and maximum
values, and some other percentiles.
print(dataset.describe())
To see the number of instances that belong to each class, we can use the
group by method and the size method.
print(dataset.groupby('class').size())
Next, we can create some visualizations of the data set. For example, we
can create a box and whisker plot for each attribute using the plot
method.
dataset.plot(kind='box', subplots=True, layout=(2,2),
sharex=False, sharey=False)plt.show()
We can also create a histogram of each input variable using the hist
method.
dataset.hist()plt.show()
To see the interaction between the different variables, we can create a
scatter matrix using the scatter_matrix method.
scatter_matrix(dataset)plt.show()
Creating a Validation Data Set
Now that we have explored the data set, let's create a validation data set.
We will split our data set into a training data set and a validation data set.
The first 80% of the data will be used to train our model, and the
remaining 20% will be used as the validation data set.
array = dataset.valuesX = array[:,0:4]Y =
array[:,4]validation_size = 0.20seed = 7X_train,
X_validation, Y_train, Y_validation =
train_test_split(X, Y, test_size=validation_size,
random_state=seed)
Introduction
In this article, we will be discussing the process of building a model using
machine learning algorithms and evaluating its accuracy using the
validation set. We will be using Python for this purpose. Specifically, we
will be discussing the process of defining arrays, splitting the dataset,
selecting a model, and building and evaluating the model. We will also be
discussing the concept of regression in machine learning and its various
types. We will be focusing on simple linear regression in this article.
Building a Model
First, we define an array consisting of all the values from the dataset. Next,
we define a variable x which consists of all the columns from the array
from 0 to 4, and a variable y which consists of the array starting from the
fourth column, which is the class column. We define our validation size
and seed and split our training dataset into x_train, x_test, y_train, and
y_test. We then use 10-fold cross-validation to estimate the accuracy of
the model and evaluate it using various algorithms like logistic regression,
linear discriminant analysis, k nearest neighbor, regression trees, and
support vector machines. From the output, it seems that linear
discriminant analysis was the most accurate model that we tested.
Evaluating the Model
We want to get an idea of the accuracy of the model on our validation set
or the testing data set. This will give us an independent final check on the
accuracy of the best model. We can run the LDA model directly on the
validation set and summarize the result as a final score, a confusion matrix,
and a classification report.
Regression in Machine Learning
Regression is the construction of an efficient model to predict the
dependent attributes from a bunch of attribute variables. In regression
problems, the output variable is either real or a continuous value like
salary, weight, area, etc. Regression is used in applications like housing
investing to predict the relationship between a dependent variable and
independent variables. Regression techniques include simple linear
regression, polynomial regression, support vector regression, decision
regression, random forest regression, and logistic regression.
Simple Linear Regression
Simple linear regression is a regression technique in which the
independent variable has a linear relationship with the dependent variable.
The straight line in the diagram is the best fit line, and the main goal of
simple linear regression is to consider the given data points and plot the
best fit line to fit the model in the best way possible.
Linear regression is a statistical technique used for modeling the
relationship between a dependent variable and one or more independent
variables. It is based on the concept of a best-fit line that predicts the
value of the dependent variable based on the value of the independent
variable. One real-life analogy to explain linear regression is car resale
value where different parameters such as the number of years the car has
been in the market, mileage, and other factors are interconnected to the
price of the car.
Terminologies in Linear Regression
There are a few terminologies that you have to be thorough with before
starting with linear regression:

Cost function: A cost function provides the best possible values
for the intercept and slope of the best-fit line for the data points.
It minimizes the error between the actual value and the predicted
value.

Gradient descent: A method of updating the intercept and slope
values to reduce the mean squared error.
Advantages and Disadvantages of Linear Regression
Advantages:

Performs exceptionally well for linearly separable data

Easy to implement, interpret and efficient to train

Handles overfitting using regularization, cross-validation, and
dimensionality reduction techniques

Extrapolation beyond a specific dataset
Disadvantages:

Takes an assumption of linearity between dependent and
independent variables

Prone to noise and overfitting

Quite sensitive to outliers

Prone to multicollinearity
Use Cases of Linear Regression
Linear regression can be used for:

Sales forecasting

Risk analysis for disease predictions

Housing applications to predict prices and other factors

Finance applications to predict stock prices and investment
evaluation
Implementing Linear Regression in Python
Steps to implement linear regression in Python:
1. Load the data
2. Explore the data
3. Slice the data according to requirements
4. Train and split the data using the fit and predict method in scikitlearn
5. Generate the model from scikit-learn
6. Evaluate the accuracy using mean squared error and metrics for
accuracy evaluation
Example:
Implementing linear regression using a simple data set in scikit-learn:
import matplotlib.pyplot as pltimport numpy as npfrom
sklearn import datasets, linear_model, metrics# Load
the diabetes datasetdiabetes =
datasets.load_diabetes()# Use only one
featurediabetes_X = diabetes.data[:, np.newaxis, 2]#
Split the data into training/testing
setsdiabetes_X_train = diabetes_X[:-20]diabetes_X_test
= diabetes_X[-20:]# Split the targets into
training/testing setsdiabetes_y_train =
diabetes.target[:-20]diabetes_y_test =
diabetes.target[-20:]# Create linear regression
objectregr = linear_model.LinearRegression()# Train
the model using the training
setsregr.fit(diabetes_X_train, diabetes_y_train)# Make
predictions using the testing setdiabetes_y_pred =
regr.predict(diabetes_X_test)# The mean squared
errorprint("Mean squared error: %.2f"
metrics.mean_squared_error(diabetes_y_test,
diabetes_y_pred))
%
Download