Uploaded by Dheeraj Pratap

internship report format 2023 CSJMU

advertisement
INTERNSHIP REPORT
A report submitted in partial fulfillment of the requirements for the Award of Degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING (AI)
By
ANSHIKA SHARMA
Regd. No.: CSJMA
Under Supervision of
Dr.
Associate Professor
(Duration: , 2023 to 2023)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY INSTITUTE OF ENGINEERING AND TECHNOLOGY
CHHATRAPATI SHAHU JI MAHARAJ UNIVERSITY, KANPUR
KANPUR, UTTAR PRADESH
2021-2025
PASTE MOOC COURSE CERTIFICATE/INTERNSHIP
CERTIFICATE/APPROVAL LETTER HERE
Acknowledgement
I would like to express my sincere gratitude to …………………… for giving
me the opportunity to do a Summer Internship Programme at the Indian
Institute of Technology, Kanpur. This opportunity I had with IIT Kanpur
was a great experience and allowed me to obtain a better idea of how
research is conducted.
I express my deepest gratitude to ………………………. for providing me this
opportunity to work under his able guidance and supervision. The
discussions I had with him during these two months had greatly
encouraged me to do my work with utmost dedication and further
motivated me to enhance my skills.
It gives me a great sense of pleasure to acknowledge ……… and ……… for
helping , guiding and encouraging me throughout this journey. I want to
express my heartfelt gratitude and special thanks to all the project
engineers for their wonderful guidance, suggestions and for creating a
conducive environment during this internship.
Student Name
(Roll Number CSJMA)
ABSTRACT
This internship report presents an overview of the valuable experience gained during an internship in the field of
Machine Learning within the Electrical Industry at the prestigious Indian Institute of Technology Kanpur (IIT
Kanpur). The internship was conducted under the guidance of Prof. Ankush Sharma, a renowned expert in the
field of Electrical Engineering. This report encapsulates crucial aspects of the organization, the programs and
opportunities it offers, the methodologies employed during the internship, and key components of the internship
report. Additionally, it highlights the substantial benefits that the Company/Institution, IIT Kanpur, can derive
from the insights and contributions presented in this report.
1. Organization Information:
The Electrical Engineering Department at IIT Kanpur is a hub of cutting-edge research and innovation in the
field of electrical and electronic systems. The department boasts state-of-the-art facilities, a dynamic faculty, and
a collaborative research environment. It is renowned for its contributions to both academia and industry, making
it an ideal setting for a machine learning internship.
2. Programs and Opportunities:
IIT Kanpur offers a diverse range of programs and opportunities for students and researchers alike. The
internship program allows students to engage in hands-on projects, work closely with faculty mentors, and
collaborate on real-world problems. This report explores the various programs and opportunities available at IIT
Kanpur that facilitate experiential learning in machine learning.
3. Methodologies:
Throughout the internship, various methodologies were employed to tackle real-world challenges in the electrical
industry. This report delves into the machine learning techniques, data analysis, and problem-solving strategies
applied during the internship. It also discusses the tools and technologies utilized to implement these
methodologies.
4. Key Parts of the Report:
The report outlines the key components, findings, and outcomes of the internship project, providing an in-depth
analysis of the research conducted. It covers areas such as data collection, preprocessing, model development,
and evaluation. Moreover, it discusses the practical implications of the research and its relevance to the electrical
industry.
5. Benefits of the Company/Institution:
This internship report not only serves as a testament to the knowledge and skills gained during the internship but
also offers significant benefits to IIT Kanpur. The insights, recommendations, and solutions presented in this
report can contribute to the ongoing research efforts of the Electrical Engineering Department. Additionally, it
highlights the institution's commitment to fostering talent and promoting industry-academia collaboration.
vii Kanpur provides a comprehensive overview of the machine learning
In conclusion, this internship report at IIT
internship experience in the electrical industry. It showcases the organization's commitment to nurturing talent
and fostering research in cutting-edge technologies. The report's findings and recommendations have the
potential to enhance the institution's contributions to the field of electrical engineering and machine learning,
ultimately benefiting both the institution and the broader industry.
TABLE OF CONTENTS1) CERTIFICATE
2) ACKNOWLEDGEMENT
3) LIST OF FIGURES
4) INTRODUCTION
5) HISTORY OF MACHINE LEARNING
6) TYPES OF MACHINE LEARNING: SUPERVISED LEARNING
 UNSUPERVISED LEARNING
 REINFORCEMENT LEARNING
 SEMISUPERVISED LEARNING
 BATCH LEARNING
 ONLINE LEARNING
 INSTANCE BASED LEARNING
 MODEL BASED LEARNING
7) MATHS FOR MACHINE LEARNING
8) PYTHON AND IT’S FEATURES FOR ML
9)DATA PREPROCESSING, ANALYSIS AND VISUALIZATION
 FEATURE ENGINEERING
10)EXPLORATORY DATA ANALYSIS (EDA)
11)MACHINE LEARNING ALGORITHMS –
 LINEAR REGRESSION
 LOGISTIC REGRESSION
 DECISION TREES
 K-NEAREST NEIGHBOUR
 SUPPORT VECTOR MACHINES
viii
 RANDOM FOREST
 REGULARIZATION- RIDGE, LASSO, ELASTICNET
 ENSEMBLE LEARNING- VOTING, BAGGING, BOOSTING
 NAIVE BAYES ALGORITHM
12)CHECKING ACCURACY OF THE MODEL
13) PROBLEMS AND ISSUES IN SUPERVISED LEARNING
14) ADVANTAGES OF ML
15) APPLICATIONS OF ML
16)PROJECT REPORTS
17)ML IN ELECTRICAL ENGINEERING
18)FUTURE SCOPE OF ML
19) CHALLENGES FOR USING ML
LIST OF FIGURES –
1) HISTOGRAM
2)BOXPLOT
3)SCATTERPLOT
4)PIE CHART
5)LINEPLOT
6)HEATMAP
7)PAIRPLOT
8)LINEAR REGRESSION
9)LOGISTIC REGRESSION
10)DECISION TREE
11)K-NEAREST NEIGHBOURS
12)SUPPORT VECTOR MACHINES
13)RANDOM FOREST
14) RIDGE REGRESSION
15) BAGGING
16) BOOSTING
17)STACKING
18)VOTING
19) NAIVE BAYES CLASSIFIER
ix
Introduction
Machine learning is a subfield of artificial intelligence (AI) that focuses on developing algorithms and models that enable computers to learn
from and make predictions or decisions based on data. It is a rapidly evolving field with applications in various industries and domains. The core
idea behind machine learning is to enable computers to learn patterns and relationships in data without explicitly programming them. Instead of
being explicitly programmed to perform specific tasks, machine learning algorithms are designed to learn from examples and experience, and to
automatically improve their performance over time. In machine learning, a model is trained on a dataset that consists of input data and
corresponding output or target values. The model learns to identify patterns and extract meaningful features from the data, allowing it to make
predictions or take actions when presented with new, unseen data. The training process involves adjusting the model's parameters and internal
representations to minimize the difference between its predictions and the true target values.
History of machine learning
Machine learning has a rich history that traces back to the mid-20th century. Its origins can be found in the work of pioneers such as Arthur
Samuel, who coined the term "machine learning" in 1959 and developed programs that could play checkers and improve their performance over
time through self-learning. In the 1960s and 1970s, researchers focused on developing algorithms for pattern recognition and decision-making,
leading to the creation of the perceptron by Frank Rosenblatt. However, enthusiasm waned due to limitations in computing power and data
availability. In the 1980s and 1990s, machine learning experienced a resurgence with the development of more advanced algorithms and the
availability of larger datasets. This period witnessed the rise of techniques like support vector machines and neural networks. The early 2000s
saw the emergence of ensemble methods, boosting model performance further. The recent decade has been characterized by the tremendous
growth of deep learning, powered by advancements in hardware and the availability of massive amounts of data. Machine learning has now
become a crucial component of various applications and industries, revolutionizing fields such as healthcare, finance, transportation, and more.
Types of machine learning
1. Supervised Learning: Supervised learning involves training models using labeled data, where the desired outputs are known. The model
learns to map input features to corresponding target labels or outputs. It aims to generalize from the provided labeled examples to make
predictions on new, unseen data. Supervised learning algorithms include linear regression, logistic regression, decision trees, random forests,
support vector machines (SVM), and neural networks. This type of learning is commonly used in tasks such as classification (e.g., spam detection,
image recognition) and regression (e.g., predicting house prices, stock market analysis).
2. Unsupervised Learning: Unsupervised learning involves training models on unlabeled data, where the target labels or outputs are not
provided. The objective is to discover hidden patterns, structures, or relationships within the data. Unsupervised learning algorithms include
clustering algorithms, such As
K means clustering and hierarchical clustering, and dimensionality reduction techniques, such as principal component analysis (PCA) and tdistributed stochastic neighbor embedding (t-SNE). Unsupervised learning is commonly used for tasks such as customer segmentation, anomaly
detection, and data exploration.
3. Reinforcement Learning: Reinforcement learning involves training agents to interact with an environment and learn optimal behaviors
through a system of rewards and punishments. The agent learns to take actions in a given state to maximize cumulative rewards over time. It
learns through trial and error, adjusting its strategies based on the feedback received from the environment. Reinforcement learning algorithms
often use concepts like Markov decision processes (MDP) and Q-learning. This type of learning is applicable to tasks such as game playing (e.g.,
AlphaGo), robotics, and autonomous driving.
4.Semi-supervised learning: Semi-supervised learning
lies somewhere between supervised and unsupervised learning because it uses both
x
labeled and unlabeled data for training - typically a small amount of labeled data and a large amount of unlabeled data. Systems using this
method can significantly improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires
qualified and relevant resources to train or learn from it. Otherwise, acquiring unlabeled data generally does not require additional resources.
5. Batch Machine Learning: Batch machine learning refers to the traditional approach where the model is trained on a fixed dataset, known as
a batch or training set. The model is trained using the entire dataset, and the learning process occurs offline before deploying the model. The
trained model is then used to make predictions on new, unseen data. Batch learning is suitable when the entire dataset is available and can be
processed at once. It is commonly used in scenarios where the data doesn't change frequently or in batch processing systems.
6. Online Machine Learning: -Online machine learning, also known as incremental learning or streaming learning, is a learning approach
where the model learns from a continuous stream of data in real-time. Instead of training on a fixed dataset, the model is updated incrementally
as new data arrives. Online learning allows the model to adapt and update its knowledge with each new observation. It is useful in scenarios
where data arrives continuously, and the model needs to be updated dynamically. Online learning is employed in applications such as fraud
detection, recommender systems, and adaptive control systems.
7. Instance-Based Machine Learning:Instance-based machine learning, also referred to as memory-based learning or lazy learning, involves
storing and utilizing specific training instances as the basis for making predictions. The model memorizes the training instances and uses them
directly during prediction without explicit training. The model compares new instances to the stored instances and makes predictions based on
similarity or distance metrics. Instance-based learning is flexible and can handle complex and non-linear relationships in the data. K-nearest
neighbors (KNN) is a popular instance-based learning algorithm.
8. Model-Based Machine Learning:Model-based machine learning refers to building a mathematical model that represents the underlying
structure of the data. The model is trained on a dataset, learning patterns and relationships in the data, and is then used for making predictions
or generating new data. Model-based learning involves explicitly defining the model architecture and optimizing its parameters using training
data. Examples of model-based machine learning algorithms include linear regression, decision trees, support vector machines (SVM), and
neural networks. Model-based learning allows for generalization and can handle complex relationships and high-dimensional data.
MATHS FOR MACHINE LEARNINGMathematics plays a fundamental role in machine learning. Understanding key mathematical concepts allows you to grasp the underlying
principles and algorithms used in various machine learning techniques. Here are some essential mathematical topics for machine learning:
1. Linear Algebra: Linear algebra provides the foundation for many machine learning algorithms. Concepts such as vectors, matrices, matrix
operations (addition, multiplication), eigenvectors, eigenvalues, and matrix decompositions (e.g., singular value decomposition) are crucial for
understanding algorithms like linear regression, principal component analysis (PCA), and support vector machines (SVM).
2. Calculus: Calculus is used to optimize machine learning models and algorithms. Key concepts include derivatives, partial derivatives, gradients,
and optimization techniques such as gradient descent. Calculus helps to minimize or maximize objective functions, such as in linear regression or
neural networks.
3. Probability and Statistics: Probability theory and statistics are fundamental in machine learning. Concepts like probability distributions (e.g.,
Gaussian distribution), conditional probability, Bayes' theorem, hypothesis testing, and statistical measures (mean, variance, correlation) are
essential for understanding probabilistic models, model evaluation, and inference.
4. Multivariate Calculus: Multivariate calculus extends calculus to functions of multiple variables. Concepts such as gradients, partial derivatives,
and Hessians become crucial when optimizing functions with multiple variables, which occurs in advanced optimization techniques and deep
learning.
PYTHON AND IT’S FEATURES FOR MLPython is one of the most popular programming languages for machine learning due to its simplicity, versatility, and robust ecosystem of
libraries and frameworks. Here are some of the key features of Python that make it well-suited for machine learning:
1. Readability and Simplicity: Python has a clean and readable syntax that makes it easy to understand and write code. Its simplicity allows
developers to focus more on the logic of their machine learning algorithms rather than dealing with complex syntax.
2. Large and Active Community: Python has a vast community of developers, data scientists, and machine learning practitioners. This active
community contributes to the development and maintenance of numerous libraries and frameworks specific to machine learning, providing
extensive documentation, tutorials, and support.
3. Rich Ecosystem of Libraries: Python offers a wide range of powerful libraries and frameworks for machine learning, such as:
- NumPy: A fundamental library for numerical computations, providing support for large, multi-dimensional arrays and mathematical functions.
It forms the foundation for many other libraries.
- Pandas: A library for data manipulation and analysis, offering data structures and tools for handling structured data. It allows easy data
preprocessing, cleaning, and transformation.
- scikit-learn: A comprehensive library for machine learning, providing implementations of various algorithms for classification, regression,
clustering, dimensionality reduction, and more. It also offers tools for model evaluation and selection.
- TensorFlow and PyTorch: Deep learning frameworks that enable building and training neural networks for complex tasks. They provide highxi
level abstractions and support for GPU acceleration.
- Keras: A user-friendly, high-level deep learning library that runs on top of TensorFlow or other backend frameworks. It simplifies the process of
building and training neural networks.
- OpenCV: A library for computer vision and image processing, useful for tasks like image classification, object detection, and image
manipulation.
4. Integration with Other Languages: Python supports seamless integration with other languages like C/C++ and Java. This allows developers to
incorporate existing libraries or optimize computationally intensive parts of their machine learning code.
5. Rapid Prototyping and Experimentation: Python's interactive shell and Jupyter notebooks provide a convenient environment for rapid
prototyping and experimentation. This enables iterative development, making it easy to test different algorithms, tweak parameters, and
visualize results on the go.
6. Scalability and Deployment: Python's ability to integrate with frameworks like Apache Spark and tools like Flask or Django allows scaling
machine learning models to large datasets and deploying them in production environments.
7. Continuous Development and Innovation: Python's open-source nature and active community ensure that new libraries, tools, and
techniques for machine learning are continually being developed and shared. This promotes a culture of innovation and facilitates staying up to
date with the latest advancements in the field.
Overall, Python's simplicity, powerful libraries, active community, and extensive ecosystem make it an excellent choice for machine learning
projects, enabling developers to efficiently implement and deploy sophisticated machine learning algorithms.
Data preprocessing, analysis and visualization
Feature engineering is a crucial step in machine learning that involves creating or selecting relevant features from raw data to improve the
performance of a model. It involves transforming the raw data into a representation that captures meaningful patterns, relationships, and
characteristics that are useful for the learning algorithm. Here are some key aspects of feature engineering:
1. Feature Extraction: Feature extraction involves deriving new features from the existing raw data. This can be done through techniques such
as:- Dimensionality Reduction: Reducing the number of features while preserving relevant information using methods like Principal Component
Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE).
- Transformations: Applying mathematical transformations such as logarithmic, exponential, or square root transformations to make the data
more suitable for the model.
- Aggregation: Creating aggregated features by summarizing or combining existing features. For example, calculating statistics like mean,
median, standard deviation, or creating interaction features.
2. Feature Encoding: Encoding categorical variables into numerical representations is crucial as many machine learning algorithms require
numerical inputs. Common techniques include:
- One-Hot Encoding: Creating binary columns for each category and indicating the presence or absence of each category.
- Label Encoding: Assigning a unique numerical value to each category, converting them into integer representations.
- Target Encoding: Replacing categorical variables with the average value of the target variable for each category.
- Ordinal Encoding: Assigning numerical values to categories based on their order or rank.
3. Handling Missing Data: Missing data is a common challenge in real-world datasets. Strategies for handling missing values include:
- Imputation: Filling in missing values with estimates such as mean, median, mode, or using advanced imputation techniques like K-nearest
neighbors or regression imputation.
- Creating Indicator Variables: Creating a binary indicator variable to represent whether a value is missing or not.
4. Feature Scaling: Scaling numerical features can help prevent variables with large magnitudes from dominating the learning process.
Common scaling techniques include:
- Standardization: Transforming the data to have zero mean and unit variance, typically using techniques like Z-score normalization.
- Min-Max Scaling: Scaling the data to a specific range, often between 0 and 1.
5. Handling Outliers: Outliers can have a significant impact on the performance of a model. Strategies for dealing with outliers include:
- Winsorization: Capping or truncating extreme values to a predefined threshold.
- Removing Outliers: Removing data points that are considered outliers based on statistical measures or domain knowledge.
6. Feature Selection: Selecting the most relevant subset of features can improve model performance and reduce overfitting. Common
techniques include:
xii
- Univariate Selection: Selecting features based on statistical tests such as chi-square test, ANOVA, or correlation.
- Feature Importance: Using techniques like information gain, Gini importance, or permutation importance to evaluate the importance of
features.
- Recursive Feature Elimination: Iteratively removing less important features based on the performance of the model. Feature engineering
requires domain knowledge, intuition, and experimentation to create meaningful representations of data that can enhance the performance of
machine learning models. It plays a critical role in extracting valuable insights and patterns from data, ultimately improving the accuracy and
generalization capabilities of the models.
EDA - EXPLORATORY DATA ANALYSIS –
Exploratory Data Analysis (EDA) often involves visualizing data to gain insights and understand patterns. Graphical representations can provide a
visual understanding of the data distribution, relationships, and potential outliers. Here are some common visualization techniques used in EDA
for machine learning:
1. Histograms: Histograms show the frequency distribution of numerical variables. They help visualize the shape, central tendency, and spread of
the data. Histograms can reveal insights about data skewness, peaks, and gaps.
FIG1=HISTOGRAM
2. Box Plots: Box plots, also known as box-and-whisker plots, display the distribution of numerical variables and provide information about
median, quartiles, and potential outliers. They help identify data variability and compare distributions across different groups or categories.
FIG2) BOXPLOT
3. Scatter Plots: Scatter plots show the relationship between two numerical variables. They help identify correlations, clusters, or patterns within
the data. Scatter plots can be enhanced with colors or sizes to represent additional variables.
xiii
FIG3) SCATTERPLOT
4. Bar Charts: Bar charts are used for categorical variables, displaying the frequency or count of each category. They help compare categories
and identify dominant or rare categories within the data.
5. Pie Charts: Pie charts represent proportions or percentages of different categories within a dataset. They are useful for visualizing the
composition of categorical variables and understanding relative contributions.
FIG4) PIE CHART
6. Line Plots: Line plots show the trend or pattern of a numerical variable over time or a continuous axis. They help identify seasonality, trends,
or changes in the data over a specific period.
FIG 5) LINEPLOT
7. Heatmaps: Heatmaps use color gradients to represent the intensity or density of numerical data across a two-dimensional grid. They are often
used to display correlation matrices, showing the strength and direction of relationships between variables.
FIG6) HEATMAP
8. Pair Plots: Pair plots, also known as scatterplot matrices, visualize relationships between multiple numerical variables in a single grid of scatter
plots. They help identify potential correlations or patterns among variables.
FIG7) PAIRPLOT
xiv
Machine learning algorithms
Machine learning algorithms are computational methods that learn patterns and relationships from data to make
predictions or decisions without being explicitly programmed. There are various machine learning algorithms, each with its
own characteristics, strengths, and applications. Here are some commonly used machine learning algorithms:
1)Linear Regression–
Linear regression is one of the supervised machine learning algorithms in Python that observes continuous features and predicts a result.
Depending on whether it runs on a single variable or on many features, it can be called simple linear regression or multiple linear regression.
This is one of the most popular algorithms in Python ML and is often underestimated. It assigns optimal weights to variables to create a line ax+b
that can be used to predict the output. We often use linear regression to estimate real-world values such as the number of calls and the cost of
houses based on continuous variables. The regression line is the best line that corresponds to Y=a*X+b to represent a relationship between
independent and dependent variables
FIG8)LINEAR REGRESSION
2)LOGISTIC REGRESSION –
Logistic regression is a supervised classification and unique machine learning algorithm in Python used in estimating discrete values
such as 0/1, yes/no, and true/false. This is done based on a given set of independent variables. We use a logistic function to predict the
probability of an event, giving us an output between 0 and 1. Although it is called 'regression', it is actually aclassification algorithm.
Logistic regression fits data into a logit function and is also called logit regression.
xv
FIG9) LOGISTIC REGRESSION
3. DECISION TREE A decision tree is one of the supervised machine learning algorithms in Python and can be used for both classification and regression
- but mainly for classification. This model takes an instance, traverses the tree, and compares important features to a particular
conditional statement. Whether it descends to the left child branch, or the right depends on the result. Usually, the more important
features are closer to the root. Decision Tree, a machine learning algorithm in Python, can work with both categorical and
continuous dependent variables. Here, we divide a population into two or more homogeneous groups. Tree models where the
target variable can take a discrete set of values are called classification trees; in these tree structures, the leaves represent class
labels, and the branches represent conjunctions of features leading to those class labels. Decision trees in which the target variable
can take continuous values (usually real numbers) are called regression trees.
FIG10) DECISION TREE
4. KNN ALGORITHM This is a Python machine learning algorithm for classification and regression - mainly for classification. It is a supervised learning
algorithm that considers different centers and uses a common Euclidean function to compare distances. It then analyses the results
and assigns each point to the group for which it should be optimized to place it with all the points closest to it. It classifies new cases
by a majority vote of k of its neighbours. The case it assigns to a class is the one that is most frequent among its k nearest
neighbours. A distance function is used for this purpose. k-NN is a type of instance-based learning, or lazy learning, in which the
function is approximated only locally, and all computations are deferred until classification. k-NN is a special case of a kernel density
"balloon" estimator with variable bandwidth and a uniform kernel.
FIG11) KNN-K-NEAREST NEIGHBOURS
5. SUPPORT VECTOR MACHINE (SVM)SVM is a supervised classification and one of the most important machine learning algorithms in Python that draws a line separating
the different categories of your data. In this ML algorithm, we compute the vector to optimize the line. This is to ensure that the
closest point in each group is the furthest apart.
xvi This is almost always a linear vector, but it can be another. An SVM model is a
representation of the examples as points in space, mapped so that the examples in each category are separated by as large a gap as
possible. SVMs can perform not only linear classifications, but also nonlinear classifications by using the so-called kernel trick and
implicitly mapping their inputs to high-dimensional feature spaces. If the data is unlabelled, supervised learning is not possible, and
an unsupervised learning approach is required, which attempts to naturally group the data and then assign new data to these
formed groups.
FIG 12) SUPPORT VECTOR MACHINES
6. RANDOM FOREST –
A Random Forest is an ensemble of decision trees. To classify each new object based on its attributes, the trees vote for a class each tree returns a classification. The classification with the most votes wins the forest. Random Forests, or Random Decision
Forests, are an ensemble learning method for classification, regression, and other tasks in which a large number of decision trees are
created at training time and the class that matches the mode of the classes (classification) or the mean prediction (regression) of
each tree is output.
FIG 13) RANDOM FOREST
7)RIDGE REGRESSIONRidge regression adds a regularization term to the ordinary least squares (OLS) loss function. The regularization term is a
penalty that shrinks the coefficients towards zero.
Ridge regression is useful when dealing with multicollinearity (high correlation among predictors). It helps to reduce the
impact of correlated predictors and stabilize the model. The strength of regularization in Ridge regression is controlled by
the hyperparameter alpha. Higher values of alpha result in more regularization and smaller coefficients. Ridge regression
does not perform variable selection, as it keeps all predictors in the model.
fig14) RIDGE REGRESSION
xvi
i
8)LASSO REGRESSIONLasso regression also adds a regularization term to the OLS loss function but uses the L1 norm of the coefficients as the penalty
term. Lasso regression encourages sparse solutions by setting some coefficients to exactly zero, effectively performing feature
selection. It is particularly useful when dealing with high-dimensional data, as it can automatically select the most relevant features.
The strength of regularization in Lasso regression is controlled by the hyperparameter alpha. Higher values of alpha result in more
regularization and more coefficients being set to zero.
9)ELASTIC NET REGRESSIONElastic Net combines both Ridge and Lasso regularization by adding a combination of L1 and L2 penalties to the loss function. The
Elastic Net regularization term has two hyperparameters: alpha controls the overall strength of regularization, and the mix ratio
(l1_ratio) controls the balance between L1 and L2 penalties. Elastic Net is useful when dealing with high-dimensional data and
multicollinearity. It provides a balance between feature selection (Lasso) and handling correlated predictors (Ridge).
10)ENSEMBLE LEARNINGIt is a machine learning technique that combines multiple individual models to make more accurate predictions or decisions. It aims
to improve overall performance by leveraging the wisdom of the crowd. Ensemble methods often outperform individual models and
are widely used in various machine learning tasks. Here are some common ensemble learning technique.
A. Bagging (Bootstrap Aggregating): Bagging involves training multiple models independently on different subsets of the
training data. Each model learns from a randomly sampled subset (with replacement) of the original training data. The final
prediction is typically obtained by averaging or voting among the predictions of individual models. Random Forest is a popular
example of a bagging ensemble algorithm.
FIG 15) BAGGING
B. Boosting: Boosting is an iterative ensemble technique where each subsequent model focuses on correcting the mistakes made
by the previous models. Models are trained sequentially, and each model assigns higher weights to the misclassified instances.
AdaBoost (Adaptive Boosting) and Gradient Boosting Machines (GBM) are well-known boosting algorithms.
FIG16) BOOSTING
C. Stacking: Stacking involves training multiple models and combining their predictions through a meta-model. The meta-model is
trained on the predictions made by the individual models as additional input features. Stacking can capture more complex
relationships and dependencies between the models. It often leads to improved performance but requires more computational
resources.
xvi
ii
FIG 17) STACKING
D. Voting: Voting combines the predictions of multiple models to make a final prediction. There are different types of voting
methods, including majority voting (classification), weighted voting (assigning weights to each model's prediction), and soft voting
(using probabilities instead of discrete predictions). Voting ensembles can be effective in reducing bias and variance and improving
generalization.
FIG 18) VOTING
11. NAIVE BAYES ALGORITHM –
Naive Bayes is a classification method based on Bayes' theorem. It assumes independence between predictors. A Naive Bayes
classifier assumes that a feature in one class is unrelated to another. Consider a fruit. It is an apple if it is round and red and has a
diameter of 2.5 inches. A Naive Bayes classifier will say that these features independently contribute to the probability that the fruit
is an apple. This is true even if the features depend on each other. For very large data sets, it is easy to build a Naive Bayes model.
Not only is this model very simple, but it also does more than many sophisticated classification methods. Naive Bayes classifiers are
highly scalable and require a number of parameters that is linear to the number of variables (features/predictors) in a learning
problem. Maximum likelihood trainingcan be done by evaluating a closed-form expression, which takes linear time, rather than an
expensive iterative approximation as used for many other types of classifiers.
xix
FIG 19) NAÏVE BAYES CLASSIFIER
12)LSTM-SAE:-LSTM stands for Long Short-Term Memory, which is a type of recurrent neural network (RNN) architecture
commonly used in sequence modeling tasks. It is designed to overcome the limitations of traditional RNNs by addressing the
vanishing gradient problem and capturing long-term dependencies in sequential data.SAE stands for Stacked Autoencoder, which is
a type of unsupervised learning algorithm used for feature learning and dimensionality reduction. It consists of multiple layers of
autoencoders, where each layer learns increasingly abstract representations of the input data.
Combining LSTM and SAE involves using the strengths of both architectures for a particular task. The LSTM can capture temporal
dependencies and process sequential data, while the SAE can learn useful representations of the input data, which can be fed into
the LSTM for further processing.
One possible approach to combining LSTM and SAE is to use the SAE as a pre-training step for the LSTM. The SAE can be trained on
the input data to learn meaningful representations, and the learned weights can be used to initialize the LSTM. This initialization can
help the LSTM converge faster and potentially improve its performance on the task at hand.
Another approach is to use the SAE as an encoder for the LSTM. The SAE can encode the input data into a lower-dimensional
representation, which can then be fed into the LSTM for further processing. This can help reduce the dimensionality of the input and
potentially improve the LSTM's ability to capture relevant patterns in the data.
Overall, combining LSTM and SAE can be a powerful technique for tasks that involve sequential data and require meaningful feature
representations. It allows for capturing long-term dependencies and learning abstract representations, leading to improved
performance in various applications such as natural language processing, speech recognition, time series analysis, and more.
CHECKING ACCURACY OF THE MODEL
The `sklearn. metrics` module in scikit-learn provides various metrics for evaluating the performance of machine learning models.
These metrics help assess the accuracy, precision, recall, F1-score, and other aspects of classification, regression, and clustering
models. Here are some commonly used metrics available in `sklearn. metrics`:
1. Classification Metrics:
- Accuracy: `accuracy score`
- Precision: `precision score`
- Recall: `recall score`
- F1-score: `f1_score`
- Confusion Matrix: `confusion matrix`
- Classification Report: `classification report`
- ROC Curve: `roc_curve`, `roc_auc_score`
2. Regression Metrics:
- Mean Squared Error (MSE): `mean_squared_error`
- Root Mean Squared Error (RMSE): `mean_squared_error` with `squared=False`
- Mean Absolute Error (MAE): `mean_absolute_error`
- R-squared (coefficient of determination): `r2_score`
- Explained Variance Score: `explained_variance_score`
xx
Problems and Issues in Supervised learning:
Supervised learning is a powerful approach in machine learning, but it also comes with its own set of problems and challenges. Here are some
common problems and issues encountered in supervised learning:
1. Insufficient Data: Supervised learning algorithms require enough labeled data for training. Insufficient data can lead to overfitting, where the
model performs well on the training data but fails to generalize to new, unseen data.
2. Imbalanced Data: Imbalanced class distributions can occur when one class dominates the dataset, leading to biased models. The model may
have poor performance on the minority class due to limited examples. Techniques like resampling, data augmentation, or using different
evaluation metrics can help address this issue.
3. Noisy or Inconsistent Labels: In real-world datasets, label noise or inconsistent labeling can be present. Incorrect or inconsistent labels can
negatively impact the model's training and performance. Careful data cleaning and validation are essential to address this issue.
4. Overfitting: Overfitting occurs when the model becomes too complex and starts to memorize the training data instead of learning general
patterns. Overfitting leads to poor generalization on unseen data. Techniques like regularization, cross-validation, and early stopping can
mitigate overfitting.
5. Underfitting: Underfitting happens when the model is too simple and fails to capture the underlying patterns in the data. It results in high bias
and poor performance. Increasing model complexity, adding more features, or using a different model may help address underfitting.
6. Feature Engineering: The success of supervised learning heavily relies on feature engineering, which involves selecting, extracting, and
transforming relevant features from the raw data. Choosing informative features that capture the underlying patterns can be challenging and
time-consuming.
7. Curse of Dimensionality: As the number of features or dimensions increases, the data becomes increasingly sparse in the feature space. This
can lead to difficulties in learning meaningful patterns and increased computational complexity. Dimensionality reduction techniques can help
alleviate this problem.
8. Generalization to Unseen Data: Supervised learning models are trained on a specific dataset and may not generalize well to unseen data from
a different distribution. Careful evaluation and testing on representative data are crucial to ensure the model's generalization capability.
9. Interpretability: Some supervised learning models, such as deep neural networks, can be complex and lack interpretability. Understanding
and interpreting the learned models may be challenging, especially in domains where interpretability is crucial, such as healthcare or finance.
ADVANTAGES OF ML1. Handling Large and Complex Data: ML algorithms can efficiently process and analyze large volumes of data that may be challenging for
humans to handle manually. They can identify patterns, extract insights, and make predictions from complex, high-dimensional data.
2. Automation and Efficiency: ML automates repetitive tasks and complex processes, reducing the need for manual intervention. It can save
time and resources by streamlining workflows, improving efficiency, and enabling faster decision-making.
3. Improved Accuracy and Decision Making: ML models can learn from historical data and make predictions or decisions with a high degree of
accuracy. They can identify patterns, detect anomalies,
xxi and provide insights that may not be apparent through traditional methods. ML-based
predictions and recommendations can support informed decision-making.
4. Adaptability and Scalability: ML models can adapt and learn from new data, improving their performance over time. They can handle
evolving and dynamic environments, making them suitable for applications where data distributions or patterns change. ML algorithms can also
scale well to large datasets and work in distributed computing environments.
5. Handling Complex and Non-linear Relationships: ML techniques can capture complex and non-linear relationships between variables. They
can discover hidden patterns and dependencies that may not be easily discernible through traditional statistical methods.
6. Personalization and Recommendation: ML enables personalized experiences and recommendations by analyzing user preferences, behavior,
and historical data. It powers recommendation systems that suggest relevant products, content, or services to users, enhancing user satisfaction
and engagement.
7. Real-time Insights and Predictions: ML models can provide real-time insights and predictions, allowing businesses to make timely decisions.
This is particularly valuable in applications like fraud detection, predictive maintenance, stock market analysis, and dynamic pricing.
8. Handling Unstructured Data: ML algorithms can process unstructured data such as text, images, audio, and video. They can extract features,
classify, and make predictions from unstructured data sources, opening possibilities in fields like natural language processing, computer vision,
and multimedia analysis.
9. Discovering New Patterns and Knowledge: ML can uncover new patterns, correlations, and knowledge from data that humans may not have
identified. It can reveal insights and generate hypotheses that can drive further research and innovation in various domains.
10. Continuous Learning and Improvement: ML models can continuously learn from new data, adapt to changes, and improve their
performance over time. They can incorporate feedback and update their predictions or behaviors based on new information.
APPLICATION OF ML1. Image and Object Recognition: ML is used for tasks like image classification, object detection, facial recognition, and image segmentation.
Applications include autonomous vehicles, surveillance systems, medical imaging, and content-based image retrieval.
2. Natural Language Processing (NLP):ML techniques are used for tasks like sentiment analysis, text classification, machine translation, chatbots,
speech recognition, and language generation. NLP finds applications in virtual assistants, customer support, content analysis, and language
understanding.
3. Recommender Systems: ML-based recommender systems are used to suggest personalized recommendations to users based on their
preferences and behavior. They are widely used in e-commerce, entertainment platforms, music streaming services, and content
recommendation.
4. Fraud Detection: ML models can detect fraudulent activities by analyzing patterns and anomalies in data. This is useful in finance, credit card
fraud detection, insurance claims, cybersecurity, and anti-money laundering systems.
5. Healthcare and Medicine: ML is used for medical imaging analysis, disease diagnosis, drug discovery, personalized medicine, genomics, and
predicting patient outcomes. It helps in early detection, treatment planning, and decision support systems.
6. Financial Services:ML is applied in credit scoring, risk assessment, fraud detection, algorithmic trading, portfolio management, and customer
segmentation. It assists in making data-driven decisions, reducing risks, and optimizing financial processes.
7. Autonomous Systems:ML algorithms power autonomous systems such as self-driving cars, drones, robotics, and industrial automation. These
systems learn from sensor data, make decisions, and adapt to their environments.
8. Predictive Maintenance: ML is used to predict failures or maintenance needs in industrial machinery, equipment, and infrastructure. It helps
optimize maintenance schedules, reduce downtime, and save costs.
9. Energy Management: ML techniques are employed in energy demand forecasting, load balancing, smart grid optimization, and energy
consumption analysis. ML helps optimize energy usage, improve efficiency, and support renewable energy integration.
These are just a few examples of how machine learning is applied across different industries. ML continues to find applications in many other
fields, including manufacturing, logistics, agriculture, entertainment, environmental monitoring, and more. Its versatility and ability to uncover
patterns in complex data make it a powerful tool for solving a wide range of problems.
xxi
i
PROJECT REPORT 1Overview: -The Load Dataset of IIT-Kanpur.
Dataset Description: - This is the load dataset of IIT- Kanpur. This data has anonymous information such as R Phase
voltage, R Phase Current, Y Phase Voltage, Y Phase Current, B Phase Voltage, B Phase Current, Active Import, Active Export, Active
Net, Apparent Import, Apparent Export, Net Apparent.
Objective:-The objective of this dataset is to forcast how much power we are getting on this particular meter from different
phases on particular time or how much power is actually used at a particular time of the day considering the different important
features and dropping the not so important ones.
PROJECT CODE:-file:///C:/Users/SR-19/Downloads/LSD%20(2).html
xxi
ii
xxi
v
xx
v
xx
vi
xx
vii
xx
viii
xxi
x
xx
x
xx
xi
xx
xii
xx
xiii
xx
xiv
xx
xv
xx
xvi
EDA FILE: -file:///C:/Users/SR-19/Downloads/report%20(3).html
RESULT: After doing all the above steps like data wrangling, data visualization, correlation, training, and finally applying
different algorithms to the models the best suited model that gave the highest score was RandomForestRegressor
with 98.6% accuracy score and lowest mean squared error of 0.22. Hence, we have successfully trained our model.
PROJECT REPORT 2Overview: -Analog Data of ESE Building IIT-Kanpur
Dataset Description:
This is the Analog data of IIT- Kanpur for 2 Consumers. This dataset has
information such as--'Consumer Name', 'Reading DateTime', 'Temperature(C)', 'VPV1(V)', 'VPV2(V)', 'Reserved
1', 'IPV1(A)', 'IPV2(A)', 'Reserved 2', 'IAC(A)', 'CUR AC S PH', 'CUR AC T PH', 'VAC(V)', 'VOL AC S PH',
'VOL AC T PH', 'PPV1(W)', 'PPV2(W)', 'PDC(W)', 'FREQ(Hz)', 'PAC(W)', 'Power AC S PH', 'Power AC T PH',
'Reserved 3', 'Reserved 4', 'E-TODAY(kWh)', 'Energy Total H', 'Energy Total L', 'E-Total(MWh)', 'Operation Hour
Total H', 'Operation Hour Total L', 'Operation Mode'.
-
Objective:-The objective of this dataset is to forcast
E-TODAY(kWh) in the dataset that is the total energy
consumption in a day by a particular user ,considering the different important features and dropping the not so
important ones.
PROJECT
CODE:A)CONSUMER
1=>file:///C:/Users/SR19/Downloads/1%20jul%202018%20-%2030%20jun%202019%20(1).html
xx
xvi
i
xx
xvi
ii
xx
xix
xl
xli
xlii
xlii
i
xli
v
xlv
xlv
i
xlv
ii
xlv
iii
xli
x
EDA FILE:file:///C:/Users/SR-19/Downloads/EDA%20df1%20(3).html
RESULT: -
After doing all the above steps like data wrangling, data visualization, correlation, training, and
finally applying different algorithms to the models the best suited model that gave the highest score was
RandomForestRegressor with 85.26% accuracy score and lowest mean squared error of 3.75. Hence, we have
successfully trained our model.
PROJECT CODE - B)CONSUMER 2=>file:///C:/Users/SR-19/Downloads/july%2018-june%2019%20(1)%20(1).html
l
li
lii
liii
liv
lv
lvi
lvii
lvii
i
lix
EDA FILE: file:///C:/Users/SR-19/Downloads/EDA%20df2%20(1).html
RESULT: -
After doing all the above steps like data wrangling, data visualization, correlation, training, and
finally applying different algorithms to the models the best suited model that gave the highest score was
RandomForestRegressor with 85% accuracy score and lowest mean squared error of 3.16. Hence, we have
successfully trained our model.
lx
ML IN ELECTRICAL ENGINEERING1. Power Systems and Grid Management:ML techniques are employed for power system load forecasting, demand response,
energy scheduling, and optimization. ML models can analyze historical data and predict electricity demand, helping utilities plan and
manage power generation and distribution more efficiently.
2. Fault Detection and Diagnostics: ML algorithms are used for fault detection, diagnosis, and condition monitoring in
electrical systems. They can analyze sensor data to identify abnormal behavior, detect faults, and predict equipment failures. MLbased techniques aid in early fault detection, reducing downtime, and improving system reliability.
3. Smart Grids and Energy Management:ML plays a crucial role in smart grid technologies, where it helps in load forecasting,
energy consumption prediction, energy theft detection, and grid optimization. ML models can analyze real-time data from smart
meters, sensors, and devices to optimize energy distribution, demand response, and grid stability.
4. Power Quality Analysis: ML techniques are used to analyze power quality data, including voltage sag/swell detection,
harmonics analysis, and transient analysis. ML models can identify power quality issues, classify, and analyze waveform distortions,
and provide insights for improving the quality and reliability of electrical power.
5. Renewable Energy Integration:ML is employed in renewable energy systems for optimizing energy generation, forecasting
renewable energy output, and grid integration. ML models can analyze weather data, historical generation patterns, and system
parameters to improve the management and integration of renewable energy sources.
6. Power Electronics and Control Systems: ML techniques are used in power electronics for fault detection, control
optimization, and modeling complex power electronic systems. ML models can learn the behavior of power electronic devices,
enabling improved control strategies and fault-tolerant operation.
7. Energy Efficiency and Conservation: ML techniques aid in optimizing energy consumption, improving energy efficiency, and
reducing energy waste in electrical systems. ML models can analyze energy usage patterns, identify energy-saving opportunities, and
provide recommendations for energy conservation measures.
8. Power System Stability and Control:ML models can analyze power system data to predict stability issues, evaluate control
strategies, and optimize power system operation. ML techniques can improve load-frequency control, voltage control, and stability
analysis in power systems.
9. Predictive Maintenance:ML algorithms are used for predictive maintenance of electrical equipment and infrastructure. They
can analyze sensor data, perform condition monitoring, and predict maintenance needs to prevent unexpected failures and optimize
maintenance schedules.
lxi
FUTURE SCOPE OF ML1. Deep Learning and Neural Networks: Deep learning, a subfield of ML, focuses on training deep neural networks with multiple
layers. Further advancements in deep learning can lead to improved performance in areas such as computer vision, natural language
processing, speech recognition, and reinforcement learning.
2. Explainable AI: There is a growing need for ML models to provide explanations and justifications for their predictions and
decisions. Developing interpretable and explainable ML models will be crucial for applications in sensitive domains like healthcare,
finance, and law, where transparency and accountability are essential.
3. Reinforcement Learning: Reinforcement learning (RL) involves training agents to make decisions and take actions in an
environment to maximize rewards. RL holds promise for autonomous systems, robotics, game playing, and optimization problems.
Future advancements in RL algorithms and techniques can lead to breakthroughs in complex decision-making tasks.
4. Unsupervised Learning: Unsupervised learning, where the model learns from unlabeled data, has the potential to uncover hidden
patterns, anomalies, and new insights from large datasets. Advancements in unsupervised learning techniques can enhance data
exploration, clustering, and anomaly detection capabilities.
5. Transfer Learning: Transfer learning enables ML models to leverage knowledge gained from one task or domain to improve
performance on another related task or domain with limited data. Further research and developments in transfer learning can
enhance model generalization, reduce the need for large, labeled datasets, and enable efficient adaptation to new domains.
6. Federated Learning: Federated learning enables training ML models across multiple decentralized devices or edge devices
without transferring raw data to a central server. It addresses privacy concerns and can enable collaborative learning in distributed
environments. The future holds potential for advancements in federated learning algorithms and protocols.
7. Human-Machine Collaboration: The future of ML involves enabling effective collaboration between humans and intelligent
machines. This includes developing models that can understand and learn from human preferences, intents, and feedback, enabling
seamless interaction and decision-making in complex tasks.
8. Ethical and Fair ML: As ML plays an increasingly significant role in decision-making processes, there is a growing need for ensuring
fairness, transparency, and accountability. Research and efforts will focus on developing ethical guidelines, regulations, and
frameworks to address biases, discrimination, and ethical considerations in ML systems.
9. Edge Computing and ML: With the proliferation of IoT devices and the need for real-time decision-making, ML models will be
increasingly deployed at the edge of the network. Advancements in edge computing and ML integration will enable efficient
processing and inference on resource-constrained devices, reducing latency and enhancing privacy.
10. Domain-Specific Applications: ML will continue to have a significant impact in domain-specific applications such as healthcare,
lxii
finance, transportation, agriculture, manufacturing, and cybersecurity. Tailoring ML models and algorithms to address the unique
challenges and requirements of these domains will drive advancements in those fields.
These are just a few future directions and potential areas of growth for ML. As technology advances and research progresses, we can
expect ML to continue transforming industries, driving innovation, and addressing complex challenges across various domains.
Challenges for using Machine Learning
1. Data Acquisition
-
Machine Learning requires massive data sets to train on, and these should be inclusive/unbiased, and of good quality.
There can also be times where they must wait for new data to be generated. 2. Time and Resources – ML needs enough
time to let the algorithms learn and develop enough to fulfil their purpose with a considerable amount of accuracy and
relevancy. It also needs massive resources to function. This can mean additional requirements of computer power for us.
2. Interpretation of Results – Another major challenge is the ability to accurately interpret results generated by the
algorithms. We must also carefully choose the algorithms for your purpose.
3. High error-susceptibility – Machine Learning is autonomous but highly susceptible to errors. Suppose you train an
algorithm with data sets small enough to not be inclusive. You end up with biased predictions coming from a biased
training set. This leads to irrelevant advertisements being displayed to customers. In the case of ML, such blunders can set
off a chain of errors that can go undetected for long periods of time. And when they do get noticed, it takes quite some
time to recognize the source of the issue, and even longer to correct it.
Download