Uploaded by khushivanshkar95

Unit-1 (1)

advertisement
UNIT – 1
Data Visualization using Python
Packages for data visualization
Matplotlib: Visualization with PythonMatplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Matplotlib makes easy things easy and hard things possible.
Matplotlib was created by John D. Hunter.
Matplotlib is open source and we can use it freely.
Features 1.Versatile: Matplotlib offers a wide range of plot types, including line plots, scatter plots, bar plots,
histograms, and more.
2.Easy to Use: It provides a simple interface for creating plots with minimal setup and code.
3.Customizable: Matplotlib allows users to customize every aspect of their plots, such as colors, labels,
titles, and styles.
4.Integration: It seamlessly integrates with NumPy and Pandas for data manipulation and analysis.
5.Exporting: Users can save plots in various file formats like PNG, PDF, SVG, etc., for sharing and
publication.
Interactive: Matplotlib supports interactive plotting features, enabling zooming, panning, and other
interactions.
Animations: Users can create animated visualizations to explore temporal trends or dynamic phenomena.
Community Support: Matplotlib has a large community of users and developers, providing extensive
documentation, tutorials, and support resources
Import Matplotlibuse given code to import matplotlib
import matplotlib.pyplot as plt
Seaborn
Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates
closely with pandas data structures.
Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes and
arrays containing whole datasets and internally perform the necessary semantic mapping and statistical
aggregation to produce informative plots. Its dataset-oriented, declarative API lets you focus on what the
different elements of your plots mean, rather than on the details of how to draw them.
Seaborn offers a number of features, including:
1.Assistance with the visualization of regression models
2.A choice of color schemes to enhance the visual appeal of your plots
3.Boxplots and violin plots are useful for plotting categorical variables (variables with categories, like
gender).
Importing seabornImport seaborn as sns
Differences between Matplotlib and Seaborn
1.Basic statistical plots are better using Matplotlib, but more complex statistical plots
are better with Seaborn.
2.Compared to seaborn, Matplotlib has a less steep learning curve.
3.Compared to Matplotlib, Seaborn offers more appealing default color palettes.
However, if you like, matplotlib allows you to build your own color palettes.
4.Seaborn does not support interactive charting from within IPython, although
Matplotlib does.
5.Unlike Matplotlib, Seaborn offers routines to plot categorical variables using
boxplots and violin plots.
6. Regression model visualization is possible with Seaborn but not with Matplotlib.
7. Compared to seaborn, Matplotlib provides a larger collection of functions.
Seaborn, however, is expanding more quickly than Matplotlib.
8. While matplotlib is distributed under the Python Software Foundation License,
Seaborn is licensed under the GNU GPL.
9. Seaborn is less popular than Matplotlib.
10. Matplotlib's documentation is inferior to Seaborn's.
Graphviz
Graphviz is a powerful tool for creating and visualizing graphs and networks in Python. Graphviz is an
open-source graph visualization software package that allows users to create, manipulate, and visualize
graphs and networks. It provides tools for generating diagrams of graphs and networks from textual
descriptions in a simple language called DOT (Graph Description Language). Graphviz is widely used in
various fields, including computer science, data analysis, network analysis, bioinformatics, and software
engineering.
Features-
1.Graph Visualization Tool: Graphviz is a software package for visualizing graphs and networks.
2.DOT Language: It uses the DOT language to describe the structure and attributes of graphs.
3.Automatic Layout: Graphviz automatically arranges nodes and edges for optimal visualization.
4.Various Output Formats: It supports multiple output formats like PNG, PDF, and SVG.
5.Customization: Users can customize node and edge attributes such as color, shape, and size.
6.Programming Interfaces: Graphviz provides interfaces for various programming languages.
7.Community Support: It has an active community and comprehensive documentation.
8.Used Across Disciplines: Graphviz is widely used in fields like computer science, data analysis, and
bioinformatics.
Import the LibraryImport graphviz
Heat map
A heatmap is a graphical representation of data where the values of a matrix are represented as colors. It's a way to
visualize data in a 2D format, where each cell in the matrix is assigned a color based on its value. Heatmaps are
commonly used in various fields for:
1.Data Analysis: Heatmaps are used to visualize large datasets, allowing analysts to identify patterns,
trends, and correlations within the data quickly.
2.Correlation Analysis: In statistics, heatmaps are used to visualize correlation matrices, where the
strength and direction of relationships between variables are represented by colors.
3.Spatial Data Analysis: In geography and cartography, heatmaps are used to represent the intensity or
density of data points across a geographic area. For example, heatmaps can show the distribution of
population density, crime rates, or disease outbreaks on a map.
4.Gene Expression Analysis: In bioinformatics, heatmaps are used to visualize gene expression data,
where each row represents a gene, and each column represents a sample. Heatmaps help researchers
identify patterns of gene expression across different experimental conditions or samples.
5.Web Analytics: In web development and digital marketing, heatmaps are used to visualize user interactions on
websites or applications. Heatmaps can show which areas of a webpage receive the most clicks, mouse movements,
or attention from users.
6.Financial Analysis: In finance, heatmaps are used to visualize stock market data, where each cell represents the
performance of a particular stock or asset over time. Heatmaps help investors identify trends and make informed
decisions about investment portfolios.
Mosaic plots
Mosaic plots are used in data visualization to visualize the relationship between two or more categorical variables. They
are particularly useful when you want to examine the association or dependency between categorical variables and
understand how their levels are distributed across each other.
Mosaic Plot is a graphical method for visualizing data from two or more Qualitative data. It’s also
called crosstabs or two-way table and it’s used to summarize the relationship among several categorical
variables. The Mosaic Plot is based on conditional probabilities.
1.Comparing Subgroup Proportions: Mosaic plots are useful for comparing the proportions of subgroups
within different categories. For example, you might use a mosaic plot to compare the distribution of car types
(e.g., sedan, SUV, truck) across different regions (e.g., North, South, East, West).
2.Identifying Patterns and Trends: Mosaic plots can help identify patterns or trends in categorical data. By
visualizing the relationship between two categorical variables, you can observe how the levels of one variable
vary across the levels of another variable.
3.Assessing Association or Dependency: Mosaic plots are often used to assess the association or
dependency between two categorical variables. By examining the size and direction of the tiles in the mosaic
plot, you can determine whether there is a significant relationship between the variables.
4.Exploring Contingency Tables: Mosaic plots are commonly used in conjunction with contingency tables to
visualize the relationship between rows and columns. They provide a visual representation of the marginal and
conditional distributions of the variables in the contingency table.
5.Communicating Insights: Mosaic plots are effective for communicating insights about the
relationship between categorical variables to a non-technical audience. They provide a clear and
intuitive visualization of complex categorical data.
How to plot Mosaic Plot in Python:-
To plot the Mosaic Plot, we are going to take a dataset of Credit Risk. You can download it from here
First, we will import all the necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
from itertools import product
Common Types of Data Plots
Bar chart
A bar chart is the most common data visualization for displaying the numerical values of categorical data to compare
various categories between them. The categories are represented by rectangular bars of the same width and with
heights (for vertical bar charts) or lengths (for horizontal bar charts) proportional to the numerical values that they
correspond to.
Line plot
A line plot is a type of data chart that shows a progression of a variable from left to right along the x-axis through data
points connected by straight line segments. Most typically, the change of a variable is plotted over time. Indeed, line
plots are often used for visualizing time series, as discussed in the tutorial on Matplotlib time series line plots.
Scatter plot
A scatter plot is a data visualization type that displays the relationships between two variables plotted as data points on
the coordinate plane. This type of data plot is used to check if the two variables correlate among themselves, how
strong this correlation is, and if there are distinct clusters in the data.
Histogram
A histogram is a type of data plot that represents the frequency distribution of the values of a numerical variable.
Under the hood, it splits the data into value range groups called bins, counts the number of points related to each bin,
and displays each bin as a vertical bar, with the height proportional to the count value for that bin. A histogram can be
considered as a specific type of bar charts, only that its adjacent bars are attached without gaps, given the continuous
nature of bins.
Box plot
A box plot is a data plot type that shows a set of five descriptive statistics of the data: the minimum and maximum
values (excluding the outliers), the median, and the first and third quartiles. Optionally, it can also show the mean
value. A box plot is the right choice if you're interested only in these statistics, without digging into the real
underlying data distribution.
Pie chart
A pie chart is a type of data visualization represented by a circle divided into sectors, where each sector corresponds to
a certain category of the categorical data, and the angle of each sector reflects the proportion of that category as a part
of the whole. Unlike bar charts, pie charts are supposed to depict the categories that constitute the whole, e.g.,
passengers of a ship.
Different marker shapes
'o'
'*'
'.'
','
'x'
'X'
'+'
'P'
's'
'D'
'd'
Circle
Star
Point
Pixel
X
X (filled)
Plus
Plus (filled)
Square
Diamond
Diamond (thin)
'p'
Pentagon
Defined linestyles
Style
symbol
'solid' (default)
'-'
'dotted'
':'
'dashed'
'--'
'dashdot'
'-.'
Predictive modeling using visualization
Predictive modeling using visualization involves leveraging various visualization techniques to explore and
analyze data, identify patterns, and build predictive models. Here are some visualization methods commonly
used in predictive modeling:
1.Heat Maps:
1. Heat maps are graphical representations of data where values are depicted using colors. They are
often used to visualize the relationships between two or more variables in a dataset.
2. In predictive modeling, heat maps can be used to visualize correlations between features or
variables. Strong correlations (positive or negative) are indicated by intense colors, while weak
correlations are represented by lighter shades.
3. Heat maps can help identify important features for predictive modeling and assess multicollinearity
among variables.
4. darker colors such as red or purple indicate higher values, while lighter colors like yellow or green
indicate lower values. Heat maps are often used to visualize data density, distribution, or variations
across different categories or dimensions.
Cool Colors (e.g., Blue, Green): These colors usually represent lower values or cooler temperatures in the
data.
Warm Colors (e.g., Yellow, Orange, Red): These colors typically represent higher values or warmer
temperatures in the data.
2.Mosaic Plots:
1. Mosaic plots are graphical displays that visualize the relationship between two or more categorical
variables in a dataset. They are particularly useful for visualizing the association between
categorical variables.
2. In predictive modeling, mosaic plots can help explore the interaction between predictor variables
and the target variable. They provide insights into how the distribution of the target variable varies
across different categories of predictor variables.
3. Mosaic plots can assist in feature selection and understanding the predictive power of categorical
variables.
4. They are particularly useful for displaying complex relationships or distributions within multiple
categories simultaneously.
3. Trees (Decision Trees, Random Forests, etc.):
1. Decision trees and ensemble methods like random forests are popular machine learning
algorithms used for predictive modeling. They create a tree-like structure to represent the
decision-making process based on input features.
2. Visualization of decision trees can aid in understanding the decision rules learned by the model
and interpreting its predictions. Tree visualization techniques include tree diagrams, dendrograms,
and graph representations.
3. Decision boundaries and splits visualized in decision trees provide insights into how the model
segments the feature space and makes predictions.
4. Clustering:
1. Clustering algorithms such as k-means, hierarchical clustering, and DBSCAN group similar
data points together based on their features or attributes.
2. Visualization techniques like scatter plots, dendrograms, and cluster heat maps are used to
visualize clusters and assess the quality of clustering.
3. Clustering visualization helps identify patterns and structures in the data, which can be useful
for feature engineering, anomaly detection, and segmentation tasks in predictive modeling.
Download