UNIT – 1 Data Visualization using Python Packages for data visualization Matplotlib: Visualization with PythonMatplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible. Matplotlib was created by John D. Hunter. Matplotlib is open source and we can use it freely. Features 1.Versatile: Matplotlib offers a wide range of plot types, including line plots, scatter plots, bar plots, histograms, and more. 2.Easy to Use: It provides a simple interface for creating plots with minimal setup and code. 3.Customizable: Matplotlib allows users to customize every aspect of their plots, such as colors, labels, titles, and styles. 4.Integration: It seamlessly integrates with NumPy and Pandas for data manipulation and analysis. 5.Exporting: Users can save plots in various file formats like PNG, PDF, SVG, etc., for sharing and publication. Interactive: Matplotlib supports interactive plotting features, enabling zooming, panning, and other interactions. Animations: Users can create animated visualizations to explore temporal trends or dynamic phenomena. Community Support: Matplotlib has a large community of users and developers, providing extensive documentation, tutorials, and support resources Import Matplotlibuse given code to import matplotlib import matplotlib.pyplot as plt Seaborn Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures. Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset-oriented, declarative API lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them. Seaborn offers a number of features, including: 1.Assistance with the visualization of regression models 2.A choice of color schemes to enhance the visual appeal of your plots 3.Boxplots and violin plots are useful for plotting categorical variables (variables with categories, like gender). Importing seabornImport seaborn as sns Differences between Matplotlib and Seaborn 1.Basic statistical plots are better using Matplotlib, but more complex statistical plots are better with Seaborn. 2.Compared to seaborn, Matplotlib has a less steep learning curve. 3.Compared to Matplotlib, Seaborn offers more appealing default color palettes. However, if you like, matplotlib allows you to build your own color palettes. 4.Seaborn does not support interactive charting from within IPython, although Matplotlib does. 5.Unlike Matplotlib, Seaborn offers routines to plot categorical variables using boxplots and violin plots. 6. Regression model visualization is possible with Seaborn but not with Matplotlib. 7. Compared to seaborn, Matplotlib provides a larger collection of functions. Seaborn, however, is expanding more quickly than Matplotlib. 8. While matplotlib is distributed under the Python Software Foundation License, Seaborn is licensed under the GNU GPL. 9. Seaborn is less popular than Matplotlib. 10. Matplotlib's documentation is inferior to Seaborn's. Graphviz Graphviz is a powerful tool for creating and visualizing graphs and networks in Python. Graphviz is an open-source graph visualization software package that allows users to create, manipulate, and visualize graphs and networks. It provides tools for generating diagrams of graphs and networks from textual descriptions in a simple language called DOT (Graph Description Language). Graphviz is widely used in various fields, including computer science, data analysis, network analysis, bioinformatics, and software engineering. Features- 1.Graph Visualization Tool: Graphviz is a software package for visualizing graphs and networks. 2.DOT Language: It uses the DOT language to describe the structure and attributes of graphs. 3.Automatic Layout: Graphviz automatically arranges nodes and edges for optimal visualization. 4.Various Output Formats: It supports multiple output formats like PNG, PDF, and SVG. 5.Customization: Users can customize node and edge attributes such as color, shape, and size. 6.Programming Interfaces: Graphviz provides interfaces for various programming languages. 7.Community Support: It has an active community and comprehensive documentation. 8.Used Across Disciplines: Graphviz is widely used in fields like computer science, data analysis, and bioinformatics. Import the LibraryImport graphviz Heat map A heatmap is a graphical representation of data where the values of a matrix are represented as colors. It's a way to visualize data in a 2D format, where each cell in the matrix is assigned a color based on its value. Heatmaps are commonly used in various fields for: 1.Data Analysis: Heatmaps are used to visualize large datasets, allowing analysts to identify patterns, trends, and correlations within the data quickly. 2.Correlation Analysis: In statistics, heatmaps are used to visualize correlation matrices, where the strength and direction of relationships between variables are represented by colors. 3.Spatial Data Analysis: In geography and cartography, heatmaps are used to represent the intensity or density of data points across a geographic area. For example, heatmaps can show the distribution of population density, crime rates, or disease outbreaks on a map. 4.Gene Expression Analysis: In bioinformatics, heatmaps are used to visualize gene expression data, where each row represents a gene, and each column represents a sample. Heatmaps help researchers identify patterns of gene expression across different experimental conditions or samples. 5.Web Analytics: In web development and digital marketing, heatmaps are used to visualize user interactions on websites or applications. Heatmaps can show which areas of a webpage receive the most clicks, mouse movements, or attention from users. 6.Financial Analysis: In finance, heatmaps are used to visualize stock market data, where each cell represents the performance of a particular stock or asset over time. Heatmaps help investors identify trends and make informed decisions about investment portfolios. Mosaic plots Mosaic plots are used in data visualization to visualize the relationship between two or more categorical variables. They are particularly useful when you want to examine the association or dependency between categorical variables and understand how their levels are distributed across each other. Mosaic Plot is a graphical method for visualizing data from two or more Qualitative data. It’s also called crosstabs or two-way table and it’s used to summarize the relationship among several categorical variables. The Mosaic Plot is based on conditional probabilities. 1.Comparing Subgroup Proportions: Mosaic plots are useful for comparing the proportions of subgroups within different categories. For example, you might use a mosaic plot to compare the distribution of car types (e.g., sedan, SUV, truck) across different regions (e.g., North, South, East, West). 2.Identifying Patterns and Trends: Mosaic plots can help identify patterns or trends in categorical data. By visualizing the relationship between two categorical variables, you can observe how the levels of one variable vary across the levels of another variable. 3.Assessing Association or Dependency: Mosaic plots are often used to assess the association or dependency between two categorical variables. By examining the size and direction of the tiles in the mosaic plot, you can determine whether there is a significant relationship between the variables. 4.Exploring Contingency Tables: Mosaic plots are commonly used in conjunction with contingency tables to visualize the relationship between rows and columns. They provide a visual representation of the marginal and conditional distributions of the variables in the contingency table. 5.Communicating Insights: Mosaic plots are effective for communicating insights about the relationship between categorical variables to a non-technical audience. They provide a clear and intuitive visualization of complex categorical data. How to plot Mosaic Plot in Python:- To plot the Mosaic Plot, we are going to take a dataset of Credit Risk. You can download it from here First, we will import all the necessary libraries. import pandas as pd import numpy as np import matplotlib.pyplot as plt from statsmodels.graphics.mosaicplot import mosaic from itertools import product Common Types of Data Plots Bar chart A bar chart is the most common data visualization for displaying the numerical values of categorical data to compare various categories between them. The categories are represented by rectangular bars of the same width and with heights (for vertical bar charts) or lengths (for horizontal bar charts) proportional to the numerical values that they correspond to. Line plot A line plot is a type of data chart that shows a progression of a variable from left to right along the x-axis through data points connected by straight line segments. Most typically, the change of a variable is plotted over time. Indeed, line plots are often used for visualizing time series, as discussed in the tutorial on Matplotlib time series line plots. Scatter plot A scatter plot is a data visualization type that displays the relationships between two variables plotted as data points on the coordinate plane. This type of data plot is used to check if the two variables correlate among themselves, how strong this correlation is, and if there are distinct clusters in the data. Histogram A histogram is a type of data plot that represents the frequency distribution of the values of a numerical variable. Under the hood, it splits the data into value range groups called bins, counts the number of points related to each bin, and displays each bin as a vertical bar, with the height proportional to the count value for that bin. A histogram can be considered as a specific type of bar charts, only that its adjacent bars are attached without gaps, given the continuous nature of bins. Box plot A box plot is a data plot type that shows a set of five descriptive statistics of the data: the minimum and maximum values (excluding the outliers), the median, and the first and third quartiles. Optionally, it can also show the mean value. A box plot is the right choice if you're interested only in these statistics, without digging into the real underlying data distribution. Pie chart A pie chart is a type of data visualization represented by a circle divided into sectors, where each sector corresponds to a certain category of the categorical data, and the angle of each sector reflects the proportion of that category as a part of the whole. Unlike bar charts, pie charts are supposed to depict the categories that constitute the whole, e.g., passengers of a ship. Different marker shapes 'o' '*' '.' ',' 'x' 'X' '+' 'P' 's' 'D' 'd' Circle Star Point Pixel X X (filled) Plus Plus (filled) Square Diamond Diamond (thin) 'p' Pentagon Defined linestyles Style symbol 'solid' (default) '-' 'dotted' ':' 'dashed' '--' 'dashdot' '-.' Predictive modeling using visualization Predictive modeling using visualization involves leveraging various visualization techniques to explore and analyze data, identify patterns, and build predictive models. Here are some visualization methods commonly used in predictive modeling: 1.Heat Maps: 1. Heat maps are graphical representations of data where values are depicted using colors. They are often used to visualize the relationships between two or more variables in a dataset. 2. In predictive modeling, heat maps can be used to visualize correlations between features or variables. Strong correlations (positive or negative) are indicated by intense colors, while weak correlations are represented by lighter shades. 3. Heat maps can help identify important features for predictive modeling and assess multicollinearity among variables. 4. darker colors such as red or purple indicate higher values, while lighter colors like yellow or green indicate lower values. Heat maps are often used to visualize data density, distribution, or variations across different categories or dimensions. Cool Colors (e.g., Blue, Green): These colors usually represent lower values or cooler temperatures in the data. Warm Colors (e.g., Yellow, Orange, Red): These colors typically represent higher values or warmer temperatures in the data. 2.Mosaic Plots: 1. Mosaic plots are graphical displays that visualize the relationship between two or more categorical variables in a dataset. They are particularly useful for visualizing the association between categorical variables. 2. In predictive modeling, mosaic plots can help explore the interaction between predictor variables and the target variable. They provide insights into how the distribution of the target variable varies across different categories of predictor variables. 3. Mosaic plots can assist in feature selection and understanding the predictive power of categorical variables. 4. They are particularly useful for displaying complex relationships or distributions within multiple categories simultaneously. 3. Trees (Decision Trees, Random Forests, etc.): 1. Decision trees and ensemble methods like random forests are popular machine learning algorithms used for predictive modeling. They create a tree-like structure to represent the decision-making process based on input features. 2. Visualization of decision trees can aid in understanding the decision rules learned by the model and interpreting its predictions. Tree visualization techniques include tree diagrams, dendrograms, and graph representations. 3. Decision boundaries and splits visualized in decision trees provide insights into how the model segments the feature space and makes predictions. 4. Clustering: 1. Clustering algorithms such as k-means, hierarchical clustering, and DBSCAN group similar data points together based on their features or attributes. 2. Visualization techniques like scatter plots, dendrograms, and cluster heat maps are used to visualize clusters and assess the quality of clustering. 3. Clustering visualization helps identify patterns and structures in the data, which can be useful for feature engineering, anomaly detection, and segmentation tasks in predictive modeling.