Uploaded by andrandol

Handbook of Computer Programming with Python

advertisement
Handbook of Computer
­Programming with Python
This handbook provides a hands-on experience based on the underlying topics, and assists students
and faculty members in developing their algorithmic thought process and programs for given computational problems. It can also be used by professionals who possess the necessary theoretical and
computational thinking background but are presently making their transition to Python.
Key Features:
• Discusses concepts such as basic programming principles, OOP principles, database programming, GUI programming, application development, data analytics and visualization,
statistical analysis, virtual reality, data structures and algorithms, machine learning, and
deep learning.
• Provides the code and the output for all the concepts discussed.
• Includes a case study at the end of each chapter.
This handbook will benefit students of computer science, information systems, and information
technology, or anyone who is involved in computer programming (entry-to-intermediate level), data
analytics, HCI-GUI, and related disciplines.
Handbook of Computer
­Programming with Python
Edited by
Dimitrios Xanthidis
Christos Manolas
Ourania K. Xanthidou
Han-I Wang
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2023 selection and editorial matter, Dimitrios Xanthidis, Christos Manolas, Ourania K. Xanthidou, Han-I Wang;
individual chapters, the contributors
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
­ ublishers
assume responsibility for the validity of all materials or the consequences of their use. The authors and p
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
­copyright ­holders if permission to publish in this form has not been obtained. If any copyright material has not been
­acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, ­including
­photocopying, microfilming, and recording, or in any information storage or retrieval system, without written
­permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
ISBN: 978-0-367-68777-9 (hbk)
ISBN: 978-0-367-68778-6 (pbk)
ISBN: 978-1-003-13901-0 (ebk)
DOI: 10.1201/9781003139010
Typeset in Times
by codeMantra
Access the Support Material: https://www.routledge.com/9780367687779
Contents
Editors...............................................................................................................................................vii
Contributors.......................................................................................................................................ix
Chapter 1
Introduction...................................................................................................................1
Dimitrios Xanthidis, Christos Manolas, Ourania K. Xanthidou,
and Han-I Wang
Chapter 2
Introduction to Programming with Python...................................................................9
Ameur Bensefia, Muath Alrammal, and Ourania K. Xanthidou
Chapter 3
Object-Oriented Programming in Python................................................................... 59
Ghazala Bilquise, Thaeer Kobbaey, and Ourania K. Xanthidou
Chapter 4
Graphical User Interface Programming with Python............................................... 107
Ourania K. Xanthidou, Dimitrios Xanthidis, and Sujni Paul
Chapter 5
Application Development with Python..................................................................... 161
Dimitrios Xanthidis, Christos Manolas, and Hanêne Ben-Abdallah
Chapter 6
Data Structures and Algorithms with Python...........................................................207
Thaeer Kobbaey, Dimitrios Xanthidis, and Ghazala Bilquise
Chapter 7
Database Programming with Python........................................................................ 273
Dimitrios Xanthidis, Christos Manolas, and Tareq Alhousary
Chapter 8
Data Analytics and Data Visualization with Python................................................ 319
Dimitrios Xanthidis, Han-­I Wang, and Christos Manolas
Chapter 9
Statistical Analysis with Python............................................................................... 373
Han-­I Wang, Christos Manolas, and Dimitrios Xanthidis
Chapter 10 Machine Learning with Python................................................................................409
Muath Alrammal, Dimitrios Xanthidis, and Munir Naveed
Chapter 11 Introduction to Neural Networks and Deep Learning..............................................449
Dimitrios Xanthidis, Muhammad Fahim, and Han-I Wang
v
vi
Contents
Chapter 12 Virtual Reality Application Development with Python............................................ 485
Christos Manolas, Ourania K. Xanthidou, and Dimitrios Xanthidis
Appendix: Case Studies Solutions............................................................................................... 527
Index............................................................................................................................................... 617
Editors
Dimitrios Xanthidis holds a PhD in Information Systems from University College London. For the
past 25 years, he has been teaching computer science subjects with a focus on programming and
software development, and data structures and databases in various tertiary education institutions.
Currently, he is working in Higher Colleges of Technology in Dubai, U.A.E. Dimitrios’ research
interests and work revolve around the topics of data science, machine learning/deep ­learning,
­virtual/augmented reality, and emerging technologies.
Christos Manolas holds a PhD in Stereoscopic 3D Media (University of York, UK), and degrees
and qualifications in Postproduction (MA), Music Technology (MSc), Music Performance, Software
Development, and Media Production. Christos’ career includes work as a software developer, musician, audio producer, and educator for over 20 years. His research interests include multimodal
(audiovisual) perception, spatial audio, interactive and immersive media (VR/AR/XR), and generally the impact and role of digital technologies on media production.
Ourania K. Xanthidou is a PhD researcher at Brunel University, London. She holds an MSc in
Computer Science from the University of Malaya, Kuala Lumpur, Malaysia. She has more than
15 years of involvement with the IT industry in the form of supporting IT departments of SMEs
and more than 5 years of teaching experience in tertiary education. Ourania’s research interests are
in the areas of eHealth, smart health, databases, web application development, and object-oriented
programming with a focus on application development for VR/AR/XR.
Han-I Wang holds a PhD in Health Economics from the University of York, UK. Han-I has been
working as a research fellow for over 10 years, starting at the Epidemiology & Cancer Statistics
Group (ECSG) before joining the Mental Health and Addiction Research Group (MHARG) at the
University of York, UK. Her area of expertise spans across cost analysis, health outcome research,
and decision modeling using complex patient-level data, and her main research interests are related
with the exploration of different decision-modeling techniques and their application to predict
healthcare expenditure, patients’ quality of life, and life expectancy.
vii
Contributors
Tareq Alhousary
Business Information Systems
University of Salford
Manchester, United Kingdom
and
Department of Management Information
Systems
Dhofar University, College of Commerce and
Business Administration
Salalah, Oman
Muath Alrammal
Department of Computer and Information
Sciences
Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
and
LACL (Laboratoire d’Algorithmique,
Complexité et Logique)
University Paris-Est (UPEC)
Créteil, France
Hanêne Ben-Abdallah
Computer and Information Science
University of Pennsylvania
Philadelphia, PA
Ameur Bensefia
Department of Genie Informatique
University of Rouen Normandy
Laboratoire d’Informatique de Traitement de
l’Information et des Systèmes (LITIS)
Rouen, France
and
Department of Computer and Information
Sciences
Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
Ghazala Bilquise
Department of Computer and Information
Sciences
Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
Muhammad Fahim
Department of Computer and Information
Sciences
Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
Thaeer Kobbaey
Department of Computer and Information
Sciences
Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
Christos Manolas
Department of Theatre, Film, Television and
Interactive Media
The University of York
York, United Kingdom
and
Department of Media Works
Ravensbourne University London
London, United Kingdom
Munir Naveed
Department of Computer Science
University of Huddersfield
Huddersfield, United Kingdom
and
Department of Computer and Information
Sciences
Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
Sujni Paul
Department of Computer and Information
Sciences
Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
Han-I Wang
Department of Health Sciences
The University of York
York, United Kingdom
ix
x
Dimitrios Xanthidis
School of Library, Archives, and Information
Sciences
University College London
London, United Kingdom
and
Department of Computer and Information
Sciences
Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
Contributors
Ourania K. Xanthidou
Department of Computer Science
Brunel University of London
Uxbridge, United Kingdom
1
Introduction
Dimitrios Xanthidis
University College London
Higher Colleges of Technology
Christos Manolas
The University of York
Ravensbourne University London
Ourania K. Xanthidou
Brunel University of London
Han-I Wang
The University of York
CONTENTS
1.1 Introduction...............................................................................................................................1
1.2 Audience....................................................................................................................................2
1.3 Getting Started with Jupyter Notebook.....................................................................................2
1.4 Creating Standalone, Executable Files......................................................................................4
1.5 Structure of this Book................................................................................................................6
References...........................................................................................................................................6
1.1
INTRODUCTION
Undoubtedly, at the time of writing, Python is among the most popular computer programming
languages. Alongside other common languages like C# and Java, it belongs to the broader family of
C/C++-based languages, from which it naturally borrows a large number of packages and modules.
While Python is the youngest member in this family, it is widely adopted as the platform of choice
by academic and corporate institutions and organizations on a global scale.
As a C++-based language, Python follows the structured programming paradigm, and the associated programming principles of sequence, selection, and repetition, as well as the concepts of
functions and arrays (as lists). A thorough presentation of such concepts is both beyond the scope
of this book and possibly unnecessary, as this was the subject of the seminal works of computer
science giants like Knuth, Stroustrup, and Aho (Aho Alfred et al., 1983; Knuth, 1997; Stroustrup,
2013). Readers interested in an in-depth understanding of these concepts on a theoretical basis are
encouraged to refer to such works that form the backbone of modern programming. As an ObjectOriented Programming (OOP) platform, it provides all the facilities and tools to support the OOP
paradigm. Unlike its counterparts (i.e., C++, C#, and Java), Python does not provide a streamlined,
centralized IDE to support GUI programming, but it does offer a significant number of related modules that cover most, if not all, of the various GUI requirements one may encounter. It includes a
number of modules that allow for the implementation of database programming, web development,
DOI: 10.1201/9781003139010-1
1
2
Handbook of Computer Programming with Python
and mobile development projects, as well as platforms, modules, and methods that can be used for
machine and deep learning applications and even virtual and augmented reality project development. Nevertheless, one of the main reasons that made Python such a popular option among computer science professionals and academics is the wealth of modules and packages it offers for data
science tasks, including a large variety of libraries and tools specifically designed for data analytics, data visualization, and statistical analysis tasks.
Arguably, there is an abundance of online resources and tutorials and printed books that address
most of the aforementioned topics in great detail. On the technical side, such resources may seem
too complicated for someone who is currently studying the subject or approaches it without prior
programming knowledge and experience. In other cases, resources may be structured more like
reference books that may focus on particular topics without covering the introductory parts of
computing with Python that some readers may find useful. This book aims at covering this gap
by exploring how Python can be used to address various computational tasks of introductory to
intermediate difficulty level, while also providing a basic theoretical introduction to the underlying
concepts.
1.2 AUDIENCE
This book focuses on students of computer science, information systems, and information technology, or anyone who is involved in computer programming, data analytics, HCI-GUI, and related
disciplines, at an entry-to-intermediate level. This book aims to provide a hands-on experience
based on the underlying topics, and assist students and faculty members in developing their algorithmic thought process and programs for given computational problems. It can also be used by
professionals who possess the necessary theoretical and computational thinking background but are
presently making their transition to Python.
Considering the above, this book includes a wealth of examples and the associated Python
code and output, presented in a context that also discusses the underlying concepts and their
applications. It also provides key concepts in the form of quick access observations, so that the
reader can skim through the various topics. Observations can be used as a reference and navigation tool, or as reminders for points for discussion and in-class presentation in the case of using
this book as a teaching resource. Chapters are also accompanied by related exercises and case
studies that can be used in this context, and their solutions are provided in the Appendix at the
end of this book.
1.3 GETTING STARTED WITH JUPYTER NOTEBOOK
Ample information and support are available through online community channels and the
­official documentation and guides in terms of installing and running Python programming environments. Nevertheless, this section provides a brief and straightforward guide on how to use
Anaconda Navigator and Jupyter Notebook in order to interpret and execute Python code, as
the majority of examples in this book have been implemented and tested using this particular
configuration.
Once Anaconda Navigator is launched, a number of different editors and environments are
­presented in the home page (Figure 1.1).
Launching the Jupyter Notebook (i.e., clicking the Launch button) initiates a web interface based
on the file directory of the local machine (Figure 1.1). To create a new Python program, the user
can select New from the top right corner and the Python 3 notebook menu option (Figure 1.2). This
action will launch a new Python file under Jupyter with a default name. This can be changed by
clicking on the file name.
3
Introduction
FIGURE 1.1
Anaconda IDE homepage.
FIGURE 1.2
Create a new Python file in Jupyter Notebook.
Jupyter editor is organized in cells. The user can add each line of code to a separate cell or add
multiple lines to the same cell (Figure 1.3). The Run button in the main toolbar is used to execute
the code in the selected cell. If the code is free from errors, the interpreter moves to the next
cell; otherwise, an error message is displayed immediately after the cell where the error occurred
(Figure 1.4).
4
Handbook of Computer Programming with Python
FIGURE 1.3
Jupyter’s editor.
FIGURE 1.4
Run a Python program on Jupyter.
1.4 CREATING STANDALONE, EXECUTABLE FILES
With the exception of Chapter 12: Virtual Reality Application Development with Python that discusses applications that demand specific and highly specialized development platforms, the Python
scripts and examples presented in this book were implemented and tested natively in the Anaconda
Jupyter environment. In this context, the process of developing and testing software solutions is a rather
straightforward and intuitive process. However, when it comes to the actual deployment of applications in more realistic scenarios, things become slightly more complex. This is mainly due to the fact
that the Python code one develops is usually dependent on a number of external libraries, packages,
and files of various formats. These are automatically provided in the background when working within
the Anaconda environment, but this is not necessarily the case when scripts are exported as standalone files. The required libraries and resources may be located on numerous different places within
the file structures of the computer and/or network systems used during development.
In the context of application deployment, references to such external files and objects are generally referred to as application dependencies. Dependencies form a crucial and essential part of the
developed application, and the underlying files must be provided alongside the final deliverable
program (e.g., a standalone, executable application), as their absence will prevent the program from
Introduction
5
running correctly in machines lacking the necessary libraries and file structures. Fortunately, the
latter are automatically selected and packaged by special routines and processes during the deployment phase of the development cycle. This way, once the final deployment package is created, one
can run the application on other computers, irrespectively of whether these include the necessary
files and libraries or not.
Many SDKs and programming environments provide built-in routines (i.e., wizards) for the generation of the deployment packages and standalone executable files. In the case of Anaconda Jupyter,
although there is no automated, built-in wizard for such tasks, one can resort to a number of external
helper applications. A detailed, step-by-step tutorial of this process is beyond the scope of this book.
However, some basic, introductory examples are provided below, in order to assist readers with minimal or no previous experience with command line environments in familiarizing with such tasks.
At the moment of writing, two of the most widely used third-party applications for generating standalone executable files from Python scripts are PyInstaller for Windows (PyInstaller
Development Team, 2019) and Py2app for Windows/Mac OS (Oussoren & Ippolito, 2010). Both
applications can handle dependencies and linking, and the decision on which one should be used
comes down to the operating system at hand and personal preference. In broad terms, the steps one
needs to follow when creating standalone executable files are summarized below:
• Step 1: Irrespectively of what program and procedure one choses to generate the standalone application, the original script(s) must be firstly exported from Anaconda Jupyter,
as one or more Python.py file(s). This will be the file(s) used as input to the deployment
application.
• Step 2: Another essential task is to ensure that the application is installed on the system.
This can be achieved in a number of ways that are detailed in the numerous a­ ssociated
online guides and tutorials (Apple Inc, 2021; Cortesi, 2021; Microsoft, 2021a, 2021b;
Oussoren & Ippolito, 2010; PyInstaller Development Team, 2019). For the purposes of this
example, one possibility is to install PyInstaller using a Command Prompt/PowerShell
window (Microsoft, 2021a, b) using the following command:
• pip install pyinstaller
• Step 3a (Windows): Once PyInstaller is installed, and given that the associated files and
the command line environment are set up appropriately, the generation of the standalone
file could be as simple as the following command:
• pyinstaller yourprogram.py
Alternatively, the user can refer to the PyInstaller official documentation, in order to execute more specific and complex commands with appropriate parameters and flags, as necessary. For instance, using the same command with the --onefile flag would force the
generated executable file to be packaged in a single file rather than in a folder structure
containing multiple files:
• pyinstaller --onefile yourprogram.py
• Step 3b (Mac OS): The same basic idea also applies when using the Py2app (Oussoren &
Ippolito, 2010), although the procedure and commands may be slightly different. For
instance, when used on a Mac OS system, Py2app generates application bundles instead of
an executable file. As an example, users of Mac OS systems can use the Terminal window
(Apple Inc, 2021) to firstly install Py2app:
• pip install -U py2app
Py2app can be then used to create a setup file:
• py2applet --make-setup yourprogram.py
Finally, the setup file can be used to generate the standalone application bundle:
• python setup.py py2app
In both cases, the standalone application is usually placed at a specified directory structure
according to the settings and parameters used.
6
Handbook of Computer Programming with Python
In order to be able to successfully execute the example commands provided here, the reader may
have to execute a number of other necessary commands and set up tasks and navigate to the correct
­directories using the command line environment. Detailed information on how to use both PyInstaller
and Py2app can be found on the official documentation pages (Cortesi, 2021; Oussoren & Ippolito,
2010) and on the large variety of associated online resources. It must be noted that the third-party
applications mentioned here are just two of the tools one may choose to use for creating standalone
executable files based on Python scripts, and they are not the only way of dealing with such tasks.
The development and deployment processes vary depending on the characteristics of the developed application, the chosen development platform, and the targeted operating system(s). As most
chapters of this book utilize the Anaconda Jupyter environment, most of the examples and programming scripts can be developed and tested within the development platform (or even other platforms)
without the need to generate standalone executable files. However, the information provided here
can be used as a general guide for the deployment procedure and the necessary conversions, should
the reader choose to create standalone versions of the various examples.
1.5 STRUCTURE OF THIS BOOK
This book is divided into three main parts, based on the knowledge field, character, and objective
of the presented topics.
The first part (Chapters 2–5) covers classic computer programming topics like introduction to
programming, Object-Oriented Programming, Graphical User Interface (GUI) programming, and
application development. It is meant to assist readers with little or no prior programming experience to start learning computer programming using Python and the Anaconda Jupyter platform.
The related concepts, techniques, and algorithms are discussed and explained with examples of the
necessary code and the expected output.
The second part (Chapters 6–9) covers concepts related to data structures and organization, the
algorithms used to manipulate these structures, database programming (SQL), data analysis and
visualization, and the basics of statistical analysis. These concepts cover most of the topics, algorithms, and applications that make up what is collectively referred to as data science. The structure
of this part of this book provides a potential entry point for readers with no prior knowledge in data
science, as well as a reference point for those who would like to focus on the implementation of
specific data science tasks using Python.
The third part (Chapters 10–12) covers machine and deep learning concepts, while also providing a brief introduction to using Python in contexts not traditionally linked with the language like
virtual reality (VR) application development. This part introduces concepts that are potentially
more advanced from a contextual perspective, but not necessarily more challenging when it comes
to their implementation using Python. For instance, while a deeper understanding of the principles
and algorithms behind machine and deep learning may be out of scope for many of the readers of
this book, the development of applications using the various related modules and methods provided
by Python may be something that is of interest. Similarly, while video game and VR/AR application
development is certainly a topic that falls outside the scope of a Python textbook in the strict sense,
a basic understanding of how such applications could be developed using the Python language may
provide a useful insight to the most adventurous of the readers.
All the scripts and case studies presented in this book, as well as the related data and files necessary for their execution, are included as supplementary material in Appendix A.
REFERENCES
Aho, A.V., Hopcroft, J.E., Ullman, J.D., Aho, A.V., Bracht, G.H., Hopkin, K.D., Stanley, J.C., Jean-Pierre, B.,
Samler, B.A., & Peter, B.A. (1983). Data Structures and Algorithms. USA: Addison-Wesley.
Introduction
7
Apple Inc. (2021). Terminal User Guide. Support.Apple.Com. https://support.apple.com/en-gb/guide/terminal/
welcome/mac/.
Cortesi, D. (2021). PyInstaller Documentation. PyInstaller 4.5. https://pyinstaller.readthedocs.io/_/downloads/
en/stable/pdf/.
Knuth, D.E. (1997). The Art of Computer Programming (Vol. 3). Pearson Education.
Microsoft. (2021a). Installing Windows PowerShell. https://docs.microsoft.com/en-us/powershell/scripting/
windows-powershell/install/installing-windows-powershell?view=powershell–7.1.
Microsoft. (2021b). Windows Command Line. https://www.microsoft.com/en-gb/p/windows-command-line/9
nblggh4xtkq?activetab=pivot:overviewtab.
Oussoren, R., & Ippolito, B. (2010). py2app – Create Standalone Mac OS X Applications with Python. https://
py2app.readthedocs.io/en/latest/.
PyInstaller Development Team. (2019). PyInstaller Quickstart. https://www.pyinstaller.org/.
Stroustrup, B. (2013). The C++ Programming Language. India: Pearson Education.
2
Introduction to Programming
with Python
Ameur Bensefia
University of Rouen Normandy
Higher Colleges of Technology
Muath Alrammal
Higher Colleges of Technology
University Paris-Est (UPEC)
Ourania K. Xanthidou
Brunel University of London
CONTENTS
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Introduction............................................................................................................................. 10
Algorithm vs. Program............................................................................................................ 11
2.2.1 Algorithm.................................................................................................................... 11
2.2.2 Program....................................................................................................................... 12
Lexical Structure..................................................................................................................... 12
2.3.1 Case Sensitivity and Whitespace................................................................................. 13
2.3.2 Comments.................................................................................................................... 13
2.3.3 Keywords..................................................................................................................... 13
Punctuations and Variables..................................................................................................... 14
2.4.1 Punctuations................................................................................................................ 14
2.4.2 Variables...................................................................................................................... 14
Data Types............................................................................................................................... 15
2.5.1 Primitive Data Types .................................................................................................. 15
2.5.2 Non-Primitive Data Types........................................................................................... 16
2.5.3 Examples of Variables and Data Types Using Python Code....................................... 16
Statements, Expressions, and Operators.................................................................................. 21
2.6.1 Statements and Expressions......................................................................................... 21
2.6.2 Operators..................................................................................................................... 21
2.6.2.1 Arithmetic Operators.................................................................................... 22
2.6.2.2 Comparison Operators.................................................................................. 23
2.6.2.3 Logical Operators.........................................................................................24
2.6.2.4 Assignment Operators..................................................................................25
2.6.2.5 Bitwise Operators.........................................................................................26
2.6.2.6 Operators Precedence...................................................................................28
Sequence: Input and Output Statements.................................................................................. 29
Selection Structure .................................................................................................................. 30
2.8.1 The if Structure......................................................................................................... 30
2.8.2 The if…else Structure.............................................................................................. 32
2.8.3 The if…elif…else Structure.................................................................................. 33
2.8.4 Switch Case Structures................................................................................................34
DOI: 10.1201/9781003139010-2
9
10
Handbook of Computer Programming with Python
2.8.5 Conditional Expressions.............................................................................................. 35
2.8.6 Nested if Statements.................................................................................................. 35
2.9 Iteration Statements ................................................................................................................ 36
2.9.1 The while Loop......................................................................................................... 36
2.9.2 The for Loop..............................................................................................................40
2.9.3 The Nested for Loop................................................................................................. 42
2.9.4 The break and continue Statement....................................................................... 45
2.9.5 Using Loops with the Turtle Library........................................................................... 47
2.10 Functions.................................................................................................................................. 50
2.10.1 Function Definition...................................................................................................... 50
2.10.2 No Arguments, No Return........................................................................................... 50
2.10.3 With Arguments, No Return....................................................................................... 51
2.10.4 No Arguments, With Return....................................................................................... 51
2.10.5 With Arguments, With Return.................................................................................... 52
2.10.6 Function Parameter Passing........................................................................................ 52
2.10.6.1 Call/Pass by Value........................................................................................ 52
2.10.6.2 Call/Pass by Reference................................................................................. 53
2.11 Case Study............................................................................................................................... 54
2.12 Exercises.................................................................................................................................. 55
2.12.1 Sequence and Selection............................................................................................... 55
2.12.2 Iterations – while Loops........................................................................................... 56
2.12.3 Iterations – for Loops................................................................................................ 56
2.12.4 Methods....................................................................................................................... 57
References......................................................................................................................................... 58
2.1 INTRODUCTION
It is hard to find a programming language that does not follow the norms of how a computer program should look like, as the underlying structures have been established for over 50 years. These
norms, widely known as the basic programming principles, are broadly accepted by the academic,
scientific and professional communities, something also reflected in the approaches of legendary
figures in the field like (Dijkstra et al., 1976; Knuth, 1997; Stroustrup, 2013).
The three basic programming principles refer to the concepts of sequence, selection, and repetition or iteration. Sequence is the concept of executing instructions of computer programs from top
to bottom, in a sequential form. Selection refers to the concept of deciding among different paths of
execution that can be followed based on the evaluation of certain conditions. Repetition is the idea
of repeating a particular block of instructions as long as a condition is evaluated to True (i.e., nonzero). The concept of computer programming in its most basic form can be defined as the integration
of these programming principles with variables that store and manipulate data through programs
and methods or functions that facilitate the fundamental idea of divide and conquer.
The aim of this chapter is not to propose any innovative ideas of how to change the above logic
and structures. Nevertheless, although it is unlikely that these concepts can be changed or redefined
in a major way, they can be fine-tuned and put into the context of new and developing programming
languages. From this perspective, this chapter can be viewed as an effort to present how these fundamental principles of computer programming are applied to Python, one of the most popular and
intuitive modern programming languages, in a comprehensive and structured way. To accomplish
this, a number of related basic concepts are presented and discussed in detail in the various sections
of this chapter:
1. Algorithms and Programs, Lexical Structures.
2. Variables & Data Types, Primitive and Non-primitive.
Introduction to Programming with Python
3.
4.
5.
6.
7.
11
Statements, Expressions, Operators & Punctuations.
Sequence: Input, Basic Operations, and Output Statements.
Selection Structures: if, if…else, if…elif…else, Conditional Expressions.
Iteration structures: for Loops, while Loops, Nested Loops.
Functions.
It should be noted that this chapter introduces the Turtle library, which is used to demonstrate some
of the uses of iteration structures.
2.2 ALGORITHM VS. PROGRAM
The demand for developing a program always originates from a problem that must be addressed by
means of computer-based automation. However, an intermediate essential step exists between the
problem and the actual program, namely the algorithm.
2.2.1 Algorithm
The term algorithm was firstly proposed by mathematician Mohamed Ibn Musa Al-Khwarizmi during the ninth century. It was defined as a set of ordered and finite mathematical operations designed
to solve a specific problem. Nowadays, this term is being adopted in various fields and disciplines,
most notably in Computer Science and Engineering, in which it is defined as a set of ordered operations executed by a machine (computer).
The first step in program development is where a
problem is defined. At this point, a solution is formulated Observation 2.1 – Algorithm: A set
as a clear and unambiguous set of steps. This solution is of ordered operations that can be
the algorithm. The steps described in the algorithm are executed by a machine (computer
later translated into a program using a specific a pro- system).
gramming language (Figure 2.1).
The benefit of starting off with the formulation of an algorithm rather than directly implementing the actual program is that it allows the programmer to focus on how to solve the problem logically, free from any constraints or considerations related to the specifics of any given programming
language. Indeed, algorithms are written in a format incorporating natural human language called
pseudo-code, and follow particular formal rules. Ultimately, such approaches ensure a certain level
of clarity and detail that reduces or eliminates ambiguity without having to deal with the technicalities of the implementation.
The examples below provide two cases of algorithms demonstrating the clarity and simplicity
that should characterize the solution to the problem at hand before it comes to translating this solution into an actual program. Both algorithms are in the form of pseudo-code and, thus, independent
of any particular programming languages used for the implementation of the solutions:
FIGURE 2.1
Phases of program development.
12
Handbook of Computer Programming with Python
Algorithm 1: Calculate the Area of a Rectangle
Start
Read the length of the rectangle
Read the width of the rectangle
Assign width*length to Area
Display Area
End
Algorithm 2: Draw a Square of 50 Pixels Length
Start
Draw a line of 50 pixels
Turn the pen right by 90
Draw a line of 50 pixels
Turn the pen right by 90
Draw a line of 50 pixels
Turn the pen right by 90
Draw a line of 50 pixels
Turn the pen right by 90
Display Area
length
degrees
length
degrees
length
degrees
length
degrees
End
2.2.2 Program
Once the algorithm is formed, the next step is to write
the program in a specific programming language. Each
programming language has its own rules and conventions. However, they all have a common core structure
consisting of inputs, processing, and outputs. They are
all implemented using some form of code, the format
and structure of which could vary depending on the
scope and purpose of each given language and program:
Observation 2.2 – Input, Processing,
Output: The basic structure of all programs irrespectively of the programming language used. Input represents
any statement written to collect data
from an external source. Output represents any statement that sends the
outcome of the processing to a display
unit, file, or another program.
1. Input: Statements dedicated to collecting data
from external input sources (e.g., input from the
user through the keyboard and mouse), opening and reading files, or accepting input from
other programs. In most instances, input is managed at the beginning of the program execution, but this may vary between different languages and programs.
2. Processing: Processing lies at the core of the program and represents statements responsible for the manipulation of the information received at input. The length of this section
can vary greatly, from a few simple statements to thousands of lines of code organized in
numerous files and packages.
3. Output: Output statements are used in order for the outcome of the processing to be communicated outside the program. This can take many forms and includes, but is not limited
to, sending visual information to a display unit, exporting to a file, or exporting to another
program. In most cases, this is the last step of the sequence in a program.
2.3 LEXICAL STRUCTURE
Lexical structure refers to the basic conventions and restrictions in terms of the format and syntax
of the text used in the programming environment, in this case Python. This is an important aspect
of any programming language, as incorrect format or syntax may lead to compiling errors and code
that is difficult to read and debug.
13
Introduction to Programming with Python
2.3.1 Case Sensitivity and Whitespace
Python is a case-sensitive programming language, which means that it distinguishes between keywords and variables written in capital and lower-case letters. Thus, if and IF are considered to be
different words, with the first being recognized as a Python keyword and the second processed as a
variable (see: Variables 2.4.2).
2.3.2 Comments
A program is a set of instructions written in a specific
language that can be translated and processed by a Observation 2.3 – Comments: Natural
computer. In real life scenarios, programs can become language statements ignored by the
quite sizable, with hundreds or even thousands of lines interpreter, used to explain the purof code required. This can make it quite difficult for pose of the different parts of the code.
the programmer to remember the meaning, functional- Start a single line comment with #, or
ity, and purpose of each line of code. As such, good start and end a multiple line comprogramming practice involves the use of comments in ment with """. Note that Python is
the program itself. Comments function as useful and case-sensitive.
intuitive reminders and descriptions to the programmer or anyone who may have direct access to the source code of the program. The comment is
expressed in a natural human language and is ignored by the interpreter during runtime. Python
allows the use of two main types of comments:
• Single Line Comment: Starts with the # symbol and continues until the end of the current
line:
# This statement displays the sentence Hello World
print ("Hello World")
• Multiple Lines Comment: Starts with the """ symbols and ends when the same symbol
combination occurs again:
""" The statement below displays
the sentence Hello World """
print("Hello World")
2.3.3 Keywords
Python reserves a number of keywords that are used
by the interpreter to trigger specific actions when the
code is compiled. As these keywords are reserved, the
programmer is not allowed to use them as variable,
function, method, or class names. A list of these keywords is provided in Table 2.1.
Observation 2.4 – Keywords: Reserved
words that cannot be used as names
for variables, functions, methods, or
classes.
TABLE 2.1
Python Keywords
and
as
assert
break
class
continue
def
del
elif
else
except
False
finally
for
from
global
if
import
in
is
lambda
None
nonlocal
not
or
pass
raise
return
True
try
while
with
yield
14
Handbook of Computer Programming with Python
2.4 PUNCTUATIONS AND VARIABLES
Punctuations and variables are special types of symbols and text that dictate specific functionality. As such, when these symbols or text are encountered, the interpreter performs specific, pre-­
determined tasks instead of treating them as common text.
2.4.1 Punctuations
Python programs may contain punctuation characters that are combined with other symbols to
denote specific functionality. These characters are divided into two main categories: separators and
operators (Table 2.2).
2.4.2 Variables
A variable describes a memory location used by a
program to store data. Indeed, from a hardware stand- Observation 2.5 – Variable: Designated
point, it is expressed as a binary or hexadecimal num- memory location used by the program
ber that represents the memory location and another to store values.
number that represents the actual data stored in it.
Since working directly with hexadecimal numbers is arguably impractical and counter-productive from a programming perspective, a variable is expressed as a combination of an identifier that
replaces the actual memory location, a data type identifying the kind of data that can be stored in
it, and a value that represents the actual data stored. Each programming language has its own rules
when it comes to naming variables. In Python, a variable name has to conform to the following
rules:
•
•
•
•
•
It should start with a letter of the Latin alphabet ('a', 'b', …, 'z', 'A', 'B', …, 'Z').
It may contain numbers.
It may contain (or start with) the special character " _ ".
It cannot contain any other character.
It cannot be a Python keyword.
In line with the above, examples of allowed variable names include the following:
Salary, Name, Child1, Email_address, firstName, _ID
Similarly, examples of invalid variable names include the following:
print, 1Child, Email#address
TABLE 2.2
Separators and Operators in Python
Separators:
Operators:
() {} [] : " ,
&
|
−
+
<>
!=
%=
//=
<
*
=
**=
<=
**
+=
&=
>=
/
−+
|=
>
//
*=
^=
==
%
/=
>>=
<<=
Introduction to Programming with Python
15
2.5 DATA TYPES
Observation 2.6 – Data Types: The
As stated previously, the purpose of a variable is to hold type of the value stored in a variable
a value of a specified type. This value can be a num- could be primitive (i.e., integer, string,
ber (e.g., decimal, real, octal, hexadecimal), text (i.e., float, Boolean) or non-primitive (i.e.,
a string of characters), a single character, or a Boolean a collection of primitive data types).
value (i.e., one out of two possible values: True or
False). More complex structures that consist of any of
the aforementioned types may be also used. In general, Python supports two main different data
types of variables in this context: primitive and non-primitive (Figure 2.2).
2.5.1 Primitive Data Types
There are four primitive data types that are used when the variable is to hold pure, simple values
of data:
• String or Text: In Python, a string variable is declared with the str keyword. It can hold
any set of characters, including letters, numbers, or other symbols, enclosed in double
quotation marks:
• "This is a text."
• "Do you accept the proposal (Yes/No)?."
• Numeric: Since there are different types of numbers, Python provides variables suitable
for different numerical formats and representations:
• int represents integer number (e.g., +24509129)
• float represents real numbers (e.g., −123.0968)
• complex represents complex numbers (e.g., +45−33.6j)
• 0o represents octal numbers (e.g., 0o7652001)
• 0x represents hexadecimal numbers (e.g., 0x34EF1C3)
• Boolean: A Boolean variable is used to represent only two possible values: True or
False.
FIGURE 2.2 Python’s data types. (See Jaiswal, 2017.)
16
Handbook of Computer Programming with Python
2.5.2 Non-Primitive Data Types
Non-primitive data types are complex types consisting of two or more other data types. Such structures are convenient when one needs to manipulate collections of values of different types. A list of
non-primitive variables is provided below:
• Sequence: This type is suitable to use when different values have to be stored and grouped
together. It can be further divided into the following categories:
• List: This category represents a collection of any primitive data types where the elements of the list can be accessible through an index and can be modified (mutable).
• Tuple: This category represents a collection of any primitive data types where
the ­elements of the list can be accessible through an index but cannot be modified
(immutable).
• Set: This category represents a collection of distinct, unique objects. It is useful when
creating lists that hold strictly unique values in the dataset, and are especially relevant
when this dataset is large. The data is unordered and mutable.
• Range: This category represents a series of numbers starting at 0 and ending at a
specified number.
Examples:
["car", "bike", "truck"]
[200, 6423, −709, 1205]
("car", "bike", "truck")
(20.1, +23, −1.9, 12.5)
{'O', 'E', 'K', 'C', 'I'}
range(5)
range(3)
#
#
#
#
#
#
#
This
This
This
This
This
This
This
is a
is a
is a
is a
is a
will
will
list of strings
list of integers
tuple of strings
tuple of floats
set of unique strings
generate the numbers 0 1 2 3 4
generate the numbers 0 1 2
• Dictionary or Mapping: In cases where it is necessary to associate a pair of data (commonly known as key and value), dictionary or mapping types can be used. These types are
labeled as dict. The declaration begins with curly brackets, followed by the set of pairs
separated by commas. Each pair is represented with the key and the value separated by a
colon. To access any value, the key name should be provided between brackets:
{"name": "Steve", "age":20} # This is a mapping variable
More information on this topic can be found in Chapter 6.
2.5.3 Examples of Variables and Data Types Using Python Code
This section includes a number of practical examples that demonstrate typical uses and structures
of variables and data types in Python.
The first example is related to the string/text data type, one of the fundamental and most commonly used data types in computer programming. In this rather simple example, the reader can
find a number of coding conventions and commands relating to this data type. For instance, the
string values that are being passed to the firstName variable are enclosed in single quotes.
Introduction to Programming with Python
17
This is also the case when a string is used directly as an argument of the print() function, used
to display the information of its arguments on screen. It must be also noted that good programming practice dictates that variables start with lower-case letters, (e.g., firstName instead of
FirstName).
This example also highlights that, in addition to simple arguments like strings in quotation marks,
functions like print() may accept multiple arguments of different types or formats, such as other
variables, or calls to functions (e.g., .format(firstName)). The format() function takes a float
value as an argument and loads it in the brackets {} of the preceding string (e.g., 'firstName is
{}'.format(firstName)). Note the use of the type() function that returns the data type of the
value stored in the provided variable (i.e., firstName).
In the Jupyter Notebook editor, if the output is text, it is provided immediately after the current
code cell when the program is executed.
Last but not least, the reader should note that comments are included before every distinct piece
of code that performs a particular task. While this is not a strict coding requirement, it is an important aspect of good programming practice.
1
2
3
4
5
6
7
8
# Declare a variable named firstName and assign its value to Steve
firstName = 'Steve'
# Print the value of variable firstName
print('firstName is {}'.format(firstName))
# Print the data type of variable firstName
print(type(firstName))
Output 2.5.3.a:
firstName is Steve
<class 'str'>
Variables of the integer data type are non-decimal numbers (e.g., numberOfStudents = 20):
1
2
3
4
5
6
7
8
# Declare a variable named numberOfStudents and assign its value to 20
numberOfStudents = 20
# Print the value of variable numberOfStudents
print('Number of students is {}'.format(numberOfStudents))
# Print the data type of variable numberOfStudents
print(type(numberOfStudents))
Output 2.5.3.b:
Number of students is 20
<class 'int'>
Variables of the float data type are floating-point numbers that require a decimal value. Note that
the inclusion of the decimal value is mandatory even if it is zero:
18
1
2
3
4
5
6
7
8
Handbook of Computer Programming with Python
# Declare a variable named salary and assign its value to 20000.0
salary = 20000.0
# Print the value of variable salary
print('Salary is {}'.format(salary))
# Print the data type of variable salary
print(type(salary))
Output 2.5.3.c:
Salary is 20000.0
<class 'float'>
Variables of the complex data type are in the form of an expression containing real and imaginary
numbers, such as +x−y.j (e.g., complexNumber = +45−33.6j):
1
2
3
4
5
6
7
8
# Declare variable complexNumber; assing its value to +45-33.6j
complexNumber = +45−33.6J
# Print the value of variable complexNumber
print('complexNumber is {}'.format(complexNumber))
# Print the data type of variable complexNumber
print(type(complexNumber))
Output 2.5.3.d:
complexNumber is (45-33.6j)
<class 'complex'>
Values of the octal data type start with 0o (e.g., octalNumber = 0o7652001). In this particular
example, the reader should also note the use of comments stretching across multiple lines. As mentioned, comments of this type start and end with three double quotation marks ("""):
1
2
3
4
5
6
7
8
9
# Declare a variable named octalNumber and assign its value to 0o7652001
octalNumber = 0o7652001
# Print the value of variable octalNumber
print('octalNumber is {}'.format(octalNumber))
"""Print the data type of variable octalNumber: notice that the type
is octal integer; this is why a class int text appears in the result"""
print(type(octalNumber))
Output 2.5.3.e:
octalNumber is 2053121
<class 'int'>
Introduction to Programming with Python
19
Boolean variables can only take two different values: True or False. In the following code, variable married is True, but the only other possible value this variable could take would be False:
1
2
3
4
5
6
7
8
# Declare a variable named married and assign its value to True
married = True
# Print the value of variable married
print('married is {}'.format(married))
# Print the data type of variable married
print(type(married))
Output 2.5.3.f:
married is True
<class 'bool'>
Mapping variables are always enclosed in curly brackets (e.g., mappingVariable = {'name':
'Steve', 'age': 20}):
1
2
3
4
5
6
7
8
9
# Declare a variable named mappingVariable and assign its
# value to {'name':'Steve', 'age':20}
mappingVariable = {'name':'Steve', 'age':20}
# Print the value of variable mappingVariable
print('mappingVariable is {}'.format(mappingVariable))
# Print the data type of variable mappingVariable
print(type(mappingVariable))
Output 2.5.3.g:
mappingVariable is {'name': 'Steve', 'age': 20}
<class 'dict'>
List variables are enclosed in square brackets (e.g., listVariable = [200, 6423, −709,
1205]):
1
2
3
4
5
6
7
8
9
# Declare a variable named listVariable and assign
# its value to [200, 6423, −709, 1205]
listVariable = [200, 6423, −709, 1205]
# Print the value of variable listVariable
print('listVariable is {}'.format(listVariable))
# Print the data type of variable listVariable
print(type(listVariable))
Output 2.5.3.h:
listVariable is [200, 6423, -709, 1205]
<class 'list'>
20
Handbook of Computer Programming with Python
Tuple variables are enclosed in parentheses (e.g., tupleVariable = ('car', 'bike', 'truck')):
1
2
3
4
5
6
7
8
9
# Declare a variable named tupleVariable and assign
# its value to ('car', 'bike', 'truck')
tupleVariable = ('car', 'bike', 'truck')
# Print the value of variable tupleVariable
print('tupleVariable is {}'.format(tupleVariable))
# Print the data type of variable tupleVariable
print(type(tupleVariable))
Output 2.5.3.i:
tupleVariable is ('car', 'bike', 'truck')
<class 'tuple'>
Range variables hold integers ranging from 0 up to a specified number (e.g., rangeVariable = range(5)). Note that the specified number is not inclusive, so rangeVariable in this
example will hold values 0, 1, 2, 3, and 4:
1
2
3
4
5
6
7
8
9
# Declare a variable named rangeVariable and assign its value to a
# range of integers from 0 to 4 (i.e., 0 1 2 3 4)
rangeVariable = range(5)
# Print the value of variable rangeVariable
print('rangeVariable is {}'.format(rangeVariable))
# Print the data type of variable rangeVariable
print(type(rangeVariable))
Output 2.5.3.j:
rangeVariable is range(0, 5)
<class 'range'>
Set variables hold sets of unique values of primitive data types. In the following code, command
set('cookie') allocates unique values 'i', 'c', 'o', 'e', 'k' to variable setVariable:
1
2
3
4
5
6
7
8
9
# Declare a variable named setVariable and assign its value to
# the set of unique letter in the word 'cookie'
setVariable = set('cookie')
# Print the value of variable setVariable
print('setVariable is {}'.format(setVariable))
# Print the data type of variable setVariable
print(type(setVariable))
21
Introduction to Programming with Python
Output 2.5.3.k:
setVariable is {'i', 'e', 'c', 'k', 'o'}
<class 'set'>
2.6 STATEMENTS, EXPRESSIONS, AND OPERATORS
Statements and expressions refer to specific syntactical structures that provide instructions to the
interpreter in order to execute specific tasks. They can be simple structures executing a simple
task, like printing a message on screen, or more complicated ones that perform a number of tasks
and generate multiple threads of information and results.
Operators refer to special symbols that perform particu- Observation 2.7 – Statement: A line
lar, pre-determined tasks, and can be used as building of code that can be executed by the
blocks for building logical statements and expressions. Python interpreter.
This section introduces basic concepts related to these
fundamental programming elements.
2.6.1 Statements and Expressions
A statement is a unit/line of code (i.e., an instruction)
that the Python interpreter can execute. So far, two kinds
of statements have been presented in this chapter, assignment and print:
1
2
3
4
5
Observation 2.8 – Expression: Any
combination of values, variables,
operators, and/or calls to functions
that result in an unambiguous value.
# Assignment statement produces no output
name = 'Steve'
# Print function
print('Name is:', name)
Output 2.6.1:
Name is: Steve
A script usually contains a sequence of statements. When there are more than one statements, the
results appear one at a time, as each statement is executed.
An expression is a combination of values, variables, operators, and calls to functions resulting in
a clear and unambiguous value upon execution.
2.6.2 Operators
Operators are tokens/symbols that represent computations, such as addition, multiplication and division. The values an operator acts upon are called
operands.
Let us consider the simple expression x = 3*2.
The reader should note the following:
•
•
•
•
Observation 2.9 – Operators/Operands:
Operators are symbols representing computations like additions, multiplications,
divisions. Operands are the values that
the operators act upon.
x is a variable.
3 and 2 are the operands.
* is the multiplication operator.
3*2 is considered an expression since it results in a specific value.
22
Handbook of Computer Programming with Python
TABLE 2.3
Python Arithmetic Operators
Operator
Example
Name
Description
+ (unary)
+ (binary)
+a
a + b
Unary positive
Addition
− (unary)
−a
Unary negation
− (binary)
*
/
a − b
a * b
a / b
Subtraction
Multiplication
Division
%
//
a % b
a // b
**
a ** b
Modulo
Floor division (also
called integer division)
Exponentiation
a
Sum of a and b. The + operator adds two numbers. It
can be also used to concatenate strings. If either operand
is a string, the other is converted to a string too.
It converts a positive value to its negative equivalent and
vice versa.
b subtracted from a.
Product of a and b.
The division of a by b. The result is always of type
float.
The remainder when a is divided by b.
The division of a by b, rounded to the next smallest
integer.
a raised to the power of b.
Python supports many operators for combining data into
expressions. These can be divided into arithmetic, comparison, logical, assignment, and bitwise:
Observation 2.10 – Efficient Script
Writing: Include expressions that display results inside the print function
to avoid multiple instructions. Use a
single statement to declare and assign
values to multiple variables.
Arithmetic Operators
2.6.2.1 These operators can be used with integers, floating-point
numbers, or even characters (i.e., they can be used with
any primitive type other than Boolean). Table 2.3 lists
the arithmetic operators supported by Python, and the example that follows presents a script that
applies a number of these operators. It is worth noting that the arithmetic expressions are not separate statements in the script. Instead, they appear as arguments in the print() ­function. Both
options are correct, although it is advisable to follow a syntax similar to the script in order to write
shorter, and thus more efficient, scripts.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
a = 5
b = 4
​
# Addition expression
print('a+b=', a + b)
​
# Subtraction expression
print('a−b=', a − b)
​
# Multiplication expression
print('a*b=', a * b)
​
# Division expression
print('a/b=', a / b)
​
# Exponent expression
print('a raised to the power of b =', a ** b)
23
Introduction to Programming with Python
18
19
20
21
22
23
24
25
26
​
# Unary negation expression
print('a negated is =', − a)
​
# Modulus expression
print('The remainder of the integer division between a and b is:', a % b)
​
# Floor division
print('Floor division of a and b is:', a // b)
Output 2.6.2.a:
a+b= 9
a-b= 1
a*b= 20
a/b= 1.25
a raised to the power of b = 625
a negated is = -5
The remainder of the integer division between a and b is: 1
Floor division of a and b is: 1
2.6.2.2 Comparison Operators
These operators compare values for equality or inequality, (i.e., the relation between the two operands, be it numbers, characters, or strings). They yield a Boolean value as a result. The comparison
operators are typically used with some type of conditional statement (see: 2.8 Selection Structures)
or within an iteration structure (see: 2.9 Iteration Structures), determining the branching or looping
directions to follow. Table 2.4 lists the comparison operators supported by Python, and the code that
follows provides some relevant example cases using a Python script.
TABLE 2.4
Python Comparison Operators
Operator
Example
Name
Description
==
!=
<
<=
>
>=
a == b
a != b
a < b
a <= b
a > b
a >= b
Equal to
Not equal to
Less than
Less than or equal to
Greater than
Greater than or equal to
True if the value of a is equal to that of b; False otherwise
True if a is not equal to b; False otherwise
True if a is less than b; False otherwise
True if a is less than or equal to b; False otherwise
True if a is greater than b; False otherwise
True if a is greater than or equal to b; False otherwise
An interesting point about this particular script is that the variables are all declared and assigned
with values in one statement separated by commas. The script also demonstrates the use of a mix of
strings and arithmetic expressions as arguments of the print function, separated by commas:
1
2
3
4
a, b, c, d, e = 5, 4, 5, 'Dubai', 'Abu Dhabi'
​
# Test for equality and print directly the result of the expression
print(a == b, 'and', a == c)
24
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Handbook of Computer Programming with Python
​
# Test for inequality and print directly the result of the expression
print(a != b, 'and', a != c)
​
# Test for 'less than' and for 'less than' or 'equal to' and
# print directly the result of the expression
print(a < b, 'and', a <= b)
​
# Test for 'greater than' and for 'greater than or equal to' and
# print directly the result of the expression
print(a > b, 'and', a >= b)
​
# Test for equality and 'less than' between strings
print(d == e, 'and', d > e)
Output 2.6.2.b:
False and True
True and False
False and False
True and True
False and True
2.6.2.3 Logical Operators
As mentioned, comparison operators compare their operands and produce a Boolean output. This
type of output is commonly used in branching and looping statements. Boolean operators are used
to combine multiple comparison expressions into a more complex, singular expression. The Boolean
operators require their operands to be Boolean values. Table 2.5 lists the logical operators supported
by Python and the following script demonstrates some of their indicative applications:
1
2
3
4
5
6
7
8
9
10
11
12
# Apply the 'not' logical operator
x = 5
print(not (x < 10))
print(not (x < 3))
​
# Apply the 'or' logical operator
x, y = 5, 7
print((x > 3) or (y < 6))
print((x < 3) or (y < 6))
​
# Apply the 'and' logical operator
x, y = 5, 7
13
print((x > 3) and (y > 6))
14
print((x < 3) and (y > 6))
15
​
16
# Combine 'not', and 'and or' operators
17
x, y = 5, 7
18
print(not (x < 3) and (y > 6))
19
print((x < 3) or (y > 6) and (x < 10))
Output 2.6.2.c:
False
True
True
False
True
False
True
True
25
Introduction to Programming with Python
TABLE 2.5
Python Logical Operators
Operator
Example
Description
not
or
and
not a
a or b
a and b
True if a is False; False if a is True
True if either a or b is True; False otherwise
True if both a and b are True; False otherwise
TABLE 2.6
Python Assignment Operators
Operator
Example
Description
=
c = a + b
+=, −=
c
c
c
c
c
c
Assigns the result of the expression on the right side of the assignment
operator to the variable on the left side.
Equivalent to c = c + a or c = c − a
*=, /=
//=
%=
**=
+= a,
−= b
*= a, c /= b
//= a
%= a
**= a
Equivalent to c
Equivalent to c
Equivalent to c
Equivalent to c
=
=
=
=
c
c
c
c
* a or c = c / b
// a
% a
** a
2.6.2.4 Assignment Operators
These quite significant operators allow the manipulation of variables by saving or updating their
values. Table 2.6 and the code that follows summarize the use of the different assignment operators
in Python:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Assign the result of the expression on the right side of
# the assignment operator to the variable on the left side
a, b = 12, 10
c = a + b
print('The value of c is:', c)
​
# Use +=, −+, *=, /= in assignments
a, c = 2, 12
c += a
print('The value of c is:', c)
​
a, c = 2, 12
c −= a
print('The value of c is:', c)
​
a, c = 2, 12
c *= a
print('The value of c is:', c)
​
a, c = 2, 12
c /= a
print('The value of c is:', c)
26
23
24
25
26
27
28
29
30
31
Handbook of Computer Programming with Python
​
# Use the %= and **= in assignments
a, c = 4, 10
c %= a
print('The value of c is:', c)
​
a, c = 4, 10
c **= a
print('The value of c is:', c)
Output 2.6.2.d:
The
The
The
The
The
The
The
value
value
value
value
value
value
value
of
of
of
of
of
of
of
c
c
c
c
c
c
c
is:
is:
is:
is:
is:
is:
is:
22
14
10
24
6.0
2
10000
2.6.2.5 Bitwise Operators
These are considered to be low-level operators. They treat operands as sequences of binary digits
and operate on them bit by bit. Table 2.7 details the bitwise operators supported by Python and
the example that follows demonstrates their application within a script. The reader should note
that when assigning values to variables in the binary system, the values must be preceded by 0b,
followed by the value in the binary form. Likewise, when variable values must be displayed in the
binary form, the form {:04b} must be used in order to display the binary value with four digits.
TABLE 2.7
Python Bitwise Operators
Operator
Example
Name
Description
&. |
a & b, a | b
bitwise AND, OR
~
~a
bitwise negation
^
a^b
bitwise XOR
(exclusive OR)
>>, <<
a >> n, a << n
Shift right or left
n places
Each bit position in the result is the logical AND (or OR) of
the bits in the corresponding position of the operands; 1 if
both are 1, otherwise 0 for AND; 1 if either is 1, otherwise 0.
Each bit position in the result is the logical negation of the bit
in the corresponding position of the operand; 1 if 0, 0 if 1.
Each bit position in the result is the logical XOR of the bits in
the corresponding position of the operands; 1 if the bits in the
operands are different, 0 if they are the same.
Each bit is shifted right or left by n places.
1
2
3
4
5
6
7
8
9
# Bitwise 'and'
a, b = 0b1100, 0b1010
print('0b{:04b}'.format(a & b))
​
# Bitwise 'and'
a, b, c, = 12, 10, 0 # 12 = 0b1100, 10 = 0b1010
C = a & b # 8 = 0b1000
print('Value of c is', c)
​
Introduction to Programming with Python
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Bitwise 'or'
a, b = 0b1100, 0b1010
print('0b{:04b}'.format(a | b))
​
# Bitwise 'or'
a, b, c, = 12, 10, 0 # 10 = 0b1100, 12 = 0b1010
c = a | b # 14 = 0b1110
print('Value of c is', c)
# Bitwise negation
a = 0b1100
b = ~a
print('0b{:04b}'.format(b))
​
# Bitwise negation
a, b = 12, ~(a) # 12 = 0b1100, −13 = 0b−1101
print('Value of b is', b)
​
# Bitwise XOR (exclusive OR)
a, b = 0b1100, 0b1010
print('0b{:04b}'.format(a ^ b))
​
# Bitwise XOR (exclusive OR)
a, b, c = 12, 10, a ^ b # 12 = 0b1100, 10 = 0b1010, 6 = 0b0110
print ('Value of c is', c)
​
# Shift right 'n' places
a = 0b1100
print('0b{:04b}'.format(a >> 2))
​
# Shift right 'n' places
a, b, = 12, a >> 2 # 3 = 0b0011
print('Value of c is', b)
​
# Shift left 'n' places
a = 0b1100
print('0b{:04b}'.format(a << 2))
Output 2.6.2.e:
0bl000
Value of
0blll0
Value of
0b-1101
Value of
0b0ll0
Value of
0b00ll
Value of
0bll0000
c is 8
c is 14
b is -13
c is 6
c is 3
27
28
Handbook of Computer Programming with Python
2.6.2.6 Operators Precedence
Python, like other programming languages, uses the
standard algebraic procedure to evaluate expressions.
All operators are assigned a precedence:
Observation 2.11 – Order of
Precedence: The order of precedence
of operator execution determines
the result of complex expressions.
Inconsistencies can lead to incorrect
scripts.
• Operators with the highest precedence are applied
first.
• Next, the results of their expression are used to
determine those with the next highest precedence.
• In case of operators with equal precedence their application starts from left to right.
• This pattern continues until the full expression is calculated.
Table 2.8 lists the operator precedence for Python, from lowest to highest. The code following this
provides some examples of their application. It is essential for the reader to keep in mind the order of
precedence of the various operators, since failure to do so will most certainly lead to inconsistencies
in the way the complex expressions are calculated by the system:
TABLE 2.8
Python Precedence Operators
Precedence
Operator
Description
Lowest
or
Boolean OR
Boolean AND
Boolean NOT
Comparisons, identity
Highest
and
not
==, != , <, <=, >, >=,
is, is not
|
^
&
<< , >>
+ , −
*, /, //, %
+x, −x, ~x
**
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Bitwise OR
Bitwise XOR
Bitwise AND
Bit shifts
Addition, subtraction
Multiplication, division, floor division, modulo
Unary positive, unary negation, bitwise negation
Exponentiation
# The order of execution is exponentiation first,
# then multiplication: 2 * 2 = 4, then, 4 * 5 = 20
a = 5 * 2 ** 2
print('The value of a is:', a)
​
# The order of execution is multiplication first,
# then addition: 2 * 3 = 6, then 2 + 6 = 8
a = 2 + 2 * 3
print('The value of a is:', a)
​
# Parentheses have the highest precedence,
# then everything else: (2 + 2) = 4, then, 4 * 3 = 12
a = (2 + 2) * 3
print('The value of a is:', a)
​
Introduction to Programming with Python
16
17
18
19
20
29
# Addition and subtraction have the same precedence,
# hence, they are evaluated from left to right.
# This is also the case between arithmetic operators
# with equal precedence: 2 + 2 = 4, then 4 − 3 = 1
print('The value of a is:', a)
Output 2.6.2.f:
The
The
The
The
value
value
value
value
of
of
of
of
a
a
a
a
is:
is:
is:
is:
20
8
12
12
2.7 SEQUENCE: INPUT AND OUTPUT STATEMENTS
Similarly to most other contemporary programming languages, Python is organized around
­functions, reusable programming routines that can be attached to an object of a class or used as
standalone pieces of code that perform specific tasks. Python has a quite extensive array of functions, both predefined ones that are inherently built in the core of the language itself, or as part of
the various classes used by it.
An example of a Python function that has already
appeared in several of the exercises presented in this Observation 2.12 – Input/Output:
chapter is the print() function. As the name sug- Use the print() function to display
gests, this is a function used to display output on screen. output on screen. Output is passed
To invoke it one simply has to call it with an argument to the function as an argument. Use
the input() function to receive
(e.g., print(<argument>)).
Another frequently used Python function is input(), input from the keyboard. Ensure that
used to get input from the keyboard. This function input() is assigned to a variable,
prompts the user to provide input in the form of text. The as Python may treat it as memory
function stops the program execution until the text input garbage.
has been provided and resumes only when the user
presses the designated key (i.e., Enter or Return). The following example demonstrates the use of
both print() and input() in a single Python script:
1
2
3
4
5
6
# Call the 'input' function to accept the user's input from
# the keyboard and assign the provided data to a variable
fullName = input('Insert your full name\n')
​
# Print the contents of the variable fullName on screen
print('The name you entered is', fullName)
Output 2.7.a & 2.7.b:
Insert your full name
30
Handbook of Computer Programming with Python
Insert your full name
Rania
The name you entered is Rania
It is important to point out the following in regard to this particular script:
• Any value received as input must be assigned to a suitable variable. If input data are unallocated, there is a serious risk that Python will treat them as memory garbage.
• Escape character \n should be used to force the display of the next output of the program
to the next line.
• The input() function treats all input streams as text regardless of whether numeric values are provided. If an input stream is meant to be treated as a numerical value, further
processing is required.
2.8 SELECTION STRUCTURE
One of the three principles of computer programming is to make a decision of the next block of
statements to execute, based on the result of the evaluation of a certain condition. Such a condition,
and the statements to execute based on it, is referred to as a selection. There are three main types of
selection statements: if, if…else, and if…elif…else.
2.8.1 The if Structure
The if structure is used to determine whether a certain statement or block of statements will be executed Observation 2.13 – Condition: A
or not, based on a simple or complex condition. If the True/False or zero/non-zero value
condition is True (or non-zero), then the block of state- expression used to determine the flow
ments is executed, otherwise it is not executed and the of program execution.
program flow continues from the next statement outside
the if structure. This means that the evaluation of the condition must yield a Boolean or arithmetic
(i.e., zero/non-zero) value. The syntax of the basic if statement is provided below:
if (condition):
Block of statements to execute if condition is True
Statements to execute outside the if statement
Similarly, Figure 2.3 illustrates a simple if statement in the form of a flowchart.
Most high-level programming languages, such as C++ or Java, use brackets {} to mark a block
of statements. Since Python does not have any type of designated markers for such purpose, it uses
indentation to identify these blocks. Under this scheme, the block starts with the indentation and
ends at the first non-indented line of code. Consider the following script:
1
2
3
4
5
# Simple 'if' statement
a = int(input('Enter the first integer to continue: '))
b = int(input('Enter the second integer to continue: '))
if (a > b):
print("The first integer is larger than the second")
Introduction to Programming with Python
FIGURE 2.3
31
Flowchart of the if statement.
Output 2.8.1:
Enter the first integer to continue: 5
Enter the second integer to continue: 3
The first integer is larger than the second
In this example, the user is prompted to enter two integer values assigned to two different, corresponding variables. Next, the variables are compared based on their
values. This is done with a simple if statement that,
when True, displays a message on screen. Both the
input() and print() functions are used in the script.
The reader should note that, since the input() function treats every input as text, it is necessary to convert
this value into a suitable primitive type for the required
calculations or processing to take place. This is the idea
behind casting. In this particular example, the input
value is cast into an integer using the int() function.
Also, the reader should note that it is possible to use one
function call inside another, in this case the input()
function call inside the int() cast call.
Observation 2.14 – if Statement:
Used to determine whether a statement or block of statements will be
executed or not, based on a simple or
complex condition.
Observation 2.15 – Indentation:
Use indentation to mark a block of
statements.
Observation 2.16 – Casting: Convert
input values to appropriate primitive
data type, as required for calculations
or processing.
32
Handbook of Computer Programming with Python
2.8.2 The if…else Structure
It is possible to write the if statement in a way that it
executes a block of statements when the condition is
True and another when it is not. This is the concept
behind the if…else statement:
if (condition):
Block of statements to execute if
condition is True
else:
Block of Statements to execute if
condition is False
Figure 2.4 illustrates an if…else structure as a flowchart and the following code provides an example of
its application. This particular script prompts the user
to enter two integers (note that input is treated as text
by default), converts the input to actual integers, compares the two values, and displays one of the two outputs, depending on the result of the comparison. In this
example, there is only one statement to execute, as the
condition of the if statement will be either True or
False. However, the user can add multiple instructions within the block of statements, while it is also
possible to have another if statement nested inside the
block. Such cases are discussed at later sections of this
chapter.
FIGURE 2.4 Flowchart of the if…else statement.
Observation 2.17 – Selection:
• Use the if statement for the
execution of one block of
statements if the condition is
True.
• Use the if…else statement
for the execution of either of
two possible blocks of statements depending on a particular condition.
• Use the if…elif…else
statement for the execution
of multiple possible blocks of
statements depending on a
number of conditions.
• Use dictionary/mapping structures in place of the switch
structure of C++, Java, etc.
• Use conditional expression in
place of the conditional operator used in C++, Java, etc.
• Use nested if structures in
more complex cases.
Introduction to Programming with Python
1
2
3
4
5
6
7
33
# The 'if…else…' statement
a = int(input('Enter the first integer to continue: '))
b = int(input('Enter the second integer to continue: '))
if (a > b):
print('First integer holds a value greater than the second')
else:
print('Second integer holds a value greater than the first')
Output 2.8.2:
Enter the first integer to continue: 13
Enter the second integer to continue: 20
Second integer holds a value greater than the first
2.8.3 The if…elif…else Structure
Python allows the execution of more than two blocks of statements in a single if structure. If one
of the conditions controlling the if structure is True, the block associated with that structure is
executed. The remaining blocks are just ignored and the program execution continues at the first
line after the if structure. If none of the conditions are True, then the else statement is executed.
The ­syntax of the if…elif…else structure is provided below, and its flowchart can be found in
Figure 2.5:
FIGURE 2.5
Flowchart of the if…elif…else statement.
34
Handbook of Computer Programming with Python
if (condition1):
Block to execute if condition1 is True
elif (condition2):
Block to execute if condition2 is True
…
else:
Block to execute if none of the conditions are True
The following script demonstrates the application of an if…elif…else structure. The script
prompts the user to enter an integer between 0 and 100. Depending on the input value, a particular
block of code is executed based on the conditions of the various if…elif…else structures:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# The 'if…elif…else…' statement
a = int(input('Enter a grade between 0 and 100: '))
​
if (a < 60):
print('I am sorry but you failed the course.\n'\
'Please try harder next semester')
elif (a < 70):
print('Task completed! You passed the course')
elif (a < 80):
print('Well done! You did well in the course')
elif (a < 90):
print('Very good job. Keep up the good work')
elif (a < 100):
print('Excellent performance. Congratulations.')
else:
print('I am sorry but an integer between 0 and 100 was expected')
Output 2.8.3:
Enter a grade for the course between 0 and 100: 92
Excellent performance. Congratulations.
2.8.4 Switch Case Structures
A switch case structure is used as an alternative to long if structures that compare a variable
against several values. Unlike other programming languages, Python does not have a dedicated
switch case statement. To get around the lack of such statements, programmers may use an if…
elif…else structure, as described in the previous section. Alternatively, dictionary/mapping can
be used as shown in the script below:
1
2
3
4
5
6
7
8
# Dictionary mapping used to check against a range of options
numberToTextSwitcher = {
1: 'One',
2: 'Two',
3: 'Three'
}
​
number = input('Insert 1, 2, or 3: ')
Introduction to Programming with Python
9
10
11
35
intNumber = int(number)
print('The string value of', intNumber, \
'is', numberToTextSwitcher.get(intNumber))
Output 2.8.4:
Insert 1, 2, or 3: 3
The string value of 3 is Three
The reader should note some interesting points in relation to this script:
• The dictionary/mapping variable type, in this example numberToTextSwitcher, can
be used to substitute the functionality of the missing switch statement.
• When a statement is long and difficult to include in a single line, the programmer can use the \
symbol to inform the Python interpreter that the statement continues in the next line.
• Apply the get() function of the dictionary/mapping variable with the key (i.e., the first
part of the pair) to get access to the value (i.e., the second part of the pair).
2.8.5 Conditional Expressions
Another expression that can be used in Python instead of the missing conditional operator of C++
or Java, is what is often called the conditional expression. The syntax is the following:
Statement 1 if condition else Statement 2
In this case, the first part of the expression that is executed is the if condition. If this is True, the
first statement is executed; otherwise, the second statement is executed. The following code provides an example of the application of the conditional expression:
1
2
3
4
5
# Use of 'conditional expression' instead of the 'if…else' statement
a = int(input('Enter the first integer (a): '))
b = int(input('Enter the second integer (b): '))
​
print('a is greater than b') if (a > b) else print('b is greater than a')
Output 2.8.5:
Enter the first integer (a): 3
Enter the second integer (b): 6
b is greater than a
2.8.6 Nested if Statements
As already implied, it is possible to have an if structure nested inside another. In fact, such a practice could go to as much depth as the programmer wishes, although it is not advisable to go deeper
than three levels since it will be difficult to conceptually control the resulting structure. A possible
syntax for the nested if structure is presented below:
if (condition 1):
if (condition 2):
Block 1 executes
36
Handbook of Computer Programming with Python
else:
Block 2 executes
else:
Block 3 to execute if
condition 1 is False
Block 1 will be executed if condition 2 is True. Condition 1 is not considered at this point, as it is
True by default. Note that if this was not the case, the program flow would never reach the nested
if(<condition 2>) statement. Also, the first else statement is an alternative to the
if(<condition 2>) part of the structure and not to the if(<condition 1>) part. The latter
is taken care of by the second else statement. The code that follows is an example of a nested if,
based on a simple variation of a previously used script:
1
2
3
4
5
6
7
8
9
10
11
12
13
# A script with a basic nested 'if' structure
inputGrade = int(input('Enter your grade between 0 and 100: '))
​
if (inputGrade >= 80):
if (inputGrade >= 90):
print('Excellent performance')
else:
print('Very good. Keep up the good work')
else:
if (inputGrade >= 60):
print('You did well')
else:
print('Sorry, you failed the course')
Output 2.8.6:
Enter your grade between 0 and 100: 50
Sorry, you failed the course
2.9 ITERATION STATEMENTS
Application developers and programmers always look
to optimize their programs using appropriate, efficient Observation 2.18 – Loop: A block of
statements and minimizing the lines of code in order statements that is executed repeatedly
to create an easy to maintain program. A common way while a certain condition is True.
to reduce the lines of code is the concept of iteration. There are three possible forms of
Indeed, iteration, alongside sequence (i.e., sequential loops: while loops, for loops, and
execution of statements) and selection (see previous sec- nested loops.
tions) constitute what is known in computer programming as the three basic principles of programming. The iteration concept applies to cases where a
block of statements has to be repeated several times. There are three possible iteration alternatives
offered in Python: the while loop, the for loop, and the nested loops.
2.9.1 The while Loop
The while loop is suitable for cases where the number of iterations is unknown and depends on
certain conditions. These conditions need to be specified explicitly, similarly to the various forms
Introduction to Programming with Python
of selection statements. The block of statements inside
the loop is repeated as long as the specified conditions
are satisfied. Once the conditions become False the
Python interpreter exits the loop and proceeds with the
rest of the program. The block of statements within the
loop structure needs to be indented. The syntax of the
basic while loop and its flowchart (Figure 2.6) are provided below:
37
Observation 2.19 – while Loop:
Repeatedly executes a block of statements while a certain condition is
True. If the condition is never True,
the block is never executed. If the
condition never changes to False,
the block is executed indefinitely,
causing an infinite loop.
# while loop with one condition
while (condition):
Block of statements
…
# while loop with two conditions;
# op can be any logical operator
while (condition) op (condition2):
Block of statements
…
If the condition before the beginning of the loop is not met, the block of statements will not be executed and/or repeated. It is also possible that the conditions inside the while loop are not updated,
in which case the block will be executed indefinitely resulting in an undesirable infinite loop. In
order to avoid the latter, it is essential for the conditions to be updated inside the while loop.
The following script provides a basic example of the while loop. The program starts by prompting the user to decide whether the message should be displayed or not. This is done by entering
either ‘Y’/‘y’ or ‘N’/‘n’. Any other input is considered as not ‘Y’/‘y’. In this arrangement, the flow
goes into the block that belongs to the while loop only when the user enters ‘Y’ or ‘y’. Note that
the same prompt for input is given to the user inside the loop. This is because it is necessary to
change this value in order to determine the while condition. As mentioned, if this value is not
modified inside the loop (i.e., if the statement showMessage = input ('Do you want to
FIGURE 2.6
Flowchart of the while loop.
38
Handbook of Computer Programming with Python
show the message again (Y/N)?)' is missing) the program execution would lead into an
infinite loop. The program will continue to run as long as the user enters ‘Y’ or ‘y’:
1
2
3
4
5
6
7
# Use of 'while' loop to show the message 'Hello world'
# as long as the user enters 'Y' or 'y'
showMessage = input('Do you want to show the message again (Y/N)? ')
​
while (showMessage == 'Y' or showMessage == 'y'):
print('Hello world')
showMessage = input('Do you want to show the message again (Y/N)? ')
Output 2.9.1.a:
Do you want to show the message again (Y/N)? Y
Hello world
Do you want to show the message again (Y/N)? Y
Hello world
Do you want to show the message again (Y/N)? N
Another example of a while loop can be seen in the script below, which introduces the use of the
end = '' clause in the print() function. This results in the program stopping and waiting for new
output at the end of the same print without proceeding to the next line:
1
2
3
4
5
6
7
8
9
# Use the 'while' loop to display all integers
# between two values provided by the user
​
numberToShow = int(input('Enter the starting integer: '))
endInteger = int(input('Enter the ending integer: '))
​
while (numberToShow <= endInteger):
print(numberToShow, ' ', end = '')
numberToShow += 1
Output 2.9.1.b:
Enter the starting integer: 5
Enter the ending integer: 10
5 6 7 8 9 10
The next script is a classic example of adding together two integers, the values of which are entered
by the user at runtime. The reader should note how the loop control variable (i.e., currentInteger) is being modified inside the block of statements. Also, it should be noted how the two
print() functions are used and connected through the end = '' clause, in order to display the
results in a single line:
1
2
3
4
5
# Use the 'while' loop to add all integers between two values
# provided by the user
​
currentInteger = int(input('Enter the starting integer:'))
endingInteger = int(input('Enter the ending integer:'))
39
Introduction to Programming with Python
6
7
8
9
10
11
12
sumOfValues = 0
while (currentInteger <= endingInteger):
print('currentInteger value is', currentInteger, end = '')
sumOfValues += currentInteger
currentInteger += 1
print(' and sumOfValues currently is', sumOfValues)
Output 2.9.1.c:
Enter the starting integer:1
Enter the ending integer:5
currentInteger value is 1 and
currentInteger value is 2 and
currentInteger value is 3 and
currentInteger value is 4 and
currentInteger value is 5 and
sumOfValues
sumOfValues
sumOfValues
sumOfValues
sumOfValues
currently
currently
currently
currently
currently
is
is
is
is
is
1
3
6
10
15
In addition to the above, it is also possible to have an if structure of any type nested inside the
while loop. The following code provides an example of a script that repeatedly accepts integers
from the keyboard, and displays the integers plus a calculation of the even and odd numbers present.
What is noteworthy in this script is the use of an if…else structure inside the while loop:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
""" Use of the 'while' loop to count the number of even and
odd numbers from an input stream provided by the user.
Stop the loop and display the results when the user enters 0 """
​
# Declare the counters for even and odd numbers
countEven, countOdd = 0, 0
​
# Declare a variable to temporarily store current input value
userInput = int(input('Enter an integer, \
or 0 to display the results and exit: '))
​
# The 'while' loop that repeatedly executes the main block of code
while (userInput != 0):
if (userInput % 2 == 0):
countEven += 1
else:
countOdd += 1
# Repeatedly accept new input from the user until 0 is entered
userInput = int(input('Enter an integer, or 0 to display \
the results and exit: '))
​
# Display the results of the program
print('You entered', countEven,'even and', countOdd,'odd numbers')
40
Handbook of Computer Programming with Python
Output 2.9.1.d:
Enter an integer, or 0
Enter an integer, or 0
Enter an integer, or 0
Enter an integer, or 0
Enter an integer, or 0
Enter an integer, or 0
You entered 3 even and
to display the
to display the
to display the
to display the
to display the
to display the
2 odd numbers
results
results
results
results
results
results
and
and
and
and
and
and
exit:
exit:
exit:
exit:
exit:
exit:
2
3
4
5
6
0
Programmers can also use a logically modified version of the while loop in place of the do…until
(or repeat…until) loop, another classic programming language loop structure that is not directly
available in Python. When using the while loop to replace the do…until functionality, the programmer should make sure that the while condition is True during the first iteration, and that its
value is repeatedly updated at the end of the block of statements inside the loop.
2.9.2 The for Loop
The for loop structure allows for the execution of a block
of statements for a predefined number of iterations. The Observation 2.20 – for Loop:
loop controls the number of iterations using a counter Repeatedly executes a block of state(i.e., a variable declared locally in the loop), within a spe- ments for a predefined number of
cific range defined by two numbers: start and end. The times. The end of the loop must be
range can be also specified by just one end number, in defined, the start can be omitted,
which case the start will be considered to be 0 by default. and the step can be specified in the
Additionally, it is possible to include an incremental or header.
decremental step inside the for header. Each repeated
statement is placed within the block of statements, inside the for loop. The syntax for each of the
three types of the for loop is provided below, while Figure 2.7 showcases the associated flowchart:
# Number of iterations is end-start
for counter in range (start, end):
Block of statements
# Number of iterations is end and starts from 0
for counter in range (end):
Block of statements
""" Number of iterations is (end-start)/step; counter increases/
decreases by step """
for counter in range (start, end, step):
Block of statements
The next script showcases a script used to display the list of names stored in a tuple. The block of
statements inside the for loop is executed four times with the i index starting at 0 and increasing
up to 3 (inclusive):
1
2
3
4
5
6
7
# Declare a variable as a 'tuple' of immutable string elements
myFriends = ('John', 'Ali', 'Steven', 'Catherine')
​
# Use a 'for' loop to read the elements in the 'tuple', first to last
for i in range (0, 4):
print('Happy New Year:', myFriends[i])
print('Done.')
Introduction to Programming with Python
41
Output 2.9.2.a:
Happy
Happy
Happy
Happy
Done.
New
New
New
New
FIGURE 2.7
Year:
Year:
Year:
Year:
John
Ali
Steven
Catherine
Flowchart of the for loop.
A similar example is provided in the following script, where instead of a tuple variable a list is used.
The user is prompted to enter four names into the empty list, which are subsequently displayed on
screen:
1
2
3
4
5
6
7
8
9
10
11
# Declare a 'list' variable that will accept names provided by the user
nameList = []
​
# Declare a 'dictionary' mapping numbers 1–4
# to text values 'first', 'second', 'third', 'fourth', respectively
numberToText = {
1: 'first',
2: 'second',
3: 'third',
4: 'fourth'
}
42
12
13
14
15
16
17
18
19
20
21
22
23
24
Handbook of Computer Programming with Python
​
# Use 'for' loop to accept 4 names; store them in dictionary
for i in range (0, 4):
message = ('Enter the ' + str(numberToText.get(i + 1)) + \
' name to insert in the dictionary: ')
newName = input(message)
nameList.insert(i, newName)
# Use a 'for' loop to display the newly created name list
for i in range (4):
print(nameList[i])
print('Done.')
Output 2.9.2.b:
Enter the
Enter the
Enter the
Enter the
Hellen
Steven
Ahmed
Catherine
Done.
first name to insert in the dictionary: Hellen
second name to insert in the dictionary: Steven
third name to insert in the dictionary: Ahmed
fourth name to insert in the dictionary: Catherine
The reader should note the following:
• A list is declared using square brackets instead of the parentheses used for tuples. By leaving the square brackets empty, an empty list is created.
• Use a dictionary mapping to convert numeric values into the corresponding text (e.g.,
numberToText).
• Use the str() function to convert a numeric value into a string.
• Use the concatenation operator (+) to combine strings.
• Use the insert() function to populate the list. The first argument is the index of the new
element and the second is the actual value.
• If the start number is omitted in the for loop header, zero is assumed as a default value.
2.9.3 The Nested for Loop
As with if statements, it is possible to embed a for
loop (i.e., inner loop) into another (i.e., outer loop) to
create a nested for loop. This is particularly convenient
when dealing with non-primitive data types of two or
more dimensions, or with more complex problems. The
syntax is provided below, and the associated flowchart is
presented in Figure 2.8:
Observation 2.21 – Nested Loops:
Use nested loops of any type to
address complex situations like mathematical problems, drawing shapes,
searching or shorting, or dealing
with multi-dimensional non-primitive
data types.
Introduction to Programming with Python
43
FIGURE 2.8 Flowchart of the nested for loop.
for counter1
Block of
...
for counter2
Block of
...
for counter3
Block of
...
in range (start1, end1):
statements 1
in range (start2, end2):
statements 2
in range (start3, end3):
statements 3
Nested loops are commonly used for the implementation of programs that deal with various types
of non-primitive data types, such as lists, tuples, or sets. The following script provides an example
of a nested for loop structure, in which a two-dimensional list variable (i.e., languages) is displayed on screen. This particular variable stores six different elements (i.e., names of programming
languages) in two different dimensions (i.e., three elements on each dimension). The reader should
note how the counters of the nested loops are used as indices for the displayed items of the list:
1
2
3
4
5
6
7
8
9
10
# Define a two-dimensional list with 3 programming languages
# as its elements (per dimension)
languages=[['Python','Java','C++'],['PhP','HTML','Java Script']]
​
# A nested 'for' loop prints the 2 different dimensions of the list
for i in range(2):
print(i, 'Set of programming languages:')
for j in range(3):
print('Happy new year:', languages[i][j])
print('All languages displayed')
44
Handbook of Computer Programming with Python
Output 2.9.3.a:
0 Set of programming languages:
Happy new year: Python
Happy new year: Java
Happy new year: c++
1 Set of programming languages:
Happy new year: PhP
Happy new year: HTML
Happy new year: Java Script
All languages displayed
Another common use of nested loops relates to the implementation of various sorting or searching
algorithms (see: Chapter 6). The following script provides another example of a nested for loop
structure that implements a classic sorting algorithm referred to as the Bubble Sort. This script does
the following:
• It declares two lists, one to accept the original list of integers and the other to store the
sorted list.
• It runs a for loop that accepts a number of integers as input from the user and transfers
them to the first list.
• It runs a second for loop that reads from the original list and transfers to into the second
one (sorted list).
• It runs a nested for loop that utilizes the Bubble Sort algorithm.
• Finally, it runs two more for loops: one that displays the original list of integers and one
that displays the sorted one.
It should be noted that the code presented in this script is not an example of the most efficient or
complete sorting algorithm, but a more simplistic implementation of it, as the main purpose was to
help the reader gain a better understanding of the use of nested loops:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
originalList, sortedList = [], []
​
# The first 'for' loop accepts a number
# of integers and populate the 'originalList'
sizeOfList = int(input('Total number of integers in the list? '))
for i in range (sizeOfList):
tempValue = int(input('Add an integer to the list: '))
originalList.insert(i, tempValue)
​
# The second 'for' loop copies the 'originalList' into the
# 'sortedListed' in preparation for sorting the latter
for i in range (sizeOfList):
sortedList.insert(i, originalList[i])
​
# Use a nested 'for' loop to sort the 'originalList' into the
# 'sortedList' using the Bubble Sort algorithm
for i in range (sizeOfList − 1):
for j in range (sizeOfList):
if (sortedList[i] > sortedList [i + 1]):
Introduction to Programming with Python
20
21
22
23
24
25
26
27
28
29
30
31
45
temp = sortedList[i]
sortedList[i] = sortedList[i + 1]
sortedList[i + 1] = temp
​
# Use two 'for' loops to successively display the two lists
print('The original list is: ', end = '')
for i in range (sizeOfList):
print(originalList[i], '', end = '')
​
print('\nThe sorted list is: ', end = '')
for i in range (sizeOfList):
print(sortedList[i], '', end = '')
Output 2.9.3.b:
Total number of integers in
Add an integer to the list:
Add an integer to the list:
Add an integer to the list:
The original list is: 2 1 4
The sorted list is: 1 2 4
the list? 3
2
1
4
2.9.4 The break and continue Statement
Another common use of nested loops is related to the
implementation of algorithms for the solution of math- Observation 2.22 – break and conematical problems. The following script presents an tinue: Use the break statement
implementation of a program calculating the prime combined with a selection statement
numbers. In this particular case, the user is prompted to in a loop, to permanently interrupt
enter the last integer of the prime numbers list the pro- loop execution. Use the continue
gram should calculate. Next, a for loop nested inside a statement combined with a selection
while loop determines whether this integer is a prime statement in a loop to skip the current
iteration.
number or not.
The script introduces the break statement, which
forces the interpreter to skip all the remaining statements and iterations, and exit the current iteration. As shown in the script, break is generally combined with a selection statement:
1
2
3
4
5
6
7
8
9
10
11
12
13
# Use a nested 'for' loop inside a 'while' loop to find primary numbers.
# Variable 'endInteger' stores the last integer of the sequence
endInteger = int(input('Enter the last integer \
of the sequence of primary numbers: '))
​
# Print default prime integers 1 and 2. This is subsequently followed
# by the rest of the sequence on the same line
print('1 2 ', end = '')
​
# The 'counter' variable is used to evaluate
# whether a number within the range is prime
counter, flag = 3, 'true'
​
46
14
15
16
17
18
19
20
21
22
23
24
25
26
Handbook of Computer Programming with Python
# 'while' loop controls the counter variable used for evaluation
while (counter <= endInteger):
# 'for': check current 'counter' value against the integers
# in the list up to itself to determine if it is a prime number
for i in range (2, counter):
if ((counter % i) == 0):
flag = 'false'
break
if (flag == 'true'):
print(counter, '', end = '')
flag = 'true'
counter += 1
Output 2.9.4.a:
Enter the last integer of the sequence of primary numbers: 100
1 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
The following example provides a more direct demonstration of how the break statement is used.
The code instructs the interpreter to read from a non-primitive data type list, but breaks just after
reading its first element:
1
2
3
4
5
6
7
8
9
10
11
12
# Declare variable 'myFriends' and populate with a list of names
myFriends = ('Ahmed', 'John', 'Emma', 'Hind')
​
# Use a 'for' loop to read the elements of the list
for i in range (4):
# Use an 'if' statement to stop reading the list once
# the second element (i.e., index 1) is reached
if (i == 1):
break
print('Happy new year:', myFriends[i])
​
print('Done')
Output 2.9.4.b:
Happy new year: Ahmed
Done
Another statement that is commonly used in loops, and particularly in nested loops, is the ­continue
statement. It is used when there is a need to skip one or more particular iterations, and continue with
the rest of the program. It is worth noting that this statement is frequently combined with selection statements. The main difference between the continue and the break statements is that
the former stops the active iteration without completely interrupting the loop. The following script
demonstrates the use of the continue statement:
Introduction to Programming with Python
1
2
3
4
5
6
7
8
9
10
11
12
47
# Declare variable 'myFriends' and populate with a list of names
myFriends = ('Ahmed', 'John', 'Emma', 'Rania')
​
# Use a 'for' loop to read the elements of the list
for i in range (4):
# Use an 'if' statement to skip the second element
# (i.e., the element with index 1)
if (i == 1):
continue
print('Happy new year:', myFriends[i])
​
print('Done.')
Output 2.9.4.c:
Happy new year: Ahmed
Happy new year: Emma
Happy new year: Rania
Done.
2.9.5 Using Loops with the Turtle Library
In addition to a multitude of other uses, loops are also convenient when using code for drawing
shapes. Among the most important programming tools for such tasks is the Turtle library. The following script provides an example of how to draw a basic shape of four squares (100 pixels in
length). The reader should note the use of the forward(length) function of the t object (­turtle
class), which draws a straight line of 100 pixels. Next, the script uses the left(degrees) function
on the t object to turn the drawing pen 90 degrees left and repeat the 100-pixel drawing. At the end
of the script it is necessary to use the mainloop() function on the t object to ensure that the drawing process is completed promptly. The output of this example shows the four squares drawn as a
result of the for loop:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Import the 'turtle' library
import turtle as t
# Use a 'for' loop to draw 4 squares with sides of 100 pixels
for i in range (4):
t.forward(100)
t.left(90)
t.forward(100)
t.left(90)
t.forward(100)
t.left(90)
t.forward(100)
# Use the mainloop() function of the 'turtle' class
t.mainloop()
48
Handbook of Computer Programming with Python
Output 2.9.5.a:
Nested loops can be also used with Turtle to draw more complex shapes. The following script demonstrates this by building on the previous example and forcing the drawing process to be repeated
three more times with the use of a nested loop. In each repetition, the rectangular shape is rotated
by 30 degrees to the left:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import the 'turtle' library
import turtle as t
# Nested 'for' to draw a complex of squares with sides of 100 pixels
for i in range (3):
for j in range (4):
t.forward(100)
t.left(90)
t.forward(100)
t.left(90)
t.forward(100)
t.left(90)
t.forward(100)
t.left(30)
# Use the mainloop() function of the 'turtle' class
t.mainloop()
Introduction to Programming with Python
49
Output 2.9.5.b:
The Turtle library comes with a rich set of functions that support a large variety of drawing
tasks. Table 2.9 provides a sample based on this set, including some of the most important of its
functions.
TABLE 2.9
Methods Available in the Turtle Class
Method or
Command
Required
Parameters
Description
forward
backward
right
left
penup
pendown
pensize
color, pencolor
fillcolor
begin_fill,
end_fill
setposition
goto
shape
speed
circle
Length in pixels
Length in pixels
Angle in degrees
Angle in degrees
None
None
Thickness of pen
Color name
Color name
None
Moves the Turtle pen forward by the specified amount
Moves the Turtle pen backward by the specified amount
Turns the Turtle pen a number of degrees clockwise
Turns the Turtle pen a number of degrees counter-clockwise
Picks up the Turtle pen
Puts down the Turtle pen to start drawing
The thickness of the Turtle pen
Changes the color of the Turtle pen
Changes the fill color for the drawing
Defines the start and the end of the application of the fillcolor() method
None
x, y coordinates
Shape name
Time delay
Radius, arc, steps
Set the current position
Moves the Turtle pen to coordinate position x, y
Can accept values ‘arrow’, ‘classic’, ‘turtle’, or ‘circle’.
Dictates the speed of the Turtle pen (i.e., slow (0) to fast (10+)).
Draws a circle counter-clockwise with a pre-set radius. If arc is used, it will
draw an arc from 0 up to a given number in degrees. If steps is used, it will
draw the shape in pieces resembling a polygon.
50
Handbook of Computer Programming with Python
2.10 FUNCTIONS
A function is a block of statements that performs a specific task. It allows the programmer to reuse parts of
their code, promoting the concept of modularity. The
main idea behind this approach is to divide a large block
of code into smaller, and thus more manageable, subblocks. There are two types of functions in Python:
Observation 2.23 – Function: A
defined structure of statements that
can be called repeatedly. It has a
unique name, and may take arguments and/or return values to the
caller.
• Built-in: The programmer can use these functions
in the program without defining them. Several functions of this type were used in the previous sections (e.g., print() and input()).
• User-defined: Python allows programmers to create their own functions. The following
section focuses on this particular function type.
2.10.1 Function Definition
The main rules for defining functions in Python are the
following:
Observation 2.24 – Four Types of
Functions:
• The function block begins with the keyword def,
1. No arguments, no return
followed by the function name and parentheses.
value.
Note that, as Python is case-sensitive, the pro2.
With arguments, no return
grammer must use def instead of Def.
value.
• Similar to variable names, function names can
3.
No arguments, with return
include letters or numbers, but no spaces or spevalue.
cial characters, and cannot begin with a number.
4. With arguments, with return
• Optional input parameters, called arguments,
value.
should be placed within the parentheses. It is
also possible to define the parameters inside the
parentheses.
• The block of statements within a function starts with a colon and is indented.
• A function that returns data must include the keyword return in its block of code.
The syntax for a function declaration is as follows:
def functionName (var1, var2, … etc.):
Statements
Depending on the presence or absence of arguments, and on the presence of input and/or return
values, functions can be classified under four possible types. These types are presented in detail in
the following section.
2.10.2 No Arguments, No Return
This is a type in which the function does not accept variables as arguments, and does not return any
data. This is demonstrated in the following script that merely prints a predefined string on screen.
The reader should note that there are no arguments inside the parameters and no return statement
inside the block of statements. The structure simply invokes the print() function displaying the
desired message. Invoking such a function inside the main program is a rather simple and straightforward task:
Introduction to Programming with Python
1
2
3
4
5
6
51
# Define function that neither accepts arguments nor returns values
def printSomething():
print('Hello world')
​
# Call the function from the main program
printSomething()
Output 2.10.2:
Hello world
2.10.3 With Arguments, No Return
Another type of a function is one in which the function accepts variables as arguments, but does not
return any data. In the following script, the function is invoked by declaring its name while also
including a number of values in the parentheses. These values are passed to the main body of the
function, and can be treated as normal variables:
1
2
3
4
5
6
7
8
9
10
# Define a function that accepts arguments but does not return values
def printMyName(fName, lName):
print('Your name is:', fName, lName)
# Prompt user to input their name
firstName = input('Enter your first name: ')
lastName = input('Enter your last name: ')
​
# Call the function from the main program
printMyName(firstName, lastName)
Output 2.10.3:
Enter your first name: Alex
Enter your last name: Fora
Your name is: Alex Fora
2.10.4 No Arguments, With Return
The third type involves a function that does not accept arguments, but returns data. It is important
to remember that since this type of function returns a value to the calling code, this value must be
assigned to a variable before being used or processed:
1
2
3
4
5
6
7
8
9
# Define a function that does not accept arguments but returns values
def returnFloatNumber():
inputFloat = float(input('Enter a real number ' \
'to return to the main program: '))
return inputFloat
​
# Call the function from the main program to display the input
x = returnFloatNumber()
print('You entered:', x)
52
Handbook of Computer Programming with Python
Output 2.10.4:
Enter a real number to return to the main program: 5.7
You entered: 5.7
2.10.5 With Arguments, With Return
The fourth type involves a function that both accepts arguments and returns values back to the calling code. The following script demonstrates this. In this case, the call of the function must include
a list of arguments and assign the return value to a specific variable for later use:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Function accepts arguments & returns values to the caller
def calculateSum(number1, number2):
print('Calculate the sum of the two numbers.')
return(number1 + number2)
​
# Accept two real numbers from the user
num1 = float(input('Enter the first number: '))
num2 = float(input('Enter the second number: '))
​
# Call the function to calculate the sum for the two numbers
addNumbers = calculateSum(num1, num2)
​
# Print the sum for the numbers
print('The sum for the two numbers is:', addNumbers)
Output 2.10.5:
Enter the first number: 3
Enter the second number: 5
Calculate the sum of the two numbers.
The sum for the two numbers is: 8.0
2.10.6 Function Parameter Passing
There are two different ways to pass parameters to functions. Determining which of the two should
be chosen depends on whether the value of the original variables should be changed within the
function or not. These two ways for passing parameter values to a function are commonly referred
to as call/pass by value and call/pass by reference.
2.10.6.1 Call/Pass by Value
In this case, the value of the argument (parameter) is
processed as a copy of the original variable. Hence, the
original variable in the caller’s scope will be unchanged
when program control returns to the caller. In Python,
if immutable parameters (e.g., integers and strings) are
passed to a function, the common practice is to call/pass
parameters by value. The example below illustrates such
a case by introducing the id() function. It accepts an
object as a parameter (i.e., id(object)) and returns
the identity of this particular object. The return value of
Observation 2.25 – Passing Values to
Argument:
1. By Value: Argument is a
copy of the original variable,
which remains unchanged.
2. By Reference: Changes apply
directly to the original variable, thus, changing its value.
Introduction to Programming with Python
53
id() is an integer, which is unique and permanent for this object during its lifetime. As shown in the
example, the id of variable x before calling the checkParamemterID function is 4564813232.
It should be noted the id of x is not changed within the function as long as the value of x is not
updated. However, once the value is updated to 20, its corresponding id is changed to 4564813552.
The most important thing to note is that the id of x does not change after calling the function, and
its original value is maintained (4564813232). That means that the change of the value of x was
applied on a copy of the variable, and not the original one within the caller’s scope:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Define function 'checkParameterID' that accepts a parameter (by value)
def checkParameterID(x):
print('The value of x inside checkParameterID',\
'before value change is', x, '\nand its id is', id(x))
​
# Change the value of parameter 'x' within the scope of the function
x = 20
print('The value of x inside checkParameterID',\
'after value change is', x, '\nand its id is', id(x))
​
# Declare variable 'x' in the main program and assign initial value
x = 10
​
print('The value of x before calling the function ',\
'checkParameterID is', x, '\nand its id is', id(x))
​
# Call function 'checkParameterID'
checkParameterID(x)
​
# Display info about 'x' in the main program after function call
print('The value of x after calling the function checkParameterID '\
'is', x, '\nand its id is', id(x))
Output 2.10.6.a:
The
and
The
and
The
and
The
and
value of x before calling the method checkParameterID is 10
its id is 140715021772880
value of x inside checkParameterID before value change is 10
its id is 140715021772880
value of x inside checkParameterID after value change is 20
its id is 140715021773200
value of x after calling the method checkParameterID is 10
its id is 140715021772880
2.10.6.2 Call/Pass by Reference
In this case, the function gets a reference to the argument (i.e., the original variable) rather than a
copy of it. The value of the original variable in the caller’s scope will be modified if a change occurs
within the function. In Python, if mutable parameters (e.g., a list) are passed to a function, the call/
pass is by reference. As shown below, updateList appends a value of 5 to the list named y. The
fact that the value of the original mutable variable x changes demonstrates the functionality of argument call/pass by reference:
54
Handbook of Computer Programming with Python
1
2
3
4
5
6
7
8
9
10
11
12
13
# Define function 'upDateList' that changes values within the list
def updateList(y):
y = y.append(5)
return y
# Declare list 'x' with 4 elements and assign values
x = [1, 2, 3, 4]
print('The content of x before calling the function updateList is:', x)
​
# Call function 'updateList'
print('Call the function updateList')
updateList(x)
print('The content of x after calling the function updateList is:', x)
Output 2.10.6.b:
The content of x before calling the method updateList is: [1, 2, 3, 4]
Call the method updateList
The content of x after calling the method updateList is: [1, 2, 3, 4, 5]
2.11 CASE STUDY
Write a Python application that displays the following menu and runs the associated functions
based on the user’s input:
•
•
•
•
•
•
Body mass index calculator.
Check customer credit.
Check a five-digit for palindrome.
Convert an integer to the binary system.
Initialize a list of integers and sort it.
Exit.
Specifics on the components of the application:
• Body Mass Index Calculator: Read the user’s weight in kilos and height in meters, and
calculate and display the user’s body mass index. The formula is: BMI = (weightKilos)/
(heightMeters × heightMeters). If the BMI value is less than 18.5, display the message
“Underweight: less than 18.5”. If it is between 18.5 and 24.9, display the message “Normal:
between 18.5 and 24.9”. If it is between 25 and 29.9, display the message “Overweight:
between 25 and 29.9”. Finally, if it is more than 30, display the message “Obese: 30 or
greater”.
• Check Department-Store Customer Balance: Determine if a department-store customer
has exceeded the credit limit on a charge account. For each customer, the following facts
are to be entered by the user:
• Account number.
• Balance at the beginning of the month.
• Total of all items charged by the customer this month.
• Total of all credits applied to the customer’s account this month.
• Allowed credit limit.
Introduction to Programming with Python
The program should accept input for each of the above from as integers, calculate the new
balance (= beginning balance + charges − deposits), display the new balance, and determine
if the new balance exceeds the customer’s credit limit. For customers whose credit limit is
exceeded, the program should display the message “Credit limit exceeded”.
• A palindrome is a number or a text phrase that reads the same backward as forward (e.g.,
12321, 55555). Write an application that reads a five-digit integer and determines whether
or not it is a palindrome. If the number is not five digits long, display an error message
indicating the issue to the user. When the user dismisses the error dialog, allow them to
enter a new value.
• Convert Decimal to Binary: Accept an integer between 0 and 99 and print its binary
equivalent. Use the modulus and division operations, as necessary.
• List Manipulation and Bubble Sort: Write a script that does the following:
a. Initialize a list of integers of a maximum size, where the maximum value is entered by
the user.
b. Prompt the user to select between automatic or manual entry of integers to the list.
c. Fill the list with values either automatically or manually, depending on the user’s
selection.
d. Sort the list using Bubble Sort.
e. Display the list if it has less than 100 elements.
The above should be implemented using a single Python script. Avoid adding statements
in the main body of the script unless necessary. Try to use functions to run the various
tasks of the application. Have the application/menu run continuously until the user enters
the value associated with exiting.
2.12 EXERCISES
2.12.1 Sequence and Selection
1. Write a script that displays numbers 1–4 on the same line and in one output, separated by
one space.
2. Write a script that accepts three integers and calculates and displays their sum, average,
product, lowest, and highest.
3. Write a script that accepts five integers and prints how many of them are odd and even.
(Hint: An even number leaves a remainder of zero when divided by 2. Use the modulus
operator.)
4. Write a script that accepts five numbers and calculates and prints the number of negatives,
positives, and zeros.
5. Write a script that accepts two integers and determines and prints whether the first is a
multiple of the second.
6 Write a script that accepts one number consisting of five digits, separates the number into
the individual digits, and prints each digit separated by three spaces from each other. (Hint:
use both division and modulus operations to break down the number.)
7. Write a script that accepts the radius of a circle as an integer and prints the circle’s diameter, circumference, and area. (Hint: Use the constant value 3.1459 for π. Calculate the
diameter as radius*2, the circumference as 2π*radius, and the area as π*radius2.)
8. Write a script that accepts the first and the last name from the user as two separate inputs,
concatenates them separated by one space character, and displays the result.
9. Write a script that accepts a character and displays it in the ASCII format. (Hint: use the
ord() function.)
10. Write a script that accepts an ASCII value between 50 and 255 and displays its character.
(Hint: use the chr() function.)
55
56
Handbook of Computer Programming with Python
2.12.2 Iterations – while Loops
1. Drivers are concerned with the accumulated mileage of their automobiles. One particular
driver has been monitoring trips by recording miles driven and petrol gallons used. Write
a script that uses a while statement to accept the miles and petrol gallons used for each
trip. The script should calculate and display the miles per gallon obtained for each trip, and
the combined, total miles per gallon obtained up to date.
2. Write a script that accepts integers within the range of 1–30. For each number entry, the
script should print a line containing adjacent asterisks of the same number (e.g., for number
7 it should display: “7: *******”). The script should run until the user enters a predefined
exit value.
3. A company pays its employees partially based on commissions. The employees receive
$200 per week, plus 9% of their gross sales for the week. Write a script that accepts the
items sold for a week by a single employee and calculates and displays their earnings.
There is no limit to the number of items that can be sold by an employee.
4. Write a script that uses a while statement to determine and print the largest number entered
by the user. The user is allowed to enter numbers until a predefined exit value is entered.
5. Write a script that uses a while statement and the tab escape sequence (\t) to print the tabular form of: a number, its multiple by 2, its multiple by 10, the square, and its cube number.
6. Armstrong numbers represent the sum of their digits to the power of the total number of
digits. Therefore, for a three-digit Armstrong number, the sum of the cube roots of each
digit should equal to the number itself (e.g., 153 = 1 ^ 3 + 5 ^ 3 + 3 ^ 3 = 1 + 125 +27 = 153).
Based on the above, write a script that displays all three-digit Armstrong numbers between
130 and 140, as well as their breakdown.
7. The factorial of a non-negative integer is written as n! and is defined as n! =
n*(n−1)*(n−2)*…*1 for values of n greater than or equal to 1, and as n! = 1 for n = 0. Write
a script that accepts a non-negative integer and computes and prints its factorial.
8. Write a script that converts Celsius temperatures to Fahrenheit. The program should print
a table displaying all the Celsius temperatures and their Fahrenheit equivalents. (Hint: the
formula for the conversion is: F = 9/5C + 32.)
9. A company wants to send data over the Internet and has requested a script that will encrypt
this data. The desired encryption function is the following: each digit should be replaced
by a value calculated by adding 7 to it and getting the remainder after dividing the new
value by 10. Next, the first digit should be swapped with the third and the second with the
fourth. The program should print the resulting encrypted integer.
10. Write a script that reads an encrypted four-digit integer, decrypts it by reversing the encryption scheme of the previous exercise, and prints the result.
2.12.3 Iterations – for Loops
1. Write a script that uses a for statement to display the following patterns:
(a)
*
**
***
****
*****
******
*******
********
(b)
**********
*********
********
*******
******
*****
****
***
(c)
**********
*********
********
*******
******
*****
****
***
(d)
*
**
***
****
*****
******
*******
********
57
Introduction to Programming with Python
2. Write a script that prompts the user to enter a number of integer values and calculate their
average. Use a for statement to receive and add up to the sequence of integers, based on
user input.
3. A mail order house sells five different products with the following codes and retail prices:
001 = $2.98, 002 = $4.50, 003 = $9.98, 004 = $4.49, and 005 = $6.87. Write a script that
accepts the following two values from the user: product number and quantity sold. This
process must be repeated as long as the user enters a valid code. The script should use a
mapping technique to determine the retail price for each product. Finally, the script should
calculate and display the total value of all products sold.
2.12.4 Methods
1. Write a script that uses methods to do the following: (a) continuously accept integers into
a two-dimensional list of integers until the user enters an exit value (e.g., 0), (b) find and
display the min value for each row and/or column of the list and of the whole list, (c) find
and display the max value for each row and/or column of the list and of the whole list,
and (d) find and display the average value for each row and/or column of the list and of the
whole list.
2. Write a script that uses methods to continuously accept the following details for a series of
books: ISBN number, title, author, publication date, and publication company. The details
of each book must be stored in five lists associated with the book information categories.
The script should accept books until the user enters an ISBN number of 0. Before exiting,
the script must print the details of the books.
3. Write a script that uses different methods to print a box, an oval, an arrow, and a diamond
on screen. Use the Turtle library for this purpose.
4. Using the Olympic Games logo as a reference, write a Python script that uses the Turtle
library and appropriate methods to draw the logo rings, matching the color order and position.
5. Using only the Turtle library methods fillcolor(), begin _ color(), end _
color(), color(), penup(), pendown(), and goto(), write a Python script that uses
various methods to draw Figure Exercise 5.
6. Write a Python script that uses appropriate methods and the Turtle library to draw a regular polygon of N sides. The script should use a method to prompt the user to enter the
number of sides (N). (Hint: a regular polygon of N sides is the combination of N equilateral
triangles.) The figure drawn should look like Figure Exercise 6.
Figure Exercise 5. Figure Exercise 6.
58
Handbook of Computer Programming with Python
REFERENCES
Dijkstra, E. W., Dijkstra, E. W., Dijkstra, E. W., & Dijkstra, E. W. (1976). A Discipline of Programming
(Vol. 613924118). New Jersey: Prentice-Hall Englewood Cliffs.
Jaiswal, S. (2017). Python Data Structures Tutorial. DataCamp. https://www.datacamp.com/community/
tutorials/data-structures-python.
Knuth, D. E. (1997). The Art of Computer Programming (Vol. 3). Pearson Education.
Stroustrup, B. (2013). The C++ Programming Language. India: Pearson Education.
3
Object-Oriented
Programming in Python
Ghazala Bilquise and Thaeer Kobbaey
Higher Colleges of Technology
Ourania K. Xanthidou
Brunel University London
CONTENTS
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
Introduction.............................................................................................................................60
Classes and Objects in Python................................................................................................ 62
3.2.1 Instantiating Objects.................................................................................................... 63
3.2.2 Object Data (Attributes)............................................................................................... 63
3.2.2.1 Instance Attributes........................................................................................ 63
3.2.2.2 Class Attributes.............................................................................................64
3.2.3 Object Behavior (Methods)..........................................................................................66
3.2.3.1 Instance Methods..........................................................................................66
3.2.3.2 Constructor Methods.................................................................................... 68
3.2.3.3 Destructor Method........................................................................................ 71
Encapsulation........................................................................................................................... 72
3.3.1 Access Modifiers in Python......................................................................................... 72
3.3.2 Getters and Setters....................................................................................................... 72
3.3.3 Validating Inputs before Setting.................................................................................. 73
3.3.4 Creating Read-Only Attributes.................................................................................... 75
3.3.5 The property() Method................................................................................................ 76
3.3.6 The @property Decorator....................................................................................... 77
Inheritance............................................................................................................................... 78
3.4.1 Inheritance in Python.................................................................................................. 78
3.4.1.1 Customizing the Sub Class........................................................................... 79
3.4.2 Method Overriding...................................................................................................... 81
3.4.2.1 Overriding the Constructor Method............................................................. 82
3.4.3 Multiple Inheritance.................................................................................................... 83
Polymorphism – Method Overloading.................................................................................... 85
3.5.1 Method Overloading through Optional Parameters in Python................................... 86
Overloading Operators............................................................................................................ 87
3.6.1 Overloading Built-In Methods.....................................................................................90
Abstract Classes and Interfaces in Python.............................................................................. 91
3.7.1 Interfaces.....................................................................................................................94
Modules and Packages in Python............................................................................................94
3.8.1 The import Statement.................................................................................................. 95
3.8.2 The from…import Statement................................................................................... 95
3.8.3 Packages......................................................................................................................96
3.8.4 Using Modules to Store Abstract Classes....................................................................97
Exception Handling................................................................................................................. 98
DOI: 10.1201/9781003139010-3
59
60
Handbook of Computer Programming with Python
3.9.1
Handling Exceptions in Python................................................................................... 98
3.9.1.1 Handling Specific Exceptions..................................................................... 100
3.9.2 Raising Exceptions.................................................................................................... 101
3.9.3 User-Defined Exceptions in Python........................................................................... 102
3.10 Case Study............................................................................................................................. 103
3.11 Exercises................................................................................................................................ 104
3.1 INTRODUCTION
The Object-Oriented Programming (OOP) paradigm is a powerful approach that involves problem
solving by means of programming components called classes, and the associated programming
objects contained in these classes. This approach aims at the creation of an environment that reflects
method structures from the real world. Within the OOP paradigm, variables, and the associated
data and methods (see: Chapter 2), are logically grouped into reusable objects belonging to a parent
class. This enables a modular approach to programming. Some of the most significant benefits of
developing software using this paradigm is that it is easier to implement, interpret, and maintain.
OOP is developed around two fundamental pillars of programming, and four basic principles of
how these could be used efficiently. The two pillars are the class and its objects. The four principles
are the concepts of encapsulation, abstraction, inheritance, and polymorphism. Although it is true
that various other programming techniques and approaches are also applied within the OOP paradigm, they all share the above core components and concepts.
A real-life analogy that demonstrates the class and object relationship is that of a recipe of a
cake. The recipe provides information about the ingredients and the method of how to bake it. Using
the recipe, several cakes may be baked. In this context, the recipe represents the class, and each
cake that is baked using the recipe represents the object. Similarly, in software development, if it is
required to store the data of numerous employees, a class that describes the general specifications
of an employee is created. This class defines what types of data are required for employees (class
properties) and what actions can be performed on the data (class methods). New employees are then
created using the class. What is important to note is that the class does not hold any data. It is simply
a template used as a model for the container of employees of the same kind, alongside any related
actions that can be performed on the data. The relation between these two fundamental elements
(i.e., class and objects) is illustrated in Figure 3.1.
In OOP terminology, the process of creating an object based on a specific class is known as
instantiation. During instantiation, the created object inherits the properties described in the class.
For example, an object named car1 may have properties like make, model, and color, while
FIGURE 3.1 Using class Employee to generate the objects Employee1 and Employee2.
Object-Oriented Programming
61
book1 may have ISBN, title, price and publication _ year. Similarly, the methods of
the object are the actions or tasks it can perform. Using the same object examples, a car may
perform actions like startEngine(), stopEngine() and moveCar(), and a book updatePrice() and calculateDiscount().
In terms of communicating complex OOP structures and ideas, programmers use the Unified
Modelling Language (UML), a tool that allows them to draw standardized diagrams that visualize
the structure of programs independently of the programming language used for the implementation. The basic building block of UML is the class diagram, a graphical representation of a class as
a rectangle with three sections, namely the class name, the class attributes, and the class methods.
The basic structure of a class diagram is illustrated in Figure 3.2, and a related example is provided
in Figure 3.3.
The top section of the class diagram contains the class name, which should adhere to the following naming conventions:
• It must be a noun.
• It must be written in singular form.
• It must start with an upper-case letter (upper camel case should be used for multiple words
in the class name).
FIGURE 3.2 Syntax of a class diagram.
FIGURE 3.3
A simple class with its attributes and methods.
62
Handbook of Computer Programming with Python
The middle section of the class diagram consists of the
class attributes. These should be written using lower- Observation 3.1 – Camel Case: The
case letters, with compound words separated by an practice of starting each word of a
underscore. Optionally, the data type of each attribute sentence in capital.
can be specified after its name, separated by a colon.
The last section of the class diagram contains the operations or methods of the class. Method
names should be verbs and follow the lower camel case naming convention (i.e., the first word is
in lower case and the first letters of all subsequent words are in upper case). Similar to attributes,
the input and output parameters of the method can be specified. The input parameters are written
within the parentheses following the method name. The output parameters are specified at the end
of the method, separated by a colon.
Finally, access modifiers, represented with a plus or minus symbol, are used to specify the scope
of access of an attribute or method. The plus symbol indicates that the attribute or method is public
and can be accessed by any object of any class outside the current one, whereas the minus symbol
indicates that the method or attribute is private and can only be accessed from within the current
class or its objects.
This chapter covers basic concepts related to the usage of classes and objects, and the four main
principles of OOP, namely:
• Encapsulation: The process of wrapping the attributes and methods of the objects of a
class in one unit, and managing the access to these attributes and methods.
• Abstraction: The technique used to hide the implementation details of a class, by providing a more abstract view. This allows for the development of a simpler interface, by focusing on what the object does rather than how it does it.
• Inheritance: The mechanism used for the creation of a parent-child relationship between
classes, where the child (or sub) class acquires the attributes and the methods of the parent (or super) class, thus, eliminating redundant code and facilitating reusability and
maintainability.
• Polymorphism: A feature of OOP languages that enables methods to perform different
tasks based on the context of the variables used. This is achieved through designated processes like method overriding and overloading.
3.2 CLASSES AND OBJECTS IN PYTHON
Contextualizing the concepts of classes and methods and
their relationship is frequently easier through the use of
working examples. Consider the common case of developing a simple application that must store employees’
data. Every employee is likely to have an employee ID, a
first name, a last name, a basic salary, and allowances.
The first step toward the implementation of such an
application in OOP would be to define a class that holds
the appropriate, general specification for all employees.
This will be used as a blueprint to create a record for
each employee in the application.
In Python, a class is created simply by using the
class keyword followed by the name of the class. The
name must follow the same naming rules that also apply
to variables. However, for clarity purposes, it is recommended that the name of the class is capitalized using
the CapWords notation (i.e., the first letter of each word
in the class name should be capitalized).
Observation 3.2 – pass: The pass
keyword is a line of code that does
nothing. It is necessary when defining an empty class since it is required
that every class has at least one line
of code.
Observation 3.3 – class keyword:
Create a class simply by using the
class keyword followed by the
name of the class. The class name
must adhere to the naming conventions of Python for variables and
should have the first letter in capital.
63
Object-Oriented Programming
The example below creates an empty class with no attributes or methods, and thus no
functionality:
1
2
3
# Define a class with no functionality
class Employee:
Pass
3.2.1 Instantiating Objects
To instantiate an object means to create a new object
Observation
3.4
–
Creating/
using a class as a template. An object is instantiated by
Instantiating Objects: An object is
passing the class name (followed by parentheses) to a
created by using the name of the class
variable. In the script example provided below emp1
it belongs followed by parentheses.
and emp2 are instances of the Employee class. Note
that, in the output of the script, each object reserves a
different memory location, as the attributes of the two employees will be stored separately:
1
2
3
4
5
6
7
8
9
10
11
# Define the class
class Employee:
Pass
# Create two instances/objects based on the class
emp1 = Employee()
emp2 = Employee()
# Print the memory address of instances 'emp1' and 'emp2'
print(emp1)
print(emp2)
Output 3.2.1:
<__main__.Employee object at 0x0000026242C487F0>
<__main__.Employee object at 0x0000026242C483D0>
3.2.2 Object Data (Attributes)
Object data, also known as attributes, are stored in
variables. There are two types of attributes in a class,
namely instance and class attributes.
Observation 3.5 – Object Data
(Attributes): Data that is associated
with each instantiated object and is
unique to that object. Use the dot
notation syntax to call it (e.g., obj.
attribute = value).
3.2.2.1 Instance Attributes
An instance attribute contains data associated with
each instantiated object, and is therefore unique to that
object. Instance attributes are created using the dot
notation syntax (obj.attribute = value) and are only accessible by the object associated
with them. In the example below, class Employee is used to instantiate objects emp1 and emp2.
These objects will store the first and last names, the basic salary, and the allowance of two different
employees.
64
Handbook of Computer Programming with Python
The reader should note the use of the dot notation to assign values to the instance/object attributes, and how the print() method is used to show the first and last names of the two Employee
instances/objects:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Define the class
class Employee:
Pass
# Create two instances/objects based on the class
emp1 = Employee()
emp2 = Employee()
# Provide attributes and assign values to the instances
emp1.firstName = "Maria"
emp1.lastName = "Rena"
emp1.basicSalary = 12000
emp1.allowance = 5000
emp2.firstName = "Alex"
emp2.lastName = "Flora"
emp2.basicSalary = 15000
emp2.allowance = 5000
# Print the objects and their attributes
print(emp1.firstName, emp1.lastName)
print(emp2.firstName, emp2.lastName)
Output 3.2.2.1:
Maria Rena
Alex Flora
3.2.2.2 Class Attributes
While instance attributes are specific to each individual
object, class attributes belong to the class itself, and are
thus shared among all instances of the class. In the following example, the class attribute bonusPercent is
defined within the scope of the Employee class. Unlike
instance attributes firstName and lastName, which
take unique values for each of the two employees
(i.e., emp1 and emp2), class attribute bonusPercent
is common to both employees:
Observation 3.6 – Class Attribute:
Data that belongs to the class and has
its values shared among each object
instantiated through the class. Define
it the same way as a simple variable.
Observation 3.7: It is recommended
to use lower-case letters when naming attributes. If an attribute name has
more than one word, use lower case
for the first word and capital first letters
for the rest, all combined in one word.
Object-Oriented Programming
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
65
class Employee:
# Define the class attribute
bonusPercent = 0.2
# Define and create the 'emp1' instance
emp1 = Employee()
emp1.firstName = "Maria"
emp1.lastName = "Rena"
# Define and create the 'emp2' instance
emp2 = Employee()
emp2.firstName = "Alex"
emp2.lastName = "Flora"
# Print class attribute
print(Employee.bonusPercent)
# Each instance is associated with the same class attribute value
print(emp1.firstName, emp1.lastName, emp1.bonusPercent)
print(emp2.firstName, emp2.lastName, emp2.bonusPercent)
# Accessing the class attribute by using the class name
Employee.bonusPercent = 0.3
print(Employee.bonusPercent)
# Accessing the class attribute by using the instance name
print(emp1.bonusPercent)
print(emp2.bonusPercent)
# Accessing the dictionary of the class and its objects
print(emp1.__dict__)
print(emp2.__dict__)
print(Employee.__dict__)
Output 3.2.2.2:
0.2
Maria Rena 0.2
Alex Flora 0.2
0.3
0.3
0.3
{'firstName': 'Maria', 'lastName': 'Rena'}
{'firstName': 'Alex', 'lastName': 'Flora'}
{'__module__': '__main__', 'bonusPercent': 0.3, '__dict__': <attribute '__dict__' of 'Employe
e' objects>, '__weakref__': <attribute '__weakref__' of 'Employee' objects>, '__doc__': None}
In terms of declaration and value assignments, a class attribute is treated as any other regular variable within the class, in contrast to instance attributes where the dot notation is used. It is accessed
by using the name of the class to which it belongs followed by the attribute name:
<className>.<attribute_name> = value
When a class attribute is associated with an instantiated object name, Python firstly checks if that
attribute is available in that particular object, and if not, whether it is available in the associated
class or any super class the object inherits from (see Section: 3.4.1 Inheritance in Python).
66
Handbook of Computer Programming with Python
There is a simple way to determine whether an attribute belongs to an object or to the class used to instanti- Observation 3.8: Call the __dict__
ate it. Every Python object contains a special attribute attribute on any object to find the
called __dict__ (i.e., dictionary), which includes refer- attributes that belong to that particuences to all the attributes within this object. Using the lar object.
previous example, if __dict__ is called for emp1 and
emp2 it will not include the bonusPercent class attribute. On the contrary, this will be the case
if it is called for the Employee class.
3.2.3 Object Behavior (Methods)
A method is a structured block of code that is associated with an object. It is defined in a class and
contains code that performs specific tasks using data from either the class itself or the instantiated
objects inheriting from the class. Methods must have a distinct name, and may or may not take
parameters or return values. All methods in a class must include an essential parameter, usually
named self, that references the current object instance. It is important to note that self is not
a reserved word. Any variable name may be used to reference the object, as long as it follows the
Python variable naming rules.
3.2.3.1 Instance Methods
An instance method, just like an instance attribute, is
specific to a particular object rather than the class used Observation 3.9 – Instance Method:
to instantiate it. It is, thus, invoked for each separate Defined as any other method but
object, and uses the data of the object that invoked it. includes the self parameter as one of
Instance methods are defined within a class and include its arguments.
the mandatory self ­parameter. However, passing the
self parameter to the method is not required when calling the method.
In the following Python example, instance method printDetails(self) is defined in the
Employee class and called twice to print each of the two employees’ data (i.e., firstName,
lastName, and salary). It does not accept any arguments and it displays the required information utilizing the attributes of the particular object it is associated with. Instance method
calculateBonus(self, bonusPercent) collects data from the attribute of the associated
object, calculates the bonus for the employee, and displays the result. The reader should note that
defining and calling instance and class methods is similar, with the exception of the use of dot notation to associate the instance method with the super class:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Define the class
class Employee:
# Define the 'printDetails' method
def printDetails(self):
print("Employee Name", self.firstName, self.lastName,
"earns", self.salary)
# Define the 'calculateBonus' method
def calculateBonus(self, bonusPercent):
return self.salary * bonusPercent
# Create the two objects and print their attributes
emp1 = Employee()
emp1.firstName = "Maria"
Object-Oriented Programming
15
16
17
18
19
20
21
22
23
24
25
67
emp1.lastName = "Rena"
emp1.salary = 15000
emp1.printDetails()
print("Bonus amount is", emp1.calculateBonus(0.2))
emp2 = Employee()
emp2.firstName = "Alex"
emp2.lastName = "Flora"
emp2.salary = 18000
emp2.printDetails()
print("Bonus amount is", emp1.calculateBonus(0.2))
Output 3.2.3.1.a:
Employee Name Maria Rena earns 15000
Bonus amount is 3000.0
Employee Name Alex Flora earns 18000
Bonus amount is 3000.0
From a structural and logical viewpoint, class and instance methods can be used strategically to
further improve the efficiency and clarity of the code. For instance, the class used in the previous
examples can be further improved by introducing the following change. Since bonusPercent is
the same for both employees, its value can be stored in a class attribute and be shared among all
the instances of the class. In this case, calling the instance method is simplified, as it is no longer
necessary to pass any parameters as method arguments. Instead, instance or class attributes can be
accessed directly, as shown in the example below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Define the class
class Employee:
# Define a class attribute common for all objects
bonusPercent = 0.2
# Define an instance method that takes no arguments
def calculateBonus(self):
return self.salary * Employee.bonusPercent
# Create two objects and an instance attribute
emp1 = Employee()
emp1.salary = 15000
emp2 = Employee()
emp2.salary = 18000
# Print using the instance method and the class attribute
print("Bonus amount is", emp1.calculateBonus(),
"calculated at", Employee.bonusPercent)
print("Bonus amount is", emp2.calculateBonus(),
"calculated at", Employee.bonusPercent)
# Change the value of the class attribute
68
23
24
25
26
27
28
29
Handbook of Computer Programming with Python
Employee.bonusPercent = 0.3
# Print again using the instance method and the changed class attribute
print("Bonus amount is", emp1.calculateBonus(),
"calculated at", Employee.bonusPercent)
print("Bonus amount is", emp1.calculateBonus(),
"calculated at", Employee.bonusPercent)
Output 3.2.3.1.b:
Bonus
Bonus
Bonus
Bonus
amount
amount
amount
amount
is
is
is
is
3000.0
3600.0
4500.0
4500.0
calculated
calculated
calculated
calculated
at
at
at
at
0.2
0.2
0.3
0.3
3.2.3.2 Constructor Methods
A constructor is a special method used to initialize the
Observation 3.10 – Constructor
data of an object. In Python, constructors are impleMethod: Defined either automatically
mented using the __init__() method. This method is
or by using the __init__() method.
automatically invoked whenever a new instance of the
It is invoked automatically when a
class is created. If not explicitly defined, the compiler
new instance of a class is created. It
assumes a default constructor with no implementation
can be used to initialize the data of
details. It is important to note that a constructor does not
the new object or to perform any
return any value.
other task necessary. It can take arguThe programmer can optionally define constructors
ments with or without default values.
other than the default one. A user-defined constructor is
It does not return any value.
created by defining the __init__() method within the
class. Like all methods in a class, it takes a self argument that references the current object. The syntax of the __init__() method is the following:
def __init__ (self [, arguments])
User-defined constructors can be one out of three different types, depending on whether they take
arguments or not. The first is the simple constructor, which takes no arguments. The following
Python script presents such a case, where the constructor takes no arguments and prints a default
text message. Notice that every time a new object is instantiated the message is displayed:
1
2
3
4
5
6
7
8
9
10
11
# Define the class
class Employee:
# Default constructor takes no arguments, prints message
def __init__ (self):
print("Object created")
# Every time a new object is created the constructor is called and
# the message is displayed
emp1 = Employee()
emp3 = Employee()
Object-Oriented Programming
69
Output 3.2.3.2.a:
Object created
Object created
The default constructor may be also used to initialize instance attributes with default values. In the
following example, when a new Employee object is created, instance attributes salary and
allowances are set to a default value of 0:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Define the class
class Employee:
""" Define the default constructor that takes no arguments
but initializes the values of the instance attributes """
def __init__ (self):
self.salary = 0
self.allowances = 0
""" Every time a new object is created the constructor is called
and the instance attributes are set to 0 """
emp1 = Employee()
emp1.salary = 15000
""" Print the instance attributes of the objects. The default
allowances value is printed """
print(emp1.salary, emp1.allowances)
# Change the value of the allowances attribute
emp1.allowances = 3000
# Print the instance attribute of the object after the value
# of allowances is changed
print(emp1.salary, emp1.allowances)
Output 3.2.3.2.b:
15000 0
15000 3000
The second constructor type accepts parameters as arguments. It is used when initialization of the
attributes of the new object involves the assignment of specific values rather than the default ones.
To highlight this, in the following example, a list of the arguments used to initialize the attributes of
the object is provided after the default self attribute:
1
2
3
4
5
6
# Define the class
class Employee:
# Define the constructor with four arguments
def __init__ (self, first, last, salary, allowances):
# Initialize instance attributes: use values of arguments
70
7
8
9
10
11
12
13
14
15
16
Handbook of Computer Programming with Python
self.firstName = first
self.lastName = last
self.salary = salary
self.allowances = allowances
# Create a new object with specific instance attribute values
emp1 = Employee("Maria", "Rena", 15000, 3000)
# Print the object's attributes
print(emp1.firstName, emp1.lastName, emp1.salary, emp1.allowances)
Output 3.2.3.2.c:
Maria Rena 15000 3000
For simplicity reasons, Python does not support method overloading and, thus, the definition of
multiple constructors is not allowed. Additionally, if a user-defined constructor is provided, it is no
longer possible to use the default constructor in order to create a new object with no parameters.
This limitation can be overcome by means of the third constructor type, which is used to accept
arguments with default values. This allows the programmer to initialize the associated object with
or without values. This constructor type is illustrated in the following example. When emp1 is
instantiated, the constructor is invoked without any parameter values. In contrast, in the case of
emp2, it is invoked with predefined parameter values, which are assigned to the respective instance
attributes. Once both objects are instantiated, the instance attributes of both emp1 and emp2 are
accessed and printed using regular dot notation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Define the class
class Employee:
""" Define a constructor that takes four arguments with
default empty values (None) if no values are passed """
def __init__ (self, first = None, last = None, salary = None,
allowances = None):
if first!= None and last!= None and salary!= None \
and allowances!= None:
self.firstName = first
self.lastName = last
self.salary = salary
self.allowances = allowances
print("Object initialized with supplied values")
else:
self.salary = 0
self.allowances = 0
print("Object initialized with default values")
# Create a new object invoking the constructor with no parameters
emp1 = Employee()
71
Object-Oriented Programming
22
23
24
25
26
27
28
29
30
emp1.firstName = "Alex"
emp1.lastName = "Flora"
print(emp1.firstName, emp1.lastName, emp1.salary, emp1.allowances)
# Create a new object invoking the constructor with parameters
emp2 = Employee("Maria", "Rena", 15000, 5000)
print(emp2.firstName, emp2.lastName, emp2.salary, emp2.allowances)
# Change and reprint the value of instance attribute of ‘emp2’
emp2.salary = 20000
print(emp2.firstName, emp2.lastName, emp2.salary, emp2.allowances)
Output 3.2.3.2.d:
Object initialized with default values
Alex Flora 0 0
Object initialized with supplied values
Maria Rena 15000 5000
Maria Rena 20000 5000
3.2.3.3 Destructor Method
Destructors are special methods invoked at the end of the
lifecycle of objects, when they must be deleted. In Python,
destructors are implemented using the __del__()
method, and are invoked when all references to an object
have been deleted. The following Python script provides
an example of two objects (i.e., emp1 and emp2) firstly
being created and then destroyed:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Observation 3.11 – Destructor
Method: Defined by using the
__del__() method. It is used to delete
an instance/object when it is not
needed anymore. The method takes
no arguments, and returns no values.
# Define the class
class Employee:
# Define the default constructor that only prints a message
def __init__(self):
print("Employee created")
# Destructor deletes the object and prints a message
def __del__(self):
print("Employee deleted")
# Constructor automatically invoked to create ‘emp1’ and ‘emp2’
emp1 = Employee()
emp2 = Employee()
# Destroy objects 'emp1' and 'emp2'. Destructor method is called
del emp1
del emp2
72
Handbook of Computer Programming with Python
Output 3.2.3.3:
Employee
Employee
Employee
Employee
created
created
deleted
deleted
3.3 ENCAPSULATION
Encapsulation is one of the pillars of Object-Oriented
Programming. It is based on the idea of wrapping up the
attributes and methods in a class and controlling access
when instantiating new objects/instances. Instead,
access modifiers are used to dictate and control how the
instance attributes can be accessed.
Observation 3.12 – Encapsulation:
Wrapping up the attributes and methods in a class and controlling access
when instantiating new objects/
instances.
3.3.1 Access Modifiers in Python
As mentioned, objects store data in attributes.
Appropriate protective measures ensure that this data is
accessed and modified in a controlled way. In general,
OOP languages provide access modifiers that specify
how an attribute or method can be accessed. There are
three main types of access modifiers:
Observation 3.13 – Access Modifiers:
Access modifiers control how the
instance attributes can be accessed.
Access modifiers can be public with
no special notation needed, private
denoted by double underscore (__),
or protected denoted by single underscore ( _ ).
• Public: Attribute/method can be accessed by any
class or program without any restrictions.
• Private: Attribute/method can be accessed only
within the container class.
• Protected: Attribute/method can be accessed within the container class and its sub-classes.
By default, all attributes and methods in Python are public. Instead of using special keywords to
specify whether an attribute is public, private, or protected, Python uses a special naming convention to control access. An attribute with an underscore prefix (_) denotes a protected attribute, while
a double underscore prefix (__) a private attribute. As mentioned, the absence of a prefix denotes
the default, public modifier.
3.3.2 Getters and Setters
When defining a class, it is good programming practice
to control the access to instance attributes by means Observation 3.14 – Getters and
of two special types of methods commonly referred to Setters: Used to implement encapsuas getters and setters. Many OOP languages use such lation. Setters are used to store data
methods to implement the principle of encapsulation. A into private instance attributes whereas
getter is a method that reads (gets) the value of an attri- getters are used to read that data.
bute, while a setter writes (sets) it. Using getters and setters to access object attributes ensures that the data is protected (i.e., encapsulated). The benefits of
using these special methods are the following:
Object-Oriented Programming
73
• Ensuring validation when reading or writing attribute data.
• Setting different access levels for the class attributes.
• Preventing direct manipulation of the attribute data.
In the Python example below, the Employee class uses setFirstName(), a setter method, to
store data in a protected attribute of the object (denoted by the double underscore symbol), while
getter method getFirstName() is used to read and print the employee’s first name. As the attribute is protected, it is accessible using the methods within the class, and within the object created
using the class. Getter and setter methods should be used for all instance attributes defined in the
class. In other words, for every instance attribute, it is recommended that the associated getter and
setter methods are provided. The reader should also notice the use of the self parameter with all
methods, as it provides the reference to the current object being used:
In this context, if the print(emp1.getFirstName()) command is replaced by
print(emp1.__first) in an attempt to access the private instance attribute directly, an error
will occur:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define the class
class Employee:
# Define the getter method to read private attribute__first
def getFirstName(self):
return self.__first
# Setter method writes to private attribute__first
def setFirstName(self, value):
self.__first = value
# Create object emp1
emp1 = Employee()
# Use the setter to store new data in the private attribute
emp1.setFirstName("George")
# Getter reads the data from the private attribute and prints it
print(emp1.getFirstName())
Output 3.3.2:
George
3.3.3 Validating Inputs before Setting
As discussed, getter and setter methods shield the data values of private instance attributes. In addition, they also provide data validation functionality. As an example, if the value of private instance
attribute __firstName should not exceed 15 characters in length, and __salary should be a
74
Handbook of Computer Programming with Python
number between 0 and 20,000, the associated validation code can be added to the setter methods of the attributes. Similarly, if it is necessary to format the output in
a particular way, the associated code could be added to
the getter methods. The following script provides a class
example demonstrating this concept:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Observation 3.15 – Validating Data:
Use getters and setters to validate data
stored in the private attributes and format data appropriately before used as
output.
# Define the class
class Employee:
# Define a setter for private attribute '__firstName'.
# Check the attribute value and store it if it is lower than 15
def setFirstName(self, value):
if len(value) < 15:
emp1.__firstName = value
# Define a getter for private attribute '__firstName'.
# Print the data with an appropriate message
def getFirstName(self):
return "The first name is :", self.__firstName
# Define a setter for private attribute '__salary'.
# Check attribute value; store it if it is between 0 and 20000
def setSalary(self, value):
if (value > 0 and value < 20000):
emp1.__salary = value
# Define a getter for private attribute '__salary'.
# Print the data with an appropriate message
def getSalary(self):
return "The salary is ", self.__salary
# Create a new object and call its setters
# to validate and store values in its attributes
emp1 = Employee()
emp1.setFirstName("John")
emp1.setSalary(17000)
# Attribute getters print stored values and associated messages
print(emp1.getFirstName(), emp1.getSalary())
# Repeat the previous tasks with an invalid first name entry.
# Notice: no change takes place in the ‘__firstName’ attribute
emp1.setFirstName("Check to see if more than 15 characters are stored")
emp1.setSalary(19000)
print(emp1.getFirstName(), emp1.getSalary())
Object-Oriented Programming
40
41
42
43
44
45
75
# Repeat the previous tasks with invalid salary entry.
# Notice: there is no change taking place in the ‘__salary’ attribute
emp1.setFirstName("George")
emp1.setSalary(21000)
print(emp1.getFirstName(), emp1.getSalary())
Output 3.3.3:
('The first name is :', 'John') ('The salary is ', 17000)
('The first name is :', 'John') ('The salary is ', 19000)
('The first name is :', 'George') ('The salary is ', 19000)
3.3.4 Creating Read-Only Attributes
Getter and setter methods may be also used to control read-only or write-only attributes. For example, attribute age may be designated as read only, since it should be calculated using the value of
attribute dateOfBirth. In this case, age will require a getter but no setter method, allowing thus
the user to read the age value but not to update it.
In the following example, class Employee defines
instance attributes for employees’ first and last names, Observation 3.16 – Creating Readand the corresponding getter and setter methods. The Only Attributes: Use getters with no
class also defines attributes for the employees’ emails setters to create and output the values
and full names, which as read-only attributes do not of read-only attributes, whose data
have setter methods. In this case, the values of these are calculated using private attributes.
attributes are constructed when they are being read
using the ­getter method:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Define the class
class Employee:
# The getter and setter methods for the first name
def getFirstName(self):
return self.__first
def setFirstName(self, value):
self.__first = value
# The getter and setter methods for the last name
def getLastName(self):
return self.__last
def setLastName(self, value):
self.__last = value
# Read-only attributes with only a getter method
def getEmail(self):
return self.__first + "." + self.__last + "@company.com"
def getFullName(self):
return self.__first + " " + self.__last
76
22
23
24
25
26
27
28
29
30
31
Handbook of Computer Programming with Python
# Create a new ‘Employee’ object
emp1 = Employee()
# Setter stores value to the ‘__private’ instance attributes
emp1.setFirstName("George")
emp1.setLastName("Davies")
# Print the read-only attributes
print(emp1.getFullName(), emp1.getEmail())
Output 3.3.4:
George Davies George.Davies@company.com
3.3.5 The property() Method
In the example presented below, methods getFirstName() and setLastName() are used to read from, Observation 3.17 – Property
and write to, private attribute __first. In order to Method: Use it to encapsulate the
make this particular example more user-friendly, the getter and setter methods in a single
getter and setter methods could be automatically called interface that facilitates access to a
when accessing the attribute, using the dot notation (i.e., private attribute using simply the dot
<obj>.<property>). The property() method pro- notation.
vides the necessary interface by encapsulating the getter
and setter methods, which are invoked when reading from, or writing to it. The method syntax is
the following:
property_name = property(gettermethod, settermethod)
After defining the property method, the attribute is accessed using the dot notation on the property
name (<obj>.<property>) instead of invoking the getter and setter methods directly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Define the class
class Employee:
# Define the getter method
def getFirstName(self):
return self.__first
# Define the setter method
def setFirstName(self, value):
self.__first = value
""" Use the property method to encapsulate the getter and setter
in a single method interface """
firstName = property(getFirstName, setFirstName)
Object-Oriented Programming
16
17
18
19
20
21
22
77
# Create the 'emp1' object
emp1 = Employee()
""" Use dot notation to invoke the setter and getter methods through
the property interface """
emp1.firstName = "George"
print(emp1.firstName)
Output 3.3.5:
George
3.3.6 The @property Decorator
Another way to define attributes in Python is to use the
@property decorator, which is built in the prop- Observation 3.18 – The @property
erty() method. In the example below, @property Decorator: It allows the extension of
defines the firstName attribute by using two different the property method in a similar way.
methods with the property name. The firstName(self)
method is decorated with the @property decorator, indicating that the method is a getter. Accordingly,
the firstName(self, value) method is decorated with @firstName.setter, indicating that
this is a setter. With this structure in place, the attribute can be accessed by using its property name
with the dot notation, without explicitly calling the getter and setter methods:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define the class
class Employee:
# Use the property decorator to define the getter method
@property
def firstName(self):
return self.__first
# Use the property decorator to define the setter method
@firstName.setter
def firstName(self, value):
self.__first = value
# Create the 'emp1' object
emp1 = Employee()
# Access private attribute '__first' through property name 'firstName'
emp1.firstName = "George"
print(emp1.firstName)
Output 3.3.6:
George
78
Handbook of Computer Programming with Python
3.4 INHERITANCE
Inheritance is one of the four main principles of OOP.
It allows the programmer to extend the functionality of Observation 3.19 – Inheritance:
a class by creating a parent-child relationship between Allows the extension of the functionclasses. In such a relationship, the child (also called sub ality of a parent/super/base class, by
or derived class) inherits from the parent (also called creating a child/sub/derived class that
super or base class). The reader should note that these inherits its attributes and behavior.
terms may be used interchangeably in this chapter, based
on the context of each discussion. Inheritance is extremely useful, as it facilitates code reusability,
thus minimizing code and making it easier to maintain. An important concept relating to child
classes is that they may have their own new attributes and methods, and can optionally override the
functionality of the respective parent class.
3.4.1 Inheritance in Python
The Python syntax for implementing the concept of inheritance is the following:
Class Parent:
Parent class definition
Class Child(Parent):
Child class definition
As a practical example of inheritance, the reader can consider two classes, a super class named
Employee and a sub class named SalesEmployee (Figure 3.4). Instead of creating the general
attributes of SalesEmployee (e.g., first name, last name, salary, or allowances) from scratch, they
can be inherited from Employee. Accordingly, the sub class can also inherit the setters and getters,
and generally all the functionality of the Employee class. Additional attributes that may be unique to
SalesEmployee (e.g., commission rate) can be also added to the inherited ones, as required.
FIGURE 3.4 Parent-child relationship between classes.
The implementation of this particular example of super class Employee and sub class
SalesEmployee is presented in the Python script examples below. In the first script, Employee
class is defined with private attributes __first, __last, __salary, and __­allowances, and
class method getTotalSalary(). In the second, SalesEmployee class is created as an empty
class, hence the use of the pass keyword. Private attributes and the method are inherited from the
Employee class. Note that the name of super class Employee is passed to SalesEmployee as
an argument:
Object-Oriented Programming
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
79
# Define class 'Employee' and its private attributes and method
class Employee():
def __init__(self, first, last, salary, allowances):
self.__first = first
self.__last = last
self.__salary = salary
self.__allowances = allowances
def getTotalSalary(self):
return self.__salary + self.__allowances
# Create object 'emp1' and print the total salary of the current employee
emp1 = Employee("George", "White", 16000, 5200)
print(emp1.getTotalSalary())
Output 3.4.1.a:
21200
1
2
3
4
5
6
7
8
# Define sub class 'SalesEmployee' based on super class 'Employee'
class salesEmployee(Employee):
pass
""" Create a new object of the sub class that inherits
attributes and behavior from the super class """
semp1 = salesEmployee("Alex", "Flora", 12000, 4000)
print(semp1.getTotalSalary()) # Method of the superclass is invoked
Output 3.4.1.b:
16000
When the semp1 object is instantiated, Python scans SalesEmployee for an initialization method
(i.e., __init__()). If this is not found, it scans and executes the initialization method of the super
class (i.e., Employee), with the parameters associated with the current object. Similarly, when getTotalSalary() is invoked for object semp1, the method is called from the super class, since it does
not exist in the sub class. The same order of resolution is
followed for all methods and attributes in the sub class.
Observation 3.20 – Customize Sub
Classes: Add attributes and/or meth3.4.1.1 Customizing the Sub Class
ods to sub classes to extend their
As mentioned, sub classes can be further customized by behavior beyond that of the super
adding new attributes and methods. For instance, in the class. Using the added behavior on
case of sub class SalesEmployee this can be done objects of the super class will raise an
by adding attribute commission _ percent. The error. Attributes of the super class that
reader should note that attempting to use the added attri- will be used in the sub class need to
bute for an object that belongs to the Employee class be declared as protected.
will raise an error. This is because there is no such
80
Handbook of Computer Programming with Python
attribute or method in the super class. It is also worth noting that in order to be able to use super
class attributes salary and allowances, they must be declared as protected instead of private.
The following scripts demonstrate these concepts:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Define class 'Employee'
class Employee():
""" Define the constructor of the class with parameters.
Define the attributes of the class """
def __init__(self, first, last, salary, allowances):
self.__first = first
self.__last = last
self._salary = salary
self._allowances = allowances
# Define a derived attribute
def getTotalSalary(self):
return self._salary + self._allowances
# Define the 'SalesEmployee' sub class
class salesEmployee(Employee):
# Use the property decorator to define the getter method
@property
def commissionPercent(self):
return self.__comm
# Use the property decorator to define the setter method
@commissionPercent.setter
def commissionPercent(self, value):
self.__comm = value
# Create and use object 'emp1' based on super class ‘Employee’
emp1 = Employee("Maria", "Rena", 15000, 5000)
print(emp1.getTotalSalary())
# Create and use object 'semp1' based on sub class 'SalesEmployee'
semp1 = salesEmployee("Alex", "Flora", 16000, 6000)
# The attribute is set in the sub class
semp1.commissionPercent = 0.05
print(semp1.commissionPercent)
""" The next line generates an error since its
attribute only exists in the sub class """
print(emp1.commissionPercent)
# Print the attributes of objects 'emp1' and 'semp1'
print(semp1.__dict)
print(emp1.__dict)
Object-Oriented Programming
81
Output 3.4.1.1:
20000
0.05
AttributeError
Traceback (most recent call last)
<ipython-input-9-0e8e58d5eaf8> in <module>
40 """ The next line generates an error since its
41 attribute only exists in the sub class """
---> 42 print(empl.commissionPercent)
43
44 # Print the attributes of objects 'empl' and 'sempl'
AttributeError: 'Employee' object has no attribute 'commissionPercent'
3.4.2 Method Overriding
Method overriding is another important programming feature that is common in OOP languages. It
allows a sub class to contain a method with a different implementation than the one inherited from the
super class. In the context of the previous examples, the programmer may wish to compute the total
salary of a sales employee by adding commissions to their salary and allowances. In this case, sub
class method getTotalSalary() must be implemented differently to the original one inherited
from Employee. As shown in the following example, super class method g
­ etTotalSalary()
has to be called in the implementation of sub class method getTotalSalary():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Define class 'Employee'
class Employee():
# Define the constructor and the attributes of the super class
def __init__(self, first, last, salary, allowances):
self.__first = first
self.__last = last
self._salary = salary
self._allowances = allowances
# Define 'getTotalSalary'
def getTotalSalary(self):
return self._salary + self._allowances
# Define sub class 'salesEmployee'
class salesEmployee(Employee):
# Use the property decorator to define the getter method
@property
def commissionPercent(self):
return self.__comm
# Use the property decorator to define the setter method
@commissionPercent.setter
def commissionPercent(self, value):
self.__comm = value
82
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Handbook of Computer Programming with Python
# Super class getter overrides the parent class method
def getTotalSalary(self):
return super().getTotalSalary() + (super().getTotalSalary()
*self.__comm)
# Create and use object 'emp1' based on super class 'Employee'
emp1 = Employee("Maria", "Rena", 15000, 5000)
print(emp1.getTotalSalary())
# Create and use object 'semp1' based on sub class 'salesEmployee'
semp1 = salesEmployee("Alex", "Flora", 16000, 6000)
# Set the attribute in the sub class
semp1.commissionPercent = 0.05
# Invoke the overridden getter method from the sub class
print(semp1.getTotalSalary())
Output 3.4.2:
20000
23100.0
3.4.2.1 Overriding the Constructor Method
The concept of method overriding is also used to create
customized constructors in the sub class. In this case,
the super() method is used to invoke the __init__()
method of the super class, as shown in the following
script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Observation 3.21 – Constructor
Overriding: Call the __init__()
method of the super class to access
the constructor and add attributes to
extend it.
# Define class 'Employee'
class Employee():
# Define the constructor of the super class and its attributes
def __init__(self, first, last, salary, allowances):
self.__first = first
self.__last = last
self._salary = salary # Protected attribute
self.__allowances = allowances
# Define the getter of the class
def getTotalSalary(self):
return self._salary + self.__allowances
# Define sub class 'salesEmployee'
class salesEmployee(Employee):
""" Define the constructor of the sub class adding the ‘comm’
attribute. Call the ‘init’ method of the super class """
def __init__(self, first, last, salary, allowances, comm):
83
Object-Oriented Programming
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
super().__init__(first, last, salary, allowances)
self.__comm = comm
# Access protected attribute '_salary' from the sub class
def getTotalSalary(self):
return super().getTotalSalary() + (self._salary *
self.__comm)
# Create and use object 'emp1' based on the super class
emp1 = Employee("Maria", "Rena", 15000, 5000)
print(emp1.getTotalSalary())
# Create and use object 'semp1' based on the sub class
semp1 = salesEmployee("Alex", "Flora", 16000, 6000, 0.05)
print(semp1.getTotalSalary()) # Method of the child class is invoked
Output 3.4.2.1:
20000
22800.0
3.4.3 Multiple Inheritance
Sub classes can inherit attributes and methods from
multiple super classes, a concept known as multiple
inheritance. In Python, this can be implemented using
the following syntax:
Observation 3.22 – Multiple
Inheritance: The concept of having
a sub class inheriting from more than
one super classes.
class Parent1
pass
class Parent2
pass
class Child (Parent1, Parent2):
pass
As an example of multiple inheritance, Figure 3.5 presents a structure consisting of two super
classes (Person and Employee) and one sub class (Manager) that inherits from both super
classes.
FIGURE 3.5
A representation of multiple inheritance between three classes.
84
Handbook of Computer Programming with Python
The following Python scripts implement this structure. The reader should note that the constructor in the Manager class calls the respective constructors of both super classes during initialization. Methods getFullName and getContact are inherited from super class Person, while
getAnnualSalary and getDepartment are inherited from Employee:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Define the first super class ('Person')
class Person():
# Define class constructor and attributes
def __init__(self, firstName, lastName, contact):
self.__firstName = firstName
self.__lastName = lastName
self.__contact = contact
# Getter for the first & last name of the first super class
def getFullName(self):
return "Employee name is: " + self.__firstName +" " \
+ self.__lastName
# Define the getter for the contact of the first parent
def getContact(self):
return "Contact number is: " + self.__contact
# Define the second Parent base class Employee
class Employee():
# The constructor & the attributes of the second super class
def __init__(self, salary, dept):
self.__salary = salary
self.__dept = dept
# Define the getter for the salary of the second super class
def getAnnualSalary(self):
return "The annual salary is: " + str(self.__salary * 12)
# The getter for the department of the 2nd super class
def getDepartment(self):
return "The employee belongs to the department: " +\
self.__dept
# Define subclass 'Manager' inheriting from both 'Person' and 'Employee'
class Manager(Person, Employee):
def __init__(self, firstName, lastName, contact, salary, dept):
Person.__init__(self, firstName, lastName, contact)
Employee.__init__(self, salary, dept)
Object-Oriented Programming
39
40
41
42
43
44
45
46
47
48
49
85
# Create and use a new instance of the 'Manager' class
mgr1 = Manager("Maria", "Rena", "0123456789", 14500, "Marketing")
# Call inherited behaviour from super class 'Person'
print(mgr1.getFullName())
print(mgr1.getContact())
# Call inherited behaviour from super class 'Employee'
print(mgr1.getAnnualSalary())
print(mgr1.getDepartment())
Output 3.4.3:
Employee name is: Maria Rena
Contact number is: 0123456789
The annual salary is: 174000
The employee belongs to the department: Marketing
3.5 POLYMORPHISM – METHOD OVERLOADING
Another powerful feature of OOP languages is the support of method overloading. This is a fundamental ele- Observation 3.23 – Polymorphism/
ment of polymorphism, the option of defining and using Method Overloading: The concept of
two or more methods with the same name but differ- using method overloading to impleent parameter lists or signatures. Overloading a method ment two or more methods with the
improves code readability and maintainability, as imple- same name but different signatures.
mentation is divided into multiple methods instead of
being concentrated into a single, complex one.
While method overloading is a prominent feature in many OOP languages, such as Java and C++,
it is not entirely supported in Python. Python is a dynamically typed language and datatype binding occurs at runtime. This is known as late binding and it differs from the static binding used in
languages like Java and C++, in which overloaded methods are invoked at compile time based on the
arguments they are supplied with. In Python, when multiple methods with the same name are defined,
the last definition overrides all previous ones. As an example, consider method calculateTotalSalary() in the Employee class. The method computes the annual salary of the employee without
the bonus. A second method that calculates the total salary plus the bonus can be implemented with
the same name, thus, overloading calculateTotalSalary(). In this case, the first method will be
ignored and any reference to it will raise an error, as shown in the following example:
1
2
3
4
5
6
7
8
9
# Define class 'Employee'
class Employee:
# Define method 'calculateTotalSalary'
def calculateTotalSalary(self):
return(self.salary + self.allowances)
# Define a method overloading 'calculateTotalSalary'
def calculateTotalSalary (self, bonus):
return(self.salary + self.allowances) + bonus
86
10
11
12
13
14
15
16
17
18
19
20
21
22
Handbook of Computer Programming with Python
# Create and use the 'emp1' object
emp1 = Employee()
emp1.salary = 15000
emp1.allowances = 5000
print("Total salary is ", emp1.calculateTotalSalary(2000))
# Create and use the 'emp2' object
emp2 = Employee()
emp2.salary = 18000
emp2.allowances = 4000
# This method call will generate an error
print("Total salary is ", emp2.calculateTotalSalary())
Output 3.5:
Total salary is
22000
TypeError
Traceback (most recent call last)
<ipython-input-8-517bl73547e9> in <module>
22
23 # This method ca11 wi11 generate an error
---> 24 print("Tota1 sa1ary is ", emp2.calculateTotalSalary())
TypeError: calculateTotalSalary() missing 1 required positional argument: 'bonus'
3.5.1 Method Overloading through Optional Parameters in Python
Although Python does not directly support method
3.24
–
Method
overloading in the same form as other OOP languages, it Observation
Overloading
in
Python:
In
Python,
offers an alternative approach to achieve the same funcuse
optional
method
parameters
tionality. Instead of resorting to the creation of multiple
methods, it allows methods to take optional parameters to emulate the method overloadwith default values. When a method is invoked in the ing feature available in other OOP
code, the programmer can choose whether to provide languages.
the parameter values or not. This, in turn, dictates which
block of statements would be executed within the method. Commonly, the None value is used to
assign a default null value to the attribute.
In the example below, constructor method calculateTotalSalary() is defined with
optional parameter bonus. The implementation subsequently returns different values, depending
on whether a new value has been assigned to the optional parameter. If this is not the case, the
default None value is used.
1
2
3
4
5
6
7
class Employee:
def calculateTotalSalary(self, bonus = None):
# None statement supports both 'is' and '==' comparison operators
if bonus is None:
return(self.salary + self.allowances)
else:
Object-Oriented Programming
8
9
10
11
12
13
14
15
16
17
18
87
return(self.salary + self.allowances) + bonus
emp1 = Employee()
emp1.salary = 15000
emp1.allowances = 5000
emp2 = Employee()
emp2.salary = 18000
emp2.allowances = 4000
print("Total salary is ", emp2.calculateTotalSalary(2000))
print("Total salary is ", emp1.calculateTotalSalary())
Output 3.5.1:
Total salary is 24000
Total salary is 20000
3.6 OVERLOADING OPERATORS
Operator overloading refers to the process of changing
the default behavior of an operator based on the oper- Observation 3.25 – Operator
ands being used. A classic case of operator overloading Overloading: Apply the + and *
in Python is the modification of the behavior of the addi- operators on operands of different
tion (+) and multiplication (*) operators based on the primitive data types to yield different
input type. For instance, when the addition operator is results.
used on two numbers it performs regular numerical
addition, but when it is used with strings it concatenates them. Similarly, when the multiplication
operator is used on numbers it multiplies them, while when it is used on a string and an integer it
repeats the string. The reader should note that this fundamental operator overloading functionality
works on operands of primitive data types, like in the following example:
1
2
3
4
5
6
7
8
9
a = 1
b = 2
print(a + b) # Adds the two numbers
print(a * b) # Multiplies the two numbers
a = 'Python'
b = ' is fun'
print(a + b) # Concatenates the two strings
print(a + b * 3) # Concatenates and repeats the string
Output 3.6.a:
3
2
Python is fun
Python is fun is fun is fun
88
Handbook of Computer Programming with Python
If the addition operator is used on user-defined objects it raises a TypeError, since it does not
support the instance type, as shown below:
1
2
3
4
5
6
7
8
9
10
11
12
13
# Define class 'Employee'
class Employee:
salary = 0
# Create and use two objects of the 'Employee' class
emp1 = Employee()
emp1.salary = 15000
emp2 = Employee()
emp2.salary = 22000
# Attempting the following print will generate a TypeError
print(emp1 + emp2)
Output 3.6.b:
TypeError
Traceback (most recent call last)
<ipython-input-11-527139aab026> in <module>
11
12 # Attempting the following print will generate a TypeError
---> 13 print(empl + emp2)
TypeError: unsupported operand type(s) for +: 'Employee' and 'Employee'
This issue can be bypassed by utilizing the built-in
magic or dunder methods, which can be invoked by
means of the respective operators. For instance, in the
case of the addition operator the associated __add__()
method is firstly extended in terms of its functionality
and, subsequently, invoked as shown in the following
script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Observation 3.26 – Magic or Dunder
Methods: Special methods invoked
when a basic operator is called, with
a double underscore as a prefix and a
suffix. They are used to overload operators with the object type of operands.
# Define class 'Employee'
class Employee:
# Overload the + operator to add the 'salary' of two objects
def __add__(self, other):
return self.salary + other.salary
# Create the two objects of the 'Employee' class
emp1 = Employee()
emp1.salary = 15000
emp2 = Employee()
emp2.salary = 22000
# Invoke the overloaded + operator by extending the '__add__' method
print(emp1 + emp2)
89
Object-Oriented Programming
Output 3.6.c:
37000
In order to implement operator overloading, the programmer has to define the appropriate magic
method according to the operator in the class definition.
Tables 3.1–3.4 provide a list of magic methods corresponding to the respective binary,
­comparison, unary, and assignment operators. Changing the implementation of the magic method
associated with the respective operator can provide a different meaning to that particular operator.
For example, the plus (+) operator can be used with the Employee objects to add their salaries (i.e.,
emp1 + emp2). Similarly, the less than (<) operator can be used to compare which employee was
hired first, or which is older. Conceptually, the idea is to use operator overloading in order to define
and implement the functionality of operators in a way that is logical and appropriate in the context
of the overall program structure and requirements.
TABLE 3.1
List of Binary Operators and Their
Corresponding Magic Method
Operator
Magic Method
+
−
*
//
/
%
**
<<
>>
&
^
|
__add__(self, other)
__sub__(self, other)
__mul__(self, other)
__floordiv__(self, other)
__div__(self, other)
__mod__(self, other)
__pow__(self, other)
__lshift__(self, other)
__rshift__(self, other)
__and__(self, other)
__xor__(self, other)
__or__(self, other)
TABLE 3.2
List of Comparison Operators and Their
Corresponding Magic Method
Operator
Magic Method
<
>
<=
>=
==
!=
__lt__(self,
__gt__(self,
__le__(self,
__ge__(self,
__eq__(self,
__ne__(self,
other)
other)
other)
other)
other)
other)
90
Handbook of Computer Programming with Python
TABLE 3.3
List of Unary Operators and Their
Corresponding Magic Method
Operator
Magic Method
–
+
~
__neg__(self, other)
__pos__(self, other)
__invert__(self, other)
TABLE 3.4
List of Assignment Operators and Their
Corresponding Magic Method
Operator
Magic Method
+=
−=
*=
/=
//=
%=
**=
<<=
>>=
&=
^=
|=
__iadd__(self, other)
__isub__(self, other)
__imul__(self, other)
__ifloordiv__(self, other)
__idiv__(self, other)
__imod__(self, other)
__ipow__(self, other)
__ilshift__(self, other)
__irshift__(self, other)
__iand__(self, other)
__ixor__(self, other)
__ior__(self, other)
3.6.1 Overloading Built-In Methods
While Python does not support overloading of custom
methods in a class, it does so for built-in methods. This
allows the programmer to change the default behavior of
an existing method within the context of a class. For
example, in the case of the print() method, the default
behavior is to print a string if the input is text or an
object reference if the argument is an object, as shown in
the following example:
1
2
3
4
5
6
7
8
9
10
11
Observation 3.27 – Overloading
Built-In Methods: It is possible to
overload built-in methods (e.g.,
print, len, bool) by extending the
functionality of their respective magic
methods.
# Define class 'Employee'
class Employee:
Pass
# Create a new 'emp1' object based on the class
emp1 = Employee()
emp1.firstName = "George"
emp1.lastName = "Comma"
# Use the print method to show the object's reference
print(emp1)
91
Object-Oriented Programming
Output 3.6.1.a:
<__main__.Employee object at 0x000002A2140033D0>
Nevertheless, when an object is used as an argument, it can be overloaded. Using the usual
Employee example, overloading the appropriate magic method, in this particular instance
__str__(), allows the program to print the respective employee’s details (e.g., firstName,
lastName) instead of the object reference as in the following example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define class 'Employee'
class Employee:
# Define and extend the constructor of the class
def __init__(self, first, last, salary):
self.firstName = first
self.lastName = last
self.salary = salary
# Overload print: extend the functionality of ‘__str__’
def __str__(self):
return "Employe name: " + self.firstName + " " + \
self.lastName + " Salary: " = str(self.salary)
# Create and use the 'emp1' object based on the 'Employee' class
emp1 = Employee("George", "Comma", 15000)
# Use the overloaded print method
print(emp1)
Output 3.6.1.b:
Employee name: George Comma Salary: 15000
3.7 ABSTRACT CLASSES AND INTERFACES IN PYTHON
An abstract class is a class that cannot be instantiated. It
serves as a blueprint or template for creating sub classes,
but it cannot be used to create objects. An abstract class
contains declarations of abstract methods. Declarations
of this type include the names and parameter lists of
the methods, but no implementation. The latter must be
defined in the corresponding sub class.
In order to create abstract classes and methods, modules ABC and abstractmethod must be imported to
the program. The syntax for doing so is the following:
Observation 3.28 – Abstract Class: A
class that cannot be instantiated, but
serves as a template for sub classes.
Abstract classes contain declarations of abstract methods (i.e., methods whose implementation must be
defined in the sub classes or nonabstract methods).
from abc import ABC, abstractmethod
ABC stands for Abstract Base Classes. Newly created abstract classes inherit from ABC and
must include at least one abstract method using the @abstractmethod built-in decorator,
92
Handbook of Computer Programming with Python
with no implementation. The following script provides an example of an abstract class (i.e.,
Employee) with one abstract method (i.e., getTotalSalary()). Running this script raises an
error, since abstract classes cannot instantiate objects:
1
2
3
4
5
6
7
8
9
10
11
12
13
# Import ABC
from abc import ABC, abstractmethod
# Define abstract class 'Employee'
class Employee(ABC):
# Define abstract method 'getTotalSalary', which must be empty
@abstractmethod
def getTotalSalary(self):
Pass
# Abstract classes cannot instantiate objects
emp1 = Employee()
Output 3.7.a:
TypeError
Traceback (most recent call last)
<ipython-input-16-47belb52dd97> in <module>
11
12 # Abstract classes cannot instantiate objects
---> 13 empl = Employee()
TypeError: can't instantiate abstract class Employee with abstract methods getTotalSalary
Once the abstract class is implemented, it can be used as a super class for deriving sub classes. Sub
classes of this type must implement the abstract method of the abstract class as a minimum requirement. In this context, as shown in the first of the following scripts, sub class FullTimeEmployee
will raise an error, since it does not implement the abstract method (i.e., getTotalSalary()) of
its super abstract class (i.e., Employee). On the contrary, the second script presents the implementation of abstract method getTotalSalary() that resolves this issue:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Import ABC
from abc import ABC, abstractmethod
# Define abstract class 'Employee'
class Employee(ABC):
# Define abstract method 'getTotalSalary'
@abstractmethod
def getTotalSalary(self):
Pass
# Define class 'fullTimeEmployee' based on the abstract class
class fullTimeEmployee(Employee):
# Define the constructor of the sub class and its attributes
def __init__(self, first, last, salary, allowances):
Object-Oriented Programming
17
18
19
20
21
22
23
24
93
self.__first = first
self._last = last
self.__salary = salary
self.__allowances = allowances
# Error will be raised as the sub class does not implement
# the abstract method
ftl = fullTimeEmployee("Maria", "Rena", 15000, 6000)
Output 3.7.b:
TypeError
Traceback (most recent call last)
<ipython-input-12-7e5c51df1210> in <module>
21
22 # Error will be raised as the sub class does not implement the abstract method
---> 23 ftl = fullTimeEmployee("Maria", "Rena", 15000, 6000)
TypeError: Can't instantiate abstract class fullTimeEmployee with abstract methods getTotalSalary
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Import ABC
from abc import ABC, abstractmethod
# Define abstract class 'Employee'
class Employee(ABC):
# Define abstract method 'getTotalSalary'
@abstractmethod
def getTotalSalary(self):
Pass
# Define class 'fullTimeEmployee' based on the abstract class
class fullTimeEmployee(Employee):
# Define the constructor of the sub class and its attributes
def __init__(self, first, last, salary, allowances):
self.__first = first
self._last = last
self.__salary = salary
self.__allowances = allowances
# Implement the abstract method of the abstract class
def getTotalSalary(self):
return self.__salary + self.__allowances
# Create and use a new 'fullTimeEmployee' object
ftl = fullTimeEmployee("Maria", "Rena", 15000, 6000)
print(ftl.getTotalSalary())
Output 3.7.c:
21000
94
Handbook of Computer Programming with Python
Abstract classes may include both abstract and non-abstract methods with implementations. Sub
classes that inherit from the abstract class also inherit the implemented methods. If required, the
latter can be overridden, but in all cases, implementations must include the abstract method.
3.7.1 Interfaces
In OOP, an interface refers to a class that serves as a template for the creation of other classes. Its
main purpose is to improve the organization and efficiency of the code by providing blueprints
for prospective classes. As such, interfaces describe the behavior of inherited classes, similarly
to abstract classes. However, contrary to the latter, they
cannot contain non-abstract methods. Python does not Observation 3.29 – Interface: A class
support the explicit creation of interfaces. However, that cannot be instantiated but serves
since it does support multiple inheritance, the program- as a template for sub classes. Unlike
mer can mimic the interface functionality by utilizing abstract classes, interfaces cannot
abstract class inheritance, limited to the exclusive use of have non-abstract methods.
abstract methods.
3.8 MODULES AND PACKAGES IN PYTHON
Modules and packages refer to structures used for organizing code in Python. Modules are files containing Observation 3.30 – Module: A modPython code structures (e.g., classes, methods, attributes, ule provides a way of organizing code
or simple variables) signified by the .py file extension. in Python. Modules can host classes,
Instead of rewriting particular blocks of code, modules methods, attributes, or even simple
can be imported into other Python files or applications, variables that can be imported and
thus allowing for a modular programming approach reused in other classes. Modules are
commonly used with abstract classes.
based on reusable code.
Abstract classes and interfaces are two of the programming structures commonly stored in modules, from where they can be imported on demand.
In the example provided in the following script, the entire definition of class Employee is stored
in a module named employee.py:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 'Employee' module saved in 'employee.py' file
class Employee:
# Define the constructor and private attributes of the class
def __init__(self, first, last, salary):
self.__firstName = first
self.__lastName = last
self.__salary = salary
# Define the getter for annual salary
def getAnnualSalary(self):
return self.__salary * 12
# Define the getter for fullName
def getFullName(self):
return self.__firstName + " " + self.__lastName
95
Object-Oriented Programming
3.8.1 The import Statement
Python module files are imported using the import statement. The statement may include one or
more modules. The syntax is the following:
import module1, [module2, module3…]
Once a module is imported, its classes and methods can
be referenced using its name as a prefix (i.e., module.
classname). The following example imports the
Employee class from the associated employee.py
­module, and accesses its attributes and methods from
the main body of the program:
1
2
3
4
5
6
7
Observation 3.31 – The import
Statement: Used to import either specific methods and attributes or entire
classes stored in modules.
# Import the 'employee.py' file as a module
import employee
# Use the module to create and use a new object
emp1 = employee.Employee("Maria", "Rena", 15000)
print(emp1.getFullName())
print()
Output 3.8.1:
Maria Rena
3.8.2 The FROM…IMPORT Statement
A Python module may contain several classes, methods, attributes, or variables. The from…
import statement allows the programmer to selectively import specific components from a
­module. The syntax is the following:
from module import name1, [name2, name3…]
Note that the names used in this example (e.g., name1, name2, name3) represent names of classes,
methods, or attributes.
To import all objects from a module the following syntax can be used:
from module import *
The reader should note that if a specific class is imported explicitly, it can be referenced without a
prefix, like in the next example:
1
2
3
4
5
6
7
# Import class ‘Employee’ from ‘employee’ module in ‘employee.py’
from employee import Employee
# Use the imported class to create and use a new object
emp1 = Employee("Alex", "Flora", 18000)
print(emp1.getFullName())
print(emp1.getAnnualSalary())
96
Handbook of Computer Programming with Python
Output 3.8.2:
Alex Flora
216000
3.8.3 Packages
A package is a collection of modules grouped together
in a common folder. The package folder must contain Observation 3.32 – Package: A
a file with the designated name __init__.py, which mechanism used to store a number of
indicates that the folder is a package. The __init__.py different modules in the same folder
file can be empty, but it must be always present in the for better code organization.
package folder. Once the package structure is created,
Python modules can be added as required. The example in Figure 3.6 illustrates the structure of a
package named hr, containing the mandatory __init__.py file, and a module named employee.py.
Modules contained in packages can be imported to an application using the package name as a
prefix in the import statement, as shown in the following scripts:
1
2
3
4
5
6
7
# Import the employee module from the 'hr' package
import hr.employee
# Use ‘Employee’ class stored in the module to create & use an object
emp1 = hr.employee.Employee("Alex", "Flora", 16000)
print(emp1.getFullName())
print(emp1.getAnnualSalary())
Output 3.8.3.a:
Alex Flora
216000
1
2
3
4
5
6
7
# Import ‘Employee’ class in the employee module from ‘hr’ package
from hr.employee import Employee
# Use the 'Employee' class of the module to create and use an object
emp2 = Employee ("Alex", "Flora", 15000)
print(emp2.getFullName())
print(emp2.getAnnualSalary())
FIGURE 3.6
Package hr contains the __init__.py file and the employee.py module.
Object-Oriented Programming
97
Output 3.8.3.b:
Alex Flora
180000
3.8.4 Using Modules to Store Abstract Classes
Modules may be also used to store abstract classes or interfaces. In the following example, abstract
class IEmployee is stored in module employee.py, which is contained in the hr package named:
1
2
3
4
5
6
7
8
9
10
11
12
# Use ‘abc’ module to create an abstract class: store it as a module
# ('employee.py') in the hr package
from abc import ABC, abstractmethod
# Define abstract class 'IEmployee' and its behavior
class IEmployee(ABC):
@abstractmethod
def getTotalSalary(self):
Pass
@abstractmethod
def getFullName(self):
Pass
The following script demonstrates how the programmer can import the IEmployee class to the
application, and use it to create a sub class (FullTimeEmployee):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Import the 'IEmployee' class from the employee module ('hr' package)
from hr.employee import IEmployee
# Define a new sub class inheriting from the 'IEmployee' super class
class fullTimeEmployee(IEmployee):
# The constructor, attributes & behavior of the sub class
def __init__(self, first, last, salary, allowances):
self.__first = first
self.__last = last
self.__salary = salary
self.__allowances = allowances
def getTotalSalary(self):
return self.__salary + self.__allowances
def getFullName(self):
return self.__first + " " + self.__last
# Create and use a new object
ftl = fullTimeEmployee("Maria", "Rena", 15000, 6000)
print(ftl.getFullName())
print(ftl.getTotalSalary())
98
Handbook of Computer Programming with Python
Output 3.8.4:
Maria Rena
21000
3.9 EXCEPTION HANDLING
When writing programs in Python, or in any other programming language for that matter, the code may include
errors. Depending on their nature and significance, these
errors may lead to a number of issues, such as preventing the program from executing, generating incorrect
output, or causing the program to crash. It is, thus, the
responsibility of the programmer to provide error identification and handling solutions, whenever possible.
Errors can be classified into three main categories:
Observation 3.33 – Types of Errors:
There are three types of errors that
may be encountered:
1. Compile Time: This is due
to incorrect syntax and will
not allow the program to
execute.
2. Logical: This error type will
allow execution of the program but may produce incorrect output.
3. Runtime: Raised because of
unexpected external issues,
wrong input, or wrong
expressions.
This
error
type will cause the program
to crash.
• Compile Time Errors: They occur due to incorrect syntax, datatype use, or parameters in a
method call among others. Whenever the compiler
encounters a compile error in the program it will
stop execution. Compile time errors are the easiest
to handle and can be fixed easily by correcting the
problematic code line(s).
• Logical Errors: They occur due to incorrect program logic. A program containing logical errors
may run normally without crashing, but will generate incorrect output. Logical errors are handled by testing the application with various
different input values, and making corrections to the program logic as necessary.
• Runtime Errors: They occur during the execution of a program, due to external factors
not necessarily related to the code. For example, a user may provide an invalid input that
the application is not expecting, or the code is attempting to read a file that does not exist
in the system. In Python, these types of errors raise exceptions and cause the program to
crash and terminate abruptly. To prevent this, the programmer should catch these exceptions by adding appropriate error handling code to the program.
3.9.1 Handling Exceptions in Python
In Python, when a runtime error occurs, the program crashes and a built-in exception is raised. The
exception provides information about the error. For
example, running the following script will cause a
ZeroDivisionError exception as it attempts to
divide a value by 0. The exception provides information
about the nature of the issue (i.e., division by zero).
1
2
3
a = 10
b = 0
print(a / b)
Observation 3.34 – Handling
Exception: Use the try…exception…
[else:]…[finally] syntax to identify
possible errors that might be encountered during execution and handle
them appropriately, avoiding abnormal
termination of the program.
Object-Oriented Programming
99
Output 3.9.1.a:
ZeroDivisionError
Traceback (most recent call last)
<ipython-input-2-dd04aeeae314> in <module>
1 a = 10
2 b = 0
----> 3 print(a / b)
ZeroDivisionError: division by zero
Exceptions can be handled using a try/except block of statements. As the name suggests, this
structure consists of two distinct blocks: try and except. The try block includes critical statements that are most likely to cause an exception. When the exception occurs within the try block,
the execution of the program jumps to the except block. This part contains code that handles the
exception appropriately. For example, it may display a related user message, close an open file, or
log the error to a file. If no exception is raised in the try block, the program skips the except
block and execution continues as normal.
Two optional blocks may also be added to the excep- Observation 3.35 – Raising Exceptions:
tion handling code, namely else and finally. The Instead of using built-in exceptions, it is
else block contains statements that are executed in possible to define user-defined excepcase no exception occurs. The finally block contains tion to address specific errors in the
code that must be executed irrespectively of whether an program execution.
exception occurs or not, and is mainly used for releasing
external resources, such as closing an open file.
The main Python syntax for catching exceptions is shown below:
try:
critical statement
except[ExceptionClass as err]:
exception handling statements
[else:
statements to execute when exception has not occurred
finally:
statements to execute whether an exception has occurred or not]
The ExceptionClass is optional, and refers to the type of exception being handled. If omitted,
all types of exceptions are handled by the except block.
The following example is an improved version of the code used in previous examples, since in
this occasion the program will not crash abruptly. Instead, it will terminate with a user-friendly
error message:
1
2
3
4
5
6
7
8
9
# Declare variables 'a' and 'b'
a, b = 10, 0
""" Try to divide the variables and if an exception is raised
execute the alternative statement in the ‘except’ block """
try:
print(a / b)
except:
print("An error has occurred")
100
Handbook of Computer Programming with Python
Output 3.9.1.b:
An error has occurred
3.9.1.1 Handling Specific Exceptions
Trying to catch all types of errors within a single try/except block is not considered good programming practice, as it does not allow the programmer to handle exceptions on a case-by-case
basis. Python provides various different built-in exception classes that are raised automatically,
according to the type of error being encountered. These specific exceptions can be utilized by referring to their designated names. Table 3.5 lists a number of common built-in exception classes in
Python.
The example presented below demonstrates how a specific error can be handled using the
ZeroDivisionError exception class:
1
2
3
4
5
6
7
8
9
10
11
# Declare variables 'a' and 'b'
a, b = 10, 0
# Attempt to print the result of the division of 'a' by 'b'
try:
print(a / b)
# If a specific 'ZeroDivisionError' occurs print a relevant message
except ZeroDivisionError as err:
print("An error has occurred")
print(err)
Output 3.9.1.1:
An error has occurred
division by zero
A try block may also contain multiple except blocks. This is useful when the programmer wants
to handle various different types of errors. However, only one of these blocks will be executed when
TABLE 3.5
Common Exception Classes in Python
Exception Class
Description
ArithmeticError
Raised when arithmetic operations fail. Includes the following exception sub
classes: OverflowError, ZeroDivisionError, FloatingPointError
The result of an arithmetic operation is out of range
Attempting to divide by zero
Floating-point operation failure
An array index is invalid
A non-existing attribute is referenced for an instance
An operator or method is applied to an inappropriate type of object
A file is not found
The parameter of a method is of an inappropriate type
OverflowError
ZeroDivisionError
FloatingPointError
IndexError
AttributeError
TypeError
FileNotFoundError
ValueError
Object-Oriented Programming
101
an exception occurs. When multiple except blocks are used, the code structure must start with the
more specific exception classes and end with the more generic ones. In this case, the latter are used
as an added measure of trying to handle unexpected errors that are not accounted for explicitly. The
syntax of a multiple exceptions block is provided below:
try:
# critical statements
pass
except FileNotFoundError:
# handle FileNotFound exception
pass
except (IndexError, ArithmeticError):
# except block with multiple exceptions
# index out of range in an array and arithmetic error
pass
except:
# must be placed at end. Handles all other errors
pass
3.9.2 Raising Exceptions
In Python, built-in exceptions are raised automatically when a corresponding runtime error occurs.
However, it also allows raising exceptions defined by the programmer. This is achieved by using the
raise keyword followed by the exception name. When raising user-defined exceptions, it is also
possible to provide a string parameter that describes the reason for raising the exception. The next
example demonstrates such a case, where if the user input (i.e., user’s age) is less than 18, a userdefined exception (i.e., ValueError) is raised:
1
2
3
4
5
6
# Accepts the user's age
age = int(input("Enter your age: "))
# If the input is an integer less than 18 raise an error
if age < 18:
raise ValueError("Age cannot be below 18")
Output 3.9.2.a:
Enter your age: 17
ValueError
Traceback (most recent call last)
<ipython-input-6-de16dc8d8553> in <module>
4 # If the input is an integer less than 18 raise an error
5 if age < 18:
----> 6
raise ValueError("Age cannot be below 18")
ValueError: Age cannot be below 18
In the example below, built-in exception AttributeError is raised when the value of private
attribute __first is invalid.
102
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Handbook of Computer Programming with Python
# Define class 'Employee'
class Employee:
# Define the getter method
def getFirstName(self):
return self.__first
# Define the setter method
def setFirstName(self, value):
if len(value) < 15:
self.__first = value
else: # Raise error if the input exceeds 14 characters
raise AttributeError(“First name must be less than 15 \
characters”)
# Attempt to create a new object and set the first name
try:
emp1 = Employee()
emp1.setFirstName("Maria Rena White") # Exception raised
# Raise the ‘AttributeError’ exception if the first name exceeds 14
# characters
except AttributeError as err:
print(err)
except:
print("An error has occurred")
Output 3.9.2.b:
First name must be less than 15 characters
Raising exceptions is also a convenient way of handling invalid values passed to an attribute setter
method. However, in this case, instead of raising built-in exceptions, it is preferable to create custom, in-class ones.
3.9.3 User-Defined Exceptions in Python
As mentioned, Python raises built-in exceptions whenever a runtime error occurs. However, for
custom errors, Python also allows the creation of custom exceptions that can be raised from within
the code. For example, instead of raising built-in exception AttributeError, the programmer
can create a user-defined exception by deriving a new class from the Exception base class, as
shown below:
class NewExceptionName (Exception):
pass
In the following script, user-defined exception FirstNameException is created and subsequently raised in the setter method, when the length of the first name exceeds the limit of 14
characters:
Object-Oriented Programming
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
103
# Define the new exception class based on the built-in exceptions
class FirstNameException(Exception):
def __init__(self, message):
super().__init__(message)
# Define class 'Employee'
class Employee:
# Getter method
def getFirstName(self):
return self.__first
# Setter method
def setFirstName(self, value):
if len(value) < 15:
self.__first = value
else:
# Raise an extended exception 'FirstNameException' if
# the first name exceeds 14 chars
raise FirstNameException(
# Raise error
“First name should be less than 15 characters”)
# Create and use the new object handling possible user-defined
# exceptions
try:
emp1 = Employee()
emp1.setFirstName("Maria Rena White") # Exception raised
except FirstNameException as err:
print(err)
except:
print("An error has occurred")
Output 3.9.3:
First name should be less than 15 characters
3.10 CASE STUDY
Sherwood real estate requires an application to manage properties. There are two types of properties: apartments and houses. Each property may be available for rent or sale.
Both types of properties are described using a reference number, address, built-up area, number
of bedrooms, number of bathrooms, number of parking slots, pool availability, and gym availability. A house requires extra attributes such as the number of floors, plot size and house type (villa
or townhouse). An apartment requires additional attributes such as floor and number of balconies.
Each type of property (house or apartment) may be available for rent or sale.
A rental property should include attributes such as deposit amount, yearly rent, furnished (yes or
no), and maids’ room (yes or no). A property available for sale has attributes such as sale price and
estimated annual service charge.
104
Handbook of Computer Programming with Python
All properties include a fixed agent commission of 2%. Both types of sale properties have a fixed
tax of 4%.
All properties require a method to display the details of the property.
All properties should include a method to compute the agent commission. For rental properties,
agent commission is calculated by using the yearly rental amount, whereas for purchase properties
it is calculated using the sale price.
Both types of purchase properties should include a method to compute the tax amount. Tax
amount is computed based on the sale price.
Design and implement a Python application that creates the four types of properties (e.g.,
RentalApartment, RentalHouse, SaleApartment, SaleHouse) by using multiple
inheritance and abstract classes. Implement class attributes and instance attributes using encapsulation. All numeric attributes, such as price, should be validated for inputs with a suitable minimum
and maximum price.
Define the methods in the abstract class and implement it in the respective classes. Override the
print method to display each property details.
Test your application by creating new properties of each type and calling the respective methods.
3.11 EXERCISES
1. Using the diagram shown below, write Python code for the following:
a. Create a class named Student.
b. Create appropriate getters and setters using the @property decorator for Student_
Name and GPA attributes. The Student_ID and Email attributes are read only.
Create only getter methods for these attributes.
c. Add a private class attribute named MAX_ID and set it to 0.
Object-Oriented Programming
105
d. Add a default constructor method to the Student class. The default constructor
should initialize the GPA attribute to 0 and Student_ID to MAX_ID + 1.
e. Add an overloaded constructor that takes Student_Name and GPA as arguments
and initializes private data variables with the values provide. In addition, it should set
the Student_ID to MAX_ID + 1 and the email attribute to first_name.last_name@
university.edu.
f. Modify the setter method of the GPA attribute to check if the provided value is between
0 and 4 before storing it.
g. Add a destructor method to the Student class. The method should print the message
“All student records destroyed”.
h. Instantiate two new objects called std1 and std2, using the default and the
­overloaded constructors, respectively.
i. Print the data values stored in each object’s attributes.
j. Delete objects std1 and std2.
4
Graphical User Interface
Programming with Python
Ourania K. Xanthidou
Brunel University London
Dimitrios Xanthidis
University College London
Higher Colleges of Technology
Sujni Paul
Higher Colleges of Technology
CONTENTS
4.1
4.2
4.3
4.4
4.5
4.6
Introduction........................................................................................................................... 108
4.1.1 Python’s GUI Modules.............................................................................................. 109
4.1.2 Python IDE (Anaconda) and Chapter Scope............................................................. 109
Basic Widgets in Tkinter....................................................................................................... 109
4.2.1 Empty Frame............................................................................................................. 110
4.2.2 The Label Widget...................................................................................................... 111
4.2.3 The Button Widget..................................................................................................... 119
4.2.4 The Entry Widget...................................................................................................... 120
4.2.5 Integrating the Basic Widgets.................................................................................... 121
Enhancing the GUI Experience............................................................................................. 126
4.3.1 The Spinbox and Scale Widgets inside Individual Frames....................................... 126
4.3.2 The Listbox and Combobox Widgets inside LabelFrames........................................ 131
4.3.3 GUIs with CheckButtons, RadioButtons and SimpleMessages................................ 138
Basic Automation and User Input Control............................................................................. 146
4.4.1 Traffic Lights Version 1 – Basic Functionality.......................................................... 146
4.4.2 Traffic Lights Version 2 – Creating a Basic Illusion................................................. 148
4.4.3 Traffic Lights Version 3 – Creating a Primitive Automation.................................... 149
4.4.4 Traffic Lights Version 4 – A Primitive Screen Saver with a Progress Bar................ 151
4.4.5 Traffic Lights Version 5 – Suggesting a Primitive Screen Saver............................... 156
Case Studies........................................................................................................................... 159
Exercises................................................................................................................................ 159
DOI: 10.1201/9781003139010-4
107
108
Handbook of Computer Programming with Python
4.1 INTRODUCTION
In modern day software development, creating an application with an intuitive Windows style
Graphical User Interface (GUI) is a must in order to make it attractive for the user. There are four
essential concepts related to this, and the associated programming tools:
• Widgets: The different components used to create an application GUI. These are relatively simple, pre-defined objects available through Python
libraries. In this chapter, the libraries and modules
used include tkinter and PIL, providing visual
attributes that supply the necessary windows
object aesthetic. The associated objects can be as
simple as labels, texts, and buttons or as complex
as frames and grids.
• Options: Characteristics or attributes of a ­widget/
object that dictate the way the latter looks and
behaves (e.g., the object color, text, position, or
alignment). Value changes, usually integrated
with interactions between the user and the GUI,
control aspects like the visual appearance or format of the application and its behavior.
• Methods: Pre-defined or newly developed snippets of Python code, aiming to affect the widgets
by changing the values of their properties/attributes. There is a wealth of method in the various
packages offered by Python, such as tkinter and
PIL. They can be as simple or complex as the
developer intends.
• Events: The interaction between the user of a
GUI-based Windows style application and the
various widgets of the application is expressed
through the various available events that trigger
the execution of particular commands or blocks
of code. There are numerous such events offered
by Python, some of them applicable to several different widgets. Examples are the click or doubleclick of a mouse, pressing the enter key in the
keyboard, hovering over a widget, or changing the
text of a text widget.
Observation 4.1 – Widget: A graphical component used to create the
interface of the Python application.
This is provided as a pre-defined class
of the tkinter or PIL packages.
Observation 4.2 – Option: An attribute of the widget that controls its
look and behavior.
Observation 4.3 – Method: A specific structure of code that changes
the value of an option of a particular
widget. It can be either pre-defined or
newly developed.
Observation 4.4 – Event: An interaction between the user and an
object that causes a change in terms
of the object’s appearance and/or
value. Many types of interactions are
available.
Observation 4.5 – Event-Driven (or
Visual) Programming: The concept
of handling events, through the use
of methods in order to change the
options of an object and, thus, their
look and actions.
Event-driven (or visual) programming is the process during which one or more of the properties/
attributes of a widget/object changes state or value. This is done through the use of specific methods
and is triggered through interactions between the user and the widget/object, caught by the associated event.
The focus of this chapter is to introduce the concept of event-driven (or visual) programming by
presenting some of the most popular widgets and the associated methods and properties/attributes/
options, and the most commonly used events for the creation of a GUI experience.
Graphical User Interface Programming
109
4.1.1 Python’s GUI Modules
Python provides a rather complete set of widgets (presented as classes) to create objects for user-friendly Observation 4.6 – Python GUI
applications, a comprehensive and developer-friendly set Modules: The most important and
of methods available through these widgets, a rich set frequently used modules for GUI
of attributes of these widgets, and an adequate number of programming in Python are Tk/Tcl,
well-defined programmable events that can be triggered Tkinter.Tix, and tkinter.ttk.
through user interactions. There are two basic modules
that define the components and functionality of these widgets, namely the tkinter and the PIL modules.
The tkinter module provides a number of classes, including the fundamental Tk class, as well as
numerous other classes associated with GUIs. It consists of the following:
• Tk/Tcl: A toolkit that includes widgets for GUI applications.
• Tkinter.Tix: An extension of tkinter including more advanced GUI widgets (e.g., spin
boxes, trees).
• tkinter.ttk: a collection of widgets, some of which are part of the original tkinter module
(e.g., combo boxes, progress bars).
Although it is not possible to describe all the widgets, methods, properties, and events available
through all these modules in detail in this chapter, an effort is made to present the most commonly
used ones and provide examples of their application. This chapter gradually moves from simpler to
more sophisticated cases of increasing complexity.
4.1.2 Python IDE (Anaconda) and Chapter Scope
In line with the approach taken in previous chapters, the Jupyter Notebook (Anaconda) is the
p­ latform of choice for the code developed in this chapter. Detailed download and installation
instructions are provided in the introductory Chapter 1.
It is worth noting that when writing programs in Python, or any other language indeed, it is
useful following good programming practices. It is a good habit and a helpful strategy in the long
run to use pseudocode in the form of comments before lines or blocks of code that are written to
accomplish a specific and well-defined task. This allows the reader or the owner of the program to
understand the underlying algorithm, making the program more readable and user-friendly.
It is beyond the scope of this chapter to write “highly intelligent” Python programs that create complex and sophisticated GUI applications, as this would make this chapter content difficult to digest.
Instead, this chapter aims at presenting the tools and their suggested uses for the creation of common
tasks and applications, without trying to offer the most efficient or optimal solution for such tasks.
4.2 BASIC WIDGETS IN TKINTER
Arguably, when creating a GUI, there are four basic
widgets that intuitively come to mind. These are the
actual frame, and the label, the button, and the entry
widgets (the latter is commonly referred to as textbox
in other programming languages). In this section, these
particular widgets will be presented and utilized to create simple GUI applications.
Observation 4.7 – Basic Widgets:
The basic widgets of any GUI in
Python are the form, and the label,
the button, and the entry widgets.
110
Handbook of Computer Programming with Python
4.2.1 Empty Frame
The basic frame is the initial parent object that a Python GUI application requires in order to support the GUI interface and functionality. The following Python code creates a basic, empty frame
titled “Python Basic Window Frame”:
1
2
3
4
5
6
7
8
# Import the necessary library
import tkinter as tk
# Create the frame using the tk class
winFrame = tk.Tk()
winFrame.title("Python Basic Window Frame")
winFrame.mainloop()
Output 4.2.1.a:
A few things are worth noting in this example:
• Every frame is an object of the tk class, initiated by the Tk() constructor. The object must
have a name.
• It is common practice to give a title to every frame using the title() method.
• The mainloop() method runs the frame and puts tkinter in a wait state, which internally monitors user-generated events, such as keyboard and mouse activity.
By default, the basic frame is resizable and its size is
determined automatically. If there is a requirement
for specifically defining and controlling whether it
should be resizable, two methods can be used, namely:
resizable() and geometry(). If it is preferred to
have a non-resizable frame, one can just pass Boolean
value False to both parameters of the resizable()
method. Accordingly, passing True would result in a
resizable frame. The geometry() method is used to
pass the initial size of the frame as a string. It is also
possible to define the maximum and minimum sizes of
the window frame, as well as its background color. The
aforementioned methods and their application are demonstrated in the following example:
Observation 4.8 – The mainloop()
Method: Use the mainloop()
method to monitor and control any
type of interaction between the user
and the application.
Observation 4.9 – Frame Methods: Use
the title(), resizable(), geometry(), maxsize(), minsize(), config() methods to configure the basic
content, size, geometry, flexibility, and
look of the main window frame.
111
Graphical User Interface Programming
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Import the necessary library
import tkinter as tk
# Create the frame using the tk object
winFrame = tk.Tk()
# Provide a title for the frame
winFrame.title("Python Controlled Frame")
# The frame is resizable if the method parameters are set
# to True or non-zero; if set to False, it is not resizable
winFrame.resizable(True, True)
# The frame will have initial dimensions of 500 by 200
winFrame.geometry('500x200')
# The frame can be resized up to a maximum of 1500 by 600
winFrame.maxsize(1500, 600)
# The frame can be resized down to a minimum of 250 by 100
winFrame.minsize(250, 100)
# The background colour of the frame can be changed with
# the use of the configure method and the bg option
winFrame.configure(bg = 'dark grey')
winFrame.mainloop()
Output 4.2.1.b:
Once the basic frame is set, the actual GUI can be created by adding the desired widgets.
4.2.2 The Label Widget
The label widget is a basic widget class from the tkinter
module. It is used to display a message or image on
screen. As it does not accept input from the keyboard its
value cannot be changed directly during runtime, but
this can be done indirectly through the code. The widget
comes with several methods and the associated parameters and options that can be used to change its
Observation 4.10 – Labels: Basic widgets used to display a message or an
image. They do not accept input and,
thus, their value cannot be changed
directly by the user. Label widgets
must be attached to a frame or window through the pack() or grid()
methods.
112
Handbook of Computer Programming with Python
appearance and functionality. The following script is an example showcasing the use of some of the
available options:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import the tkinter library
import tkinter as tk
# Define the parent frame
winFrame = tk.Tk()
winFrame.title("Labels in Python")
winFrame.resizable(True, True)
winFrame.geometry('300x100')
# Create a label object based on the tk.Label class
winLabel = tk.Label(winFrame, text = "Hello Python programmer")
# Associate the label object with the parent frame
winLabel.pack()
# Run the interface
winFrame.mainloop()
Output 4.2.2.a:
The script creates a window frame containing a basic label widget, used to display a text message. The label widget (winLabel) is derived from the tk.Label class, by means of the related
tk.Label() constructor. This call takes a minimum of two parameters, namely the parent frame
(winFrame) and the text that assigns the label with a message to display. The label widget is tied
to the parent frame through the pack() method. Finally, the mainloop() method activates the
application.
An extension of this basic use of the label widget could involve the use of the grid() method,
in order to control its placement within the parent frame more efficiently:
1
2
3
4
5
6
7
8
9
10
11
# Import the tkinter library
import tkinter as tk
# Define the parent frame
winFrame = tk.Tk()
winFrame.title("Python Label using the Grid")
# Create a label and place it in the Grid
winLabel = tk.Label(winFrame, text = \
"Use the Grid method to \nplace the label in a static position")
# Specify the row and column the label
113
Graphical User Interface Programming
12
13
14
15
# is to be placed, regardless of the size of the parent frame
winLabel.grid(column = 0, row = 0)
winFrame.mainloop()
Output 4.2.2.b:
A couple of things are noteworthy in this case:
• For clarity purposes, if the statement is lengthy, it
can be broken by inserting the backslash special
character (“\”). This character informs Python
that the statement continues on the next line.
• Using the grid() method instead of pack()
ensures that the label widget will be placed in the
respective grid cell, in this case in the first row
(row = 0) and first column (column = 0), and
that its position will not be directly adjusted based
on the size of the frame or parent widget.
Observation 4.11 – The Backslash
Special Character (“\”): Use the
backslash special character (“\”) to
break a lengthy line.
Observation 4.12 – expand, foreground, background, font, anchor:
Use the expand, foreground, background, font, and anchor options to
improve the appearance of widgets.
It is possible to further enhance the appearance of a label
by changing its foreground and background colors, its
alignment, and its expandability, as shown in the following script. This example demonstrates the
behavior of the alignment of labels before and after resizing the window frame:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Import the relevant library
import tkinter as tk
# The basic frame with the tk.Tk() constructor and provide a title
winFrame = tk.Tk()
winFrame.title('More options for label widgets')
# Create the 1st label and place it in the middle of the parent window
winLabel1 = tk.Label(winFrame, fg = 'green', font = "Arial 24",
text = 'A green label of Arial 24, that does not expand')
winLabel1.pack(expand = 'N')
# The second label that expands vertically when the frame is resized
winLabel2 = tk.Label(winFrame, bg = 'red', fg = 'white',
text = 'A label in red background that expands only vertically')
winLabel2.pack(expand = 1, fill = tk.Y)
# The third label that expands horizontally when the frame is resized
winLabel3 = tk.Label(winFrame, bg = 'blue', fg = 'yellow',
text = 'A label in blue background that expands only horizontally')
114
21
22
23
24
25
26
27
28
Handbook of Computer Programming with Python
winLabel3.pack(expand = 1, fill = tk.X)
# The fourth label 'anchored' (i.e., align always to the right/east)
winLabel4 = tk.Label(winFrame, anchor = 'e', bg = 'green',
text = 'A right, i.e., east, aligned label')
winLabel4.pack(expand = 1, fill = tk.BOTH)
winFrame.mainloop()
Output 4.2.2.c:
A number of key observations can be made based on this example:
1. The expand option can be used to control whether a label widget will expand in line with
its parent widget. If the value is 0 or “N”, the label will not expand.
2. If the expand option is set to ‘Y’ or non-zero, the label widget can expand in line with its
parent widget. It can be also specified whether the expansion will be horizontal, vertical, or
both. In this case, one can use the fill option with the following arguments: X for horizontal
expansion only; Y for vertical expansion only, and BOTH for a simultaneous expansion in
both directions.
3. The fg and bg options can be used to define the color of the foreground and background
of the label widget, respectively.
4. The font option can be used to set up the font name and size of the text in the label widget.
5. The anchor option can be used to ensure that the label widget will not relocate if the parent widget does.
Ultimately, label widgets can provide additional functionality and can be further enhanced in terms
of their appearance. Indeed, they can be loaded with image objects with or without associated text,
and can function as buttons (covered in a later section of this chapter). If images are to be used, the
PIL module must be imported, as it provides the necessary methods to support such processes. The
following Python program uses image objects as buttons that change the text-related properties of
the main label:
1
2
3
4
5
6
7
8
9
# Import the relevant library
import tkinter as tk
# Import the necessary image processing classes from PIL
from PIL import Image, ImageTk
global photo1, photo2, photo3, photo4, photo5, photo6
# Declare the methods to control the click events from each of the
# labels and change the settings of the main label
Graphical User Interface Programming
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
115
def changeBorders(a, b):
winLabel5.config(relief = a, borderwidth = b)
def changeText(a):
winLabel5.config(text = a)
def changeAlignment(a):
winLabel5.config(anchor = a)
# Declare the method that will open the various images
def photos():
global photo1, photo2, photo3, photo4, photo5, photo6
image1 = Image.open('LabelsDynamicWithImageGoodMorning.gif')
image1 = image1.resize((100, 50), Image.ANTIALIAS)
photo1 = ImageTk.PhotoImage(image1)
image2 = Image.open('LabelsDynamicWithImageGoodAfternoon.gif')
image2 = image2.resize((100, 50), Image.ANTIALIAS)
photo2 = ImageTk.PhotoImage(image2)
image3 = Image.open('LabelsDynamicWithImageGoodEvening.gif')
image3 = image3.resize((100, 50), Image.ANTIALIAS)
photo3 = ImageTk.PhotoImage(image3)
image4 = Image.open('LabelsDynamicWithImageAlignLeft.gif')
image4 = image4.resize((100, 50), Image.ANTIALIAS)
photo4 = ImageTk.PhotoImage(image4)
image5 = Image.open('LabelsDynamicWithImageAlignRight.gif')
image5 = image5.resize((100, 50), Image.ANTIALIAS)
photo5 = ImageTk.PhotoImage(image5)
image6 = Image.open('LabelsDynamicWithImageAlignCenter.gif')
image6 = image6.resize((100, 50), Image.ANTIALIAS)
photo6 = ImageTk.PhotoImage(image6)
# Declare the method that will create the first row of labels
# that will shape the main label
def firstRow():
winLabel1a = tk.Label(winFrame, text = "Left click to \
\n change to raised label \nwith border width of 4",
relief = "raised")
winLabel1a.grid(column = 1, row = 0)
winLabel1a.bind("<Button-1>", lambda event, a = "raised",
b = 4: changeBorders(a, b))
winLabel1b = tk.Label(winFrame, text = "Left click to \n change \
to sunken label \nwith border width of 6", relief = "raised")
winLabel1b.grid(column = 2, row = 0)
winLabel1b.bind("<Button-1>", lambda event, a = "sunken",
b = 6: changeBorders(a, b))
winLabel1c=tk.Label(winFrame, text = "Left click to \n change \
to flat label \nwith border width of 8", relief = "raised")
winLabel1c.grid(column = 3, row = 0)
116
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
Handbook of Computer Programming with Python
winLabel1c.bind("<Button-1>", lambda event, a = "flat",
b = 8: changeBorders(a, b))
# Declare the method that will create the second row of labels
# that will shape the main border
def secondRow():
winLabel2a = tk.Label(winFrame, text = "Left click to \n change \
to ridge label \nwith border width of 10", relief = "raised")
winLabel2a.grid(column = 1, row = 4); winLabel2a.bind("<Button-1>",
lambda event, a = "ridge", b = 10: changeBorders(a, b))
winLabel2b = tk.Label(winFrame, text="Left click to \nchange to \
solid label \nwith border width of 12", relief = "raised")
winLabel2b.grid(column = 2, row = 4); winLabel2b.bind("<Button-1>",
lambda event, a = "solid", b = 12: changeBorders(a, b))
winLabel2c = tk.Label(winFrame, text="Left click to \n change to \
groove label \nwith border width of 14", relief = "raised")
winLabel2c.grid(column = 3, row = 4); winLabel2c.bind("<Button-1>",
lambda event, a = "groove", b = 14: changeBorders(a, b))
# Declare the method that will create the third row of labels
# that will change the text of the main label
def thirdRow():
global photo1, photo2, photo3, photo4, photo5, photo6
winLabel3a = tk.Label(winFrame,
text="Double left click to\n change to",
image = photo1, compound = 'left', relief = "raised")
winLabel3a.grid(column = 0, row = 1)
winLabel3a.bind("<Double-Button-1>", lambda event,
a = "Good morning": changeText(a))
winLabel3b = tk.Label(winFrame, image = photo2, relief = "raised")
winLabel3b.grid(column = 0, row = 2)
winLabel3b.bind("<Double-Button-1>", lambda event,
a = "Good afternoon": changeText(a))
winLabel3c=tk.Label(winFrame, image=photo3, compound="center",
text="Double click to\n change the text to", relief="raised")
winLabel3c.grid(column = 0, row = 3)
winLabel3c.bind("<Double-Button-1>", lambda event,
a = "Good evening": changeText(a))
# Declare the method that will create the fourth row of labels
# that will adjust the alignments of the text of the main label
def fourthRow():
winLabel4a = tk.Label(winFrame, image = photo4,
text = "Right click to \n left align the text\nof the label",
compound = "center", relief = "raised")
winLabel4a.grid(column = 4, row = 1)
winLabel4a.bind("<Button-3>", lambda event,
a = "w": changeAlignment(a))
winLabel4b = tk.Label(winFrame, image = photo5, relief = "raised",
text = "Right click to \nright align the text\nof the label")
winLabel4b.grid(column = 4, row = 2)
Graphical User Interface Programming
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
117
winLabel4b.bind("<Button-3>", lambda event,
a = "e": changeAlignment(a))
winLabel4c = tk.Label(winFrame, image = photo6, compound = "right",
text = "Right click to \ncenter align the text\nof the label",
relief = "raised")
winLabel4c.grid(column = 4, row = 3)
winLabel4c.bind("<Button-3>", lambda event, a = "center":
changeAlignment(a))
# The basic frame with the tk.Tk() constructor and provide a title
winFrame = tk.Tk()
winFrame.title("Playing with Label options at runtime")
photos()
firstRow()
secondRow()
thirdRow()
fourthRow()
# Create the main label
winLabel5=tk.Label(winFrame, text = "...", font= "Arial 18", width= 30)
winLabel5.grid(column = 2, row = 2)
winFrame.mainloop()
Output 4.2.2.d:
As mentioned, the PIL module provides the necessary classes to support processes related with
images, in this case Image and ImageTk.
The photos() method includes six sets of three lines/steps, and deals with the opening and
reading of the images, as well as their preparation in order to be loaded to the respective labels. In
the first step (i.e., the first line of each set) the Image class and the open() method are used to read
the images and create an image object. Next, the script uses the resize() method with the
118
Handbook of Computer Programming with Python
preferred dimensions for the image and the ANTIALIAS option in order to ensure that quality is
maintained when downsizing an image to fit the label. This applies to all six cases. During the final
step, a new image object is created based on the previously processed image. This is accomplished
by using PhotoImage method from the ImageTk class for each of the six cases. It is worth noting
that this process applies to images with a gif file type. The reader should check the Python documentation to find the exact classes, methods, and options that should be used when working with
other types of images, as well as the exact process that must be followed. Nevertheless, the latter
should not differ significantly from the process presented above.
The next part of the script involves the use of four
methods (firstRow(), secondRow(), thirdRow(), Observation 4.13 – resize(),
and fourthRow()) to create the twelve labels of the ANTIALIAS: Use the resize()
application (i.e., three labels for each row). For each method to set the preferred dimenlabel, three statements are used. The first statement cre- sions of the image, and the
ates the label widget and sets its text property to show ANTIALIAS option to ensure that the
the associated message, and the relief property to highest quality is maintained when
enhance the widget appearance to raised. The second resizing an image.
statement places the label in the desired position within
the grid of the current frame. The third statement calls
the bind method in order to associate the particular
Observation 4.14 – <button-1>,
widget with an event.
<button-3>, <Double-Button-1>:
There are a number of events that can be associated
Use the <Button-1>, <Button-3>,
with the various widgets. This example involves three
and <Double-Button-1> events
basic events, namely: <Button-1> that is triggered
to catch when the parent widwhen the user left-clicks on the parent widget (label in
get is left-clicked, right-clicked or
this case), <Button-3> that is triggered when the user
double-left-clicked.
right-clicks, and <Double-Button-1> that is triggered when the parent widget is double left-clicked.
Whenever an event is triggered, a method is usually
called in order to execute a set of statements. If the Observation 4.15 – lambda: Use
method is to accept arguments from the calling state- the lambda event expression to define
ment, the lambda event expression must be also called in the arguments passed by an event to
order to define the arguments before they are passed to a method.
the method.
There are a number of options offered for the purpose
of changing the appearance of the border of a label wid- Observation 4.16 – relief, borget. These include options such as raised, sunken, derwidth: Use the relief and
flat, ridge, solid, and groove and have to be set borderwidth properties to adjust
through the relief property. Property ­borderwidth, the visual attributes of the label.
used with an integer argument, is used to change the
default border width of a label.
Finally, it is possible to have both a text and an image Observation 4.17 – compound,
appearing in a label widget. In such cases, it is neces- left, right, center: Use the
sary to combine the two elements using the compound compound filter to combine text and
expression. The expression accepts different alignment image objects in a label. Options
values, namely left when the image is to be placed include left, right, and center.
before the text, right when the image is to be placed
after the text, and center when both objects are to be
placed at the same position, one over the other.
Graphical User Interface Programming
119
4.2.3 The Button Widget
Observation 4.18 – The Button
As mentioned previously, the label widget is not meant Widget: Use the button widget to
to be used to trigger events initiated by the user interac- create objects that are responsive to
tion with the GUI. In such cases, the button widget can various types of events (e.g., click,
be used instead. This widget also belongs to the tkinter double-click, right-click), and the cormodule, although it can be also found in the ttk module, responding options or properties to
where button objects can be created by defining the but- modify its appearance.
ton class. The following script demonstrates the possible
output of five different user interactions through the use of
a simple button widget. The script also provides user feedback depending on the type of interaction,
by displaying relevant messages through a label widget:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Import the relevant library
import tkinter as tk
# Define the method that controls the mouse click events
def changeText(a):
winLabel.config(text = a)
# The basic frame with the tk.Tk() constructor and provide a title
winFrame = tk.Tk()
winFrame.title("A simple button and label application")
# Create the label
winLabel = tk.Label(winFrame, text = "...")
winLabel.grid(column = 1, row = 0)
# Create the button widget and bind it with the associated events
winButton=tk.Button(winFrame, text="Left, right, or double left Click "\
"\nto change the text of the label", font = "Arial 16", fg = "red")
winButton.grid(column = 0, row = 0)
winButton.bind("<Button-1>", lambda event, \
a = "You left clicked on the button": changeText(a))
winButton.bind("<Button-3>", lambda event, \
a = "You right clicked on the button": changeText(a))
winButton.bind("<Double-Button-1>", lambda event, \
a = "You double left clicked on the button": changeText(a))
winButton.bind("<Enter>", lambda event, \
a = "You are hovering above the button": changeText(a))
winButton.bind("<Leave>", lambda event, \
a = "You left the button widget": changeText(a))
winFrame.mainloop()
120
Handbook of Computer Programming with Python
Output 4.2.3:
As shown, the process of creating a button widget object and assigning values to its basic options
or properties (e.g., text, font, fg) is not different to the one used in the case of the label widget.
Accordingly, binding the button widget to an event and calling a method (with or without arguments) is also following the same syntax and logic as in the label widget case.
4.2.4 The Entry Widget
The entry widget is a basic widget from the ttk module
(tkinter package), which allows input from the keyboard
as a single line. The widget offers several methods and
options that allow the control of its appearance and/or
functionality. The widget must be placed in a parent
widget, usually the current frame, through the .pack()
or .grid() methods. The following script introduces
the basic use of the entry widget, and its output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Observation 4.19 – Entry/Text: Use
the entry and/or text widgets from the
ttk module (tkinter package) to allow
the user to enter text as a single line
or multiple lines respectively. When
using the text widget, specify the number of text lines through the height =
<number of lines> option.
# Import the necessary library
import tkinter as tk
from tkinter import ttk
# Create the frame using the tk object
winFrame = tk.Tk()
winFrame.title("Python GUI with text")
# Create a StringVar object to accept user input from the keyboard
textVar = tk.StringVar()
# Set the initial text for the StringVar
textVar.set('Enter text here')
# Create an entry widget and associate it to the StringVar object
winText = ttk.Entry(winFrame, textvariable = textVar, width = 40)
winText.grid(column = 1, row = 0)
winFrame.mainloop()
Graphical User Interface Programming
121
Output 4.2.4:
In line with common GUI development practice, the frame is created first and any child objects
(in this case the entry widget) are created and placed in it subsequently. Finally, the mainloop()
method is called to run the application and monitor its interactions. The width property specifies
the number of characters the widget can display. The reader should note that this is not necessarily
the total number of accepted characters, rather the number of displayed characters. It must be also
noted that if it is necessary to have multiple lines entered, it would be preferable to use the text widget (tk module, tkinter library) and specify the number of lines through the height = <number
of lines> option.
The script also introduces a method that helps the programmer monitor the execution of the
application: the StringVar() constructor from the tk class. When associated with relevant widgets, such as the entry widget, its functionality is to create objects that accept text input. Once such
an object is created it can have its content set through the .set() method. If no content is set, the
object will remain empty until the user provides input through the associated widget. The entry
widget and the StringVar object are associated via the textvariable.
4.2.5 Integrating the Basic Widgets
Having introduced the syntax and functionality of the basic Python widgets included in the tkinter,
PIL, and ttk modules/libraries, it would be useful to attempt to create an interface that integrates
all of them in one application. The following Python script displays a message to the user, accepts
a text input from the keyboard, and uses a number of buttons to change the various attributes of the
text, through the integration of label, entry, and button widgets:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import the necessary library
import tkinter as tk
from tkinter import ttk
# The tempText variable will store the contents of the entry widget
global tempText
# The textVar object will associate the entry widget with the input
global textVar
# Define the winText widget
global winText
# ===================================================================
# Declare the methods that will run the application
def showHideLabelEntry(a):
if (a == 's'):
winText.grid()
elif (a == 'h'):
122
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Handbook of Computer Programming with Python
winText.grid_remove()
def showHideEntryContent(a):
global tempText
global textVar
if (a == 's'):
if (tempText!= ''):
textVar.set(tempText)
if (a == 'h'):
tempText = textVar.get()
textVar.set('')
def enableLockDisableEntryWidget(a):
if (a == 'e'):
winText.config(state = 'normal')
elif (a == 'l'):
winText.config(state = 'disabled')
def boldContentsOfEntryWidget(a):
if (a == 'b'):
winText.config(font = 'Arial 14 bold')
elif (a == 'n'):
winText.config(font = 'Arial 14')
def passwordEntryWidget(a):
if (a == 'p'):
winText.config(show = '*')
elif (a == 'n'):
winText.config(show = '')
# ===================================================================
# Declare the method that will create the application GUI
def createGUI():
createLabelEntry()
showHideButton()
showHideContent()
enableDisable()
boldOnOff()
passwordOnOff()
# Create a label and an entry widget to prompt for input and
# associate it with a StringVar object
def createLabelEntry():
global textVar
global winText
winLabel = tk.Label(winFrame, text = 'Enter text:', bg = 'yellow',
font = 'Arial 14 bold', relief = 'ridge', fg = 'red', bd = 8)
Graphical User Interface Programming
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
123
winLabel.grid(column = 0, row = 0)
# A StringVar object to accept user input from the keyboard
textVar = tk.StringVar()
winText = ttk.Entry(winFrame, textvariable = textVar, width = 20)
winText.grid(column = 1, row = 0)
# Create two button widgets to show/hide the label and entry widgets
def showHideButton():
winButtonShow = tk.Button(winFrame, font='Arial 14 bold',
text = 'Show the\nentry widget', fg='red',
borderwidth=8, height=3, width=20)
winButtonShow.grid(column = 0, row = 1)
winButtonShow.bind('<Button-1>',lambda event,
a = 's': showHideLabelEntry(a))
winButtonHide = tk.Button(winFrame, font = 'Arial 14 bold',
text = 'Hide the\nentry widget',
fg = 'red', borderwidth = 8, height = 3, width = 20)
winButtonHide.grid(column = 1, row = 1)
winButtonHide.bind('<Button-1>', lambda event, \
a = 'h': showHideLabelEntry(a))
# Two button widgets to show/hide the contents of the entry widget
def showHideContent():
winButtonContentShow = tk.Button(winFrame, font = 'Arial 14 bold',
text = 'Show the contents\nof the entry widget',
fg = 'blue', borderwidth = 8, height = 3, width = 20)
winButtonContentShow.grid(column = 0, row = 2)
winButtonContentShow.bind('<Button-1>', lambda event,
a = 's': showHideEntryContent(a))
winButtonContentHide = tk.Button (winFrame,
text = 'Hide the contents\nof the entry widget',
font = 'Arial 14 bold', fg = 'blue', borderwidth = 8,
height = 3, width = 20)
winButtonContentHide.grid (column = 1, row = 2)
winButtonContentHide.bind ('<Button-1>', lambda event,
a = 'h': showHideEntryContent(a))
# Button widgets to enable/disable & lock/unlock the entry widget
def enableDisable():
winButtonEnableEntryWidget = tk.Button(winFrame,
text = 'Enable the\nentry widget', font = 'Arial 14 bold',
fg = 'green', borderwidth = 8, height = 3, width = 20)
winButtonEnableEntryWidget.grid(column = 0, row = 3)
winButtonEnableEntryWidget.bind('<Button-1>', lambda event,
a = 'e': enableLockDisableEntryWidget(a))
winButtonDisableEntryWidget = tk.Button(winFrame,
124
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
Handbook of Computer Programming with Python
text = 'Lock the\nentry widget', font = 'Arial 14 bold',
fg = 'green', borderwidth = 8, height = 3, width = 20)
winButtonDisableEntryWidget.grid(column = 1, row = 3)
winButtonDisableEntryWidget.bind('<Button-1>', lambda event,
a = 'l': enableLockDisableEntryWidget(a))
# Create two button widgets to switch the "bold" property
# of the entry widget content on or off
def boldOnOff():
winButtonBoldEntryWidget = tk.Button (winFrame,
text = 'Bold contents of\nthe entry widget',
font = 'Arial 14 bold',
fg = 'brown', borderwidth = 8, height = 3, width = 20)
winButtonBoldEntryWidget.grid (column = 0, row = 4)
winButtonBoldEntryWidget.bind ('<Button-1>', lambda event,
a = 'b': boldContentsOfEntryWidget(a))
winButtonNoBoldEntryWidget = tk.Button (winFrame,
text = 'No bold contents of \nthe entry widget',
font = 'Arial 14 bold', fg = 'brown', borderwidth = 8,
height = 3, width = 20)
winButtonNoBoldEntryWidget.grid (column = 1, row = 4)
winButtonNoBoldEntryWidget.bind ('<Button-1>', lambda event,
a = 'n': boldContentsOfEntryWidget(a))
# Button widgets to convert the entry widget text to a password
def passwordOnOff():
winButtonPasswordEntryWidget = tk.Button(winFrame,
text ='Show entry widget \ncontent as password', borderwidth=8,
font = 'Arial 14 bold', fg = 'grey', height = 3, width = 20)
winButtonPasswordEntryWidget.grid(column = 0, row = 5)
winButtonPasswordEntryWidget.bind('<Button-1>', lambda event,
a = 'p': passwordEntryWidget(a))
winButtonNormalEntryWidget = tk.Button(winFrame,
font = 'Arial 14 bold',
text = 'Show entry widget \ncontent as normal text',
fg = 'grey', borderwidth = 8, height = 3, width = 20)
winButtonNormalEntryWidget.grid(column = 1, row = 5)
winButtonNormalEntryWidget.bind('<Button-1>', lambda event,
a = 'n': passwordEntryWidget(a))
# ===================================================================
# Create the frame using the tk object and run the application
winFrame = tk.Tk()
winFrame.title("Wrap up the basic widgets")
createGUI()
winFrame.mainloop()
Graphical User Interface Programming
Output 4.2.5.a–4.2.5.f:
125
126
Handbook of Computer Programming with Python
There are some noteworthy ideas presented in this script,
relating to the need to hide, disable, and lock the text of Observation 4.20 – grid(): Use
a widget, or make it appear as a password. For example, the grid() method to position a
sometimes it is required to hide, and subsequently widget on the grid; use the grid _
unhide, a widget. This is often referred to as adjusting remove() method to remove it withits visibility. In Python this is achieved with the use out deleting it.
of the grid() and grid _ remove() methods. It
should be stated that when the widget is invisible it is
Observation 4.21 – state, nornot deleted, but merely removed from the grid.
mal, disabled: Use the state
Method showHideLabelEntry() implements this
option with the normal or disfunctionality.
abled flags to enable or disable
In a similar fashion, the method showHideEntry(lock) the functionality of a widget.
Content() implements the functionality of hiding and
displaying the contents of the same entry widget using
the set() and get() methods. The reader should note
Observation 4.22 – show: Use the
that the content of the entry widget should be stored in
show option to replace the text with
a variable, since tampering with the set() and get()
a password-like text, based on a premethods may accidentally delete it. Likewise, method
ferred character/symbol.
enableLockDisableEntryWidget() implements
the functionality of locking/disabling the entry widget
using the state option and its normal and disabled values.
Finally, if it is required to utilize text font properties, such as bold or italic, one can use the
font option as shown in the boldContentsOfEntryWidget() method. It is also possible to
make the content of the entry widget appear as a password. Method passwordEntryWidget()
uses option show to replace each character with a chosen placeholder character, in this case an
asterisk (“*”).
The rest of the methods are assigned with the creation of the application GUI.
4.3 ENHANCING THE GUI EXPERIENCE
The widgets, methods, options, and events presented in the previous sections should provide a good
enough basis to create a GUI application for a basic system, as they cover all the fundamental
aspects of basic interaction. However, they do not address two major requirements in computer
programming: validation and efficiency. In the case of numbers, specific widgets like spinbox and
scale are frequently used for the purposes of validation and improvement of visual appearance. In
the case of text, for tasks requiring optimized and synchronized organization, widgets like listbox
and combobox can be used. Checkbuttons and radiobuttons are used frequently in cases where
improved selection options are required. Finally, in order to improve the organization of the GUI
and avoid accidental repositioning of the widgets at runtime, the various objects can be placed in
individual frames within the main frame of the application.
4.3.1 The Spinbox and Scale Widgets inside Individual Frames
One of the main challenges in programming is to identify and highlight the user’s mistakes when
entering numbers as part of their interaction with an application. It is often the case that either
numeric values entered are outside the allowed range or they are alphanumeric sequences consisting of both text and numbers. In order to validate that a number is entered correctly two different
approaches are followed: (a) code is written to ensure the correct, acceptable form of the input
number, and (b) widgets like spinbox and scale are used to restrict the user’s options when selecting
numbers. The following Python script makes use of such widgets to implement a small application in which the user may enter the speed limit, the current speed, and the fine per km/h over the
Graphical User Interface Programming
127
speed limit. Once these numbers are entered, the fine is calculated based on the following formula:
fine = (current speed − speed limit) × fine per km/h. For improving the organization of the GUI,
the script uses a frame widget, which the various other widgets are placed upon:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Import the necessary modules
import tkinter as tk
from tkinter import ttk
# Declare and initialise the global variables and widgets
# and define the associated methods
currentSpeedValue, speedLimitValue, finePerKmValue = 0, 0, 0
global speedLimitSpinbox
global finePerKmScale
global currentSpeedScale
global fine
# ===========================================================
# Define the methods to run the control speed application
# Define the method to control the Current Speed Scale widget change
def onScale(val):
global currentSpeedValue
v = float(val)
currentSpeedValue.set(v)
calculateFine()
# Define the method to control the Speed Limit Spinbox widget change
def getSpeedLimit():
global speedLimitValue
v = float(speedLimitSpinbox.get())
speedLimitValue.set(v)
calculateFine()
# Define the method to control the Fine per Km Spinbox widget change
def getFinePerKm(val):
global finePerKmValue
v = int(float(val))
finePerKmValue.set(v)
calculateFine()
# Define the method to calculate the Fine given the 3 user parameters
def calculateFine():
global currentSpeedValue, speedLimitValue, finePerKmValue
global fine
diff = float(currentSpeedValue.get())-float(speedLimitValue.get())
finePerKm = float(finePerKmValue.get())
if (diff <= 0):
fine.config(text = 'No fine')
else:
fine.config(text = 'Fine in USD: '+ str(diff * finePerKm))
# ===========================================================
128
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Handbook of Computer Programming with Python
# Define the methods that will create the interface of the application
def createGUI():
currentSpeedFrame()
speedLimitFrame()
finePerKmFrame()
fineFrame()
# Create the frame to include the Current Speed widgets
def currentSpeedFrame():
global currentSpeedValue
CurrentSpeedFrame = tk.Frame (winFrame, bg = 'light grey', bd = 2,
relief = 'sunken')
CurrentSpeedFrame.pack()
CurrentSpeedFrame.place(relx = 0.05, rely = 0.05)
currentSpeed = tk.Label(CurrentSpeedFrame, text = 'Current speed:',
width = 24)
currentSpeed.config(bg = 'light blue', fg = 'red', bd = 2,
font = 'Arial 14 bold')
currentSpeed.grid(column = 0, row = 0)
# Create Scale widget; define variable to connect to scale widget
currentSpeedValue = tk.DoubleVar()
currentSpeedScale = tk.Scale (CurrentSpeedFrame, length = 200,
from_ = 0, to = 360)
currentSpeedScale.config(resolution = 0.5,
activebackground = 'dark blue', orient = 'horizontal')
currentSpeedScale.config(bg = 'light blue', fg = 'red',
troughcolor = 'cyan', command = onScale)
currentSpeedScale.grid(column = 1, row = 0)
currentSpeedSelected = tk.Label(CurrentSpeedFrame, text = '...',
textvariable = currentSpeedValue)
currentSpeedSelected.grid(column = 2, row = 0)
# Create the frame to include the Speed Limit widgets
def speedLimitFrame():
global speedLimitValue
global speedLimitSpinbox
SpeedLimitFrame = tk.Frame(winFrame, bg = 'light yellow', bd = 4,
relief = 'sunken')
SpeedLimitFrame.pack()
SpeedLimitFrame.place(relx = 0.05, rely = 0.30)
# Create the prompt label on the Speed Limit frame
speedLimit=tk.Label(SpeedLimitFrame, text='Speed limit:', width=24)
speedLimit.config(bg = 'light blue', fg = 'yellow', bd = 2,
font = 'Arial 14 bold')
speedLimit.grid(column = 0, row = 0)
# Create the Spinbox widget; define variable to connect to Spinbox
speedLimitValue = tk.DoubleVar()
Graphical User Interface Programming
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
129
speedLimitSpinbox = ttk.Spinbox(SpeedLimitFrame,
from_ = 0, to = 360, command = getSpeedLimit)
speedLimitSpinbox.grid(column = 1, row = 0)
speedLimitSelected = tk.Label(SpeedLimitFrame, text = '...',
textvariable = speedLimitValue)
speedLimitSelected.grid(column = 2, row = 0)
# Create the frame to include the Fine per Km widgets
def finePerKmFrame():
global finePerKmValue
FinePerKmFrame = tk.Frame(winFrame, bg = 'light blue',
bd = 4, relief = 'sunken')
FinePerKmFrame.pack()
FinePerKmFrame.place (relx = 0.05, rely = 0.55)
# Create the prompt label on the Fine per Km frame
finePerKm=tk.Label(FinePerKmFrame, text='Fine/Km overspeed (USD):',
width = 24)
finePerKm.config(bg = 'light blue', fg = 'brown', bd = 2,
font = 'Arial 14 bold')
finePerKm.grid(column = 0, row = 0)
# Create Scale widget; define variable to connect to Scale widget
finePerKmValue = tk.IntVar()
finePerKmScale = ttk.Scale(FinePerKmFrame, orient = 'horizontal',
length = 200, from_ = 0, to = 100, command = getFinePerKm)
finePerKmScale.grid(column = 1, row = 0)
finePerKmSelected = tk.Label(FinePerKmFrame, text = '...',
textvariable = finePerKmValue)
finePerKmSelected.grid(column = 2, row = 0)
# Create the frame to include the Fine for speeding
def fineFrame():
global fine
FineFrame = tk.Frame(winFrame, bg='yellow', bd=4, relief='raised')
FineFrame.pack()
FineFrame.place(relx = 0.05, rely = 0.80)
# Create the label that will display the fine on the Fine frame
fine = tk.Label(FineFrame, text = 'Fine in USD:...', fg = 'blue')
fine.grid(column = 0, row = 0)
# ===================================================================
# Create the main frame for the application and run it
winFrame = tk.Tk()
winFrame.title("Control speed")
winFrame.config(bg = 'light grey')
winFrame.resizable(False, False)
winFrame.geometry('500x170')
createGUI()
winFrame.mainloop()
130
Handbook of Computer Programming with Python
Output 4.3.1:
Conceptually, the script may be divided into three parts.
The first part involves the declaration of the global vari- Observation 4.23 – frames, relx,
ables and their initialization, so that they can be used rely: Use frames for improved conin runtime when the user interacts with the program trol of the interface. Contain the vari(line 7). This is important since the methods imple- ous widgets of the interface in the
menting the interaction will be using the same variables relevant frames. Use options relx and
dynamically. At this stage, the main frame is also ini- rely to place the frames in specified
tialized and formed (lines 139–145), although this is positions, relative to the main window.
done outside the initial phase. Eventually, a frame is
created with a single label placed in it, with the sole
purpose of displaying the calculated fine for speeding Observation 4.24 – scale: Use the
scale widget to create a controlled
(lines 128–137).
The second part includes the creation of the four dif- mechanism that will accept numeriferent frames inside the main frame, and the placement cal user input. The tkinter widget
of the relevant widgets in each of them. These frames has more visual options than the ttk
are created by means of a call to the relevant methods, alternative.
through the createGUI() method (lines 49–54).
In the first case, (lines 56–80), the frame is placed
inside the main window frame in a particular position Observation 4.25 – Options: Use the
(relx and rely options). Next, a label and a scale required options, such as activewidget are placed in the frame. The reader should note background, troughcolor, bg,
the use of the config() method that defines the back- fg, to modify the visual attributes of
ground (bg), foreground (fg), borderwidth (bd), and the widget. Use the resolution
font name and size (font) of the label. It must be also option to specify the increment and
noted that the label is placed in column 0 and row 0 of decrement steps. Use the orient
option to specify its orientation (i.e.,
the current frame, and not of the main window frame.
In addition to the label, the scale widget is also placed horizontal or vertical). Use the
in the frame. It is set to have a length (length) of 200 from _ = and to = options to set the
pixels, and its values are restricted within a lower bound- numerical boundaries of the widget.
ary of 0 and upper boundary of 360. The reader should
also observe the use of the config() method that sets the resolution option of the widget,
allowing for user-defined increments (including decimals) of the values, the activebackground
option that sets the color of the widget when it is active, and the orientation (orient) that can take one
of two values: horizontal or vertical. For clarity reasons, the config() method is used for
a second time to set some more options for the widget, such as the background (bg), the foreground
(fg), and the troughcolor that sets the color of the trough. Additionally, another label is placed
in the frame in order to display the current value of the scale widget, as an optional visual aid.
The second frame and the associated label introduce the spinbox widget (lines 82–103). This is
also used to control user input when entering numeric values. It is very similar to the scale widget,
allowing for the setting of the lower and upper boundaries of the accepted values, with two main
differences: (a) it is visually different, and (b) the user may directly enter a value to the textual part
of the widget, and/or control it with the increase/decrease arrows. As in the previous case, another
label is added to the frame as an extra visual aid.
131
Graphical User Interface Programming
The third frame introduces another scale widget
(lines 105–126). This is different to the one used in the Observation 4.26: Use the spinbox
first frame in that (a) it is visually different and restricted widget to create a controlled mechaas to its visual attributes (i.e., it is not offering several of nism that will accept numerical user
the tk widget options), and (b) it belongs to the ttk class/ input, while also allowing direct input.
library instead of tk. The reader should notice the distinctly different visual results of the two scale widgets.
The third part defines the four methods used to control the interaction between the user and the
application (lines 16–46). The reader should note that three of the methods (i.e., onScale(val),
getSpeedLimit(), and getFinePerKm(val)) are directly associated with widgets currentSpeedScale, speedLimitSpinbox, and finePerKmScale, respectively. This is done
through the command option. More specifically, when the user interacts with a particular widget, the
resulting values are captured and the respective methods are called for the calculation of the fine. In
the case of the scale widget, the value is passed with the call to the method. This is the case for both
tk and ttk. The reader should observe (a) the use of the set and get methods applied to the objects
of the widgets in order to tamper with the widget values, (b) the use of the casing operators (i.e.,
float(), int(float())) to control the type of numerical values used in the calculation, and (c) the
declaration of the global variables that must be called and used in the methods. At the end of each
of these methods the calculateFine() method is called to perform the associated calculation.
4.3.2 The Listbox and Combobox Widgets inside LabelFrames
Two of the most well-known widgets used in programming are the listbox and the combobox. These widgets
are used to present the user with lines of text as a list,
with the purpose of allowing them to make a selection.
This selection can be also used to synchronize the contents between multiple instances of different widgets.
The programmer can be creative as to the appearance
of the widgets, as it is possible to manipulate their visual
attributes, despite the fact that the basic form cannot be
modified. The main difference between the two widgets is that the former provides an open list whereas
the latter is a collapsed list that opens upon the user’s
click. Another widget which can help further enhancing the appearance of an application is the labelframe
widget. This widget is similar to the frame widget, but
it allows for a label to be specified on the frame itself,
thus, removing the need for the creation of an extra label
widget into the frame. Some of the visual attributes of
this widget (including those related to the label font) can
be manipulated.
In this section, two additional libraries are introduced: random and time. The former is introduced in
order to use method randint() that generates random
numbers, and the latter in order to use process _
time() that records the starting and/or ending time of
a particular process.
The following Python script allows the user to select
a number of randomly generated integers in order to
populate a listbox. Subsequently, it sorts this list into a
Observation 4.27 – listbox, combobox:
Use the listbox and combobox widgets
to display lists of lines of text, select one
or more of these lines and, synchronize
their contents as necessary.
Observation 4.28 – labelframe: As
with the frame widget, one can use the
labelframe widget without the need to
create an extra label for descriptions.
The same options as with the frame
and label widgets apply.
Observation 4.29 – randint():
Use the randint() method of the
random library to generate random
­numbers within a specified range.
Observation 4.30 – process _
time(): Use the process _
time() method of the time library to
mark a particular moment in time and
use it to count the time elapsed for a
given process.
132
Handbook of Computer Programming with Python
second listbox before displaying the size of the list, the sum of the numbers and their average, and
the processing time for completing the sorting process:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Import the necessary modules
import tkinter as tk
from tkinter import ttk
from tkinter import *
import random
import time
# Initialise various lists used by the listboxes, comboboxes, & methods
unsortedL = []; sortedL = []; statisticsData = [];
sizes = [5, 20, 100, 1000, 10000, 20000]
global UnsortedList, SortedList
global startTime, endTime, ListSizeSelection, size
global UnsortedListScrollBar, SortedListScrollBar
global EntryFrame, UnsortedFrame, SortedFrame
# Populate the unsorted list with random numbers and
# the unsorted listbox
def populateUnsortedList():
global size
global UnsortedListScrollBar
global UnsortedList
global ListSizeSelection
# Read the number of elements as they are selected from the combobox
size = int(ListSizeSelection.get())
# randint() method of the random class generates random integers
for i in range (size):
n = random.randint(-100, 100)
# Enter the generated random integer to the relevant place in the
# unsorted list
unsortedL.insert(i, n)
# Populate the listbox with the elements of the unsorted list
for i in range (0, size):
UnsortedList.insert(i, unsortedL[i])
UnsortedListScrollBar.config(command = UnsortedList.yview)
# Use Bubble sort to sort the list & record the statistics for later use
def sortToSortedList():
global size, startTime, endTime
global SortedListScrollBar
global SortedList
# Load the unsorted list and listbox to the sorted list and listbox
for i in range (0, size):
sortedL.insert(i, unsortedL[i])
Graphical User Interface Programming
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
133
# Start the timer
startTime = time.process_time()
# The Bubble sort algorithm
for i in range (0, size-1):
for j in range (0, size-1):
if (sortedL[j] > sortedL[j+1]):
temp = sortedL[j]
sortedL[j] = sortedL[j+1]
sortedL[j+1] = temp
# End the timer
endTime = time.process_time()
# Load the sorted list to the relevant listbox
for i in range (0, size):
SortedList.insert(i, sortedL[i])
SortedListScrollBar.config(command = SortedList.yview)
# Clear all lists, listboxes, & comboboxes, & the global size variable
def clearLists():
global size
sortedL.clear()
unsortedL.clear()
UnsortedList.delete('0', 'end')
SortedList.delete('0', 'end')
statisticsData.clear()
StatisticsCombo.delete('0', 'end')
# Calculate and report the statistics from the sorting process
def statistics():
global size, startTime, endTime
statisticsData.clear()
statisticsData.insert(1, 'The size of the lists is ' + str(size))
statisticsData.insert(2,'The sum of the lists is '+str(sum(sortedL)))
statisticsData.insert(3, 'The time passed to sort the list was ' \
+ str(round(endTime - startTime, 5)))
statisticsData.insert(4, 'The average of the sorted list is: ' \
+ str(round(sum(sortedL) / size, 2)))
StatisticsCombo['values'] = statisticsData
# ===================================================================
# Define the methods that will create the GUI of the application
def createGUI():
unsortedFrame()
entryFrame()
entryButton()
sortButton()
sortedFrame()
clearButton()
statisticsButton()
134
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
Handbook of Computer Programming with Python
statisticsSelection()
# Create the labelframe & place the Unsorted Array Listbox widgets in it
def unsortedFrame():
global unsortedList
global UnsortedListScrollBar
global UnsortedList
global winFrame
global UnsortedFrame
UnsortedFrame = tk.LabelFrame (winFrame, text = 'Unsorted Array')
UnsortedFrame.config(bg='light grey',fg='blue',bd=2, relief='sunken')
# Create a scrollbar widget to attach to the UnsortedList
UnsortedListScrollBar = Scrollbar (UnsortedFrame, orient = VERTICAL)
UnsortedListScrollBar.pack(side = RIGHT, fill = Y)
# Create the listbox in the Unsorted Array frame
UnsortedList = tk.Listbox(UnsortedFrame, bg='cyan', width=13, bd=0,
height = 12, yscrollcommand = UnsortedListScrollBar.set)
UnsortedList.pack(side = LEFT, fill = BOTH)
# Associate the scrollbar command with its parent widget,
# i.e., the UnsortedList yview
UnsortedListScrollBar.config(command = UnsortedList.yview)
# Place the Unsorted frame and its parts into the interface
UnsortedFrame.pack(); UnsortedFrame.place(relx = 0.02, rely = 0.05)
# Create the labelframe to include the Entry widget
def entryFrame():
global unsortedList
global UnsortedListScrollBar
global ListSizeSelection
global EntryFrame
global winFrame
EntryFrame = tk.LabelFrame(winFrame, text = 'Actions')
EntryFrame.config(bg='light grey', fg='red', bd=2, relief = 'sunken')
EntryFrame.pack(); EntryFrame.place(relx = 0.25, rely = 0.05)
# Create the label in the Entry frame
EntryLabel = tk.Label(EntryFrame,
text='How many integers\nin the list', width = 16)
EntryLabel.config(bg = 'light grey', fg='red', bd = 3,
relief = 'flat', font = 'Arial 14 bold')
EntryLabel.grid(column = 0, row = 0)
# Create the combobox to select the number of elements in the lists
ListSizeSelection = tk.IntVar()
ListSizeCombo = ttk.Combobox(EntryFrame,
textvariable=ListSizeSelection, width = 10)
ListSizeCombo['values'] = sizes
ListSizeCombo.current(0)
ListSizeCombo.grid(column = 1, row = 0)
# Create button to insert new entries into the unsorted array & listbox
def entryButton():
Graphical User Interface Programming
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
135
global EntryFrame
EntryButton = tk.Button(EntryFrame, text = 'Populate\nUnsorted list',
relief = 'raised', width = 16)
EntryButton.bind('<Button-1>', lambda event: populateUnsortedList())
EntryButton.grid(column = 0, row = 2)
# Create the button that will sort the numbers and display them
# in the sorted array and listbox
def sortButton():
global EntryFrame
SortButton=tk.Button(EntryFrame,text='Sort numbers\nwith BubbleSort',
relief = 'raised', width = 16)
SortButton.bind('<Button-1>', lambda event: sortToSortedList())
SortButton.grid(column = 1, row = 2)
# Create the labelframe to include the Sorted Array Listbox widgets
def sortedFrame():
global sortedList
global SortedListScrollBar
global SortedList
global winFrame
global SortedFrame
SortedFrame = tk.LabelFrame(winFrame, text = 'Sorted Array')
SortedFrame.config(bg='light grey', fg='blue', bd=2, relief='sunken')
# Create a scrollbar widget to attach to the SortedList
SortedListScrollBar = Scrollbar (SortedFrame)
SortedListScrollBar.pack(side = RIGHT, fill = Y)
# Create the listbox in the Sorted Array frame
SortedList = tk.Listbox (SortedFrame, bg='cyan', width=13, height=12,
yscrollcommand = SortedListScrollBar.set, bd = 0)
SortedList.pack(side = LEFT, fill = BOTH)
# Associate the scrollbar command with its parent widget,
# i.e., the SortedList yview
SortedListScrollBar.config(command = SortedList.yview)
# Place the Unsorted frame and its parts into the interface
SortedFrame.pack(); SortedFrame.place(relx = 0.75, rely = 0.05)
# Create the button that will clear the two listboxes and the two lists
def clearButton():
global EntryFrame
ClearButton = tk.Button(EntryFrame, text = 'Clear lists',
relief = 'raised', width = 16)
ClearButton.bind('<Button-1>', lambda event: clearLists())
ClearButton.grid(column = 0, row = 3)
# Create the button that will display the statistics for the sorting
def statisticsButton():
global EntryFrame
136
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
Handbook of Computer Programming with Python
StatisticsButton = tk.Button(EntryFrame, text = 'Show statistics',
relief = 'raised', width = 16)
StatisticsButton.bind('<Button-1>', lambda event: statistics())
StatisticsButton.grid(column = 1, row = 3)
# Create the option menu that will show the statistical results
# from the sorting process
def statisticsSelection():
global EntryFrame
global StatisticsCombo
StatisticsSelection = tk.StringVar()
statisticsData = ['The statistics will appear here']
StatisticsSelection.set(statisticsData[0])
StatisticsCombo = ttk.Combobox(EntryFrame, width = 30,
textvariable = StatisticsSelection)
StatisticsCombo['values'] = statisticsData
StatisticsCombo.grid(column = 0, columnspan = 2, row = 4)
# ===================================================================
# Create the main frame for the application
winFrame = tk.Tk()
winFrame.title("Bubble Sort"); winFrame.config(bg = 'light grey')
winFrame.resizable(True, True); winFrame.geometry('650x300')
createGUI()
winFrame.mainloop()
Output 4.3.2:
Initially, the necessary libraries are imported (i.e., tkinter, time, and random, lines 2–6). Next, the
various lists, variables, and listboxes are initialized (lines 9–14). Note that the lists are not defined as
global, since they are accessed by reference by all methods in the script by default. It must be also
noted that different types of objects and/or variables must be declared as global in separate lines,
since declaring them together may raise errors. After initialization, the main frame is created and
configured in lines 227–229.
The next step is to create the application interface. In this case, the interface consists of two
distinct parts. The first includes two listboxes created and placed inside the associated labelframes
(lines 103–124 and 170–191). The use of labelframes makes the creation of additional labels obsolete. The visual properties of the listboxes can be configured through their options, which are almost
Graphical User Interface Programming
137
identical to those of an entry widget. The listboxes can
be populated at run time using the insert(index, Observation 4.31 – insert(),
value) method, and cleared at run time using the delete(): Use the insert() and
delete(index, index) method. Likewise, the delete() methods to populate or
properties/options of the labelframes are similar to those clear a listbox.
of regular frames and labels.
The second part is to create the labelframe that hosts
Observation 4.32 – [“values”]: Use
the comboboxes and the buttons required in the applithe [“values”] property to popucation. The purpose of the first combobox is to display
late a combobox with an initial list of
the number of random integers in the unsorted list. The
values.
second one displays basic statistics related to the sorting process, the size of the lists, the sum and average of
the integers, and the time required to sort the list. There
Observation 4.33 – textvariare three notable observations related to the creation and
able: Use the textvariable
use of the comboboxes (lines 143–149 and 211–223).
option of the combobox to associate
Firstly, they must include a [“values”] list which will
it with an IntVar() object that will
take its values from an associated list. The latter can be
store the selected value.
initially empty or populated. Secondly, their selection
value (e.g., textvariable), must be associated with
an object of the IntVar() class (or any similar alterna- Observation 4.34 – current():
tive) that will store it for further use, since the selected Use the current() method to
combobox value is not directly accessible. Thirdly, the define the currently selected value of
currently selected value must be defined through the the combobox.
current(index) method.
The last step is to create the interaction between the
user and the application. For this purpose, four but- Observation 4.35 – get(): It is nectons are created and bound with click events to trig- essary to use the get() method to
ger the respective methods. This populates, sorts, and read from the IntVar() object, as
clears the relevant lists, and displays the basic statis- it is private and, hence, not directly
tics. The populateUnsortedList() method uses accessible.
the randint() method to generate random integers,
and the insert() method populates the unsorted
list (lines 16–38). It is worth noting the declaration of Observation 4.36 – clear(): Use
global variable size, and the use of the get() method the clear() method to clear the valto read the value from the private attribute of the ues of the lists.
ListSizeSelection object (line 25). The sortToSortedList() method (lines 40–67) declares
global variables size, startTime and endTime, Observation 4.37 – xview, yview,
uses the process_time() method to mark the xscrollcommand, yscrollcomstart and end of the sort process, and utilizes a com- mand: Use the scrollbar widget to
mon Bubble Sort algorithm to sort the list and populate attach a scrollbar to the associated
the sortedList. The clearLists() method uses widget (usually a listbox). Use xview
methods clear() to clear the ­values of the lists and or yview to control its orientation
delete() to delete the values of the listboxes (lines (i.e., horizontal or vertical). Use the
69–77). Finally, the statistics() method uses meth- xscrollcommand or the yscrollods sum() and round() to produce the basic statistics command to activate it.
that will be displayed (lines 79–89).
The reader should observe the use of the scrollbar widget introduced in this script. The idea
behind, and the use of, this particular widget is intuitive and quite straightforward. Firstly, the
­labelframe inside which the scrollbar operates is created. Next, the scrollbar is created and connected (packed) to the parent widget (i.e., in this case the associated labelframe), specifying its
138
Handbook of Computer Programming with Python
orientation and positioning. Lastly, the widget/object that will make use of the scrollbar is created and associated with the scrollbar through either yscrollcommand or xscrollcommand
(depending on whether the scrollbar orientation is vertical or horizontal respectively), and configured to scroll the contents of the attached widget (lines 38, 120–124, and 67, 187–191).
4.3.3 GUIs with CheckButtons, RadioButtons and SimpleMessages
In addition to listboxes and comboboxes, there are two more widgets that users of windows-based
applications are familiar with, namely checkbuttons and radiobuttons. These widgets allow the user
to make one or more selections from a set of different available options/actions. Their main difference is that while in the case of checkbuttons the user may select more than one option at any given
time, radiobuttons only allow a single selection from the set of available options. Finally, another
handy widget available in Python the reader should be familiar with is the message widget. In this
section the most basic form of this widget will be introduced and explained.
The following script implements an interface that includes two listboxes with associated, attached
vertical scrollbars. The listboxes are populated with the names of various countries and their capital
cities. It also includes two entry boxes for accepting new entries to the listboxes. Insertions are triggered using the associated button-click events. The contents of all listboxes are synchronized with
the user’s click on any listbox. The interface also includes four buttons that handle the interaction
between the application and the user, allowing for the insertion and deletion of particular entries, the
clearance of all entries from all three containers, and exiting the application. Finally, two checkbuttons control whether the relevant containers are enabled or not, and two radiobuttons whether they
are visible:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import tkinter as tk
from tkinter import *
from tkinter import ttk
from tkinter import messagebox
countries = ['E.U.', 'U.S.A.', 'Russia', 'China', 'India', 'Brazil']
Capital = ['Brussels', 'Washinghton', 'Moscow', 'Beijing', 'New Delhi',
'Brazilia']
global
global
global
global
global
global
newCountry, newCapital
CountriesFrame, CapitalFrame
checkButton1, checkButton2
radioButton
CountriesList, CapitalList
CountriesScrollBar, CapitalScrollBar
# Create the interface for the listboxes
def drawListBoxes():
global CountriesList, CapitalList
global CountriesFrame, CapitalFrame
global CountriesScrollBar, CapitalScrollBar
# Create CountriesFrame labelframe; place CountriesList widget in it
CountriesFrame = tk.LabelFrame(winFrame, text = 'Countries')
CountriesFrame.config(bg = 'light grey', fg = 'blue', bd = 2,
Graphical User Interface Programming
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
139
width = 15, relief = 'sunken')
# Create a scrollbar widget to attach to the CountriesList
CountriesScrollBar = Scrollbar(CountriesFrame, orient = VERTICAL)
CountriesScrollBar.pack(side = RIGHT, fill = Y)
# Create the listbox in the CountriesFrame
CountriesList = tk.Listbox(CountriesFrame, bg = 'cyan', width = 15,
height = 8, yscrollcommand = CountriesScrollBar)
CountriesList.pack(side = LEFT, fill = BOTH)
# Associate the scrollbar command with its parent widget,
# (i.e., the CountriesList yview)
CountriesScrollBar.config(command = CountriesList.yview)
# Place the Countries frame and its parts on the interface
CountriesFrame.pack(); CountriesFrame.place(relx = 0.03, rely = 0.05)
CountriesList.bind('<Double-Button-1>',
lambda event: alignList('countries'))
# Create the CapitalFrame labelframe; place CapitalList widget on it
CapitalFrame = tk.LabelFrame(winFrame, text = 'Countries Capital')
CapitalFrame.config(bg = 'light grey', fg = 'blue', bd = 2,
width = 13, relief = 'sunken')
# Create a scrollbar widget to attach to the CapitalFrame
CapitalScrollBar = Scrollbar(CapitalFrame, orient = VERTICAL)
CapitalScrollBar.pack(side = RIGHT, fill = Y)
# Create the listbox in the CapitalFrame
CapitalList = tk.Listbox(CapitalFrame, bg = 'cyan',
yscrollcommand = CapitalScrollBar, width = 16, height = 8, bd = 0)
CapitalList.pack(side = LEFT, fill = BOTH)
# Associate the scrollbar command with its parent widget,
# (i.e., the CapitalList yview)
CapitalFrame.pack(); CapitalFrame.place(relx = 0.70, rely = 0.05)
CapitalList.bind('<Double-Button-1>',
lambda event: alignList('capital'))
# Create the interface for the new entries
def drawNewEntries():
global newCountry, newCapital
# Create the labelframe and place the newCountry entry widget on it
NewCountryFrame = tk.LabelFrame(winFrame, text = 'New Country')
NewCountryFrame.config(bg = 'light grey', fg = 'blue', bd = 2,
width = 13, relief = 'sunken')
NewCountryFrame.pack(); NewCountryFrame.place(relx= 0.03, rely = 0.75)
newCountry = tk.StringVar(); newCountry.set('')
NewCountryEntry = tk.Entry(NewCountryFrame, textvariable = newCountry,
width = 15)
NewCountryEntry.config(bg= 'dark grey', fg = 'red', relief = 'sunken')
NewCountryEntry.grid(row = 0, column = 0)
# Create the labelframe and place the newCapital entry widget on it
NewCapitalFrame = tk.LabelFrame(winFrame, text = 'New Capital')
NewCapitalFrame.config(bg = 'light grey', fg = 'blue', bd = 2,
width = 13, relief = 'sunken')
140
Handbook of Computer Programming with Python
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
NewCapitalFrame.pack(); NewCapitalFrame.place(relx= 0.70, rely = 0.75)
newCapital = tk.StringVar(); newCapital.set('')
NewCapitalEntry = tk.Entry(NewCapitalFrame, textvariable = newCapital,
width = 15)
NewCapitalEntry.config(bg= 'dark grey', fg = 'red', relief = 'sunken')
NewCapitalEntry.grid(row = 0, column = 0)
# Create the interface for the action buttons
def drawButtons():
# Create the labelframe that will host the buttons
ButtonsFrame = tk.Frame(winFrame)
ButtonsFrame.config(bg= 'light grey', bd=2, width=14, relief='sunken')
ButtonsFrame.pack(); ButtonsFrame.place(relx = 0.30, rely = 0.07)
newRecordButton = tk.Button(ButtonsFrame, text = 'Insert\nnew record',
width = 11, height = 2)
newRecordButton.grid(row = 0, column = 0)
newRecordButton.bind('<Button-1>', lambda event,
a = 'insertRecord': buttonsClicked(a))
deleteRecordButton = tk.Button (ButtonsFrame,
text = 'Delete\n record', width = 11, height = 2)
deleteRecordButton.grid (row = 0, column = 1)
deleteRecordButton.bind('<Button-1>', lambda event,
a = 'deleteRecord': buttonsClicked(a))
clearRecordsButton = tk.Button (ButtonsFrame,
text = 'Clear\n records', width = 11, height = 2)
clearRecordsButton.grid (row = 1, column = 0)
clearRecordsButton.bind('<Button-1>', lambda event,
a = 'clearAllRecords': buttonsClicked(a))
exitButton = tk.Button(ButtonsFrame, text='Exit', width=11, height=2)
exitButton.grid (row = 1, column = 1)
exitButton.bind('<Button-1>', lambda event : winFrame.destroy())
exit()
# Create the interface for the checkbuttons
def drawCheckButtons():
global checkButton1, checkButton2
# Create the labelframe that will host the checkbuttons
CheckButtonsFrame = tk.Frame(winFrame)
CheckButtonsFrame.config(bg = 'light grey', bd = 2, relief = 'sunken')
CheckButtonsFrame.pack();CheckButtonsFrame.place(relx=0.34, rely=0.43)
checkButton1 = IntVar(value = 1)
CountriesCheckButton = tk.Checkbutton (CheckButtonsFrame,
variable = checkButton1, text = 'Countries \nenabled/disabled',
bg = 'light blue', onvalue = 1, offvalue = 0, width = 15,
Graphical User Interface Programming
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
141
height = 2, command = checkClicked).grid(row = 0, column = 0)
checkButton2 = IntVar(value = 1)
CapitalCheckButton = tk.Checkbutton (CheckButtonsFrame,
variable = checkButton2, onvalue = 1, offvalue = 0,
text = 'Capitals \nenabled/disabled', width = 15, height = 2,
bg = 'light blue', command = checkClicked).grid (row=1, column=0)
# Create the interface for the radiobuttons
def drawRadioButtons():
global radioButton
# Create the labelframe that will host the radiobuttons
RadioButtonsFrame = tk.Frame(winFrame)
RadioButtonsFrame.config(bg = 'light grey', bd = 2, relief = 'sunken')
RadioButtonsFrame.pack();RadioButtonsFrame.place(relx=0.31, rely=0.78)
radioButton = IntVar()
visibleRadioButton = tk.Radiobutton (RadioButtonsFrame,
text = 'Containers \nvisible', width = 8, height = 2,
bg = 'light green', variable = radioButton, value = 1,
command = radioClicked).grid(row = 0, column = 0)
invisibleRadioButton = tk.Radiobutton (RadioButtonsFrame,
text = 'Containers \ninvisible', width = 8, height = 2,
bg = 'light green', variable = radioButton, value = 2,
command = radioClicked).grid(row = 0, column = 1)
radioButton.set(1)
# Define method alignList that will identify the selected row
# in any of the listboxes and align it with the corresponding row others
def alignList(a):
global CountriesList, CapitalList
global selectedIndex
if (a == 'countries'):
selectedIndex = int(CountriesList.curselection()[0])
CapitalList.selection_set(selectedIndex)
if (a == 'capital'):
selectedIndex = int(CapitalList.curselection()[0])
CountriesList.selection_set(selectedIndex)
# Define checkClicked method to control the state of the containers
def checkClicked():
global checkButton1, checkButton2
# Control the state of the containers as NORMAL or DISABLED
# based on the state of the checkbuttons
142
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
Handbook of Computer Programming with Python
if (checkButton1.get() == 1):
CountriesList.config(state = NORMAL)
else:
CountriesList.config(state = DISABLED)
if (checkButton2.get() == 1):
CapitalList.config(state = NORMAL)
else:
CapitalList.config(state = DISABLED)
# Define the radioClicked method that will display or hide the frames
# of the containers
def radioClicked():
global CountriesFrame, CapitalFrame
global radioButton
# Use the destroy() method to destroy the frames of the containers.
# The lists are not destroyed
CountriesFrame.destroy()
CapitalFrame.destroy()
if (radioButton.get() == 1):
drawListBoxes()
populate()
# Populate the listboxes
def populate():
global CountryList, CapitalList
global selectedIndex
for i in range (int(len(countries))):
CountriesList.insert(i, countries[i])
for i in range (int(len(capital))):
CapitalList.insert(i, capital[i])
# Define method buttonsClicked that will trigger the corresponding code
# when any of the buttons is clicked
def buttonsClicked(a):
global CountriesList, PopulationCombo, CapitalList
global newCountry, newPopulation, newCapital, populationSelection
global selectedIndex
if (a == "insertRecord"):
if (newCountry!= '' and newCapital!= ''):
countries.append(newCountry.get()); CountriesList.delete('0',
'end')
capital.append(newCapital.get());CapitalList.delete('0','end')
# Call method populate() to re-populate the containers
Graphical User Interface Programming
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
143
# with the renewed lists
populate()
if (a == 'deleteRecord'):
# Use messagebox.askyesno() to pop a confirmation message
# for deleting the elements
deleteElementOrNot=messagebox.askokcancel(title="Delete element",
message="Are you ready to delete the elements?", icon='info')
if (deleteElementOrNot == True):
# Use the pop() method to remove selected elements from the lists
countries.pop(selectedIndex); capital.pop(selectedIndex)
CountriesList.delete('0', 'end'); CapitalList.delete('0', 'end')
# Call method populate() to re-populate the containers
# with the renewed lists
populate()
if (a == 'clearAllRecords'):
# Use messagebox.askyesno() to pop a confirmation message
# for clearing the lists
clearListsOrNot=messagebox.askokcancel(title="Clear all elements",
message = "Are you ready to clear the lists?", icon = 'info')
if (clearListsOrNot == True):
countries.clear(); capital.clear()
CountriesList.delete('0', 'end'); CapitalList.delete('0', 'end')
# Call method populate() to re-populate the containers
# with the renewed lists
populate()
# Create the frame for the Countries program and configure its size
# and background color
winFrame = tk.Tk()
winFrame.title ('Countries')
winFrame.geometry("500x250")
winFrame.config (bg = 'light grey')
winFrame.resizable(False, False)
# Create the Graphical User Interface
drawListBoxes()
drawNewEntries()
drawButtons()
drawCheckButtons()
drawRadioButtons()
# Call populate()to populate the listboxes and comboboxes
populate()
winFrame.mainloop()
144
Handbook of Computer Programming with Python
Output 4.3.3:
As in previous examples, the first part of the application deals with drawing the interface. In this Observation 4.38 – destroy(),
particular case this task is assigned to methods drawL- exit(): Use methods destroy() to
istBoxes(), drawNewEntries(), drawButtons(), destroy the interface (i.e., the widgets
drawCheckButtons(),
and
drawRadioBut- of the particular frame it applies) and
tons(). Method drawListBoxes() (lines 16–55) exit() to exit the application.
creates the relevant frames and containers. The reader
should note the call to method alignList() that
causes the contents of the two containers to be aligned, Observation 4.39 – checkbutton,
offvalue: Use the
and the use of the relx and rely options that posi- onvalue,
checkbutton
widget
to offer selection
tion the respective frames in the appropriate places
options.
Each
option
is represented
within the interface. The drawNewEntries()
by
a
separate
widget.
If an option
method (lines 56–80) creates the entry widgets that will
is
selected,
the
widget
is given an
accept the user’s input for new entries. Observe how
onvalue,
otherwise
it
is
given an
the entry widgets are associated with the respective
offvalue
through
the
associated
StringVar() objects that allow the use of the input
through the appropriate set() and get() methods. IntVar() object.
Similarly, the drawButtons() method (lines 82–110)
creates the frame and places the buttons that perform
Observation 4.40 – radiobutton:
the basic actions of the application (i.e., insert a new
Use the radiobutton widget to offer
entry, delete a selected entry, clear all contents of the
a number of mutually exclusive
containers, and exit the application). In the case of
­
options. Each option is represented
the Exit button in particular, one should note the use
by a different widget. If an option is
of the destroy() method that destroys the interface
selected, the widgets are given a parof the main window, and the exit() method that exits
ticular value through the associated
the application.
IntVar() object.
The drawCheckButtons() method (lines 112–
131) creates the frame for the checkbutton widgets.
Notice how each of the checkbuttons is associated (bound) with a separate IntVar() object to
monitor its state (i.e., onvalue = 1 if it is checked or offvalue = 0 if it is unchecked).
The reader should also notice that when the user checks/unchecks the checkbutton the
Graphical User Interface Programming
145
checkClicked() method is triggered through
the command option. This is in order to control the Observation 4.41 – command,
radiobutton:
appearance of the respective container. Likewise, in checkbutton,
Use
the
command
option
to trigger
the case of drawRadiobuttons() (lines 133–153),
a
particular
action
when
any
of the
two of them are placed in the relevant frame and trigger
checkButton
or
radioButton
the radioClicked() method through the command
option. This controls the appearance of the containers widgets are selected.
as a whole. It is important to note that in such cases
where multiple radiobuttons are associated/bound with
Observation 4.42 – curselecthe same IntVar() object, only one can be selected.
tion(): Use methods curselecThe second part of the application deals with the
tion()
to identify the selected
interactions that take place between the interface and
element from a listbox and selecthe user and their results, through the use of methods
tion _ set() to select a particular
alignList(), checkClicked(), radioClicked(),
indexed element.
populate(), and buttonsClicked(). In the case of
alignList() (lines 155–167), the curselection()
method is applied to the relevant container (listbox) to Observation
4.43
–
state,
identify the element of the container that was selected. NORMAL,
DISABLED: Use the
Since the method results to a tuple, it is necessary to state option to determine whether
limit the result to the first element of the tuple (i.e., the a particular listbox is enabled
[0] value). Once the element of the container is identified (NORMAL) or disabled (DISABLED).
through its index, the selection _ set() method is
executed. This allows the other container to align the
two listboxes based on the selections. Ultimately, this process synchronizes the two containers.
In the case of the checkClicked() method (lines 169–183) the reader should note the
following:
• The use of the state option and its two possible values (i.e., NORMAL and DISABLED),
which determine whether the associated widget will be enabled or not. More specifically,
NORMAL dictates that the user is allowed to click in the relevant container and select one
or more of its elements and DISABLED the opposite.
• The use of the get() method to access the value of objects checkButton1 and checkButton2. The reader is reminded that accessing the values of these objects is only possible
through such methods, since the objects and their values are private. The checkButton1
and checkButton2 widgets are declared as global to ensure that they are used by reference, taking their values from the original objects in the main application.
In the case of the radioClicked() method
Observation 4.44 – append(),
185–198), frames CountriesFrame and
(lines ­
delete(), clear(): Use methCapitalFrame are destroyed alongside their containers/
ods append() to append a list (i.e.,
listboxes (i.e., CountriesList and CapitalList)
insert a new element at the end of the
and are only recreated and repopulated if the user selects
list), delete() to delete a selected
the appropriate ­
visibleRadioButton from the
element from a list, and clear() to
interface (i.e., assigning a value of 1 to the radioButclear all the elements of a list.
ton object).
Finally, the buttonsClicked() method (lines
211–249) has three main tasks. Firstly, it inserts a new element in each of the listboxes when the
user clicks the Insert button. In this case, the values of the newCountry and newCapital entry
widgets are checked and, if not empty, used to append the relevant lists. Notice that it is preferable
to append the lists and not the listboxes, as the former host the actual values. The listboxes are
repopulated only after this task is completed.
146
Handbook of Computer Programming with Python
Secondly, the method has the task of deleting the selected elements from the listboxes when
the user clicks the Delete button. In this case, as long as an element of the listboxes is selected,
a simple messagebox pops up to confirm the user’s choice. Notice that the askyesno() method
provides one of the simplest available forms of messages, and results in either True or False.
The programmer can use these values to determine further actions. The reader should note that the
messagebox module is part of the tkinter library. It is also noteworthy that the delete() method
is used in the code to initially clear the listboxes from their contents, and subsequently re-populate
them with the refreshed, appended lists. This particular method accepts the first and the last index
in the range of elements that should be deleted from the lists as arguments. Similarly, a third task
is to completely clear the listboxes from their contents. For this purpose, the clear() method is
applied to both lists (but not the listboxes), given that
confirmation is provided by the user through another Observation 4.45 – askyesno():
simple ­messagebox interaction.
Use the appropriate messagebox
In all the cases discussed above, the populate() module method (e.g., askyesno())
method (lines 200–209) is responsible for reading the to confirm the user’s choice.
lists and using their contents to populate the listboxes.
4.4 BASIC AUTOMATION AND USER INPUT CONTROL
A common characteristic of visual programming is the creation of the illusion that the application objects/widgets change shape, content, or status, either automatically or based on the user’s
input or automatically. If an object/widget is to be activated and put in operation automatically, the
programmer needs to associate it with a respective time-controlled event. The latter enables the
programmer to change the properties of the object/widget at run time, through the activation and
execution of appropriate blocks of code that are based on the time-controlled event.
In this section, the reader will have the opportunity to get some exposure to the creation of
­applications that manipulate objects/widgets without the user’s input, or with interactions of a different type than direct written input or button-click events. Throughout the section, a basic Traffic Lights
application is gradually developed toward a primitive, but informative, automated user experience.
4.4.1 Traffic Lights Version 1 – Basic Functionality
The Traffic Lights sample project can start by creating a very basic application that uses three
images (loaded in labels) displaying a green, a yellow, and a red traffic light, respectively. The
three images can be programmed to appear and disappear based on user’s selection. The following
Python script creates this interface and implements the related interactions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import libraries
import tkinter as tk
from tkinter import *
# Import the necessary image processing classes
from PIL import Image, ImageTk
global
global
global
global
global
radioButton
image1, image2, image3
photo1, photo2, photo3
winLabel1, winLabel2, winLabel3
winFrame
# Create the main frame
winFrame = tk.Tk()
Graphical User Interface Programming
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
147
winFrame.title("Traffic Lights v1")
# Create the interface with the images and labels
def photos():
global radioButton
global image1, image2, image3
global photo1, photo2, photo3
image1 = Image.open("TrafficLightsGreen.gif")
image1 = image1.resize((50, 100), Image.ANTIALIAS)
photo1 = ImageTk.PhotoImage(image1)
winLabel1=tk.Label(winFrame,text='', image=photo1, compound='left')
winLabel1.grid(row = 0, column = 0)
image2 = Image.open("TrafficLightsYellow.gif")
image2 = image2.resize((50, 100), Image.ANTIALIAS)
photo2 = ImageTk.PhotoImage(image2)
winLabel2 = tk.Label(winFrame,text='',image=photo2,compound='left')
winLabel2.grid(row = 0, column = 1)
image3 = Image.open("TrafficLightsRed.gif")
image3 = image3.resize((50, 100), Image.ANTIALIAS)
photo3 = ImageTk.PhotoImage(image3)
winLabel3 = tk.Label(winFrame,text='',image=photo3,compound='left')
winLabel3.grid(row = 0, column = 2)
# Control active traffic lights based on the radio button selection
if (radioButton.get() == 1):
winLabel2.destroy()
winLabel3.destroy()
if (radioButton.get() == 2):
winLabel1.destroy()
winLabel3.destroy()
if (radioButton.get() == 3):
winLabel1.destroy()
winLabel2.destroy()
# Create the radio button interface
def drawRadioButtons():
global radioButton
visibleGreenRadioButton = tk.Radiobutton (winFrame, text = 'Green',
width=17, height=1, bg = 'light grey', variable = radioButton,
value = 1, command = photos).grid(row = 1, column = 0)
visibleYellowRadioButton = tk.Radiobutton(winFrame, text='Yellow',
width= 17, height= 1, bg= 'light grey', variable = radioButton,
148
67
68
69
70
71
72
73
74
75
76
77
Handbook of Computer Programming with Python
value = 2, command = photos).grid(row = 1, column = 1)
visibleRedRadioButton = tk.Radiobutton (winFrame, text = 'Red',
width= 17, height= 1, bg= 'light grey', variable = radioButton,
value = 3, command = photos).grid(row = 1, column = 2)
radioButton = IntVar()
photos()
drawRadioButtons()
winFrame.mainloop()
Output 4.4.1:
The output demonstrates the two main parts of the application. In the first part, the photos()
method loads the three images and controls their visibility within the interface (lines 17–55). The
reader will notice that part of the method is the destruction of two of the images, in order to leave
only one on display (lines 44–56). For this task, the reader might also consider to use the grid _
remove() method (covered in previous sections), which will have the same result.
The second part controls which of the three images will be displayed. Once the desired radiobutton has been clicked upon, the corresponding image stays on display and the other two are hidden
(lines 57–71). It is worth noting that all three radio buttons are associated with the same variable.
This is reflected on the fact that they cancel each other when selected, as the value of the common
associated object is altered.
4.4.2 Traffic Lights Version 2 – Creating a Basic Illusion
Taking things one step further, the application is changed in such a way as to make only one image
appearing instead of three. The impression that there is only one image is of course illusory, as it
is essentially caused by manipulating the visual properties of the associated widget and/or its position in the interface. In this case, the traffic images are stacked upon each other using the same grid
coordinates, and, subsequently, two of them are being removed from the interface.
This version is almost identical to the original one, with the exception of the positioning of the
widgets and the slightly modified title. The proposed modification only requires the replacement
of lines 15, 35, 42, 62, 66–67, and 70–71 with the ones provided below, which are only different in
terms of their grid coordinates and width:
15
35
42
winFrame.title ("Traffic Lights v2"); winFrame.geometry("200x180")
[...]
winLabel2.grid(row = 0, column = 0)
[...]
winLabel3.grid(row = 0, column = 0)
[...]
Graphical User Interface Programming
62
66
67
70
71
width = 20,
[...]
width = 20,
value =
[...]
width = 20,
value =
149
height = 1, bg = 'light grey', variable = radioButton,
height = 1, bg = 'light grey', variable = radioButton,
2, command = photos).grid(row = 2, column = 0)
height = 1, bg = 'light grey', variable = radioButton,
3, command = photos).grid(row = 3, column = 0)
Output 4.4.2
4.4.3 Traffic Lights Version 3 – Creating a Primitive Automation
In this version of the sample application, there is no need for the user to click on the respective radio
buttons in order to cause the traffic light images to appear/disappear. The change happens automatically after 5 seconds from the time one of the images is turned on (and the other two turned off).
In order to enable timed functionality, in addition to the libraries used in the previous versions, the
time library must be imported to the script.
This version differs from the previous ones in a number of ways:
• The radio buttons that were dealing with the interaction are removed, and a new manageLabels() function is introduced to control the automated process of traffic light changes.
• Every time there is a change of the displayed image, the time.sleep() function (time
library) is used to freeze the execution of the application for a given period of time (in this
case 3 seconds).
• Since there are no radiobuttons, the application uses another object (trafficLight), to
control which image is displayed. This is accomplished by setting its value through the
set() method.
• The update() function is applied to the main frame in order to refresh the interface based
on the latest status update.
The complete script is provided below:
1
2
3
4
# Import libraries
import tkinter as tk
from tkinter import *
# Import the necessary image processing classes
150
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Handbook of Computer Programming with Python
from PIL import Image, ImageTk
# Import the timer threading library
import time
global
global
global
global
global
image1, image2, image3
photo1, photo2, photo3
winLabel1, winLabel2
winFrame
trafficLight
# Open the traffic images and create the relevant pointers
def photos():
global image1, image2, image3
global photo1, photo2, photo3
image1 = Image.open("TrafficLightsGreen.gif")
image1 = image1.resize((50, 100), Image.ANTIALIAS)
photo1 = ImageTk.PhotoImage(image1)
image2 = Image.open("TrafficLightsYellow.gif")
image2 = image2.resize((50, 100), Image.ANTIALIAS)
photo2 = ImageTk.PhotoImage(image2)
image3 = Image.open("TrafficLightsRed.gif")
image3 = image3.resize((50, 100), Image.ANTIALIAS)
photo3 = ImageTk.PhotoImage(image3)
# Manage label visibility based on time.
def manageLabels():
global winLabel1, winLabel2
global Photo1, Photo2, Photo3
global winFrame
global trafficLight
if (trafficLight.get() == 1):
winLabel1.config(image = photo1)
winLabel1.grid(row = 0, column = 0)
winLabel2.config(text = 'Green')
time.sleep(3)
if (trafficLight.get() == 2):
winLabel1.config(image = photo2)
winLabel1.grid(row = 0, column = 0)
winLabel2.config(text = 'Yellow')
time.sleep(3)
if (trafficLight.get() == 3):
winLabel1.config(image = photo3)
winLabel1.grid(row = 0, column = 0)
winLabel2.config(text = 'Red')
time.sleep(3)
Graphical User Interface Programming
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
151
winFrame.update()
# Create the main frame
winFrame = tk.Tk()
winFrame.title ("Traffic Lights v3"); winFrame.geometry("200x180")
photos()
winLabel1 = tk.Label(winFrame, text='', image=photo1, compound='left')
winLabel1.grid(row = 0, column = 0)
winLabel2=tk.Label(winFrame,text='...'); winLabel2.grid(row=1,column=0)
trafficLight = IntVar()
trafficLight.set(1)
while (True):
if (trafficLight.get() == 1):
trafficLight.set(2)
elif (trafficLight.get() == 2):
trafficLight.set(3)
elif (trafficLight.get() == 3):
trafficLight.set(1)
manageLabels()
winFrame.mainloop()
Output 4.4.3:
4.4.4 Traffic Lights Version 4 – A Primitive Screen Saver with a Progress Bar
Having introduced the concept of timed events and how they can be used to control the flow of
events in an application, it is rather straightforward to expand the same idea to the creation of an
illusory movement of particular objects inside a frame. A good example of this is the creation of a
primitive screen saver using the existing Traffic Lights application as a basis.
In addition to the existing widgets, an additional widget that can be used in this scenario is the
progressbar widget. This will assist in making the screen saver a bit more informative, by providing clues about the elapsed and remaining time in any particular condition (i.e., green, yellow,
and red traffic light). The widget belongs to the ttk library and can take several parameters that
control its appearance and functionality, with the most important ones being length, orient,
152
Handbook of Computer Programming with Python
and mode. Length determines the size (i.e., length) of
the progress bar, orient the orientation of the widget Observation 4.46 – progressbar: Use
(i.e., VERTICAL or HORIZONTAL), and mode if the the progressbar widget to display the
displayed value is predetermined (“­determinate”) or progress of an event or task that takes a
indetermined (“intederminate”). In the case of the particular amount of time to complete.
former, the bar will appear moving toward one end of Progressbars can be ­horizontal or
the widget until the specified value is reached, while in vertical, and can have a predeterthe case of the latter the bar will appear moving continu- mined (determinate) or undetermined (interminate) value.
ously from one end to the other and back.
The following script implements a related implementation example, where the three traffic lights are controlling the movement of a car image (embedded in a label widget). When the green light is on, the car is moving at a particular speed and when
yellow is on at half that speed. Similarly, when the red light is on, the car appears to stop and the
progressbar appears to be loading to reflect the elapsed time in this particular condition (i.e., red
light) and remaining time until the next condition is triggered (i.e., green light). The car image
appears to be bouncing across the frame, moving toward a different direction every time it reaches
the edges of the parent frame. The movement of the car image is always diagonal, and follows four
different directions. The program stops when the user interrupts (closes) the application. The associated Python script is provided below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Import libraries
import tkinter as tk
from tkinter import ttk
from tkinter import *
# Import the necessary image processing classes
from PIL import Image, ImageTk
# Import threading libary for the timer threading
import time
global
global
global
global
global
global
global
global
trafficLight
image1, image2, image3
photo1, photo2, photo3
winLabel1, winLabel2, winLabel3
direction
posx, posy
winFrame
progressBar
# Open the traffic and car images and create the relevant pointers
def photos():
global image1, image2, image3, image4
global photo1, photo2, photo3, photo4
image1 = Image.open("TrafficLightsGreen.gif")
image1 = image1.resize((50, 100), Image.ANTIALIAS)
photo1 = ImageTk.PhotoImage(image1)
image2 = Image.open("TrafficLightsYellow.gif")
image2 = image2.resize((50, 100), Image.ANTIALIAS)
photo2 = ImageTk.PhotoImage(image2)
Graphical User Interface Programming
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
image3 = Image.open("TrafficLightsRed.gif")
image3 = image3.resize((50, 100), Image.ANTIALIAS)
photo3 = ImageTk.PhotoImage(image3)
image4 = Image.open("Car.gif")
image4 = image4.resize((30, 15), Image.ANTIALIAS)
photo4 = ImageTk.PhotoImage(image4)
# Manage label visibility based on time
def manageLabels():
global trafficLight
global winLabel1, winLabel2
global Photo1, Photo2, Photo3
global winFrame
if (trafficLight.get() == 1):
winLabel1.config(image=photo1)
winLabel2.config(text='Green'); a=1
elif (trafficLight.get() == 2):
winLabel1.config(image=photo2)
winLabel2.config(text='Yellow'); a=2
elif (trafficLight.get() == 3):
winLabel1.config(image = photo3)
winLabel2.config(text = 'Red'); a = 3
winLabel1.pack(); winLabel1.place(x = 1, y = 1)
winLabel2.pack(); winLabel2.place(x = 1, y = 100)
winFrame.update
# Call method moveCar()to move the image within the interface
moveCar(a)
# Control the direction of the movement
def checkDirection():
global direction
global posx, posy
if (posx >= 400 and direction == 1):
direction = 2
elif (posx >= 400 and direction == 4):
direction = 3
elif (posx <= 0 and direction == 2):
direction = 1
elif (posx <= 0 and direction == 3):
direction = 4
elif (posy <= 0 and direction == 3):
direction = 2
elif (posy <= 0 and direction == 4):
direction = 1
elif (posy >= 200 and direction == 1):
direction = 4
elif (posy >= 200 and direction == 2):
153
154
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
Handbook of Computer Programming with Python
direction = 3
# Manage the movement of the car
def moveCar(a):
global direction
global posx, posy
global winLabel3
global winFrame
global progressBar
progressBar['value'] = 0
for i in range(10):
# Call checkDirection() to control the movement direction
checkDirection()
if (a == 1):
move = 10
elif (a == 2):
move = 5
else:
move = 0
progressBar['value'] = int((i/(10 - 1)) * 100)
if (direction == 1):
posy += move; posx += move
elif (direction == 2):
posy += move; posx -= move
elif (direction == 3):
posy -= move; posx -= move
elif (direction == 4):
posy -= move; posx += move
winLabel3.pack(); winLabel3.place(x = posx, y = posy)
winFrame.update()
time.sleep(0.3)
# Create the main frame
winFrame = tk.Tk()
winFrame.title ("Traffic Lights v4"); winFrame.geometry("400x200")
photos()
winLabel1 = tk.Label(winFrame, text='', image=photo1, compound='left')
winLabel1.pack(); winLabel1.place(x = 1, y = 1)
winLabel2 = tk.Label(winFrame, text = '...')
winLabel2.pack(); winLabel2.place(x = 1, y = 100)
winLabel3 = tk.Label(winFrame, text='', image=photo4, compound='left')
winLabel3.pack(); winLabel3.place(x = 1, y = 1)
posx = 0; posy = 0
Graphical User Interface Programming
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
155
progressBar = ttk.Progressbar(winFrame, length=100, orient = VERTICAL,
mode = 'determinate')
progressBar.place(relx = 0.13, rely = 0.02)
trafficLight = IntVar()
trafficLight.set(3)
direction = 1
while (True):
winFrame.update_idletasks()
if (trafficLight.get() == 1):
trafficLight.set(2)
elif (trafficLight.get() == 2):
trafficLight.set(3)
elif (trafficLight.get() == 3):
trafficLight.set(1)
manageLabels()
winFrame.mainloop()
Output 4.4.4:
A number of new methods, options and computational ideas are introduced in this script. First, the
reader will notice the use of the update _ idletasks() method, which ensures that objects
or methods not being currently used are still updated every time the while loop is executed
(line 132). This safeguards from unwanted garbage collection processes that might occur for the,
156
Handbook of Computer Programming with Python
seemingly unused, objects. Second, it is worth noting the use of absolute coordinates x and y
to continuously position the relevant widgets on the interface, instead of the relative ones (relx
and rely) used in previous examples. This is especially
relevant in the case of the moving car in order to trace
and handle its movement when reaching the edges of the Observation 4.47 – update_idleinterface.
tasks(): Use the update _ idleIn terms of the actual movement of the car, the compu- tasks() method to ensure that idle
tational idea is quite simple. For instance, when it reaches widgets/objects are not being destroyed
the east edge of the interface, (a) if it is moving southeast when not being used for extended peri(i.e., direction = 1) it should bounce toward the southwest ods of time.
(i.e., direction = 2), and (b) if it is moving northeast (i.e.,
direction = 4) it should bounce toward the northwest
(i.e., direction = 3). The c
­ heckDirection() method Observation 4.48 – x, y coordinates:
(lines 59–79) takes care of the rest of the movements of It is often preferable to use the x and
the car. Once the step and directions are set, the actual y coordinates when placing a widget
movement takes place in method movecar() (lines on an interface, in order to ensure its
81–109). The method recalculates the current placement absolute placement in pixels instead
coordinates of the car based on the actual coordinates, of the relative positions (i.e., using
given both the intended direction and the state of the relx and rely).
traffic light.
4.4.5 Traffic Lights Version 5 – Suggesting a Primitive Screen Saver
As a conclusion of this automation-related series of scripts based on the Traffic Lights sample application, it is useful to introduce the idea of using designated keyboard input commands to achieve a
certain level of control over the automated events. The following script introduces functionality that
allows the user to move the car dynamically at run time using the up, down, left, and right keys on
the keyboard, as well as the esc key to exit:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Import libraries
import tkinter as tk
from tkinter import *
# Import the necessary image processing classes
from PIL import Image, ImageTk
# Import the timer threading libary
import time
global
global
global
global
global
global
trafficLight
posx, posy
image1, image2, image3
photo1, photo2, photo3
winLabel1, winLabel2
winFrame
# Open the traffic and car images and create the relevant pointers
def photos():
global image1, image2, image3, image4
global photo1, photo2, photo3, photo4
image1 = Image.open("TrafficLightsGreen.gif")
Graphical User Interface Programming
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
image1 = image1.resize((50, 100), Image.ANTIALIAS)
photo1 = ImageTk.PhotoImage(image1)
image2 = Image.open("TrafficLightsYellow.gif")
image2 = image2.resize((50, 100), Image.ANTIALIAS)
photo2 = ImageTk.PhotoImage(image2)
image3 = Image.open("TrafficLightsRed.gif")
image3 = image3.resize((50, 100), Image.ANTIALIAS)
photo3 = ImageTk.PhotoImage(image3)
image4 = Image.open("Car.gif")
image4 = image4.resize((30, 15), Image.ANTIALIAS)
photo4 = ImageTk.PhotoImage(image4)
# Manage the movement based on the traffic light
def keyPressed (event):
global trafficLight
global posx, posy
global winFrame
global winLabel3
# Set the moving step based on the traffic light
if (trafficLight == 1):
move = 10
elif (trafficLight == 2):
move = 5
elif (trafficLight == 3):
move = 0
print(event.keycode)
# Prepare the moving step (up, down, left, right, esc)
# Mac codes: (8320768,8255233, 8124162, 8189699, 3473435)
# The user pressed 'up'. Move the car accordingly
if (event.keycode == 38):
if (move == 10 and posy >= 20):
posy -= 10
elif (move == 5 and posy >=20):
posy -= 5
# The user pressed 'down'. Move the car accordingly
elif (event.keycode == 40):
if (move == 10 and posy <= 270):
posy += 10
elif (move == 5 and posy <= 270):
posy += 5
# The user pressed 'right'. Move the car accordingly
elif (event.keycode == 39):
if (move == 10 and posx <= 570):
posx += 10
elif (move == 5 and posx <= 570):
157
158
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
Handbook of Computer Programming with Python
posx += 5
# The user pressed 'left'. Move the car accordingly
elif (event.keycode == 37):
if (move == 10 and posx >= 20):
posx -= 10
elif (move == 5 and posx >= 20):
posx -= 5
# The user pressed 'escape'. Close the program
elif (event.keycode == 27):
winFrame.destroy()
exit()
winLabel2.pack(); winLabel2.place(x = posx, y = posy)
winFrame.update()
def trafficLightsLoop():
global trafficLight
global winFrame
global winLabel1
winFrame.update_idletasks()
if (trafficLight == 1):
trafficLight = 2; winLabel1.config(image = photo2)
elif (trafficLight == 2):
trafficLight = 3; winLabel1.config(image = photo3)
elif (trafficLight == 3):
trafficLight = 1; winLabel1.config(image = photo1)
winLabel1.pack(); winLabel1.place(x = 1, y = 1)
winFrame.update
winFrame.after(3000, trafficLightsLoop)
# Create the main frame
winFrame = tk.Tk()
winFrame.title ("Traffic Lights v5"); winFrame.geometry("600x300")
winFrame.bind('<Key>', keyPressed)
photos()
winLabel1 = tk.Label(winFrame, text='', image=photo1, compound='left')
winLabel1.pack(); winLabel1.place(x = 1, y = 1)
winLabel2 = tk.Label(winFrame, text='', image=photo4, compound='left')
winLabel2.pack(); winLabel2.place(x = 1, y = 1)
trafficLight = 1; posx = 0; posy = 0
winFrame.after(3000, trafficLightsLoop)
winFrame.mainloop()
Graphical User Interface Programming
159
Output 4.4.5:
The script introduces some new ideas and techniques aiming to make the user experience more
engaging, and to encourage further enhancements. Firstly, it must be noted that, in the main program, the main frame is bound to the keypressed() method through the <Key> event (line 102).
It must be stressed that the naming of the event is important and that any deviations (e.g., <key>)
may not be translated correctly by Python. The use of the binding results in the user being able to
press any of the up, down, left, and right directional keys in order to move the car to the relevant
direction. This is achieved by checking the values of the
event.keycode produced based on the user’s input. Observation 4.49 – <key>, event.
It is worth noting that these values may vary between keycode: Use the <Key> event to
different systems, so the code should include appropriate bind a particular frame or widget to
controls and solutions for such variations (lines 37–73).
a key press event. Once the key input
Secondly, the reader should note the avoidance of a is captured, use event.keycode to
loop and its replacement by the after() method, which determine the appropriate action.
is applied to the main frame (winFrame). The reason
for this decision was that since the program activates
the monitoring of the <Key> event, the presence of a Observation 4.50 – after(): Use
second monitoring event like a loop would cause con- the after() method to call a method
flicts in the internal threading of the application. The or execute a command after a preafter() method serves the purpose of creating a loop- determined number of seconds has
like behavior without causing such a conflict (lines 97, elapsed since the initiation of the cur113). Finally, the reader should note the use of the esc rent method. This can be used as an
code in the keypressed() method (line 76) to exit the alternative to for or while loops.
application in a controlled way.
4.5 CASE STUDIES
Enhance the Countries application in order to include the following functionality:
•
•
•
•
Add one more listbox to display more content for each country (e.g., size, population, etc.).
Add a combobox to allow the user to select the font name of the contents of the listboxes.
Add a combobox to allow the user to select the font size of the contents of the listboxes.
Add a combobox to change the background color of the content in the listboxes.
4.6 EXERCISES
Enrich the Traffic Lights application by including one more car. The new car must be controlled
by another set of keys on the keyboard, using the same traffic lights as those on the original
application.
5
Application Development
with Python
Dimitrios Xanthidis
University College London
Higher Colleges of Technology
Christos Manolas
The University of York
Ravensbourne University London
Hanêne Ben-Abdallah
University of Pennsylvania
CONTENTS
5.1
5.2
5.3
5.4
5.5
5.6
Introduction........................................................................................................................... 161
Messages, Common Dialogs, and Splash Screens in Python................................................ 162
5.2.1 Simple Message Boxes.............................................................................................. 162
5.2.2 Message Boxes with Options..................................................................................... 164
5.2.3 Message Boxes with User Input................................................................................. 166
5.2.4 Splash Screen/About Forms....................................................................................... 168
5.2.5 Common Dialogs....................................................................................................... 169
Menus.................................................................................................................................... 171
5.3.1 Simple Menus with Shortcuts.................................................................................... 171
5.3.2 Toolbar Menus with Tooltips..................................................................................... 175
5.3.3 Popup Menus with Embedded Icons......................................................................... 178
Enhancing the GUI Experience............................................................................................. 181
5.4.1 Notebooks and Tabbed Interfaces............................................................................. 181
5.4.2 Threaded Applications.............................................................................................. 185
5.4.3 Combining Multiple Concepts and Applications in a Multithreaded System........... 190
Wrap Up................................................................................................................................. 199
Case Study.............................................................................................................................205
5.1 INTRODUCTION
Application development can be viewed as a process that is both scientific and creative. Scientific
because it follows the systematic process of the software development life-cycle. This covers all
development steps, from requirement analysis and implementation to deployment and maintenance.
Creative as it calls for the creativity of the developer to design a system that incorporates features
that make it suitable and efficient for the task at hand, while also being attractive to the end user.
The previous chapter introduced and discussed some of the key objects for the development of an
appealing user interface. In this chapter, the concept of application development is examined more
DOI: 10.1201/9781003139010-5
161
162
Handbook of Computer Programming with Python
thoroughly, by introducing ideas and tools that call for the integration of multiple functions within
a single application. These include:
• Dialogs, Messages, and the Splash Screen: Simple and intuitive objects that most users
of Windows style applications are quite familiar with. Each of these objects serves a particular function and is part of the Python API (Application Programming Interface), thus,
requiring only minimal coding.
• Menus, Toolbar Menus, Popup Menus: Variations of the well-known menu object allowing the user to select different functions available in the application. Menus are usually
accompanied by extra functionality options like hot keys, shortcuts, and tooltips, in order
to enhance their attractiveness and efficiency.
• Tabs: Tabs provide an effective way to optimize the use of the real estate of the running
interface, allowing the inclusion of more than one application in the same space. This idea
is simple, but intuitive and effective. Tabs are commonly used to separate a single notebook
into various sections and load various independent applications.
• Threads: Threading involves the simultaneous execution of code relating to multiple
instances of the same process, class or application. Different threads can be executed simultaneously, either in parallel or in explicitly defined time slots. Each thread can have its own
widgets (if it is GUI based) and attributes. Threaded objects do not necessarily communicate
with each other, although this is possible and can be implemented when and if necessary.
The focus of this chapter is on discussing and illustrating key underlying concepts and mechanisms
associated with these tools and structures.
5.2 MESSAGES, COMMON DIALOGS, AND SPLASH SCREENS IN PYTHON
Messageboxes, common dialogs, and splash screens are some of the most understated, but useful
objects that can help in enhancing the functionality of an application without adding lengthy code to
it. They are user-friendly and multifunctional, and provide instant, and strictly restricted and managed input from the user during the execution of an application. Several types of these components
are available with varied and diverse functions, such as the display of user messages, the creation of
menus of options/choices, the acceptance and verification of user input, the management of display
parameters and options (e.g., colors), and the management of files, file structures and directories.
Each of the above can be called and implemented with relatively simple Python code commands, as
described in the following sections.
5.2.1 Simple Message Boxes
The simple message box displays a message to the user
and stays on display until the corresponding (OK) button Observation 5.1 – Simple Message
is clicked, at which point the application resumes execu- Box: Methods showinfo(), showtion. As there is no input to be received, the user reaction error(), and showwarning()
to the message is irrelevant and the only possible choice (members of the messagebox
is to click the OK button. The object has three distinct object, tkinter library) are used to
forms represented by methods showinfo(), shower- display a simple message box with a
ror(), and showwarning(), which are embedded in respective info, error, or warning icon.
the messagebox object (tkinter library). These
methods do not change any fundamental aspects of the message box, but modify the icon that
accompanies it according to the type of information provided to the user. The following Python
script presents a basic example of the use of each of the three methods:
Application Development
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Import libraries
import tkinter as tk
from tkinter import messagebox
# Declare simpleMessage() function, invoked upon button click
def simpleMessage(a):
if (a == 1):
messagebox.showinfo("Simple Info Message",
"You clicked for the info message")
elif (a == 2):
messagebox.showerror("Simple Error Message",
"You clicked for the error message")
elif (a == 3):
messagebox.showwarning("Simple Warning Message",
"You clicked for the warning message")
# Create a non-resizable Windows frame using the tk object
winFrame = tk.Tk()
winFrame.title("Simple Messageboxes")
winFrame.resizable(False, False)
winFrame.geometry('290x180')
winFrame.configure(bg = 'dark grey')
# Create button that triggers an info message
winButton1 = tk.Button(winFrame, width = 25,
text = "Click to display \na simple info messagebox")
winButton1.pack(); winButton1.place(x = 50, y = 20)
winButton1.bind('<Button-1>', lambda event: simpleMessage(1))
# Create button that triggers an error message
winButton2 = tk.Button(winFrame, width = 25,
text = "Click to display \na simple error messagebox")
winButton2.pack(); winButton2.place(x = 50, y = 70)
winButton2.bind('<Button-1>', lambda event: simpleMessage(2))
# Create button that triggers a warning message
winButton3 = tk.Button(winFrame, width = 25,
text = "Click to display \na simple warning messagebox")
winButton3.pack(); winButton3.place(x = 50, y = 120)
winButton3.bind('<Button-1>', lambda event: simpleMessage(3))
winFrame.mainloop()
Output 5.2.1:
163
164
Handbook of Computer Programming with Python
The reader should note that the first parameter passed to the message box is the title, whereas the
second is the content. The program output provided above illustrates the resulting messages for each
of the three simple types of message boxes.
5.2.2 Message Boxes with Options
Message boxes are commonly used to receive user confirmation for processes that take place at run-time. In Observation 5.2 – Message Box with
such cases, instead of merely displaying information, the Options: Methods askokcancel(),
object must prompt the user to confirm their approval (or askretrycancel(), askyesno(),
lack of) regarding the execution of particular processes. and askquestion() (members of
As in the case of simple messages, several options are the messagebox object, tkinter
available for message boxes with options, depending library) are used to display a meson the type of confirmation that is requested. However, sage, while also requesting some sort
there are two major differences between the two types of of confirmation from the user. The
messages. Firstly, in the case of messages with options, responses can be True or False for
the user makes a choice that may alter the execution the first three and ‘Yes’ or ‘No’ for
order of the processes that follow, in contrast to the the last one.
simple message box. The type and format of the input
depends on the type of the message (e.g., OK-Cancel, Retry-Cancel, Yes-No). Secondly, the user’s
choice has a tangible value that can be stored in a variable and checked against other pre-defined
values to determine the flow of execution. These values are True or False (no quotes and casesensitive) in the case of OK-Cancel, Retry-Cancel, and Yes-No, and ‘Yes’ or ‘No’ (in single
quotation marks and case-sensitive) in the case of a question message box.
The following Python script provides a simple example that integrates all four different types of
messages with options. The script also makes use of the showinfo() and showerror() methods of the simple message box:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Import libraries
import tkinter as tk
from tkinter import messagebox
# Declare optionMessage()function, invoked upon button click
def optionMessage(a):
if (a == 1):
response = messagebox.askokcancel(title = "ok-cancel Message",
message = "Clicked the OK-Cancel message", icon = 'info')
if (response == True):
messagebox.showinfo("Info Message", "Clicked OK")
elif (response == False):
messagebox.showerror("Error Message", "Clicked Cancel")
elif (a == 2):
response = messagebox.askquestion(title = "question Message",
message = "Clicked the question message", icon = 'info')
if (response == 'yes'):
messagebox.showinfo("Info Message", "Clicked Yes")
elif (response == 'no'):
Application Development
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
165
messagebox.showerror("Error Message", "Clicked No")
elif (a == 3):
response=messagebox.askretrycancel(title="retry-cancel Message",
message = "Clicked the Retry-Cancel message", icon = 'info')
if (response == True):
messagebox.showinfo("Info Message", "Clicked Retry")
elif (response == False):
messagebox.showerror("Error Message", "Clicked Cancel")
elif (a == 4):
response = messagebox.askyesno(title = "yes-no Message",
message = "Clicked the Yes-No message", icon = 'info')
if (response == True):
messagebox.showinfo("Info Message", "Clicked Yes")
elif (response == False):
messagebox.showerror("Error Message", "Clicked No")
# Create a non-resizable Windows frame using the tk object
winFrame = tk.Tk()
winFrame.title("Messageboxes with options")
winFrame.resizable(False, False)
winFrame.geometry('320x220')
winFrame.configure(bg = 'grey')
# Create button that triggers an OK-Cancel message
winButton1 = tk.Button(winFrame, width = 20,
text = "Click to display \na OK-Cancel messagebox")
winButton1.pack(); winButton1.place(x = 85, y = 20)
winButton1.bind('<Button-1>', lambda event: optionMessage(1))
# Create button that triggers a question message
winButton2 = tk.Button(winFrame, width = 20,
text = "Click to display \na Question messagebox")
winButton2.pack(); winButton2.place(x = 85, y = 70)
winButton2.bind('<Button-1>', lambda event: optionMessage(2))
# Create button that triggers a Retry-Cancel message
winButton3 = tk.Button(winFrame, width = 20,
text = "Click to display \na Retry-Cancel messagebox")
winButton3.pack(); winButton3.place(x = 85, y = 120)
winButton3.bind('<Button-1>', lambda event: optionMessage(3))
# Create button that triggers a Yes-No message
winButton3 = tk.Button(winFrame, width = 20,
text = "Click to display \na Yes-No messagebox")
winButton3.pack(); winButton3.place(x = 85, y = 170)
winButton3.bind('<Button-1>', lambda event: optionMessage(4))
winFrame.mainloop()
166
Handbook of Computer Programming with Python
Output 5.2.2:
5.2.3 Message Boxes with User Input
Occasionally, message boxes are used instead of regular
entry or text widgets, to prompt user input of various dif- Observation 5.3 – Message Box with
ferent data types (i.e., string, integer, float). This is a via- User Input: Methods askstring(),
ble choice when the interface is heavily loaded or when askinteger(), and askfloat()
the use of widgets is not desirable. When message boxes (members of the simpledialog
are used for this purpose, the following methods can be object, tkinter library) are used to
used: (a) askstring() for string input, (b) askinte- display a message requesting input of
ger() for integer numbers input, and (c) askfloat() a specific data type from the user.
for float numbers (real numbers) input. These methods
are members of the simpledialog class of the tkinter library. As they return a particular data type
value, it must be stored in a suitable variable declared for this purpose.
As shown in the following Python script, the title and the message of the message box must be
also specified:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import libraries
import tkinter as tk
from tkinter import simpledialog
from tkinter import messagebox
global name; global birthyear; global gpa
# Declare optionMessage() function, invoked upon button click
def inputMessage(a):
global name; global birthyear; global gpa
# Accept student name, year of birth, and GPA
# and display it through a simple message box
if (a == 1):
name = simpledialog.askstring("Name", "What is your name?")
elif (a == 2):
birthyear = simpledialog.askinteger("Year of birth",
Application Development
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
167
"What is the year of your birth?")
elif (a == 3):
gpa = simpledialog.askfloat("GPA",
"What is your GPA (out of 4 with one decimal)?")
elif (a == 4):
message="Student's name: "+name+"\nStudent's year of birth: "+\
str(birthyear) + "\nStudent's GPA: " + str(gpa)
messagebox.showinfo("Student's info", message)
# Create a non-resizable Windows frame using the tk object
winFrame = tk.Tk()
winFrame.title("Inputboxes")
winFrame.resizable(False, False)
winFrame.geometry('260x220')
winFrame.configure(bg = 'grey')
# Create buttons that will trigger the associated messages
winButton1 = tk.Button(winFrame,
text = "Click to ask \nthe student's name", width = 20)
winButton1.pack(); winButton1.place(x = 30, y = 20)
winButton1.bind('<Button-1>', lambda event: inputMessage(1))
winButton2 = tk.Button(winFrame, width = 20,
text = "Click to ask \nthe student's year of birth")
winButton2.pack(); winButton2.place(x = 30, y = 70)
winButton2.bind('<Button-1>', lambda event: inputMessage(2))
winButton3 = tk.Button(winFrame,
text = "Click to ask \nthe student's GPA", width = 20)
winButton3.pack(); winButton3.place(x = 30, y = 120)
winButton3.bind('<Button-1>', lambda event: inputMessage(3))
winButton4 = tk.Button(winFrame,
text = "Click to show \nthe student's info", width = 20)
winButton4.pack(); winButton4.place(x = 30, y = 170)
winButton4.bind('<Button-1>', lambda event: inputMessage(4))
name = ""; birthyear = 0; gpa = 0.0
winFrame.mainloop()
Output 5.2.3:
168
Handbook of Computer Programming with Python
5.2.4 Splash Screen/About Forms
A frequently underestimated type of object is the soObservation 5.4 – Splash screen: A
called splash screen or about form. It is most commonly
splash screen can be used in cases
used to provide information about application execution
of excessive loading times of a winand processes, development details and dates, copydow/widget or when there is a need
rights, and contacting the development team. The object
to display information related to the
does not follow a formal design and, therefore, it is not
application.
offered as a template by most well-known programming
languages.
Among its various uses, the splash screen/about form can be used to give time to the main application to load its components. This is especially relevant if significant amounts of data need to be
loaded, such as sizable databases or graphics, and heavy objects in general. The following script
is a basic example of a splash screen with no apparent functionality. The form disappears after 8
seconds to give its place to the main application window:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Import libraries
import tkinter as tk
import time
global winSplash
# Create the Splash screen
def splash():
global winSplash
winSplash = tk.Tk()
winSplash.title("Splash screen")
winSplash.resizable(False, False)
winSplash.geometry('250x100')
winSplash.configure (bg = 'dark grey')
winLabel1 = tk.Label(winSplash,
text = "Display the Splash screen \nfor 8 seconds")
winLabel1.grid(row = 0, column = 0)
# Use the update function to display the splash screen
# before the mainloop (main window) takes over
winSplash.update()
# Call the splash screen for 8 seconds
splash()
time.sleep(8)
# Destroy the splash screen before the mainloop
winSplash.destroy()
# Create the main window
winFrame = tk.Tk()
winFrame.title("Main Window")
winFrame.resizable(False, False)
Application Development
35
36
37
38
39
40
41
169
winFrame.geometry('250x100')
winFrame.configure(bg = 'grey')
winLabel2 = tk.Label(winFrame, text = "Entered the main window")
winLabel2.grid(row = 0, column = 0)
winFrame.mainloop()
Output 5.2.4:
The user should note the use of the time.sleep() method after the splash() method is invoked.
This delays the splash screen before the main window (winFrame) is loaded. It is also worth noting
the use of the update() method on the winSplash object. This method ensures that the widget is
displayed, although it is not the main window and, thus, the mainloop() method cannot be used
with it.
5.2.5 Common Dialogs
It is frequently the case that the programmer needs to utilize the API (Application Programming
Interface) of the operating system in order to avoid writing code that is already provided as prepackaged, essential functionality. Some of the most important GUI-related API elements can be
found under the broader category of dialogs. Different versions of dialogs exist, such as Color, Open
File, Save File, Directory, Font Dialog, and Print. These dialogs allow programmers to circumvent
extensive GUI programming by offering instant access to basic, repetitive functional tasks.
These are the common dialog objects that appear in various types of widely used GUI applications
(e.g., MS Office or Adobe Creative Suite).
With the exception of the color dialog (askcolor),
which is included in the colorchooser library, Observation 5.5 – API methods:
the aforementioned dialogs are all included in the The API methods offered by Python
­filedialog library under the associated keywords can be used to perform basic repeti(e.g., filedialog.askopenfile(), filedia- tive tasks across many platforms and
log.asksaveasfile(), filedialog.askdirec- operating systems. These methods
tory()). The syntax for invoking these API methods is include askcolor() from the colsimple and rather intuitive, and it allows a two-way com- orchooser library and asksavesmunication with the user in order to obtain their selec- asfile(), askopenfile(), and
tion. In the case of askcolor(), one should note that the askdirectory() from the fileresult is a set of two values: an rgb (red, green, blue) value dialog library.
and a particular color selection. The color values selected
can be stored in a variable for further use. The following Python script illustrates the use of the four
API methods mentioned above:
170
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Handbook of Computer Programming with Python
# Import libraries
import tkinter as tk
from tkinter import filedialog
from tkinter import colorchooser
# Define openDialogs() function, invoked upon button click
def openDialogs(a):
if (a == 1):
# Assign user color selection to a set of variables
(rgbSelected, colorSelected) = colorchooser.askcolor()
# Use the color element from the variable set to change
# the color of the form
winFrame.config(background = colorSelected)
elif (a == 2):
filedialog.askopenfile(title = "Open File Dialog")
elif (a == 3):
filedialog.askdirectory(title = "Directory Dialog")
elif (a == 4):
filedialog.asksaveasfilename(title = "Save As Dialog")
# Create a non-resizable Windows frame using the tk object
winFrame = tk.Tk()
winFrame.title("Common Dialogs")
winFrame.resizable(False, False)
winFrame.geometry('280x220')
winFrame.configure(bg = 'grey')
# Create button that triggers the Color dialog
winButton1 = tk.Button(winFrame,
text = "Click to open \nthe Color dialog", width = 20)
winButton1.pack(); winButton1.place(x = 60, y = 20)
winButton1.bind('<Button-1>', lambda event: openDialogs(1))
# Create button that triggers the Open File dialog
winButton2 = tk.Button(winFrame,
text = "Click to open \nthe File Dialog", width = 20)
winButton2.pack(); winButton2.place(x = 60, y = 70)
winButton2.bind('<Button-1>', lambda event: openDialogs(2))
# Create button that triggers the Directory dialog
winButton3=tk.Button(winFrame,
text="Click to open \nthe Directory Dialog", width = 20)
winButton3.pack(); winButton3.place(x = 60, y = 120)
winButton3.bind('<Button-1>', lambda event: openDialogs(3))
# Create button that triggers the Save As dialog
winButton3=tk.Button(winFrame,
text = "Click to open \nthe Save As Dialog", width = 20)
winButton3.pack(); winButton3.place(x = 60, y = 170)
winButton3.bind('<Button-1>', lambda event: openDialogs(4))
winFrame.mainloop()
Application Development
171
Output 5.2.5:
5.3 MENUS
It is quite rare for a desktop or mobile application to offer singular functionality. Developers usually create systems capable of performing numerous tasks and functions. An example of this are
the scripts developed in the previous sections, where multiple, although quite simplistic, tasks were
performed using a series of corresponding buttons. In reality, in most cases, access to different
functions within an application is provided through menus. These can take different forms, such as
simple menus, single-layered menus, menus with nested sub-menus, toolbars, and pop-up menus.
These types of menus can be used in isolation, but are also frequently used in conjunction. This section covers basic menu concepts, as well as a number of particular options that can be used to further
enhance menu functionality.
5.3.1 Simple Menus with Shortcuts
In all windows style applications, simple menus follow
the same basic, but rather intuitive, style. They include Observation 5.6 – Menu class: Use
a top-level list of items, usually displayed just below the the constructor of the Menu class
title of the application. This top-level menu layer sits on to create a menu object. The main
top of sub-menus that are hidden in subsequent layers. menu choices can be added using
Such basic menus are created using the constructor of the constructor (Menu()), while simthe Menu class from the tkinter library. The idea is ple menu items can be added using
quite straightforward indeed. Firstly, the menu object the add _ command() method and
is created using the Menu() constructor. Additional radio and check buttons using the
menu objects can be also created and attached to the add _ checkbutton() and add _
respecmain menu object, as necessary. Next, any required radiobutton()methods,
tively.
Use
add
_
cascade()
to put
sub-menus can be added to the main menu. This can
all
pieces
of
the
menu
together
and
be accomplished with the add _ command() method
display
them
on
the
menu
bar.
for simple items or the add _ checkbutton() and
add _ radiobutton() methods for check button and
radio button items, respectively. For nested menus, these steps can be repeated as many times as
necessary, although one should avoid going deeper than two levels of menus for clarity reasons.
Finally, the add _ cascade() method is used to tie together the various menu pieces and activate
the menu system.
172
Handbook of Computer Programming with Python
In addition to creating the basic menu structure, developers often choose to extend its functionality by means of menu shortcuts. This can take the form of either hot letters using the underline
option, or combinations of special keys (e.g., the control key) and letters through the accelerator
option. In both cases, it is essential to remember that while these options may appear on the menu,
they do not automatically trigger the relevant functionality. For this purpose, the main window
form should be bound to the relevant event in order to trigger the respective functionality. This is
achieved with the bind() method. The following application uses the functionality of the previous
section, but with the implementation of a two-level deep basic menu instead of buttons:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Import libraries
import tkinter as tk
from tkinter import filedialog
from tkinter import colorchooser
from tkinter import messagebox
from tkinter import Menu
# Define functions colorDialog, openDialog, saveAsDialog, quit, askyesno
# and askokcancel, invoking the relevant dialogs or message boxes
def colorDialog():
# Assign user color selection to a set of variables
(rgbSelected, colorSelected) = colorchooser.askcolor()
# Change the form color; use the color element from the variable set
winFrame.config(background = colorSelected)
def openDialog():
filedialog.askopenfile(title = "Open File Dialog")
def saveAsDialog():
filedialog.asksaveasfilename(title = "Save As Dialog")
def quit():
winFrame.destroy()
exit()
def askyesno():
messagebox.askyesno("YesNo message",
"Click on Yes or No to continue")
def askokcancel():
messagebox.askokcancel("OKCancel message",
"Click on OK or Cancel to continue")
# Define keypressedEvent() function that will invoke
# the associated function based on key press
def keypressedEvent(event):
if (event.keycode == 67 or event.keycode == 99):
colorDialog()
if (event.keycode == 70 or event.keycode == 102):
openDialog()
if (event.keycode == 83 or event.keycode == 115):
saveAsDialog()
Application Development
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
173
# Create non-resizable Windows frame using the tk object
winFrame = tk.Tk()
winFrame.title("Menus")
winFrame.resizable(False, False)
winFrame.geometry('260x220')
# Create the menu widget on the main window
menubar = tk.Menu(winFrame)
# Create the first series of sub-menus with dialogs
# and underline the shortcut letters
dialogs = tk.Menu(menubar, tearoff = 0)
dialogs.add_command(label = "Color dialog", command = colorDialog,
underline = 0)
dialogs.add_command(label = "Open File dialog", command = openDialog,
underline = 5)
dialogs.add_command(label = "Save As dialog", command = saveAsDialog,
underline = 0)
menubar.add_cascade(label = "Dialogs", menu = dialogs)
# Create the second series of sub-menus with messages
mssgs = tk.Menu(menubar, tearoff = 0)
# Create sub-menu inside the Yes/No, OK/Cancel message
mssgs1 = tk.Menu(mssgs, tearoff = 0)
mssgs1.add_command(label = "Yes/No Message", command = askyesno,
accelerator = 'Ctrl-Y')
mssgs1.add_command(label = "OK/Cancel Message", command = askokcancel,
accelerator = 'Ctrl-O')
mssgs.add_cascade(label = "Yes/No, OK/Cancel", menu = mssgs1)
mssgs.add_separator()
mssgs.add_command(label= "Exit", command = quit, accelerator = 'Ctrl-X')
menubar.add_cascade(label = "Messages", menu = mssgs)
# Create the third series of menus with check buttons and radio buttons
buttonmenus = tk.Menu(menubar, tearoff = 0)
buttonmenus.add_checkbutton(label = "Checkmenu1", onvalue=1, offvalue=0)
buttonmenus.add_checkbutton(label = "Checkmenu2", onvalue=1, offvalue=0)
buttonmenus.add_separator()
buttonmenus.add_radiobutton(label = "Radiomenu1")
buttonmenus.add_radiobutton(label = "Radiomenu2")
menubar.add_cascade(label = "Button menus", menu = buttonmenus)
# Bind the main window frame with the event/shortcut that will trigger
# the relevant function
winFrame.bind('<Key>', lambda event: keypressedEvent(event))
winFrame.bind('<Control-Y>', lambda event: askyesno())
winFrame.bind('<Control-O>', lambda event: askokcancel())
winFrame.bind('<Control-X>', lambda event: quit())
winFrame.config(menu = menubar)
winFrame.mainloop()
174
Handbook of Computer Programming with Python
Output 5.3.1:
In addition to the necessary library calls, the script
Observation 5.7 – add_separais split into three main parts. In the first part, the
tor(), underline, accelerator:
main window frame is created and configured (lines
Use the add _ separator() method
44–48). Next, a menu object (menubar) is created
to add a line separating the various items
(lines 50–51) and two main menu items (dialogs
of a menu. Use the underline option
and mssgs) are attached to it (lines 55, 65). Notice the
to create hot keys, or the accelerator
tearoff option, which prevents the menu from being
option to create ctrl-, shift-, or alt-keys,
detached from the main menu bar. Once the main menu
and to associate them with the desired
components are in place, the various sub-menu items
functionality and events.
are created and associated with their parent menu item
through the add _ command() method (lines 55–62
and 69–73). The command option binds particular menu items with the relevant methods. The
underline option accepts the index of the text of the underlying object (starting at 0) and
displays the associated character as a hot key. As in the case of hot keys in previous menu item
examples, this is not enough by itself to trigger the relevant method or command, so a relevant
event must be bound to the hot key character (lines 55–62 and 69–73). This is unlike the case of
the command option.
When sub-menus are required as part of a menu item, the same process can be utilized. The
only difference in this case would be that the referenced object should be the menu item instead of
the main menu item (line 68). If it is preferred to use combinations of special keys (i.e., Control,
Shift, or Alt) and characters, one can use the accelerator option instead of underline (lines
69–72, 76). As with underline, additional code should be written in order to trigger the function,
method, or command associated with the menu item.
In cases where check or radio buttons are required instead of simple menu items, one can use
methods add _ checkbutton() and add _ radiobutton(), respectively. These methods are
used as alternatives to the add _ command() method (lines 81–82 and 84–85). When there is a
need to separate the various menu items in groups, one can use the add _ separator() method
Application Development
175
(line 83). As mentioned, the add _ cascade() method ties together and activates the various
items of the menu system.
In the second part of the script, the bindings between the menu item shortcuts (hot keys or
­control characters) and the associated commands are established (lines 90–93 and 36–42).
The third part of the script involves the methods that perform the various functionality tasks
(lines 8–32). Should the reader experience difficulties to follow through this example, the main
coding concepts and commands used in the script are discussed in more detail in previous sections and/or chapters.
It is important to note that there is a difference in terms of how a menu is displayed in Windows
(the menu bar is inside the running application window) and in Mac OS (the menu is displayed at
the main system menu bar, detached from the running application window).
5.3.2 Toolbar Menus with Tooltips
An alternative form of presenting menu options to the
user is the toolbar menu. It could either supplement the Observation 5.8 – toolbar menu:
simple menu system or be used as a stand-alone compo- Use a toolbar menu system in addinent. The idea is rather straightforward: creating a col- tion to (or instead of) simple menus, to
lection of buttons (on a frame) and attaching it to the improve the GUI of a multi-functional
main window frame. The buttons are then bound to the application.
respective commands.
Buttons can display either images or text, or a combination of both. In order to improve clarity
and make the interface more user-friendly, button text is often replaced by appropriate tooltips.
The following Python script provides the same functionality as the one in the previous section,
but is using a toolbar instead of a menu. The implementation also embeds tooltips to the toolbar
buttons:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Import libraries
import tkinter as tk
from tkinter import filedialog
from tkinter import colorchooser
from tkinter import Menu
from tkinter import *
# Import the necessary image processing classes from PIL
from PIL import Image, ImageTk
global openFileToolTip, saveAsToolTip, colorsDialogToolTip, exitToolTip
global photo1, photo2, photo3, photo4
global openFileButton, saveAsButton, colorsButton, exitButton
#----------------------------------------------------------------------------# Open and resize images - load images to buttons
def images():
global photo1, photo2, photo3, photo4
image1
image1
photo1
image2
image2
photo2
=
=
=
=
=
=
Image.open("OpenFile.gif")
image1.resize((24, 24), Image.ANTIALIAS)
ImageTk.PhotoImage(image1)
Image.open("SaveAs.gif")
image2.resize((24, 24), Image.ANTIALIAS)
ImageTk.PhotoImage(image2)
176
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Handbook of Computer Programming with Python
image3 = Image.open("ColorsDialog.gif")
image3 = image3.resize((24, 24), Image.ANTIALIAS)
photo3 = ImageTk.PhotoImage(image3)
image4 = Image.open("Exit.gif")
image4 = image4.resize((24, 24), Image.ANTIALIAS)
photo4 = ImageTk.PhotoImage(image4)
#-----------------------------------------------------------------------------# Define the colorDialog, openDialog, saveAsDialog, and quit functions
# that will invoke the relevant dialogs or quit the application
def colorDialog():
# Assign user color selection to a set of variables
(rgbSelected, colorSelected) = colorchooser.askcolor()
# Change the form color; use the color element from set variable
winFrame.config(background = colorSelected)
def openDialog():
filedialog.askopenfile(title = "Open File Dialog")
def saveAsDialog():
filedialog.asksaveasfilename(title = "Save As Dialog")
def quit():
winFrame.destroy()
exit()
#-----------------------------------------------------------------------------# showToolTips function displays relevant message when hovering over a
# button; hideToolTips() function destroys/hides the tooltip
def showToolTips(a):
global openFileToolTip, saveAsToolTip
global colorsDialogToolTip, exitToolTip
if (a == 1):
openFileToolTip = tk.Label(winFrame, relief = FLAT,
text = "Open the Open File dialog", background = 'cyan')
openFileToolTip.place(x = 25, y = 30)
if (a == 2):
saveAsToolTip = tk.Label(winFrame, bd = 2, relief = FLAT,
text = "Open the Save As Dialog", background = 'cyan')
saveAsToolTip.place(x = 50, y = 30)
if (a == 3):
colorsDialogToolTip = tk.Label(winFrame, bd = 2, relief = FLAT,
text = "Open the Colors Dialog", background = 'cyan')
colorsDialogToolTip.place(x = 75, y = 30)
if (a == 4):
exitToolTip = tk.Label(winFrame, bd = 2, relief = FLAT,
text = "Click to exit the application", background = 'cyan')
exitToolTip.place(x = 100, y = 30)
def hideToolTips(a):
global openFileToolTip, saveAsToolTip
global colorsDialogToolTip, exitToolTip
if (a == 1):
Application Development
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
177
openFileToolTip.destroy()
if (a == 2):
saveAsToolTip.destroy()
if (a == 3):
colorsDialogToolTip.destroy()
if (a == 4):
exitToolTip.destroy()
#-----------------------------------------------------------------------------# Defing the bindButtons function to bind the buttons with the
# various events
def bindButtons():
global openFileButton, saveAsButton, colorsButton, exitButton
openFileButton.bind('<Button-1>', lambda event: openDialog())
openFileButton.bind('<Enter>', lambda event: showToolTips(1))
openFileButton.bind('<Leave>', lambda event: hideToolTips(1))
saveAsButton.bind('<Button-1>', lambda event: saveAsDialog())
saveAsButton.bind('<Enter>', lambda event: showToolTips(2))
saveAsButton.bind('<Leave>', lambda event: hideToolTips(2))
colorsButton.bind('<Button-1>', lambda event: colorDialog())
colorsButton.bind('<Enter>', lambda event: showToolTips(3))
colorsButton.bind('<Leave>', lambda event: hideToolTips(3))
exitButton.bind('<Button-1>', lambda event: quit())
exitButton.bind('<Enter>', lambda event: showToolTips(4))
exitButton.bind('<Leave>', lambda event: hideToolTips(4))
#-----------------------------------------------------------------------------# Create non-resizable Windows frame using the tk object
winFrame = tk.Tk()
winFrame.title("Menus")
winFrame.resizable(False, False)
winFrame.geometry('260x220')
# Invoke the images function
images()
# Create toolbar with images and bind to related click event
toolbar = tk.Frame(winFrame, bd = 1, relief = RAISED)
toolbar.pack(side=TOP, fill=X)
# Create the toolbar buttons and invoke the bindButton function to bind
# them with the relevant events
openFileButton = tk.Button(toolbar, image = photo1, relief = FLAT)
saveAsButton = tk.Button(toolbar, image = photo2, relief = FLAT)
colorsButton = tk.Button(toolbar, image = photo3, relief = FLAT)
exitButton = tk.Button(toolbar, image = photo4, relief = FLAT)
bindButtons()
openFileButton.pack(side=LEFT, padx=0, pady=0)
saveAsButton.pack(side=LEFT, padx=0, pady=0)
colorsButton.pack(side=LEFT, padx=0, pady=0)
exitButton.pack(side=LEFT, padx=0, pady=0)
winFrame.mainloop()
178
Handbook of Computer Programming with Python
Output 5.3.2:
The script is similar to the previous versions in structure
but with some notable differences. Firstly, a toolbar Observation 5.9 – Enter, Leave:
frame is created and populated with four buttons instead Use the Enter and Leave events to
of creating a menu structure. Images are added to the trigger the desired actions when the
buttons (lines 16–30) and activated through the associ- mouse hovers over or moves away
ated pack() method calls (lines 121–124). Secondly, from an object.
the buttons are associated with three events, namely
Button-1, Enter, and Leave (lines 86–100, 120). Button-1 is triggered when the left mouse
button is pressed, Enter when the mouse pointer hovers over the button, and Leave when the
mouse pointer exits the boundaries of the button.
Another key point in this script is the way tooltips are
created and triggered. At the time of writing, Python did Observation 5.10 – tooltip: To add a
not provide an automatic method to create and trigger tooltip to a particular object, associa tooltip. As such, developers wishing to use a tooltip ate a label with it and display or hide
should implement this functionality through coding. the label as the mouse hovers over or
Nevertheless, the concept for doing so is rather simple: moves away from an object.
creating a label object that is displayed when the mouse
hovers over the button. This can be accomplished by creating separate labels for each button or
by creating a single label and changing its text and location coordinates depending on the mouse
pointer position. As mentioned, once the mouse pointer exits the boundaries of the button, the label
can be hidden (destroyed). This implementation of tooltip functionality is illustrated in methods
showToolTips() and hideToolTips() (lines 52–82).
5.3.3 Popup Menus with Embedded Icons
A third way to create menus in Python is through popup menus. Pop-up menus are quite similar to simple Observation 5.11 – pop-up: Use a
menus, with the difference that they are not attached to pop-up menu to provide menu funcany particular, pre-defined position, but are floating on tionality without having to permatop of the application window. The creation and con- nently display the menu within the
figuration of pop-up menus follow the same structure as application. Pop-up menus can be
simple menus; however, they are triggered in a slightly used as stand-alone menu options or
different way (e.g., left or right click on a designated in combination with simple menus
space within the application window). Pop-up menus, and/or toolbars.
similarly to simple menus, can include items of various
forms like text, images, combinations of both text and images, or shortcuts. They are often used in
combination with menus of other types, like simple menus and toolbars, in order to improve application efficiency and make it more appealing to the user.
Application Development
179
The following script implements the same functionality as the previous two examples, but
uses pop-up menus instead of simple menus and/or toolbars. In this example, menu items include
­combinations of images and text:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Import libraries
import tkinter as tk
from tkinter import filedialog
from tkinter import colorchooser
from tkinter import Menu
from tkinter import *
# Import the necessary image processing classes from PIL
from PIL import Image, ImageTk
global photo1, photo2, photo3, photo4
global popupmenu
# Open and resize images - load images to the buttons
def images():
global photo1, photo2, photo3, photo4
image1
image1
photo1
image2
image2
photo2
image3
image3
photo3
image4
image4
photo4
=
=
=
=
=
=
=
=
=
=
=
=
Image.open("OpenFile.gif")
image1.resize((24, 24), Image.ANTIALIAS)
ImageTk.PhotoImage(image1)
Image.open("SaveAs.gif")
image2.resize((24, 24), Image.ANTIALIAS)
ImageTk.PhotoImage(image2)
Image.open("ColorsDialog.gif")
image3.resize((24, 24), Image.ANTIALIAS)
ImageTk.PhotoImage(image3)
Image.open("Exit.gif")
image4.resize((24, 24), Image.ANTIALIAS)
ImageTk.PhotoImage(image4)
# Define the colorDialog, openDialog, saveAsDialog, and quit functions
# to invoke the relevant dialogs or quit the application
def colorDialog():
# Assign the user's selection of the color to a set of variables
(rgbSelected, colorSelected) = colorchooser.askcolor()
# Change the form color using the color part of the set of variables
winFrame.config(background = colorSelected)
def openDialog():
filedialog.askopenfile(title = "Open File Dialog")
def saveAsDialog():
filedialog.asksaveasfilename(title = "Save As Dialog")
def quit():
winFrame.destroy()
exit()
180
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Handbook of Computer Programming with Python
def popupMenu(event):
global popupmenu
popupmenu.tk_popup(event.x_root, event.y_root)
#----------------------------------------------------------------------------# Create non-resizable Windows frame using the tk objec,
winFrame = tk.Tk()
winFrame.title("Menus")
winFrame.resizable(False, False)
winFrame.geometry('260x220')
# Invoke the images function
images()
# Create the popup menu
popupmenu = tk.Menu(winFrame, tearoff = 0)
popupmenu.add_command(label = "Color dialog", image = photo1,
compound = LEFT, command = colorDialog)
popupmenu.add_command(label = "Exit", image = photo4, compound = LEFT,
command = quit)
popupmenu.add_separator()
popupmenu.add_command(label = "Open File dialog", image = photo2,
compound = LEFT, command = openDialog)
popupmenu.add_command(label = "Save As dialog", image = photo3,
compound = LEFT, command = saveAsDialog)
winFrame.bind('<Button-1>', lambda event: popupMenu(event))
winFrame.mainloop()
Output 5.3.3:
181
Application Development
The reader should pay attention to two particular
aspects of this script. Firstly, the add _ cascade()
method that was used in previous scripts to tie together
the ­various menu items to the main menu system is missing. In this instance, the tk _ popup() method is used
instead. The method is called as a member of the popupmenu object (i.e., inside the popupmenu(event)
method), and casts the pop-up menu at the current position of the mouse cursor (line 50). Secondly, it must
be noted how the text and the picture are combined
on the menu items. Hot keys and other types of shortcuts can be also used, as described in previous sections
(lines 63–71).
5.4 ENHANCING THE GUI EXPERIENCE
Observation 5.12 – tk_popup(),
add_cascade(): Use the tk _
p o p u p(e v e n t.x _ r o o t,
event.y _ root) method to display the pop-up menu at the current
mouse location. Note that the add _
cascade() method should not be
used in this occasion, in contrast to
the creation of simple menus.
Observation 5.13: Use combinations
of text, images, and hot keys to make
the pop-up menu items more appealing and self-explanatory.
Three additional concepts can be utilized in order to further enhance the GUI experience. What
these concepts have in common is that they can be used to improve the efficiency of real estate and
memory usage of an application. Ultimately, good programming practice supports the creation of
separate, autonomous GUIs and their ability to be reused in various programs by simple calls from
the corresponding objects. This section examines these three concepts and provides some examples
of their application.
5.4.1 Notebooks and Tabbed Interfaces
As information systems grow larger in size, the management of real estate of the related applications (i.e., the Observation 5.14 – Notebook(),
creation of space that will host and display these appli- Frame(): Use the Notebook() concations) becomes increasingly important. The idea of structor (ttk module) to create the
using a menu system in its various different forms was main object of a tabbed interface.
introduced and explained in detail in previous sections. Use the Frame() constructor (ttk
Menus offer a quite efficient way of addressing the man- module) to create each tab sepaagement of real estate. An alternative way of doing so is rately and to add them to the main
through the use of tabbed interfaces. This approach is object. Finally, pack() all the pieces
based on the creation of separate sub-sections inside a together and load the applications in
single window (i.e., tabs). Tabs are opened and run sepa- the respective tabs.
rately, but at the same time, they are parts of the same
GUI structure. Tab-based implementations are commonly used in web browsers, where the various
different web pages can be opened in separate tabs.
The following script combines two of the scripts covered in Chapter 4 (i.e., Buttons and Text and
Speed Control) in a single application, utilizing a tab-based implementation:
1
2
3
4
5
6
7
8
# Import libraries
import tkinter as tk
from tkinter import ttk
# Declare and initialise the global variables and widgets
# for use with the functions
currentSpeedValue, speedLimitValue, finePerKmValue = 0, 0, 0
global speedLimitSpinbox
182
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Handbook of Computer Programming with Python
global
global
global
global
global
global
finePerKmScale
currentSpeedScale
fine
tab1, tab2
winLabel
winButton
# ===========================================================
# Functions related to the tab2 application of Speed Control
# ===========================================================
# Define the functions that will create the application interface
def createGUITab2():
currentSpeedFrame()
speedLimitFrame()
finePerKmFrame()
fineFrame()
# Define function to control changes in the Current Speed Scale widget
def onScale(val):
global currentSpeedValue
currentSpeedValue.set(float(val))
calculateFine()
# Define function to control changes in the Speed Limit Spinbox widget
def getSpeedLimit():
global speedLimitValue
speedLimitValue.set(float(speedLimitSpinbox.get()))
calculateFine()
# Define function to control changes in the Fine per Km Spinbox widget
def getFinePerKm(val):
global finePerKmValue
finePerKmValue.set(int(float(val)))
calculateFine()
# Define function to calculate Fine based on user input
def calculateFine():
global currentSpeedValue, speedLimitValue, finePerKmValue
global fine
diff = float(currentSpeedValue.get()) – float(speedLimitValue.get())
finePerKm = float(finePerKmValue.get())
if (diff <= 0):
fine.config(text = 'No fine')
else:
fine.config(text = 'Fine in USD: '+ str(diff * finePerKm))
# Add the Current Speed widgets to tab2
def currentSpeedFrame():
global currentSpeedValue
# Create the prompt label for the Current Speed tab
currentSpeed = tk.Label(tab2, text = 'Current speed:', width = 24)
Application Development
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
183
currentSpeed.config(bg = 'light blue', fg = 'red', bd = 2,
font = 'Arial 14 bold')
currentSpeed.grid(column = 0, row = 0)
# Create Scale widget and define connection variable
currentSpeedValue = tk.DoubleVar()
currentSpeedScale=tk.Scale (tab2, length = 200, from_ = 0, to = 360)
currentSpeedScale.config(resolution = 0.5,
activebackground = 'dark blue', orient = 'horizontal')
currentSpeedScale.config(bg = 'light blue', fg = 'red',
troughcolor = 'cyan', command = onScale)
currentSpeedScale.grid(column = 1, row = 0)
currentSpeedSelected = tk.Label(tab2, text = '...',
textvariable = currentSpeedValue)
currentSpeedSelected.grid(column = 2, row = 0)
# Add the Speed Limit widgets to tab2
def speedLimitFrame():
global speedLimitValue
global speedLimitSpinbox
# Create the prompt label for the Speed Limit tab
speedLimit = tk.Label (tab2, text = 'Speed Limit:', width = 24)
speedLimit.config(bg = 'light blue', fg = 'yellow', bd = 2,
font = 'Arial 14 bold')
speedLimit.grid(column = 0, row = 1)
# Create the Spinbox widget and define variable to connect
# to Spinbox widget
speedLimitValue = tk.DoubleVar()
speedLimitSpinbox = ttk.Spinbox(tab2, from_ = 0, to = 360,
command = getSpeedLimit)
speedLimitSpinbox.grid(column = 1, row = 1)
speedLimitSelected = tk.Label(tab2, text = '...',
textvariable = speedLimitValue)
speedLimitSelected.grid(column = 2, row = 1)
# Add the Fine per Km widgets to tab2
def finePerKmFrame():
global finePerKmValue
# Create the prompt label for the Fine per Km tab
finePerKm=tk.Label(tab2, text='Fine/Km overspeed (USD):', width=24)
finePerKm.config(bg = 'light blue', fg = 'brown', bd = 2,
font = 'Arial 14 bold')
finePerKm.grid(column = 0, row = 2)
# Create Scale widget and define variable to connect to Scale widget
finePerKmValue = tk.IntVar()
finePerKmScale=ttk.Scale(tab2, orient = 'horizontal', length = 200,
from_ = 0, to = 100, command = getFinePerKm)
finePerKmScale.grid(column = 1, row = 2)
finePerKmSelected = tk.Label(tab2, text = '...',
textvariable = finePerKmValue)
finePerKmSelected.grid(column = 2, row = 2)
184
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
Handbook of Computer Programming with Python
# Add the Fine for speeding label to tab2
def fineFrame():
global fine
# Create the label that will display the fine on the Fine tab
fine = tk.Label(tab2, text = 'Fine in USD:...', fg = 'blue')
fine.grid(column = 0, row = 3)
# ===========================================================
# The functions related to the tab1 application (button and text)
# ===========================================================
# Define the function that will control the mouse click events
def changeText(a):
global winLabel
winLabel.config(text = a)
# Define the function that will create the GUI for the tab1
def createGUITab1():
global winButton
global winLabel
winLabel = tk.Label(tab1, text = "...")
winLabel.grid(column = 1, row = 0)
# Create the button widget and bind it with the associated events
winButton=tk.Button(tab1, text="Left, right, or double left Click "
"\nto change the text of the label", font="Arial 16", fg="red")
winButton.grid(column = 0, row = 0)
winButton.bind("<Button-1>", lambda event, \
a = "You left clicked on the button": changeText(a))
winButton.bind("<Button-2>", lambda event, \
a = "You right clicked on the button": changeText(a))
winButton.bind("<Double-Button-1>", lambda event, \
a = "You double left clicked on the button": changeText(a))
winButton.bind("<Enter>", lambda event, \
a = "You are hovering above the button": changeText(a))
winButton.bind("<Leave>", lambda event, \
a = "You left the button widget": changeText(a))
# ===========================================================
# Create non-resizable Windows frame using the tk object
winFrame = tk.Tk()
winFrame.title("Tabs")
winFrame.resizable(True, True)
winFrame.geometry('500x180')
# Create notebook with tab pages
tabbedInterface = ttk.Notebook(winFrame)
tab1 = ttk.Frame(tabbedInterface)
tabbedInterface.add(tab1, text = "Buttons and Text")
tab2 = ttk.Frame(tabbedInterface)
tabbedInterface.add(tab2, text = "Speed control")
tabbedInterface.pack()
Application Development
164
165
166
167
168
169
185
# Invoke the 2 functions to create the different GUIs for the 2 tabs
createGUITab1()
createGUITab2()
winFrame.mainloop()
Output 5.4.1:
As shown in the output, the application implements an interface with two tabs, one hosting the
Buttons and Text application and the other the Speed Control application. In this example, it is
worth to raise some key points. Firstly, the tabs allow for a more efficient use of the real estate, since
the two separate applications run simultaneously in a single window, but are displayed independently from each other. Secondly, the creation of the tabbed interface is through the Notebook()
constructor of the ttk module (line 158). The two tabs are created using the Frame() constructor
of the ttk module (lines 159 and 161) and are associated with the main notebook object by being
added to it (lines 160 and 162). All the components are packed together in line 163. Ultimately, the
tabs are created by means of the relevant GUI calls in lines 166 and 167.
There are two main differences between the way the applications are used in this example and in
the original implementations presented in Chapter 4. The first is that, in both cases, the applications
are converted to a completely procedural format, making full use of methods for all the required
functionality and without any statements being added to the main body of the program. The second
is that the Speed Control application is somewhat simplified, as the control variables associated
with the Scale and Spinbox widgets and their respective labels are removed in order to avoid
possible referencing issues between the various methods.
5.4.2 Threaded Applications
One of the most important concepts in programming, and arguably among the most effective tools
when creating real-life applications, is that of threads and threading. The idea behind threads is
rather straightforward: multiple instances of an application can be run as independent processes.
One way to conceptualize threads is to view them as different objects of the same class. Indeed,
this is a rather accurate description, with the additional element of utilizing different processes
of the operating system. One of the main characteristics of threaded applications is that they are
meant to run in parallel. In reality, even in the case of using multi-core computer systems, this is
186
Handbook of Computer Programming with Python
not entirely feasible, but this is a rather specialized computer architecture consideration that exceeds the scope
of this book.
In the following example, the SpeedControl application from Chapter 4 is converted to a class, for the purpose of demonstrating the implementation of threads.
The script creates two objects of the SpeedControl class,
and runs them separately on two different threads:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Observation 5.15 – threads: Create
different threads of the same objects
of a class. Threads are separate and
independent, and can run in parallel or sequentially. They use separate
processes and allocated memory
space.
# Import modules tk and ttk
import tkinter as tk
from tkinter import ttk
import threading
class SpeedControl(threading.Thread):
# Create and run the main window frame for the application
def __init__(self, winFrame):
super(SpeedControl, self).__init__()
self.winFrame = winFrame
self.winFrame.title("Control speed")
self.winFrame.config(bg = 'light grey')
self.winFrame.resizable(False, False)
self.winFrame.geometry('500x170')
# Create the frame, label and scale widgets for currentSpeed
self.currentSpeedFrame = tk.Frame (self.winFrame,
bg = 'light grey', bd = 2, relief = 'sunken')
self.currentSpeedFrame.pack()
self.currentSpeedFrame.place(relx = 0.05, rely = 0.05)
self.currentSpeed = tk.Label(self.currentSpeedFrame,
text = 'Current speed:', width = 24)
self.currentSpeed.config(bg = 'light blue', fg = 'red', bd = 2,
font = 'Arial 14 bold')
self.currentSpeed.grid(column = 0, row = 0)
self.currentSpeedScale = tk.Scale (self.currentSpeedFrame,
length = 200, from_ = 0, to = 360)
self.currentSpeedScale.config(resolution = 1,
orient = 'horizontal', activebackground = 'dark blue')
self.currentSpeedScale.config(bg = 'light blue', fg = 'red',
troughcolor = 'cyan', command = self.onScale)
self.currentSpeedScale.grid(column = 1, row = 0)
self.currentSpeedSel = tk.Label(self.currentSpeedFrame,
text='...')
self.currentSpeedSel.grid(column = 2, row = 0)
# Create the frame, label, & spinbox widget for the speedLimit
self.speedLimitFrame = tk.Frame(self.winFrame,
bg = 'light yellow', bd = 4, relief = 'sunken')
self.speedLimitFrame.pack()
self.speedLimitFrame.place(relx = 0.05, rely = 0.30)
self.speedLimit = tk.Label (self.speedLimitFrame,
text = 'Speed limit:', width = 24)
Application Development
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
187
self.speedLimit.config(bg= 'light blue', fg = 'yellow', bd = 2,
font = 'Arial 14 bold')
self.speedLimit.grid(column = 0, row = 0)
self.speedLimitSpinbox = ttk.Spinbox(self.speedLimitFrame,
from_ = 0, to = 360, command = self.getSpeedLimit)
self.speedLimitSpinbox.grid(column = 1, row = 0)
self.speedLimitSel=tk.Label(self.speedLimitFrame, text='...')
self.speedLimitSel.grid(column = 2, row = 0)
# Create the frame, label, and scale widget for finePerKm
self.finePerKmFrame = tk.Frame(self.winFrame,
bg = 'light grey', bd = 2, relief = 'sunken')
self.finePerKmFrame.pack()
self.finePerKmFrame.place (relx = 0.05, rely = 0.55)
self.finePerKm = tk.Label(self.finePerKmFrame,
text = 'Fine/Km overspeed (USD):', width = 24)
self.finePerKm.config(bg = 'light blue', fg = 'red', bd = 2,
font = 'Arial 14 bold')
self.finePerKm.grid(column = 0, row = 0)
self.finePerKmScale = tk.Scale(self.finePerKmFrame,
length = 200, from_ = 0, to = 100)
self.finePerKmScale.config(resolution = 1,
activebackground = 'dark blue', orient = 'horizontal')
self.finePerKmScale.config(bg = 'light cyan', fg = 'red',
troughcolor = 'light blue', command = self.getFinePerKm)
self.finePerKmScale.grid(column = 1, row = 0)
self.finePerKmSel = tk.Label(self.finePerKmFrame, text='...')
self.finePerKmSel.grid(column = 2, row = 0)
# Create the frame for the fine and the related label
self.fineFrame = tk.Frame(self.winFrame, bg = 'yellow', bd = 4,
relief = 'raised')
self.fineFrame.pack()
self.fineFrame.place(relx = 0.05, rely = 0.80)
self.fine = tk.Label(self.fineFrame, text = 'Fine in USD:...',
fg = 'blue')
self.fine.grid(column = 0, row = 0)
# Define function to control changes in Current Speed Scale widget
def onScale(self, val):
v = int(float(val))
self.currentSpeedSel.config(text = v)
self.calculateFine()
# Define function to control changes in Speed Limit Spinbox widget
def getSpeedLimit(self):
v = self.speedLimitSpinbox.get()
self.speedLimitSel.config(text = v)
self.calculateFine()
# Define function to control changes in Fine per Km Spinbox widget
def getFinePerKm(self, val):
v = int(float(val))
self.finePerKmSel.config(text = v)
188
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
Handbook of Computer Programming with Python
self.calculateFine()
# Define function to calculate the Fine based on user input
def calculateFine(self):
currentSpeed, speedLimit, finePerKm = 0, 0.0, 0
# Ensure relevant objects are initiated & assigned with values
if (self.currentSpeedScale.get()!= ''
and self.speedLimitSpinbox.get()!= ''
and self.finePerKmScale.get()!= ''):
currentSpeed = self.currentSpeedScale.get()
speedLimit = float(self.speedLimitSpinbox.get())
finePerKm = self.finePerKmScale.get()
else:
currentSpeed, finePerkKm = 0, 0; speedLimit = 0.0
# Calculate the fine and display it on the associated label
diff = currentSpeed - speedLimit
if (diff <= 0):
self.fine.config(text = 'No fine')
else:
self.fine.config(text='Fine in USD: '+str(diff*finePerKm))
# Create two different GUI frames
winFrame1 = tk.Tk()
winFrame2 = tk.Tk()
# Create two different threads - one for each GUI frame
speedControl1 = SpeedControl(winFrame1)
speedControl2 = SpeedControl(winFrame2)
# Start each thread/frame and run it separately
speedControl1.start()
winFrame1.mainloop()
speedControl2.start()
winFrame2.mainloop()
Output 5.4.2:
Application Development
189
The output illustrates how this particular application runs the two different objects in separate
threads. It must be noted that the threads are running simultaneously. The term in parallel should
be avoided in this context, as it is uncertain whether the threads are indeed running in parallel. This
is also something that can be affected by the operating system, the hardware and software settings,
and the associated behaviors. Nevertheless, from the perspective of the user, this is of purely academic interest. As shown in the example above, the two threaded objects appear to run in parallel
indeed, but at the same time they function independently and use different inputs as if they were run
sequentially.
The order of statements between lines 121 and 133
Observation 5.16 – Threads: Use the
is also important. Firstly, the two GUI window frames
Thread class from the threading
are created as normal. If the first GUI frame was to be
module to create threaded objects.
created directly followed by the first threaded object,
Use the start() method to start
and before the second GUI frame and threaded object,
the threads and the stop()method
the user would only get access to the first window
to stop them. Always use the self
frame. The second window frame would only appear
parameter on all widgets and attrionce the first one was closed and stopped. The reader
butes to refer to the specific object
should also notice that the threading module needs
they belong to.
to be inserted before the calls to the start methods of
the threaded objects (i.e., speedControl1 and
speedControl2).
Observation 5.17: Avoid using control
It must be noted that each threaded object is assigned variables (e.g., IntVar()) in threaded
to a separate window frame and has a dedicated main- objects.
loop() method to monitor its GUI and the associated
events (lines 125–130 and 126–133). This assignment is
taking place in lines 125–126, where the window frame Observation 5.18: In cases of GUIfor each threaded object is called as a parameter, and based threaded objects, use the
used on the specific, independent GUI for the underly- mainloop() method for monitoring
ing object.
each object.
Another notable aspect of the script is the explicit
definition of the __ init __ (self,
winFrame): super(SpeedControl, self). __ init __ () that loads the GUI widgets onto
the window frames of each of the threaded objects. The reader should be reminded here that the
__ init __ () method is provided by Python to automatically initialize basic and necessary widgets and attributes in preparation of launching the object. The self parameter is necessary in
order for the Python interpreter to distinguish which object is running and what widgets and attributes belong to it. This is the reason why each widget and attribute, and even simple variables, are
preceded by the self parameter.
Another key point in this particular script is that, since the object that is being created is threaded,
it inherits from the Thread class of the threading module (line 6) and is implemented on that
class (line 10). These two lines that essentially create the threaded object are called each time a new
threaded object is initiated.
Finally, the reader should note that the control variables (e.g., IntVar()) are missing from this
version of the code. This was done on purpose, as their inclusion could cause unnecessary conflicts
between the threaded objects and the cross-method operations within any single threaded object,
without offering any particular benefits to the application. In general, it is advisable that control
variables on widgets are avoided, especially when implementing object-oriented and/or threaded
object applications.
190
Handbook of Computer Programming with Python
5.4.3 Combining Multiple Concepts and Applications in a Multithreaded System
Chapters 2–5 of this book provide a gradual progression from basic programming skills to more
advanced application development concepts. Although there are certainly many more concepts and
layers of depth to be explored when it comes to programming in Python, Chapters 2–5 should provide a solid basis for the aspiring programmer, as they cover the necessary building blocks required
to make functional and well-structured applications. As a conclusion to this conceptual sub-section
of this book, it was deemed necessary to provide an overview of how the concepts, mechanisms,
and practices presented so far can be integrated into a coherent, centralized solution. Ultimately,
this should provide an idea of how a multithreaded and multi-functional information system can
be built, resembling the scenarios and challenges one may face in real life. The example presented
below combines two of the applications developed earlier (Speed Control and Bubble Sort)
into a multithreaded system that can be launched and operated as a single, unified platform. In order
for this to be possible, two changes are required:
a. Each of the two individual applications (Speed Control and Bubble Sort) must be
adjusted according to the object-oriented paradigm. This is done by separating and
extracting the main code that is responsible for the GUI creation and all related methods, and save the remaining code as separate text files in Jupyter. By doing so, the original applications cannot be run separately, as there is no actual object being created in the
remaining code. Instead of creating the object within the main body of each application,
this is done through a call from another application, which now functions as the main
application.
b. The code that was extracted from the original applications must be imported to this newly
created application.
The code examples presented and discussed in the following pages provide a practical illustration
of these changes:
Chapter5SpeedControl.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Import modules tk and ttk
import tkinter as tk
from tkinter import ttk
import threading
class SpeedControl(threading.Thread):
# Create and run the main window frame for the application
def __init__(self, winFrame):
super(SpeedControl, self).__init__()
self.winFrame = winFrame
self.winFrame.title("Control speed")
self.winFrame.config(bg = 'light grey')
self.winFrame.resizable(False, False)
self.winFrame.geometry('500x170')
# Create frame for currentSpeed & its label and scale widgets
self.currentSpeedFrame = tk.Frame(self.winFrame,
Application Development
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
191
bg = 'light grey', bd = 2, relief = 'sunken')
self.currentSpeedFrame.pack()
self.currentSpeedFrame.place(relx = 0.05, rely = 0.05)
self.currentSpeed = tk.Label(self.currentSpeedFrame,
text = 'Current speed:', width = 24)
self.currentSpeed.config(bg = 'light blue', fg = 'red', bd = 2,
font = 'Arial 14 bold')
self.currentSpeed.grid(column = 0, row = 0)
self.currentSpeedScale = tk.Scale(self.currentSpeedFrame,
length = 200, from_ = 0, to = 360)
self.currentSpeedScale.config(resolution = 1,
activebackground = 'dark blue', orient = 'horizontal')
self.currentSpeedScale.config(bg = 'light blue', fg = 'red',
troughcolor = 'cyan', command = self.onScale)
self.currentSpeedScale.grid(column = 1, row = 0)
self.currentSpeedSel = tk.Label(self.currentSpeedFrame,
text = '...')
self.currentSpeedSel.grid(column = 2, row = 0)
# Create frame for speedLimit & its label and spinbox widgets
self.speedLimitFrame = tk.Frame(self.winFrame,
bg = 'light yellow', bd = 4, relief = 'sunken')
self.speedLimitFrame.pack()
self.speedLimitFrame.place(relx = 0.05, rely = 0.30)
self.speedLimit = tk.Label(self.speedLimitFrame,
text = 'Speed limit:', width = 24)
self.speedLimit.config(bg = 'light blue', fg = 'yellow',
bd = 2, font = 'Arial 14 bold')
self.speedLimit.grid(column = 0, row = 0)
self.speedLimitSpinbox = ttk.Spinbox(self.speedLimitFrame,
from_ = 0, to = 360, command = self.getSpeedLimit)
self.speedLimitSpinbox.grid(column = 1, row = 0)
self.speedLimitSel=tk.Label(self.speedLimitFrame,text='...')
self.speedLimitSel.grid(column = 2, row = 0)
# Create frame for finePerKm and its label and scale widgets
self.finePerKmFrame = tk.Frame(self.winFrame,
bg = 'light grey', bd = 2, relief = 'sunken')
self.finePerKmFrame.pack()
self.finePerKmFrame.place(relx = 0.05, rely = 0.55)
self.finePerKm = tk.Label(self.finePerKmFrame,
text = 'Fine/Km overspeed (USD):', width = 24)
self.finePerKm.config(bg = 'light blue', fg = 'red', bd = 2,
font = 'Arial 14 bold')
self.finePerKm.grid(column = 0, row = 0)
self.finePerKmScale = tk.Scale(self.finePerKmFrame,
length = 200, from_ = 0, to = 100)
self.finePerKmScale.config(resolution = 1,
activebackground = 'dark blue', orient = 'horizontal')
self.finePerKmScale.config(bg = 'light cyan', fg = 'red',
troughcolor = 'light blue', command = self.getFinePerKm)
192
Handbook of Computer Programming with Python
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
self.finePerKmScale.grid(column = 1, row = 0)
self.finePerKmSel=tk.Label(self.finePerKmFrame, text = '...')
self.finePerKmSel.grid(column = 2, row = 0)
# Create the frame for Fine and its label
self.fineFrame = tk.Frame(self.winFrame, bg = 'yellow', bd = 4,
relief = 'raised')
self.fineFrame.pack()
self.fineFrame.place(relx = 0.05, rely = 0.80)
self.fine = tk.Label(self.fineFrame, text = 'Fine in USD:...',
fg = 'blue')
self.fine.grid(column = 0, row = 0)
# Define function to control changes in CurrentSpeedScale widget
def onScale(self, val):
v = int(float(val))
self.currentSpeedSel.config(text = v)
self.calculateFine()
# Define function to control changes in SpeedLimitSpinbox widget
def getSpeedLimit(self):
v = self.speedLimitSpinbox.get()
self.speedLimitSel.config(text = v)
self.calculateFine()
# Define function to control changes in FineperKm Spinbox widget
def getFinePerKm(self, val):
v = int(float(val))
self.finePerKmSel.config(text = v)
self.calculateFine()
# Define the function to calculate the Fine based on user input
def calculateFine(self):
currentSpeed, speedLimit, finePerKm = 0, 0.0, 0
# Make sure the objects are initiated and assigned with values
if (self.currentSpeedScale.get()!= ''
and self.speedLimitSpinbox.get()!= ''
and self.finePerKmScale.get()!= ''):
currentSpeed = self.currentSpeedScale.get()
speedLimit = float(self.speedLimitSpinbox.get())
finePerKm = self.finePerKmScale.get()
else:
currentSpeed, finePerkKm = 0, 0; speedLimit = 0.0
# Calculate the fine and display it on the associated label
diff = currentSpeed - speedLimit
if (diff <= 0):
self.fine.config(text = 'No fine')
else:
self.fine.config(text='Fine in USD: '+str(diff*finePerKm))
Application Development
193
In the class presented above, the statements that create and run the GUI have been already separated and extracted, ready to be imported to the main application that will eventually create the
multithreaded objects. Apart from extracting these particular statements, the class implements the
SpeedControl application as discussed in the previous section. The class needs to be saved as a text
file with the .py extension.
Chapter5BubbleSort.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Import modules tk, random and time
import tkinter as tk
from tkinter import ttk
from tkinter import *
import random
import time
import threading
class BubbleSort(threading.Thread):
# Initialise the various lists used by the objects of the class
unsortedL = []; sortedL = []; statisticsData = [];
sizes = [5, 20, 100, 250, 500, 750, 1000, 2000, 5000, 10000, 20000]
# Create and run the main window frame for the application
def __init__(self, winFrame):
super(BubbleSort, self).__init__()
self.winFrame = winFrame
self.winFrame.title("Bubble Sort");
self.winFrame.config(bg = 'light grey')
self.winFrame.resizable(True, True);
self.winFrame.geometry('650x300')
self.listSize = 0
self.createGUI()
# Define the functions that will create the application GUI
def createGUI(self):
self.unsortedFrame()
self.entryFrame()
self.entryButton()
self.sortButton()
self.sortedFrame()
self.clearButton()
self.statisticsButton()
self.statisticsSelection()
# Create labelframe; populate with Unsorted Array Listbox widgets
def unsortedFrame(self):
self.UnsortedFrame=tk.LabelFrame(self.winFrame,
text='Unsorted Array')
self.UnsortedFrame.config(bg = 'light grey', fg = 'blue',
bd = 2, relief = 'sunken')
194
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
Handbook of Computer Programming with Python
# Create a scrollbar widget to attach to UnsortedList
self.UnsortedListScrollBar = Scrollbar(self.UnsortedFrame,
orient = VERTICAL)
self.UnsortedListScrollBar.pack(side = RIGHT, fill = Y)
# Create a listbox in the Unsorted Array frame
self.UnsortedList = tk.Listbox(self.UnsortedFrame,
yscrollcommand = self.UnsortedListScrollBar.set,
bg = 'cyan', width = 13, height = 12, bd = 0)
self.UnsortedList.pack(side = LEFT, fill = BOTH)
# Associate the scrollbar command with its parent widget
# (i.e., the UnsortedList yview)
self.UnsortedListScrollBar.config(command =
self.UnsortedList.yview)
# Place the Unsorted frame & its components into the interface
self.UnsortedFrame.pack()
self.UnsortedFrame.place(relx = 0.02, rely = 0.05)
# Create the labelframe that will contain the Entry widget
def entryFrame(self):
self.EntryFrame = tk.LabelFrame(self.winFrame, text= 'Actions')
self.EntryFrame.config(bg = 'light grey', fg = 'red', bd = 2,
relief = 'sunken')
self.EntryFrame.pack(); self.EntryFrame.place(relx=0.25,
rely=0.05)
# Create the label in the Entry frame
self.EntryLabel = tk.Label(self.EntryFrame,
text = 'How many integers\nin the list', width = 16)
self.EntryLabel.config(bg = 'light grey', fg = 'red', bd = 3,
relief = 'flat', font = 'Arial 14 bold')
self.EntryLabel.grid(column = 0, row = 0)
# Create combo box to select the number of elements in lists
self.ListSizeCombo = ttk.Combobox(self.EntryFrame, width = 10)
self.ListSizeCombo['values'] = self.sizes
self.ListSizeCombo.current(0)
self.ListSizeCombo.grid(column = 1, row = 0)
# Create the button that will insert new entries into the unsorted
# array and list box
def entryButton(self):
self.EntryButton = tk.Button(self.EntryFrame, relief= 'raised',
text = 'Populate\nUnsorted list', width = 16)
self.EntryButton.bind('<Button-1>',
lambda event: self.populateUnsortedList())
self.EntryButton.grid(column = 0, row = 2)
# Populate the unsorted list with random numbers and populate
# the unsorted list box
def populateUnsortedList(self):
self.listSize = int(self.ListSizeCombo.get())
Application Development
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
195
# Generate random integers with randint() from the random class
for i in range (self.listSize):
n = random.randint(-100, 100)
# Enter the generated random integer to the relevant place
# in the unsorted list
self.unsortedL.insert(i, n)
# Populate UnsortedList with the unsorted list elements
for i in range (0, self.listSize):
self.UnsortedList.insert(i, self.unsortedL[i])
self.UnsortedListScrollBar.config(command=
self.UnsortedList.yview)
# Create the button that will sort the numbers and display them
# in the sorted array and list box
def sortButton(self):
self.SortButton = tk.Button(self.EntryFrame, relief = 'raised',
text = 'Sort numbers\nwith BubbleSort', width = 16)
self.SortButton.bind('<Button-1>',lambda event:
self.sortToSortedList())
self.SortButton.grid(column = 1, row = 2)
# Create the labelframe to include the Sorted Array Listbox widgets
def sortedFrame(self):
self.SortedFrame=tk.LabelFrame(self.winFrame,
text='Sorted Array')
self.SortedFrame.config(bg = 'light grey', fg = 'blue', bd = 2,
relief = 'sunken')
# Create a scrollbar widget to attach to the SortedList
self.SortedListScrollBar = Scrollbar (self.SortedFrame)
self.SortedListScrollBar.pack(side = RIGHT, fill = Y)
# Create the list box in the Sorted Array frame
self.SortedList = tk.Listbox (self.SortedFrame,
yscrollcommand = self.SortedListScrollBar.set,
bg = 'cyan', width = 13, height = 12, bd = 0)
self.SortedList.pack(side = LEFT, fill = BOTH)
# Associate the scrollbar command with its parent widget
# (i.e., the SortedList yview)
self.SortedListScrollBar.config(command =
self.SortedList.yview)
# Place the Unsorted frame and its parts into the interface
self.SortedFrame.pack(); self.SortedFrame.place(relx = 0.75,
rely = 0.05)
# Bubble Sort sorts the list & records information for later use
def sortToSortedList(self):
# Load unsorted list & list box to the sorted list & list box
for i in range (0, self.listSize):
self.sortedL.insert(i, self.unsortedL[i])
196
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
Handbook of Computer Programming with Python
# Start timer
self.startTime = time.process_time()
# The Bubble sort algorithm
for i in range (self.listSize-1):
for j in range (self.listSize-1):
if (self.sortedL[j] > self.sortedL[j+1]):
temp = self.sortedL[j]
self.sortedL[j] = self.sortedL[j+1]
self.sortedL[j+1] = temp
# End timer
self.endTime = time.process_time()
# Load the sorted list to the relevant list box
for i in range (0, self.listSize):
self.SortedList.insert(i, self.sortedL[i])
self.SortedListScrollBar.config(command=self.SortedList.yview)
# Create button that will clear the two list boxes & the two lists
def clearButton(self):
self.ClearButton = tk.Button(self.EntryFrame,
text = 'Clear lists', relief = 'raised', width = 16)
self.ClearButton.bind('<Button-1>',
lambda event: self.clearLists())
self.ClearButton.grid(column = 0, row = 3)
# Clear all lists, list & combo boxes, & related global variable
def clearLists(self):
self.sortedL.clear()
self.unsortedL.clear()
self.UnsortedList.delete('0', 'end')
self.SortedList.delete('0', 'end')
self.statisticsData.clear()
self.StatisticsCombo.delete('0', 'end')
self.listSize = 0
# Create the button that will display sorting information
def statisticsButton(self):
self.StatisticsButton = tk.Button(self.EntryFrame,
text = 'Show statistics', relief = 'raised', width = 16)
self.StatisticsButton.bind('<Button-1>',
lambda event: self.statistics())
self.StatisticsButton.grid(column = 1, row = 3)
# Create the option menu that will show the statistical results
# from the sorting process
def statisticsSelection(self):
self.StatisticsSelection = tk.StringVar()
Application Development
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
197
self.statisticsData = ['The statistics will appear here']
self.StatisticsSelection.set(self.statisticsData[0])
self.StatisticsCombo = ttk.Combobox(self.EntryFrame,
textvariable = self.StatisticsSelection, width = 30)
self.StatisticsCombo['values'] = self.statisticsData
self.StatisticsCombo.grid(column = 0, columnspan = 2, row = 4)
# Calculate and report the statistics from the sorting process
def statistics(self):
self.statisticsData.clear()
self.statisticsData.insert(1,
'The size of the list is ' + str(self.listSize))
self.statisticsData.insert(2,
'The sum of the list is ' + str(sum(self.sortedL)))
self.statisticsData.insert(3, 'The time passed to sort the ' +
'list was ' + str(round(self.endTime - self.startTime, 5)))
self.statisticsData.insert(4, 'The average of the sorted list '
+'is: ' + str(round(sum(self.sortedL) / self.listSize, 2)))
self.StatisticsCombo['values'] = self.statisticsData
As with the SpeedControl class discussed previously, the class presented above is the modified
version of the Bubble Sort application. The object-oriented paradigm is adopted by separating and
extracting the statements that would create and run the GUI. The remaining code is saved as a .py
text file in Jupyter, in order to be accessible by the main application.
The following class implements the main application that imports the two classes and runs them
as threaded objects. The classes are imported in lines 5–6, and the main GUI object is created
in lines 47, 49, and 51. The interface offers a single method: the display of a popup menu when
a left-click event takes place. The menu allows for the creation of two threaded objects based on
SpeedControl and Bubble Sort (line 30). The reader should note how the statements separated and
extracted from the imported classes were added to the main application in lines 32–37 and 39–44
respectively:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Import libraries
import tkinter as tk
from tkinter import Menu
from tkinter import *
import Chapter5SpeedControl
import Chapter5BubbleSort
import threading
class Application:
# Create main window frame for the application with the popup menu
def __init__(self, winFrame):
self.winFrame = winFrame
self.winFrame.title("Application with threads")
self.winFrame.config(bg = 'light grey')
self.winFrame.resizable(False, False)
198
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
Handbook of Computer Programming with Python
self.winFrame.geometry('260x220')
self.popupmenu = tk.Menu(self.winFrame, tearoff = 0)
self.popupmenu.add_command(label = "Speed Control",
command = self.speedControlThread)
self.popupmenu.add_command(label = "Bubble Sort",
command = self.bubbleSortThread)
self.winFrame.bind('<Button-1>',
lambda event: self.popupMenu(event))
self.winFrame.config(menu = self.popupmenu)
self.winFrame.mainloop()
def popupMenu(self, event):
self.popupmenu.tk_popup(event.x_root, event.y_root)
def speedControlThread(self):
# Prepare the Speed Control GUI
speedControlFrame = tk.Tk()
speedControl1 =
Chapter5SpeedControl.SpeedControl(speedControlFrame)
speedControl1.start()
speedControlFrame.mainloop()
def bubbleSortThread(self):
# Prepare the Bubble sort GUI
bubbleSortFrame = tk.Tk()
bubbleSort1 = Chapter5BubbleSort.BubbleSort(bubbleSortFrame)
bubbleSort1.start()
bubbleSortFrame.mainloop
# Prepare the application GUI
winFrame = tk.Tk()
application = Application(winFrame)
winFrame.mainloop()
Application Development
199
Output 5.4.3:
5.5 WRAP UP
Chapters 4 and 5 provided a step-by-step, systematic walkthrough of Graphical User Interface
(GUI) programming with Python, and an introduction to GUI objects like menus, tabs, and threads.
Key Python widgets were introduced alongside their most common uses and options. This was
done through a series of straightforward examples and applications that progressed gradually from
simpler to more challenging implementations. Although a detailed coverage of all the available
widgets is beyond the scope of this chapter, Table 5.1 provides widget lists with descriptions, and
200
Handbook of Computer Programming with Python
TABLE 5.1
Frequently Used Widgets and the Module They Belong to
Widget Name
Brief Description
Windows frame
The main object of a windows-based application,
acting as a container for all other widgets in order to
create the Graphical-User-Interface.
Displays a short message to the user. Its content is not
expected to change significantly in the program
lifecycle and it is not meant to be used for interaction.
Nevertheless, it is possible to write code that will
enhance its functionality.
Used to handle basic interaction between the user and
the application. This is usually implemented through
movement or click-based events.
A basic widget used to accept a single line of text from
the keyboard. As with most other widgets, it can be
modified in terms of functionality and appearance.
A controlled mechanism for accepting numerical user
input. Two different implementations of the widget
are available, with the one found in tkinter offering
more options than that in ttk.
A controlled mechanism for accepting numerical user
input from the ttk library.
Used for improved control of the GUI. It can contain
various other widgets.
Similar to the frame widget, but with the inclusion of a
label.
Used to display separate lines of text, allowing the user
to make a selection. The contents of multiple listboxes
can be synchronized.
Similar to the list box, but instead of being
permanently expanded it is in a collapsed state and
only opens when clicked upon. The selected line of
text is displayed on the top level (i.e., the displayed
text box when the list is collapsed).
Used to improve the appearance and use of associated
multiline widgets (e.g., list boxes) when they are
populated with a large number of entries.
Used to offer selection options. It allows for the
selection of multiple options at any given time.
Used to offer selection options. Options are mutually
exclusive.
Used to inform the user about the state of a particular
running method. It can be determinate, in which case
the widget presents the actual state of the method, or
indeterminate, where the widget provides a scrolling
message indicating that the method is still in progress.
Similar to the entry widget, but allowing multiple lines
of text.
A widget that provides a space to place graphics, text,
or other objects.
Provides the supporting object for tabbed frames.
Label
Button
Entry
Scale
Spinbox
Frame
Labelframe
Listbox
Combobox
ScrollBar
CheckButton
RadioButton
Progressbar
Text
Canvas
Notebook
Module/Constructor
tkinter, tk.Tk()
tkinter, tk.Label()
tkinter, tk.Button()
ttk, ttk.Entry()
tkinter/ttk, tk.Scale()/
ttk.Scale()
ttk, ttk.Spinbox()
tkinter, tk.Frame()
tkinter, tk.LabelFrame()
tkinter, tk.ListBox()
ttk, ttk.Combobox()
tkinter, ScrollBar()
tkinter, tk.CheckButton()
tkinter, tk.RadioButton()
ttk, ttk.Progressbar()
tk, tk.Text()
tk, tk.Canvas()
ttk, ttk.Notebook()
201
Application Development
the modules/libraries they belong to as a quick reference. This information can be also used as a
reference for constructors when creating objects from the respective classes. Additional details on
the listed widgets (including tkinter) can be found in the official Python documentation.
In addition to the aforementioned widgets, a number of other objects are frequently used to
improve the GUI experience. Although many of these are not standalone objects, their use in conjunction with other objects is rather common. Table 5.2 lists some of these objects:
The above objects make use of a number of methods that contribute to the creation of the overall user experience. Table 5.3 lists some of the most important of the methods used in the various
scripts and applications developed in this chapter:
TABLE 5.2
Notable Objects and Their Modules
Object
Brief Description
Image
Used to load and display an image. It supports different
file types (e.g., gif, jpg, png). Various different
methods are available, depending on the file type.
Used to host text or numbers.
tk.StringVar(), tk.IntVar(),
tk.DoubleVar(), etc.
askyesno(), askokcancel(),
askretrycancel(),
askquestion()
showinfo(), showerror(),
showwarning()
askopenfile(),
asksaveasfilename(),
askdirectory(), askcolor()
menu, popup menu
Thread
Module
PIL
tkinter
Used to display different types of pre-defined message
boxes.
messagebox
Used to display a simple message box with an info,
error, or warning icon.
Used to display the common windows-based dialogs,
ranging from file dialogs to color chooser modules.
messagebox
filedialog,
colorchooser
Used to display regular windows-based or popup
menus.
Used to create threaded objects.
tkinter,
tk.Menu()
threading
TABLE 5.3
Frequently Used Methods and Their Respective Widgets (in Alphabetical Order with
Constructors First)
Method
.add_command(), .add_
checkbutton(), .add_
radiobutton(), .add_cascade(),
.add_separator()
.after()
.append()
.askyesno(), askokcancel(),
.askretrycancel(),
.askquestion()
.askopenfile(),
.asksaveasfilename(),
.askdirectory(), .askcolor()
.bind()
.clear()
Brief Description
Adds the various components of a menu object.
Invokes a method after a set amount of time has elapsed.
Appends a new element to the end of a list.
Offers a set of different types of pre-defined message boxes.
Offers a set of different types of pre-defined dialogs.
Binds the widget with a user interaction event.
Clears the values from a list.
(Continued)
202
Handbook of Computer Programming with Python
TABLE 5.3 (Continued)
Frequently Used Methods and Their Respective Widgets (in Alphabetical Order with
Constructors First)
Method
.config()
.current()
.curselection()
.delete()
.destroy()
.exit()
.geometry()
.grid()
.grid_remove()
int(), float(), str()
.insert()
.mainloop()
.maxsize(), .minsize()
.open()
.pack()
.PhotoImage()
.place()
.process_time()
.randit()
.resizable()
.resize()
round(), sum(), len()
.selection_set()
.set (),.get ()
.showinfo(), .showerror(),
.showwarning()
.start(), .stop()
.title()
.update_idletasks()
Brief Description
Allows the configuration of the widget in terms of its characteristics
(e.g., color, font properties).
Identifies the current selection from a combo box.
Identifies the current selection from a list box.
Deletes values from a list box.
Destroys the current frame/interface.
Exits the current frame/interface (or the entire application).
Accepts the initial dimensions of the frame in the form of a string
(i.e., ‘length x width’).
Places the widget on the grid of the parent widget and at a specific column
and row. It can span across multiple columns/rows.
Temporarily hides the widget from the grid of the parent without deleting
or destroying it.
Converts the specified values to integer, float, or string values respectively.
Inserts values to a list box.
Puts the frame in an idle state, and monitors possible interactions. The
latter can take the form of defined events between the user and the GUI.
Defines the minimum/maximum size of the associated frame.
Reads an image/picture based on its full path, assigned as an argument.
Attaches the widget to the parent, allowing coordinates to be calculated
either on a relative or absolute basis.
Creates a memory pointer to a processed image object, by means of the
open() method.
Places the widget at specific coordinates on the parent frame, either on a
relative or absolute basis.
Counts the time needed for a particular process to execute.
Generates random numbers in the specified range.
Specifies whether the object is resizable based on a Boolean value
(True/False) that is provided as a parameter.
Specifies the size of the image/picture. It is usually accompanied by the
ANTIALIAS expression to ensure the quality of the image is maintained
when downsizing.
Basic mathematical methods.
Selects a particular indexed element in a list box.
Sets or gets the value of an object.
Offer different types of pre-defined message boxes.
Starts or stops a threaded object.
Provides a title to the windows frame.
Ensures that a widget/object that has been idle for extended periods of
time is not destroyed.
For most methods listed on Table 5.3, there exists a number of options/parameters that may be
also used for the improvement of the GUI. These are applicable to a variety of widgets/objects.
Table 5.4 provides a list of some of the most important ones. The list is not exhaustive, but it is based
on cases described in detail in the various examples in this chapter.
203
Application Development
TABLE 5.4
Frequently Used Properties and Their Descriptions
Properties/Expressions
activebackground,
activeforeground
anchor
borderwidth, bd
command
compound
expand
fg (or foreground),
bg (or background)
fill
font
from_ =, to =
height, width
highlightcolor
image
justify
lambda expression
onvalue, offvalue
orient
padx, pady
relief
resolution
relx, rely
show
side
state
text
textvariable
troughcolor
value
["values"]
underline
wraplength
yscrollcommand,
xscrollcommand
yview, xview
Brief Description
The background or foreground color when the cursor hovers over the widget.
Ensures that the particular element it applies to (i.e., text or image) is placed on a
position within the parent widget that will remain unchanged.
The width of the border around the widget (e.g., borderwidth = 12) as an integer.
The method called when the widget is clicked.
Combines two objects in the same position (e.g., an image and a text) in a parent label
widget. It can take different values (e.g., left, center, right) that specify the
order of the two objects.
Specifies whether the underlying widget is expandable (value is “Y” or non-zero) or
not (value is “N” or zero) when the parent widget is resized.
The color of the foreground/background (fg/bg) or the text a particular widget will
display (see Table 4.6).
Specifies whether the widget it applies to will expand horizontally (fill = tk.X),
vertically (fill = tk.Y) or both (fill = tk.BOTH).
Sets/gets the font name and the size of the text to be displayed by the widget
(e.g., font = 'Arial 24').
Sets the numerical boundaries of the widget.
The height or width of the widget in characters (for text widgets) or pixels (for image
widgets).
The color of the text of the widget when the widget is in focus.
Defines an image to be displayed on the widget instead of text.
Determines how multiple lines of text will be justified in respect to each other. Values
are LEFT, CENTER, or RIGHT.
Sets the parameters to be passed on to a method or method when an event is triggered.
The values assigned to a check button depending on whether it is selected or not.
Specifies the orientation of the widget (horizontal or vertical).
Additional padding left/right (padx) or above/below (pady) in relation to the widget.
Causes the widget to be displayed with a particular visual effect in terms of its border
appearance (see Table 4.6 for available values).
The incremental or decremental step of the scale widget.
The position of the widget relative to the parent object.
Replaces the text of the current widget with the specified character(s).
Specifies the position of the content of the widget (Left, Center, or Right).
The state of responsiveness and/or accessibility of the widget. Values can be NORMAL,
ACTIVE, DISABLED.
The textual content to be displayed.
The textual content of the text-based widget.
The color of the trough of the scale widget.
The value assigned to a radio button, depending on the selection/state.
Associates/populates a combo box with a particular list of values.
If −1, no character of the button’s text will be underlined. If a non-zero value is
provided, the corresponding character(s) will be underlined.
If non-zero, the text lines of the widget will be wrapped to fit the length of the parent
widget.
Used to activate the scrollbar.
Specifies the orientation of a scrollbar (yview for vertical or xview for horizontal).
204
Handbook of Computer Programming with Python
TABLE 5.5
Frequently Used Events and Their Descriptions
Event
Brief Description
<Button-1>, <Button-2>, <Button-3>
<Double-Button-1>, <Double-Button-2>,
<Double-Button-3>
<Enter>
<Key>
<Leave>
Triggered when the left, middle, or right button of the
mouse is clicked upon the widget.
Triggered when the left, middle, or right mouse button is
double clicked upon the widget.
Triggered when the mouse is hovering across the widget.
Triggered when any key on the keyboard is pressed. Use
the event.keycode option to check the key that was
pressed. Note that the values of the keyboard keys vary
between operating systems.
Triggered when the mouse leaves the parent widget.
It should be evident by the examples provided in this chapter that one of the most important concepts in GUI programming is the user’s interaction with the widgets, as this is how events are used
to trigger specific tasks. Such interactions usually take the form of mouse clicks or keyboard events.
Table 5.5 lists some of the most important methods of interactions as a quick reference.
Finally, some common values of the options mentioned previously are provided on Table 5.6
below.
TABLE 5.6
Possible Values for the Various Different Options
Option
Color related
Font related
Anchor related
Relief styles
Bitmap styles
Cursor styles
Pack options
Values Available
It is possible to set the color of the widget, text, or object, either in the form of a hexadecimal string
(e.g., “#000111”), or by using color names (e.g., “white”, “black”, “red”, “green”, “blue”,
“cyan”, “yellow”, and “magenta”).
The font of a text can be set just after the text is specified, using the following sub-options:
• Family: The font family names as a string.
• Size: The font height in points (n) or pixels (−n).
• Weight: The attributes of the text (“bold” for bold, or “normal” for regular text).
• Slant: The attributes of the text (“italic” for italic, or “roman” for unslanted).
• Underline: The attributes of the text (1 for underlined or 0 for normal text).
• Overstrike: The attributes of the text (1 for overstruck or 0 for normal text).
The possible values for the anchor justification are: NW, N, NE, W, CENTER, E, SW, S, SE.
After specifying the text of a widget, the possible values for the relief option are: raised, sunken,
flat, groove, ridge.
Possible bitmap styles include the following: error, gray75, gray50, gray25, gray12,
hourglass, info, questhead, question, warning. These can be used in combination
with, or instead of, text.
Possible cursor styles include the following: arrow, circle, clock, cross, dotbox,
exchange, fleur, heart, man, mouse, pirate, plus, shuttle, sizing, spider,
spraycan, star, target, tcross, trek, watch. These can be used after the text is specified.
There are 4 options in terms of placing a particular widget in respect to the parent widget through the
pack() method. Use the side option with values: TOP (default), BOTTOM, LEFT, or RIGHT.
There are 3 options to determine whether and how a particular widget should expand when the parent
widget expands. Use the fill option with values: NONE (default), X (fill only horizontally), Y (fill only
vertically), or BOTH (fill both horizontally and vertically).
(Continued)
205
Application Development
TABLE 5.6 (Continued)
Possible Values for the Various Different Options
Option
Grid options
Values Available
When placing widgets on the interface using the grid() method, the following options are available:
• columnrow: The column and row the widget will be placed in. The leftmost column (0) and the
first row are the defaults.
• columnspan, rowspan: The number of columns or rows a widget will span across. 1 is the
default value.
• ipadx, ipady: The number of pixels to pad the widget (horizontally and vertically) within its
borders.
• padx, pady: The number of pixels to pad the widget (horizontally and vertically) outside its
borders.
• sticky: Determines how the widget will be aligned if its size is smaller than its cell in the grid.
The default value is centered. Other possible values are N, E, S, W, NE, NW, SE, and SW.
5.6 CASE STUDY
Complete the integration of the Basic Widgets Python script from Chapters 4 with a full menu
­system in an object-oriented application, using all three types of menus (i.e., regular, toolbar,
popup), as described in this chapter. The menu system should include the following options: Color
dialog, Open File dialog, Separator, Basic Widgets, Save As, Open Directory, Separator, About,
and Exit.
6
Data Structures and
Algorithms with Python
Thaeer Kobbaey
Higher Colleges of Technology
Dimitrios Xanthidis
University College London
Higher Colleges of Technology
Ghazala Bilquise
Higher Colleges of Technology
CONTENTS
6.1
6.2
Introduction...........................................................................................................................208
Lists, Tuples, Sets, Dictionaries.............................................................................................209
6.2.1 List.............................................................................................................................209
6.2.2 Tuple.......................................................................................................................... 214
6.2.3 Sets............................................................................................................................. 214
6.2.4 Dictionary.................................................................................................................. 215
6.3 Basic Sorting.......................................................................................................................... 217
6.3.1 Bubble Sort................................................................................................................ 217
6.3.2 Insertion Sort............................................................................................................. 220
6.3.3 Selection Sort............................................................................................................. 222
6.3.4 Shell Sort................................................................................................................... 225
6.3.5 Shaker Sort................................................................................................................ 227
6.4 Recursion, Binary Search, and Efficient Sorting with Lists.................................................. 230
6.4.1 Recursion................................................................................................................... 230
6.4.2 Binary Search............................................................................................................ 233
6.4.3 Quicksort................................................................................................................... 235
6.4.4 Merge Sort................................................................................................................. 238
6.5 Complex Data Structures....................................................................................................... 242
6.5.1 Stack.......................................................................................................................... 242
6.5.2 Infix, Postfix, Prefix................................................................................................... 245
6.5.3 Queue.........................................................................................................................248
6.5.4 Circular Queue........................................................................................................... 250
6.6 Dynamic Data Structures...................................................................................................... 253
6.6.1 Linked Lists............................................................................................................... 254
6.6.2 Binary Trees.............................................................................................................. 261
6.6.3 Binary Search Tree.................................................................................................... 262
6.6.4 Graphs........................................................................................................................ 267
6.6.5 Implementing Graphs and the Eulerian Path in Python............................................ 269
6.7 Wrap Up................................................................................................................................. 271
6.8 Case Studies........................................................................................................................... 271
6.9 Exercises................................................................................................................................ 272
References....................................................................................................................................... 272
DOI: 10.1201/9781003139010-6
207
208
Handbook of Computer Programming with Python
6.1 INTRODUCTION
Data is defined as a collection of facts. In raw form, data
Observation 6.1 – Data Structures: A
is difficult to process and, thus, in need of further strucway of representing, organizing, storturing in order to be useful. In computer science, a data
ing, and accessing data based on a set
structure refers to the organization, storage, and manof well-defined rules.
agement of data in a way that allows its efficient processing and retrieval. In simple terms, a data structure
represents the associated data on a computer in a specific format, while preserving any underlying
logical relationships, and it provides storage and efficient access to the data based on set of performance-enhancing rules.
As an example, one can consider the real-life scenario of searching for a particular name in a
phone book. The search is being made easy by organizing the names in the phone book and sorting
them in alphabetical order. In this rather primitive example, one is not required to go through the
phone book page by page to find the desired name. Other relevant examples include the history of
web pages visited through the web browser (implemented as a linked-list structure), the undo/redo
mechanism available in many applications (implemented as stack structure), the queue structures
used by operating systems for scheduling the various CPU tasks, and the tree structure used in
many artificial intelligence-based games to track the player’s actions.
In a broader context, there are two different types of data structures:
• Basic data structures that are usually available in every modern programming language.
In Python, these include structures like the list, the dictionary, the tuple, and the set. Lists
and tuples allow the programmer to work with data that is ordered sequentially. Sets are
unordered collections of values with no duplicates.
• Complex data structures, like stacks, queues, and various types of trees, that are built on
basic data structures. In terms of the way these structures organize data, stacks and queues
are classified as linear (i.e., the data elements are ordered), whereas trees and graphs as
non-linear (i.e., the elements do not follow a particular order).
This chapter covers the following topics:
• Basic data structures (i.e., lists, tuples, sets, and dictionaries) and their operations.
• Basic Sorting Algorithms: bubble sort, insertion sort, selection sort, shell sort, shaker
sort.
• The concept of recursion and its application to binary search, and the merge sort and quick
sort algorithms.
• Complex data structures (i.e., stacks and queues).
• Dynamic data structures like singly and doubly linked lists, binary trees/binary search
trees, and graphs.
The focus is both on the computational thinking behind these topics, and on a detailed look
into the programming concepts used for their implementation. Nevertheless, it must be stated
that this chapter aims to provide a thorough introduction of the underlying ideas rather than to
cover the aforementioned data structures exhaustively. Fundamental and critically important
data structures and the associated algorithms like the heap tree and the heap sort or hashing
structures and hashing tables, are not covered here. The reader can find more details on related
subjects in the seminal works of Dijkstra et al. (1976), Knuth (1997), and Stroustrup (2013), to
whom the modern computer ­science and information systems and technology community owes
much of its existence.
Data Structures and Algorithms
209
6.2 LISTS, TUPLES, SETS, DICTIONARIES
This section explores the four built-in data structures provided by Python, namely lists, tuples, sets
and dictionaries. These structures are also briefly discussed in Chapter 2, where they are referred
to as non-primitive data types. Their main use is to store a collection of values and provide tools for
its manipulation.
6.2.1 List
A list is a data structure that stores a collection of items in specified, and frequently successive,
memory locations. Each item in the list has a location number called an index. The index starts from
zero and follows a sequential order. This does not refer to the values of the stored data being ordered
in a particular way (e.g., alphabetically), but the index values. To access an item at a particular location, the programmer can simply use the index number corresponding to this location. The concept
of the list is analogous to a to-do list that contains things that must be accomplished. In terms of
functionality, Python provides various operations, such as adding items to, and removing from, a
list. Since items in a list can be modified, it is considered
to be mutable.
Observation 6.2 – List: A list is a data
At a practical level, lists in Python are denoted by structure that stores a collection of
square brackets (i.e., []). The list can be populated by items in specified, usually successive,
adding items within the brackets, separated by commas. memory locations. It is indexed by a
The following script creates a list, and then prints both the sequential index that always starts at
list items and the number of items in the list. It also asks zero. The items do not have to be in
the user to specify the index of an item to print (starting a particular order. A list is a mutable
from zero), a range of items to print from the start of the object, meaning that each item can
list to a user-specified index, and a range of items to print be modified.
from a user-specified index to the end of the list:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Create the list
cars = ["BMW", "Toyota", "Honda", "Mercedes"]
# Print the list items
print("The list of the cars is the following: ", cars)
# Use the len() function to print the number of items in the list
print("The number of items in the list is: ", len(cars))
# Ask the user for the index number of an item for printing
singleIndex = int(input("Enter the index \
of the item to print (indexes start from 0): "))
print("Your selection for display is: ", cars[singleIndex])
# Ask the user for the starting index of the print range
startingIndex = int(input("Enter the starting index of the range \
of items to print (index starts from 0): "))
print("Your selected range of items to display is: ",
cars[startingIndex:len(cars)-1])
# Ask the user for the ending index of the print range
endingIndex = int(input("Enter the ending index of the range of items \
210
23
24
25
26
27
Handbook of Computer Programming with Python
to print (index starts from 0): "))
print("Your selected range of items to display is: ",
cars[0:endingIndex])
# Use a negative index to start printing the list from the end
print("The last item in the list is: ", cars[–1])
Output 6.2.1.a:
The list of the cars is the following: ['BMW', 'Toyota', 'Honda', 'Mercedes']
The number of items in the list is: 4
Enter the index of the item to print (indexes start from 0): 0
Your selection for display is: BMW
Enter the starting index of the range of items to print (index starts from 0): 1
Your selected range of items to display is: ['Toyota', 'Honda']
Enter the ending index of the range of items to print (index starts from 0): 2
Your selected range of items to display is: ['BMW', 'Toyota']
The last item in the list is: Mercedes
In this script, the reader will notice that the syntax for calling a range of items is list[start:end],
with start denoting the position of the starting index (inclusive) and end the ending index (not
­inclusive). It must be stressed that the start and end parameters are optional. For instance, expression cars[0: endingIndex] could be replaced by cars[:endingIndex] and, similarly, expression cars[startingIndex:len(cars)-1] could be replaced by cars[startingIndex:].
The reader should also note that if the user tries to access a list item using an index that does not
exist, an IndexError exception will be raised, as illustrated in the example below:
Output 6.2.1.b:
The list of the cars is the following: ['BMW', 'Toyota', 'Honda',
'Mercedes']
The number cf items in the list is: 4
Enter the index of the item to print (indexes start from 0): 4
IndexError
Traceback (most recent call last)
<ipython-input-5-695ecl33b0e9> in <module>
11 singleIndex = int(input("Enter the index \
12 of the item to print (indexes start from 0): "))
---> 13 print("Your selection for display is: ", cars[singleIndex])
14
15 # Ask the user for the starting index of a range of items in the
list to print
IndexError: list index out of range
In addition to the basic functions discussed above, Python also provides a number of additional
functions that can be used to manipulate a list (Table 6.1):
211
Data Structures and Algorithms
TABLE 6.1
Most Important Functions for List Manipulation
Functions
append(item)
clear()
copy()
count()
extend(list2)
index(item)
insert(pos, item)
pop()
remove(item)
reverse()
sort()
Description
Adds an element at the end of the list
Removes all the elements from the list
Returns a copy of the list
Returns the number of elements with the specified value
Adds the elements of a second list (e.g., list2) to the end of the current list
Returns the index of the first item with the specified value
Adds an element at the specified position
Removes and returns the last element of the list
Removes the item with the specified value
Reverses the order of the list
Sorts the list in ascending order
The script below is a modified version of the previously created one, demonstrating the use of
append(), insert(), extend(), remove(), and pop() (Table 6.1). The script performs the
tasks of adding items at the end of a list (line 9), inserting an item in a particular position specified
by an index value (line 11), extending the list by adding items from a second list (lines 16–17),
removing a particular item from the list (line 22), and removing the last item of the list (line 26):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Create the list
cars = ["BMW", "Toyota", "Honda", "Mercedes"]
# Print the list size and its items
print("The list of the cars has ", len(cars),
" items which are the following: ", cars)
# Append/add an item to the end of the list
cars.append("Nissan")
# Insert an item to position 1 of the list
cars.insert(1,"Suzuki")
# Print the updated list
print("The updated list after the append and insert is: ", cars)
# Extend the list by adding the items of a second list
cars2 = ["Renault", "Audi"]
cars.extend(cars2)
print("The updated list after extending it with items from "
"a second list is: ", cars)
# Remove a specific item from the list
cars.remove("Toyota")
print(cars)
# Remove the last item from the list
cars.pop()
print(cars)
212
Handbook of Computer Programming with Python
Output 6.2.1.c:
The list of the cars has 4 items which are the following: ['BMW',
'Toyota', 'Honda', 'Mercedes']
The updated list after the append and insert is: ['BMW', 'Suzuki',
'Toyota', 'Honda', 'Mercedes', 'Nissan']
The updated list after extending it with items from a second list is: ['BMW',
'Suzuki', 'Toyota', 'Honda', 'Mercedes', 'Nissan', 'Renault', 'Audi']
['BMW', 'Suzuki', 'Honda', 'Mercedes', 'Nissan', 'Renault', 'Audi']
['BMW', 'Suzuki', 'Honda', 'Mercedes', 'Nissan', 'Renault']
The following variation of the same script showcases the use of reverse(), sort(),
sort(reverse = True), and index() in order to reverse the items of the list (line 9), sort them
in ascending order (line 13), sort them in descending/reverse order (line 17), and find and return the
index of a particular item (line 21). Notice that none of the results of these functions have a permanent effect on the original list:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Create the list
cars = ["BMW", "Toyota", "Honda", "Mercedes", "Toyota"]
# Print the list size and its items
print("The list of the cars has ", len(cars),
" items which are the following: ", cars)
# Print the items of the list in reverse order
cars.reverse()
print(cars)
# Sort the items of the list and print them
cars.sort()
print(cars)
# Sort the items of the list in reverse order and print them
cars.sort(reverse = True)
print(cars)
# Find and return the index of a specific item in the list
print(cars.index("BMW"))
Output 6.2.1.d:
The list of the cars has 4 items which are the following:
['BMW', 'Toyota', 'Honda', 'Mercedes']
['Mercedes', 'Honda', 'Toyota', 'BMW']
['BMW', 'Honda', 'Mercedes', 'Toyota']
['Toyota', 'Mercedes', 'Honda', 'BMW']
3
Data Structures and Algorithms
213
Finally, with the use of in <list>, copy(), count(), and clear(), the programmer can examine in run-time whether a particular item belongs in a list (lines 8–11 and 13–16), copy the contents
of a list (line 23), count the occurrences of an item in the list (line 19), and clear the list (line 27):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Create the list
cars = ["BMW", "Toyota", "Honda", "Mercedes", "Toyota"]
# Print the list items
print("The list of the cars is the following: ", cars)
# Print True or False depending on whether an item is included in
the list
if ("Toyota" in cars):
print("Toyota is in the list")
else:
print("Toyota is not in the list")
if ("Nissan" in cars):
print("Nissan is in the list")
else:
print("Nissan is not in the list")
# The number of occurrences of an item in the list
occurrences = cars.count("Toyota")
print("Occurrences of the particular item in the list is: ",
occurrences)
# Copy the contents of a list into another
newCars = cars.copy()
print("The contents of the new list are: ", newCars)
# Clear the list
newCars.clear()
print("The newCars list of items is now empty: ", newCars)
Output 6.2.1.e:
The list of the cars is the following: ['BMW', 'Toyota', 'Honda',
'Mercedes', 'Toyota']
Toyota is in the list
Nissan is not in the list
Occurences of the particular item in the list is: 2
The contents of the new list are: ['BMW', 'Toyota', 'Honda',
'Mercedes', 'Toyota']
The newCars list of items is now empty: []
214
Handbook of Computer Programming with Python
6.2.2 Tuple
Tuples are a special type of list, with items being orgaObservation 6.3 – Tuple: A special
nized in a particular order and accessed by referencing
type of list that is immutable (i.e., its
index values. The difference between a normal list and a
items cannot be modified). Tuples are
tuple is that the latter is immutable, meaning that its
created using parentheses instead of
items cannot be modified. As such, tuples do not offer
square brackets.
some of the extended functionality of a list described in
the previous section. In terms of syntax, tuples are created using parentheses instead of square brackets. The following script demonstrates the basics of
tuple creation and usage:
1
2
3
4
5
6
7
8
9
10
# Create a tuple
cars = ("BMW", "Toyota", "Honda", "Mercedes")
# Display all items in the tuple
print("The items in the tuple are: ", cars)
# Display the first item in the tuple
print("The first item in the tuple is: ", cars[0])
# Raises TypeError exception as the tuple item can't be modified
cars[0] = "Tesla"
Output 6.2.2:
The items in the tuple are: ('BMW', 'Toyota', 'Honda', 'Mercedes')
The first item in the tuple is: BMW
TypeError
Traceback (most recent call last)
<ipython-input-1-3c3eee3a45c8> in <module>
8
9 # Raises a TypeError exception since the item in the tuple cannot be
modified
---> 10 cars[0] = "Tesla"
TypeError: 'tuple' object does not support item assignment
6.2.3 Sets
A set is a collection of unordered and unique items. It is
created using curly braces (i.e., {}) (Hoare, 1961). When
the print() function is used to display the contents of
a set, the duplicates are removed from the output and its
contents are not presented in a particular order. In fact,
every time the code is executed the order of the elements
is different.
There are four particular operators/functions used on
sets:
1. The in Operator: Examines whether an item is
included in the set.
Observation 6.4 – Set: A collection
of unordered, unique items. Use the
in operator to examine if an item
belongs to a set. Use the intersection() function to find the common items between two sets. Use the
difference() function to retrieve
items from the first set that are not
found in the second. The union()
function combines the items of two
sets, removing any duplicates.
Data Structures and Algorithms
215
2. The intersection() Function: Identifies the common items between two sets.
3. The difference() Function: Retrieves items from a set that do not exist in another set.
4. The union() Function: Combines the items of two sets and returns a new one after
removing any duplicates.
The following script demonstrates the basic use of sets and their main operations:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Create the set
cars = {"BMW", "Toyota", "Honda", "Mercedes", "Toyota"}
# Print the set
print("The cars set includes the following items: ", cars)
# Check whether a particular item exists in the set
if ("Honda" in cars):
print("Honda is in the cars set")
else:
print("Honda is not in the cars set")
# Create and print an additional set
german_cars = {"BMW", "Mercedes", "Audi", "Porsche"}
print("The german cars set includes the following items: ", german_cars)
# Find and print the intersection (i.e., common items of the two sets)
print("The intersection, i.e., the common items of the two sets, is: ",
cars.intersection(german_cars))
# Find and print the difference of the two sets
print("The different items between the two sets are: ",
cars.difference(german_cars))
# Find and print the union of the two sets
print("The union of the two sets is: ", cars.union(german_cars))
Output 6.2.3:
The cars set includes the following items: {'Honda', 'Mercedes',
'BMW', 'Toyota'}
Honda is in the cars set
The german cars set includes the following items: {'Mercedes',
'Porsche', 'BMW', 'Audi'}
The intersection, i.e., the common items of the two sets, is:
{'Mercedes', 'BMW'}
The different items between the two sets are: {'Honda', Toyota'}
The union of the two sets is: {'Audi', 'Porsche', 'Honda',
'Mercedes', 'BMW', 'Toyota'}
6.2.4 Dictionary
A dictionary is a collection of items that stores values in key-value pairs. The key is a unique identifier and the value is the data associated with it. The dictionary is analogous to a phone book that
stores the contact name and telephone of a person. The contact name would be the key that is used
216
Handbook of Computer Programming with Python
TABLE 6.2
Functions of a Dictionary
Function
clear()
copy()
get(key)
has_key(key)
items()
keys()
values()
pop(key)
popitem()
update()
Description
Removes all the elements from the dictionary
Returns a copy of the dictionary
Gets an item by the key
Returns a Boolean value based of whether the key is in the dictionary or not
Returns a list of (key, value) tuples
Returns a list of keys
Returns a list of values
Removes an item given the key and returns the value
Removes the next item, and returns the key/value
Adds or overwrites items from another dictionary
to look up the telephone number (i.e., the value). In a dictionary, keys must be unique and of an immutable data
type, such as strings or integers, while values can be of
any type (e.g., strings, integers, lists).
The Python syntax for creating a dictionary is the
following:
dictionary = {key1: value1, key2: value2}
Observation 6.5 – Dictionary: A collection of items stored in a key-value
pair format. The keys must use immutable data types. The values can be of
any type and are mutable. The syntax
is the following:
dictionary = {key1: value1,
key2: value2}
Table 6.2 lists the available dictionary functions
The following script presents an example involving a dictionary named employee that holds
the employees’ names, salaries, and job titles:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Create the dictionary
employee = {"name": "Maria", "salary": 15000, "job": "Sales Manager"}
# Print the dictionary
print("The employee dictionary is: ", employee)
# Access a specific key and print the paired value
print("The pair value for the <name> key is: ", employee["name"])
# Use the get() method to print a pair based on a given key
print("The value pair of the <name> key is: ", employee.get("name"))
# If the key value does not exist the get() method will return
# None (empty)
print("The value pair of the <name> key is: ",
employee.get("department"))
# Add a new pair to the dictionary
employee["department"] = "Sales"
print("The value pair of the new <department> key is: ",
employee.get("department"))
# Modify the value of a given key
employee["salary"] = "20000"
print("The new employee dictionary includes the following pairs: ",
Data Structures and Algorithms
24
25
26
27
28
29
30
31
32
33
34
217
employee)
# Use the update() method to modify the dictionary
employee.update({"name":"Alex","department":"Sales"})
print(employee)
# Pop/remove a pair based on a given key, assign it to a new
# dictionary and print it
emp_job = employee.pop("job")
print("The original employee dictionary is: ", employee)
print("The new emp_job dictionary is: ", emp_job)
Output 6.2.4:
The employee dictionary is: {'name': 'Maria', 'salary': 15000, 'job':
'Sales Manager'}
The pair value for the <name> key is: Maria
The value pair of the <name> key is: Maria
The value pair of the <name> key is: None
The value pair of the new <department> key is: Sales
The new employee dictionary includes the following pairs: {'name': 'Maria',
'salary': '20000', 'job': 'Sales Manager', 'department': 'Sales'}
{'name': 'Alex', 'salary': '20000', 'job': 'Sales Manager', 'department':
'Sales'}
The original employee dictionary is: {'name': 'Alex', 'salary': '20000',
'department': 'Sales'}
The new empjob dictionary is: Sales Manager
The reader should note that it is possible to access the value of a dictionary key either directly
(line 7) or through the get() function (line 10). If access to a value of a key that does not exist in
the dictionary is requested, get() returns an empty value (line 13 and 14). It is also worth noting
that it is possible to add a new pair of values through the update() function (line 27). Finally, line
32 demonstrates how to remove a particular pair from a dictionary through the pop() function and
how to create a new dictionary from it.
6.3 BASIC SORTING
Sorting is a major task in computer science and information systems/technology, with as much as
30% of the total computer processing time of everyday business activity allegedly being devoted to
it. In a broader context, sorting is the computational process of arranging data in a particular order.
As different sorting algorithms can result in differences of minutes, hours, or even days, efficiency
is an important factor in terms of sorting time. Efficiency is measured by counting the number of
comparisons and exchanges/swaps required to sort a given list of data. A comparison takes place
when an element of the list is compared with another, whereas exchanges/swaps happen when two
elements of the list switch their positions.
6.3.1 Bubble Sort
The bubble sort is one of the most well-known sorting algorithms. It is also covered in Chapter 4 of
this book, under the topic of listboxes. The main idea of the algorithm is to have the element with
the highest (or lowest) value in a list moved to the last (or first) place during each iteration. At each
218
Handbook of Computer Programming with Python
iteration, the program repeats this process, moving the
next highest (lowest) number in the list to the appropri- Observation 6.6 – Bubble Sort: Use
ate place. The number of the main iteration corresponds two nested for loops during the
to the number of the elements of the list. During each inner iterations to successively move
main iteration there are as many comparisons (and the highest/lowest value element to
potentially exchanges/swaps) as the total number of ele- the end of the list until the entire list
ments in the list. Thus, the time complexity of the bubble is sorted.
sort is O(n2). The detailed explanation of time complexities and the Big O/Theta/Omega notation is beyond the scope of this book, but the reader can find
related information in most of the essential computer science sources and bibliography. For the
purposes of this chapter, it should suffice to claim that the bubble sort is not particularly efficient in
terms of time. In order to examine the low efficiency of the algorithm, the reader could assume that
each comparison takes 1 nanosecond to complete (1 nanosecond = 1.0e−9 seconds). This would
translate to the following rough estimates:
•
•
•
•
•
n = 10: n2 = 81 comparisons → approximate time 3e−4 seconds.
n = 100: n2 = 9.8e3 comparisons → approximate time 5e−3 seconds.
n = 1,000: n2 = 9.98e5 comparisons → approximate time 0.4 seconds.
n = 10,000: n2 = 9.998e7 comparisons → approximate time 46 seconds.
n = 20,000: n2 = 4e7 comparisons → approximate time 188 seconds
As these calculations are estimates, they are largely dependent on the system at hand, the type of
data of the list, and the conditions of the programming platform used. However, the crude assumptions and numbers used here could provide a rough idea of the increasing inefficiency of the bubble
sort in line with an increasing size of the list. Indeed, bubble sort works well as long as n is not
higher than approximately 10,000. After this point, it becomes heavy and its inefficiency starts to
show.
It is possible to slightly improve the efficiency of the algorithm by avoiding unnecessary
­comparisons. As an example, one could use the following eight-element list: 3, 5, 4, 2, 3, 1, 6, 7.
The algorithm will execute n−1 times (i.e., seven iterations) during each of the main iterations. The
inner iterations are then responsible to bring each element to the corresponding place successively
(Table 6.3).
The reader should note that, firstly, it is not necessary that an exchange/swap of elements will
take place in every iteration of the inner loop and, secondly, at the end of the main outer iteration the
highest element is pushed to the end of the list. In this case, in the first main outer iteration, element
7 is pushed to the end of the list. The last line is the result of the first main outer iteration, after all
seven inner loops are completed. Subsequent iterations will repeat the same process, ensuring that
the next highest element moves to the appropriate position, until all elements have taken the correct
place in the list.
TABLE 6.3
The Inner Loop inside the First Main Iteration
3
3
3
3
3
3
3
5
5
4
4
4
4
4
4
4
5
2
2
2
2
2
2
2
5
3
3
3
3
3
3
3
5
1
1
1
1
1
1
1
5
5
6
6
6
6
6
6
6
7
7
7
7
7
7
7
219
Data Structures and Algorithms
TABLE 6.4
The Results of the Outer Loops
After the 1st pass
After the 2nd pass
After the 3rd pass
After the 4th pass
After the 5th pass
After the 6th pass
After the 7th pass
3
3
2
2
1
4
2
3
1
2
2
3
1
5
3
1
4
5
1
3
4
5
3
3
4
5
3
3
4
5
Comparisons are made with no swaps
Comparisons are made with no swaps
6
6
6
6
6
7
7
7
7
7
Table 6.4 presents the results after each of the outer iterations/loops.
A Python implementation of a basic bubble sort and its output is provided below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Import the random module to generate random numbers
import random
import time
comparisons = 0
list = []
# Enter the number of list elements
size = int(input("Enter the number of list elements: "))
# Use the randint() function to generate random integers
for i in range (size):
newNum = random.randint(-100, 100)
list.append(newNum)
print("The unsorted list is: ", list)
# Bubble sorts the list & records the stats for later use
# Start the timer
startTime = time.process_time()
# The bubble sort algorithm
for i in range (size-1):
for j in range (size-1):
comparisons += 1
if (list[j] > list[j+1]):
temp = list[j]
list[j] = list[j+1]
list[j+1] = temp
# End the timer
endTime = time.process_time()
# Display the basic info for the bubble sort
print("The sorted list is: ", list)
print("The number of comparisons is: ", comparisons)
print("The elapsed time in seconds is: ", (endTime - startTime))
220
Handbook of Computer Programming with Python
Output 6.3.1:
Enter the number of elements in the list:7
The unsorted list is: [33, -16, -57, -17, 95, 5, 15]
The sorted list is: [-57, -17, -16, 5, 15, 33, 95]
The number of comparisons is = 36
The elapsed time in seconds = 0.0
6.3.2 Insertion Sort
Insertion sort is another basic sorting algorithm, similar
to bubble sort but somewhat improved. The basic idea Observation 6.7 – Insertion Sort:
is that on the ith pass the algorithm inserts the ith ele- Use a while loop nested inside a
ment into the appropriate place (i.e., L[i]) at the end of for loop to find the highest/lowest
in the subset of the list
the L[1], L[2], …, L[i-1] sequence, the elements of which value element
th pass. The subset starts with
in
each
i
have been previously placed in sorted order. As a result,
after the insertion, the elements occupying the L[1], the first two elements (index extends
L[2], …, L[i] sequence are in sorted order. In simple up to i + 1) and is increased by 1 in
terms, the algorithm sorts increasingly larger subsets of each pass.
the original list until the whole list is sorted.
As an example, assume that the insertion sort is applied to the following seven-element list: 3,
5, 4, 2, 3, 1, 6, thus executing n−1 (i.e., 6) outer iterations/loops. The big difference between this
algorithm and bubble sort is that each of the main iterations will not require the same number as
the inner iterations, but an increasing iteration number starting from 1 and up to n−1. During each
inner iteration, the highest element is moved to the last location of the current subset of the list. The
following section describes in detail each of the main iterations.
The inner iteration of the first main iteration will put the two elements of the subset in order:
3
3
5
5
The two-iteration loop of the second main iteration will put the three elements of the subset in order:
3
3
3
5
4
4
4
5
5
The three-iteration loop of the third main iteration will put the four elements of the subset in order:
3
3
3
2
4
4
2
3
5
2
4
4
2
5
5
5
221
Data Structures and Algorithms
The four-iteration loop of the fourth main iteration will put the five elements of the subset in order:
2
2
2
2
2
3
3
3
3
3
4
4
3
3
3
5
3
4
4
4
3
5
5
5
5
The five-iteration loop of the fifth main iteration will put the six elements of the subset in order:
2
2
2
2
2
1
3
3
3
3
1
2
3
3
3
1
3
3
4
4
1
3
3
3
5
1
4
4
4
4
1
5
5
5
5
5
The six-iteration loop of the sixth main iteration will put the seven elements of the subset in order:
1
1
1
1
1
1
1
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
5
5
5
5
5
5
5
6
6
6
6
6
6
6
The algorithm relies on the introduction of a temporary element (e.g., temp) and a temporary
location (i.e., loc), which are assigned with values L[1] and 1 respectively. The following script
provides an implementation of the insertion sort algorithm in Python:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import random
import time
list = []
comparisons = 0
# Enter the number of list elements
size = int(input("Enter the number of list elements: "))
# Use the randint() function to generate random integers
for i in range (size):
newNum = random.randint(–100, 100)
list.append(newNum)
print("The unsorted list is: ", list)
startTime = time.process_time() # Start the timer
# The insertion sort algorithm
for i in range(1, size):
222
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Handbook of Computer Programming with Python
temp = list[i]
loc = i
while ((loc > 0) and (list[loc-1] > temp)):
comparisons += 1
list[loc] = list[loc-1];
loc = loc -1
list[loc] = temp
endTime = time.process_time() # End the timer
# Display the basic info for the insertion sort
print("The sorted list is: ", list)
print("The number of comparisons is: ", comparisons)
print("The elapsed time in seconds is: ", (endTime - startTime))
Output 6.3.2:
Enter the number of elements in the list:7
The unsorted list is: [2, -8, 69, 20, -56, -32, -81]
The sorted list is: [-81, -56, -32, -8, 2, 20, 69]
The number of comparisons is = 16
The elapsed time in seconds = 0.0
There are a couple of characteristics that make insertion sort significantly more efficient compared
to bubble sort. First, since each subset of the list includes fewer elements than the entire list, it
performs fewer comparisons. Second, as each pass secures that the subset is in order, fewer swaps
are required. However, on average, the algorithm falls under the same time efficiency bracket as
bubble sort (i.e., O(n2)), and only shows improvement on the best case, where it becomes linear and
achieves a time complexity of O(n).
An approximation of the time efficiency improvements of the insertion sort over the bubble sort
is provided in the list below (assume 1 comparison takes 1 nanosecond or 1.0e−9 seconds; where
Cs stands for Comparisons):
•
•
•
•
•
n = 10: ~40 Cs (n2 = 81 in Bubble S.) → approx. 2.0e−4 seconds (3e−4 in Bubble S.)
n = 100: ~4.0e3 Cs (n2 = 9.8e3 in Bubble S.) → approx. 2.5e−3 seconds (4.5e−3 in Bubble S.)
n = 1,000: ~5.0e5 Cs (n2 = 9.98e5 in Bubble S.) → approx. 0.16 seconds (3.7e−1 in Bubble S.)
n = 10,000: ~4.0e7 Cs (n2 = 9.998e7 in Bubble S.) → approx. 15 seconds (46 in Bubble S.)
n = 20,000: ~9.8e7 Cs (n2 = 2.0e8 in Bubble S.) → approx. 57 seconds (188 in Bubble S.)
6.3.3 Selection Sort
Selection sort, also considered one of the fundamental sorting algorithms, is similar to insertion sort, but
provides some improvements in terms of efficiency as
it reduces the number of required swaps. The basic idea
is that, on the ith pass, the algorithm selects the element
with the lowest (or highest) value within a given range
(i.e., A[j], …, A[n]), and swaps it with the current position (i.e., A[j]). Thus, after the ith pass, the ith lowest elements will occupy A[1], A[2], …, A[i] in sorted order.
Observation 6.8 – Selection Sort:
Use a for loop nested inside another
for loop to find and replace the
highest/lowest value element with
the original, ith item in the list. In each
successive pass, the subset of the
searchable list is reduced by one.
223
Data Structures and Algorithms
The algorithm utilizes subsets of a list to sort it, moving from the whole list to end up with the smallest divisions of it. In a sense, it is almost the opposite of insertion sort. The algorithm requires one
additional variable in order to store the location (index) of the lowest value element within the list.
Using the list from the previous example (i.e., 3, 5, 4, 2, 3, 1, 6), during the 1st outer iteration of
the selection sort, the inner iterations will determine that the lowest value element is in index 5.
Therefore, the elements in list[0] and list[5] will be swapped, and the element in list[0] will not be
involved in any further processing from this point on:
list[0] = 3
list[1] = 5
list[2] = 4
list[3] = 2
list[4] = 3
list[5] = 1
list[6] = 6
By the end of the 1st outer iteration, the list has the following structure:
list[0] = 1
list[1] = 5
list[2] = 4
list[3] = 2
list[4] = 3
list[5] = 3
list[6] = 6
Given that the 2nd outer loop will move the index to the 2nd element of the list (i.e., i = 1), the 2nd inner
iterations will only deal with the subset of the original list, excluding the sorted part (i.e., list[0]).
This means that in the unsorted subset of the list, the element with the lowest value will be in index
3. Thus, the elements in list[1] and list[3] will be swapped, while the element in list[1] will not be
involved in any further processing:
list[0] = 1
list[1] = 5
list[2] = 4
list[3] = 2
list[4] = 3
list[5] = 3
list[6] = 6
list[5] = 3
list[6] = 6
By the end of the 2nd outer iteration the list will be the following:
list[0] = 1
list[1] = 2
list[2] = 4
list[3] = 5
list[4] = 3
Once again, the 3rd outer loop will move the index to the 3rd element of the list (i.e., i = 2) and the 3rd
inner iterations will only deal with the subset of the original list, excluding the sorted part (i.e.,
list[0], list[1]). As in the previous two iterations, this will result in the element with the lowest value
in the unsorted subset of the list being found in index 4, and thus the elements in list[2] and list[4]
will be swapped:
list[0] = 1
list[1] = 2
list[2] = 4
list[3] = 5
list[4] = 3
list[5] = 3
list[6] = 6
By the end of the 3rd outer iteration the list will be the following:
list[0] = 1
list[1] = 2
list[2] = 3
list[3] = 5
list[4] = 4
list[5] = 3
list[6] = 6
Repeating the outer loop for a 4th time will further move the index to the 4th element of the list and
the 4th inner iterations will deal with the remaining subset of the list. The inner loop will find the
lowest value element to be in index 5 of that subset, and the elements in list[3] and list[5] will be
swapped:
list[0] = 1
list[1] = 2
list[2] = 3
list[3] = 3
list[4] = 4
list[5] = 5
list[6] = 6
224
Handbook of Computer Programming with Python
The algorithm will continue until there is no subset left unprocessed. By that time, the list will have
been sorted. The following script showcases an implementation of selection sort in Python and its
output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Import the random module to generate random numbers
import random
import time
comparisons = 0
list = []
# Enter the number of list elements
size = int(input("Enter the number of list elements: "))
# Use the randint() function to generate random integers
for i in range (size):
newNum = random.randint(-100, 100)
list.append(newNum)
print("The unsorted list is: ", list)
# Selection sorts the list & records the stats for later use
# Start the timer
startTime = time.process_time()
# The selection sort algorithm
for i in range(size):
locOfMin = i
# Find the smallest element in the
# remaining subset of the list
for j in range(i+1, size):
comparisons += 1
if (list[locOfMin] > list[j]):
locOfMin = j
# Swap the minimum element with
# the first element of the subset
list[i], list[locOfMin] = list[locOfMin], list[i]
# End the timer
endTime = time.process_time()
# Display the basic info for the selection sort
print("The sorted list is: ", list)
print("The number of comparisons is: ", comparisons)
print("The elapsed time in seconds: ", (endTime - startTime))
Data Structures and Algorithms
225
Output 6.3.3:
Enter the number of elements in the list:7
The unsorted list is: [32, 81, -76, -88, 62, -53, -17]
The screed list is: [-88, -76, -53, -17, 32, 62, 81]
The number of comparisons is = 21
The elapsed time in seconds = 0.0
Selection sort is a bit heavier than insertion sort, but it becomes comparatively faster as the list
grows larger. Nevertheless, for lists containing between approximately 1,000 and 50,000 elements,
both algorithms perform similarly in terms of their efficiency. Their most important difference is
that the efficiency of selection sort is quite similar across the best, average, and worst cases, with a
time complexity of O(n2), whereas insertion sort has a complexity that in the best case might even
reach O(n). In practice, both algorithms are suitable for relatively small lists.
The following list provides approximate comparative figures highlighting the performance differences between the two algorithms (assume 1 comparison takes 1 nanosecond or 1.0e−9 seconds;
Cs stands for Comparisons):
• n = 10: 45 Cs (up to 40 in Insertion S.) → approx. 6.0e−4 seconds (2.0e−4 in Insertion S.)
• n = 100: 4.9e3 Cs (up to 4.0e3 in Insertion S.) → approx. 8.0 e−3 seconds (2.5e−3 in
Insertion S.)
• n = 1,000: 5.0e5 Cs (up to 5.0e5 in Insertion S.) → approx. 0.18 seconds (0.16 seconds in
Insertion S.)
• n = 10,000: 5.0e7 Cs (4.0e7 Cs in Insertion S.) → approx. 17 seconds (15 seconds in
Insertion S.)
• n = 20,000: 2.0e8 Cs (9.8e7 Cs in Insertion S.) → approx. 62 seconds (57 seconds in
Insertion S.)
• n = 30,000: 4.5e8 Cs (2.2e8 Cs in Insertion S.) → approx. 142 seconds (125 seconds in
Insertion S.)
6.3.4 Shell Sort
In order to improve the performance of sorting larger
lists, the reader can use the shell sort (also referred to Observation 6.9 – Shell Sort: An
as the diminishing-increment sort). The main problem improved variation of the bubble
with previously discussed algorithms like insertion, sort, sorting subsets of a list based
selection and bubble sort, is their time performance of on the distance between the various
O(n2), making them extremely slow when sorting big list elements. The process starts with
lists. Shell sort, while being based on insertion sort, is a defined number that is reduced in
using smaller distances between elements. Initially, ele- each iteration (usually by one).
ments within a specifically defined distance in the list
are sorted. The algorithm then starts working with elements of decreasing distances until all subsequent elements have been processed. The key point in this algorithm is that every pass deals with
a relatively small number of elements, or with already sorted elements, and every pass secures
an increasing part of the list is ordered. The sequence of the distances can change, provided that
the last distance must be 1. It is mathematically proven that the algorithm has a time complexity
of O(n1,2).
226
Handbook of Computer Programming with Python
As an example, let us consider the following list: 3, 5, 2, 4, 6, 1, 7, 9, 8. In the 1st pass, the list is
split into three subsets, each of which is processed using the insertion sort. In this particular case,
the three subsets have a distance of three between each element:
• 1st Pass/Subset 1: 3, 4, 7. Result after insertion sort: 3, 4, 7
• 1st Pass/Subset 2: 5, 6, 9. Result after insertion sort: 5, 6, 9
• 1st Pass/Subset 3: 2, 1, 8. Result after insertion sort: 1, 2, 8
After the end of the 1st pass the list will be in the following order: 3, 5, 1, 4, 6, 2, 7, 9, 8.
In the 2nd, the list is split into two subsets, with each one being processed again using the insertion sort. In this case, the two subsets have a distance of two between each element:
• 2nd Pass/Subset 1: 3, 1, 6, 7, 8. Result after insertion sort: 1, 3, 6, 7, 8
• 2nd Pass/Subset 2: 5, 4, 2, 9. Result after insertion sort: 2, 4, 5, 9
After the end of the 2nd pass, the complete list will be in the following order: 1, 2, 3, 4, 6, 5, 7, 9, 8.
Finally, in the 3rd pass, the list is dealt with as a whole, again using the insertion sort. Given that
the previous passes ensured that the list is close to being fully sorted, this pass does require multiple
swaps but only the necessary comparisons. The following script implements the aforementioned
algorithm:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Import the random module to generate random numbers
import random
import time
comparisons = 0
list = []
# Enter the number of elements for the list
size = int(input("Enter the number of list elements: "))
# Use the randint() function to generate random integers
for i in range (size):
newNum = random.randint(-100, 100)
list.append(newNum)
print("The unsorted list is: ", list)
# Start the timer
startTime = time.process_time()
# Use shell sort to sort the list and record the statistics for later use
# Start with a big distance and reduce it successively
distance = int(size/2)
# Insertion sorts each of the list subsets divided by distance
while distance >= 0:
# The insertion sort algorithm
for i in range(size):
temp = list[i]
227
Data Structures and Algorithms
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
loc = i
while ((loc >= distance) and (list[loc-distance] > temp)):
comparisons += 1
list[loc] = list[loc-distance]
loc = loc - distance
list[loc] = temp
distance -= 1
# End the timer
endTime = time.process_time()
# Display basic info for the shell sort
print("The sorted list is: ", list)
print("The number of comparisons is: ", comparisons)
print("The elapsed time in seconds is: ", (endTime - startTime))
Output 6.3.4:
Enter the number of elements in the list:10
The unsorted list is: [-47, 79, -79, 94, -79, -97, -7, -3, 49, 88]
The sorted list is: [-97, -79, -79, -47, -7, -3, 49, 79, 88, 94]
The number of comparisons is = 10
The elapsed time in seconds = 0.0
While the efficiency of the algorithm may not be instantly noticeable, it does make a difference
when examined more closely. The following list of approximate results showcases the performance
difference between insertion sort and shell sort (assume 1 comparison takes 1 nanosecond or
1.0e−9 seconds; Cs stands for Comparisons):
• n = 10: 8 Cs (up to 40 in Insertion S.) → approx. 3.8e−4 seconds (2.0e−4 in Insertion S.)
• n = 100: 4e2 Cs (up to 4.0e3 in Insertion S.) → approx. 3.8e−3 seconds (2.5e−3 in Insertion S.)
• n = 1,000: 1.5e4 Cs (up to 5.0e5 in Insertion S.) → approx. 0.27 seconds (0.16 seconds in
Insertion S.)
• n = 10,000: 1.7e5 Cs (4.0e7 Cs in Insertion S.) → approx. 26 seconds (15 seconds in
Insertion S.)
• n = 20,000: 3.4e5 Cs (9.8e7 Cs in Insertion S.) → approx. 99 seconds (57 seconds in
Insertion S.)
• n = 30,000: 5 e5 Cs (2.2e8 Cs in Insertion S.) → approx. 215 seconds (125 seconds in
Insertion S.)
6.3.5 Shaker Sort
The shaker sort algorithm is based on the bubble sort, but
instead of the list being read always on the same direction,
consequent readings occur in opposite directions. This
ensures that both the highest and lowest value elements
of the list move to the correct positions faster. The main
disadvantage of this algorithm is that, since it is based on
bubble sort, its time complexity is bound to O(n2).
Observation 6.10 – Shaker Sort:
Use two separates for loops nested
inside a while loop to read a list of
elements in opposite directions. This
ensures that the elements will be positioned to the correct places in the list
faster than with bubble sort.
228
Handbook of Computer Programming with Python
The following list provides approximate comparisons between the shaker and the bubble sort.
The examples support the argument that it is not worth using this algorithm unless the size of the list
falls within the approximate range of 1,000–50,000 elements. For lists with more elements than the
upper threshold of this range (50,000), using the shaker sort is impractical (as in previous examples,
1 comparison takes 1 nanosecond to complete and 1 nanosecond = 1.0e−9 seconds):
n = 10: ~40 Cs (n2 = 81 in Bubble S.) → approx. 7.7e−4 seconds (3e−4 in Bubble S.)
n = 100: ~4.2e3 Cs (n2 = 9.8e3 in Bubble S.) → approx. 3.2e−3 seconds (4.5e−3 in Bubble S.)
n = 1,000: ~3.9e5 Cs (n2 = 9.98e5 in Bubble S.) → approx. 0.28 seconds (0.37 in Bubble S.)
n = 10,000: ~3.8e7 Cs (n2 = 9.998e7 in Bubble S.) → approx. 28 seconds (46 in Bubble S.)
n = 20,000: ~1.5e8 Cs (n2 = 2.0e8 in Bubble S.) → approx. 110 seconds (188 in Bubble S.)
•
•
•
•
•
In general, the time complexity of the algorithm for the average and worst cases are O(n2), while
slight improvements can potentially lead to a running time complexity of O(n) at best.
As an example, let us consider the same list as the one used with bubble sort: 2, 3, 1, 6, 7. During
the 1st outer loop, shaker sort will execute two inner iterations successively, with one iteration processing the list to the right and one to the left. Each time an inner loop processes the list to the right,
the pointer at the end of the list is reduced by one. Similarly, each time it processes the list to the left,
the pointer at the start of the list is increased by one. Starting with the 1st outer iteration, the inner
loop presented in Table 6.5 (processing the list to the right) will take place.
Likewise, in the 1st outer iteration, the inner loop presented in Table 6.6 will process the list to
the left.
The reader should note that, at the end of each outer iteration, the highest value element of the
current sub-list is pushed to the end of the sub-list and the lowest is pushed to the start. Table 6.7
presents the results of each of the outer iterations. Note that the algorithm will stop at the end the
first inner iteration of the 3rd outer pass, as there are no more swaps to be made:
TABLE 6.5
The First Inner Loop within the First Main Iteration, Reading the List to the Right
3
3
3
3
3
3
3
5
5
4
4
4
4
4
4
4
5
2
2
2
2
2
2
2
5
3
3
3
3
3
3
3
5
1
1
1
1
1
1
1
5
5
6
6
6
6
6
6
6
7
7
7
7
7
7
7
TABLE 6.6
The Second Inner Loop within the First Main Iteration, Reading the List to the Left
3
3
3
3
3
3
1
4
4
4
4
4
1
3
2
2
2
2
1
4
4
3
3
3
1
2
2
2
1
1
1
3
3
3
3
5
5
5
5
5
5
5
6
6
6
6
6
6
6
7
7
7
7
7
7
7
229
Data Structures and Algorithms
TABLE 6.7
The Results of the Outer Loops
After the 1st pass
After the 2nd pass
After the 1st inner of the 3rd outer pass
1
1
1
3
2
2
4
3
3
2
3
3
3
4
4
5
5
5
6
6
6
The following script demonstrates an implementation of the shaker sort and its output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Import the random module to generate random numbers
import random
import time
comparisons = 0
list = []
# Enter the number of list elements
size = int(input("Enter the number of list elements: "))
# Use the randint() function to generate random integers
for i in range (size):
newNum = random.randint(-100, 100)
list.append(newNum)
print("The unsorted list is: ", list)
# Start the timer
startTime = time.process_time()
# The shaker sort algorithm
swapped = True; start = 0; end = size -1
# Keep running the shaker sort while swaps are taking place
while (swapped == True):
# Set swap to false to start the new loop
swapped = False;
# Loop from left to right using bubble sort
for i in range(start, end):
comparisons += 1
if (list[i] > list[i + 1]):
temp = list[i]; list[i] = list[i+1]; list[i+1] = temp
swapped = True;
# If there were no swaps, the list is sorted
if (swapped == False):
break
# If at least one swap, then reset swap to false and continue
else:
swapped = False
7
7
7
230
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
Handbook of Computer Programming with Python
# Decrease the end of the list to -1, since largest element moved
# to the right
end –= 1
# Loop from right to left using bubble sort
for i in range (end, start, -1):
comparisons += 1
if (list[i] < list[i-1]):
temp = list[i]; list[i] = list[i-1]; list[i-1] = temp
swapped = True
# Increase the start of the list by 1 since smallest element moved
# to the left
start += 1
# End the timer
endTime = time.process_time()
# Display the sorted list
print("The sorted list is: ", list)
print("The number of comparisons is: ", comparisons)
print("The elapsed time in seconds: ", (endTime - startTime))
Output 6.3.5:
Enter the number of elements in the list:15
The unsorted list is: [98, -23, -29, 17, -11, 2, 77, -20, -53, 66, -2, 33,
63, 33, 68]
The sorted list is: [-53, -29, -23, -20, -11, -2, 2, 17, 33, 33, 63, 66, 68,
77, 98]
The number of comparisons is = 77
The elapsed time in seconds = 0.0
6.4 RECURSION, BINARY SEARCH, AND EFFICIENT SORTING WITH LISTS
On a broader context, any attempt to find an algorithm that addresses the problem of sorting a list
efficiently is subject to certain restrictions. This is due to the fact algorithms generally fall within
the same time complexity of O(n2), as a result of their inherent nested loop structures. As shown in
the previous sections, this is true even when improved and optimized versions of the algorithms are
used. In order to improve the efficiency of sorting algorithms further, recursion must be adopted.
This section presents and discusses the concept of recursion, and uses it as a base to implement
some common related algorithmic ideas like binary search and factorial. Subsequently, two notable
algorithms that address the problem of sorting large lists in an efficient way are presented: merge
sort and quick sort.
6.4.1 Recursion
By definition, a recursive function is one that calls itself. The basic idea is to break a large problem
into several smaller parts that are equivalent to the original. These are further broken down successively into even smaller parts, until the problem is small enough for its solution to become evident.
231
Data Structures and Algorithms
This final point is called a terminal or base case. The
condition that must be met in order to achieve the terminal case is called the terminal condition. The associated
step followed to break down the problem into smaller
parts is called the basic step.
In order to contextualize the idea of recursion, one
needs to break down what happens on a recursive function call:
Observation 6.11 – Recursion: A
recursive function is one that calls
itself. It takes a large problem and
breaks it into smaller ones successively, following a step. The step is
repeated until the smaller parts are so
small that the solution is evident. The
final and smallest part is referred to as
the terminal or base case.
• Firstly, the compiler/interpreter passes a parameter to the function.
• The called function and its parameter is pushed to the program stack (stacks are discussed
in Section 6.5.5), a separate place in memory where the local variables are stored until this
particular function call is completed.
• The compiler/interpreter records the return address, which will be used as a return to the
calling function when the current function call is complete.
• When the current function call is complete, the compiler/interpreter records the value to be
returned to the calling function (if applicable).
In terms of its results, recursion is similar to the iteration explained in Chapter 2, but differs in terms
of the functions used. An iterative algorithm uses a looping construct whereas a recursive algorithm
uses a branching structure. In terms of both time and memory usage, recursive solutions are often
less efficient than their iterative counterpart. However, in many occasions they are the only solutions
available. Their main advantage is that by simplifying the solution to a single problem they often
result in shorter and more readable source code.
The following script presents a basic recursive function that calls itself continuously and
­indefinitely, printing a particular message:
1
2
3
4
5
def message():
print("This is a recursive function")
message()
message()
Output 6.4.1.a:
This
This
This
This
is
is
is
is
a
a
a
a
recursive
recursive
recursive
recursive
function
function
function
function
RecursionError
Traceback (most recent call last)
<ipython-input-l-e0c7cc045453> in <module>
To prevent the function from falling into this infinite call loop, the number of repetitions must be
controlled. This can be achieved by incorporating the following two steps:
• A dividing step must be applied to a subset of the original values in each repetition.
• The terminal or basic case must be defined and calculated (if applicable).
232
Handbook of Computer Programming with Python
The following script is a modified version of the message() function presented above. It passes an
integer argument that dictates the number of times the function will call itself before the terminal
case:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# The recursive function
def message(times):
print("Message called with times = ", times)
# Define the dividing step through an if statement
if (times > 0):
print("\tThis is a recursive function.\n")
message(times -1)
# The terminal or base case stops recursion & "roll back"
print("Message returning with times = ", times, "\n")
# Start the recursion by calling the recursive function
message(3)
Output 6.4.1.b:
Message called with times = 3
This is a recursive function.
Message called with times = 2
This is a recursive function.
Message called with times = 1
This is a recursive function.
Message called with times = 0
Message returning with times =
0
Message returning with times =
1
Message returning with times =
2
Message returning with times =
3
The application of recursion can be also considered in the context of a purely mathematical function, that of the factorial. The complete definition of the factorial is f(n) = n * f(n−1) for n > 1, and
f(1) = 1 for n = 1. According to this definition, for f(4) the result would be calculated as follows:
f(4) = 4 * f(3) = 4 * 3 * f(2) = 4 * 3 * 2 *f(1) = 4 * 3 * 2 * 1 = 24.
Notice that in the case of f(1) there is no further breakdown of the function, as this is considered
the terminal or base case with a result of f(1) = 1. The following script implements the solution of
the factorial:
233
Data Structures and Algorithms
1
2
3
4
5
6
7
8
9
10
11
12
# The factorial function using recursion
def factorial(n):
# The terminal or base case
if (n == 1):
return 1
# The recursive step
else:
print(n, "* f(", n-1, ")")
return n * factorial(n-1)
num = int(input("Enter the number to find its factorial: "))
print("The factorial for", num, "is ", factorial(num))
Output 6.4.1.c:
Enter the number to find its factorial: 1
The factorial for 1 is 1
Enter the number to find its factorial: 3
3 * f( 2 )
2 * f( 1 )
The factorial for 3 is 6
Enter the number to find its factorial: 7
7 * f( 6 )
6 * f( 5 )
5 * f( 4 )
4 * f( 3 )
3 * f( 2 )
2 * f( 1 )
The factorial for 7 is 5040
6.4.2 Binary Search
One of the most well-known applications of recursion is
the binary search. The main idea behind binary search
is to find whether a word exists in a dictionary. The necessary precondition is to use it on a sorted list, regardless of the algorithm used for the sorting. The concept is
rather simple:
Observation 6.12 – Binary Search: A
recursive algorithm applied to sorted
lists in order to find the location of a
particular element.
• Initially, the algorithm checks whether the word in the middle element of the list exists.
• If it does not and the middle element value is larger than the search value, the list is split
into two halves and the middle element of the first half is checked; otherwise, the middle
element of the second half is checked.
• The algorithm continues until the desired element is found, in which case the element and
its position in the list are reported. If the search element is not found, a relevant message
is generated.
234
Handbook of Computer Programming with Python
An implementation of the binary search algorithm is provided below:
# The recursive function for binary search
binarySearch(word, startPage, endPage)
# if the dictionary consists of one page (base case) search for it in
# that page
if startPage = endPage
search the word in the startPage
else
# get to the middle of the dictionary
middlePage = (endPage + startPage)/2
# determine which half of the dictionary might contain
# the chosen word
# if the word is in the first half
if the word is located before the middlePage
# find the word in the first half of the dictionary
binarySearch(word, startPage, middlePage)
else
# find the word in the second half of the dictionary
binarySearch(word, middlePage+1, endPage)
In this particular algorithm, function binarySearch calls itself recursively. At each call, the problem
gets smaller as the size is halved. The base case is the startPage = endPage statement that dictates
that either the word is found or it does not exist in the dictionary.
The following script implements the algorithm:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# The list of numbers to search in
listOfNumbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# The recursive function for binary search
def binarySearch(number, startPage, endPage):
# If the list consists of one page (base case) search for it
# in that page
if (startPage == endPage):
if (listOfNumbers[startPage] == number):
print("The number was found in the list in "
"position: ", startPage)
else:
print("The number was not found in the list")
else:
# Split the list using the middle point as a reference
middlePage = int((endPage + startPage)/2)
# Determine which half of the list might contain the number
# If the number is in the first half
if (number <= listOfNumbers[middlePage]):
# Find the number in the first half of the list
binarySearch(number, startPage, middlePage)
else:
# Find the number in the second half of the list
binarySearch(number, middlePage + 1, endPage)
num = int(input("Enter the number to find in the list: "))
235
Data Structures and Algorithms
27
28
29
# Call the binarySearch function
binarySearch(num, 0, 9)
Output 6.4.2:
Enter the number to find in the list: 7
The number was found in the list in position: 6
Enter the number to find in the list: 23
The number was not found in the list
6.4.3 Quicksort
Quicksort is considered as one of the more advanced sorting algorithms for lists (i.e., static objects),
with a better average performance than insertion, selection, and shell sort. It was presented by
Hoare in 1962 (Hoare, 1961). Quicksort belongs to a well-known and highly regarded family of
algorithms adopting the divide and conquer strategy.
The algorithm sorts a list of n elements by picking a
key value k in the list as a pivot point, around which the Observation 6.13 – Quicksort: Select
list elements are then rearranged. Finding or calculat- an element in the list as the pivot k
ing the ideal pivot point is key, although not absolutely element and rearrange the rest so that
necessary. The pivot point should be either the median lower value elements precede it and
or close to the median key value, so that the numbers higher succeed it (or the opposite).
of preceding and succeeding elements in the list are Apply the same process to the two
resulting sub-lists repeatedly, until
balanced.
Once this pivot key (k) is decided, the elements of the there are no more lists to divide. By
list are rearranged so that those with lower values appear definition, at the end of this process
before it and those with higher values after it. Once this the list will be sorted.
process is completed, the list is partitioned into two sublists: one containing all values lower than k and one containing k itself (in its original position in
the list) plus all values higher than k. This process is applied recursively to the two sub-lists and all
subsequent sub-lists created based on them until there are no lists to divide. Once this process is
complete, the list is sorted by definition.
As an example, let us consider the following list: 37, 2, 6, 4, 89, 8, 10, 12, 68, 45. The first element
(i.e., list[0]: 37) is taken as the pivot element (k). The process will start with the rightmost element
of the list, moving in a decremental order from that point on (i.e., list[9]: 45, list[8]: 68, list[7]: 12).
Each element is compared with k until an element with a lower value is found. In this instance, the
process will stop at list[7]: 12 and this element will be swapped with k (Table 6.8).
TABLE 6.8
The First Round of Comparisons at the Right of the List and Towards the Pivot Element
37
37
37
12
2
2
2
2
6
6
6
6
4
4
4
4
89
89
89
89
8
8
8
8
10
10
10
10
12
12
12
37
68
68
68
68
45
45
45
45
236
Handbook of Computer Programming with Python
TABLE 6.9
The First Round of Comparisons at the Left of the List and Towards the Pivot Element
12
12
12
12
12
2
2
2
2
2
6
6
6
6
6
4
4
4
4
4
89
89
89
89
37
8
8
8
8
8
10
10
10
10
10
37
37
37
37
89
68
68
68
68
68
45
45
45
45
45
89
89
68
68
45
45
89
89
68
68
45
45
TABLE 6.10
The First Round of Comparisons Resumes at the Right of the Pivot Element
12
12
2
2
6
6
4
4
37
10
8
8
10
37
TABLE 6.11
The First Round of Comparisons Resumes and Finishes at the Left
12
12
2
2
6
6
4
4
37
10
8
8
10
37
Next, the k (37) will be compared with the elements on its left, beginning after 12. The comparisons will continue in an increasing order until an element greater than 37 is found. This will happen
for value 89, so 37 and 89 will be swapped (Table 6.9).
After the swap, the process will resume at the left of the previously swapped element (89) and at
the right of pivot element k. The first element that will be considered is 10, which is smaller than the
pivot element, thus, the two elements will be swapped. The rearranged list is shown in Table 6.10
below.
Finally, the process will start again at the left of the sub-list with 37 as the pivot, and begin with
the element after 10. This time, the only remaining element to compare (8) is lower than 37 so no
swap will take place between the two elements. This first round of comparisons will end with the 1st
pivot element (37) placed in its final place in the list, leaving two unsorted sub-lists on its left and
right sides (Table 6.11).
This is the first partitioning of the list into the first two unsorted sub-lists. The exact same
comparison process will be next applied to both the left and right sub-lists recursively. When all
comparisons and partitions are complete there will be no further sub-lists left to sort and the entire
list will be sorted.
The algorithm may seem rather complicated and its efficiency difficult to gauge. Nevertheless, it
is indeed much more efficient than all the previously discussed algorithms. A script implementing
the quicksort algorithm is provided below:
1
2
3
4
5
6
# Import the random and time modules
# to generate random numbers and keep time
import random
import time
global comparisons
list = []
Data Structures and Algorithms
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
237
# The quicksort algorithm
def quickSortReadings(list, start, end):
global comparisons
pivot = list[start]
low = start + 1
high = end
while (True):
# Compare elements from the right to find one
# that is smaller than the pivot. Stop when one is found
while (low <= high and list[high] >= pivot):
high -= 1; comparisons += 1
# Compare elements from the left to find one
# that is larger than the pivot. Stop when one is found
while (low <= high and list[low] <= pivot):
low += 1; comparisons += 1
# If an element larger or smaller than the pivot is found
# swap elements to put things in order & continue the process
if (low <= high):
list[low], list[high] = list[high], list[low]
# Stop and exit if the low index moved beyond the high index
else:
Break
list[start], list[high] = list[high], list[start]
return high
def quickSortPartition(list, start, end):
if start >= end:
Return
p = quickSortReadings(list, start, end)
quickSortPartition(list, start, p -1)
quickSortPartition(list, p + 1, end)
# Enter the number of list elements
size = int(input("Enter the number of list elements:"))
# Use the randint() function to generate random integers
for i in range (size):
newNum = random.randint(-100, 100)
list.append(newNum)
print("The unsorted list is: ", list)
comparisons = 0
# Start the timer
startTime = time.process_time()
238
58
59
60
61
62
63
64
65
66
67
Handbook of Computer Programming with Python
quickSortPartition(list, 0, size -1)
# End the timer
endTime = time.process_time()
# Display the sorted list
print("The sorted list is: ", list)
print("The number of comparisons is = ", comparisons)
print("The elapsed time in seconds = ", (endTime - startTime))
Output 6.4.3:
Enter the number of elements in the list:10
The unsorted list is: [-94, -1, -35, 13, -73, 18, 4, 29, 46, -62]
The sorted list is: [-94, -73, -62, -35, -1, 4, 13, 18, 29, 46}
The number of comparisons is = 26
The elapsed time in seconds = 0.0
The following estimates provide a rough comparison between quicksort and bubble sort, highlighting the fact that the former operates at a completely different efficiency level and, thus, being
capable of processing much larger lists. The only possible restrictions in relation to its use have to
do with the power of the computer system used and the available memory, as these are determining
factors when running recursive calls on lists larger than 100,000 elements (a comparison takes 1
nanosecond to complete and 1 nanosecond = 1.0e−9 seconds):
n = 10: ~30 Cs (n2 = 81 in Bubble S.) → approx. 1.8e−4 seconds (3e−4 in Bubble S.)
n = 100: ~6.2e2 Cs (n2 = 9.8e3 in Bubble S.) → approx. 4e−4 seconds (4.5e−3 in Bubble S.)
n = 1,000: ~1e4 Cs (n2 = 9.98e5 in Bubble S.) → approx. 9.7e−3 seconds (3.7e−1 in Bubble S.)
n = 10,000: ~3e5 Cs (n2 = 9.998e7 in Bubble S.) → approx. 0.1 seconds (46 in Bubble S.)
n = 20,000: ~1e6 Cs (n2 = 2.0e8 in Bubble S.) → approx. 0.3 seconds (188 in Bubble S.)
n = 30,000: ~3e6 Cs (n2 = 2.0e8 in Bubble S.) → approx. 0.6 seconds (Not practical in
Bubble S.)
• n = 100,000: ~2e7 Cs (n2 = 2.0e8 in Bubble S.) → approx. 5.6 seconds (Not practical in
Bubble S.)
• n = 300,000: ~1.8e8 Cs (n2 = 2.0e8 in Bubble S.) → approx. 48 seconds (Not practical in
Bubble S.)
•
•
•
•
•
•
In terms of time complexity, while the worst cases run at O(n2), the average and best cases run at the
much more efficient level of O (n log(n)).
6.4.4 Merge Sort
Merge sort is another advanced algorithm for efficient
sorting of large lists, falling into the same divide and
conquer approach as quicksort. Merge sort is an excellent choice for sorting data that cannot be kept on the
computer memory all at once and are, thus, kept in secondary storage.
The essential idea behind merge sort is to split
lists into two halves continuously until all sub-lists
Observation 6.14 – Merge Sort: A
divide and conquer algorithm for
sorting static lists. The basic idea is
to divide the list into two sub-lists
repeatedly, until all sub-lists consist of
a single element. The divided lists are
then merged again following a particular sorting procedure.
Data Structures and Algorithms
239
consist of a single element and, subsequently, merge the sub-lists while also ordering their elements.
Algorithmically, the process is rather straightforward, particularly for the split part. The process the
programmer must follow for merging each given set of two sub-lists is summarized below:
• Check if the first sub-list is empty.
• If not, check if the second sub-list is empty.
• If not, compare the first available element in the first sub-list with the first available ­element
in the second sub-list.
• Whichever of the two elements has a lower value must be placed in the first available slot
of a new merged list.
• This process should be repeated for all remaining elements of the two sub-lists.
• If all the elements of one of the sub-lists have been used, place the remaining elements of
the other sub-list to the new merge list, in the order they appear in the sub-list.
• Recursively repeat this process until all the sub-lists are merged into one ordered merged list.
As an example, let us consider the following list: 25, 13, 9, 32, 17, 5, 33, 25, 43, 21. Firstly, the list is
split into the required set of sub-lists:
Next, the lists are merged on a bottom-up basis, as shown below:
240
Handbook of Computer Programming with Python
The following script provides an implementation of the merge sort algorithm:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Random and time modules generate random numbers & keep time
import random
import time
global comparisons, i, j, k
global list
# Merge two sub-lists, list[first, middle] and list[middle+1, last]
def merge(first, middle, last):
global list
global i, j, k, comparisons
size1 = middle - first + 1; size2 = last - middle
# Create temporary lists
leftList = []; rightList = []
# Copy original list to temporary lists leftList & rightList
for i in range(0, size1):
leftList.append(list[first + i])
for j in range(0, size2):
rightList.append(list[middle + 1 + j])
# Merge temp lists leftList & rightList into original list
# until one of the sub-lists is empty
i = 0; j = 0; k = first
while (i < size1 and j < size2):
if (leftList[i] <= rightList[j]):
list[k] = leftList[i]; i += 1; comparisons += 1
else:
list[k] = rightList[j]; j += 1; comparisons += 1
k += 1
# If list becomes empty, copy remaining elements to original
while (i < size1):
list[k] = leftList[i]; i += 1; k += 1
# If list becomes empty, copy remaining elements to original
while (j < size2):
list[k] = rightList[j]; j += 1; k += 1
# The merge sort algorithm
def mergesort(first, last):
global list
# The recursive step
if (first <= last-1):
middle = (first + last)//2
Data Structures and Algorithms
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
241
mergesort(first, middle)
mergesort(middle + 1, last)
merge(first, middle, last)
list = []
# Initialize the indices of the sub-lists
i, j, k = 0, 0, 0
# Enter the number of list elements
size = int(input("Enter the number of list elements: "))
# Use the randint() function to generate random integers
for i in range (size):
newNum = random.randint(-100, 100)
list.append(newNum)
print("The unsorted list is: ", list)
comparisons = 0
# Start the timer
startTime = time.process_time()
mergesort(0, size-1)
# End the timer
endTime = time.process_time()
# Display the sorted list
print("The sorted list is: ", list)
print("The number of comparisons is = ", comparisons)
print("The elapsed time in seconds = ", (endTime - startTime))
Output 6.4.4:
Enter the number of elements in the list:15
The unsorted list is: [83, -3, 89, 64, -5, 65, 78, 17, 8, -3, 82, 89, -80, 23, 64]
The sorted list is: [-80, -5, -3, -3, 8, 17, 23, 64, 64, 65, 78, 82, 83, 89, 89]
The number of comparisons is = 42
The elapsed time in seconds = 0.0
The efficiency of the algorithm in sorting static lists is comparable to that of quicksort (a comparison takes 1 nanosecond to complete; 1 nanosecond = 1.0e−9 seconds):
• n = 10: ~20 Cs (30 in Quicksort) → approx. 2e−4 seconds (1.8e−4 in Quicksort)
• n = 100: ~5.4e2 Cs (6.2e2 Cs in Quicksort) → approx. 0.0012 seconds (1.2e−2 in Quicksort)
• n = 1,000: ~8.6e3 Cs (1e4 Cs in Quicksort) → approx. 0.015 seconds (9.7e−3 seconds in
Quicksort)
• n = 10,000: ~1.2e5 Cs (3e5 Cs in Quicksort) → approx. 0.15 seconds (0.1 seconds in
Quicksort)
• n = 30,000: ~4e5 Cs (3e6 in Quicksort) → approx. 0.44 seconds (0.6 seconds in Quicksort)
• n = 100,000: ~1.5e6 Cs (2e7 in Quicksort) → approx. 1.6 seconds (5.6 seconds in Quicksort)
• n = 300,000: ~5e6 Cs (1.8e8 in Quicksort) → approx. 5.5 seconds (48 seconds in Quicksort)
242
Handbook of Computer Programming with Python
In general, merge sort is more efficient than quicksort as it runs on O(n logn) time complexity in all
cases (i.e., best, average, and worst case). Most importantly, it becomes significantly better as the
size of the list grows larger (e.g., lists consisting of hundreds of thousands of elements or higher)
depending on the power, memory, and settings of the system it runs on.
6.5 COMPLEX DATA STRUCTURES
In the previous sections, the focus was on the implementation of sorting by means of relatively
simple, static data structures, like lists. When it comes to more advanced, real-life applications
more complex data structures may be required. This section addresses such data structures, which
can take both linear and non-linear forms (Figure 6.1).
In linear structures, such as stacks, queues, and linked lists, each element occupies a position
that is relative to that of previous and succeeding elements within the structure. Consequently, the
structure is traversed (i.e., read) sequentially. In non-linear structures, such as trees and graphs, the
items are not arranged in a particular, hierarchical order, thus, sequential traverse is not feasible.
Non-linear structures are more complex to implement, but they are also more powerful. As such,
they are used extensively in real-life applications.
6.5.1 Stack
A stack is an ordered list with two ends, the top and the
base. New items are always inserted at the top end in
an operation called push. Items are also removed from
the top end, in what is referred to as pop. In a stack, the
last item to push is always the first to pop, hence a stack
is also called a last in, first out (LIFO) list. Besides the
item at the top, other items in the stack are not directly
accessible. As an analogy, one can think of a stack as a
pile of plates stacked upon each other. Each new plate is
placed at the top of the pile. In order to be used, a plate
is also taken from the top of the pile.
FIGURE 6.1
Classification of data structures.
Observation 6.15 – Stack: An
ordered, linear list structure with two
ends: top and base. Items are pushed
to and popped from the top, and the
last item pushed in the stack is the
first to be popped out (LIFO). The
operations performed on the stack
are the following: initialize, push, pop,
isEmpty, top, and size.
Data Structures and Algorithms
243
From a more formal, technical perspective, the stack ADS (Abstract Data Structure) consist of
the following:
• An index pointing at the top item in the stack, with values ranging from 0 to its maximum
size −1.
• The body of the stack that stores the values (i.e., the actual data of the list).
• Initialize – init(s): A function that initializes the stack (i.e., creating an empty list).
• Empty – isEmpty(s): A function that checks whether the stack (s) is empty.
• Push – push(x, s): A function that pushes a new item (x) onto the stack (s).
• Pop – pop(x, s): A function that deletes the top item (x) from the stack (s).
• Top – top(s): A function that returns the item at the top of the stack.
• Size – size(s): A function that returns the total number of items in the stack.
The following Python class (filename: Chapter6Stack.py) defines the stack structure (stack ADS):
class Stack:
def __init__(self):
self.items = []
def push(self, item):
self.items.append(item)
def pop(self):
return self.items.pop()
def isEmpty(self):
return self.items == []
def top(self):
if (not self.isEmpty()):
return self.items[-1]
def size(self):
return len(self.items)
def show(self):
return self.items
Since the class in this form is rather generic, it can be used for a variety of stack-based applications.
The following script imports the stack class from Chapter6Stack.py in order to implement a simple
example of the functionality of the stack:
1
2
3
4
5
6
7
8
9
10
11
12
13
import Chapter6Stack
fruits = Chapter6Stack.Stack()
# Confirm that the stack is empty
if (fruits.isEmpty() == True):
print ("The stack is empty")
# Push elements to the stack
fruits.push('apple')
fruits.push('orange')
fruits.push('banana')
244
14
15
16
17
18
19
20
21
22
23
24
Handbook of Computer Programming with Python
# Confirm that the stack is not empty and print its contents
if (fruits.isEmpty()!= True):
print("The stack is not empty: It's size is: ", fruits.size())
print("The contents of the stack are: ", fruits.show())
# Return the top item of the stack
print("The top item of the stack is: ", fruits.top())
# Remove the top item of the stack, print the new top item and the stack
print("Remove the top item of the stack: ", fruits.pop())
print("The top item of the stack is now: ", fruits.top())
print("The contents of the stack now are: ", fruits.show())
Output 6.5.1.a:
The stack is empty
The stack is not empty: It's size is: 3
The contents of the stack are: ['apple', 'orange', 'banana']
The top item of the stack is: banana
Remove the top item of the stack: banana
The top item of the stack is now: orange
The contents of the stack now are: ['apple', 'orange']
Stacks are used extensively in computer programs. A rather common example is storing page visits
on a web browser. Every page that is visited is added to a stack and when the user clicks on the back
button the last page visited is retrieved from the stack. A similar use can be found in the undo function included in most computer applications. A stack is used to store all the tasks performed in the
application and when the user clicks on the respective button, the last action is retrieved from the
stack and its action is reversed. Stacks are also useful in evaluating expressions, backtracking, and
implementing recursive function calls.
As an example of a practical use of the stack, let us consider the common utility task of converting a decimal number into binary. The algorithm is quite simple: repeatedly divide the decimal
number by 2 until the result is 0, while pushing the remainder of the integer division to the stack. At
the end of the process, all the items are popped from the stack to get the binary representation of the
decimal number. Assuming that the integer to be converted is number 21, the above procedure will
result in binary number 10101 (Figure 6.2).
FIGURE 6.2 Decimal to binary number conversion.
Data Structures and Algorithms
245
The following script implements the stack structure, utilizing Stack ADS (Chapter6Stack.py) as
in the previous example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import Chapter6Stack
# decimal object implements the conversion using the stack
decimal = Chapter6Stack.Stack()
# Accept an integer to convert to binary form
userInput = int(input("Enter the integer to convert to binary: "))
# Repeatedly divide by 2; keep pushing the remainder to the stack
while (userInput > 0):
decimal.push(userInput % 2)
userInput = userInput//2
# Confirm that the stack is not empty and print its contents
if (decimal.isEmpty()!= True):
print("The stack is not empty: It's size is: ", decimal.size())
print("The contents of the stack are: ", decimal.show())
# Return the number in binary form
print("The binary form of the number is: ", end = '')
for i in range (decimal.size()):
print(decimal.pop(), end = '')
Output 6.5.1.b:
Enter the integer to convert to binary: 56
The stack is not empty: It's size is: 6
The contents of the stack are: [0, 0, 0, 1, 1, 1]
The binary form of the number is: 111000
6.5.2 Infix, Postfix, Prefix
Another application of a stack that is particularly important in computer science is the evaluation
of arithmetic expressions. In general, the reader should be aware of the fact that there are three
kinds of arithmetic notations, namely infix, prefix, and postfix. Infix is what humans are mostly
used to, as it involves a binary operator appearing between two operands and determining the type
of operation that will take place between them (e.g., 3 + 5). In a prefix notation, the same expression would be converted to + 3 5, where the operator precedes both operands. Likewise, the postfix
notation would take the form 3 5 +, with the operator succeeding the two operands. It must be
noted that the postfix notation is the one used by compilers when evaluating an arithmetic expression. As such,
the conversion of an infix expression that humans would Observation 6.16 – Infix, Postfix,
understand more easily to a postfix expression that can Prefix: Three different kinds of notabe evaluated by compilers is a rather important task in tions used to evaluate arithmetic
computer science. The implementation of such a conver- expressions by humans or computers.
sion poses three main problems that must be addressed:
246
Handbook of Computer Programming with Python
• In an infix expression, the operation precedence is forcing multiplication/division to apply
before the additions/subtractions, whereas in a postfix expression there is no operator
priority.
• When translating an infix to a postfix expression, only the placement of the operators is
different. An algorithm that translates from infix to postfix only needs to shift the operators to the right, and possibly reorder them.
• Postfix expressions do not take parentheses.
The following algorithm uses a stack to temporarily store the operators until they can be inserted to
the right position into the postfix expression:
• Initialize the stack.
• Scan the infix expression from left to right.
• While the scanned character is valid:
• If the character is an operand, move it directly to the postfix expression.
• If the character is an operator, compare it with the operator at the top of the stack.
• While the operator at the top of the stack is of higher or equal priority than the
character just encountered, and is not a left parenthesis character, pop the operator
from the stack and move it to the postfix expression. Once all the operators are
popped, push the current character/operator to the stack.
• If the character is a left parenthesis, push the character onto the stack.
• If the character is a right parenthesis, pop and move the operators off the stack to the
postfix expression. Pop the left parenthesis and ignore it.
• If the operator at the top of the stack is of a lower priority than the character just
encountered or if the stack is empty, push the character that was just encountered to
the stack.
• After the entire infix expression has been scanned, pop any remaining operators from the
stack and move them to the postfix expression.
As an example, Figure 6.3 illustrates the use of a stack to convert infix expression 2 + 3 x 5 + 4 into
postfix.
•
•
•
•
•
2+3=5 → 2 3 +=5
2 x 5 + 3 = 13 → 2 5 x 3 + = 13
2 + 5 x 3 = 17 → 2 5 3 x 3 = 17
2 x 3 + 5 x 4 = 26 → 2 3 x 5 4 x + = 26
2 + 3 x 5 + 4 = 21 → 2 3 5 x + 4 + = 21
FIGURE 6.3
Infix expression remaining to be evaluated.
Data Structures and Algorithms
247
Figures 6.4 and 6.5 demonstrate a more complex case of an infix to postfix expression conversion
that includes operators in parentheses: 2 x (7 + 3 x 4) + 6.
The evaluation of a postfix expression utilizes the steps described in the algorithm below:
• Scan the postfix expression from left to right.
• If an operand is encountered, push it to the stack.
• If an operator is encountered, apply it to the top two operands of the stack and replace the
two operands with the result of the operation.
• After scanning the entire postfix expression, the stack should have one item, which is the
value of the expression.
Figure 6.6 illustrates how expression 1 6 + 5 2 – x is evaluated using a stack.
FIGURE 6.4 Infix to postfix with parenthesis – Part A.
FIGURE 6.5
Infix to postfix with parenthesis – Part B.
248
FIGURE 6.6
Handbook of Computer Programming with Python
Evaluating a postfix expression.
6.5.3 Queue
A queue is also a linear structure in which items are
added at one end through a process called enqueue, but
removed from the other end through what is referred
to as dequeue. The two ends are called rear and front.
Unlike the stack, in a queue the items that are added
first are also removed first, hence it is also described as
a first in, first out (FIFO) structure. A queue is analogous
to people waiting in line to purchase a ticket or pay a
bill. The person first in line is the first one to be served.
The following is a visual illustration of the queue
structure:
Observation 6.17 – Queue: An
ordered, linear list structure with
two ends: rear and front. Items are
enqueued at one end and dequeued
at the other. The first enqueued
item is also the first to be dequeued
(FIFO). The operations performed on
the queue are the following: initialize,
enqueue, dequeue, isEmpty, peek,
and size.
Figure 6.7 below illustrates the execution of a simple queue:
FIGURE 6.7
Execution of a simple queue.
Data Structures and Algorithms
249
In computer science, queues are used extensively to schedule tasks, such as printing or managing
CPU processes. When multiple users submit print jobs, the printer queues all the jobs and prints
them in a first-come-first-served basis. Similarly, when multiple processes require to use the CPU,
the order of execution is scheduled and performed through a queue structure.
The queue ADS consists of the following:
•
•
•
•
•
•
•
An index that points to the front item of the queue.
An index that points to the rear item of the queue.
The body of the queue that stores its values (i.e., the actual data in the list).
Initialize – init(q): A function that initializes the queue (i.e., creates the empty list).
Empty – isEmpty(q): A function that checks whether the queue is empty.
Enqueue – enqueue(x, q): A function that adds an item to the rear end of the queue.
Dequeue – dequeue(x, q): A function that returns the item at the front end of the queue
and removes it from the queue.
• Front – peek(q): A function that returns the item at the front of the queue.
• Size – size(q): A function that returns the number of items in the queue.
The Python class provided below (filename: Chapter6Queue.py) is an implementation of the queue
ADS:
class Queue:
# Initialize the queue
def __init__(self):
self.items = []
# Check whether the queue is empty
def isEmpty(self):
return self.items == []
# Add an item to the queue
def enqueue(self, item):
self.items.insert(0,item)
# Delete an item from the queue
def dequeue(self):
if not self.isEmpty():
return self.items.pop()
def peek(self):
if not self.isEmpty():
return self.items[-1]
def size(self):
return len(self.items)
def show(self):
return self.items
The following script (filename: Chapter6QueueExample) imports and runs a simple queue ADS:
1
2
3
4
5
import Chapter6Queue
q = Chapter6Queue.Queue()
print(q.isEmpty())
q.enqueue('Task A')
250
6
7
8
9
10
11
12
13
14
15
Handbook of Computer Programming with Python
print(q.show())
q.enqueue('Task B')
print(q.show())
q.enqueue('Task C')
print(q.show())
print(q.dequeue()) # removes Task A
print(q.show())
print(q.dequeue()) # removes Task B
print(q.show()) # q has only one task left
print(q.size())
Output 6.5.3:
True
['Task
['Task
['Task
Task A
['Task
Task B
['Task
1
A']
B', 'Task A']
C', 'Task B', 'Task A']
C', 'Task B']
C']
6.5.4 Circular Queue
A circular queue is essentially the same as a regular
queue, but with two major differences. First, the size Observation 6.18 – Circular Queue:
of the circular queue does not change. This size restric- A structure similar to a queue with
tion can be viewed as the main weakness of the circular the difference that its size does not
queue. Second, its front and rear are continuously mov- change and the front and rear are
ing in a circular form based on the demand for enqueue movable. This is based on the demand
and dequeue, provided that there is available empty for enqueue and dequeue in a circuspace and that they do not clash with each other (i.e., the lar form, allowing for the front item to
front cannot be in the same list index as the rear). This is be stored before the rear.
an important observation, as it is possible that the front
item is stored before the rear one on the circular queue. Because of these qualitative differences,
a circular queue ADS needs to check whether the queue is full before enqueuing a new item in it.
Figure 6.8 provides an illustration of the circular queue operation.
The following script (filename: Chapter6CircularQueue) imports and runs an implementation
of the queue ADS:
Data Structures and Algorithms
FIGURE 6.8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
251
Example of circular queue.
class CircularQueue():
# Initialize the circular queue to the preferred size
# with all its items empty and the front and rear starting at -1
def __init__(self, maxSize):
self.cqSize = maxSize
self.queue = [None] * self.cqSize
self.front = self.rear = -1
# Insert an item into the circular queue
def enqueue(self, data):
# Insert the first item to the queue, start the front and rear
if (self.front == -1):
self.front = self.rear = 0
self.queue[self.rear] = self.queue[self.front] = data
# Insert items to the queue
else:
# Only be concerned with the front item; use % and the size
# of the queue to move the front in a circular manner
self.front = (self.front + 1) % self.cqSize
self.queue[self.front] = data
print("Queue size: ", self.cqSize, "Queue front: ", self.front,
"Queue rear: ", self.rear)
# Delete an item from the circular queue
252
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
Handbook of Computer Programming with Python
def dequeue(self):
if (self.front == -1):
print("The circular queue is empty\n")
# If the front item is the same as the rear the queue has only
# one item; empty the queue
elif (self.front == self.rear):
self.front = self.rear = -1
else:
# Only be concerned with the rear item; use % and the size
# of the queue to move the rear in a circular form
self.queue[self.rear] = [None]
self.rear = (self.rear + 1) % self.cqSize
print("Queue size: ", self.cqSize, "Queue front: ", self.front,
"Queue rear: ", self.rear)
# The printCQueue will display the contents of the circular queue
def printCQueue(self):
# If the front value is -1 the circular queue is still empty
if(self.rear == -1):
print("No element in the circular queue")
# If front index is larger than rear then queue is still valid
elif (self.front >= self.rear):
for i in range(self.rear, self.front + 1):
print(self.queue[i], end = " ")
# If front less than rear, queue has completed a circle
else:
for i in range(self.front + 1):
print(self.queue[i], end = " ")
for i in range(self.rear, self.cqSize):
print(self.queue[i], end = " ")
print()
# Check whether the circular queue is full
def isFull(self):
if ((self.front + 1) % self.cqSize == self.rear):
return True
else:
return False
# Ask the user for the preferred size for the circular queue
maxSize = int(input("Enter the size of the circular queue:"))
cq = CircularQueue(maxSize)
# Keep working on the circular queue until input is not E or D
while (True):
# Ask the user for the next move, enqueue or dequeue
choice = input("(E)nqueue or (D)equeue or (Q)uit?")
if (choice == "E"):
if (cq.isFull()!= True):
newItem= int(input("Enter the next item of the circular
queue:"))
cq.enqueue(newItem)
Data Structures and Algorithms
77
78
79
80
81
82
83
84
253
else:
print("The queue is full. Cannot insert a new item")
elif (choice == "D"):
cq.dequeue()
else:
break
print("The updated Queue is: ", end = " ")
cq.printCQueue()
Output 6.5.4:
Enter the size of the circular queue:3
(E)nqueue or (D)equeue or (Q)uit?E
Enter the next item of the circular queue:10
Queue size: 3 Queue front: 0 Queue rear: 0
The updated Queue is: 10
(E)nqueue or (D)equeue or (Q)uit?E
Enter the next item of the circular queue:20
Queue size: 3 Queue front: 1 Queue rear: 0
The updated Queue is: 10 20
(E)nqueue or (D)equeue or (Q)uit?E
Enter the next item of the circular queue:30
Queue size: 3 Queue front: 2 Queue rear: 0
The updated Queue is: 10 20 30
(E)nqueue or (D)equeue or (Q)uit?E
The queue is full. Cannot insert a new item
The updated Queue is: 10 20 30
(E)nqueue or (D)equeue or (Q)uit?
D
(E)nqueue or (D)equeue or (Q)uit?D
Queue size: 3 Queue front: 2 Queue rear: 1
The updated Queue is: 20 30
(E)nqueue or (D)equeue or (Q)uit?D
Queue size: 3 Queue front: 2 Queue rear: 2
The updated Queue is: 30
(E)nqueue or (D)equeue or (Q)uit?E
Enter the next item of the circular queue:40
Queue size: 3 Queue front: 0 Queue rear: 2
The updated Queue is: 40 30
(E)nqueue or (D)equeue or (Q)uit?
6.6 DYNAMIC DATA STRUCTURES
The data structures described in the previous sections are characterized as static, since they all use
inherently static list structures. To some extent, issues like restrictions associated with the requirement for large amounts of memory, generally weak performance due to the heavy nature of the
tasks, and a certain inflexibility, can be traced in all of these structures. The previously discussed
cases have demonstrated that the execution of even the most advanced algorithms tends to become
impractical as the size of the structures increases. In order to address this issue, there is a need for
more effective data structures that allocate the available computer memory only as and when necessary, and in the most efficient way possible. Structures that fall under this category are collectively
known as dynamic data structures. Some of the most important of these structures are introduced
and briefly discussed in the following sections.
254
Handbook of Computer Programming with Python
6.6.1 Linked Lists
A linked list is a collection of nodes linked to each other
through pointers. The structure is recursive by defini- Observation 6.19 – Linked List: A
tion. Each node includes a data value and a pointer to structure of connected nodes. Each
the first node of a subsequent linked list, or to null if node contains a data value and a
the latter is empty. In order to navigate a linked list, it is pointer to the first node of the subnecessary to create a separate object, called head, that sequent list. A head pointer is always
always points to the first node of the list. Subsequent pointing to the first node. The last
nodes are accessed via the associated pointers, stored in node points to null. The rest of the
each node. If the list is empty, the head will simply point nodes are defined as intermediate.
to a null value. In a similar fashion, the link pointer of
the last node is set to null to mark the end of the list. There is only one head, and it is always pointing to the first node of the linked list. Similarly, there is only one tail (i.e., the last node), pointing
to null. All other nodes are called intermediate nodes and have both a predecessor and a successor. Traversing (i.e., moving through) intermediate nodes towards the tail starts at the first node of
the list, pointed to by the head. For this purpose, it is best to create another object, usually called
­current, that is used to move between the intermediate nodes in the list.
The strength of the linked list is that its data are stored dynamically, with new nodes created only
if and when necessary, and unwanted nodes deleted if they are not in use. Separately from the data,
the pointer of every newly created node is set to point to null. Nodes can store any data type, but all
nodes of a linked list need to store the same data type.
Figure 6.9 illustrates the structure of a linked list. Notice how the head points to the first node
and that the last node points to null:
FIGURE 6.9
Linked list.
The implementation of a linked list requires two classes. The first is the node class that c­ ontains
a data and a pointer to the next item. For any new node that is created, next will point to null.
The second, is the linked list itself that contains the head pointer to the first item in the list and
the ­current_node that is used to move through the list. Both the head and the current_node will
­initially point to null since there are no items in the list.
The linked list ADS (Abstract Data Structure) includes the following operations:
• Instantiating & initializing the list: This function is used to create the head and the current object that initially point to null (i.e., the empty list; Figure 6.10). The Python code for
this function is the following:
def __init__(self):
self.head = self.current_node = None
FIGURE 6.10
New linked list.
Data Structures and Algorithms
255
• Checking if the list is empty: This function checks whether the linked list is empty, in
which case no more nodes can be deleted and any newly inserted node must be the first in
the list. The Python code is the following:
def isEmpty(self):
current_node = self.head
if (current_node == None):
return True
• Reading and printing the list: It is often useful to print the nodes of the list and provide
information about its size (i.e., the number of nodes it contains). In order to do this, it is
necessary to traverse (i.e., read through) the list starting at the first node. While the current_node value is not null, current node values are read/printed successively as the list is
traversed. Figure 6.11 illustrates this process diagrammatically. The related Python code
is presented below:
def readList(self):
count = 0
current_node = self.head
print("The current list is: ", end = " ")
while (current_node):
count += 1
print(current_node.data, " ", end = "")
current_node = current_node.next
print("\nThe size of the linked list is: ", count)
• Inserting a new node in the list: A new node can be either inserted as a first element
when the list is empty or as the last element appended to the list. In the former case, a new
node is created (including the associated data) and its next element is set to point to null.
Finally, the head is set to point to the new node (Figure 6.12). In the case of appending a
new element to the list, after the new node is created, the list is traversed until the last node
is reached. Once this is done, the next element of the last node is set to point to the newly
created node (Figure 6.13). The related Python code is presented below:
def append(self, data):
# Create the newNode to append the linked list
newNode = Node(data)
# Case 1: List is empty
if (self.head == None):
self.head = newNode
Return
# Case 2: If the list is not empty start the
# current node at the head of the list
current_node = self.head
# Loop through the linked list untill the current node
# has Next pointing to None
while (current_node.next):
current_node = current_node.next
# Add new node to the end of the list
current_node.next = newNode
256
Handbook of Computer Programming with Python
• Deleting a node: This operation starts by checking if the linked list is empty. If not, it
searches for the data that must be deleted. If the data are not found, the list remains as is.
If the data are found, the node they belong to is deleted and the list is updated accordingly.
There are two cases to consider in relation to this process. The first case is that the node to
be deleted is the first one in the list. In this case, the process simply involves the allocation
of the head to the next node, and the assignment of the pointer that points to the deleted
node to null. The second case is that the node to be deleted is not the first one in the list. In
this case, it is necessary to also find the nodes before and after the deleted, and keep references to them. With this information at hand, the next pointer of the node preceding the
deleted one is made to point to the node succeeding it. Finally, the pointers of the deleted
node are removed. Figure 6.14 illustrates this process diagrammatically.
FIGURE 6.11
Traversing the linked list.
FIGURE 6.12
Inserting the first node.
Data Structures and Algorithms
FIGURE 6.13
Appending a node to the list.
FIGURE 6.14
Deleting a node from a linked list.
The following Python script demonstrates the deletion process:
def delete(self, data):
if (self.isEmpty()):
print("There is no node available to delete. "
"The linked list is empty.")
else:
current_node = self.head
# Case 1: If the node to be deleted is the first node
if (current_node and current_node.data == data):
# Set the head of the list of the next item
self.head = current_node.next
# Set the current item’s pointer to null
current_node.next = None
Return
257
258
Handbook of Computer Programming with Python
# Keep track of the previous node while searching
# for the node to be deleted
previous_node = None
while (current_node and current_node.data != data):
previous_node = current_node
current_node = current_node.next
# Check if the node was found
if (current_node is None):
return
previous_node.next = current_node.next
current_node = None
• Destroying the list: Since building a linked list involves the dynamic allocation of memory in the form of pointers, it is advisable that before the underlying application stops,
any pointers and memory allocated during its lifecycle are freed and released back to the
system. The following Python code demonstrates a possible implementation of this task:
def destroyList(self):
temp = self.head
if (temp is None):
print("\n The linked list is deleted")
while (temp):
self.head = temp.next
temp = None
temp = self.head
self.readList()
The reader can merge the above functions and commands as in the code example provided below
(the code is arranged into two classes, stored in file Chapter6LinkedList.py):
class Node:
def __init__(self, data):
self.data = data
self.next = None
class LinkedList:
def __init__(self):
...
def append(self,data):
...
def delete(self, data):
...
def destroyList(self):
...
def readList(self):
...
def isEmpty(self):
...
Data Structures and Algorithms
259
The following script (filename: Chapter6LinkedListExample) implements the class, as discussed
above:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import Chapter6LinkedList
ll = Chapter6LinkedList.LinkedList()
while (True):
print("[A]: Append a new node")
print("[D]: Delete a particular node")
print("[Q]: Clear all list and exit")
print("[P]: Print the current list")
choice = input("Enter your choice: ")
if (choice == "A"):
newNode = int(input("Enter the new node value to append the
list: "))
ll.append(newNode)
elif (choice == "D"):
deleteNode = int(input("Enter the node to delete: "))
ll.delete(deleteNode)
elif (choice == "P"):
ll.readList()
else:
ll.destroyList()
break
Output 6.6.1:
[A]: Append a new node
[D]: Delete a particular node
[Q]: Clear all list and exit
[P]: Print the current list
Enter your choice: A
Enter the new node value to append the list: 5
[A]: Append a new node
[D]: Delete a particular node
[Q]: Clear all list and exit
[P]: Print the current list
Enter your choice: A
Enter the new node value to append the list: 3
[A]: Append a new node
[D]: Delete a particular node
[Q]: Clear all list and exit
[P]: Print the current list
Enter your choice: A
Enter the new node value to append the list: 7
260
[A]: Append a new node
[D]: Delete a particular node
[Q]: Clear all list and exit
[P]: Print the current list
Enter your choice: P
The current list is: 5 3 7
The size of the linked list is:
Handbook of Computer Programming with Python
3
[A]: Append a new node
[D]: Delete a particular node
[Q]: Clear all list and exit
[P]: Print the current list
Enter your choice: D
Enter the node to delete: 3
[A]: Append a new node
[D]: Delete a particular node
[Q]: Clear all list and exit
[P]: Print the current list
Enter your choice: P
The current list is: 5 7
The size of the linked list is:
2
In addition to the operations discussed above, the effectiveness of the linked list could be also
improved by:
• Inserting a new node before/after an existing node based on its data.
• Searching for a node using key data, and retrieving the data and the positional index of
the node.
• Modifying the data of a particular node within the list.
• Sorting the linked list.
Some key points when implementing linked lists or related structures are summarized in the list
below:
• To access the nth node of a linked list, it is necessary to pass through the first n−1 nodes.
• If nodes are added at a particular position instead of just being appended, the insertion will
result in a node index change.
• Deletion of nodes will result in a node index change.
• Trying to store the node indices in a linked list is of no use, since they are constantly
changing (indeed, there are no actual indices in such a list).
• To append a node, one has to traverse the whole list and reach the last node.
• In addition to the head and current_node pointers, adding a tail pointer to the last node of
the list makes appending easier and more efficient.
• To delete the last node, one has to traverse the whole list and find the two last positions.
• If for any reason the head pointer is lost, the linked list cannot be read and retrieved.
A particular variation of the linked list is the circular linked list, in which the last node is linked to
the first. It is used when the node next to the last corresponds to the first one, such as in the cases
of the weekdays or the ring network topology. The advantage of the circular linked list is that it can
be traversed starting at any node and is able to reach the node it has started with again in a circular
manner. Figure 6.15 provides an illustration of a simple circular linked list.
Data Structures and Algorithms
FIGURE 6.15
261
A circular linked list.
6.6.2 Binary Trees
The previous section focused in the singly linked list, in which the pointer of each node points to the
next node. The main problem with this type of linked
list is that it does not offer direct access to the previ- Observation 6.20 – Doubly Linked
ous node. This can make the process of deleting nodes List: A structure similar to a singly
from the list rather complicated. Doubly linked lists can linked list, but containing two pointers
address this problem. As the name implies, the main dif- pointing to both the next and previous
ference between singly and doubly linked lists is that nodes instead of just one (next).
the latter consist of two pointers instead of one, with the
additional pointer pointing to the previous node. Despite
the obvious functional advantage of this additional pointer, it tends to make operations more complicated and causes additional overhead, as an extra pointer is added to every node. Figure 6.16
provides an illustration of the inner structure of a doubly linked list node and an example of a threenode doubly linked list connections:
Among the most important types of doubly linked lists is the binary tree (Figure 6.17), a rooted
tree in which every node has at most two children (i.e., degree 2). Its recursive definition declares
that a binary tree is either an external node (leaf) or an internal node (root/parent) and up to two
sub-trees (a left subtree and a right subtree). In simple terms, if a node is a root, it has one or two
children nodes but no parent, if it is a leaf, it has a parent node but no children, and every node is an
element that contains data. The number of levels in the tree is defined as its depth.
FIGURE 6.16 A NODE of a double linked list.
FIGURE 6.17 Binary trees.
262
FIGURE 6.18
Handbook of Computer Programming with Python
Decision trees.
Example 1 in Figure 6.17 shows an unfinished binary
tree with degree 2 and a depth of three levels. The tree
has 76 as its root, 26 and 85 as children nodes, and 27,
24, and 18 as leaf nodes. Example 2 shows a completely
unbalanced binary tree and Example 3 a mixed case.
Binary trees are commonly used in decision tree
structures (Figure 6.18), although this may often go
unnoticed.
Observation 6.21 – Binary Tree: A
rooted tree in which every node is
either an external node (leaf) or an
internal node (root/parent), with up
to two sub-trees (a left subtree and a
right subtree).
6.6.3 Binary Search Tree
A particular type of a binary tree is the binary search
tree. Its definition is the same as that of the regular
binary tree, but with the following additional properties:
• All elements rooted at the right child of a node
have higher values than that of the parent node.
• All elements rooted at the left child of a node have
lower values than that of the parent node.
Observation 6.22 – Binary Search
Tree: A structure based on a binary
tree with the difference that all elements rooted at the right child of a
node are greater and those rooted at
its left child lower than the value of
the parent node.
In the example provided in Figure 6.19 the reader would notice that every node on the left subtree
of the root has a lower value than 43, while every node on the right subtree has a higher value. The
reader should also notice that this is recursively applied to the internal nodes too (e.g., as in the case
of node with value 56). This could be potentially reversed by having the smaller values on the right
and the larger on the left subtrees respectively, but the logic of the binary tree structure remains
the same.
There are three systematic ways to visit all the nodes of a binary search tree: preorder, inorder,
and postorder. If the left subtree contains values that are lower than the root node, all three of these
will traverse the left subtree before the right subtree. Their only difference lies on when the root
node is visited and read (Table 6.12).
The implementation of a linked list requires two classes. The first is the node class, containing
the data and a pointer to the next item. For any new node that is created, next will point to null. The
second is the linked list itself, and contains the head pointer (pointing to the first item in the list)
and the current_node that is used to move through the list. Both the head and the current_node will
initially point to null since there are no items in the list.
263
Data Structures and Algorithms
FIGURE 6.19
Binary search tree.
TABLE 6.12
Searching a Node in a Binary Search Tree
Inorder Traversal
Traverse the left subtree.
Visit/read the root node.
Traverse the right subtree.
Resulting list: 20, 28, 31, 33,
40, 43, 47, 56, 59, 64, 89
Preorder Traversal
Postorder Traversal
Visit/read the root node.
Traverse the left subtree.
Traverse the right subtree.
Resulting list: 43, 31, 20, 28,
40, 33, 64, 56, 47, 59, 89
Traverse the left subtree.
Traverse the right subtree.
Visit/read the root node.
Resulting list: 28, 20, 33, 40,
31, 47, 59, 56, 89, 64, 43
In its most basic form, the binary search tree ADS includes the following operations:
• Instantiating & initializing the Binary Search Tree (BST): This function is used to create each new node in the BST, allocating the necessary memory and initializing its pointers to both the left and right subtrees to null. Figure 6.20 provides a visual representation
of the new node and the following code excerpt illustrates its implementation:
def __init__(self, key):
self.left = None
self.right = None
self.data = key
FIGURE 6.20
New node for the BST.
264
Handbook of Computer Programming with Python
FIGURE 6.21
Traversing the BST inorder.
• Inorder traversal of the BST: The inorder function, one of the most well-known functions associated with dynamic data structures, happens to be also among the easiest ones.
The following Python code and Figure 6.21 illustrate its operation:
def traverseInorderBST(root):
# If the BST current node is not a leaf traverse
# the left subtree. If it is, print its data and
# then traverse the right subtree
if (root):
traverseInorderBST(root.left)
print(root, root.data)
traverseInorderBST(root.right)
• Inserting a new node to the list: The goal of this function is to place the newly imported
data to the desired place in the BST. When the BST is empty, the new node simply initializes it. In all other cases, the function recursively checks whether the data value in the new
node is lower, equal to, or higher than the data in the current node, and keeps on moving to
the respective subtree accordingly until the current node is empty. At that point, it finally
assigns the new node. Figure 6.22 illustrates this process by inserting nodes from the following list to a BST: 43, 31, 64, 56, 20, 40, 59, 28, 33, 47, 89. The Python code for this
function is the following:
def insert(root, key):
# If there is no BST create its first node
if (root is None):
return BinarySearchTree(key)
else:
Data Structures and Algorithms
265
FIGURE 6.22 Inserting nodes to the BST.
# If the current node's data is less than or equal
# to the new key, move into the right subtree;
# otherwise, move to the right subtree recursively
if (root.data <= key):
root.right = insert(root.right, key)
else:
root.left = insert(root.left, key)
return root
• Searching for a key value in the BST: This function searches the BST for a key value
provided by the user. As with the previous functions, it recursively calls itself on either
the left or right subtree in an effort to find a match for the key value. If the key value is
not found after all the BST has been searched, an empty BST is returned. This raises an
error and crashes the application unless it is handled by the calling function. Figure 6.23
illustrates both a case where the key is being found and one where it is not. The following
Python code provides an implementation of this function:
def search(root, key):
# Recursively visit the left and right subtrees to find
# the node that matches the key searched for
if (root.data == key):
return root
if (root.data < key):
return search(root.right,key)
else:
return search(root.left,key)
FIGURE 6.23 Data search in a BST.
266
Handbook of Computer Programming with Python
# If the key is not found, return the empty BST
if (root is None):
return None
• Deleting a node from the BST: Arguably, this is the most complex function in the BST
ADS. If the current root is empty, which may be because the key was not found, there
is nothing to be done and the current BST is returned as is. In any other case, the key is
found in the current node, or its left or right subtree. If the key is found in the current node
and the right subtree is empty, the function replaces the current node with its left subtree.
Accordingly, if the left subtree is empty it is replaced with the right subtree. If none of
these are empty, the function finds the minimum data in the right subtree, replaces the data
in the current node, and the current node with the right subtree, while also deleting the
node of the subtree with the lowest value data. If the key is not found in the current node,
the function is called recursively on the left and the right subtrees, depending on whether
the key value is lower or higher than the current node data. Figure 6.24 illustrates this
process and the related Python script is provided below:
def delete_Node(root, key):
""" If the root is empty, return it; if not, if the key is
larger than the current root, find it in the right subtree;
Otherwise, if it is smaller, find it in the left subtree
If the key is matched, delete the current root """
if (root == None):
return root
elif (root.data > key):
root.left = delete_Node(root.left, key)
elif (root.data < key):
root.right= delete_Node(root.right, key)
""" If the key is matched, then, if there is no right
subtree just replace the current node with the left
subtree; similarly in this case, if there is no left
subtree just replace the current node with the right
subtree."""
elif (root.data == key):
if (root.right == None):
return root.left
if (root.left == None):
return root.right
""" If none of the left or right subtrees is empty
replace the data in the current node with the minimum
data in the right subtree and delete the node with
that minimum data from the right subtree"""
temp = root.right
FIGURE 6.24 Deleting a node from a BST.
267
Data Structures and Algorithms
mini_data = temp.data
while (temp.left):
temp = temp.left
mini_data = temp.data
root.data = mini_data
root.right = delete_Node(root.right,root.data)
return root
• Destroying the BST: As with most structures occupying computer memory space, it is
advisable that the BST is deleted (i.e., destroyed) when exiting the application. The following Python code excerpt provides a possible implementation of this task:
def destroyBST(root):
if (root):
destroyBST(root.left)
destroyBST(root.right)
print("Node destroyed before exiting: ", root, root.data)
root = None
Finally, it must be noted that the performance of the BST in terms of searching, inserting, or deleting depends on how balanced it is. In the case of well-balanced BSTs, the performance is always
O(logn), while in extremely unbalanced cases the performance can be improved to O(n).
6.6.4 Graphs
A graph is a non-linear data structure consisting of
nodes, also called vertices, which may or may not be
connected to other nodes. The line or path connecting
two nodes is called an edge. If edges have particular flow
directions, the graph is said to be directed. Graphs with
no directional edges are referred to as undirected graphs
(Figure 6.25).
A directed graph consists of a set of vertices and a
set of arcs. The vertices are also called nodes or points.
FIGURE 6.25 An undirected graph.
Observation 6.23 – Graph: A non-linear structure of nodes/vertices interconnected through edges. Edges may
have a particular direction (directed
graphs) or not (undirected graphs).
Graphs can be presented as static
adjacency matrices or as dynamic
adjacency lists.
268
Handbook of Computer Programming with Python
FIGURE 6.26 Arc (V, W).
An arc is an ordered pair of vertices (V, W); V is called the tail and W is called the head of the arc.
Function arc (V, W) is often expressed as V → W (Figure 6.26).
A path in a directed graph can be described as a sequence of vertices V1, V2, …Vn, thus V1 → V2,
V2 → V3, …, Vn−1 → Vn can be viewed as arcs. In this occasion, the path from vertex V1 to vertex Vn,
passes through vertices V2, V3, …, Vn−1, and ends at vertex Vn. The length of the path is the number
of arcs on the path, in this particular case n−1. A path is simple if all vertices, except possibly the
first and last, are distinct. A simple cycle is a simple path of a length of at least one that begins and
ends at the same vertex. A labeled graph is one in which each arc and/or vertex can have an associated label that carries some kind of information (e.g., a name, cost, or other values associated with
the arc/vertex).
There are two ways to represent a directed graph: as a static adjacency matrix or as a dynamic
adjacency list. The prefix static refers to the use of a static structure (i.e., a list), whereas the prefix dynamic refers to the use of a dynamic structure in the form of a linked list. In the case of the
former, assuming that V = {1, 2, …, N}, the adjacency matrix of G is an NxN matrix A of booleans,
where A[i, j] is true if and only if there is an arc from vertex i to j. An extension of this scheme is
what is called a labelled adjacency matrix, where A[i, j] is the label of the arc going from vertex i
to vertex j; if there is no arc from i to j, it is not possible to have an associated value referring to it.
The main disadvantage of the adjacency matrix is that it requires storage in the region of O(n2). In
contrast, in the case of the adjacency list, which is essentially a list of pointers representing every
vertex of the graph that is adjacent to vertex i, the whole structure is dynamic and, therefore, can
have its memory size increased or decreased on demand.
Figure 6.27 presents examples of an adjacency matrix and an adjacency list.
An undirected graph consists of a set of vertices and a set of arcs. As in the case of the directed
graph, the vertices are also called nodes or points. Its main difference from a directed graph is that
edges are unordered, implying that (V, W) = (W, V).
The applications of graphs, both directed and undirected, are numerous. Examples include, but
are not limited to, the airlines industry, the logistics and freight industries, or the various GPS and
navigation systems. In all these cases, the solution to most of their operational problems is a form of
the famous shortest path algorithm. The idea behind this algorithm is pretty simple.
FIGURE 6.27 Adjacency matrix vs. adjacency list.
Data Structures and Algorithms
269
• A directed graph G = (V, E) is drawn, in which each arc has a non-negative label and a
vertex is specified as the source.
• The cost of the shortest path from the source back to itself is calculated through every
other vertex in V (i.e., the length of the path).
Dijkstra’s famous greedy algorithm, also called the Eulerian path, provides the solution to this problem. The algorithm can be summarized in the following steps:
• Step 1: Determine if the solution is feasible, which is true only if every vertex is connected
to an even number of other vertices.
• Step 2: Start with the source vertex and move to the first next available vertex in the adjacency matrix (or adjacency list).
• Step 3: Print/store the identified vertex and delete it from the adjacency matrix (or adjacency list).
• Step 4: Repeat Steps 2 and 3 until there are no more connections to use.
6.6.5 Implementing Graphs and the Eulerian Path in Python
Implementing an undirected graph implies the implementation of either an adjacency matrix or an
adjacency list. Although the implementations may differ, the algorithm is basically the same in both
cases: the Eulerian path (Dijkstra’s algorithm) is used to find and display the shortest path between
the vertices.
Based on the undirected graph provided in Figure 6.28, the following script offers three different scenarios (i.e., scenarios can be selected by enabling/disabling the associated commented
statements). The scenario firstly prompts the user to enter the number of vertices in the graph. Next,
it accepts the connections in the form of an adjacency matrix as 0s or 1s (fillAdjacencyMatrix()), checks whether the Eulerian path algorithm can be applied to this particular matrix, and
traverses the graph and displays the shortest path. Note that this process may result in one path
being inside another. In this case, in the second round, the vertex that opens the path must also close
it. The reader should also notice that, in order to merge two paths, the vertex that opens and closes
FIGURE 6.28
An undirected graph.
270
Handbook of Computer Programming with Python
the second path is the one that associates the two separate cases, in the form of a zoom-in path residing inside another. The second and third scenarios involve two different, pre-defined matrices that
represent graphs and are addressed accordingly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def fillAdjacencyMatrix(matrix, vertices):
for i in range(vertices):
col = []
for j in range (vertices):
print("Enter 1 if there is a connection between ", i, \
" and ", j, " or 0 if not: ", end = " ")
connectionExists = int(input())
col.append(connectionExists)
matrix.append(col)
return matrix
def displayAdjacencyMatrix(matrix, vertices):
for i in range(vertices):
print(matrix[i])
def checkEulerian(matrix, vertices):
newStartVertex = -1
for i in range(vertices-1, -1, -1):
sumPerCol = 0
for j in range (vertices):
sumPerCol = sumPerCol + matrix[i][j]
if (sumPerCol != 0):
newStartVertex = i
return newStartVertex
# Ask the user for the number of graph vertices
numVertices = int(input("Number of graph vertices: "))
#graph = []
graph =[[0,1,1,1,1], [1,0,1,1,1], [1,1,0,1,1], [1,1,1,0,1], [1,1,1,1,0]]
#graph = [[0,1,0,0,0,1], [1,0,1,0,1,1], [0,1,0,1,1,1], [0,0,1,0,1,0],
[0,1,1,1,0,1], [1,1,1,0,1,0]
# Fill the adjacency matrix
# graph = fillAdjacencyMatrix(graph, numVertices)
# Display the adjacency matrix before running the Eulerian Path
displayAdjacencyMatrix(graph, numVertices)
# Check if the Eulerian Path algorithm can be applied in this case
startVertex = checkEulerian(graph, numVertices)
endVertex = vertex = startVertex
col = 0
if (startVertex == -1):
print("Eulerian Path cannot be applied in this case")
else:
print("The first round: ", graph[vertex][0], end = "")
while (vertex < numVertices and col < numVertices):
Data Structures and Algorithms
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
271
if (graph[vertex][col] == 0):
col += 1
if (col == numVertices or vertex == numVertices):
startVertex = checkEulerian(graph, numVertices)
if (startVertex == -1):
print("\nPath closed")
else:
endVertex = startVertex
vertex = startVertex; col = 0
print("\nZoom into", startVertex,
"for the round: ", startVertex, end = " ")
elif (graph[vertex][col] == 1):
print("->", col, end = "")
graph[vertex][col] = graph[col][vertex] = 0
vertex = col; col = 0
Output 6.6.5:
How many vertices in the graph? 5
[0, 1, 1, 1, 1]
[1, 0, 1, 1, 1]
[1, 1, 0, 1, 1]
[1, 1, 1, 0, 1]
[1, 1, 1, 1, 0]
The first round: 0-> 1-> 2-> 0-> 3-> 1-> 4-> 0
Zoom into 2 for the round: 2 -> 3-> 4-> 2
Path closed
6.7 WRAP UP
In this chapter an effort was made to briefly explain some of the most important data structures
in programming and the algorithms to support those. The various scripts were showcasing how
Python can be utilized to implement those." Apparently, there are several other data structures
available and, perhaps, more efficient algorithms to implement those which was beyond the scope
of this chapter.
6.8 CASE STUDIES
1. Create an application that implements the algorithms and tasks specified below. The application should use a GUI interface in the form of a tabbed notebook, using one tab for each
­algorithm. The application requirements are the following:
a. Implement the following static sorting algorithms: bubble sort, insertion sort, shaker
sort, merge sort.
b. Ask the user to enter a regular arithmetic expression in a form of a phrase, with each
of the operators limited to single-digit integer numbers. Convert the infix expression
to postfix.
c. Ask the user to enter a sequence of integers, insert them into a binary search tree and
implement the BST ADS algorithm with both inorder and postorder traversals.
272
Handbook of Computer Programming with Python
6.9 EXERCISES
1. Use a notebook GUI to implement the selection sort, the shell sort and the quicksort (one
on each tab).
2. Use a stack to implement the following tasks:
a. Reversing a string.
b. Calculating the sum of integers 1…N.
c. Calculating the sum of squares 1 ^ 2 +…+ N ^ 2.
d. Checking if a number or word is a palindrome.
e. Evaluating a postfix expression by using a stack.
3. Implement a deque structure with an example to test it. A deque is a linear structure of
items similar to a queue in the sense that it has two ends (i.e., front and rear). However, it
can enqueue and dequeue from both ends of the structure. Deque supports the following
operations:
a. add _ front(item): Adds an item to the front of the deque.
b. add _ rear(item): Adds an item to the rear of the deque.
c. remove _ front(item): Removes an item from the front of the deque.
d. remove _ rear(item): Removes an item from the rear of the deque.
e. isEmpty(): Returns a Boolean value indicating whether the deque is empty or not.
f. peek _ front(): Returns the item at the front of the deque without removing it.
g. peek _ rear(): Returns the item at the rear of the deque without removing it.
h. size(): Returns the number of items in the deque.
4. Using a graph do the following:
a. Ask the user to enter the number of vertices in the undirected graph.
b. Ask the user to enter the name of each of the vertices in the undirected graph.
c. Ask the user to enter the connected vertices to each of the edges in the undirected
graph.
d. Determine whether the Eulerian Path solution (Dijkstra’s algorithm) is feasible.
e. In case it is not, ask the user to add new connections to the missing ones.
f. Create the adjacency matrix for the graph and display it.
g. Create the adjacency list for the graph and display it.
h. Run the Dijkstra’s algorithm to find the shortest path, starting from a source entered by
the user.
i. Display the solution of the shortest path.
REFERENCES
Dijkstra, E. W., Dijkstra, E. W., Dijkstra, E. W., & Dijkstra, E. W. (1976). A Discipline of Programming (Vol.
613924118). Prentice-Hall: Englewood Cliffs.
Hoare, C. A. R. (1961). Algorithm 64: Quicksort. Communications of the ACM, 4(7), 321.
Knuth, D. E. (1997). The Art of Computer Programming (Vol. 3). Pearson Education.
Stroustrup, B. (2013). The C++ Programming Language. India: Pearson Education.
7
Database Programming
with Python
Dimitrios Xanthidis
University College London
Higher Colleges of Technology
Christos Manolas
The University of York
Ravensbourne University London
Tareq Alhousary
University of Salford
Dhofar University
CONTENTS
7.1
7.2
Introduction........................................................................................................................... 273
Scripting for Data Definition Language................................................................................ 274
7.2.1 Creating a New Database in MySQL........................................................................ 276
7.2.2 Connecting to a Database.......................................................................................... 279
7.2.3 Creating Tables..........................................................................................................280
7.2.4 Altering Tables.......................................................................................................... 289
7.2.5 Dropping Tables......................................................................................................... 294
7.2.6 The DESC Statement.................................................................................................. 296
7.3 Scripting for Data Manipulation Language........................................................................... 296
7.3.1 Inserting Records....................................................................................................... 296
7.3.2 Updating Records...................................................................................................... 301
7.3.3 Deleting Records....................................................................................................... 303
7.4 Querying a Database and Using a GUI................................................................................. 305
7.4.1 The SELECT Statement.............................................................................................306
7.4.2 The SELECT Statement with a Simple Condition.....................................................307
7.4.3 The SELECT Statement Using GUI.......................................................................... 310
7.5 Case Study............................................................................................................................. 316
7.6 Exercises................................................................................................................................ 316
References....................................................................................................................................... 317
7.1
I NTRODUCTION
Most IT professionals and scholars may agree on what makes computers special and useful: they
can perform operations at lightning speed and on large volumes of data. Stemming from these two
fundamental computational thinking elements are the notions of algorithms and programs as a
means to process and manipulate data. In the scope of computer science, information systems, and
information technology, the logical and physical organization of data falls under the broader context
of databases. A thorough analysis of the various concepts related to databases and their structural
DOI: 10.1201/9781003139010-7
273
274
Handbook of Computer Programming with Python
design is outside the scope of this book. The reader can
find relevant information on Elmasri & Navathe (2017). Observation 7.1 – Types of Scripting
The focus of this chapter is on the crossroads between in Relational Databases: There are
computer programming with Python and a common three types of scripts addressing relational databases: Data Definition
type of database structure: the relational database.
In relational databases, there are three main types of Language (DDL), Data Manipulation
scripting techniques and/or languages that are used to Language (DML), and Queries.
perform the various associated tasks, namely Data
Definition Language (DDL), Data Manipulation
Language (DML), and Queries. DDL is used to create, Observation 7.2 – Database Schema,
display, modify, or delete the database and its structures Database Instance: The structure of
and tables, and it is associated with the database schema a database, including table metadata,
or metadata. DML is used to insert data into the various is also referred to as the database
tables, and modify or delete this data as required. It schema. The data stored on the tables
relates to the database instance or state. Queries are at any given time are called the dataused to display the data in various different ways. Most base instance or state.
commercially available Database Management Systems
(DBMS) incorporate facilities and tools that utilize these three mechanisms.
The DBMS of choice for this chapter is MySQL (2021). This is part of a package that includes
both the DBMS and a local server solution called Apache (2021). The package supports both
Windows and Mac OS systems, and the two associated versions come under the name MAMP. The
packages are free for download from MAMP (2021) and Oracle (2021b) and the installation is pretty
intuitive and straightforward. While it is always beneficial for one to study and understand the tools
and technologies of any given system to a good extent, it must be noted that no prior knowledge or
practical experience with MAMP is needed in order to practice and execute the examples presented
in this chapter. While the examples make use of the MySQL DBMS and the Apache Server, this
is just a matter of simply logging in and activating them, and accessing the created databases. The
scripts provided in this chapter will do all the necessary work, while the results will appear in the
relevant MySQL database.
This chapter will cover the following topics:
• DDL (Data Definition Language): Creating a database and connecting to it. Modifying,
deleting, or displaying DB tables, structures, and attributes.
• DML (Data Manipulation Language): Inserting, modifying, and deleting records in a table.
• Queries: Displaying the records of one or more tables in various different ways.
• Using GUI programming, and in particular the Grid widget, to create presentable database applications with Python.
It should be noted that while expertise in databases is not essential, a good understanding of the
concepts and techniques introduced in Chapter 4: Graphical User Interface Programming with
Python and Chapter 5: Application Development with Python may be required. Ideally, the reader
should be comfortable with the major concepts introduced in all the previous introductory chapters,
as many of these concepts will be utilized or integrated in the examples presented here.
7.2
S CRIPTING FOR DATA DEFINITION LANGUAGE
As mentioned, MAMP will provide some of the tools that are necessary for the examples presented
in this chapter. The MAMP packages must be downloaded and installed, as required. Once installation is complete, the MAMP application must be launched. This will start the Apache local server
and the MySQL DBMS, both of which are required in order to run a client-server application.
Figures 7.1 and 7.2 illustrate the MAMP server and the MySQL DBMS interfaces, respectively:
Database Programming with Python
FIGURE 7.1
MAMP server.
FIGURE 7.2 MySQL phpMyAdmin.
275
276
FIGURE 7.3
Handbook of Computer Programming with Python
Installed libraries in environments tab.
Once these services are launched, the libraries related to MySQL connectivity and scripting must be also installed in the Anaconda environment. The libraries can be found under the
Environments tab in Anaconda Navigator. If the reader has already installed the necessary libraries in previous chapters of this book, installing the new libraries ensures that the import statements related to MySQL will not raise errors. If some of the libraries used here have not been
previously installed, the reader should refer to the scripts of the previous chapters and amend the
installation and scripts presented here accordingly. Figures 7.3 and 7.4 illustrate the Environments
tab with lists of the installed libraries, as well as those that are not installed but needed for running the examples.
7.2.1 Creating a New Database in MySQL
A database can be formally defined as an organized collection of related data the processing of which can pro- Observation 7.3 – Database: An
vide a particular, explicit meaning. A database includes organized collection of related data
a number of tables, also called relations, hence the rela- which are processed to provide
tional prefix. Each table/relation consists of attributes, explicit meaning. A database includes
also referred to as fields or columns. Typically, one or a number of tables, each with its own
more of these attributes serve as unique record identi- attributes. Tables may be organized
fiers called primary keys and are often organized using using a unique primary key and make
indices. These structural elements of the database are use of indices.
collectively referred to as the database metadata. As
mentioned, the creation and control of metadata can be handled using the DDL.
Database Programming with Python
277
FIGURE 7.4 Not installed but necessary libraries.
It goes without saying that the database itself needs to be created prior to the creation of the
metadata. In MySQL, the creation of a new database is as simple as clicking on the New option on
the left panel of phpMyAdmin (Figure 7.2). When creating a new database, the user must specify a
name, the database format (usually GuiDB) and the default character set (usually utf8). In Python,
the creation process involves a number of steps:
• Obtaining the log-in credentials for the MySQL environment. These can be found in the
Welcome page in the Example area in MySQL.
• Using the config
object (list) to set the credentials in the dictionary form:
­config = {‘user’: ‘root’, ‘password’: ‘root’, ‘host’: ‘localhost’}.
• Writing the statements to connect to the database, setting the SQL statement, and executing the commands.
Writing a Python script to create a database may be as simple as writing the basic statements in
a command-prompt mode or as sophisticated as offering a full GUI environment. The following
Python script is an example of the latter. Notice that, upon execution, the application should not
produce an output, which simply means that no problems were encountered while connecting to
MySQL. Instead of an output, the program should display the newly created database as an available database. It must be also stressed that SQL statements are simply treated as strings that are not
case sensitive. As such, they can be written with capital or lower-case letters, or a combination of
both. In this chapter, it was decided to use capital letters for the keywords of the statements, in line
with the style adopted in the official MySQL documentation (Oracle, 2021a). This decision had to
do mainly with distinguishing the SQL keywords from the SQL database table and attribute names
and from the Python code, thus improving clarity and readability:
278
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Handbook of Computer Programming with Python
import tkinter as tk
from tkinter import ttk
import mysql.connector
config = {'user': 'root', 'password': 'root', 'host': 'localhost'}
def createDB(dbName):
GUIDB = 'GuiDB'
connect = mysql.connector.connect(**config)
cursor = connect.cursor()
sqlString = "CREATE DATABASE " + dbName.get() + \
"DEFAULT CHARACTER SET utf8"
cursor.execute(sqlString.format(GUIDB))
# Create the basic window frame and give it a title
winFrame = tk.Tk()
winFrame.title("Create a new database")
# Create the interface
winLabel = tk.Label(winFrame,
text = "Enter the name of the new database", bg = "grey")
winLabel.grid(column = 0, row = 0)
# Create the StringVar object that will accept user input from the
# keyboard,and initialize it
textVar = tk.StringVar()
textVar.set("Enter the name here")
winText = ttk.Entry(winFrame, textvariable = textVar, width = 30)
winText.grid(column = 0, row = 1)
winButton = tk.Button(winFrame, font = "Arial 16",
text = "Click to create the new DB\nin the localhost")
winButton.bind("<Button-1>", lambda event, a = textVar: createDB(a))
winButton.grid(column = 0, row = 2)
winFrame.mainloop()
Output 7.2.1:
Database Programming with Python
279
The part of the script specifically relating to the database is in lines 3–13. In line 3, the mysql.
connector function that handles the connection with MySQL is imported. A standard connection
configuration is implemented in line 5. Once the GUI is built, a click button event calls the createDB() function that assigns the most frequently used database format (GuiDB) to the relevant variable (line 8). Next, it connects to MySQL using the mysql.connector.connect(**config)
adaptor (line 10), prepares the pending execution statement in the form of a sqlString (line 12),
and executes the statement (line 13).
7.2.2 Connecting to a Database
As in the previous example, once the database is created a connection must be established. Connecting to a Observation 7.4 – Connecting to a
database involves the creation of a link to it inside the Database:
1. Import the mysql.connector
relevant DMS (e.g., MySQL) through a server, such as
library.
Internet Information Server (ISS) or Apache. Once the
2. Use the cursor object
connection is established, the database must be opened
and
the mysql.connector.
and a link must be created and attached to it. This usually
connect(**config)
function to
requires some credentials, including login username,
connect
to
the
database.
password, the host address (i.e., the network address of
3. Prepare the SQL statement.
the server that hosts the database), and the name of the
4. Execute the SQL statement using
database itself. In the case of databases stored and used
the
cursor.execute() function.
from within a local computer system and a local server
(e.g., MySQL through Apache), the host address is usually “localhost”.
The following Python script connects to the newly Observation 7.5 – The SHOW TABLES
created database. It sets the configuration string Statement: Use the SHOW TABLES
(­config) that holds the credentials for the connection statement to locate tables in the datato the database (lines 2–3). Next, it links the execution base. If successful, use the cursor.
statement with the MySQL database through mysql. fetchall() function to load the
connector (line 5). Once the connection is success- results to the cursor object for later
fully established, the results are loaded to the cursor use.
object, which always receives the results of all executed
SQL statements (line 6). Lastly, the database tables are
displayed by executing the cursor.execute("SHOW
7.6
–
Exception
TABLES") (line 7) and cursor.fetchall() (line 8) Observation
Handling:
It
is
highly
advisable
that
commands.
the
try…except
exception
handling
In this example the reader should note the use of the
try…except statement (lines 4 and 10) to display the structure is used for each statement
appropriate messages in the cases of both successes related to SQL scripts, as it is likely
and failures. This ensures that statements execution that the execution of such statements
that may return incorrect or unexpected values will will frequently cause errors that can
not cause the application to crash. As an example, run- lead to the abnormal termination
ning this script with newDB as the database name will (crash) of the application.
display the tables as expected. However, if the database name were to be changed to a non-existing one (e.g., newDB1), the exception handling code
in lines 9 and 10 would be executed, launching an error message. It is worth mentioning that the
execution of the except segment of the script will be triggered for any reason that might cause
a failure in connecting to the database. Nevertheless, if the database is empty, an empty set of
tables will be displayed:
280
1
2
3
4
5
6
7
8
9
10
Handbook of Computer Programming with Python
import mysql.connector
config = {'user': 'root', 'password': 'root',
'host': 'localhost', 'database': 'newDB'}
try:
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute("SHOW TABLES")
print(cursor.fetchall())
except:
print("There is an error with the connection")
Output 7.2.2.a:
[('STUDENT',), ('Table1',)]
Output 7.2.2.a shows the results for a database including tables Student and Table 1.
Output 7.2.2.b:
There is an error with the connection
Output 7.2.2.b shows the results for an empty database. In this case, the exception handling mechanism is activated and the corresponding error message is displayed. Returning an empty cursor
after the execution of the SHOW TABLES statement is considered an internal error, and it is thus
raising an exception.
7.2.3 Creating Tables
The first action needed once a new database is created is
the creation of its table(s). This is accomplished by the
execution of the CREATE TABLE statement in SQL. The
CREATE TABLE statement is very similar or identical
across different DBMS. A detail description of the small
syntax variations between different DBMS systems is
beyond the scope of this chapter, but the basic structure
remains the same.
Assuming the commonly used relational model,
seven particular elements need to be specified when creating a table:
Observation 7.7 – The CREATE
TABLE Statement: Use the CREATE
TABLE statement to create a table,
define its attributes, data types, and
sizes, and set possible primary and
foreign keys.
Observation 7.8 – Create Tables with
No Primary or Foreign Key: Use the
following statement to create a table
with no primary or foreign keys:
CREATE TABLE (<attribute1>
1. The table name (i.e., the name of each structure <DATA TYPE>(<size>),...,
that will store data in its columns or fields, also <attributeN> <DATA
TYPE>(<size>))
called attributes).
2. The number of attributes of the table.
3. The name of each attribute, preferably as a single, descriptive word.
4. The data type for each of the attributes (e.g., CHAR, INT, or DATE).
5. The length/size of the data for each attribute in bytes.
6. Whether any of the attributes is the primary key, or part of a combined primary key of the table.
7. Whether any of the attributes is a foreign key, referencing a corresponding attribute in
another table.
Database Programming with Python
281
Provided that these seven elements are specified, there are three possible cases when creating a
table:
1. The table does not have a primary key and does not have any of its attributes referencing
the attributes of another table. In this case, the table is part of a single-table database or it
is a parent table for other tables to refer to.
2. The table has one or more of its attributes designated as a primary key, ensuring that each
of its records is unique.
3. There are more than one tables in the database and they are somehow related to each
other. This occurs when one or more of the attributes reference an identical column in
another table within the same database.
Python provides support for all three cases. Starting with the first case, one could create a table
with a number of attributes, but no primary or foreign keys. This can be done either statically
or dynamically. A static approach entails pre-defined statements and pre-determined results. A
dynamic approach allows the programmer to determine the table structure at run-time. The following script and output is an example of the latter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import mysql.connector
# The database config details
config = {'user': 'root', 'password': 'root',
'host': 'localhost', 'database': 'newDB'}
# The name of the table and its attributes
tableName = input("Enter the name of the table to create: ")
sqlString = "CREATE TABLE " + tableName + "("
numOfAt = int(input("Enter the number of attributes in the table"))
atName = [""]*numOfAt
atType = [""]*numOfAt
atSize = [0]*numOfAt
# Define the table structure (i.e., attribute details)
for i in range(numOfAt):
atName[i] = input("Enter the attribute " + str(i) + ": ")
atType[i]=str(input("Enter 'char' for char, 'int' for int type: "))
atSize[i] = int(input("Enter the size of the attribute: "))
sqlString += atName[i]+ " " + atType[i]+"("+str(atSize[i])+")"
if (i < numOfAt-1):
sqlString += ","
else:
sqlString += ")"
# The SQL statement and exception handling mechanism
print("The SQL statement to run is: ", sqlString)
282
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Handbook of Computer Programming with Python
try:
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(sqlString)
sqlString = "DESC " + tableName
cursor.execute(sqlString)
attributes = cursor.fetchall()
# Desc/show the metadata of the new table
print("The metadata for the new table "+str(tableName)+" are: ")
for row in attributes:
print(row)
except:
print("There is an error with the connection")
Output 7.2.3.a:
Enter the name of the table to create: Student
Enter the number of attributes in the table: 3
Enter the attribute 0: Name
Enter 'char' for char type, 'int' for int type: char
Enter the size of the attribute: 10
Enter the attribute 1: Address
Enter 'char' for char type, 'int' for int type: char
Enter the size of the attribute: 15
Enter the attribute 2: Year
Enter 'char' for char type, 'int' for int type: int
Enter the size of the attribute: 4
The SQL statement to run is: Create Table Student(Name char(10),
Address char(15),Year int(4))
The metadata for the new table Student are:
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
The script consists of three distinct parts. In the first part (lines 7–13), the user is prompted to enter a
name for the new table and the number of its attributes. The SQL string that is subsequently used for
the creation of the table is also constructed. In the second part (lines 15–25), the user is prompted
to enter the required details for each attribute (e.g., name, data type, size), and the SQL string is
updated accordingly. The third part involves code that connects to the database and executes the
SQL string. As mentioned, this is wrapped in an exception handling block in order to prevent a
possible uncontrolled termination of the program due to failures of database-related activities (lines
30–42). This is one the most straightforward cases of creating tables using Python scripts. Indeed,
this implementation simply involves the incorporation and execution of SQL statements through the
Python script wrapper, similarly to what one would do
with any other modern programming language.
Observation 7.9 – Primary Key: An
In the output of this particular example, the user attribute or a combination of attrienters the rather trivial and common example of a butes with values that uniquely idenStudent table with three basic attributes: Name, tify each particular record in the table.
Address, and Year (of birth). After execution, the
283
Database Programming with Python
reader should be able to verify that the table has been
created with the desired structure (e.g., with no primary or foreign keys) by checking database newDB in
MySQL.
The second case involves the addition of primary
keys to the table. As a reminder, a formal definition
of the primary key is that of an attribute of a table
the value of which identifies records uniquely. Simply
put, the primary key designation ensures that there
are no duplicate values for the related attribute(s). It
must be stressed again that two distinct possibilities
exist in relation to primary keys. The first is that it
consists of a single attribute. In this case the syntax is
the following:
CREATE TABLE <table name> (<attribute1>
<DATA TYPE>(<size>) PRIMARY KEY,...,
<attributeN> <DATA TYPE>(<size>))
The second is that the primary key consists of a combination of two or more attributes. In this case the syntax
is slightly different:
CREATE TABLE <table name> (<attribute1>
<DATA TYPE>(<size>),..., <attributeN>
<DATA TYPE>(<size>), PRIMARY KEY
(<attributeX>,... <attributeY>))
The following script is another version of the one
presented previously, modified in order to addresses
the creation of a table with a single primary key
(lines 15–31):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Observation 7.10 – Foreign Key: An
attribute that references the values of
a corresponding attribute on another
table of the same database that is also
the primary key for the referenced table.
Observation 7.11 – Create a Table
with a Single Primary Key but No
Foreign Key:
CREATE TABLE <table name>
(<attribute1> <DATA
TYPE>(<size>) PRIMARY KEY,
..., <attributeN> <DATA
TYPE>(<size>))
Observation 7.12 – Create a Table
with Combined Primary Key but No
Foreign Key:
CREATE TABLE <table name>
(<attribute1> <DATA
TYPE>(<size>),...,
<attributeN> <DATA
TYPE>(<size>), PRIMARY KEY
(<attributeX>,...
<attributeY>))
import mysql.connector
# The database config details
config = {'user': 'root', 'password': 'root',
'host': 'localhost', 'database': 'newDB'}
# The name of the table and its attributes
tableName = input("Enter the name of the table to create: ")
sqlString = "CREATE TABLE " + tableName + "("
numOfAt = int(input("Enter the number of attributes in the table: "))
atName = [""]*numOfAt
atType = [""]*numOfAt
atSize = [0]*numOfAt
key = 0
# Define the structure of the table (i.e., attribute details)
for i in range(numOfAt):
284
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
Handbook of Computer Programming with Python
atName[i] = input("Enter the attribute " + str(i) + ": ")
atType[i]=str(input("Enter 'CHAR' for char, 'INT' for int type: "))
atSize[i] = int(input("Enter the size of the attribute: "))
sqlString += atName[i] + " " + atType[i] + \
"(" + str(atSize[i]) + ")"
if (key == 0):
primaryKey = str(input("Is this a primary key (Y/N)? "))
if (primaryKey == "Y"):
sqlString += " PRIMARY KEY"
key = 1
if (i < numOfAt-1):
sqlString += ", "
else:
sqlString += ")"
# The SQL statement to run using exception handling
print("The SQL statement to run is: \n", sqlString)
try:
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(sqlString)
sqlString = "DESC " + tableName
cursor.execute(sqlString)
columns = cursor.fetchall()
print("The structure/metadata of the table ",str(tableName),"is:")
for row in columns:
print(row)
except:
print("There is an error with the connection")
Output 7.2.3.b:
Enter the name of the table to create: Customers
Enter the number of attributes in the table: 3
Enter the attribute 0: CustomerID
Enter 'char' for char, 'int' for int type: int
Enter the size of the attribute: 3
Is this a primary key (Y/N)? Y
Enter the attribute 1: CustLastName
Enter 'char' for char, 'int' for int type: char
Enter the size of the attribute: 15
Enter the attribute 2: CustFirstName
Enter 'char' for char, 'int' for int type: char
Enter the size of the attribute: 10
The SQL statement to run is:
Create Table Customers(CustomerID int(3) Primary key, CustLastName
char(15), CustFirstName char(10))
There is an error with the connection
285
Database Programming with Python
Output 7.2.3.c:
Enter the name of the table to create: Items
Enter the number of attributes in the table: 3
Enter the attribute 0: ItemID
Enter 'char' for char, 'int' for int type: char
Enter the size of the attribute: 6
Is this a primary key (Y/N)? Y
Enter the attribute 1: ItemDesc
Enter 'char' for char, 'int' for int type: char
Enter the size of the attribute: 25
Enter the attribute 2: ItemPrice
Enter 'char' for char, 'int' for int type: int
Enter the size of the attribute: 5
The SQL statement to run is:
Create Table Items(ItemID char(6) Primary key, ItemDesc char(25),
ItemPrice int(5))
The structure/metadata of the table Items is:
('ItemID', 'char(6)', 'NO', 'PRI', None, '')
('ItemDesc', 'char(25)', 'YES', '', None, '')
('ItemPrice', 'int(5)', 'YES', '', None, '')
The output demonstrates the creation of two of the three
tables (i.e., Customers and Items) from Table 7.1.
The third case involves the connection of more than
one tables connecting to each other through a common
attribute. In this case, this common attribute is usually
designated as a primary key in one of the tables and a
foreign key in the others, although this is not the only
possible arrangement. This practice is often termed as
referencing, as the foreign key of the child table references the primary key of the parent table. The syntax
for the creation of the table and the key designation is
the following:
Observation 7.13 – Create a Table
with One or More Foreign Keys:
CREATE TABLE <table name>
(<attribute1> <DATA
TYPE>(<size>), FOREIGN KEY
(<attribute name>) REFERENCES
<table name> (<attribute
name>),..., <attributeN>
<DATA TYPE>(<size>), FOREIGN
KEY (<attribute name>)
REFERENCES <table name>
(<attribute name>))
CREATE TABLE <table name> (
<attribute1> <DATA TYPE>(<size>), FOREIGN KEY (<attribute name>)
REFERENCES <table name> (<attribute name>),...
<attributeN> <DATA TYPE>(<size>) FOREIGN KEY (<attribute name>) REFERENCES
<table name> (<attribute name>))
TABLE 7.1
Customers – Items – Orders
Customers
Attribute
CustomerID
CustLastName
CustFirstName
Items
Orders
Type
Attribute
Type
Attribute
Type
INT(3) PK
CHAR(15)
CHAR(10)
ItemID
ItemDesc
ItemPrice
CHAR(6) PK
CHAR(25)
INT(5)
OrderID
CustID
ItemID
OrderYear
OrderQuantity
INT(3) PK
INT(3) FK
INT(6) FK
INT(4)
INT(3)
286
Handbook of Computer Programming with Python
The following Python script is another amendment to the previously developed script, allowing for
the specification of a foreign key attribute, and the corresponding tables and reference attributes.
It is beyond the scope of this chapter to discuss the numerous possibilities of such tasks in detail,
and to provide safety measures against the multitude of cases of incorrect entries that could cause
abnormal termination of the program. The goal of this example is to demonstrate how to use Python
to facilitate the creation of such relationships in their simplest form using database table Orders
from Table 7.1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import mysql.connector
# The database config details
config = {'user': 'root', 'password': 'root',
'host': 'localhost', 'database': 'newDB'}
# The name of the table and its attributes
tableName = input("Enter the name of the table to create: ")
sqlString = "CREATE TABLE " + tableName + "("
numOfAt = int(input("Enter the number of attributes in the table: "))
atName = [""]*numOfAt
atType = [""]*numOfAt
atSize = [0]*numOfAt
pkey = 0
# Define the structure of the table (i.e., attribute details)
for i in range(numOfAt):
atName[i] = input("\nEnter the attribute " + str(i) + ": ")
atType[i]=str(input("Enter 'CHAR' for char, 'INT' for int type: "))
atSize[i] = int(input("Enter the size of the attribute: "))
sqlString += atName[i] + " " + atType[i] + \
"(" + str(atSize[i]) + ")"
if (pkey == 0):
primaryKey = input("Is this a primary key (Y/N)? ")
if (primaryKey == "Y"):
sqlString += " PRIMARY KEY"
pkey = 1
foreignKey = input("Is this a foreign key (Y/N)? ")
if (foreignKey == "Y"):
availableTables = "SHOW TABLES"
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(availableTables)
tables = cursor.fetchall()
print(tables)
refTable = input("Select the table to reference: ")
availableAttributes = "DESC " + str(refTable)
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(availableAttributes)
Database Programming with Python
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
287
columns = cursor.fetchall()
print(columns)
refAt = input("Select the attribute to reference: ")
sqlString += ", FOREIGN KEY (" + atName[i]
sqlString += ") REFERENCES " + str(refTable) + "(" + \
str(refAt) + ")"
if (i < numOfAt-1):
sqlString += ", "
else:
sqlString += ")"
# The SQL statement and the exception handling mechanism
print("\nThe SQL statement to run is: \n", sqlString)
try:
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(sqlString)
sqlString = "DESC " + tableName
cursor.execute(sqlString)
columns = cursor.fetchall()
print("\nThe structure/metadata of the table ",
str(tableName), "is:")
for row in columns:
print(row)
except:
print("There is an error with the connection")
Output 7.2.3.d:
Enter the name of the cable to create: Orders
Enter the number of attributes in the table: 5
Enter the attribute 0: OrderiD
Enter 'char' for char type, 'int. for int type: int
Enter the size of the attribute: 3
Is this a primary key (Y/N)? Y
Is this a foreign key (Y/N)? n
Enter the attribute 1: CustID
Enter 'char' for char type, 'int. for int type: int
Enter the size of the attribute: 3
Is this a foreign key (Y/N)? Y
[('customers',), ('items',), ('student',), ('table1',)]
Select the table to reference: Customers
[('CustomerID', 'int(3)', 'NO', 'PRI', None, ''), ('CustLastName',
'char(15)', 'YES', '', None, ''), ('CustFirstName', 'char(10)', 'YES', '',
None, '')]
Select the attribute to reference: CustomerID
Enter the attribute 2: ItemID
Enter 'char' for char type, 'int' for int type: char
Enter the size of the attribute: 6
Is this a foreign key (Y/N)? Y
[('customers',), ('items',), ('student',), ('table1',)]
Select the table to reference: Items
[('ItemID', 'char(6)', 'NO', 'PRI', None, ''), ('ItemDesc', 'char(25)',
'YES', '', None, ''), ('ItemPrice', 'int(5)', 'YES', '', None, '')]
Select the attribute to reference: ItemID
Select the table to reference: Customers
[('CustomerID', 'int(3)', 'NO', 'PRI', None, ''), ('CustLastName',
'char(15)', 'YES', '', None, ''), ('CustFirstName', 'char(10)', 'YES', '',
None, '')]
288
of Computer Programming with Python
Select the attribute to reference: Handbook
CustomerID
Enter the attribute 2: ItemID
Enter 'char' for char type, 'int' for int type: char
Enter the size of the attribute: 6
Is this a foreign key (Y/N)? Y
[('customers',), ('items',), ('student',), ('table1',)]
Select the table to reference: Items
[('ItemID', 'char(6)', 'NO', 'PRI', None, ''), ('ItemDesc', 'char(25)',
'YES', '', None, ''), ('ItemPrice', 'int(5)', 'YES', '', None, '')]
Select the attribute to reference: ItemID
Enter the attribute 3: OrderYear
Enter 'char' for char type, 'int. for int type: int
Enter the size of the attribute: 4
Is this a foreign key (Y/N)? N
Enter the attribute 4: OrderQty
Enter 'char' for char type, 'int' for int type: int
Enter the size of the attribute: 3
Is this a foreign key (Y/N)? N
The SQL statement to run is:
Create Table Orders(OrderID int(3) Primary key, CustID int(3), Foreign
Key (CustID) References Customers(CustomerID), ItemID char(6), Foreign
Key (ItemID) References Items(ItemID), OrderYear int(4), OrderQty int(3))
The structure/metadata of the table Orders is:
('OrderID', 'int(3)', 'NO', 'PRI', None, '')
('CustID', 'int(3)', 'YES', 'MUL', None, '')
('ItemID', 'char(6)', 'YES', 'MUL', None, '')
('OrderYear', 'int(4)', 'YES', '', None, '')
('OrderQty', 'int(3)', 'YES', '', None, '')
Once the table is created and references to tables Customers and Items are established, the following Entity Relationship Diagram (ERD) should appear in MySQL Designer (Figure 7.5):
FIGURE 7.5
Entity relationship diagram for the customers-items-orders database.
289
Database Programming with Python
7.2.4 Altering Tables
As discussed, the CREATE TABLE statement creates new tables and defines their attributes and characteristics. In other words, it is used to create and
specify the metadata of the table. This metadata is
not expected to change frequently; indeed, the better
the design of the database the lower the possibility of
metadata modification being required. Nevertheless,
when necessary, the most drastic way to do so is to
destroy and re-create the entire table. This is also the
easiest solution provided that the table contains no
data. However, the feasibility of using this function
is inversely related to the amount of existing data,
as destroying the table would also lead to permanent
data loss.
This is where the ALTER TABLE statement comes
into play. The statement has numerous variations, but
they all serve the purpose of altering the structure and
metadata of an existing table. The most important and
frequently used of these variations cover the following:
Observation 7.14 – The ALTER
TABLE Statement:
ALTER TABLE <name> ADD <new
attribute> <DATA TYPE>(<size>)
ALTER TABLE <name> DROP
<attribute name>
ALTER TABLE <name> CHANGE
<attribute name><attribute
new name> <attribute new DATA
TYPE>(<new size>)
ALTER TABLE <name> ADD (new
attribute) <DATA TYPE>(<size>)
PRIMARY KEY
ALTER TABLE <name> DROP
PRIMARY KEY
1. Adding/deleting/modifying an attribute in an existing table.
2. Adding/deleting a primary key constraint.
The first set of statements relates to the manipulation of simple attributes. For instance, if
a new attribute is to be added to an existing table, the ALTER TABLE syntax would be the
following:
ALTER TABLE <table name> ADD <new attribute> <DATA TYPE>(<size>)
Accordingly, to delete an existing attribute from a table the statement can be used with following
syntax:
ALTER TABLE <table name> DROP <attribute name>
Modifications of the data type and/or size of an attribute would take the following form:
ALTER TABLE <table name> CHANGE <attribute name> <attribute new name>
<attribute new DATA TYPE>(<new size>)
The second set of statements involves the addition of a new attribute that also serves as a (composite)
primary key or the deletion of the primary key function of an attribute. In the first case, the following syntax should be used:
ALTER TABLE <table name> ADD <new attribute> <DATA TYPE>(<size>) PRIMARY KEY
In the case of the latter, the syntax would be the following:
ALTER TABLE <table name> DROP PRIMARY KEY
290
Handbook of Computer Programming with Python
The following Python script demonstrates the use of all the aforementioned cases in a single
application:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import mysql.connector
# The database config details
config = {'user': 'root', 'password': 'root',
'host': 'localhost', 'database': 'newDB'}
# Show the available tables
availableTables = "SHOW TABLES"
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(availableTables)
tables = cursor.fetchall()
print(tables)
# Select the table to alter and show its attributes
selectedTable = input("Select the table to alter: ")
availableAttributes = "DESC " + str(selectedTable)
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(availableAttributes)
columns = cursor.fetchall()
for row in columns:
print(row)
# Decide to add a column in the selected table, modify it, or drop it
alterType = input("(A)dd a new column\n(M)odify its size\n(D)rop one?\
\n(APK)Add Primary Key\n(DPK)Drop Primary Key?\
n\Select preferred task: ")
if (alterType == "A"):
atName = input("\nEnter the attribute name: ")
atType = input("Enter 'char' for char type, 'int' for int type: ")
atSize = int(input("Enter the size of the attribute: "))
if (alterType == "D"):
atName = input("\nEnter the name of the attribute to drop: ")
if (alterType == "M"):
atName = input("\nEnter the name of the attribute to change: ")
atNewName = input("\nEnter the new name of the attribute: ")
atNewType=input("Enter 'char' for char type, 'int' for int type: ")
291
Database Programming with Python
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
atNewSize = int(input("Enter the
if (alterType == "APK"):
atName = input("\nEnter the name
convert to Primary Key: ")
atNewType=input("Enter 'char' for
atNewSize = int(input("Enter the
size of the attribute: "))
of the attribute to \
char type, 'int' for int type: ")
size of the attribute: "))
# Prepare and execute the alter statement
if (alterType == "A"):
sqlString = "ALTER TABLE " + str(selectedTable) + " ADD " + \
atName + " " + str(atType) + "(" + str(atSize) + ")"
elif (alterType == "D"):
sqlString = "ALTER TABLE " + str(selectedTable) + \
" DROP COLUMN " + str(atName)
elif (alterType == "M"):
sqlString = "ALTER TABLE " + str(selectedTable) + " CHANGE " + \
atName + " " + atNewName + " " + atNewType + \
"(" + str(atNewSize) + ");"
elif (alterType == "APK"):
sqlString="ALTER TABLE "+str(selectedTable)+" ADD "+atName + \
" " + atNewType + "(" + str(ateNewSize) + ") PRIMARY KEY"
elif (alterType == "DPK"):
sqlString="ALTER TABLE "+str(selectedTable)+" DROP PRIMARY KEY"
print(sqlString)
try:
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(sqlString)
print(cursor)
sqlString = "DESC " + selectedTable
cursor.execute(sqlString)
columns = cursor.fetchall()
print("\nThe structure/metadata of the table ",
str(selectedTable), "is:")
for row in columns:
print(row)
except:
print("There is an error with the connection")
292
Handbook of Computer Programming with Python
Output 7.2.4.a: Adding a new attribute
[('customers',), ('items',), ('orders',), ('student',), ('table1',)]
Select the table to alter: Student
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
(A)dd a new column
(M)odify its size
(D)rop one?
(APK)Add Primary Key
(DPK)Drop Primary Key?
Select preferred task: A
Enter the attribute name: MobileNumber
Enter 'char' for char type, 'int' for int type: char
Enter the size of the attribute: 15
Alter table Student add MobileNumber char(15)
MySQLCursor: Alter table Student add MobileNumber cha..
The structure/metadata of the table Student is:
('Name', 'char(13)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
('MobileNumber', 'char(15)', 'YES', '', None, '')
Output 7.2.4.b: Modifying an attribute
[('customers',), ('items',), ('orders',), ('student',), ('table1',)]
Select the table to alter: Student
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
('MobileNumber', 'char(15)', 'YES', '', None, '')
(A)dd a new column
(M)odify its size
(D)rop one?
(APK)Add Primary Key
(DPK)Drop Primary Key?
Select preferred task: M
Enter the name of the attribute to change: MobileNumber
Enter the new name of the attribute: PhoneNumber
Enter 'char' for char type, 'int' for int type: char
Enter the size of the attribute: 20
Alter table Student change MobileNumber PhoneNumber char(20);
MySQLCursor: Alter table Student change MobileNumber ..
The structure/metadata of the table Student is:
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
('PhoneNumber', 'char(20)', 'YES', '', None, '')
Database Programming with Python
293
Output 7.2.4.c: Deleting/Dropping an attribute
[('customers',), ('items',), ('orders',), ('student',), ('table1',)]
Select the table to alter: Student
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
('PhoneNumber', 'char(20)', 'YES', '', None, '')
(A)dd a new column
(M)odify its size
(D)rop one?
(APK)Add Primary Key
(DPK)Drop Primary Key?
Select preferred task: D
Enter the name of the attribute to drop: PhoneNumber
Alter table Student drop column PhoneNumber
MySQLCursor: Alter table Student drop column PhoneNum..
The structure/metadata of the table Student is:
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
Output 7.2.4.d: Adding a primary key
[('customers',), ('items',), ('orders',), ('student',), ('tablel',)]
Select the table to alter: student
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
(A)dd a new column
(M)odify its size
(D)rop one?
(APK)Add Primary Key
(DPK)Drop Primary Key?
Select preferred task: APK
Enter the name of the attribute to
convert to Primary Key: StudentID
Enter 'char' for char type, 'int' for int type: char
Enter the size of the attribute: 10
Alter table student add StudentID char(10) Primary key
MySQLCursor: Alter table student add StudentID char(1..
The structure/metadata of the table student is:
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
('StudentID', 'char(10)', 'NO', 'PRI', None, '')
294
Handbook of Computer Programming with Python
Output 7.2.4.e: Dropping a primary key
[('customers',), ('items',), ('orders',), ('student',), ('table1',)]
Select the table to alter: student
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
('StudentID', 'char(10)', 'NO', 'PRI', None, '')
(A)dd a new column
(M)odify its size
(D)rop one?
(APK)Add Primary Key
(DPK)Drop Primary Key?
Select preferred task: DPK
Alter table student Drop Primary Key
MySQLCursor: Alter table student Drop Primary Key
The structure/metadata of the table student is:
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
('StudentID', 'char(10)', 'NO', '', None, '')
The script allows the user to select the table the metadata of which must be altered. The user is presented with a simple menu that can be used for choosing the type of the execution statement. Upon
execution the result is displayed on screen, but can be also verified in MySQL. As the concepts
related to the programming aspects of the script have been covered in previous sections, they are not
discussed here. The outputs showcase some testing cases based on the developed script.
7.2.5 Dropping Tables
The deletion of an entire table, and especially of one that
contains data, is not something that one should resort
to frequently. Nevertheless, there are occasions that this
may be necessary. Assuming that there are no referential
integrity relationships between the table in question and
any other tables, the deletion can be implemented with
the DROP TABLE statement and a simple reference to
the name of the table:
Observation 7.15 – The DROP TABLE
Statement: Destroys (deletes) a table
and all the data contained in it, as in
the example below.
DROP TABLE <table name>
DROP TABLE <table name>
The following Python script demonstrates this by displaying the available tables to the user and
offering a mechanism for table selection and deletion to the user:
1
2
3
4
5
6
7
import mysql.connector
# The database config details
config = {'user': 'root', 'password': 'root',
'host': 'localhost', 'database': 'newDB'}
# Show the available tables
Database Programming with Python
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
295
def showTables():
availableTables = "SHOW TABLES"
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(availableTables)
tables = cursor.fetchall()
print(tables)
# Show the available tables
showTables()
# Select the table to drop and show its attributes
selectedTable = input("Select the table to drop: ")
availableAttributes = "DESC " + str(selectedTable)
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(availableAttributes)
columns = cursor.fetchall()
for row in columns:
print(row)
# Confirm the decision to drop the table
dropConfirmation = input("Are you sure you want to drop \
the table (Y/N)? ")
if (dropConfirmation == "Y"):
sqlString = "DROP TABLE " + str(selectedTable)
print(sqlString)
try:
link = mysql.connector.connect(**config)
cursor = link.cursor()
cursor.execute(sqlString)
# Show the available tables
showTables()
except:
print("There is an error with the connection")
Output 7.2.5:
[('customers',), ('items',), ('orders',), ('student',), ('table1',), ('test',)]
Select the table to drop: test
('test1', 'char(10)', 'NO', 'PRI', None, '')
('test2', 'char(10)', 'YES', '', None, '')
Are you sure you want to drop the table (Y/N)? Y
Drop table test
[('customers',), ('items',), ('orders',), ('student',), ('table1',)]
The output shows how to use the DROP TABLE statement to delete/destroy a table and its data. Note
that before trying to drop a table (in this instance table Test), one has to ensure that the table has
been created and is in existence.
296
Handbook of Computer Programming with Python
7.2.6 The DESC Statement
In previous sections, there were instances where the
structure or metadata of a table had to be displayed. The
statement used in such cases was the following:
DESC <table name>
Observation 7.16 – The DESC
Statement: Returns the metadata of a
table as in the example below.
DESC <table name>
This statement returns a list of tuples with the attributes of the table and the associated details, such
as its name, size, and primary key designation. The reader can refer to the scripts provided in previous sections as practical examples of its functionality and use.
7.3
S CRIPTING FOR DATA MANIPULATION LANGUAGE
The previous sections introduced the various DDL statements used to create, alter, and drop the
metadata of the tables in a database. This is often called the database schema. As mentioned, it is
not expected nor desired that this schema changes frequently. Once the schema is finalized, one can
start working on its state or instance. A database instance contains all the data stored in the database at any particular moment in time. The statements used for working with the database instance
are usually referred to as the Data Manipulation Language (DML). As in DDL and the database
schema, DML statements are used to create or insert new records to a table, modify and amend data,
or delete existing records from a table. The following sections introduce the most basic and common
uses of these statements.
7.3.1 Inserting Records
The INSERT statement is used to insert a single record
(row) to a table. The general syntax of the statement is
the following:
INSERT INTO <table name>
VALUES (<attribute1 value>... <attributeN
value>)
If the user is allowed to insert data to a table in a different order than the one specified in the corresponding
table metadata or to enter data selectively to a subset of
the table attributes, the following syntax could be used:
INSERT INTO <table name>
(<attributeX name>... <attributeZ name>)
VALUES (<attributeX value>... <attributeZ
value>)
Observation 7.17 – Insert Records:
INSERT INTO <table name>
VALUES (<attribute1 value>...
<attributeN value>)
If the data order is different than that
of the table attributes, or if some attributes are not supposed to receive
data, the following syntax can be
used:
INSERT INTO <table name>
(<attributeX name>...
<attributeZ name>)
VALUES (<attributeX value>...
<attributeZ value>)
The following Python script demonstrates the use of the
INSERT statement, using a case where the user is also allowed to select the table to which the statement applies first:
1
2
3
4
import mysql.connector
# Provide the established database config
GUIDB = 'GuiDB'
Database Programming with Python
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
297
config = {'user': "root", 'password': "root",
'host': "localhost", 'database': "newDB"}
# Connect to the newDB database
connect = mysql.connector.connect(**config)
cursor = connect.cursor()
try:
# Attempt to show the tables of the newDB database
cursor.execute("SHOW TABLES")
tables = cursor.fetchall()
print("DB tables are: " + str(tables))
except:
print("There was a problem showing tables")
tableName = input("Enter the table selected: ")
try:
# Show the table metadata
cursor.execute("DESC " + tableName)
columns = cursor.fetchall()
print("Selected table is: ", tableName)
print("Its attributes are: ")
for row in columns:
print(row)
# Show the current instance of the table
cursor.execute("SELECT * FROM " + str(tableName))
records = cursor.fetchall()
print("The records in the table are: ")
for row in records:
print(row)
except:
print("There was a problem showing the table attributes")
# Prepare the insert statement
numColumns = len(columns)
attributes = [""]*numColumns
sqlString = "INSERT INTO " + tableName + " VALUES ("
# Invite user's input for each attribute
for i in range(numColumns):
attributes[i] = input("Enter data for attribute " + str(i) + ": ")
if (columns[i][1][0] == "c"):
sqlString += "\"" + attributes[i] + "\""
elif (columns[i][1][0] == "i"):
sqlString += attributes[i]
if (i < numColumns-1):
sqlString += ", "
sqlString += ")"
# Execute the prepared insert statement
298
56
57
58
59
60
61
62
63
64
65
66
67
68
Handbook of Computer Programming with Python
print("SQL statement to execute is: ")
print(sqlString)
cursor.execute(sqlString)
# Commit the results to ensure they are permanently stored
connect.commit()
# Show the new instance of the table
print("The records in the " + str(tableName) + " table are: ")
sqlString = "SELECT * FROM " + tableName
cursor.execute(sqlString)
records = cursor.fetchall()
for row in records:
print(row)
Output 7.3.1.a: Inserting a new record to Student
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: student
Selected table is: student
Its attributes are:
('Name', 'char(10)', 'YES', '', None, '')
('Address', 'char(15)', 'YES', '', None, '')
('Year', 'int(4)', 'YES', '', None, '')
('StudentlD', 'char(10)', 'NO', '', None, '')
The records in the table are:
Enter data for attribute 0: Alex
Enter data for attribute 1: Westwood 7
Enter data for attribute 2: 2002
Enter data for attribute 3: 001
SQL statement to execute is:
Insert into student values ("Alex", "Westwood 7", 2002, "001")
The records in the student table are:
('Alex', 'Westwood 7', 2002, '001')
Upon execution, the script displays the tables in the current database and prompts the user to select
one of them. Once a selection is made, the user is provided with both the metadata and the instance
of the table. Next, the user is invited to enter values for each of the attributes of the table, one at a
time. In this case, the more generic, basic syntax is adopted, so the user must enter values for all
the attributes of the table in the order dictated when the table was created. After all values are collected, the related INSERT statement is prepared and executed, and its result is committed. Finally,
the script provides the new instance of the table.
The following observations are also noteworthy in relation to the script and its output. Firstly,
any text value that is inserted to a table always takes single quotes, while numbers do not. Dates
also have a particular, unique format. Secondly, in this particular example, the user attempts to
insert a record to the Student table, which has no primary key attribute, and is neither referencing nor being referenced by another table. As this is a rather straightforward case, should
any issues arise with the statement these should be likely related to technical connectivity issues
between the database, the server, and the connections in the script. Thirdly, when committing the
results of the INSERT statement, it is important that the newly inserted data are indeed stored in
the table.
One could use the Customers, Items, and Orders tables as a working example. Firstly, the
user would enter a new record to the Customers table (note that the table has an attribute that
Database Programming with Python
299
serves as a primary key). The following output illustrates this with the following data: 001, “John”,
and “Good”:
Output 7.3.1.b: Inserting a new record to Customers
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: customers
Selected table is: customers
Its attributes are:
('CustomerID', 'int(3)', 'NO', 'PRI', None, '')
('CustLastName', 'char(15)', 'YES', '', None, '')
('CustFirstName', 'char(10)', 'YES', '', None, '')
The records in the table are:
Enter data for attribute 0: 001
Enter data for attribute 1: John
Enter data for attribute 2: Good
SQL statement to execute is:
Insert into customers values (001, "John", "Good")
The records in the customers table are:
(1, 'John', 'Good')
Next, let us assume that the user attempts to enter a new record with the following data: 001,
“Maria”, and “Green”. The problem in this case is that the user is attempting to insert a new record
with the same value for the primary key (i.e., 001). This will raise an internal error, since MySQL
does not allow duplicate values for this attribute. The output shows the error that would be raised
in such a case:
Output 7.3.1.c: Attempting to insert a new record to Customers with duplicate primary key
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: customers
Selected table is: customers
Its attributes are:
('CustomerID', 'int(3)', 'NO', 'PRI', None, '')
('CustLastName', 'char(15)', 'YES', '', None, '')
('CustFirstName', 'char(10)', 'YES', '', None, '')
The records in the table are:
(1, 'John', 'Good')
Enter data for attribute 0: 001
Enter data for attribute 1: Maria
Enter data for attribute 2: Green
SQL statement to execute is:
Insert into customers values (001, "Maria", "Green")
~\anaconda3\lib\site-packages\mysq1\connector\connection.py in_handle_
result(self, packet)
571
return self._handle eof(packet)
572
elif packet[4] == 255:
raise errors.get_exception(packet)
-- > 573
574
# We have a text result set
575
IntegrityError: 1062 (23000): Duplicate entry '1' for key 'PRIMARY'
300
Handbook of Computer Programming with Python
Following up on the same example, let us assume that the user attempts to insert a record in the
Items table, as displayed on the output below:
Output 7.3.1.d: Inserting a record to Items
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: items
Selected table is: items
Its attributes are:
('ItemID', 'char(6)', 'NO', 'PRI', None, '')
('ItemDesc', 'char(25)', 'YES', '', None, '')
('ItemPrice', 'int(5)', 'YES', '', None, '')
The records in the table are:
Enter data for attribute 0: 100
Enter data for attribute 1: Refrigerator
Enter data for attribute 2: 600
SQL statement to execute is:
Insert into items values ("100", "Refrigerator", 600)
The records in the items table are:
('100', 'Refrigerator', 600)
The user may also attempt to insert a record in the Orders table. Firstly, let us assume that the user
correctly inputs data that correspond to the other two tables (i.e., Customers and Items). The
following output illustrates a successful attempt:
Output 7.3.1.e: Inserting a record to Orders
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: orders
Selected table is: orders
Its attributes are:
('OrderID', 'int(3)', 'NO', 'PRI', None, '')
('CustID', 'int(3)', 'YES', 'MUL', None, '')
('ItemID', 'char(6)', 'YES', 'MUL', None, '')
('OrderYear', 'int(4)', 'YES', '', None, '')
('OrderQty', 'int(3)', 'YES', '', None, '')
The records in the table are:
Enter data for attribute 0: 1
Enter data for attribute 1: 1
Enter data for attribute 2: 100
Enter data for attribute 3: 2021
Enter data for attribute 4: 15
SQL statement to execute is:
Insert into orders values (1, 1, "100", 2021, 15)
The records in the orders table are:
(1, 1, '100', 2021, 15)
In contrast, if we assume that the user attempts to insert another record to Orders with no consideration towards the corresponding Customers table, an error will be raised:
301
Database Programming with Python
Output 7.3.1.f: Violating a referential integrity constraint in an INSERT statement
DB tables are: (('custorers',), ('items',), ('orders',), ('student',)]
Enter the table selected: orders
Selected table is: orders
Its attributes are:
('OrderID', 'int(3)', 'NO', 'PRI', None, '')
('CustID', 'int(3)', 'YES', 'MUL', None, '')
('ItemID', 'char(6)', 'YES', 'MUL', None, '')
('OrderYear', 'int(4)', 'YES', '', None, '')
('OrderQty', 'int(3)', 'YES', '', None, '')
The records in the table are:
(1, 1, '100', 2021, 15)
Enter data for attribute 0: 2
Enter data for attribute 1: 2
Enter data for attribute 2: 100
Enter data for attribute 3: 2021
Enter data for attribute 4: 10
SQL statement to execute is:
Insert into orders values (2, 2, "100", 2021, 10)
IntegrityError
Traceback (most recent call last)
~\anaconda3\lib\site-packages\mysql\connector\connection.py in _handle_
result(self, packet)
571
572
--> 573
574
575
return self._handle_eof(packet)
elif packet[4] == 255:
raise errors.get_exception(packet)
# We have a text result set
IntegrityError : 1452 (23000): Cannot add or update a child row: a foreign
key constraint fails ('newdb'.'orders', CONSTRAINT
'orders_ibfk_1' FOREIGN KEY ('CustID') REFERENCES
'customers' ('CustomerID'))
These examples provide a basic demonstration of various cases of data insertion to tables, and of
potential violations of important constraints like primary and foreign keys. Of course, this is not an
exhaustive collection of all possible cases, but it should provide some clarity in terms of working
with INSERT statements in Python. Ideally, exception handling should be employed to control as
many violation scenarios as possible.
7.3.2 Updating Records
Contrary to data definition statements, where the
case of changing the metadata of a table after its creation is generally undesirable and quite rare, when
it comes to data manipulation it is necessary to be
able to change the data of particular records rather
frequently. This is accomplished with the use of the
UPDATE statement:
Observation 7.18 – The UPDATE
Statement:
UPDATE <table name>
SET <attribute1> = <value1>,...,
<attributeN> = <valueN>
WHERE <condition that involves
one or more attributes>
302
Handbook of Computer Programming with Python
UPDATE <table name>
SET <attribute1> = <value1>,..., <attributeN> = <valueN>
WHERE <condition that involves one or more attributes>
The following Python script is based on the examples developed in the previous sections, and adopts
the same user prompts and table selection functions in order to showcase the use of the UPDATE
statement, using the Customers table:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import mysql.connector
# Provide the established database config
GUIDB = 'GuiDB'
config = {'user': "root", 'password': "root",
'host': "localhost", 'database': "newDB"}
# Connect to the newDB database
connect = mysql.connector.connect(**config)
cursor = connect.cursor()
try:
# Attempt to show the tables of the newDB database
cursor.execute("SHOW TABLES")
tables = cursor.fetchall()
print("DB tables are: " + str(tables))
except:
print("There was a problem showing tables")
tableName = input("Enter the table selected: ")
try:
# Show the table metadata
cursor.execute("DESC " + tableName)
columns = cursor.fetchall()
print("Selected table is: ", tableName)
print("Its attributes are: ")
for row in columns:
print(row)
# Show the current instance of the table
cursor.execute("SELECT * FROM " + str(tableName))
records = cursor.fetchall()
print("The records in the table are: ")
for row in records:
print(row)
except:
print("There was a problem showing the table attributes")
# Prepare the update statement
attributeSelected = input("Select the attribute to change its values: ")
newValue = input("Enter the new value")
oldValue = input("Enter the old value")
sqlString = "UPDATE " + tableName + " SET " + attributeSelected + \
303
Database Programming with Python
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
" = " + "\'" + newValue + "\'" + " WHERE " +
attributeSelected + \
" = " + "\'" + oldValue + "\'"
# Execute the prepared Update statement
print("SQL statement to execute is: ")
print(sqlString)
cursor.execute(sqlString)
# Commit the results to ensure they are permanently stored
connect.commit()
# Show the new instance of the table
print("The records in the " + str(tableName) + " table are: ")
sqlString = "SELECT * FROM " + tableName
cursor.execute(sqlString)
records = cursor.fetchall()
for row in records:
print(row)
Output 7.3.2: Updating a record in Customers
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: customers
Selected table is: customers
Its attributes are:
('CustomerID', 'int(3)', 'NO', 'PRI', None, '')
('CustLastName', 'char(15)', 'YES', '', None, '')
('CustFirstName', 'char(10)', 'YES', '', None, '')
The records in the table are:
(1, 'John', 'Good')
Select the attribute to change its values: CustLastName
Enter the new valueJames
Enter the old valueJohn
SQL statement to execute is:
Update customers set CustLastName = 'James' where CustLastName = 'John'
In addition to the UPDATE statement and its execution, the reader should pay close attention to
the requirement to commit the results of the execution. The commit() function ensures that the
results are permanently stored in the table. It must be also noted that there are several variations
of the UPDATE statement, the detailed coverage of which is out of the scope of this chapter.
For more detailed information on this topic, the reader is advised to refer to the official MySQL
documentation.
7.3.3 Deleting Records
In DML, the deletion of one or more records from a table
is handled through the DELETE statement. The general
syntax of the statement is the following:
DELETE <table name> WHERE <condition>
Observation 7.19 – The DELETE
Statement:
DELETE <table name> WHERE
<condition>
304
Handbook of Computer Programming with Python
If the WHERE clause is omitted, all the records of the table are deleted. Nevertheless, the empty table
will be still in existence, as the table deletion is a task achieved only through the DROP statement. It
must be also noted that the <condition> part is quite flexible and can include various expressions
and parameters, such as one or more attributes of the same table, queries related to the same table, or
queries from different tables. Finally, it is important to remember that the DELETE statement cannot
be executed if the result is violating referential integrity constraints.
Using the same example as in previous sections, the following Python script demonstrates a
simple use of the DELETE statement:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import mysql.connector
# Provide the established database config
GUIDB = 'GuiDB'
config = {'user': "root", 'password': "root",
'host': "localhost", 'database': "newDB"}
# Connect to the newDB database
connect = mysql.connector.connect(**config)
cursor = connect.cursor()
try:
# Attempt to show the tables of the newDB database
cursor.execute("SHOW TABLES")
tables = cursor.fetchall()
print("DB tables are: " + str(tables))
except:
print("There was a problem showing tables")
tableName = input("Enter the table selected: ")
try:
# Show the table metadata
cursor.execute("DESC " + tableName)
columns = cursor.fetchall()
print("Selected table is: ", tableName)
print("Its attributes are: ")
for row in columns:
print(row)
# Show the current instance of the table
cursor.execute("SELECT * FROM " + str(tableName))
records = cursor.fetchall()
print("The records in the table are: ")
for row in records:
print(row)
except:
print("There was a problem showing the table attributes")
# Prepare the Delete statement
attributeSelected = input("Select the attribute based on \
which to delete a record(s): ")
deleteValue = input("Enter the data to delete: ")
Database Programming with Python
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
305
sqlString = "DELETE FROM " + tableName + " WHERE " + \
attributeSelected + " = " + "\'" + deleteValue + "\'"
# Execute the prepared Update statement
print("SQL statement to execute is: ")
print(sqlString)
cursor.execute(sqlString)
# Commit the results to ensure they are permanently stored
connect.commit()
# Show the new instance of the table
print("The records in the " + str(tableName) + " table are: ")
sqlString = "SELECT * FROM " + tableName
cursor.execute(sqlString)
records = cursor.fetchall()
for row in records:
print(row)
Output 7.3.3: Updating a record in Customers
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: orders
Selected table is: orders
Its attributes are:
('OrderID', 'int(3)', 'NO', 'PRI', None, '')
('CustID', 'int(3)', 'YES', 'MUL', None, '')
('ItemID', 'char(6)', 'YES', 'MUL', None, '')
('OrderYear', 'int(4)', 'YES', '', None, '')
('OrderQty', 'int(3)', 'YES', '', None, '')
The records in the table are:
(1, 1, '100', 2021, 15)
Select the attribute based on
which to delete a record(s): 100
Enter the data to delete: 100
SQL statement to execute is:
Delete from orders where 100 = '100'
The records in the orders table are:
In the example illustrated in the output, the user selects the only record that has a value of 100 for
attribute ItemID in the Orders table. The reader should note how DELETE is prepared based on
the user’s selections, and how the result is committed using the commit() function.
7.4
QUERYING A DATABASE AND USING A GUI
Querying and reporting data from database tables is arguably the most useful part of database management from the perspective of the user. Thus, it should come as no surprise that the remaining
SQL statements are specifically used for these purposes. The available clauses are numerous, and
the possibilities for nested queries and for conditional query execution render the potential combinations virtually limitless. As such, an exhaustive coverage of every possible case of querying and
reporting is not only outside the scope of this chapter, but also a rather futile attempt in general. The
focus of this section is to showcase some basic ways to execute querying and reporting tasks, and to
demonstrate how GUIs could be utilized for presentation purposes.
306
Handbook of Computer Programming with Python
7.4.1 The SELECT Statement
The SELECT statement is used to query and report data
from tables. Its most basic and generic syntax does not
involve any clauses that dictate additional functionality
or selection criteria:
SELECT * FROM <table name> WHERE *
Observation 7.20 – The SELECT
Statement:
SELECT <list of attributes
from one or more tables> OR *
FROM <list of tables>
WHERE <conditions>
Such a statement will return all the attributes of the
specified table, as the asterisk (*) character is used to
include all attributes and all conditions. Selections based on more specific criteria can be built by
adding the required clauses:
SELECT <list of attributes from one or more tables> OR *
FROM <list of tables>
WHERE <conditions>
The <conditions> part specifies the particular requirements that the data must meet in order to
be reported, ranging from no conditions to very complicated multi-attribute and multi-table ones.
Similarly, the <list of tables> part specifies the tables that must be included in the report.
The reader can refer to the rich and readily available collection of related textbooks and resources,
providing thorough descriptions of the numerous forms of the detailed syntax clauses and possible
refinements (Oracle, 2021a).
The following Python script builds on the previous examples to demonstrate querying and
reporting on data from a table (i.e., Customers, Products, Orders), as specified by the user:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import mysql.connector
# Provide the established database config
GUIDB = 'GuiDB'
config = {'user': "root", 'password': "root",
'host': "localhost", 'database': "newDB"}
# Connect to the newDB database
connect = mysql.connector.connect(**config)
cursor = connect.cursor()
try:
# Attempt to show the tables of the newDB database
cursor.execute("SHOW TABLES")
tables = cursor.fetchall()
print("DB tables are: " + str(tables))
except:
print("There was a problem showing tables")
tableName = input("Enter the table selected: ")
try:
# Show the table metadata
cursor.execute("DESC " + tableName)
columns = cursor.fetchall()
Database Programming with Python
307
25
print("===================")
print("Selected table is: ", tableName)
26
27
print("===================")
28
print("Its attributes are:")
for row in columns:
29
30 print(row)
31
32
# Show the current instance of the table
33
cursor.execute("SELECT * FROM " + str(tableName))
34
records = cursor.fetchall()
print("==============================")
35
36
print("The records in the table are: ")
37
print("==============================")
for row in records:
38
39 print(row)
40 except:
41
print("There was a problem showing the table attributes")
Output 7.4.1: Reporting data from a table based on user selection
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: customers
Selected table is:
customers
Its attributes are:
('CustomerID', 'int(3)', 'NO', 'PRI', None, '')
('CustLastName', 'char(15)', 'YES', '', None, '')
('CustFirstName', 'char(10)', 'YES', '', None, '')
The records in the table are:
(1,
(2,
(3,
'John', 'Good')
'Norman', 'Chris')
'Flora', 'Alex')
In the case presented here, the output reports all the records from the Customers table.
7.4.2 The SELECT Statement with a Simple Condition
The previous section demonstrated the use of simple SELECT statements to report on data of a
MySQL table. The complexity of the queries is limited only by the imagination and capabilities of
the programmer and the task at hand, since Python provides the facilities and support for highly
complex querying and reporting tasks. As a starting point for building more complex tasks, the following Python script invites the user to select a table from an example database and build a query
based on the selection. Next, it prompts the user for a particular attribute to base the condition
on, and for setting particular preferences for the condition depending on whether the attribute is
numerical or text-based:
1
2
3
import mysql.connector
# Provide the established database config
308
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
Handbook of Computer Programming with Python
GUIDB = 'GuiDB'
config = {'user': "root", 'password': "root",
'host': "localhost", 'database': "newDB"}
# Connect to the newDB database
connect = mysql.connector.connect(**config)
cursor = connect.cursor()
try:
# Attempt to show the tables of the newDB database
cursor.execute("SHOW TABLES")
tables = cursor.fetchall()
print("DB tables are: " + str(tables))
except:
print("There was a problem showing tables")
tableName = input("Enter the table selected: ")
# Show the table metadata
cursor.execute("DESC " + tableName)
columns = cursor.fetchall()
print("==================================================")
print("Selected table is: ", tableName)
print("==================================================")
print("Its attributes are:")
for row in columns:
print(row)
# Select the attribute to build the condition
print("==================================================")
condAttribute = input("Enter the attribute to build the condition: ")
typeAttribute = input("Is it a numeric attribute or a text (Num/Text):")
if (typeAttribute == "Num"):
minCond = int(input("Enter the min value for the attribute"))
maxCond = int(input("Enter the max value for the attribute"))
sqlStatementCondition = " WHERE "+str(condAttribute)+" >= "+ \
str(minCond)+" AND "+str(condAttribute)+" <= "+str(maxCond)
if (typeAttribute == "Text"):
startingText = input("Enter the starting text of the value to \
search for: ")
sqlStatementCondition = " WHERE "+str(condAttribute)+" LIKE \'"+ \
str(startingText) + "%\'"
# Show the current instance of the table
sqlStatement = "SELECT * FROM " + str(tableName) + sqlStatementCondition
print(sqlStatement)
cursor.execute(sqlStatement)
records = cursor.fetchall()
print("====================================")
print("The records in the table are: ")
Database Programming with Python
54
55
56
print("====================================")
for row in records:
print(row)
Output 7.4.2.a – Example 1: Conditionally reporting data based on user selection
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: items
Selected table is:
items
Its attributes are:
('ItemID', 'char(6)', 'NO', 'PRI', None, '')
('ItemDesc', 'char(25)', 'YES', '', None, '')
('ItemPrice', 'int(5)', 'YES', '', None, '')
Enter the attribute to build the condition: ItemPrice
Is it a numeric attribute or a text (Num/Text):Num
Enter the min value for the attribute300
Enter the max value for the attribute450
Select * from items where ItemPrice >= 300 and ItemPrice <= 450
The records in the table are:
('100', 'RF-100', 300)
('200', 'TV-LG100', 400)
('303', 'PC-3', 400)
Output 7.4.2.b – Example 2: Conditionally reporting data based on user selection
DB tables are: [('customers',), ('items',), ('orders',), ('student',)]
Enter the table selected: items
Selected table is:
items
Its attributes are:
('ItemID', 'char(6)', 'NO', 'PRI', None, '')
('ItemDesc', 'char(25)', 'YES', '', None, '')
('ItemPrice', 'int(5)', 'YES', '', None, '')
Enter the attribute to build the condition: ItemDesc
Is it a numeric attribute or a text (Num/Text):Text
Enter the starting text of the value to search for: TV
Select * from items where ItemDesc like 'TV%'
The records in the table are:
('200', 'TV-LG100', 400)
('201', 'TV-Samsung 100', 550)
('202', 'TV-BenQ', 600)
309
310
Handbook of Computer Programming with Python
In the output of Example 1 above, the user firstly selects table Items. Next, a list of all the available
attributes is presented to the user as a choice for the condition of the SELECT statement. The user selects
ItemPrice and is prompted to choose whether it is a numerical or text attribute. As the attribute is
numerical, the script offers the option to enter the min and max values. On the contrary, in the output
of Example 2, the user selects an attribute that is text-based. Hence, the script offers a different set of
prompts and statements, appropriate for the use of the SELECT statement with text-based conditions.
The reader should note that the SELECT statements in both cases are the same as those used in
MySQL. The only challenge in this instance is that the programmer has to prepare the final SQL
script with the dynamic elements in place. Expectedly, if no dynamic elements are involved in the
query (e.g., if the table and the condition are predefined), the preparation of the SELECT statement
is less complicated.
7.4.3 The SELECT Statement Using GUI
Arguably, if one aims to develop a user-oriented application, it is necessary to wrap the application
with a user-friendly GUI. An extensive introduction to the most important GUI widgets (e.g., labels,
entry boxes, radio buttons, buttons) and their application is provided in earlier chapters of this
book. In the current context, it is assumed that the focus is on the creation of a grid-based layout that
will be used to host the results of the SQL queries. In such a case, a grid layout manager could be
used. The following Python script showcases the development and execution of a condition-based
MySQL SELECT query using a fully deployed GUI:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import mysql.connector
import tkinter as tk
from tkinter import ttk
global
global
global
global
global
global
tableName, attributeName, radioButton, textVar
minLabel, maxLabel, textualLabel; global textualEntry
selectionsFrame, resultsFrame; global columnName, columnType
minCondScale, maxCondScale; global tablesCombo, columnsCombo
connect, cursor, config; global tables, columns
minCond, maxCond; global minValue, maxValue, numCols
# Create the frame to select the table for the query and its attributes
def selectionGUI():
global tables, columns; global tablesCombo, columnsCombo
global tableName, radioButton, textVar
global selectionsFrame, resultsFrame
global minLabel, maxLabel, textualLabel
global minCondScale, maxCondScale; global textualEntry
# The frame for the query selections of the user
selectionsFrame=tk.LabelFrame(winFrame, text='Query selections')
selectionsFrame.config(bg = 'light grey', fg = 'red', bd = 2,
relief = 'sunken')
selectionsFrame.grid(column = 0, row = 0)
# Create the combobox to hold the tables available in the db
tablesLabel = tk.Label(selectionsFrame,
Database Programming with Python
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
311
text = "Tables available:", bg = "light grey")
tablesLabel.grid(column = 0, row = 0)
tablesCombo = ttk.Combobox(selectionsFrame,
textvariable = tableName, width = 15)
tablesCombo['values'] = tables; tablesCombo.current(0)
tablesCombo.grid(column = 1, row = 0)
# Button updates the attributes combo based on the table selection
updateAttributesButton = tk.Button(selectionsFrame,
text = 'Update Attributes', relief = 'raised', width = 15)
updateAttributesButton.bind('<Button-1>',
lambda event: updateAttributes())
updateAttributesButton.grid(column = 2, row = 0)
# Create the button to run the query
runButton = tk.Button(selectionsFrame, text = 'Run Query',
relief = 'raised', width = 15)
runButton.bind('<Button-1>', lambda event: runQuery())
runButton.grid(column = 3, row = 0)
# Update the columns combo based on the table selection
columnsLabel = tk.Label(selectionsFrame,
text = "Select attribute:", bg = "light grey")
columnsLabel.grid(column = 0, row = 1)
columnsCombo = ttk.Combobox(selectionsFrame,
textvariable = attributeName, width = 15)
columnsCombo.grid(column = 1, row = 1)
# Check whether selected attribute is numeric or text
numericalAttribute = tk.Radiobutton (selectionsFrame,
text = 'Numerical\nattribute', width = 10, height = 2,
bg = 'light green', variable = radioButton, value = 1,
command = radioClicked).grid(column = 2, row = 1)
textAttribute = tk.Radiobutton (selectionsFrame,
text = 'Text\nattribute', width = 10, height = 2,
bg = 'light green', variable = radioButton, value = 2,
command = radioClicked).grid(column = 3, row = 1)
radioButton.set(1)
# Create the GUI for the numerical conditional parameters
minLabel=tk.Label(selectionsFrame,text="Min value:",bg="light grey")
minLabel.grid(column = 0, row = 4); minLabel.grid_remove()
minCond = tk.IntVar()
minCondScale = tk.Scale (selectionsFrame, length = 200,
from_ = 0, to = 10000)
minCondScale.config(resolution = 10,
activebackground = 'dark blue', orient = 'horizontal')
minCondScale.config(bg = 'light blue', fg = 'red',
troughcolor = 'cyan', command = onScaleMin)
minCondScale.grid(column = 1, row = 4); minCondScale.grid_remove()
312
Handbook of Computer Programming with Python
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
maxLabel = tk.Label(selectionsFrame, text = "Max value:",
bg = "light grey")
maxLabel.grid(column = 2, row = 4); maxLabel.grid_remove()
maxCond = tk.IntVar()
maxCondScale = tk.Scale (selectionsFrame, length = 200,
from_ = 0, to = 10000)
maxCondScale.config(resolution = 10, activebackground = 'dark blue',
orient = 'horizontal')
maxCondScale.config(bg = 'light blue', fg = 'red',
troughcolor = 'cyan', command = onScaleMax)
maxCondScale.grid(column = 3, row = 4); maxCondScale.grid_remove()
# Create the GUI for the textual parameters
textualLabel = tk.Label(selectionsFrame,
text = "Enter text to find:", bg = "light grey")
textualLabel.grid(column = 0, row = 5); textualLabel.grid_remove()
textVar = tk.StringVar()
textualEntry = ttk.Entry(selectionsFrame,
textvariable = textVar, width = 20)
textualEntry.grid(column = 1, row = 5); textualEntry.grid_remove()
# Update the attributes table based on the table selection
def updateAttributes():
global cursor; global tableName, textVar; global tables, columns
global tablesCombo, columnsCombo; global numCols
global columnName, columnType; global mindCondScale, maxCondScale
try:
# Show the selected table metadata
if (str(tableName.get()) != ""):
sqlString = "DESC " + str(tableName.get())
cursor.execute(sqlString)
columns = cursor.fetchall()
# Reformat the columns list to new useful ones
numCols = len(columns)
columnName = []; columnType = []
for i in range (numCols):
columnName.append(columns[i][0])
columnType.append(columns[i][1])
columns[i] = str(columns[i][0]) + " " + \
str(columns[i][1])
columnsCombo['values'] = columns
columnsCombo.current(0)
except:
print("There was a problem showing the attributes")
# Update the attributes table based on the table selection
def runQuery():
global cursor; global tableName; global tables, columns
Database Programming with Python
128
global columnsCombo; global numCols, numRows
129
global selectedAttribute; global columnName, columnType
130
global minValue, maxValue; global resultsFrame
131
132
# Empty the results list and the results frame
133
records = []
134
if (resultsFrame != None):
135
resultsFrame.destroy()
136
137
# Prepare the query to run
138
selectedIndex = columnsCombo.current()
139
if (radioButton.get() == 1):
140
sqlStatementCondition = " WHERE " + \
141 str(columnName[selectedIndex]) + \
142
" >= " + str(minValue) + " AND " + \
143 str(columnName[selectedIndex]) + \
144
" <= " + str(maxValue)
145
elif (radioButton.get() == 2):
146
startingText = str(textVar.get())
147
sqlStatementCondition = " WHERE " + \
148 str(columnName[selectedIndex]) + \
149
" LIKE \'" + str(startingText) + "%\'"
150
151
# The frame for the query selections of the user
152
resultsFrame = tk.LabelFrame(winFrame, text = "Query data")
153
resultsFrame.config(bg = 'light grey', fg = 'red', bd = 2,
154
relief = 'sunken')
155
resultsFrame.grid(column = 0, row = 1)
156
157
# Show the current instance of the table
158
sqlStatement = "SELECT * FROM " + str(tableName.get()) + \
159
sqlStatementCondition
160
cursor.execute(sqlStatement)
161
records = cursor.fetchall()
162
163
numRows = len(records)
164
165
for i in range(numRows):
166
for j in range(numCols):
167
# Create the labels to display the columns of results
168
newLabel = tk.Label(resultsFrame, width = 24)
169
if (i%2 == 0):
170
newLabel.config(text = records[i][j],
171
bg = "light grey", relief = "sunken")
172
else:
173
newLabel.config(text = records[i][j],
174
bg = "light cyan", relief = "sunken")
175
newLabel.grid(column = j, row = i)
176
313
314
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
Handbook of Computer Programming with Python
# Display/hide the relevant conditional parameters depending on
# the type of the attribute
def radioClicked():
global minLabel, maxLabel; global minCondScale, maxCondScale
global textualLabel, textualEntry
if (radioButton.get() == 1):
minLabel.grid(); minCondScale.grid(); maxLabel.grid()
maxCondScale.grid(); textualLabel.grid_remove();
textualEntry.grid_remove()
if (radioButton.get() == 2):
minLabel.grid_remove(); minCondScale.grid_remove()
maxLabel.grid_remove()
maxCondScale.grid_remove(); textualLabel.grid()
textualEntry.grid()
# Define the method to control the min condition value
def onScaleMin(val):
global minValue
minValue = int(val)
# Define the method to control the max condition value
def onScaleMax(val):
global maxValue
maxValue = int(val)
#====================================================================
# Provide the established database config
GUIDB = 'GuiDB'
config = {'user': "root", 'password': "root", 'host': "localhost",
'database': "newDB"}
# Connect to the newDB database
connect = mysql.connector.connect(**config)
cursor = connect.cursor()
# Basic window frame with the title through tk.Tk() constructor
winFrame = tk.Tk()
winFrame.config(bg = "grey")
winFrame.title("Queries through GUIs")
try:
# Attempt to show the tables of the newDB database
cursor.execute("SHOW TABLES")
tables = cursor.fetchall()
except:
print("There was a problem with reporting the tables")
tableName = tk.StringVar()
Database Programming with Python
226
227
228
229
230
231
232
315
attributeName = tk.StringVar()
radioButton = tk.IntVar()
resultsFrame = None
updateAttributes()
selectionGUI()
winFrame.mainloop()
Output 7.4.3.a: Using the grid layout manager with a numerical condition query
Output 7.4.3.b: Using the grid layout manager with a text-based condition query
Conceptually, this script is divided into four parts. The first part (lines 12–92) provides the GUI element using the selectionGUI() function. This covers the main body of the GUI but excludes the
grid where the query data will be reported on. When running the application, the user must perform
the following actions:
316
1.
2.
3.
4.
Handbook of Computer Programming with Python
Select a table from the connected database through the relevant combo box.
Update the combo box using the attributes of the selected table.
Select the attribute upon which the condition for the query will be based.
Identify whether the attribute is numerical (int) or text-based (char).
The second part (lines 165–180) provides the necessary functionality for the user to be able to
decide the type of the attribute, through the selection of the relevant radio button. This provides
the appropriate partial interface that will enable the creation of the condition. The reader should
note how the selection causes the partial interfaces to appear/disappear and be replaced by the most
appropriate option based on the selection. This can be further enhanced and automated to include
as many conditions as needed.
In the third part, function updateAttributes() (lines 93–116) is used to update the attributes combo box based on the selected table. Functions onScaleMin() and onScaleMax()
(lines 182–190) are also part of this process, as they allow the user to determine the limits of the
condition when a numerical attribute is selected.
Arguably, the most important part of the application is the runQuery() function (lines 118–
163). The function firstly prepares the query based on the user’s preferences, and subsequently runs
it based on the prepared condition. Upon execution, the data grid is displayed as required, with a
number of columns dictated by the results of the query. The grid is merely an arrangement of a
sequence of columns (i.e., per line of the grid layout manager) that is created on-the-spot and loaded
with the results of the previously executed query.
In relation to appearance and aesthetics, the reader should also note how the variation of the
background color of each new line creates a specific color theme for the grid. It must be stressed
that, in this particular application, the grid consists of labels and it is, thus, not possible to work
on it directly. If a different widget were to be used instead (e.g., entry boxes), the contents would be
editable and processing (e.g., updating the value of a particular attribute on selected table records)
could be applied to the data directly through the grid.
The simple application presented here is just a sample of the use and functionality of the SQL
and GUI features provided by Python. As mentioned, SQL provides numerous options and possibilities, and this is reflected on the virtually limitless potential when designing and implementing
database applications in Python or other compatible programming languages.
7.5
CASE STUDY
Create an application that provides the following functionality:
a. Prompt the user for their credentials and the name of the MySQL database to connect to.
Display a list of the tables that are available in the connected database in a status bar form
at the bottom of the application window (Hint: A label can be used for this purpose).
b. Allow the user to define a new table and set the number of its attributes. Based on user
selection, create the interface required for the specifications of the attributes in the new
table (i.e., attribute name, type and size, primary or foreign key designation). The interface
should be created on-the-spot.
The application must use a GUI interface and the MySQL facilities for the database element.
7.6
EXERCISES
Based on the Employee example, write Python scripts to perform the following tasks using
MySQL:
Database Programming with Python
317
1. Create table DEPT to host departmental data for a company, with the following attributes:
a. Code → DeptNo, Number (2), not null, primary key.
b. Department name → Dname, 20 characters.
2. Create table EMP to host employee data, with the following attributes:
c. Code → Empno, Number (4), not null, primary key.
d. Name (Last and First) → Ename, 40 characters.
e. Job → Job, 10 characters.
f. Manager Code → Mgr, Number (4), internal foreign key to Emp → Empno.
g. Date Hired → Hiredate, date.
h. Monthly salary → Sal, Number (7, 2), between 100 and 10,000.
i. Department code → DeptNo, Number (2), foreign key to Dept → DeptNo.
3. Alter table DEPT to include the following attribute: Location → DLocation, 20
characters.
4. Alter table EMP to include the following attribute: Sales Commission → Comm, Number (7, 2),
no more than Sal.
5. Insert five records into DEPT.
6. Insert ten records into EMP, two for each department.
7. Delete the record of the department entered last.
REFERENCES
APACHE. (2021). APACHE Software Foundation. https://apache.org.
Elmasri, R., & Navathe, S. (2017). Fundamentals of Database Systems (Vol. 7). Pearson, Hoboken, NJ.
MAMP. (2021). Download MAMP & MAMP PRO. https://www.mamp.info/en/downloads.
MySQL. (2021). Oracle Corporation. https://www.mysql.com.
Oracle. (2021a). MySQL Documentation. https://dev.mysql.com/doc.
Oracle. (2021b). Oracle.com.
8
Data Analytics and Data
Visualization with Python
Dimitrios Xanthidis
University College London
Higher Colleges of Technology
Han-­I Wang
The University of York
Christos Manolas
The University of York
Ravensbourne University London
CONTENTS
8.1
8.2
Introduction........................................................................................................................... 320
Importing and Cleaning Data................................................................................................ 322
8.2.1 Data Acquisition: Importing and Viewing Datasets.................................................. 322
8.2.2 Data Cleaning: Delete Empty or NaN Values........................................................... 324
8.2.3 Data Cleaning: Fill Empty or NaN Values................................................................ 326
8.2.4 Data Cleaning: Rename Columns............................................................................. 327
8.2.5 Data Cleaning: Changing and Resetting the Index................................................... 329
8.3 Data Exploration.................................................................................................................... 329
8.3.1 Data Exploration: Counting and Selecting Columns................................................. 329
8.3.2 Data Exploration: Limiting/Slicing Dataset Views................................................... 331
8.3.3 Data Exploration: Conditioning/Filtering................................................................. 332
8.3.4 Data Exploration: Creating New Data....................................................................... 333
8.3.5 Data Exploration: Grouping and Sorting Data.......................................................... 336
8.4 Descriptive Statistics............................................................................................................. 339
8.4.1 Measures of Central Tendency..................................................................................340
8.4.2 Measures of Spread................................................................................................... 343
8.4.3 Skewness and Kurtosis.............................................................................................. 347
8.4.4 The describe() and count() Methods......................................................................... 350
8.5 Data Visualization................................................................................................................. 352
8.5.1 Continuous Data: Histograms.................................................................................... 352
8.5.2 Continuous Data: Box and Whisker Plot................................................................... 354
8.5.3 Continuous Data: Line Chart..................................................................................... 356
8.5.4 Categorical Data: Bar Chart...................................................................................... 357
8.5.5 Categorical Data: Pie Chart....................................................................................... 363
8.5.6 Paired Data: Scatter Plot............................................................................................364
8.6 Wrapping Up.......................................................................................................................... 366
8.7 Case Study............................................................................................................................. 371
References....................................................................................................................................... 371
DOI: 10.1201/9781003139010-8
319
320
Handbook of Computer Programming with Python
8.1 INTRODUCTION
Python is one of the most popular modern programming
languages for data analytics, data visualization, and Observation 8.1 – Data Analytics:
data science tasks in general. Indeed, its reputation as Analysis of data from various sources
a programming language comes from its efficiency in to produce meaningful results that aid
such tasks and the wealth of related facilities and tools the process of decision-­making.
it provides. Its power in addressing data analytics problems comes from its numerous built-­in libraries, including Pandas, Numpy, Matplotlib, Scipy, and Seaborn. Observation 8.2 – Data Visualization:
These libraries provide functionality to read data from The process of illustrating the results of
a variety of sources, clean data, and perform descrip- data analytics through visual means.
tive and inferential statistics operations. In addition, the
libraries provide data visualization facilities, supporting
the generation of all types of charts based on the data at Observation 8.3 – Big Data: Data
hand. Finally, the platform is capable of performing the obtained from a large variety of
aforementioned tasks on large collections of data, a task sources, at great velocity, in large
amounts of volumes, and in a variety
commonly referred to as big data analytics.
A formal definition of the term data analytics may of formats.
be difficult to come up with, as it is a relatively new
and rather broad concept in the contemporary business and academic context. However, a possible
description could be that the term refers to the efficient analysis of data from various sources to
produce meaningful results that aid the process of decision-­making. If this was to be extended in
order to also capture big data analytics, the associated data would be expected to come from a large
variety of sources, at great velocity (i.e., speed), in vast amounts of volume, and in a serious variety
of formats, as pointed in relevant, contemporary literature. The term data visualization, another
relatively new concept, refers to common mechanisms of illustrating the results of data analytics in
the form of various charts, available as visual tools or through built-­in methods in programming
libraries.
A quick look into any book or resource related to data analytics would unveil that the process
is more or less the same, with any minor variations most likely having to do with the terminology
rather than the functionality and structure. The latter includes the following seven steps:
• Research Objectives/Research Question(s): The first part of the data analytics process is
frequently omitted, as it can be deemed as an obvious step. However, it is the most essential part of the process and requires effort to develop. To complicate things further, it is a
task of purely investigative nature, so limited support is available in terms of specific and
automated tools. It is basically a process seeking to establish the objectives and questions
the process is aiming to address for the task at hand at any given instance. It is beyond the
scope of this chapter to address these concepts in more detail. For more information, the
reader is encouraged to refer to literature related to research methods and methodologies.
• Data Acquisition: The process of reading data stored in a variety of formats and sources,
including spreadsheets, comma separated files, web pages, and databases. Once the data
is read, it is stored in a specific type of variable called data frame for further processing.
• Cleaning Data: While the collection of complete and error-­free data during the acquisition process is highly desirable, this is seldom the case. Given that the data are entered
by users who are often not familiar with the data entry process, it is highly probable and
expected to encounter such problems. The process of data cleaning focuses on the removal
of these types of errors.
• Exploratory Analysis: This is a process that comes after data cleaning, with the aim of
identifying and summarizing the main characteristics of the data. It often involves the
application of descriptive statistics methods and analysis.
Data Analytics and Data Visualization
321
• Modeling and Validation: This process involves the deployment of advanced tools and
techniques, such as machine learning, for building models relating to the data. This task
covers broad and deep areas of study and expertise that is beyond the scope of this chapter.
• Visualizing Results: This task relates to the use of various facilities and programming
libraries to create charts that help in visualizing the data and assisting in the process of
decision-­making.
• Reporting: Writing-­up of the final reports relating to the data, including any conclusions
and recommendations.
It is apparent that the process involves various fields of
expertise, including databases and data mining, artificial intelligence/machine learning, statistics, social
science, and others. It is this interdisciplinary nature of
the overall process that results in the widely used data
science term.
As mentioned, the main Python packages and libraries used for data processing and visualization are
Pandas, Numpy, Matplotlib, Scipy, and Seaborn. More
these libraries are the following:
Observation 8.4 – Data Science: An
interdisciplinary field that involves
databases and data mining, AI/
machine learning, statistics, social sciences and other relevant means to
analyze and interpret data.
specifically, the main characteristics of
• NumPy: A library optimized for working with single and multi-­dimensional arrays. A tool
suitable for machine learning and statistical analysis tasks.
• Pandas: An easy-­to-­use, open-­source library that is based on NumPy. It works particularly
well with one and two-­dimensional data (Series and DataFrame respectively). It is a good
choice for statistical analysis tasks.
• SciPy: Another library based on NumPy. It offers additional functionality compared
to NumPy, making it a solid choice for both machine learning and statistical analysis
tasks.
• Matplotlib: A low level plotting library suitable for creating basic graphs. While it provides a lot of freedom to the programmer, it may be rather demanding in terms of coding
requirements. One must be also aware of the fact that Matplotlib cannot deal directly with
analysis. As such, this needs to be addressed prior to plotting.
• Python’s Statistics: A built-­in Python library for descriptive statistics. It works rather well
when datasets are not too large (Statistics — Mathematical Statistics Functions, 2021).
In this chapter, the reader will have the opportunity to acquire basic skills required for cleaning and describing data, and performing data visualization, while familiarising with some of
the most popular libraries associated with these tasks. This chapter is divided into four main
sections:
• Data Acquisition and Cleaning: Import, re-­arrange, and clean data from various types
of sources.
• Data Exploration: Report data by selecting, sorting, filtering, grouping, and/or re-­
calculating rows/columns, as necessary.
• Data Processing/Descriptive Statistics: Apply simple descriptive statistics on the data
frame.
• Data Visualization: Use the available methods from the various Python packages for data
visualization.
Excel files Grades.xlsx and Grades2.xlsx are used for the various examples presented throughout
this chapter.
322
Handbook of Computer Programming with Python
8.2 IMPORTING AND CLEANING DATA
Before discussing the process of importing data for
­analysis, there are two key terms that need to be pre- Observation 8.5 – Data Frame:
dimensional data
sented: arrays/lists and data frames. Unlike other Typically, a two-­
structure
with
rows
representing the
common programming languages like C++ or Java, in
data
records.
Records
are divided
Python there is no distinct array object. Instead, this
into
columns,
and
indices
are used to
functionality is provided by the list object, as discussed
speed
up
the
searching
process
within
in Chapter 2. As a quick reminder, a list is a sequence of
the
data
frame.
variables that hold data of the same data type, sharing
the same name, and being distinguished only by their
index.
A data frame is a data structure that resembles a relational database table, or an Excel spreadsheet consisting of rows and columns. The rows correspond to the actual records of the data frame
and are accessed by their index number. The columns correspond to the attributes/columns/fields
in a database table and are accessed by their names. The index is the first column of a data frame
(i.e., starting at zero).
8.2.1 Data Acquisition: Importing and Viewing Datasets
The Pandas library is required in order to create the
object used to both read the data from the source and
create the data frame to which data analysis will be
applied. Various sources and formats of data are supported, including Excel and Comma Separated Values
(CSV) files, tables, plain text, databases, or web-­based
sources. In all cases, the basic process of reading from
the source remains the same. However, the method and
the parameters used may vary slightly, depending on the
source.
In the case of reading data from Excel files, the general syntax is the following:
<name of data frame> = <name of Pandas
object>.read_excel("<Filename>", sheet_
name = "<Sheet name>")
Observation 8.6 – The Pandas
Library: The Pandas library provides
support for the creation of objects that
can be used for various data analytics
tasks.
Observation 8.7 – Reading from
data sets: Use the read _ excel(),
read _ csv(), or read _ html()
methods to import (read) data from
Excel, CSV, or html files into the data
frame.
The following example demonstrates the process of reading data from a particular spreadsheet
(Grades 2020) within an Excel file (Grades.xlsx):
1
2
3
import pandas as pd
dataset = pd.read_excel("Grades.xlsx", sheet_name = "Grades 2020")
print(dataset)
Output 8.2.1.a:
0
1
2
Final Grade Final Exam Quiz 1
58.57
50.5
76.0
65.90
49.0
89.0
69.32
63.5
73.0
Quiz 2 Midterm Exam
70.7
60.0
63.0
54.0
54.7
70.0
Project
55
90
80
323
Data Analytics and Data Visualization
3
4
5
6
7
8
9
10
11
12
13
14
15
16
72.02
73.68
61.32
67.87
75.57
61.28
0.00
62.35
66.13
69.43
82.60
0.00
62.62
0.00
60.5
74.0
45.5
66.5
66.0
50.5
NaN
48.0
61.0
50.0
74.0
NaN
45.5
NaN
99.0
84.0
94.0
73.0
94.0
84.0
NaN
78.0
83.0
80.0
94.0
NaN
78.0
NaN
74.7
53.3
42.7
53.7
58.7
37.3
NaN
49.0
45.3
49.3
65.0
NaN
56.7
NaN
76.0
64.0
66.0
54.0
92.0
58.0
NaN
70.0
70.0
90.0
86.0
NaN
72.0
NaN
70
87
70
87
70
78
69
71
70
76
92
75
70
0
The above script reports 16 rows/records across 6 columns. A few key things are noteworthy in
the script output. Firstly, the name of the read_excel() method is case sensitive. This is in
line with the general Python syntax rule for methods and statements used in data analytics tasks.
Secondly, as mentioned, it is highly unlikely to deal with perfect, clean data during data analysis.
More often than not, one has to deal with erroneous, corrupt, or missing data. The latter applies
to both designated NaN entries or empty cells. Fortunately, there are easy ways to tackle such
problems, some of which are described in the following sections. Finally, it is worth mentioning that in order to report a given dataset the print() method can be used. The method comes
handy in several situations related to reporting data from datasets and it is further discussed latter
in this chapter.
In the case of reading data from a flat CSV file, the general syntax is the following:
<name of data frame> = <name of Pandas object>.read_csv("<Filename.csv",
delimiter = ', ')
The following script reads and reports the data included in file Grades2.csv:
1
2
3
import pandas as pd
dataset = pd.read_csv('Grades2.csv', delimiter = ', ')
Dataset
Output 8.2.1.b:
Final Grade
Final Exam
Quiz 1
Quiz 2
Midterm Exam
Project
0
67.47
59.0
70
72.7
70
72
1
75.13
61.5
76
68.3
82
87
2
66.85
77.5
84
52.0
40
80
3
54.45
34.5
62
44.0
44
90
4
76.95
66.5
68
67.0
82
92
5
45.13
26.0
52
26.3
50
68
324
Handbook of Computer Programming with Python
6
73.23
63.5
96
68.3
62
89
7
81.87
83.0
97
82.7
84
72
8
62.63
54.5
54
31.3
64
87
9
58.75
46.5
54
39.0
52
90
10
49.75
27.5
48
37.0
62
70
11
44.25
21.5
55
18.0
42
80
12
62.52
31.0
85
54.7
68
89
13
47.33
16.5
38
33.3
52
89
14
68.97
55.0
65
49.7
70
94
In the case of reading data from a web page, the general syntax is the following:
<name of data frame> = <name of Pandas object>.read_html("<url>")
8.2.2 Data Cleaning: Delete Empty or NaN Values
There are two main techniques to clean a dataset. One
has to do with correcting erroneous data and the other Observation 8.8 – Drop NaN or
with dealing with missing values. The cleaning process Empty Values: Use the dropna()
may include the partial or complete deletion of the related method to delete rows with NaN or
rows or the replacement of cells that contain missing data empty values from a data frame. The
method must be used with the how
with specific calculated or predefined values.
In the case of the former, there are two possible sce- parameter (“all” or “any” values).
narios. Rows may contain missing or designated NaN
values, in some or all of its columns. If it is decided to delete all the rows that contain missing data,
the following syntax should be used:
<name of new Data Frame> = <name of original Data Frame>.dropna()
The following script demonstrates the application of the dropna() method that deletes all rows
with cells that include NaN values:
1
2
3
4
import pandas as pd
dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020")
dframe_no_missing_data = dataset.dropna()
dframe_no_missing_data
Using the dropna(how = “any”) method form instead of the simple dropna() form will
p­ roduce the same result, similarly to deleting any row that contains either NaN or empty values.
The full syntax in this case is very similar to the previous one:
<name of new Data Frame> = <name of original Data Frame>.dropna(how = "any")
The following Python script provides an example of this method applied to the same data frame:
325
Data Analytics and Data Visualization
1
2
3
4
import pandas as pd
dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020")
dframe_delete_rows_with_any_na_values = dataset.dropna(how = "any")
dframe_delete_rows_with_any_na_values
Output 8.2.2.a:
Final Grade
Final Exam
Quiz 1
Quiz 2 Midterm Exam
Project
0
58.57
50.5
76.0
70.7
60.0
55
1
65.90
49.0
89.0
63.0
54.0
90
2
69.32
63.5
73.0
54.7
70.0
80
3
72.02
60.5
99.0
74.7
76.0
70
4
73.68
74.0
84.0
53.3
64.0
87
5
61.32
45.5
94.0
42.7
66.0
70
6
67.87
66.5
73.0
53.7
54.0
87
7
75.57
66.0
94.0
58.7
92.0
70
8
61.28
50.5
84.0
37.3
58.0
78
10
62.35
48.0
78.0
49.0
70.0
71
11
66.13
61.0
83.0
45.3
70.0
70
12
69.43
50.0
80.0
49.3
90.0
76
13
82.60
74.0
94.0
65.0
86.0
92
15
62.62
45.5
78.0
56.7
72.0
70
The reader should note that 2 of the 16 original rows are deleted from the data frame as a result of
running the two versions of the script, irrespectively of whether the dropna() or dropna(how
“any”) method form is used.
If it is decided to delete only the rows with all columns containing NaN or empty values, the following syntax of the dropna() method should be used:
<name of new Data Frame> = <name of original Data Frame>.dropna(how = "all")
The following script and its output demonstrate the use of the dropna() method, with parameters
that result in the deletion of rows consisting exclusively of cells with NaN values. Note that none of
the 16 original rows are deleted from the data frame as a result of the method call.
1
2
3
4
import pandas as pd
dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020")
dframe_delete_rows_with_all_na_values = dataset.dropna(how = "all")
dframe_delete_rows_with_all_na_values
326
Handbook of Computer Programming with Python
Output 8.2.2.b:
Final Grade Final Exam Quiz 1
Quiz 2 Midterm Exam Project
0
58.57
50.5
76.0
70.7
60.0
55
1
65.90
49.0
89.0
63.0
54.0
90
2
69.32
63.5
73.0
54.7
70.0
80
3
72.02
60.5
99.0
74.7
76.0
70
4
73.68
74.0
84.0
53.3
64.0
87
5
61.32
45.5
94.0
42.7
66.0
70
6
67.87
66.5
73.0
53.7
54.0
87
7
75.57
66.0
94.0
58.7
92.0
70
8
61.28
50.5
84.0
37.3
58.0
78
9
0.00
NaN
NaN
NaN
NaN
69
10
62.35
48.0
78.0
49.0
70.0
71
11
66.13
61.0
83.0
45.3
70.0
70
12
69.43
50.0
80.0
49.3
90.0
76
13
82.60
74.0
94.0
65.0
86.0
92
14
0.00
NaN
NaN
NaN
NaN
75
15
62.62
45.5
78.0
56.7
72.0
70
16
0.00
NaN
NaN
NaN
NaN
0
8.2.3 Data Cleaning: Fill Empty or NaN Values
It is often the case that empty cells or cells with NaN
values are filled with either predefined values or values calculated based on the rest of the data. In such
cases, instead of the dropna() method (in any
of its forms), one can use the fillna(<value>,
[inplace = true]) method. The general syntax of
the method is the following:
<name
how =
<name
how =
of new
'all']
of new
'any']
Observation 8.9 – Fill NaN or Empty
Values: Use the fillna() method
to define replacement values for any
NaN or empty values encountered.
Data Frame> = <name of original Data Frame>.fillna(value[,
[, inplace = True])
Data Frame> = <name of original Data Frame>.fillna(value[,
[, inplace = True])
The value can be defined before running the script, based on existing dataset values and/or other
calculations (e.g., using the mean of the existing data in the same column). The inplace parameter
enables the permanent change of the data in the dataset, if set to true. While the false value can
be also used, this would not make much sense, since it is the default value when inplace is not used.
327
Data Analytics and Data Visualization
The following script and its output demonstrate the use of the fillna() method, while also
applying the inplace parameter to enable the permanent change of the data. The default value
used for the modification of empty or missing values is zero. The reader should note that the
inplace parameter affects only the dataset resulting from the execution of the script, and not
the data source:
1
2
3
4
import pandas as pd
dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020")
dataset.fillna(0, inplace = True)
dataset
Output 8.2.3:
Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project
0
58.57
50.5
76.0
70.7
60.0
55
1
65.90
49.0
89.0
63.0
54.0
90
2
69.32
63.5
73.0
54.7
70.0
80
3
72.02
60.5
99.0
74.7
76.0
70
4
73.68
74.0
84.0
53.3
64.0
87
5
61.32
45.5
94.0
42.7
66.0
70
6
67.87
66.5
73.0
53.7
54.0
87
7
75.57
66.0
94.0
58.7
92.0
70
8
61.28
50.5
84.0
37.3
58.0
78
9
0.00
0.0
0.0
0.0
0.0
69
10
62.35
48.0
78.0
49.0
70.0
71
11
66.13
61.0
83.0
45.3
70.0
70
12
69.43
50.0
80.0
49.3
90.0
76
13
82.60
74.0
94.0
65.0
86.0
92
14
0.00
0.0
0.0
0.0
0.0
75
15
62.62
45.5
78.0
56.7
72.0
70
16
0.00
0.0
0.0
0.0
0.0
0
8.2.4 Data Cleaning: Rename Columns
It is sometimes required to change the column headings
in a dataset. This is especially true in the case of formal reports, where clarity and appearance are key. In
such cases, the rename() method is used. The method
allows for the temporary change of the column heading
without affecting the original dataset at the source.
Observation 8.10 – rename(): Use
the rename() method to change the
column heading appearance. Use the
set notation to dictate the old and
new (temporary) column names.
328
Handbook of Computer Programming with Python
The general syntax is the following:
df.rename(columns = {"oldname": "newname", } [, inplace=True])
As in the previous case, if the inplace parameter is used, the column names will be changed for
the resulting dataset, but the source data will not be affected. The most crucial aspect of the syntax
is that the programmer can change any number of column names just by separating them using
commas:
1
2
3
4
5
import pandas as pd
dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020")
dataset_new = dataset.rename(columns = {"Final Grade": "Total Grade",
"Quiz 1": "Test 1", "Quiz 2": "Test 2", "Midterm Exam": “Midterm”})
dataset_new
Output 8.2.4:
Total Grade
Final Exam
Test 1
Test 2
Midterm
Project
0
58.57
50.5
76.0
70.7
60.0
55
1
65.90
49.0
89.0
63.0
54.0
90
2
69.32
63.5
73.0
54.7
70.0
80
3
72.02
60.5
99.0
74.7
76.0
70
4
73.68
74.0
84.0
53.3
64.0
87
5
61.32
45.5
94.0
42.7
66.0
70
6
67.87
66.5
73.0
53.7
54.0
87
7
75.57
66.0
94.0
58.7
92.0
70
8
61.28
50.5
84.0
37.3
58.0
78
9
0.00
NaN
NaN
NaN
NaN
69
10
62.35
48.0
78.0
49.0
70.0
71
11
66.13
61.0
83.0
45.3
70.0
70
12
69.43
50.0
80.0
49.3
90.0
76
13
82.60
74.0
94.0
65.0
86.0
92
14
0.00
NaN
NaN
NaN
NaN
75
15
62.62
45.5
78.0
56.7
72.0
70
16
0.00
NaN
NaN
NaN
NaN
0
The reader should note the use of the set notation to declare the pairs of column names (i.e., old
and new) when changing them. It must be also noted that, in order for the change to apply, the result
of the rename() method must be assigned to a new dataset before it is reported.
329
Data Analytics and Data Visualization
8.2.5 Data Cleaning: Changing and Resetting the Index
The index of a dataset is important, as it can speed up the
process of data searching. This is particularly relevant Observation 8.11 – set _ index(),
when searching for or sorting data on a column of the reset _ index(): Use the set _
dataset different than the one the focus is on. In such a index() and reset _ index()
case, it is convenient to temporarily change the indexed methods to set the index of the datacolumn to perform the task at hand, and return back to set to another column and restore it
the original state by resetting the index to its original back to the original one.
column once this is completed. The general syntax for
changing and resetting the index in a dataset is the following:
<name of dataset>.set_index("<column name>" [, inplace=True])
<name of dataset>.reset_index([inplace=True])
8.3 DATA EXPLORATION
Data exploration is an umbrella term, encompassing processes used to report data in various different ways. For example, it may refer to the process of row/column selection for inclusion in the report,
or to facilities used to sort and/or filter data based on certain, defined conditions. If necessary, it
offers options to group the data in one or more columns and the functionality to create new columns
based on calculations on existing ones. This section will explore some of the most important concepts and methods related to data exploration.
8.3.1 Data Exploration: Counting and Selecting Columns
Three of the basic methods and parameters used in order
to view the data of a dataset are len(), columns, and
shape. The len() method reports the number of records
in the dataset. The general syntax is the following:
len(<name of dataset>)
Observation 8.12 – len(): Use the
len() method and the columns and
shape attributes of a dataset to report
the number of its records, the names
of its attributes, and the number of its
records and columns, respectively.
The columns attribute can be used to get a list of the
available columns in the dataset, with the following syntax:
<name of dataset>.columns
Finally, the shape attribute can be used to report the number of records and columns in a dataset:
<name of dataset>.shape
The following script uses all three of the above, while also including a basic statement to display all
the data in the dataset:
1
2
3
4
5
6
7
8
import pandas as pd
dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020")
dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]]
dataset
len(dataset)
dataset.columns
dataset.shape
330
Handbook of Computer Programming with Python
Output 8.3.1.a: Basic exploration methods without print
(17, 6)
It should be noted that the script fails to display all the requested output. Instead, it displays only the
result of the application of shape: the number of records and columns. If it is necessary to display
all the requested information, the print() method should be used, as in the amended version of
the script below:
1
2
3
4
5
6
7
8
9
import pandas as pd
dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020")
dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]]
print(dataset)
print("The dataset has", len(dataset), "records")
print("The columns in the dataset are:", dataset.columns)
print("The number of records is:", dataset.shape[0])
print("The number of columns is:", dataset.shape[1])
Output 8.3.1.b: Basic exploration methods using print
Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project
0
58.57
50.5
76.0
70.7
60.0
55
1
65.90
49.0
89.0
63.0
54.0
90
2
69.32
63.5
73.0
54.7
70.0
80
3
72.02
60.5
99.0
74.7
76.0
70
4
73.68
74.0
84.0
53.3
64.0
87
61.32
45.5
94.0
42.7
66.0
70
5
6
67.87
66.5
73.0
53.7
54.0
87
7
75.57
66.0
94.0
58.7
92.0
70
8
61.28
50.5
84.0
37.3
58.0
78
9
0.00
NaN
NaN
NaN
NaN
69
10
62.35
48.0
78.0
49.0
70.0
71
11
66.13
61.0
83.0
45.3
70.0
70
12
69.43
50.0
80.0
49.3
90.0
76
13
82.60
74.0
94.0
65.0
86.0
92
14
0.00
NaN
NaN
NaN
NaN
75
15
62.62
45.5
78.0
56.7
72.0
70
0.00
NaN
NaN
NaN
NaN
0
16
The dataset has 17 records
The columns in the dataset are: Index(['Final Grade', 'Final Exam',
'Quiz 1', 'Quiz 2', 'Midterm Exam', 'Project'], dtype='object')
The number of records is: 17
The number of columns is: 6
As shown above, it is possible to improve the output appearance by adding appropriate text through
the print() method. Obviously, the presentation of the results could be further improved with the
use of more elaborate presentation techniques and tools, such as an appropriate GUI.
331
Data Analytics and Data Visualization
8.3.2 Data Exploration: Limiting/Slicing Dataset Views
It is often the case that it is impractical to display all
the data in a single report. This is especially true when
working with very large datasets. In such cases, it is
preferable to display just a sample of the dataset, by
limiting the number of records and/or columns. There
are a number of methods that can be used for this task.
Methods head(n) and tail(n) restrict the number of
the displayed records, either at the top or the bottom of
the dataset. The general syntax is the following:
Observation 8.13 – head(), tail():
Use the head(n) and tail(n) methods to restrict the number of displayed
records from the top and bottom of
the dataset. Use the loc[] or iloc[]
attributes to restrict the report to the
specified rows and columns using
labels or indices.
<name of dataset>.head(number of rows from the top)
<name of dataset>.tail(number of rows from the bottom)
Methods loc[] and iloc[] can be used to restrict the displayed results based on specific rows
and/or columns:
<name of dataset>[start record number: end record number [: step]
<name of dataset>.loc[start record number: end record number [: step],
"<start column name>": "<end column name>"]
<name of dataset>.iloc[[start record number: end record number, start
column index: end column index]
The practical application of these methods and attributes is demonstrated in the following script:
1
2
3
4
5
6
7
import pandas as pd
dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020")
print(dataset.head(5))
print(dataset.tail(5))
print(dataset[0:37:5])
print(dataset.loc[0:5,"Final Grade": "Final Exam"])
print(dataset.iloc[0:5,0:3])
Output 8.3.2:
0
1
2
3
4
12
13
14
15
16
0
5
10
15
Final Grade
58.57
65.90
69.32
72.02
73.68
Final Grade
69.43
82.60
0.00
62.62
0.00
Final Grade
58.57
61.32
62.35
62.62
Final Exam
50.5
49.0
63.5
60.5
74.0
Final Exam
50.0
74.0
NaN
45.5
NaN
Final Exam
50.5
45.5
48.0
45.5
Quiz 1
76.0
89.0
73.0
99.0
84.0
Quiz 1
80.0
94.0
NaN
78.0
NaN
Quiz 1
76.0
94.0
78.0
78.0
Quiz 2
70.7
63.0
54.7
74.7
53.3
Quiz 2
49.3
65.0
NaN
56.7
NaN
Quiz 2
70.7
42.7
49.0
56.7
Midterm Exam
60.0
54.0
70.0
76.0
64.0
Midterm Exam
90.0
86.0
NaN
72.0
NaN
Midterm Exam
60.0
66.0
70.0
72.0
Project
55
90
80
70
87
Project
76
92
75
70
0
Project
55
70
71
70
332
0
1
2
3
4
5
0
1
2
3
4
Handbook of Computer Programming with Python
Final Grade
58.57
65.90
69.32
72.02
73.68
61.32
Final Grade
58.57
65.90
69.32
72.02
73.68
Final Exam
50.5
49.0
63.5
60.5
74.0
45.5
Final Exam
50.5
49.0
63.5
60.5
74.0
Quiz 1
76.0
89.0
73.0
99.0
84.0
In the output, the reader will notice that with the application of head(5) and tail(5), only the five
first and last records of the dataset are displayed (with all their columns). Next, records are displayed
in intervals of five, starting from zero and ending with the last records of the dataset. The next section displays six records of the dataset using only the first three columns (inclusive of the index of
the dataset). In a similar way, the last section shows the first five records using only the first four
columns (inclusive of the index of the dataset), but the columns are specified by their index and not
their names. If it is required to report on non-­sequential columns, these columns must be included
in square brackets ([]) and separated by commas.
8.3.3 Data Exploration: Conditioning/Filtering
Expectedly, Pandas also offers a set of methods that
allow for the filtering of the displayed data through conditioning. For instance, the unique() method displays
only the first occurrence of recurring data values from
the specified column:
<name of dataset>["<name of column>"].
unique()
Observation 8.14 – unique():
Use the unique() method and the
square bracket ([]) list notation to
report unique data in a dataset based
on a specified column and to set the
conditions for the reported records.
It is also possible to define a particular condition that limits the displayed results like in the case
of an if statement. The condition can be simple (single) or complex. The general syntax is the
following:
<name of dataset>[<condition>]
<name of dataset> [<condition>[&/|] <condition>]]
The following script uses the data from the Grades.xlsx file to identify unique grades for the project,
and report all final grades with a percentage higher than 80% and between 1% and 59%:
1
2
3
4
5
6
7
import pandas as pd
dataset = pd.read_excel('Grades.xlsx')
print("Unique grades for project:", dataset["Project"].unique())
print("Final grades more than 80%:\n",
dataset[dataset["Final Grade"] > 80])
print("Final grades 1% to 60%:\n", dataset[(dataset["Final Grade"] > 0)
& (dataset["Final Grade"] < 60)])
333
Data Analytics and Data Visualization
Output 8.3.3:
Unique grades for project: [55 90 80 70 87 78 69 71 76 92 75 0]
Final grades more than 80%:
Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project
13
82.6
74.0
94.0
65.0
86.0
92
Final grades 1% to 60%:
Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project
0
58.57
50.5
76.0
70.7
60.0
55
The reader should note that it is possible to limit the displayed columns if the loc[] parameter is
also used, although this is not shown in the current script and its output. It is also worth mentioning
that, in a compound condition like the second one in the example, instead of using the and or or
keywords one can use & and | operators respectively.
8.3.4 Data Exploration: Creating New Data
As part of the data exploration process, it is sometimes necessary to create new data. This can take
four different forms:
• Merging two or more datasets into one.
• Creating a new column with data derived from other available data sources, in the same
or other datasets.
• Creating a new column with data calculated from other available data sources, in the same
or other datasets.
• Creating a new file of a certain file type (e.g., Observation 8.15 – Create New
Excel, CSV).
Column: Use the following expression
and syntax to create a new column
The append() method is used to merge two or more based on the values of other columns
datasets. The basic syntax is the following:
from the same or other datasets:
<name of new dataset> = <name of first
old dataset>.append(<name of second old
dataset>)
To create a new column with values calculated based
on data of other columns one can use the following
command:
<name of dataset>["<name of new column>"]
= expression with other columns
If the newly created column is based on certain conditions applied to data from other columns the following
commands could be used instead:
<name of dataset>["<name of new column>"]
= np.where(condition, value if True,
value if False)
or
<name of dataset>["<name of new column>"] =
np.select(<condition set>, <set of values>)
<name of dataset>[“<name of
new column>”] = expression
with other columns
Observation 8.16 – Create a New
Column Using np.where() or np.
select(): Use Numpy’s np.where()
or np.select() methods and the
following syntax to create a new column based on a simple or complex
condition. This can include other columns from the same or other datasets:
<name of dataset>[“<name of
new column>”] = np.where
(condition, value if True,
value if False)
<name of dataset>[“<name of
new column>”] = np.select
(<condition set>, <set of
values>)
334
Handbook of Computer Programming with Python
Finally, to create a new dataset and store it in a file, one
of the following command structures could be used. The
examples provided here cover Excel and CSV files, but
the same logic also applies to other data file formats.
Excel files:
<name of new Excel file object> =
pd.ExcelWriter("<name of new Excel file>")
<name of dataset>.to_excel(<name of new
Excel file object>, "sheet name")
<name of new Excel file object>.save()
CSV files:
<name of dataset>.to_csv("<name of new
CSV file>")
Observation 8.17 – Create a New
Excel File: Use the following syntax to
create a new Excel file from a given
dataset:
<name of new Excel file
object> = pd.ExcelWriter
(“<name of new Excel file>”)
<name of dataset>.to_excel
(<name of new Excel file
object>, “sheet name”)
<name of new Excel file
object>.save()
Using the Grades.xlsx dataset as an example, student grades are stored in a particular section of
a course and in a particular semester. If another dataset for the same course but a different section exists in another file (e.g., Grades2.csv), it may be
useful to merge the two and perform the necessary pro- Observation 8.18 – Create a New
cesses in the newly created dataset. The following script CSV File: Use the following syntax
reads two different files (i.e., Excel and CSV), reports to create a new CSV file from a given
their data, appends the second dataset at the end of the dataset:
first, defines the condition, and creates a new column
with values calculated from the data of other columns. <name of dataset>.to_csv
Finally, it saves the new dataset in both Excel and CSV (“<name of new CSV file>”)
formats:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import numpy as np
dataset1 = pd.read_excel("Grades.xlsx")
print("The data in Grades file are:"); print(dataset1.head(3))
dataset2 = pd.read_csv('Grades2.csv')
print("The data in Grades2 file are:"); print(dataset2.tail(3))
dataset = dataset1.append(dataset2)
print("The new merge dataset is:"); print(dataset.head(3))
print(dataset.tail(3))
# The conditions for the Letter Grades
conditions = [(dataset["Final Grade"] > 90.0),
(dataset["Final Grade"] > 80.0) & (dataset["Final Grade"] <= 89.9),
(dataset["Final Grade"] > 70.0) & (dataset["Final Grade"] <= 79.9),
(dataset["Final Grade"] > 60.0) & (dataset["Final Grade"] <= 69.9),
(dataset["Final Grade"] < 59.9)
]
# The list of Grade Letters based on the conditions
gradeLetters = ["A", "B", "C", "D", "F"]
# Create a new Letter Grades column in the new dataset using numpy
dataset["Letter Grade"] = np.select(conditions, gradeLetters)
335
Data Analytics and Data Visualization
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
dataset["Course Work"] = dataset["Quiz 1"]*0.1+dataset["Quiz 2"]*0.1+ \
dataset["Midterm Exam"]*0.25 + dataset["Project"]*0.25
print("A partial view of the new dataset:")
# Find the number of records in the dataset
rowNum = len(dataset)
# Select the columns to be displayed in the report
cols = [7, 1, 0, 6]
print(dataset.iloc[:rowNum:5, cols])
# Save the new dataset as an Excel file
newExcel = pd.ExcelWriter("NewGrades.xlsx")
dataset.to_excel(newExcel, "New Data")
newExcel.save()
# Save the new dataset as a CSV file
dataset.to_csv("newGrades.csv")
Output 8.3.4:
The data in Grades file are:
Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam
70.7
60.0
0
58.57
50.5
76.0
63.0
54.0
65.90
49.0
89.0
1
70.0
69.32
63.5
73.0
2
54.7
The data in Grades2 file are:
Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam
62.52
31.0
85
54.7
68
12
16.5
38
33.3
52
13
47.33
68.97
55.0
65
49.7
70
14
The new merge dataset is:
Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam
0
58.57
50.5
76.0
70.7
60.0
1
65.90
49.0
89.0
63.0
54.0
2
69.32
63.5
73.0
54.7
70.0
Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam
12
62.52
31.0
85.0
54.7
68.0
13
47.33
16.5
38.0
33.3
52.0
14
68.97
55.0
65.0
49.7
70.0
A partial view of the new dataset:
Course Work Final Exam Final Grade Letter Grade
0
43.42
50.5
58.57
F
5
47.67
45.5
61.32
D
10
47.95
48.0
62.35
D
15
48.97
45.5
62.62
D
3
44.10
34.5
54.45
F
8
46.28
54.5
62.63
D
13
42.38
16.5
47.33
F
Project
55
90
80
Project
89
89
94
Project
55
90
80
Project
89
89
94
Some key observations can be made based on this script. Firstly, it is possible, and indeed common, for the programmer to require the merging of datasets from files of different file types. In
this instance, the script merges a dataset stored in an Excel file with one in a CSV file. Secondly,
336
Handbook of Computer Programming with Python
although it is possible to use multiple lines of code to define the values of a new column based
on different conditions, a more efficient option is to use the np.where() method to define the
conditions and their paired values in advance, and subsequently use the np.select() method
from the Numpy library. Thirdly, it is possible to create a new column based on simple or complex
expressions that include other columns. Fourthly, it may be more convenient to define the displayed
records and columns as variables and use them in a statement, rather than directly adding the associated constraints to the statement. Finally, the reader should note that the sequence of statements
used to create a new Excel file is different than that for a CSV file. Such differences also exist for
files of other formats.
8.3.5 Data Exploration: Grouping and Sorting Data
Data grouping is one of the most important data processing tasks, and is usually carried out before other
tasks commence. This is commonly coupled with data
sorting, and the two tasks together constitute a key
building block for the production of professional reports.
Unsurprisingly, Python provides facilities for both of
these tasks.
In order to group data within a dataset, the
groupby() method can be used. The general syntax
is the following:
Observation 8.19 – Grouping Data:
Use the groupby() method to group
a dataset based on one or more columns. The method must be used with
either an aggregate method (e.g.,
mean()) or with the apply(lambda
x: x[…]) statement for non-­aggregate
groupings.
<name of dataset>.groupby([“<name of column>” [, “<name of column>”,
…]]).<aggregate function>
It must be noted that the method requires the application of an aggregation (e.g., mean) to the
grouped data, a concept covered in the following section. Alternatively, if the goal is to simply display the report grouped by a specific column, the apply() method can be used with the following
syntax:
<name of dataset>.groupby([“<name of column>” [, “<name of column>”,
…]]).apply(lambda x: x[<rows>, <cols>])
The apply() method replaces the aggregation with the lambda x: x[…] expression in order to
specify the records and columns that should be displayed in the report.
The reader should also note that if more than one column is used for the grouping, the data will
be initially grouped based on the firstly selected column. After that point, data will be grouped in
each separate group based on the second column.
For the purposes of data sorting, the sort_values() method is used. The general syntax is the
following:
<name of dataset>.sort_values([“<name of
column>” [, “<name of column>”, …]] [,
ascending = False])
Observation 8.20 – Sorting Data:
Use the sort_values() method to
sort a dataset based on one or more
specified columns.
As with data grouping, the reader should note that if
more than one column is specified, the data with the
same value are sorted based on the first column.
Finally, it is possible to combine the functionality of
groupby() and sort_values() by firstly applying the former and assigning the result to the
lambda expression, and then applying the sort_values() method to the lambda expression.
337
Data Analytics and Data Visualization
The following script reads a CSV file and groups and reports its data based on the Letter Grade
column, displaying only columns Letter Grade and Final Grade. Next, it creates a second dataset
and sorts the values based on the Final Grade in ascending order. Finally, it utilizes the apply()
method to group the data based on Letter Grade and sort them based on Final Grade:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
dataset = pd.read_csv('newGrades.csv')
# Report the number of records in the dataset
rows = len(dataset)
# Report the records grouped by Letter Grade
dataset1 = dataset[["Letter Grade", "Final Grade"]]
print(dataset1.groupby(["Letter Grade"]).apply(lambda x: x[0:rows]))
# Report the records sorted by Final Grade
dataset2 = dataset[["Letter Grade", "Final Grade"]]
print(dataset2.sort_values(["Final Grade"], ascending = False))
# Report the records firstly grouped by Letter Grade and
# then sorted by Final Grade (within groups)
dataset3 = dataset[["Letter Grade", "Final Grade"]]
print(dataset3.groupby(["Letter Grade"]).
apply(lambda x: x.sort_values(["Final Grade"], ascending=False)))
Output 8.3.5.a–8.3.5.c:
Letter Grade
B
13
24
3
C
4
7
18
21
23
D
1
2
5
6
8
10
11
12
15
17
19
Letter Grade
Final Grade
B
B
C
C
C
C
C
C
D
D
D
D
D
D
D
D
D
D
D
82.60
81.87
72.02
73.68
75.57
75.13
76.95
73.23
65.90
69.32
61.32
67.87
61.28
62.35
66.13
69.43
62.62
67.47
66.85
338
F
Handbook of Computer Programming with Python
25
29
31
0
9
14
16
20
22
26
27
28
30
13
24
21
7
18
4
23
3
12
2
31
6
17
19
11
1
25
15
29
10
5
8
26
0
20
27
30
22
28
14
9
16
Letter Grade
B
13
24
C
21
7
D
D
D
F
F
F
F
F
F
F
F
F
F
62.63
62.52
68.97
58.57
0.00
0.00
0.00
54.45
45.13
58.75
49.75
44.25
47.33
Letter Grade
B
B
C
C
C
C
C
C
D
D
D
D
D
D
D
D
D
D
D
D
D
D
F
F
F
F
F
F
F
F
F
F
Final Grade
82.60
81.87
76.95
75.57
75.13
73.68
73.23
72.02
69.43
69.32
68.97
67.87
67.47
66.85
66.13
65.90
62.63
62.62
62.52
62.35
61.32
61.28
58.75
58.57
54.45
49.75
47.33
45.13
44.25
0.00
0.00
0.00
Letter Grade Final Grade
B
B
C
C
82.60
81.87
76.95
75.57
339
Data Analytics and Data Visualization
D
F
18
4
23
3
12
2
31
6
17
19
11
1
25
15
29
10
5
8
26
0
20
27
30
22
28
9
14
16
C
C
C
C
D
D
D
D
D
D
D
D
D
D
D
D
D
D
F
F
F
F
F
F
F
F
F
F
75.13
73.68
73.23
72.02
69.43
69.32
68.97
67.87
67.47
66.85
66.13
65.90
62.63
62.62
62.52
62.35
61.32
61.28
58.75
58.57
54.45
49.75
47.33
45.13
44.25
0.00
0.00
0.00
The output shows the results of the reports for the three datasets. From left to right, the output
shows the results of groupby() based on Letter Grade, the results of sort_values() based
on Final Grade, and the dataset grouped by Letter Grade and sorted by Final Grade. The reader
should note that, in this instance, the outputs are presented side-­by-­side for demonstration purposes,
but in a more realistic scenario they should be presented in succession, as dictated by the actual
output.
8.4 DESCRIPTIVE STATISTICS
Descriptive statistics are defined as the analysis of data
that describe, show, or summarize information in a
meaningful manner. They are simply a way of describing the data and they do not draw conclusions, make
predictions, or test hypotheses based on the data, all
of which form a specific branch of statistical analysis
referred to as inferential statistics (covered in Chapter 9).
This section provides introductions to basic concepts
relating to descriptive statistics and how Python is used
to carry out various descriptive analysis tasks.
Before performing any statistical task, it is useful to
distinguish and identify the type(s) of data that will be
analysed, as this largely dictates the most appropriate
descriptive statistics and data visualisation techniques
for the task at hand.
Observation 8.21 – Descriptive
Statistics: A branch of data analysis
that describes, displays, or summarizes information without drawing
conclusions, making predictions, or
testing hypotheses.
Observation 8.22 – Categorical and
Continuous Data: Categorical data
are data that can be divided into
groups or classes but with no numerical relationship. Continuous data are
numerical data that can be used for
counting or measurements.
340
Handbook of Computer Programming with Python
In a broad context, data can be simply categorized into two types: categorical and continuous.
Categorical data are data that can be divided into groups or classes that do not have a numerical or
hierarchical relationship (e.g., gender). Continuous data are numerical, and can include counting
(i.e., integers) or measurements (i.e., any numerical values). The reader should become familiar with
these two terms, as they are used extensively throughout this section.
8.4.1 Measures of Central Tendency
There are two main ways to explore and describe continuous data: (a) measuring their central tendency and, (b)
measuring their spread. The following sections introduce and briefly discuss these two concepts.
The measures of central tendency show the central
or middle values of datasets. Hence, this is also frequently referred to as measures of central location.
There are three different measures that can be considered as the centre of a dataset, namely mean, median,
and mode.
The mean, also called the arithmetic mean, is a popular measure of central tendency. It is the average of the
data in a dataset, and is calculated as the sum of all the
data values divided by the number of cases in the dataset. The mean can fail to describe the central location
of the data if there are outliers present or if the data are
skewed.
The median is the middle point of a dataset that has
been sorted in either ascending or descending order. The
main difference between the mean and the median is
that the former is heavily affected by outliers or skewed
data, while the latter is affected only slightly or not at all.
The following Python script reads the data frame
from the newGrades.csv file introduced in previous
script samples, and calculates the means, medians, and
modes of each of the columns:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Observation 8.23 – Measures of
Central Tendency: Measures that
describe the central or middle values
of a dataset. The three different measures are the mean, the median, and
the mode.
Observation 8.24 – (Arithmetic)
Mean: The average of the data in a
dataset, calculated as the sum of all
the data values divided by the number of cases.
Observation 8.25 – Median: The
middle point of a sorted dataset.
Observation 8.26 – Mode: The most
frequently occurring value in the
dataset. If more than one such values
exist, the dataset is characterized as
multimodal.
import pandas as pd
# Define the format of float numbers
pd.options.display.float_format = '${:,.2f}'.format
dataset = pd.read_csv('newGrades.csv')
# Define the number of rows and columns in the data frame
rows = len(dataset)
cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]
# Calculate the mean of all columns and append the dataset
mean1 = dataset["Final Grade"].mean()
mean2 = dataset["Final Exam"].mean()
mean3 = dataset["Quiz 1"].mean()
Data Analytics and Data Visualization
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
341
mean4
mean5
mean6
means
= dataset["Quiz 2"].mean()
= dataset["Midterm Exam"].mean()
= dataset["Project"].mean()
= {"Final Grade": mean1, "Final Exam": mean2, "Quiz 1": mean3,
"Quiz 2": mean4, "Midterm Exam": mean5, "Project": mean6}
dataset = dataset.append(means, ignore_index = True)
# Calculate the median of all columns and append the dataset
median1 = dataset["Final Grade"].median()
median2 = dataset["Final Exam"].median()
median3 = dataset["Quiz 1"].median()
median4 = dataset["Quiz 2"].median()
median5 = dataset["Midterm Exam"].median()
median6 = dataset["Project"].median()
medians = {"Final Grade": median1, "Final Exam": median2,
"Quiz 1": median3, "Quiz 2": median4, "Midterm Exam": median5,
"Project": median6}
dataset = dataset.append(medians, ignore_index = True)
# Find the mode in all columns and append the dataset
mode1 = dataset["Final Grade"].mode(dropna = True).values
if (len(mode1) > 1):
mode1 = "Multimode"
mode2 = dataset["Final Exam"].mode(dropna = True).values
if (len(mode2) > 1):
mode2 = "Multimode"
mode3 = dataset["Quiz 1"].mode(dropna = True).values
if (len(mode3) > 1):
mode3 = "Multimode"
mode4 = dataset["Quiz 2"].mode(dropna = True).values
if (len(mode4) > 1):
mode4 = "Multimode"
mode5 = dataset["Midterm Exam"].mode(dropna = True).values
if (len(mode5) > 1):
mode5 = "Multimode"
mode6 = dataset["Project"].mode(dropna = True).values
if (len(mode6) > 1):
mode6 = "Multimode"
modes = {"Final Grade": mode1, "Final Exam": mode2, "Quiz 1": mode3,
"Quiz 2": mode4, "Midterm Exam": mode5, "Project": mode6}
dataset = dataset.append(modes, ignore_index = True)
# Report the dataset
dataset1 = dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]]
print(dataset1.iloc[0:rows:1])
#Report the rows with the means, medians, modes
print("Means"); print(dataset1.iloc[32:33])
print("Medians"); print(dataset1.iloc[33:34])
print("Modes"); print(dataset1.iloc[34:35])
342
Handbook of Computer Programming with Python
Output 8.4.1:
Final Grade
0
$58.57
1
$65.90
2
$69.32
3
$72.02
4
$73.68
5
$61.32
6
$67.87
7
$75.57
8
$61.28
9
$0.00
10
$62.35
11
$66.13
12
$69.43
13
$82.60
14
$0.00
15
$62.62
16
$0.00
17
$67.47
18
$75.13
19
$66.85
20
$54.45
21
$76.95
22
$45.13
23
$73.23
24
$81.87
25
$62.63
26
$58.75
27
$49.75
28
$44.25
29
$62.52
30
$47.33
31
$68.97
Means
Final Grade
32
$58.87
Medians
Final Grade
33
$62.63
Modes
Final Grade
34
[0.0]
Final Exam
$50.50
$49.00
$63.50
$60.50
$74.00
$45.50
$66.50
$66.00
$50.50
NaN
$48.00
$61.00
$50.00
$74.00
NaN
$45.50
NaN
$59.00
$61.50
$77.50
$34.50
$66.50
$26.00
$63.50
$83.00
$54.50
$46.50
$27.50
$21.50
$31.00
$16.50
$55.00
Quiz 1
$76.00
$89.00
$73.00
$99.00
$84.00
$94.00
$73.00
$94.00
$84.00
NaN
$78.00
$83.00
$80.00
$94.00
NaN
$78.00
NaN
$70.00
$76.00
$84.00
$62.00
$68.00
$52.00
$96.00
$97.00
$54.00
$54.00
$48.00
$55.00
$85.00
$38.00
$65.00
Quiz 2 Midterm Exam Project
$70.70
$60.00 $55.00
$63.00
$54.00 $90.00
$54.70
$70.00 $80.00
$74.70
$76.00 $70.00
$53.30
$64.00 $87.00
$42.70
$66.00 $70.00
$53.70
$54.00 $87.00
$58.70
$92.00 $70.00
$37.30
$58.00 $78.00
NaN
NaN $69.00
$49.00
$70.00 $71.00
$45.30
$70.00 $70.00
$49.30
$90.00 $76.00
$65.00
$86.00 $92.00
NaN
NaN $75.00
$56.70
$72.00 $70.00
NaN
$0.00
NaN
$72.70
$70.00 $72.00
$68.30
$82.00 $87.00
$52.00
$40.00 $80.00
$44.00
$44.00 $90.00
$67.00
$82.00 $92.00
$26.30
$50.00 $68.00
$68.30
$62.00 $89.00
$82.70
$84.00 $72.00
$31.30
$64.00 $87.00
$39.00
$52.00 $90.00
$37.00
$62.00 $70.00
$18.00
$42.00 $80.00
$54.70
$68.00 $89.00
$33.30
$52.00 $89.00
$49.70
$70.00 $94.00
Final Exam Quiz 1 Quiz 2 Midterm Exam Project
$52.71 $75.28 $52.36
$65.72 $76.84
Final Exam Quiz 1 Quiz 2 Midterm Exam Project
$53.60 $77.00 $52.83
$65.86 $78.00
Final Exam
Multimode
Quiz 1
Multimode
Quiz 2 Midterm Exam Project
Multimode
[70.0] [70.0]
The script and its output demonstrate a few important points:
• Given that the various calculations occasionally produce floating point numbers with several decimal digits, it may be desirable to limit the latter to a more manageable scale (i.e.,
two digits). The statement in line four formats the output accordingly.
• The statements in lines 13–15 calculate the mean of each of the columns of the dataset.
Next, these values are appended at the end of the dataset as a new row.
Data Analytics and Data Visualization
343
• In a similar fashion, the statements in lines 21–26 calculate the median of each of the columns of the dataset and append them as a new row at the end of the dataset. It should be
noted that, since it is necessary to have the data sorted in order to make such a calculation,
this particular method performs this task too.
• The statements in lines 33–50 calculate the mode for each of the columns. Since it is
undesirable in this particular example to have more than one such value reported, the code
includes appropriate if statements to ensure that the mode is a single value per column or
report that the output is multimodal, (i.e., it includes more than one values).
• Finally, the reader should note the use of the dropna = True parameter in the statements that ensure empty or NaN values are not considered in the mode calculation. The
.values parameter also discards the information related to the resulting series and its
object type, leaving only the pure value.
8.4.2 Measures of Spread
Another way to describe and summarize continuous
data is through measures of spread. Such measures
quantify the variability of data points; hence they are
also called measures of dispersion. Measures of spread
are frequently used in conjunction with measures of
central tendency to provide a clearer and more rounded
overview of the data at hand. The importance of measures of spread lies in the fact that they can describe how
well the mean represents the data. If the data spread is
large (i.e., if there are large differences between the data
points), the mean may not be as good a representation of
the data as the median or the mode.
The data range is the difference between the minimum and maximum data points in the dataset. It is calculated as range = max−min.
Quartiles describe the data spread by breaking the
data into four parts (i.e., quarters), using three quartiles.
The 1st quartile (Q1) is the 25th percentile of the sample,
dividing roughly the lowest 25% from the rest of the
data, while the 2nd quartile (Q2) is the 50th percentile
or the median, and the third (Q3) the 75th percentile.
Quartiles are a useful measure of spread, as they are
much less affected by outliers or skewed datasets than
other measures like variance or standard deviation.
Variance shows numerically how far the data points
are from the mean. Variance is useful as, unlike quartiles, it takes into account all data points in the dataset
and provides a better representation of the data spread.
The variance of dataset 𝑥 with 𝑛 data points is expressed
as 𝑠² = Σi(𝑥i−mean(𝑥))²/(𝑛−1), where 𝑖 = 1, 2, …, 𝑛 and
mean(𝑥) is the mean of 𝑥. In order to get a better understanding of why the sum has to be divided with 𝑛−1
instead of 𝑛, the reader can refer to Bessel’s correction.
Standard deviation also demonstrates how the data
points spread out from the mean. It is the positive
square root of the variance. A small standard deviation
Observation 8.27 – Measures of
Spread: Measures that quantify the
variability of data points in a dataset.
If the spread is large, the measures of
tendency are not good representations of the data.
Observation 8.28 – min(), max():
Use the min() and max() methods
to find the minimum and maximum
values in a dataset. Calculate their
difference to find the range of these
values.
Observation 8.29 – Quartiles:
Use the quantile() method to
specify and report the relevant quartile of data in a dataset. For instance,
­quantile(0.1) will report the lowest
10% of the data values in the dataset.
Observation 8.30 – variance(): Use
the variance() method to find the
variance of a dataset and show the distance of the data points from the mean.
Observation 8.31 – Standard
Deviation (SD): Standard deviation
shows the distance of the data points
from the mean. The larger its values the larger the spread of the data
points from the mean. It is frequently
preferable to the measure of variance.
344
Handbook of Computer Programming with Python
indicates that the data are close to the mean, while a large one shows a high outwards data spread
from the mean. Standard deviation is often the preferred choice in order to present the data
spread, and it is more convenient compared to variance, as it utilizes the same unit as the data
points.
The following script uses the Pandas and Statistics Python packages to read the newGrades.
csv file, find the max and min values for each column in the dataset, find the 25% (1st) quartile and
calculate the variance and the standard deviation using both the regular std() and the stdev()
methods from the statistics package. Finally, it creates a new dataset with all the related values, and
reports the dataset:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import pandas as pd
import statistics
# Define the format of float numbers
pd.options.display.float_format = '${:,.2f}'.format
dataset = pd.read_csv('newGrades.csv')
rows = len(dataset)
cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]
# Find
max1 =
max3 =
max5 =
the max values in each column
dataset["Final Grade"].max(); max2 = dataset["Final Exam"].max()
dataset["Quiz 1"].max(); max4 = dataset["Quiz 2"].max()
dataset["Midterm Exam"].max(); max6 = dataset["Project"].max()
# Find
min1 =
min3 =
min5 =
the min values in each column
dataset["Final Grade"].min(); min2 = dataset["Final Exam"].min()
dataset["Quiz 1"].min(); min4 = dataset["Quiz 2"].min()
dataset["Midterm Exam"].min(); min6 = dataset["Project"].min()
# Find the lower 25% quartile in all columns
quartile25a = dataset["Final Grade"].quantile(0.25);
quartile25b = dataset["Final Exam"].quantile(0.25)
quartile25c = dataset["Quiz 1"].quantile(0.25);
quartile25d = dataset["Quiz 2"].quantile(0.25)
quartile25e = dataset["Midterm Exam"].quantile(0.25)
quartile25f = dataset["Project"].quantile(0.25)
# Calculate
variance1 =
variance2 =
variance3 =
variance4 =
variance5 =
variance6 =
the variance in all columns
statistics.variance(dataset["Final Grade"].dropna())
statistics.variance(dataset["Final Exam"].dropna())
statistics.variance(dataset["Quiz 1"].dropna())
statistics.variance(dataset["Quiz 2"].dropna())
statistics.variance(dataset["Midterm Exam"].dropna())
statistics.variance(dataset["Project"].dropna())
# Calculate the standard deviation of all columns using std()
std1 = dataset["Final Grade"].std(); std2 = dataset["Final Exam"].std()
std3 = dataset["Quiz 1"].std(); std4 = dataset["Quiz 2"].std()
Data Analytics and Data Visualization
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
345
std5 = dataset["Midterm Exam"].std(); std6 = dataset["Project"].std()
# Calculate the standard deviation in all columns using stdev()
stdev1 = statistics.stdev(dataset["Final Grade"].dropna())
stdev2 = statistics.stdev(dataset["Final Exam"].dropna())
stdev3 = statistics.stdev(dataset["Quiz 1"].dropna())
stdev4 = statistics.stdev(dataset["Quiz 2"].dropna())
stdev5 = statistics.stdev(dataset["Midterm Exam"].dropna())
stdev6 = statistics.stdev(dataset["Project"].dropna())
# Report the dataset
dataset1 = dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]]
print(dataset1.iloc[0:rows:1])
# Append the dataset with the max values
maxs = {"Final Grade": max1, "Final Exam": max2, "Quiz 1": max3,
"Quiz 2": max4, "Midterm Exam": max5, "Project": max6}
dataset1 = dataset1.append(maxs, ignore_index = True)
mins = {"Final Grade": min1, "Final Exam": min2, "Quiz 1": min3,
"Quiz 2": min4, "Midterm Exam": min5, "Project": min6}
dataset1 = dataset1.append(mins, ignore_index = True)
quartiles = {"Final Grade": quartile25a, "Final Exam": quartile25b,
"Quiz 1": quartile25c, "Quiz 2": quartile25d,
"Midterm Exam": quartile25e, "Project": quartile25f}
dataset1 = dataset1.append(quartiles, ignore_index = True)
variances = {"Final Grade": variance1, "Final Exam": variance2,
"Quiz 1": variance3, "Quiz 2": variance4,
"Midterm Exam": variance5, "Project": variance6}
dataset1 = dataset1.append(variances, ignore_index = True)
stds = {"Final Grade": std1, "Final Exam": std2, "Quiz 1": std3,
"Quiz 2": std4, "Midterm Exam": std5, "Project": std6}
dataset1 = dataset1.append(stds, ignore_index = True)
stdevs = {"Final Grade": stdev1, "Final Exam": stdev2, "Quiz 1": stdev3,
"Quiz 2": stdev4, "Midterm Exam": stdev5, "Project": stdev6}
dataset1 = dataset1.append(stdevs, ignore_index = True)
# Report the rows with the max, min, quartile, variance, and std values
print("Max"); print(dataset1.iloc[32:33])
print("Min"); print(dataset1.iloc[33:34])
print("25% Quartile"); print(dataset1.iloc[34:35])
print("Variance"); print(dataset1.iloc[35:36])
print("Standard Deviation (using: std())"); print(dataset1.iloc[36:37])
print("Standard Deviation (using: stdev())")
print(dataset1.iloc[37:38])
346
Handbook of Computer Programming with Python
Output 8.4.2:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Max
Final Grade
$58.57
$65.90
$69.32
$72.02
$73.68
$61.32
$67.87
$75.57
$61.28
$0.00
$62.35
$66.13
$69.43
$82.60
$0.00
$62.62
$0.00
$67.47
$75.13
$66.85
$54.45
$76.95
$45.13
$73.23
$81.87
$62.63
$58.75
$49.75
$44.25
$62.52
$47.33
$68.97
Final Exam
$83.00
Quiz 1
$76.00
$89.00
$73.00
$99.00
$84.00
$94.00
$73.00
$94.00
$84.00
NaN
$78.00
$83.00
$80.00
$94.00
NaN
$78.00
NaN
$70.00
$76.00
$84.00
$62.00
$68.00
$52.00
$96.00
$97.00
$54.00
$54.00
$48.00
$55.00
$85.00
$38.00
$65.00
Quiz 1
$99.00
Quiz 2
$70.70
$63.00
$54.70
$74.70
$53.30
$42.70
$53.70
$58.70
$37.30
NaN
$49.00
$45.30
$49.30
$65.00
NaN
$56.70
NaN
$72.70
$68.30
$52.00
$44.00
$67.00
$26.30
$68.30
$82.70
$31.30
$39.00
$37.00
$18.00
$54.70
$33.30
$49.70
Midterm Exam
$60.00
$54.00
$70.00
$76.00
$64.00
$66.00
$54.00
$92.00
$58.00
NaN
$70.00
$70.00
$90.00
$86.00
NaN
$72.00
NaN
$70.00
$82.00
$40.00
$44.00
$82.00
$50.00
$62.00
$84.00
$64.00
$52.00
$62.00
$42.00
$68.00
$52.00
$70.00
Project
55
90
80
70
87
70
87
70
78
69
71
70
76
92
75
70
0
72
87
80
90
92
68
89
72
87
90
70
80
89
89
94
Quiz 2
$82.70
Midterm Exam
$92.00
Project
$94.00
Final Grade Final Exam Quiz 1 Quiz 2
$0.00
$16.50 $38.00 $18.00
33
25% Quartile
Final Grade Final Exam Quiz 1
Quiz2
34
$57.54
$45.50 $65.00 $42.70
Variance
Final Grade Final Exam Quiz 1 Quiz 2
35
$461.46
$289.85 $267.49 $242.37
Standard Deviation (using: std())
Final Grade Final Exam Quiz 1 Quiz 2
36
$21.48
$17.02 $16.36 $15.57
Standard Deviation (using: stdev())
Final Grade Final Exam Quiz 1 Quiz 2
37
$21.48
$17.02 $16.36 $15.57
Midterm Exam
$40.00
Project
$0.00
Midterm Exam
$54.00
Project
$70.00
Midterm Exam
$197.06
Project
$291.88
Midterm Exam
$14.04
Project
$17.08
Midterm Exam
$14.04
Project
$17.08
32
Min
Final Grade
$82.60
Final Exam
$50.50
$49.00
$63.50
$60.50
$74.00
$45.50
$66.50
$66.00
$50.50
NaN
$48.00
$61.00
$50.00
$74.00
NaN
$45.50
NaN
$59.00
$61.50
$77.50
$34.50
$66.50
$26.00
$63.50
$83.00
$54.50
$46.50
$27.50
$21.50
$31.00
$16.50
$55.00
Data Analytics and Data Visualization
347
8.4.3 Skewness and Kurtosis
Skewness measures the asymmetry of the data and
describes the amount by which the distribution differs from
a normal distribution. There are several mathematical definitions of skewness. A commonly used one is Pearson’s
skewness coefficient, which can be derived using the size
of a dataset, the mean, and the standard deviation of the
data. Negative skewness values indicate a dominant tail
on the left side, while positive values correspond to a long
tail on the right side. If the skewness is close to 0 (i.e.,
between −0.5 and 0.5), the data are considered to be symmetric (Figure 8.1). When the skewness is between −1 and
−0.5 or between 0.5 and 1, the data are considered to be
moderately skewed. If skewness is less than −1 or more
then 1, the data are considered to be highly skewed.
Kurtosis shows whether the data is heavy-­tailed or
light-­tailed compared to a normal distribution. In other
words, kurtosis identifies whether the data contains
extreme values. A high kurtosis indicates a heavy tail
and more outliers in the data, while a low kurtosis shows
a light tail and fewer outliers. An alternative and effective
way to show kurtosis and skewness is the histogram, as it
visually demonstrates the shape of the data distribution.
There are three main types of kurtosis: mesokurtic,
leptokurtic, and platykurtic (Figure 8.2).
Observation 8.32 – Skewness: Use
the skew() method to calculate
the skewness of a dataset. Based on
Pearson’s skewness coefficient, skewness between −0.5 and 0.5 is considered to be symmetric, while values
between −1 and −0.5 or 0.5 and 1
indicate that skewness is moderate
and values less than −1 or more than
1 that it is high.
Observation 8.33 – Kurtosis: Use
the kurtosis() method to calculate
the kurtosis of a dataset. Data can be
characterized as mesokurtic (normal
distribution with value of 3), leptokyrtic (data heavily-­tailed with profusion
of outliers and value higher than 3), or
platykurtic (data light-­tailed with less
extreme values than normal distribution and value lower than 3).
• Mesokurtic (Kurtosis = 3): Data are normally distributed.
• Leptokurtic (Kurtosis > 3): Data are heavy-­tailed with profusion of outliers.
• Platykurtic (Kurtosis < 3): Data are light-­tailed and/or contain less extreme values than
normal distribution.
FIGURE 8.1
Symmetric, positive, and negative skewness.
FIGURE 8.2 Main types of kurtosis.
348
Handbook of Computer Programming with Python
The following script reads the newGrades.csv file, calculates the skewness, kurtosis, and sum values
of all columns, and reports them alongside the rest of the dataset:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
# Define the format of float numbers
pd.options.display.float_format = '${:,.2f}'.format
dataset = pd.read_csv('newGrades.csv')
rows = len(dataset)
cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]
# Find the skewness (Pearson's coefficient) values for each column
skew1 = dataset["Final Grade"].skew()
skew2 = dataset["Final Exam"].skew()
skew3 = dataset["Quiz 1"].skew()
skew4 = dataset["Quiz 2"].skew()
skew5 = dataset["Midterm Exam"].skew()
skew6 = dataset["Project"].skew()
# Find the kurtosis values for each column
kurtosis1 = dataset["Final Grade"].kurtosis()
kurtosis2 = dataset["Final Exam"].kurtosis()
kurtosis3 = dataset["Quiz 1"].kurtosis()
kurtosis4 = dataset["Quiz 2"].kurtosis()
kurtosis5 = dataset["Midterm Exam"].kurtosis()
kurtosis6 = dataset["Project"].kurtosis()
# Find
sum1 =
sum2 =
sum3 =
sum4 =
sum5 =
sum6 =
the sum of all values for each column
dataset["Final Grade"].sum()
dataset["Final Exam"].sum()
dataset["Quiz 1"].sum()
dataset["Quiz 2"].sum()
dataset["Midterm Exam"].sum();
dataset["Project"].sum()
# Report the dataset
dataset1 = dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]]
print(dataset1.iloc[0:rows:1])
# Append the dataset with the max values
skewness = {"Final Grade": skew1, "Final Exam": skew2, "Quiz 1": skew3,
"Quiz 2": skew4, "Midterm Exam": skew5, "Project": skew6}
dataset1 = dataset1.append(skewness, ignore_index = True)
kurtosis = {"Final Grade": kurtosis1, "Final Exam": kurtosis2,
"Quiz 1": kurtosis3, "Quiz 2": kurtosis4,
"Midterm Exam": kurtosis5, "Project": kurtosis6}
dataset1 = dataset1.append(kurtosis, ignore_index = True)
349
Data Analytics and Data Visualization
51
52
53
54
55
56
57
58
sums = {"Final Grade": sum1, "Final Exam": sum2, "Quiz 1": sum3,
"Quiz 2": sum4, "Midterm Exam": sum5, "Project": sum6}
dataset1 = dataset1.append(sums, ignore_index = True)
# Report the rows with the skewness, kurtosis, and sums
print("Skewness"); print(dataset1.iloc[32:33])
print("Kurtosis"); print(dataset1.iloc[33:34])
print("Sum values"); print(dataset1.iloc[34:35])
Output 8.4.3:
Final Grade
$58.57
0
1
$65.90
2
$69.32
3
$72.02
4
$73.68
5
$61.32
6
$67.87
7
$75.57
8
$61.28
9
$0.00
10
$62.35
11
$66.13
12
$69.43
13
$82.60
14
$0.00
15
$62.62
16
$0.00
17
$67.47
18
$75.13
19
$66.85
20
$54.45
21
$76.95
22
$45.13
23
$73.23
24
$81.87
25
$62.63
26
$58.75
27
$49.75
28
$44.25
29
$62.52
30
$47.33
31
$68.97
Skewness
Final Grade
32
$-1.96
Kurtcsis
Final Grade
33
$3.52
Sum values
Final Grade
34
$1,883.94
Final Exam
$50.50
$49.00
$63.50
$60.50
$74.00
$45.50
$66.50
$66.00
$50.50
NaN
$48.00
$61.00
$50.00
$74.00
NaN
$45.50
NaN
$59.00
$61.50
$77.50
$34.50
$66.50
$26.00
$63.50
$83.00
$54.50
$46.50
$27.50
$21.50
$31.00
$16.50
$55.00
Quiz 1
$76.00
$89.00
$73.00
$99.00
$84.00
$94.00
$73.00
$94.00
$84.00
NaN
$78.00
$83.00
$80.00
$94.00
NaN
$78.00
NaN
$70.00
$76.00
$84.00
$62.00
$68.00
$52.00
$96.00
$97.00
$54.00
$54.00
$48.00
$55.00
$85.00
$38.00
$65.00
Quiz 2
$70.70
$63.00
$54.70
$74.70
$53.30
$42.70
$53.70
$58.70
$37.30
NaN
$49.00
$45.30
$49.30
$65.00
NaN
$56.70
NaN
$72.70
$68.30
$52.00
$44.00
$67.00
526.30
$68.30
5E2.70
$31.30
$39.00
$37.00
$18.00
$54.70
$33.30
$49.70
Midterm Exam
$60.00
$54.00
$70.00
$76.00
$64.00
$66.00
$54.00
$92.00
$58.00
NaN
$70.00
$70.00
$90.00
$86.00
NaN
$72.00
NaN
$70.00
$82.00
$40.00
$44.00
$82.00
$50.00
$62.00
$84.00
$64.00
$52.00
$62.00
$42.00
$68.00
$52.00
$70.00
Project
55
90
80
70
87
70
87
70
78
69
71
70
76
92
75
70
0
72
87
80
90
92
68
89
72
87
90
70
80
89
89
94
Final Exam
$-0.43
Quiz 1
$-0.51
Quiz 2
$-0.18
Midterm Exam
$0.05
Project
$-3.03
Final Exam
$-0.35
Quiz 1
$-0.53
Quiz 2
$-0.39
Midterm Exam
$-0.60
Project
$13.01
Final Exam
Quiz 1
Quiz 2
$1,528.50 $2,183.00 $1,518.40
Midterm Exam
Project
$1,906.00 $2,459.00
350
Handbook of Computer Programming with Python
8.4.4 The describe() and count() Methods
Two more methods that are worth mentioning are
describe() and count(). These methods come Observation 8.34 – describe():
rather handy when describing categorical data, but can Use the describe() method to
be also used with continuous data. The describe() automatically report a set of basic
method provides a simple way to describe data, report- descriptive statistics.
ing the max, min, variance, quartiles, mean, and
standard deviation without having to deal with each
of them separately. The count() method reports the Observation 8.35 – count(): Use
number of occurrences of each case of categorical the count() method to report the
data in the dataset (i.e., it denotes frequency of occur- frequency of occurrence of categorirence). It can be also calculated on a percentage basis cal data.
in order to obtain a representation of the part-­to-­whole
relationship.
The following script uses newGrades.csv to report basic descriptive statistics for Final Grade,
while also counting the As, Bs, Cs, Ds, and Fs in the report:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
# Define the format of float numbers
pd.options.display.float_format = '${:,.2f}'.format
dataset c pd.read_csv('newGrades.csv')
rows = len(dataset)
cols = ["Final Grade", "Letter Grade"]
# Report the basic descriptive statistics for Final Grade
print("Basic descriptive statistics on Final Grade")
print(dataset["Final Grade"].describe(), "\n")
# Create a new dataset with Letter Grade only
dataset1 = dataset[["Letter Grade"]]
# Find the number of occurrences of Letter Grades
countAll = dataset1.count()
print("Total students:", countAll.values)
dataset2 = dataset1[dataset1["Letter Grade"] == "A"]
if (not dataset2.empty):
countA = dataset2.count()
else:
countA = 0
print("Students awarded an A:", countA)
Data Analytics and Data Visualization
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
dataset2 = dataset1[dataset1["Letter Grade"] == "B"]
if (not dataset2.empty):
countB = dataset2.count().values
else:
countB = 0
print("Students awarded an B:", countB)
dataset2 = dataset1[dataset1["Letter Grade"] == "C"]
if (not dataset2.empty):
countC = dataset2.count().values
else:
countC = 0
print("Students awarded an C:", countC)
dataset2 = dataset1[dataset1["Letter Grade"] == "D"]
if (not dataset2.empty):
countD = dataset2.count().values
else:
countD = 0
print("Students awarded an D:", countD)
dataset2 = dataset1[dataset1["Letter Grade"] == "F"]
if (not dataset2.empty):
countF = dataset2.count().values
else:
countF = 0
print("Students awarded an F:", countF)
Output 8.4.4:
Basic descriptive statistics on Final Grade
$32.00
count
$58.87
mean
$21.48
std
$0.00
min
$57.54
25%
$64.27
50%
$70.08
75%
$82.60
max
Name: Final Grade, dtype: float64
Total students: [32]
Students awarded an A:
Students awarded an 3:
Students awarded an C:
Students awarded an D:
Students awarded an F:
0
[2]
[6]
[14]
[10]
351
352
Handbook of Computer Programming with Python
8.5 DATA VISUALIZATION
We are all familiar with the expression a picture is
worth a thousand words. Data visualisation refers to Observation 8.36 – Data Visuali­
the use of graphical means to represent and summarize zation: The use of visual means, such
data. It can help the analyst identify and conceptualize as various types of charts, to represent
patterns, trends, and correlations present in the data that and summarize data.
may be otherwise difficult to spot. It is also an efficient
way to convey insights or summaries to wider audiences and, thus, it is widely used for data presentation (particularly when working with big data). Data visualisation is also an essential step before
undertaking inferential statistics analysis (Chapter 9) and machine learning (Chapter 10) tasks, as it
provides an overview of some of the structures and techniques used in these fields. In general, data
visualisation is useful for the following tasks:
•
•
•
•
•
•
•
Recognizing the structure and patterns of the data.
Detecting errors or outliers.
Exploring relationships between variables.
Discovering new trends.
Suggesting appropriate inferential statistical analysis and machine learning methods.
Identifying the need for data correction (e.g., transforming data to log-­scale).
Communicating data to wider audiences.
Python is a popular data visualization choice for data scientists, as it provides various packages and
libraries suitable for visualization tasks. Some popular plotting libraries are the following:
• Matplotlib: As mentioned in earlier sections, Matplotlib is a low-­level plotting library, suitable
for creating basic graphs and providing a lot of options relating to this task to the programmer.
• Pandas: Pandas is based on Matplotlib and, in addition to plotting, it also provides extra
analysis functionality.
• Seaborn: Seaborn is a high-­level plotting library with a solid collection of usable, default
styles. It also allows for graph plotting with minimal coding, and it provides advanced
visuals, making it the tool of choice for many data scientists.
The above libraries and packages provide a wealth of available methods to produce any type of
visualization. In this section, only Pandas and Matplotlib are used. This is mainly for simplicity and
clarity reasons.
8.5.1 Continuous Data: Histograms
A histogram is a type of graph that can depict the distribution of continuous numerical data by displaying the
data frequency using bars of different heights. Due to
the use of bars, prior to plotting histograms, one first
has to bin the range of data values. The term bin is used
to describe the process of dividing the entire range of
data values into a series of intervals. Subsequently, data
falling into each interval are counted and the resulting
frequencies are plotted in the form of bars. Bins are usually specified as consecutive, non-­overlapping intervals
and often have equal or comparable sizes, although this
is not a strict requirement (Freedman et al., 1998).
Observation 8.37 – Histograms:
Use
the
plot.hist() method
(Pandas library) to visualize continuous data, dividing the entire range
of values into a series of intervals
referred to as bins. Parameters such
as subplots, ­
layout, grid,
­xlabelsize, ­
ylabelsize, xrot,
yrot, ­figsize, and legend allow
for the detailed configuration of the
histogram.
Data Analytics and Data Visualization
FIGURE 8.3
353
Types of histograms.
Histograms can be used when investigating and demonstrating the shape of the data distribution
(i.e., its center, spread, and skewness), as well as its various modes and the presence of outliers. They
help the analysis by visually determining whether two or more data distributions are different, like
in the example above (Figure 8.3).
At first, histograms may look like bar charts, but these two graph formats are notably different. Histograms are used for summarising and grouping continuous data into ranges, while bar
charts are used for displaying the frequency of categorical data. Another difference is that the
proportion of the data in a histogram is represented as a unified area of the graph, while in a bar
chart through the length of individual bars. Bar charts are discussed in more detail in later parts
of this chapter.
To plot a histogram in Python, one can use the plot.hist() method from the Pandas library.
For basic plotting, no further arguments are needed. However, the method accepts additional arguments in order to optionally control specific plotting details, such as the bin size (the default value
is 10). It is also possible to have multiple histograms generated and illustrated in one single plot.
The subplots parameter allows the programmer to plot each feature in the dataset separately,
and the layout parameter specifies the number of plots per row and column of a given diagram.
By default, the histogram appears inside a grid, but it is possible to avoid this by setting the grid
parameter to False. The letter size of the x or y axis can be controlled by setting the xlabelsize
or ylabelsize parameters, respectively. The histogram can be rotated by a specified number of
degrees on the x or y axis, by setting the xrot or yrot parameters. The size of the figures can be
specified (in inches) using the figsize parameter.
The following script uses the newGrades.csv dataset used in previous examples to display six
histograms in one plot (i.e., two lines and three columns):
1
2
3
4
5
6
7
8
9
10
import pandas as pd
dataset = pd.read_csv('newGrades.csv')
dataset1 = dataset[["Final Grade", "Final Exam",
"Midterm Exam", "Project"]]
"Quiz 1", "Quiz 2",
# Prepare a histogram with 2 lines of subplots, visible grid & legend
# in 2 rows & 3 columns, with figures of size 10x10 inches, & 10 bins
plt = dataset1.plot.hist(subplots = 2, grid = True, legend = True,
layout = (2, 3), figsize = (10, 10), bins = 10)
354
Handbook of Computer Programming with Python
Output 8.5.1:
8.5.2 Continuous Data: Box and Whisker Plot
A box and whisker plot, also called box plot, is a graphical Observation 8.38 – Box and Whisker
representation of the spread of continuous data, based Plot: Use the boxplot() method
on a five number summary: the minimum, the maximum, (Pandas library) to draw a box and
the sample median, the first quartile (Q1), and the third whisker plot. Plot aspects like the grid,
quartile (Q3). As the name suggests, the plot contains the figure size, and the labels can be
two parts: a box and a set of whiskers. The two ends configured using the grid, figsize,
of the whiskers show the minimum and the maximum and labels parameters, respectively.
values of the dataset, while the top and the bottom of the
box represent Q3 and Q1, respectively. The horizontal
line in the middle of the box denotes the median. The data point that is located outside the whiskers
of the box plot is defined as an outlier, which is the value that is more than one and a half times the
length of the box. It is worth noting that box plots work better with data that only contain a limited
number of categories (Figure 8.4).
Data Analytics and Data Visualization
FIGURE 8.4
355
Box and whisker plot.
Box plots can be used when:
•
•
•
•
Working with numerical data.
Presenting the spread of the data and the central value.
Comparing data distribution across different categories.
Identifying outliers.
Box plots can be created using the boxplot() method from the Pandas library. The x and y axis values
can be modified using the by and column parameters, respectively (Pandas, 2021a). For an improved
visual effect, one can alternatively use the sns.boxplot() method from the Seaborn library.
The following script draws a box and whisker plot for the newGrades.csv dataset:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
dataset = pd.read_csv('NewGrades.csv')
# The names of the columns on the x-axis
cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]
dataset1 = dataset[["Final Grade", "Final Exam", \
"Quiz 1", "Quiz 2", "Midterm Exam", "Project"]]
# Prepare a box and whisker diagram with all the 6 columns represented
# in a single plot of size 10x10 inches
dataset1.boxplot(grid = True, figsize = (10, 10), showcaps = True, \
showbox = True, showfliers = True, labels = cols)
356
Handbook of Computer Programming with Python
Output 8.5.2:
<AxesSubplot:>
8.5.3 Continuous Data: Line Chart
A line chart is a graphical method to represent trend
data as a continuous line. It connects a series of historical data points by line segments in order to depict the
variations of the data continuously over time. The x-­axis
corresponds to time or continuous progression, while
the y-­axis represents the corresponding values.
Line charts can be used when:
Observation 8.39 – Line Chart: Use
the plot.line() method (Pandas
library) to draw a line chart. There are
several parameters available for the
detailed configuration of the chart.
• Working with numerical data (y-­axis) that follow a continuous progression (x-­axis).
• Emphasizing changes in values over time or as a continuous progression.
• Comparing between different series of trends.
To create a line chart, one can call the plot.line() method from the Pandas library. If multiple
lines are plotted in a single line chart, Pandas automatically creates a legend. This is a rather useful
feature when comparing data trends.
The following script uses the newGrades.csv dataset to draw a line chart plotting all six columns
of the dataset:
357
Data Analytics and Data Visualization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
dataset = pd.read_csv('newGrades.csv')
# The names of the columns on the x-axis
cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2",
"Midterm Exam", "Project"]
dataset1 = dataset[["Final Grade", "Final Exam", \
"Quiz 1", "Quiz 2", "Midterm Exam", "Project"]]
# Prepare a line chart with all the 6 columns represented
# in a single plot of size 7x7 inches
dataset1.plot.line(grid = True, figsize = (7, 7),
title = "Grades Line Chart")
Output 8.5.3:
<AxesSubplot:title={'center':'Grades Line Chart'}>
8.5.4 Categorical Data: Bar Chart
A bar chart is a graph that displays counts of categorical data or data associated with categorical data in
the form of vertical or horizontal rectangular bars. The
x-­axis (vertical bar chart) represents the data by category, while the y-­axis can take any value depending
on the dataset used. Bar charts are useful for describing
Observation 8.40 – Bar Chart: Use
the plot.bar() method (Pandas
library) to draw a bar chart. There are
several parameters available for the
detailed configuration of the chart.
358
Handbook of Computer Programming with Python
categorical data that have less than approximately 30 categories, as anything close to or above this
rough threshold tends to make them rather unreadable. In such cases, a more efficient grouping or
re-­grouping approach should be considered.
Bar charts can be used when:
• Working with categorical data.
• Investigating the frequency of the data.
To plot a bar chart for categorical data one can use the plot.bar() method (Pandas library).
The reader must note that before this method is called, the frequency for each category must be
counted using the value_count() method. Methods plt.xlabel(), plt.ylabel(), and
plt.title() can be used to add appropriate descriptions to the bar chart.
The following script uses plot.bar() to draw and configure a vertical bar chart (default) based
on the Letter Grade column of newGrades2.xlsx (New Data sheet):
1
2
3
4
5
6
7
8
9
import pandas as pd
dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data")
barChart = dataset["Letter Grade"].value_counts().plot.bar(grid = True,
legend = True, figsize = (7, 7), rot = 0)
barChart.set_title("Final Letter Grades")
barChart.set_ylabel("Frequencies")
barChart.set_xlabel("Letter Grades")
Output 8.5.4.a:
Text(0.5, 0, 'Letter Grades')
Data Analytics and Data Visualization
359
The reader should note the use of the grid, legend, figsize, and rot parameters to configure
the basic appearance of the chart (i.e., show the grid and the legend, define the size of the figure in
inches, and ensure the correct orientation of the x-­axis labels, respectively). It must be also noted
how methods set_title(), set_ylabel(), and set_xlabel() are used to set the title of the
chart and define the headings for the x and y axes.
When horizontal bars are needed instead of vertical ones the plot.barh() method should be
used instead of the plot.bar(). The following script demonstrates this option, while its output
illustrates how slight parameter variations can help with the new horizontal orientation:
1
2
3
4
5
6
7
8
9
import pandas as pd
dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data")
barChart = dataset["Final Exam Letter"].value_counts().plot.barh(
grid = True, legend = True, figsize = (7, 7), rot = 0)
barChart.set_title("Final Exam Letter Grades")
barChart.set_ylabel("Letter Grades")
barChart.set_xlabel("Frequencies")
Output 8.5.4.b:
Text(0.5, 0, 'Frequencies')
360
Handbook of Computer Programming with Python
It is also possible to have two or more different bar charts within the same figure. This can take
three different forms. The first is to have a single plot with two separate charts as in the script below.
The script uses the subplots() method from the plt object of the matplotlib.pyplot package to create two different plots:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data")
# Draw first subplot
plt.subplot(1, 2, 1)
plot1 = dataset["Letter Grade"].value_counts().plot.bar(grid = True,
figsize = (10, 7), legend = True, sharey = True, rot = 0)
plot1.set_title("Final Letter Grades")
plot1.set_ylabel("Frequencies")
plot1.set_xlabel("Letter Grades")
# Draw second subplot
plt.subplot(1, 2, 2)
plot2 = dataset["Final Exam Letter"].value_counts().plot.bar(grid=True,
figsize = (10, 7), legend = True, sharey = True, rot = 0)
plot2.set_title("Final Exam Letter Grades")
plot2.set_ylabel("Frequencies")
plot2.set_xlabel("Letter Grades")
Output 8.5.4.c:
Text(0.5, 0, 'Letter Grades')
Data Analytics and Data Visualization
361
The second form is to create a compound or nested bar chart, allowing two or more sets of data
associated with the same categorical data to be plotted in a single diagram. This is useful in situations requiring visual comparison. The following script is a variation of previously used examples,
demonstrating this form of bar chart:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import matplotlib.pyplot as plt
# Read the Excel dataset
dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data")
# Count the frequencies of Letter Grade and Final Exam Letter
dataset1 = dataset["Letter Grade"].value_counts()
dataset2 = dataset["Final Exam Letter"].value_counts()
barChart = pd.DataFrame({"Final Letter Grade": dataset1,
"Final Exam Letter Grade": dataset2})
barChart.plot.bar(grid = True,
title = "Final Exam and Final Grade Letter Grades",
rot = 0, figsize = (8, 8), color = ["lightblue", "lightgrey"])
# Use the plt object to set the labels of the x and y axis
plt.xlabel("Letter Grades")
plt.ylabel("Frequencies")
Output 8.5.4.d:
Text(0, 0.5, 'Frequencies')
362
Handbook of Computer Programming with Python
The third form is the stacked bar chart. In this case, the various components are stacked upon each
other to create a single, unified bar. The following script presents columns Letter Grade and Final
Exam Letter from the newGrades2.xlsx dataset (New Data sheet). The reader should note that, in
addition to the previously mentioned parameters of the regular plot.bar() method, the script also
uses the stacked = True parameter that is responsible for stacking the two datasets:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import matplotlib.pyplot as plt
# Read the Excel dataset
dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data")
# Count the frequencies of the "Letter Grade" & the "Final Exam Letter"
dataset1 = dataset["Letter Grade"].value_counts()
dataset2 = dataset["Final Exam Letter"].value_counts()
barChart = pd.DataFrame({"Final Letter Grade": dataset1,
"Final Exam Letter Grade": dataset2})
barChart.plot.bar(stacked = True, grid = True,
title = "Final Exam and Final Grade Letter Grades",
rot = 0, figsize = (8, 8), color = ["lightblue", "lightgrey"])
# Use the plt object to set the labels of the x-axis and the y-axis
plt.xlabel("Letter Grades")
plt.ylabel("Frequencies")
Output 8.5.4.e:
Text(0, 0.5 'Frequencies')
Data Analytics and Data Visualization
363
8.5.5 Categorical Data: Pie Chart
A pie chart is a circular graph that uses the size of pie slices to illustrate proportion. It displays a
part-­to-­whole relationship of categorical data. Like in the case of the bar chart, the pie chart should
be avoided for data with a significant number of categories (i.e., slices), as this would compromise
readability. Ideally, data with five or less categories are preferable. If the pie chart is to be used for
data with more than five categories, re-­categorising or aggregating the data should be considered.
Pie charts can be used when the presentation of the part-­
to-­whole relationship of the data is more important than
the precise size of each category, and when it is required Observation 8.41 – Pie Chart: Use
to visually compare the size of categories in relation the pie() method (Pandas library)
to the whole. However, unlike bar charts, they can- to create a pie chart based on a
not explicitly demonstrate absolute numbers or values dataset. Use the plt object from
for each category. To plot a pie chart, one can use the ­matplotlib.pyplot to configure
plot.pie() method from the Pandas library (Pandas, and improve the appearance of the
2021b), while its appearance can be further configured chart.
using the plt object from the matplotlib.pyplot
package.
The following script reads the New Data dataset from newGrades2.xlsx and creates a pie chart
based on the Letter Grade column. Next, it demonstrates the use of the labels, autopct,
shadow, and startangle parameters to define and format the labels (in percentages), to display shadows, and to dictate the orientation and angle of the slices. Finally, it uses the axis,
legend, and title methods to adjust the size of the slices, and to add titles to the chart and
the legend:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import matplotlib.pyplot as plt
# Read the Excel dataset
dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data")
labels1 = dataset["Letter Grade"].unique()
# Count the frequencies of Letter Grade
dataset1 = dataset["Letter Grade"].value_counts()
plt.pie(dataset1, labels = labels1, autopct = "%1.1f%%", shadow = True,
startangle = 90)
plt.axis("equal")
plt.legend(title = "Final Letter Grades")
plt.title("Final Letter Grades")
364
Handbook of Computer Programming with Python
Output 8.5.5:
Text(0.5, 1.0, 'Final Letter Grades')
8.5.6 Paired Data: Scatter Plot
A scatter plot is a visual representation of the relationship between two sets of data using dots or circles.
The dots/circles can report the values of individual
data points, but also patterns of the data as a whole.
Relationships between variables can be described in the
following ways: positive or negative, strong or weak, linear or nonlinear (Figure 8.5).
Scatter plots can be used when:
Observation 8.42 – Scatter Plot: Use
plot.scatter() (Pandas library) to
create a scatter plot. Scatter plots
illustrate the relationship between
two sets of data using dots or circles.
• Working with paired numerical data.
• Identifying whether the data are correlated.
• Investigating data patterns (e.g., cluster, data gap, outliers) (Figure 8.6).
To create a scatter plot, one can call the plot.scatter() method from the Pandas library, and use
the x and y arguments to define the paired data. The following script draws a scatter plot chart using
the Final Exam Grades and Final Grades columns from newGrades2.xlsx:
1
2
3
4
5
6
7
8
9
10
import pandas as pd
# Read the Excel dataset
dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data")
dataFrame = pd.DataFrame(data = dataset, columns = ["Final Exam",
"Final Grade"])
dataFrame.plot.scatter(x = "Final Exam", y = "Final Grade",
title = "Scatter chart between final exams and final grades ",
figsize = (7, 7))
Data Analytics and Data Visualization
FIGURE 8.5
Types of scatter plots.
FIGURE 8.6
Investigating data patterns.
365
366
Handbook of Computer Programming with Python
Output 8.5.6:
<AxesSubplot:title={'center':'Scatter chart between final exams
and final grades '}, xlabel='Final Exam', ylabel='Final Grade'>
8.6 WRAPPING UP
This chapter covered some of the basic concepts and tasks used in data analysis. Considering the
large number of possibilities and analysis combinations that may be utilized in order to provide
thorough data analytics results, this chapter was not meant to provide exhaustive analysis of all
options, but introductions to some of the main ones that highlight the general approaches and perspectives. For instance, topics like heatmaps, word clouds, bubble charts, area charts, and geospatials were not covered, although they are rather popular and common data visualization tools. The
reader can find more detailed information on such topics in the rather extensive body of work that is
readily available in related publications or web sources. At the level of detail and abstraction used in
this chapter, Table 8.1 can be used as a quick guide for some of the methods covered, and their use
in the context of data analytics.
TABLE 8.1
Quick Guide of Methods and Their Functionality and Syntax
Functionality
Syntax/Example
Data Acquisition
Import the Pandas library.
import pandas as <pandas object>
Example:
import pandas as pd
(Continued)
367
Data Analytics and Data Visualization
TABLE 8.1 (Continued)
Quick Guide of Methods and Their Functionality and Syntax
Functionality
Create a data frame through
data read.
Syntax/Example
<name of data frame> = <name of pandas object>.read_
csv(“<Filename.csv”, delimiter = ‘,’)
Example:
dataset=pd.read_csv(‘WPP2019_TotalPopulationBySex.
csv’, delimiter = ‘,’)
<name of data frame> = <name of pandas object>.read_
excel(“<Filename.xlsx>”, sheet_name = “<Sheet name>”)
Example:
dataset=pd.read_excel(WPP2019_Total_Population.xlsx’,
sheet_name = “ESTIMATES”)
<name of data frame> = <name of pandas object>.
read_html(“<url>”)
Data Cleaning
Delete all rows containing
missing data.
Delete all rows containing
any missing data.
Delete all rows with missing
data in all columns.
Replace missing values with a
predefined or calculated
value.
Change the names of columns
with new ones.
Change the index of a dataset
and reset it back to the
original column.
<name of new Data Frame> = <name of original Data
Frame>.dropna()
Example:
dframe_no_missing_data = dataset.dropna()
<name of new Data Frame> = <name of original Data
Frame>.dropna(how = “any”)
Example:
dframe_delete_rows_with_any_na_values = dataset.
dropna(how = “any”)
<name of new Data Frame> = <name of original Data
Frame>.dropna(how = “all”)
Example:
dframe_delete_rows_with_all_na_values = dataset.
dropna(how = “all”)
<name of new Data Frame> = <name of original Data
Frame>.fillna(value[, how = ‘all’] [, inplace = True])
<name of new Data Frame> = <name of original Data
Frame>.fillna(value[, how = ‘any’] [, inplace = True])
Example:
dataset.fillna(0, inplace = True)
<name of Data Frame>.rename(columns = {“oldname”:
”newname”, } [, inplace=True])
Example:
dataset_new = dataset.rename (columns = {“Final
Grade”: “Total Grade”, “Quiz 1”: “Test 1”, “Quiz 2”:
“Test 2”, “Midterm Exam”: “Midterm”})
<name of dataset>.set_index(“<column name>”[,
inplace=True])
<name of dataset>.reset_index([inplace=True])
Data Exploration
Find the number of records in
the dataset.
len(<name of dataset>
Example:
len(dataset)
(Continued)
368
Handbook of Computer Programming with Python
TABLE 8.1 (Continued)
Quick Guide of Methods and Their Functionality and Syntax
Functionality
Report the columns of the
dataset.
Report the number of records
and columns in the dataset.
Report the first n records of the
dataset.
Report the last n records of the
dataset.
Report a number of records
and columns from the dataset,
based on their name and/or
index value.
Report only the unique values
from a selected column in the
dataset.
Report data based on simple or
compound condition.
Merge two datasets into a new
one.
Create a new column based on
an expression using data from
other columns.
Create a new column based on
a condition.
Create a new column based on
a set of conditions and paired
values.
Syntax/Example
<name of dataset>.(columns)
Example:
dataset.columns
<name of dataset>.shape
Example:
dataset.shape
<name of dataset>.head(n)
Example:
dataset.head(5)
<name of dataset>.tail(n)
Example:
dataset.tail(5)
<name of dataset>[start row: end row: step]
<name of dataset>.loc[start row: end row, “<name of
starting column>”: “<name of ending column>”]
<name of dataset>.iloc[start row: end row, start
column (index): end column (index)
Example:
print(dataset[0:37:5])
print(dataset.loc[0:5,” Final Grade” : “Final Exam”])
print(dataset.iloc[0:5,0:3])
<name of dataset>[“<name of column>”.unique()]
Example:
dataset[“Project”].unique())
<name of dataset>[<condition>]
<name of dataset> [<condition>[&/|] <condition>]]
Examples:
dataset[“Final Grade”] > 80
dataset[“Final Grade”] > 0) & (dataset[“Final Grade”]
< 60)
<name of new dataset> = <name of first old dataset>.
append(<name of second old dataset>)
Example:
dataset = dataset1.append(dataset2)
<name of dataset>[“<name of new column>”] =
expression with other columns
Example:
dataset[“Course Work”] = dataset [“Quiz”]*0.2 +
dataset [“Midterm Exam”] *0.25 +
dataset[“Project”]*0.25
<name of dataset>[“<name of new column>”] = np.where
(condition, value if True, value if False)
Example:
dataset[“Letter Grade”] = np.where (dataset[“Final
Grade”] > 89, “A”)
<name of dataset>[“<name of new column>”] = np.select
(conditions, paired values)
Example:
dataset[“Letter Grade”] = np.select (conditions,
gradeLetters)
(Continued)
369
Data Analytics and Data Visualization
TABLE 8.1 (Continued)
Quick Guide of Methods and Their Functionality and Syntax
Functionality
Group a dataset based on one
or more columns, and apply
any aggregate method
necessary (e.g., sum(),
mean()).
Group a dataset based on one
or more columns. Use
apply() to organize the
records and columns in the
dataset.
Sort the data in a dataset.
Syntax/Example
<name of dataset>.groupby([“<name of column>” [,
“<name of column>”,...]]).<aggregate function>
Example:
dataset1.groupby([“Letter Grade”]).mean()
<name of dataset>.groupby([“<name of column>” [,
“<name of column>”,...]]).apply(lambda x: x[<rows>,
<cols>])
Example:
dataset1.groupby([“Letter Grade”]).apply(lambda x:
x[0:rows])
<name of dataset>.sort_values([“<name of column>” [,
“<name of column>”,...]] [, ascending = False])
Example:
dataset3.groupby([“Letter Grade”]).apply (lambda x:
x.sort_values ([“Final Grade”], ascending=False))
Descriptive Statistics
Use mean() to find the mean/
average in a dataset.
Use median() to find the
median in a dataset.
Use mode() to find the most
frequent value in a dataset.
Use .values to discard all
output from the mode()
report except its value.
Use max() to find the max
value in a dataset.
Use min() to find the min
value in a dataset.
Use quantile(x) to find
the xth quantile in a dataset.
Use variance() (Statistics
package) to calculate data
variance.
Use std() or stdev()
(Statistics package) to
calculate standard deviation.
<name of dataset>[“<name of column>”].mean()
Example:
dataset[“Final Grade”].mean()
<name of dataset>[“<name of column>”].median()
Example:
dataset[“Final Grade”].median()
<name of dataset>[“<name of column>”].mode()
Example:
dataset[“Final Grade”].mode(dropna = True).values
<name of dataset>[“<name of column>”].mode().values
Example:
dataset[“Final Grade”].mode(dropna = True).values
<name of dataset>[“<name of columna>”].max()
Example:
dataset[“Final Grade”].max()
<name of dataset>[“<name of columna>”].min()
Example:
dataset[“Final Grade”].min()
<name of dataset>[“<name of columna>”].
quantile(0.0–1.0)
Example:
dataset[“Final Grade”].quantile(0.25)
statistics.variance(<name of dataset>[“<name of
column>”].dropna()
Example:
statistics.variance(dataset[“Final Grade”].dropna())
<name of dataset>[“<name of column>”].dropna
statistics.stdev(<name of dataset>[“<name of
column>”].dropna()
Example:
dataset[“Final Grade”].std()
statistics.stdev(dataset[“Final Grade”].dropna())
(Continued)
370
Handbook of Computer Programming with Python
TABLE 8.1 (Continued)
Quick Guide of Methods and Their Functionality and Syntax
Functionality
Use skew() to calculate data
skewness.
Use kurtosis() to
calculate data kurtosis.
Use count() to calculate the
frequency of occurrence of a
value.
Use describe() to
automatically report a set of
basic descriptive statistics.
Syntax/Example
<name of dataset>[“<name of column>”].skew()
Example:
dataset[“Final Grade”].skew()
<name of dataset>[“<name of column>”].kurtosis()
Example:
dataset[“Final Grade”].kurtosis()
<name of dataset>[“<name of column>”].count()
Example:
dataset[“Final Grade”].count()
<name of dataset>[“<name of column>”].describe()
Example:
dataset[“Final Grade”].describe()
Data Visualization
Use the hist() function
(Pandas library) to draw
histograms.
Use the boxplot() function
(Pandas library) to draw box
and whiskers plots.
Use the line() function
(Pandas library) to draw a
line chart.
Use the bar() function
(Pandas library) to draw a bar
chart. Use the subplots(),
and stacked() functions
with appropriate code to create
different types of bar charts.
Use the pie() function
(Pandas library) to draw a pie
chart. Use the plt object of
the matplotlib.pyplot package
to configure and improve the
appearance of the chart.
plt = <name of dataset>.plot.hist(subplots =
<integer>, grid = True/False, legend = True/False,
layout = (<number of rows>, <number of columns>,
figsize = (<size on x axis in inches>, <size on y
axis in inches>), bins = <number of bins>)
Example:
plt = dataset1.plot.hist(subplots = 2, grid = True,
legend = True, layout = (2, 3), figsize = (10, 10),
bins = 10)
<name of dataset>.boxplot ([grid = True/False],
[figsize = (<integer>, <integer>), [showcaps = True/
False], [showbox = True/False], [showfliers = True/
False], [labels = <names of columns>)
Example:
dataset1.boxplot(grid = True, figsize = (10, 10),
showcaps = True, showbox = True, showfliers = True,
labels = cols)
<name of dataset>.plot.line ([grid = True/False],
[figsize = (<integer>, <integer>], [title =
“<title>”])
Example:
dataset1.plot.line(grid = True, figsize = (7, 7),
title = “Grades Line Chart”)
<name of dataset>.plot.bar()
Example:
see relevant script in the text
<name of dataset>.pie()
Example:
see relevant script in the text
(Continued)
371
Data Analytics and Data Visualization
TABLE 8.1 (Continued)
Quick Guide of Methods and Their Functionality and Syntax
Functionality
Use the scatter() function
(Pandas library) to draw a
scatter plot based on two
datasets.
Syntax/Example
<dataFrame>.plot.scatter(x = “<column 1>”, y =
“<column 2>”, [title = “<title>”,...)
Example:
dataFrame.plot.scatter(x = "Final Exam", y = "Final
Grade", title = "Final exams and final grades ",
figsize = (7, 7))
8.7 CASE STUDY
Readmission is considered a quality measure of hospital performance and a driver of healthcare
costs. Studies have shown that patients with diabetes are more likely to have higher early readmissions (readmitted within 30 days of discharge), compared to those without diabetes (American
Diabetes Association, 2018; McEwen & Herman, 2018). To reduce early readmission, one solution
is to provide additional assistance to patients with high risk of readmission. For this purpose, the US
Department of Health would like to know how to identify the patients with high risk of readmission
using the collected clinical records of diabetes patients from 130 US hospitals between 1999 and
2008.
As an attempt to assist the US Department of Health in understanding the data, you are asked to
explore, analyse (descriptively), and visualize the data of readmission (readmitted) and the potential
risk factors, such as time in hospital (time_in_hospital) and hemoglobin A1c results (HA1Cresult),
using techniques covered in this chapter.
More specifically, your work should cover the following:
1. Data Acquisition: Import the related data file (i.e., Diabetes.csv).
2. Data Exploration: Report the number of records/samples and the number of columns/
variables in the dataset.
3. Descriptive Statistics: Use suitable techniques to summarize or describe the three variables we are interested in: readmitted, time_in_hospital, and HA1Cresult.
4. Data Visualisation: Use appropriate techniques to visualize the three variables and
the relationships between readmitted and time_in_hospital, and readmission and
HA1Cresult.
REFERENCES
American Diabetes Association. (2018). Economic costs of diabetes in the US in 2017. Diabetes Care, 41(5),
917–928. https://doi.org/https://doi.org/10.2337/dci18-­0 007.
Freedman, D., Pisani, R., & Purves, R. (1998). Statistics (3rd ed.). New York: WW Norton & Company.
McEwen, L. N., & Herman, W. H. (2018). Health care utilization and costs of diabetes. Diabetes in America
(3rd ed.), 40-­1–40-­78. NIDDK.
Pandas. (2021a). pandas.DataFrame.boxplot. Version: 1.2.5. https://pandas.pydata.org/docs/reference/api/
pandas.DataFrame.boxplot.html.
Pandas. (2021b). pandas.DataFrame.plot.pie. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.
plot.pie.html.
Statistics — Mathematical statistics functions. (2021). Python. https://docs.python.org/3/library/statistics.
html.
9
Statistical Analysis with Python
Han-­I Wang
The University of York
Christos Manolas
The University of York
Ravensbourne University London
Dimitrios Xanthidis
University College London
Higher Colleges of Technology
Contents
9.1
9.2
9.3
9.4
9.5
Introduction........................................................................................................................... 374
9.1.1 What is Statistics?...................................................................................................... 374
9.1.2 Why Use Python for Statistical Analysis?................................................................. 375
9.1.3 Overview of Available Libraries................................................................................ 375
Basic Statistics Concepts....................................................................................................... 376
9.2.1 Population vs. Sample: From Description to Inferential Statistics............................ 376
9.2.2 Hypotheses and Statistical Significance.................................................................... 377
9.2.3 Confidence Intervals.................................................................................................. 378
Key Considerations Prior to Conducting Statistical Analysis............................................... 379
9.3.1 Level of Measures: Categorical and Numerical Variables........................................ 379
9.3.2 Types of Variables: Dependent and Independent Variables...................................... 380
9.3.3 Statistical Analysis Types and Hypothesis Tests....................................................... 381
9.3.3.1 Statistical Analysis for Summary Investigative Questions......................... 381
9.3.3.2 Statistical Analysis for Comparison Investigative Questions..................... 381
9.3.3.3 Statistical Analysis for Relationship Investigative Questions..................... 383
9.3.4 Choosing the Right Type of Statistical Analysis....................................................... 385
Setting Up the Python Environment...................................................................................... 386
9.4.1 Installing Anaconda and Launching the Jupyter Notebook...................................... 387
9.4.2 Installing and Running the Pandas Library.............................................................. 387
9.4.3 Review of Basic Data Analytics................................................................................ 387
Statistical Analysis Tasks...................................................................................................... 388
9.5.1 Descriptive Statistics................................................................................................. 388
9.5.2 Comparison: The Mann-­Whitney U Test.................................................................. 391
9.5.3 Comparison: The Wilcoxon Signed-­Rank Test......................................................... 391
9.5.4 Comparison: The Kruskal-­Wallis Test...................................................................... 392
9.5.5 Comparison: Paired t-­test.......................................................................................... 393
9.5.6 Comparison: Independent or Student t-­Test............................................................... 395
9.5.7 Comparison: ANOVA................................................................................................ 396
9.5.8 Comparison: Chi-­Square........................................................................................... 397
9.5.9 Relationship: Pearson’s Correlation........................................................................... 398
9.5.10 Relationship: The Chi-­Square Test............................................................................ 399
DOI: 10.1201/9781003139010-9
373
374
Handbook of Computer Programming with Python
9.5.11 Relationship: Linear Regression................................................................................400
9.5.12 Relationship: Logistic Regression.............................................................................402
9.6 Wrap Up.................................................................................................................................404
9.7 Exercises................................................................................................................................405
References.......................................................................................................................................407
9.1
I NTRODUCTION
When working with data, one of the main questions one seeks to answer is whether the observed
value fluctuations and differences are random or not. If not by chance, what are the key factors
that cause such changes, and what are their relationships with the data? Statistical analysis, and in
particular inferential statistics, is the key tool for answering these questions.
In this chapter, some commonly used statistical functions and the relationship between different
types of measurements and statistical tests are introduced, accompanied by demonstrations of how
to conduct relevant statistical analysis tasks in Python. The analysis functions follow a linear and
incremental order, and build on concepts introduced previously, in order to assist readers with little
or no prior experience in this area. For those familiar with the various concepts and functions discussed, this chapter can be used as a refresher or as a practical guide to implementing and executing
common statistical functions using the Python platform.
The reader should note that before embarking on any substantial task involving statistical analysis, it is important to consult statistics experts in order to determine the appropriate data collection
functions and measurement units, as well as the types of statistical tests required and the best
approaches for interpreting and reporting the results.
9.1.1 What is Statistics?
Statistics is a branch of applied mathematics involving the
tasks of data collection, manipulation, interpretation, and
prediction. Two broad categories can be identified in the
field of statistics: descriptive and inferential. Descriptive
statistics (covered in part in Chapter 8 on Data Analytics
and Data Visualization) focus on identifying and describing patterns in the data, by utilizing straightforward
functions like frequencies and mean calculations. In
descriptive statistics, there is no uncertainty or unknown
factors. The goal is to summarize large volumes of data,
making it easier to visualize and understand. On the
other hand, inferential statistics focus on putting forward
hypotheses (or inferences) related to a sample taken from a
wider population. The hypotheses can be then generalized
and applied to the entire population. Hence, as the sample
does not contain the entirety of the population, analytical
tasks utilizing inferential statistics are bound to contain
an element of uncertainty.
The reader must note that the term statistics is commonly used to refer to inferential statistics, while the
term descriptive statistics is used when analytical tasks
are conducted solely for describing existing data. In line
with this convention, in this chapter the term statistics
will be most frequently used to refer to inferential statistics, unless stated otherwise.
Observation 9.1 – Statistics: A branch
of applied mathematics that involves
the tasks of data collection, manipulation, interpretation, and prediction.
Two broad categories can be identified:
descriptive and inferential statistics.
Observation 9.2 – Descriptive
Statistics: The focus is on identifying and describing patterns in the
data through frequencies and mean
calculations.
Observation 9.3 – Inferential
Statistics: The focus is on putting forward hypotheses (inferences) related
to a sample from a wider population.
If the hypotheses are proven correct,
they are generalized and applied to
the entire population.
Statistical Analysis
375
9.1.2 Why Use Python for Statistical Analysis?
A large number of specialized statistical software tools are available, such as SAS, Stata, R,
and SPSS, and are widely used for both academic and commercial purposes. However, as each
of these software packages come from different developers, they use customized features and
specialized commands and syntax that cannot be directly translated and exchanged across different platforms. On the contrary, Python is a general-­purpose programming language with
extensive cross-­platform capabilities. This characteristic gives Python an advantage when it
comes to complex statistical analysis tasks that mix statistics with other data science fields,
such as image analysis, text mining, or artificial intelligence and machine learning. In such
cases, the richness and flexibility of Python, provided by its ability to adapt its functionality by
means of appropriate modules, make it a better choice compared to other specialized statistical
software packages. Furthermore, the Python language
is relatively easy to learn compared to those found
in the more specialized statistical software tools. Its Observation 9.4: Python, as a
syntax is reminiscent of the English language, mak- general-­
purpose programming laning it easy to learn and use, and thus accessible to guage, allows the user to integrate stausers from diverse backgrounds and programming tistics with other data science fields
expertise levels. Finally, Python is an open-­source and tasks like image analysis, text minand free-­to-­use language, unlike most of the special- ing, artificial intelligence, or machine
ized statistical packages that frequently come at a learning.
considerable cost.
9.1.3 Overview of Available Libraries
A number of Python libraries, such as NumPy, SciPy, Scikit-­learn, and Pandas, provide
f­unctions and tools that allow the user to conduct specific statistical analysis tasks. As the
names suggest, NumPy and SciPy focus on numeric and scientific computations, as they support basic operations on multidimensional arrays.
Accordingly, Scikit-­learn is mostly used for machine
learning and data mining, as it offers simple and effi- Observation 9.5: The NumPy and
cient tools for common data analysis tasks. Pandas SciPy libraries focus on numeric and
is derived from the term panel data, and is designed scientific computations, Scikit-­learn is
for data manipulation and analysis (McKinney & used for machine learning and data
Team, 2020). For pure statistical analysis purposes, mining, and Pandas for data maniputhe Pandas library is one of the most suitable options, lation and analysis.
as it provides high-­p erformance data analysis tools
(Anaconda Inc., 2020).
The reader will notice that the library of choice for a large part of the work covered in this chapter
is Pandas. This is due to three main reasons. Firstly, the library is highly suitable for the types of statistical analysis tasks covered in this chapter. Secondly, it supports different data formats like comma-­
separated values (.csv), plain text, Microsoft Excel (.xls), and SQL, allowing the user to import, export,
and manipulate databases easily. Thirdly, it is built on top of the SciPy library, so the results can
be easily fed into functions of associated libraries like Matplotlib for plotting and Scikit-­learn for
machine learning tasks (Mclntire et al., 2019). This highlights another concept that is central to the
structure and rationale of this chapter: the selective use of different libraries and functions for different analytical tasks. For instance, functions from the SciPy library may be used for a specific analytical task alongside functions from the Matplotlib library for plotting the output data. This approach
aims at promoting the idea that, as long as the fundamental principles and logic for the various different analytical tasks remain the same, the reader should feel confident to explore different toolkits
and solutions.
376
9.2
Handbook of Computer Programming with Python
B
ASIC STATISTICS CONCEPTS
Readers unfamiliar with the intricacies of statistical analysis who come across the notions of significant difference, p-­value, or confidence intervals may wonder what exactly these terms mean,
and why they are so central in statistics. In this section, key statistics concepts, and the frequently
intimidating jargon that accompanies them, are discussed and contextualized using simple examples. This aims at assisting the reader establishing an understanding of the connections and differences between descriptive and inferential statistics, and how and why scientists frequently make the
­transition from the former to the latter.
9.2.1 Population vs. Sample: From Description to Inferential Statistics
Population can be defined as the whole set of individuals
or subjects for which generalized observations or assump- Observation 9.6 – Population, Sample:
tions are needed, whereas sample is the actual part of this Population is the whole set of indipopulation from which data are actually collected. As viduals or subjects for which generalsuch, the sample is bound to be a small part of the entire ized observations or assumptions are
needed. The sample is the part of the
population.
In an ideal scenario, individual information from the population from which data are actuentirety of the population would be retrieved. In this ally collected. The sample is always
case, descriptive statistical functions could be utilized to a small part of the entire population.
describe the patterns observed in the data. However, this
scenario is extremely rare. In most cases, budget and time constraints related to the data collection and
analysis tasks at hand impose significant limitations. This is especially true when the study population
is substantial, a rather common situation indeed. For example, if a national survey about the quality
of life of all patients with diabetes in the UK is to be carried out, researchers would have to interview
a population of approximately 4.7 million people (Diabetes UK, 2019). Arguably, it would be much
more efficient to survey a group of diabetes patients rather than the entire population. In such cases,
since researchers would get access to the information of a sample, statistical functions that allow one
to make inferences to the population based on the sample are required. Measuring the national Body
Mass Index (BMI) scores can be used as an example to demonstrate the underlying rationale. Assume
that one wants to measure the BMI scores of all smokers in the UK. Since it is not plausible to get
information from the entire UK smoker population, a sample will be drawn, which will be then used
to draw conclusions. Ultimately, findings will be generalized to the entire UK smoker population using
inferential statistics.
In order to determine the required sample size, various different sampling functions are available. These include, but are not limited to, random, cluster, and stratified sampling. Depending on the
research question behind the study and on the characteristics of the study population, a particular sampling function may be preferable to others. A detailed analysis of sampling functions and how to choose
one is outside the scope of this chapter. However, a large number of related resources, like specialized
statistics books and online materials are available for those interested in learning more about the topic.
In terms of generalizing findings and observations from the sample to the entire population, one
may wonder how such a generalization can be possible and trustworthy. In its simplest form, this is
achieved by conforming to a strict set of minimum requirements, summarized below:
1. The sample must be representative of the population to which the results will be generalized.
Representative means that the sample should
reflect specific characteristics of the population,
such as age, gender, or ethnic background, as
closely as possible.
Observation 9.7 – Sample Characteris­
tics: A sample must be representative
of the population, suitable for answering the research question quantitatively,
and allowing for hypothesis testing.
Statistical Analysis
377
2. It must be suitable for answering the research question quantitatively.
3. It must allow for hypothesis testing, as implied by the research question.
4. The data analysis must match the type of the data being analyzed. In other words, one
needs to use the right statistical function for the data at hand.
These concepts are further discussed in the following sections.
9.2.2 Hypotheses and Statistical Significance
Once a representative sample is drawn from the study population, hypotheses are drawn based on
the underlying research questions. These hypotheses are, subsequently, systematically tested in
order to measure the strength of the evidence and to draw conclusions about the entire population.
This is commonly known as hypothesis testing. Hypothesis testing is, therefore, the process of making a claim about the study population and using the sample data to check whether the claim is
valid. A common and long-­established convention within the scientific community is that this claim
is based on the assumption that the hypothesis will not be true, or in other words, that the analysis
will show that the intervention or condition under investigation will have no difference or no effect
in the context of the population. This is a specific and standardized type of assumption that is essential in statistical testing, and is commonly referred to as the null hypothesis (H0). For those unfamiliar with scientific methodologies, the fact that the
expectation is that the analysis will unveil no difference
as opposed to some difference may seem counter-­ Observation 9.8 – Null Hypothesis:
intuitive. However, the reader should note that the idea The hypothesis that the intervention
behind this is that the analyst seeks to reject the null or condition under investigation assohypothesis rather than confirm it. In other words, the ciated with the research question will
assumption is that if one can disprove the null hypothe- have no effect in the population.
sis (i.e., no difference), a difference or effect must exist
within the population.
To check the validity of the null hypothesis, one needs to conduct a detailed and strictly-­
defined type of testing, commonly referred to as statistical significance testing. There are numerous statistical significance tests to choose from, depending on the research questions and the data
at hand (see Section 9.3 for more details on test selection and on how to conduct such tests in
Python). A common attribute of all these tests is that they calculate the probability of the results
observed in the sample being consistent with the results
one would likely get from the entire population. This
is known as the p-value, which describes how likely it Observation 9.9 – Hypothesis or
is that the data would have occured by random chance Statistical Significance Testing: Tests
if the null hypothesis is true. Hence, if the p-value is that calculate the probability of the
high, the observed sample data will confirm the null results being consistent with those
hypothesis, and thus there must be no difference in the from the entire population. If probpopulation. If the p-value is low, it is a sign that the ability is high, the null hypothesis is
observed sample data are i­nconsistent with the null confirmed and there is no difference
hypothesis (H0), which is, therefore, rejected. In this in the population; if it is low, the
case, one can c­ onclude that there must be a difference observed sample data are inconsispresent in the population and the difference is statis- tent with the null hypothesis, which is
tically significant or that a significant difference has therefore rejected.
been detected.
As a working example of the above, the reader can assume a study of the effectiveness of a new
hypertension drug, by comparing the blood pressure levels of those using it with the levels of those
using conventional hypertension drugs. A hypothesis test can be carried out to detect whether the
378
Handbook of Computer Programming with Python
TABLE 9.1
p–value and Significance
p-­value
>0.1
0.05–0.1
0.01–0.05
0.001–0.01
<0.001
Significance
Little or no evidence of a difference or relationship.
Weak evidence of a difference or relationship.
Evidence of a difference or relationship.
Strong evidence of a difference or relationship.
Very strong evidence of a difference or relationship.
new drug intervention has any effects on the sample or not. The null hypothesis will be based on the
claim that there will be no difference of blood pressure levels between the users of the two different
drugs in the sample. Hypothesis testing will be conducted and a p-­value will be generated. If the
p-­value is low and the null hypothesis is rejected, there is evidence that there must be a difference
in terms of the effectiveness of the two drugs in the general population.
At this point, the reader may start wondering how low the p-­value should be in order to be considered low. The answer to this is that it depends on the significance level one chooses for the research
question. In other words, for each research question, one needs to determine how high or low the probability (i.e., the p-­value) must be in order to conclude whether the sample data is consistent with the
null hypothesis or not. Conventionally, differences are considered to be significant if the p-­value is less
than 0.05 (5%). Essentially, the p-­value can be regarded as an indicator of the strength of the evidence.
The reader can use the classification of p-­values as a rough guide for determining whether statistical
significance requirements are met for a specific analysis task (Table 9.1).
Using the same hypertension drug example, if the p-­value of the hypothesis test is found to be 0.03, it
indicates that there is a 3% chance that the same treatment effect would occur in the randomly sampled
data. Since the 3% chance is lower than the 5% statistical significance threshold, the null hypothesis can
be rejected, leading to the conclusion that a significant difference between the two drugs exists in terms
of the treatment effects within the general population. It is worth mentioning that the p-­value here only
indicates a statistical relationship and not causation. For identifying causation, more sophisticated
inferential statistical analysis methods, such as regression, are needed (see Sections 9.5.11, 9.5.12).
9.2.3 Confidence Intervals
Another key concept used frequently in statistics is that of confidence intervals. The term is used
to describe the use of a range of values within which the actual value of the tests may fall instead
of a single estimated value. More specifically, in inferential statistics, one of the primary goals is
to estimate population parameters. However, such parameters like population mean and standard
deviation are always unknown, as it is very difficult, or even impossible, to be measured accurately across the entire population. Instead, estimates are made based on the samples. In order to
avoid selection bias when the sample is selected and
to achieve an accurate and objective representation Observation 9.10 – Confidence
of the population, methods like random sampling are Intervals: A range of values within
commonly used. However, even when such methods which the actual value of the tests
are used, uncertainty about the population estimates may fall. They act as mediators that
still exists to a certain degree, due to the possibility take into account potential sampling
of sampling errors. It must be noted that, despite the errors and, therefore, provide a higher
term used, sampling errors do not refer to actual errors. level of confidence during the statistiThey appear due to the inevitable variability occurring cal analysis process.
by chance, as random samples are used rather than an
379
Statistical Analysis
entire population. Nevertheless, they are treated as errors for the purposes of statistical testing, as
they may lead to inaccurate conclusions.
Although sampling errors cannot be completely eliminated, confidence intervals act as
a mediator by taking these potential errors into account and providing a range of values the
actual population parameter value is likely to fall within. As an example of this, one can
assume that researchers want to know the average height of all secondary school students in
the UK. Since it is impossible to measure the height of every single student, a random sample
of 1,000 secondary school students could be used. If the analysis of the sample measurements
results in an average height of 165 cm, it is unlikely that the population mean will also have
this exact value, despite the fact that random sampling was used for sample selection. However,
if the average height of the sample is expressed as a value within a confidence interval between
160 and 170, researchers can be confident that the true average height of all UK secondary school
students among the entire population is captured within this range.
9.3
K
EY CONSIDERATIONS PRIOR TO CONDUCTING STATISTICAL ANALYSIS
Before conducting statistical analysis in Python, key
aspects of the data collection process, as well as the
tools and methods that will be used for the analysis of
the collected data, must be considered. At a basic level,
such considerations include:
Observation 9.11 – Variable: A characteristic, factor, or quantity that
can be measured. As the name suggests, it varies between subjects and/
or changes over time. It is directly
related to the type of statistical analysis adopted for a given task.
• the measurement scales and the types of variables
that will be used for data collection,
• the hypothesis being tested, and
• the statistical tests that will be used for data analysis.
A variable is a characteristic, factor, or quantity that can be measured, and which may vary between
subjects or change over time (or both). For example, age is a variable that varies between individuals
and changes over time, while income also varies between individuals but may, or may not, change
over time. The reason the type of the variable is important is that it is directly related to the type of
statistical analysis adopted for a given task. This is true for both descriptive and inferential statistics.
Certain statistical analysis tests can be used only with certain types of data. For instance, if statistical methods suitable for categorical data are used with continuous data, the results are bound to
be inconsistent and inaccurate. Hence, knowing the type of data that will be collected in advance
enables one to choose the appropriate analysis method.
Variables are generally categorized according to the type of measurement they are used for and
the level of detail of this measurement. The following sections briefly introduce the different types
of variables, the associated types of statistical tests, and how to choose the right statistical test based
on the type of variable at hand.
9.3.1 Level of Measures: Categorical
and Numerical Variables
Categorical variables, also known as qualitative variables, describe categories or factors of
objects, events, or individuals. An example is gender, which contains a finite number of categories
(e.g., female, male). Categorical variables can also
take numerical values (e.g., 1 for female, 2 for male).
However, these values are only used for coding and
Observation 9.12 – Categorical
Variables
(Nominal,
Ordinal):
Categorical (or qualitative) variables describe categories or factors
of objects, events, or characteristics
of individuals with no mathematical meaning. Nominal variables take
discrete values that have no particular order, while ordinal variables take
discrete, ordered values.
380
Handbook of Computer Programming with Python
indexing purposes and do not have any mathematical meaning. There are two types of categorical
variables: nominal and ordinal. A brief description of each type is provided below.
• Nominal variables can have two or more discrete states, but there is no implied order
for these states. For example, gender (i.e., female, male) is a nominal variable. Marital
status (i.e., unmarried, married, divorced) and ethnic background (e.g., African, Asian,
Caucasian) are also examples of nominal variables. Similarly, in medical research,
patients that are either in treatment or not in treatment can be also described by a nominal variable.
• Ordinal variables can have also two or more discrete states, but contrary to nominal variables, they can be ordered or ranked. For example, a satisfaction scale that lets respondents choose a value between 1 (strongly disagree) and 5 (strongly agree) is an example
of an ordinal variable. Age group (e.g., 20–29, 30–39 and so on) and income can be also
expressed as ordinal variables.
Continuous variables, also known as quantitative variables, are variables that can increase or
decrease steadily, or by a quantifiable degree or amount. There are two types of continuous variables, namely interval and ratio. A brief description of each type is provided below.
• Interval variables can be measurable and ordered,
and the intervals between the different values are Observation 9.13 – Continuous
(Interval,
Ratio):
equally spaced. For example, temperature mea- Variables
Continuous
(or
quantitative)
variables
sured in degrees (e.g., Celsius) is an interval variable, as the difference between 40°C and 30°C, take continuous numerical values
and 30°C and 20°C is an equidistant interval of describing measured objects, events,
10°C. Other examples of interval variables include or characteristics of individuals. They
age (when measured in years, months or days can take the form of intervals with no
instead of the ordinal age groups of the previous true zero values, or ratios where a true
example), or pH. Another characteristic of interval zero value has a logical meaning.
variables is that they do not have a true zero. For
instance, there is no such thing as no temperature, as a temperature of 0°C is still a measurable temperature. Hence, interval variable values can be also added or subtracted (but
not multiplied or divided).
• Ratio variables are similar to interval variables, with one important difference: they do
have a true zero point. When a ratio variable equals to zero, this means there is none of
this variable. Examples of ratio variables include height, weight, and length. Also, due to
the existence of a true zero point, the ratio between two measurements takes a new meaning. For instance, an object weighing 10 kg is twice as heavy as an object weighing 5 kg.
However, a temperature of 30°C (interval variable) cannot be considered twice as hot as
15°C. One can only claim that the 30°C temperature is higher than 15°C.
9.3.2 Types of Variables: Dependent and Independent Variables
Variables are typically classified as either independent
or dependent. Independent variables, also called predictor, explanatory, controlled, input, or exposure variables,
have an influence on the dependent variables, but are not
affected by any other variables themselves, hence their
name. Accordingly, dependent variables, also known
as observed, outcome, output, or response variables,
are variables that are changing based on changes in the
Observation 9.14 – Dependent and
Independent Variables: Independent
variables are changed/controlled in
an experiment that tests their effect on
the dependent variables. Both independent and dependent variables can
be either categorical or continuous.
381
Statistical Analysis
associated independent variables. Ultimately, in a scientific experiment, one seeks to change or control
the independent variables in order to test the effects of these changes on the dependent variables.
As an example, one can consider the following research question:
Does the length of treatment result in improved health outcomes?
In this case, the length of treatment is the independent variable, while health outcomes are the
dependent variables. Similarly, if one poses the question:
How aspirin dosage affects the frequency of second heart attacks?
The aspirin dosage would be the independent variable, while the heart attack frequency would be
the dependent variable.
It is worth mentioning that any type of categorical or continuous variables can be either independent or dependent, based on the context. A summary of the various different types of variables is
provided in Figure 9.1 below.
9.3.3 Statistical Analysis Types and Hypothesis Tests
There are various different statistical analysis types and
hypothesis tests. In general, statistical analysis can solve
three main types of investigative questions: summary,
comparison, and relationship. A more detailed list of
common statistical analysis types, and the categories
of problems they are used to address, are presented on
Table 9.2 below.
Observation 9.15 – Types of Statistical
Analysis: There are three statistical
analysis types: summary analysis using
descriptive statistics, and comparison
and relationship analysis both using
inferential statistics.
9.3.3.1 Statistical Analysis for Summary Investigative Questions
Statistical analysis of this type is mainly used for summarizing and describing a single variable at a
given time. The most common statistical methods associated with this type of analysis are those calculating the mean and median for continuous variables and the frequency for categorical variables.
9.3.3.2 Statistical Analysis for Comparison Investigative Questions
This type of statistical analysis is related to the comparison of the means of a single variable between
two or more groups. For example, it can be used if one needs to know whether the Body Mass Index
(BMI) numbers of men and women are significantly different to each other, or whether a new drug
can reduce blood pressure (i.e., measuring blood pressure before and after treatment). In this type of
analysis, p-­value is used to determine whether the difference is statistically significant.
Variable
Connuous
Interval
(30o C)
FIGURE 9.1
Categorical
Rao
(height)
Ordinal
(Likert scale)
Types of variables.
Nominal
(age, gender)
Dependent
(health outcome)
Independent
(treatment type & me)
382
Handbook of Computer Programming with Python
TABLE 9.2
Common Types of Statistical Tests
Statistics
Investigative Question
Common Statistical Tests
Descriptive
Summary
Inferential
Comparison
Inferential
Relationship
Continuous variable:
Mean, Median, Mode
Categorical variable:
Frequency
Continuous variable:
Nonparametric
Mann-­Whitney U test
Wilcoxon Signed-­Rank test
Kruskal-­Wallis, Mood’s median test
Parametric
Student’s t-­test
Paired Student’s t-­test
Analysis of Variance test (ANOVA)
Categorical variable:
Chi-­Square test
Association strength without causal relationship
Pearson’s correlation coefficient
Chi-­Squared test
Association strength with causal relationship
Linear regression
Logistic regression
Overall, there are six common types of tests that can be used for comparative hypothesis cases.
The choice of the appropriate test for a particular task depends on a number of factors, such as the
sample size, the data characteristics, and the comparison groups. Tests of this type can be further
divided into two main categories: parametric and non-­parametric (Table 9.3).
The main difference between parametric and non-­parametric analysis is that the former tests the
group means, while the latter tests the group medians. When the sample size of each group is large
enough and the comparison data are continuous and normally distributed, parametric statistical
tests are preferable. Parametric tests have more statistical power than their non-­parametric counterparts, and can thus detect an existing, underlying effect more efficiently. However, in cases where
the sample size is small, or the comparison data are skewed or non-­continuous (e.g., five-­point
Likert scales) (De Winter & Dodou, 2010), non-­parametric statistical methods are more appropriate. Table 9.4 provides a simple indicative list of sample size thresholds for choosing whether
parametric and non-­parametric tests should be used. The reader can find more on this topic in the
various available sources assisting users with statistical test selection, such as Minitab (2015).
Irrespectively of the sample size, when one compares two different means or medians, statistical
analysis can be further divided into two types, depending on whether the mean or median comes
TABLE 9.3
Common Types of Comparison Statistical Tests
Parametric Tests (Means)
Non-­Parametric Tests (Median)
Independent Student t-­test
Dependent (Paired) Student t-­test
Analysis of Variance Test (ANOVA)
Mann-­Whitney U test
Wilcoxon Signed-­Rank test
Kruskal-­Wallis, Mood’s median test
383
Statistical Analysis
TABLE 9.4
Simple Guide for Choosing between Parametric and Non-­Parametric Tests
Non-­Parametric Tests
Mann-­Whitney U test
Wilcoxon Signed-­Rank test
Kruskal-­Wallis, Mood’s median test
Sample Size
Parametric Tests
N = 15 in each group
N = 30
Compare 2–9 groups, n = 15 in each group
Compare 10–12 groups, n = 20 in each group
Independent Student t-­test
Dependent (Paired) Student t-­test
Analysis of Variance test (ANOVA)
from independent groups or from repeated measurements within the same group. If it comes from
independent groups, independent t-­tests should be used for parametric analysis and Mann-­Whitney
U tests for non-­parametric analysis. Examples of such cases are analysis based on measurements
of BMI for men and women, or the height of UK and US population. If the mean or median comes
from repeated measurements within the same group, dependent t-­tests should be used for parametric analysis and Wilcoxon Signed-­Rank tests for non-­parametric analysis. An example of this is the
measurement of blood pressure before and after using a new drug.
One can also compare three or more different means or medians. An example of this is the comparison of height across different ethnic groups. In this case, Analysis of Variance (ANOVA) tests
should be used. In simple terms, ANOVA can be viewed as different implementations of t-­tests that
allow one to compare means or medians of more than two groups.
9.3.3.3 Statistical Analysis for Relationship Investigative Questions
This type of statistical analysis is used to investigate the relationship between two or more variables. Depending on the type of variable and the purpose of the analysis, it can be further divided
into four sub-­categories, as outlined in Table 9.5.
In general terms, relationship statistical analysis is suitable for:
• hypothesis testing,
• measuring the association strength, and
• investigating causal relationships.
Hypothesis testing is an attempt to check whether two variables are associated with each other. For
example, one may wish to know whether an increase in daily sodium intake results in blood pressure changes Figure 9.2. If the test results in a p-­value of 0.05, a significant relationship is assumed
to exist between salt intake and blood pressure.
Association strength is a measurement of how closely the two variables are correlated (Table 9.6).
This is usually expressed in terms of the R or R2 value, ranging from −1.0 to 1.0 or 0 to 1.0 respectively. Positive numbers indicate a positive correlation (e.g., if one variable increases the other
increases too) and negative numbers an inverse correlation (e.g., if one variable increases the other
TABLE 9.5
Common Types of Relationship Statistical Tests
Type of Variable
Continuous Variable
Categorical Variable
Statistical Test
Association Strength
Correlation
(Linear Regression)
Chi-­Square
(Logistic Regression)
Correlation
–
Causal Relationship
Linear Regression
Logistic Regression
384
Handbook of Computer Programming with Python
TABLE 9.6
R value and Strength of Correlation
R value
1.0
0.7
0.5
0.3
0
−0.3
−0.5
−0.7
−1.0
Strength of Correlation
Perfect positive correlation
Strong positive correlation
Moderate positive correlation
Weak positive correlation
No correlation
Weak negative correlation
Moderate negative correlation
Strong negative correlation
Perfect negative correlation
decreases). In this context, a value of 1.0 indicates a perfect correlation, and 0 no correlation. A rule
of thumb is that when R is higher than 0.7 or lower than −0.7 the two variables are considered to
be highly correlated. When R is between −0.3 and 0.3, the correlation between the two variables is
regarded as weak. In the example presented in Figure 9.2, R is 0.82. Thus, there is a positive relationship between sodium intake and blood pressure. In other words, increasing the daily sodium intake
is highly correlated with high blood pressure.
The investigation of causal relationships is an attempt to relate the two variables via the equation
of a line that stretches across a cloud of points. The equation is usually expressed as Y = a + bX, and
it can be used for prediction. In the example presented in Figure 9.2, the causal relationship results
show that blood pressure equals to 114.5 + 3.5 * daily sodium intake. This indicates that if the daily
sodium intake of individuals is known it is possible to predict their approximate blood pressure. For
instance, when the daily salt intake is 3 g the blood pressure would be 125 mmHg, and would go
up by 3.5 mmHg for every 1 g increase of the daily sodium intake. This example provides a rather
simplified, but informative description of the causal relationship concept.
When the two variables are continuous, two common types of statistical analysis can be used
to test their relationship: correlation and linear regression (McDonald, 2014). In simple terms,
correlation measures the p-­value in order to test the hypothesis, and can quantify the direction
and strength of the relationship between two continuous variables by summarizing the result
with an R value. However, correlation cannot infer a cause-­a nd-­effect relationship. On the other
FIGURE 9.2
Relationships between daily salt intake and blood pressure.
385
Statistical Analysis
TABLE 9.7
Cheat Sheet for Choosing the Right Statistical Test
No. of Variables
Question Type
Dependent Variable
1
1
1
1
1
2
2
2+
2+
Summary
Summary
Comparison
Comparison
Comparison
Relationship
Relationship
Relationship
Relationship
Continuous
Categorical
Continuous
Continuous
Categorical
Continuous
Categorical
Continuous
Categorical
Independent Variable
–
–
2 groups
3+ groups
2+ groups
1 continuous
1 categorical
1+ variables
1+ variables
Statistical Test
Mean, Mode
Frequency
t-­Test
ANOVA
Chi-­Square
Correlation
Chi-­Square
Linear Regression
Logistic Regression
hand, linear regression provides a p-­value for hypothesis testing similarly to correlation, but can
also summarize the causal relationship with an equation that describes the relationship between
variables.
When the variables are categorical (i.e., nominal and ordinal), their relationship can be tested
using two additional types of statistical analysis: chi-­square test and logistic regression. The chi-­
square test is used to test the association by providing a p-­value. For example, if one is interested
in the relationship between gender and smoking status, the chi-­square test can be used. If the result
is a p-­value of 0.015, a strong association between gender and smoking status can be assumed. As
in correlation, the chi-­square test cannot infer a cause-­and-­effect relationship. To do so, logistic
regression is required. The latter works like linear regression in the sense that it can summarize
the causal relationship with an equation and use the equation for prediction. The only difference
between the two is that logistic regression is used for categorical data, while linear regression is
used for continuous data.
The reader can find a list and a brief description of a number of common statistical analysis tests
discussed in this section on Table 9.7.
9.3.4 Choosing the Right Type of Statistical Analysis
Selecting the right type of statistical analysis is one of
the most important considerations when conducting Observation 9.16 – Selecting the
analytical work. This decision is generally based on the Appropriate Test: The decision of
type and number of variables, and it can be a challeng- what test to use is not an arbitrary one
ing process for those with less experience in this field of but depends on a number of factors,
study. Table 9.7 presents a cheat sheet that can be used to such as the types and number of varidetermine when to choose the statistical tests mentioned ables at hand, the number of groups
in Section 9.3.3, Table 9.2. The first column contains the to be tested, the sample size, and the
number of variables under investigation and the second data distribution characteristics.
the type of the research question one is trying to answer.
The third and fourth columns contain the types of the independent and dependent variables, and the
fifth the recommended statistical test. A decision tree chart is also provided on Figure 9.3, with the
recommended statistical test at the end of each tree branch. By using these resources as a guide,
the reader should be able to find a suitable statistical test for the data type and research question at
hand. It must be noted that this is a just a brief introduction to the topic of statistical test suitability
and selection. In addition to any decisions based on such guides, it is always helpful and advisable
to consult statisticians and analysis experts before embarking on any serious analytical task.
386
Handbook of Computer Programming with Python
FIGURE 9.3
9.4
Choosing the right statistical analysis.
SETTING UP THE PYTHON ENVIRONMENT
General information related to the process of setting up, and operating in, the Python environment are provided in Chapter 1 of this book. Most of the essential requirements and basic
programming concepts presented in these chapters are transferable and, thus, apply to the work
and ideas presented here. Nevertheless, if the reader opts to focus solely on this chapter, the sections below provide a quick guide on how to set up the essential platforms, namely Anaconda
and Jupyter, as well as the required libraries and modules required for the purposes of statistical
analysis.
Statistical Analysis
387
9.4.1 Installing Anaconda and Launching the Jupyter Notebook
The official Anaconda download page allows the user to download and install the latest version
of the Python platform (see Chapter 1) (Anaconda Inc., 2020). The code and examples provided in
this chapter were written and tested using Python 3.9. Once Anaconda is installed, the Anaconda
Navigator can be used to launch applications, and simple Python programs can be created and run
using the Spyder or Jupyter Notebook environments.
For the purposes of this chapter, Jupyter Notebook is the platform of choice. This is due to a number of reasons. Firstly, it offers an appropriate environment for the Pandas library, which is required
for tasks related to data exploration and modelling. Secondly, it allows for the execution of code in
cells rather than running the entire file, something that can save time when it comes to debugging.
Thirdly, it provides an easy way to visualize datasets and plots.
9.4.2 Installing and Running the Pandas Library
To install Pandas, the reader can type !pip install pandas in the command input cell. Since
Pandas is used frequently, it is common to import Pandas with a shorter name, namely pd. This is
done by using the import pandas as pd expression:
!pip install pandas
import pandas as pd
9.4.3 Review of Basic Data Analytics
With Pandas imported, the user can read data from local .csv files using the pd.read_csv() function and the full path directory of the file. For example, the following command can be used to read
data from a local file named purchase.csv:
df = pd.read_csv('C:\Python\Example\purchase.csv', index_col=0)
The same applies to reading data files of other types, like Excel spreadsheets, SQL, and JSON, using
the appropriate functions (i.e., pd.read_excel(), pd.read_sql_query(), and pd.read_
json()) (The Pandas Development Team, 2020). For the purpose of importing tables from HTML
webpages, Pandas uses the pd.read_html() function (Sharma, 2019). The following example
uses the HTML dataset from a cryptocurrency website to showcase this (WorldCoinIndex, 2021).
Firstly, the requests library is imported. After passing the website link to variable url, function
request.get() attempts to connect to the web server and allocate the relevant connection information to variable crypto_url. If a connection is established, property crypto_url.text is used
as an argument to the pd.read_html command that, in turn, passes a dataframe to variable
crypto_df. This particular dataframe contains columns with unnecessary data that are discarded
from the main dataset. Finally, the first five rows of the dataset are displayed:
1
2
3
4
5
6
7
8
9
import pandas as pd
import requests
# Define the url
url = 'https://www.worldcoinindex.com/'
# Request the url
crypto_url = requests.get(url)
# Read from the url to Pandas object
crypto_df = pd.read_html(crypto_url.text)
388
10
11
12
13
14
15
Handbook of Computer Programming with Python
# Acquire only the relevant data form the dataset
dataset = crypto_df[0]
# Limit the displayed columns
df = dataset.iloc[0:102, 2:5]
# Print the first five rows of the dataset
print(df.head(5))
Output 9.4.3:
Name Ticker
Bitcoin
BTC
Ethereum
ETH
Axie Infinity
AXS
Dogecoin
DOGE
Ethereumclassic
ETC
0
1
2
3
4
Last price
$ 33,839
$ 2,140.42
$ 40.82
$ 0.193697
$ 47.64
A dataframe is a two-­dimensional tabular data structure with labeled rows and columns. To view
the dataframe, the user can simply call the name of the variable it is stored in. For instance, calling
variable crypto_df from the pd.read_csv example presented above will read the entire dataframe
that is stored in it. By default, the first and last five rows of a dataframe can be also retrieved using
commands df.head() and df.tail() respectively. Passing a specific number to the arguments
list of the head() function retrieves the corresponding number of rows, in this case 10.
When it comes to saving the dataframe, various different file formats can be chosen. These
include, but are not limited to, the following:
1.
2.
3.
4.
Plain Text CSV: A commonly used, straightforward format.
Pickle: Python’s native data storage format.
HDF5: A format designed to store large amounts of data.
Feather: A fast and lightweight binary file format that is also compatible with statistical
analysis software R.
Depending on the requirements and nature of the task at hand, each format has its own advantages and
disadvantages. The example below uses Pickle, as the process is rather straightforward: function to_
pickle() is used to save the dataframe to file example.pkl and pd.read_pickle() to retrieve it:
df.to_pickle('example.pkl')
df1 = pd.read_pickle('example.pkl')
9.5
S TATISTICAL ANALYSIS TASKS
Once the Python environment is configured and the appropriate methods and tools are determined,
the reader can focus on the practical implementation of the various analytical tasks using Python.
This section provides coding examples for various statistical analysis concepts and tests as well as
information on the interpretation of the test results.
9.5.1 Descriptive Statistics
Descriptive statistics are typically used for summarizing data from a sample. Depending on the type of
measures used, a number of tools can be utilized for analysis and visualization (Table 9.8). If the type of
measure is a continuous variable, functions and methods like .describe(), plot(kind=‘hist’),
or plt.hist() can be used to generate summarized estimates or plot histograms (Koehrsen, 2018).
389
Statistical Analysis
TABLE 9.8
Common Descriptive Statistical Tools for Different Types of Measures
Type of Measure
Continuous Variable
Categorical Variable
Summarized Values
Plot
Mean, Median, Standard Deviation, Range
Frequency, Proportion, Percentage
Histogram, Box Chart and similar
Pie Chart, Bar Chart, Box Chart and similar
As an example, assume a survey is conducted in order to gather personal information (i.e., age,
gender, or BMI) from adults (18+) in a particular geographic area, and this information should be
used to describe the age distribution within the sample population. The examples below show how
one can generate the associated summary statistics and plot graphs:
1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
# Define the floating numbers format
pd.options.display.float_format = '${:,.2f}'.format
# Define the analysis dataset
dataset = pd.read_csv("Survey.csv", index_col = 0)
print("Descriptive Statistics for Age")
print(dataset[["age"]].describe())
# Draw the histogram of the ‘age’ column
dataset["age"].plot(kind = 'hist', title = 'Age');
Output 9.5.1.a:
Descriptive Statistics for Age
age
count $2,849.00
mean
$55.83
std
$16.06
min
$18.00
25%
$44.00
50%
$58.00
75%
$67.00
max
$101.00
390
Handbook of Computer Programming with Python
The results indicate that the mean age of this group is 55.83 years. The age ranges from 18 to 101,
and the distribution is symmetrically centred around the mean.
For categorical variables one can use the .value_counts() method to generate the frequency
of all values in a column, and the plot(kind=‘bar’) function to plot the frequency using bars
(Tavares, 2017). Using the same survey example, the gender distribution for the patient group can
be calculated and plotted using the following commands:
1
2
3
4
5
6
7
8
9
10
import pandas as pd
# Define the analysis dataset
dataset = pd.read_csv("Survey.csv", index_col = 0)
print("Descriptive Statistics for Gender")
print(dataset[["gender"]].describe())
# Draw the bar graph for the gender column
dataset["gender"].value_counts().plot(kind = "bar",
title = "Gender", rot = 0)
Output 9.5.1.b:
Descriptive Statistics for Gender
gender
count
2849
unique
2
top
Female
freq
1660
<AxesSubplot:title={'center':'Gender'}>
The results show that that there are 1,660 females and 1,182 males within the patient group and the
related plot is generated.
As the topic of descriptive statistics is covered in detail in Chapter 8: Data Analytics and
Data Visualization, the information provided here is only meant to function as a quick reference.
Nevertheless, it is important to mention that descriptive statistics are frequently used as a way to
gauge the data and provide context to many of the inferential statistics tasks presented in the following sections.
391
Statistical Analysis
9.5.2 Comparison: The Mann-­Whitney U Test
The Mann-­Whitney U test is a type of non-­parametric
test for continuous variables. It is used to test whether Observation 9.17 – The Mann-­Whitney
the distributions of two independent samples are equal. U Test: A non-­parametric test for conThis test is appropriate when the sample size is small, or tinuous variables. It tests whether the
distributions of two independent samthe data are skewed.
As a practical example, one can consider a clinical ples are equal. It is appropriate when
trial comparing the treatment effects of standard and a the sample size is small or the data are
new therapy for patients with depression. A total of ten skewed. Use the mannwhitneyu()
participants are randomly allocated to the two groups function from the SciPy library.
(i.e., standard therapy/new therapy). The primary outcome of the measurements is the depression scores, ranging from 1 (extremely depressed) to 100
(extremely euphoric):
Standard therapy
New therapy
85
75
65
40
70
60
55
40
40
50
75
65
30
35
80
20
20
25
80
40
The null hypothesis (H0) is that the depression scores of the two therapies are equal. Since the
sample size is small (<20), the Mann-­Whitney U Test is the appropriate choice for analysis. To run
the test, the user can use the mannwhitneyu() function from the SciPy library. Data arrays
data1 and data2 contain the depression scores of the standard and new therapies. The two sets
of results can be compared using the mannwhitneyu(data1, data2) function:
1
2
3
4
5
6
7
# Example of the Mann-­
Whitney U Test
from scipy.stats import mannwhitneyu
# Standard therapy
data1 = [85, 65, 70, 55, 40, 75, 30, 80, 20, 80]
# New therapy
data2 = [75, 40, 60, 40, 50, 65, 35, 70, 25, 40]
mannwhitneyu(data1, data2)
Output 9.5.2:
MannwhitneyuResult(statistic=34.0, pvalue=0.11941708700675263)
The results provide two values: the U statistics value (34.0) and the p-­value (0.119). Since the latter is larger than the significance level of 0.05, there is no sufficient evidence to conclude that the
number of bacteria in the blood between the two therapies is different. Hence, the null hypothesis
can be rejected with the conclusion that the new therapy
does not improve the reduction of bacteria numbers in Observation 9.18 – The Wilcoxon
the blood compared to the standard therapy.
Signed-­Rank Test: A non-­parametric
9.5.3 Comparison: The Wilcoxon
Signed-­Rank Test
The Wilcoxon Signed-­Rank Test is used to test whether
the distributions of two paired samples are equal or not.
It is a non-­parametric test that can be used for both continuous and ordinal variables.
test for continuous or ordinal variables. It tests whether the distributions
of two paired samples are equal. It is
appropriate when the sample size is
small or the data are skewed. Use the
wilcoxon() function from the SciPy
library.
392
Handbook of Computer Programming with Python
As an example, one can assume a test during which depression score measurements are taken
before and after a newly developed therapy for ten patients, and the goal is to find whether the
therapy makes a difference:
Patient
1
2
3
4
5
6
7
8
9
10
Before therapy
After therapy
85
75
65
40
70
50
55
40
40
50
75
65
30
35
80
20
20
25
80
40
The null hypothesis (H0) is that there is no difference in depression scores before and after the
therapy. Since the data are taken from pairs and the sample size is small, the Wilcoxon Signed-­
Rank Test is an appropriate choice. To run the test, the user can use the wilcoxon() function from
the SciPy library. Data arrays data1 and data2 contain the depression scores before and after
therapy. The two sets of results can be compared using the wilcoxon(data1, data2) function:
1
2
3
4
5
6
7
# Example of the Wilcoxon Signed-­
Rank Test
from scipy.stats import wilcoxon
# Before therapy
data1 = [85, 65, 70, 55, 40, 75, 30, 80, 20, 80]
# After therapy
data2 = [75, 40, 50, 40, 50, 65, 35, 20, 25, 40]
wilcoxon(data1, data2)
Output 9.5.3:
WilcoxonResult(statistic=7.0, pvalue=0.037109375)
The test provides a p-­value of 0.036 which is below the significance level of 0.05. Hence, the null
hypothesis can be rejected with the conclusion that the new therapy has a significant effect on the
depression scores.
9.5.4 Comparison: The Kruskal-­Wallis Test
The Kruskal-­Wallis Test is used to test whether the dis- Observation 9.19 – The Kruskal-­
tributions (medians) of two or more independent sam- Wallis Test: A non-­parametric test
ples are equal or not. It is used for continuous or ordinal for continuous or ordinal variables
variables when the sample size is small and/or data are with small sample size and/or data
not normally distributed. The test indicates whether the not normally distributed but with a
differences between the test groups are likely to have similar skewness. It tests whether the
occurred by chance or not. It is worth noting that the differences between two or more
Kruskal-­Wallis Test is used under the assumption that groups are by chance or not. Use the
the observations in each group come from populations ­kruskal() function from the SciPy
with the same shape of distribution. Hence, if differ- library.
ent groups have different distribution shapes (e.g., one
is right-­skewed and another left-­skewed), the Kruskal–Wallis Test may produce inaccurate results
(Fagerland & Sandvik, 2009).
As an example of how to use the test in Python, one can assume a case of three available options
to alleviate depression: standard therapy, new therapy, and new therapy plus exercise. The purpose
of the test is to determine whether there is any difference in depression scores between the three
therapy options with the following depression scores:
393
Statistical Analysis
New therapy + exercise
New therapy
Standard therapy
90
85
75
80
65
40
90
70
50
30
55
40
55
40
50
90
75
65
55
30
35
85
80
20
40
20
25
90
80
40
Since the sample size is small and the depression scores are ordinal, the Kruskal-­Wallis Test
is an appropriate choice. To run the test in Python, one can use the kruskal() function from
the SciPy library. Data arrays data1, data2 and data3 contain the depression scores for new
therapy and exercise, new therapy and standard therapy respectively. The three sets of results can
be compared using the kruskal(data1, data2, data3) expression:
1
2
3
4
5
6
7
8
9
# Example of the Kruskal-­
Wallis Test
from scipy.stats import kruskal
# New therapy and exercise
data1 = [90, 80, 90, 30, 55, 90, 55, 85, 40, 90]
# New therapy
data2 = [85, 65, 70, 55, 40, 75, 30, 80, 20, 80]
# Standard therapy
data3 = [75, 40, 50, 40, 50, 65, 35, 20, 25, 40]
kruskal(data1, data2, data3)
Output 9.5.4:
KruskalResult(statistic=7.275735789710176, pvalue=0.026308376435655575)
The results show that the p-­value is 0.026, which is less than the significance level of 0.05. Hence,
the null hypothesis (H0) (i.e., the depression scores of the three therapies are equal) can be rejected,
with the conclusion that a significant difference exists between the three treatment options.
9.5.5 Comparison: Paired t-­test
The Paired t-­Test, also referred to as the Dependent
t-­Test, is used to test whether repeated measurements Observation 9.20 – The Paired t-­Test:
(means) taken from the same sample are significantly A parametric test for normally distribdifferent. Since the measurements come from the same uted data with no significant outliers.
sample, the terms paired samples, matched samples or Use the ttest _ rel() function
repeated measures are also commonly used for this type from the SciPy library.
of test. The test is used under the assumption that the
measurements are normally distributed and do not contain significant outliers. If the measurements
are skewed or contain significant outliers, the Wilcoxon Signed-­Rank Test should be used instead.
As an example, one can assume the case of a new drug developed to assist patients by reducing
blood pressure. To investigate the effectiveness of the new drug, the blood pressure of 100 patients
is firstly measured prior to taking the drug and also 3 months later. Since the goal is to determine
whether the new drug is effective, the null hypothesis (H0) is that the average blood pressure will be
the same before and after taking the drug. Assuming a dataset stored in a file named Blood.csv, the
user can conduct the Paired t-­Test in Python using the ttest_rel() function from the SciPy library:
1
2
3
4
import pandas as pd
from scipy.stats import ttest_rel
# Define the format of floating numbers
394
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Handbook of Computer Programming with Python
pd.options.display.float_format = '{:,.2f}'.format
# Define the dataset
dataset = pd.read_csv("Blood.csv", index_col = 0)
print("Descriptive Statistics for Blood before and after")
print(dataset[["Before", "After"]].describe())
# Prepare and display the scatter plot for the dataset
dataFrame = pd.DataFrame(data = dataset, columns = ["Before", "After"])
dataFrame.plot.scatter(x = "Before", y = "After",
title = "Scatter chart for Blood.csv", figsize = (7, 7))
# Calculate the Paired t-Test
ttest_rel(dataset[["Before"]], dataset[["After"]])
Output 9.5.5:
Descriptive Statistics for Blood before and after
Before After
count
80.00 80.00
mean
153.39 147.55
std
10.49 13.57
man
138.00 125.00
25%
144.75 136.00
50%
151.50 146.00
75%
159.25 157.00
max
185.00 184.00
Ttest_relResult(statistic=array([2.91731434]), pvalue=array([0.00459528]))
Statistical Analysis
395
Arrays data1 and data2 correspond to the blood pressure scores before and after the drug
therapy. The results show that the average blood pressure before taking the new drug was higher
(153.38 mmHg) compared to the measurement taken after drug administration (147.55 mmHg). The
test provides a p-­value of 0.004, which is lower than the significance level of 0.05. Hence, the null
hypothesis can be rejected with the conclusion that a statistically significant difference in blood
pressure occurs after using the new drug.
9.5.6 Comparison: Independent or Student t-­Test
The Independent t-­Test, also known as the Student
t-­Test, is used to test whether the means of two inde- Observation 9.21 – The Student
pendent samples are significantly different. To conduct t-­Test: A parametric test for normally
Independent t-­Tests in Python, the ttest_ind() func- distributed data with no significant
tion from the SciPy library can be used. The function outliers. Use the ttest _ ind()
accepts two arrays as parameters, corresponding to the function from the SciPy library.
sets of data under investigation. The reader can find
more information on the official SciPy.org website (The SciPy Community, 2020).
Using the same survey example, one can assume a case where the user needs to know whether
ages between men and women within the sample are different. In this context, the null hypothesis
(H0) the mean ages of the two groups are equal is used:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
from scipy.stats import ttest_ind
# Define the format of floating numbers
pd.options.display.float_format = '${:,.2f}'.format
# Define the dataset
dataset = pd.read_csv("survey.csv", index_col = 0)
print("Descriptive Statistics for age grouped by gender")
print(dataset["age"].groupby(dataset["gender"]).describe())
# Calculate the Student t-­
Test
ttest_ind(dataset.age[dataset.gender == 'Male'],
dataset.age[dataset.gender == 'Female'])
Output 9.5.6:
Descriptive Statistics for age grouped by gender
count
mean
std
min
25%
50%
75%
max
gender
Female $1,660.00 $55.27 $16.42 $18.00 $43.00 $57.00 $67.00 $101.00
Male
$1,189.00 $56.61 $15.50 $19.00 $45.00 $58.00 $68.00 $98.00
Ttest_indResult(statistic=2.1993669348926157, pvalue=0.02793196707542121)
The first output shows that the average age for men (56.56) is higher than that of women (55.30). The
Independent t-­Test is conducted in order to determine whether this difference is significant. The first
statistic value is the t score (2.199), which is a ratio of the difference between and within the two
groups. As a general rule, the higher the t score, the bigger the difference would be between groups,
and vice versa. To determine whether the t score is high enough, one has to rely on the p-­value
output. In this example, the p-­value is 0.0279, which is lower than the significance level of 0.05.
396
Handbook of Computer Programming with Python
Thus, the null hypothesis can be rejected with the conclusion that there is a statistically significant
difference between the age of male and female individuals.
9.5.7 Comparison: ANOVA
The ANOVA (i.e., Analysis of Variance) Test is used to
compare the means of three or more samples. It assumes Observation 9.22 – The ANOVA
independence of observations, homogeneity of variances, Test: A parametric test for normally
and normally distributed observations within groups. In distributed, independent observaPython, the user can utilize the f_oneway() function tions, with homogeneity of variances.
from the SciPy library to calculate the F-­Statistic, which, Use the f_oneway() function from
in turn, can be used to calculate the p-­value. The function the SciPy library.
accepts parameters corresponding to the sample measures for each group under consideration.
Using the same survey data as an example, one can assume that the user needs to know whether
the Body Mass Index (BMI) values are different across non-­smokers, former smokers and current
smokers (smoking status). The null hypothesis (H0) is that there is no difference between the means
of the BMIs among people from the three different groups:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
from scipy.stats import f_oneway
# Define the format of floating numbers
pd.options.display.float_format = '{:,.2f}'.format
# Define the dataset
dataset = pd.read_csv("survey.csv", index_col = 0)
print("Descriptive statistics for survey by smokestat")
print(dataset.bmi.groupby(dataset.smokestat).describe(), "\n")
# Calculate the one-­
way ANOVA Test
print("Results of ANOVA by smokestat values of Never, Former, Current")
print(f_oneway(dataset.bmi[dataset.smokestat == "Never"], \
dataset.bmi[dataset.smokestat == "Former"], \
dataset.bmi[dataset.smokestat == "Current"]))
Output 9.5.7:
Descriptive statistics for survey by smokestat
count mean std
min
25%
50%
75%
max
smokestat
Current
363.00 28.20 6.84 17.50 23.20 27.20 31.25 62.60
Former
755.00 29.22 6.24 16.80 25.05 28.20 32.40 66.20
Never
1,731.00 28.14 6.48 16.10 23.50 27.10 31.30 75.20
Results of ANOVA by smokestat values of Never, Former, Current
F_onewayResult(statistic=7.548128785289014, pvalue=0.0005377158828502398)
The first output shows that the former smokers have the highest mean BMI (29.22), followed by
current smokers (28.30), and non-­smokers (28.20). The output of the ANOVA Test shows that the
F-­Statistic is 6.56 and the p-­value is 0.0014, indicating an overall significant effect of smoking status
on BMI. However, at this point it is uncertain exactly where the difference between groups lies. To
397
Statistical Analysis
clarify this, one needs to conduct post-­hoc tests. For more detailed information regarding post-­hoc
tests in Python, the reader can refer to the official documentation in Scikit-posthocs (2020).
9.5.8 Comparison: Chi-­Square
As shown, the t-­Test is used to check whether means Observation 9.23 – The Chi-­Square
differ between two groups. The Chi-­square Test, also Test: A parametric test for categorical
known as the Chi-­squared Goodness-­of-­fit Test, is variables. It tests whether data from a
the equivalent of the t-­test for categorical variables. It single sample follow a specified distritests whether categorical data from a single sample fol- bution. Use the chisquare() funclow a specified distribution (i.e., external or historical tion from the SciPy library.
distribution).
For example, based on the example of a smoker status survey, one can assume that the proportions of non-­smokers, former smokers, and current smokers are 30%, 10%, 60% respectively. The
government launched a health promotion campaign in an attempt to increase smoking cession rate.
To evaluate the impact of the program, the same survey was conducted for a second time a year
later. The survey was completed by 500 people, and the data obtained were the following:
Before programme
After programme
Non-­Smokers
Former Smokers
Current Smokers
150
140
50
80
300
280
Since the goal is to determine the impact of the health promotion programme, the null hypothesis
(H0) assumes that the distribution of smoking status is the same prior to, and after the implementation of the program and, thus, the health promotion campaign has no impact. In such cases, the
Chi-­
square Test is an appropriate choice. In Python, the test can be conducted using the
chisquare() function from the SciPy library. The function accepts parameters corresponding to
the observed frequencies in each categorical variable:
1
2
3
4
5
6
7
8
9
10
11
12
import scipy as scipy
from scipy.stats import chisquare
# Define the datasets
before = scipy.array([150, 50, 300])
print("The dataset before the program:")
print(before)
after = scipy.array([140, 80, 280])
print("The dataset after the program:")
print(after)
square test results are the following:")
print("The Chi-­
print(scipy.stats.chisquare(before, after))
Output 9.5.8:
The dataset before the program:
[150 50 300]
The dataset after the program:
[140 80 280]
The Chi-square test results are the following:
Power_divergenceResult(statistic=13.392857142857142,
pvalue=0.0012353158761688927)
398
Handbook of Computer Programming with Python
The first value of the output (13.39) is the Chi-­square value, followed by the p-­value (0.0012). Since
the p-­value is less than the significance level of 0.05, the null hypothesis is rejected, indicating that
there is a significant difference in terms of the smoking status before and after the programme.
9.5.9 Relationship: Pearson’s Correlation
Correlation is used to test whether two continuous variables have a linear relationship. The correlation coef- Observation 9.24 – Pearson’s
Correlation: A test used to examine
ficient summarizes the strength of this relationship.
As an example, the reader can assume that one needs whether two normally distributed,
to know whether age and BMI are correlated. The null continuous variables have a linear
hypothesis (H0) for this example is that age and BMI relationship. Use the pearsonr()
are not correlated. Assuming that both age and BMI are function from the SciPy library.
normally distributed and have the same variance, one
can use function pearsonr() from the SciPy library to calculate the correlation coefficient and
estimate the strength of the relationship. The function accepts two arrays as parameters corresponding to the sets of data:
1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import scipy as scipy
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
# Read the dataset
dataset = pd.read_csv("example.csv", index_col = 0)
print(pearsonr(dataset.age, dataset.bmi))
# Visualize the correlation with a scatter plot
print(plt.scatter(dataset.age, dataset.bmi, alpha = 0.5,
edgecolors = "none", s = 20))
Output 9.5.9:
(0.0453741864067145, 0.014235768675028503)
<matplotlib.collections.PathCollection object
at 0x000002802BD93310>
399
Statistical Analysis
The first value of the output is the correlation coefficient (0.045), followed by the p-­value (0.014).
Since p-­value is less than the significance level of 0.05, one can confirm that a relationship exists
between age and BMI. Another important observation is that the correlation is positive (i.e., if age
increases, BMI increases too), as the correlation coefficient is a positive number. However, the
strength of the correlation is rather weak, as the correlation coefficient (0.045) is quite close to 0
(i.e., no correlation).
The correlation can be also visualized as a scatter plot, using the scatter() function as shown
in the Output plot above.
9.5.10 Relationship: The Chi-­Square Test
To test whether two categorical variables are independent, one may use the Chi-­squared Test, also known as Observation 9.25 – Pearson’s
Chi-­squared Test of Independence or Pearson’s Chi-­ Chi-­Square Test: A test used to
examine whether two categorical
square Test.
To demonstrate the logic of the test, one can use the variables are independent. Use the
same survey data example and evaluate whether gender chi2_contingency() function from
and smoking status are associated. The null hypothesis the SciPy library.
(H0) would be that there is no relationship between gender and smoking status. When neither of the two measurements is less than 5, one can use the
crosstab() function from the Pandas library to create a cross table and scipy.stats.chi2_
contingency() to conduct the Chi-­square Test on the contingency/cross table. Detailed documentation for this function can be found in the official SciPy.org website (The SciPy Community,
2020). The following Python script makes use of both the crosstab() and the chi2_­
contingency() functions to provide the frequencies of the smoking status across the two gender
groups and test whether there is an indication of a relationship between them:
1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import scipy as scipy
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
# Read the dataset
dataset = pd.read_csv("example.csv", index_col = 0)
print(pd.crosstab(dataset.smokestat, dataset.gender), "\n")
squared Test of Independence
# Calculate the Chi-­
print(chi2_contingency(pd.crosstab(dataset.smokestat,
dataset.smokestat)))
Output 9.5.10.a:
gender
smokestat
Current
Former
Never
Female
Male
210
403
1093
162
367
683
(5835.999999999999, 0.0, 4, array([[ 47.42426319, 98.16312543, 226.41261138],
[ 98.16312543, 203.18711446, 468.64976011],
[ 226.41261138, 468.64976011, 1080.93762851]]))
400
Handbook of Computer Programming with Python
The first value of the output (19.453) is the Chi-­square value, followed by the p-­value (5.96e−05), the
degrees of freedom (2), and the expected frequencies as an array. Since the p-­value is less than 0.05, the
null hypothesis can be rejected, indicating that a relationship between smoking status and gender exists.
It is worth noting that if an expected frequency lower than 5 is present, the user should use the
Fisher’s Exact Test instead of the Chi-­square Test. Both tests assess for independence between
variables. The Chi-­square Test applies an approximation assuming the sample is large, while the
Fisher’s Exact Test runs an exact procedure suitable for small-­sized samples (Kim, 2017).
To visualize the results of the test, one can also create a mosaic plot using the mosaic() function from the Statsmodels library. The function accepts the source as a parameter and defines the
names of the columns for the plot:
1
2
3
4
5
6
7
8
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
# Read the dataset
dataset = pd.read_csv("example.csv", index_col = 0)
mosaic(dataset, ["smokestat", "gender"])
plt.show()
Output 9.5.10.b:
9.5.11 Relationship: Linear Regression
Linear regression is used to examine the linear relation9.26
–
Linear
ship between two (i.e., univariate linear regression) or Observation
Regression:
A
test
used
to
examine
the
more (i.e., multivariate linear regression) variables.
linear
relationship
between
two
(i.e.,
To contextualize this using the previous survey
example, the reader can assume a case where one wants univariate) or more (i.e., multivariate)
to test the relationship between body weight and BMI, variables. Use the OLS(y, X).fit()
where the BMI is normally distributed. Additionally, function from the Statsmodels library.
predictions regarding the BMI should be made based
on weight information. Since BMI is a continuous variable, linear regression is appropriate for
401
Statistical Analysis
the analysis. In Python, linear regression can be performed using either the Statsmodels or the
Scikit-­learn libraries. For this example, the test choice was function OLS(y, X).fit() from the
Statsmodels library, as the Scikit-­learn library is generally associated more with tasks related to
machine learning. The related Python script and its output are provided below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Read the dataset
dataset = pd.read_csv("example2.csv", index_col = 0)
# Independent variable
X = dataset.weight
# Dependent variable
y = dataset.bmi
# Add an intercept (beta_0) to the model
X = sm.add_constant(X)
# Function sm.OLS(dependent variable, independent variable)
model = sm.OLS(y, X).fit()
# Predictions
predictions = model.predict(X)
# Print out the statistics
print(model.summary())
# Plot the statistics
print(sm.graphics.plot_ccpr(model, "weight"))
Output 9.5.11.a and 9.5.11.b:
OLS Regression Results
Dep. Variable:
Model:
Method:
Date:
Time:
No. Observations:
Df Residuals:
Df Model:
Covariance Type:
const
weight
bmi
OLS
Least Squares
Sun, 25 Jul 2021
16:35:19
2849
2847
1
nonrobust
R-squared:
Adj. R-squared:
F-statistic:
Prob (F-statistic):
Log-Likelihood:
AIC:
BIC:
0.740
0.739
8085.
0.00
-7449.6
1.490e+04
1.492e+04
coef
std err
t
P>|t|
[0.025
0.975]
6.5712
0.1218
0.251
0.001
26.188
89.918
0.000
0.000
6.079
0.119
7.063
0.124
Omnibus:
Prob(Omnibus):
Skew:
Kurtosis:
268.275
0.000
0.574
4.953
Durbin-Watson:
Jarque-Bera (JB):
Prob(JB):
Cond. No.
1.290
609.183
5.22e-133
750.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
Figure(432x288)
402
Handbook of Computer Programming with Python
In this example of linear regression, y equals to a dependent variable, which is the variable that
must be predicted or estimated. Variable x equals to a set of independent variables, which are the
predictors of y. It must be noted that we need to add an intercept to the list of independent variables
using sm.add_constant(x) before running the regression.
The output provides several pieces of information. The first part contains information about
the dependent variable, the number of observations, the model, and the method. OLS stands for
Ordinary Least Squares, and method Least Squares relates to the attempt to fit a regression line
that would minimize the square of vertical distance from the data points to the regression line.
Another important value presented in the first part is the R squared (R² = 0.740), which is the percentage of variance that the model can justify (73.9%). The larger the R squared value the better
the model fit.
The second part of the output includes the intercept and the coefficients. The p-­value is lower
than .0001, indicating that there is statistical significance in terms of the weight predicting the BMI,
with a weight increase of 1 pound leading to a respective increase in BMI by 0.1219. The linear
regression equation can be also used in the following form:
BMI = ( Intercept ) + ( Weight_ coefficient ) * weight
Once the output numbers are added, the equation would take the following form:
BMI = 6.5531 + 0.1219* weight
Therefore, if the user knows a person’s weight (e.g., 125 pounds), their BMI can be calculated as 6.5
531 + 0.1219 * 125 = 21.7906.
The user can also use the Matplotlib library to plot the results, as illustrated in the associated
graph.
9.5.12 Relationship: Logistic Regression
Logistic regression is used to describe the relation9.27
–
Logistic
ship between a dependent, categorical variable and Observation
Regression:
A
test
used
to
examine
one or more independent variables. It models the logit-­
transformed probability in a linear relationship with the the relationship between a depenpredictor variables. For instance, using the same survey dent, categorical variable and one
example, one can assume that the user wants to know or more independent variables.
the relationship between smoking status (i.e., 1 = current Use the logit(y, X) function from the
smoker, and 0 = non-­smoker) and the potential predic- Statsmodels library.
tors, such as age, gender, and marital status. In addition, the user may also want to predict the smoking status based on the predictor information. Since
smoking status is a categorical variable, logistic regression is an appropriate analysis method. In
Python, logistic regression can be conducted using the Logit(y, X) function from the Statsmodels
library. Parameter y equals to a dependent variable, which is the variable that must be predicted or
estimated. Variable X equals to a set of independent variables, which are the predictors of y:
1
2
3
4
5
6
7
# Example of Logistic Regression
import pandas as pd
import statsmodels.api as sm
# Read data
df = pd.read_csv("Example2.csv", index_col = 0)
403
Statistical Analysis
8
9
10
11
12
13
14
15
16
17
18
19
x = df[["age", "gender2", "marital_divorced",
"marital_single", "marital_widowed"]]
y = df.smokestat2
# Add an intercept (beta_0) to the model
X = sm.add_constant(x)
logit_model = sm.Logit(y, X)
result = logit_model.fit()
# Print result.summary()
print(result.summary2())
Output 9.5.12:
Optimization terminated successfully.
Current function value: 0.373830
Iterations 6
Results: Logit
Model:
Dependent Variable:
Date:
No. Observations:
Df Model:
Df Residuals:
Converged:
No. Iterations:
Logit
smokestat2
2021-07-27 13:21
2849
5
2843
1.0000
6.0000
Coef.
const
age
gender2
marital_divorced
marital_single
marital_widowed
-1.7107
-0.0109
0.1805
0.8406
0.4609
0.4764
Std.Err.
0.2307
0.0040
0.1170
0.1422
0.1584
0.2229
Pseudo R-squared:
AIC:
BIC:
Log-Likelihood:
LL-Null:
LLR p-value:
Scale:
z
-7.4156
-2.7133
1.5418
5.9097
2.9096
2.1372
P>|z|
0.020
2142.0822
2177.8105
-1065.0
-1086.7
3.1240e-08
1.0000
[0.025
0.975]
0.0000 -2.1628 -1.2585
0.0067 -0.0187 -0.0030
0.1231 -0.0489 0.4098
0.0000 0.5618 1.1194
0.0036 0.1504 0.7715
0.0326 0.0395 0.9133
404
Handbook of Computer Programming with Python
As in linear regression, the output contains two parts. The first part provides information about the
dependent variable and the number of observations, while the second part provides the intercept
and the coefficients. As shown, age and marital status are significant predictors on smoking status
(p < 0.05), while gender is not (p = 0.1231). Individuals who are divorced are 2.31 (i.e., exp(0.8406))
times more likely to be smokers than those who are married. Similar trends are also observed for
those who are single (1.5855 times) and widowed (1.6102 times). In terms of age, it is observed that
for every 1-­year increase in age there is a decrease of approximately 1% (i.e., 1−exp(−0.0109)) in the
odds of an individual being a smoker.
The output information can be also used in order to build the logistic regression as follows:
P(probability of being a smoker) =
exp(−1.7107 − 0.0109* Age + 0.1805* gender2 + 0.8406* Divorced+0.4609*Single+0.4764*Widowed)
1 + exp(−1.7107 − 0.0109* Age + 0.1805* gender2 + 0.8406* Divorced+0.4609*Single+0.4764*Widowed)
As such, it can be predicted that a 40-­year-­old divorced male will have a 24.5% probability of being
a smoker:
exp ( −1.7107 − 0.0109 * 40 + 0.1805*1 + 0.8406*1)
0.3244
=
= 0.2450
1 + exp ( −1.7107 − 0.0109* 40 + 0.1805*1 + 0.8406*1) 1 + 0.3244
9.6
W
RAP UP
This chapter focused on the introduction of basic concepts and terms related to statistics analysis
and on the practical demonstration of carrying out inferential statistics analysis tasks using Python.
It provided an overview of statistics and the available tools for conducting the analytical tasks. Basic
statistical concepts, such as population and sample, hypothesis, significance levels and confidence
intervals, were introduced. It also provided a practical guide for choosing the right type of statistical
test for different types of tasks. The purposes and definitions of common types of statistical analysis
methods were briefly discussed. Furthermore, it covered the necessary background for choosing a
statistical analysis approach, such as levels and types of variables and the corresponding statistical
and hypothesis tests and demonstrated how to set up the Python environment and work with various libraries specifically designed for statistical analysis. Finally, it provided a practical guide for
the implementation and execution of common statistical analysis tasks in Python. Each statistical
analysis method was supported by working examples, the associated Python programming code,
and result interpretations.
A list of the common statistical analysis methods covered in this section, as well as the corresponding Python libraries and methods, are presented below:
Statistical Test
Mann-­Whitney U Test
Willcoxon Signed-­rank Test
Kruskal-­Wallis Test
Paired t-­Test
Independent t-­Test
Chi-­Square of goodness of Fit
ANOVA
Pearson’s Correlation
Pearson’s Correlation (Scatter Plot)
Library
SciPy
SciPy
SciPy
SciPy
SciPy
SciPy
SciPy
SciPy
Matplotlib
Code
mannwhitneyu(data1, data2)
wilcoxon(data1, data2)
kruskal(data1, data2, data3, …)
ttest_rel(data1, data2)
ttest_ind(data1, data2)
chisquare (data1, data2)
f_oneway(data 1, data 2, data3, …)
pearsonr(var1, var2)
scatter(var1, var2)
(Continued)
405
Statistical Analysis
Statistical Test
Library
Pearson’s Chi-­Square Test
Pearson’s Chi-­Square Test (Mosaic Plot)
Linear Regression
Logistic Regression
SciPy
Statsmodels
Statsmodels
Statsmodels
Code
chisquare (data1, data2)
mosaic(Dataframe, ['var1', 'var2'])
OLS(y, X).fit()
Logit(y, X)
The basic inferential statistical tests covered in this chapter lay the foundation for other, more advanced
statistical analysis tasks, such as time to event and time series analysis. Ultimately, such methods and
results could be used as building blocks for even more complex system simulations, such as Markov models, discrete-­event, and agent-­based simulations. Although advanced statistical analysis and simulation
tasks like these were not covered in this chapter, the reader should be able to explore them by building on
the information and knowledge acquired. Relevant key textbooks and bibliography for the purposes of
further study and self-­learning can be found in the Reference List of this chapter.
9.7 EXERCISES
We conducted an experiment about different plant species response to length of light over 3 months.
The data we collected are listed below:
Sample
1
2
3
4
5
6
7
8
9
10
Plant Species
Length of Daylight
(Hours per Day)
Growth
(cm)
Flowered or Not
(1 = Yes, 0 = No)
A
B
A
A
B
A
B
B
A
B
6
7
6
5
6
8
9
5
7
8
4.2
3.1
4.6
3.3
2.5
5.2
3.9
2.1
3.5
3.4
0
1
1
0
0
1
1
0
1
1
1. The variable of Plant Species is:
A. Ordinal variable
B. Nominal variable
C. Interval variable
D. Ratio variable
Answer: B
2. The variable of Length of Daylight is:
A. Ordinal variable
B. Nominal variable
C. Interval variable
D. Ratio variable
Answer: D
406
Handbook of Computer Programming with Python
3. The variable of Growth is:
A. Ordinal variable
B. Nominal variable
C. Continuous variable
D. Categorical variable
Answer: C
4. The variable of Flowered or not is:
A. Ordinal variable
B. Nominal variable
C. Interval variable
D. Ratio variable
Answer: A
5. If we want to know the correlation between Length of Daylight and Growth, which of the
following statistical methods should we use?
A. Chi-­square
B. Pearson’s Correlation
C. Logistic Regression
D. ANOVA
Answer: B
6. The estimated correlation coefficient is 0.45. What is the strength of the correlation?
A. Weak negative correlation
B. Strong positive correlation
C. Moderate positive correlation
D. Weak positive correlation
Answer: D
7. If we want to compare the growth difference of different plant species, which statistical
analysis should we use?
A. Linear Regression
B. Chi-­square Test
C. Student t-­Test
D. Mann-­Whitney U Test
Answer: D
8. We received more data from other research teams, making the total sample size 150. Next,
we would like to update our growth comparison results for different plant species. Which
Python codes should we use?
A. mannwhitneyu(data1, data2)
B. chisquare(data1, data2)
C. ttest_ind(data1, data2)
D. wilcoxon(data1, data2)
Answer: C
Statistical Analysis
407
9. Based on the total of 150 samples, we decided to investigate the relationship between
Growth and Length of Daylight. What would be our dependant variable?
A. Length of Daylight
B. Growth
C. Plant Species
D. Flowered or not
Answer: B
10. To explore the relationship mentioned in Question 9, which statistical analysis should be used?
A. Linear Regression
B. Logistic Regression
C. ANOVA
D. Chi-­square Test
Answer: A
11. Which Python code should be used to conduct the analysis used in Question 10?
A. ttest_rel(data1, data2)
B. f_oneway(data1, data2, data3)
C. OLS(y, X).fit()
D. Logit(y, X)
Answer: C
12. To explore the relationship between Flowered or not and Length of Daylight, which Python
code should be used?
A. ttest_rel(data1, data2)
B. f_oneway(data1, data2, data3)
C. OLS(y, X).fit()
D. Logit(y, X)
Answer: D
REFERENCES
Anaconda Inc. (2020). Anaconda Distribution Starter Guide. https://docs.anaconda.com/_downloads/9ee215
ff15fde24bf01791d719084950/Anaconda-­Starter-­Guide.pdf.
De Winter, J. F. C., & Dodou, D. (2010). Five-­point likert items: t test versus Mann-­W hitney-­Wilcoxon
(Addendum added October 2012). Practical Assessment, Research, and Evaluation, 15(1), 11.
Diabetes UK. (2019). Number of People with Diabetes Reaches 4.7 Million. https://www.diabetes.org.uk/
about_us/news/new-­stats-­people-­living-­with-­diabetes.
Fagerland, M. W., & Sandvik, L. (2009). The Wilcoxon–Mann–Whitney test under scrutiny. Statistics in
Medicine, 28(10), 1487–1497.
Kim, H.-­Y. (2017). Statistical notes for clinical researchers: Chi-­squared test and Fisher’s exact test. Restorative
Dentistry & Endodontics, 42(2), 152–155.
Koehrsen, W. (2018). Histograms and Density Plots in Python. Towardsdatascience. com, https://towardsdatascience.com/histograms-­and …. https://towardsdatascience.com/histograms-­and-­density-­plots-­in-­
python-­f6bda88f5ac0.
McDonald, J. H. (2014). Correlation and linear regression. In Handbook of Biological Statistics (3rd ed.).
Baltimore, MD: Sparky House Publishing. https://www.biostathandbook.com/HandbookBioStatThird.
pdf.
408
Handbook of Computer Programming with Python
McKinney, W., & Team, P. D. (2020). Pandas-­Powerful python data analysis toolkit. Pandas—Powerful
Python Data Analysis Toolkit, 1625. https://pandas.pydata.org/docs/pandas.pdf.
Mclntire, G., Martin, B., & Washington, L. (2019). Python Pandas Tutorial: A Complete Introduction for
Beginners. Learn Data Science-­Tutorials, Books, Courses, and More. https://www.learndatasci.com/
tutorials/python-­pandas-­tutorial-­complete-­introduction-­for-­beginners/.
Minitab. (2015). Choosing between a nonparametric test and a parametric test. State College: The Minitab
Blog. https://blog.minitab.com/blog/adventures-­in-­statistics-­2/choosing-­between-­a-­nonparametric-­test-­
and-­a-­parametric-­test.
Pandas Development Team. (2020). pandas.read_excel. https://pandas.pydata.org/pandas-­docs/stable/reference/api/pandas.read_excel.html.
Scikit-­posthocs. (2020). The Scikit Posthocs Test. https://scikit-­posthocs.readthedocs.io/en/latest/.
SciPy Community. (2020). scipy.stats.ttest_ind. https://docs.scipy.org/doc/scipy/reference/generated/scipy.
stats.ttest_ind.html.
Sharma, A. (2019). Importing Data into Pandas. https://www.datacamp.com/community/tutorials/importing-­
data-­into-­pandas#:~:targetText=To read an HTML file, to read the HTML document.
Tavares, E. (2017). Counting and Basic Frequency Plots. https://etav.github.io/python/count_basic_freq_plot.
html.
WorldCoinIndex. (2021). WorldCoinIndex. https://www.worldcoinindex.com/.
10
Machine Learning with Python
Muath Alrammal
Higher Colleges of Technology
University Paris-­Est (UPEC)
Dimitrios Xanthidis and Munir Naveed
University College London
Higher Colleges of Technology
CONTENTS
10.1 Introduction.........................................................................................................................409
10.2 Types of Machine Learning Algorithms............................................................................ 410
10.3 Supervised Learning Algorithms: Linear Regression........................................................ 411
10.4 Supervised Learning Algorithms: Logistic Regression...................................................... 414
10.5 Supervised Learning Algorithms: Classification and Regression Tree (CART)................ 418
10.6 Supervised Learning Algorithms: Naïve Bayes Classifier................................................. 430
10.7 Unsupervised Learning Algorithms: K-­means Clustering................................................. 435
10.8 Unsupervised Learning Algorithms: Apriori..................................................................... 438
10.9 Other Learning Algorithms................................................................................................ 443
10.10 Wrap Up - Machine Learning Applications.......................................................................444
10.11 Case Studies........................................................................................................................ 447
10.12 Exercises............................................................................................................................. 447
References....................................................................................................................................... 447
10.1 INTRODUCTION
At the present time, machine learning (ML) plays an
essential role in many human activities. It is applied in Observation 10.1 – Machine
different areas including online shopping, medicine, Learning: A subfield of computer scivideo surveillance, email spam and malware detection, ence and Artificial Intelligence that
online customer support, and search engine result refine- focuses on developing algorithms that
ment. It is a subfield of computer science and a subset can learn from data and make predicof Artificial Intelligence (AI). The main focus of ML is tions based on their learning.
on developing algorithms that can learn from data and
make predictions based on this learning.
An ML program is one that learns from experience E Observation 10.2 – Machine
given some tasks (T) and performance measure (P), if it Learning Process: A Machine
improves from that experience (E) (Mitchell, 1997). ML Learning program learns from experibehaves similarly to the growth of a child. As a child ence (E) given some tasks (T) and pergrows, its experience E in performing task T increases, formance measure (P), if it improves
from that experience (E).
which results in a higher performance measure (P).
In ML, a computer is trained using a given dataset
in order to predict the properties of new data. For instance, one can train a system by feeding it
with 10,000 images of dogs and 10,000 more images not containing dogs, indicating in each case
DOI: 10.1201/9781003139010-10
409
410
Handbook of Computer Programming with Python
whether a picture is a dog or not. following this training, when the system is fed with a new image
it should be able to predict whether it is the image of a dog or not.
Python has an arsenal of libraries that support the implementation of ML algorithms. Some of
these libraries are already discussed and used in previous chapters (e.g., Pandas, Matplotlib). Other
libraries especially useful for ML applications are the following:
• NumPy: It is an array-­processing library. It provides complex mathematical functions for
processing multi-­dimensional arrays and matrices. It is a powerful tool for handling random numbers, Fourier transforms, and linear algebra.
• SciPy: It is an open-­source Python library used for scientific computing. It contains modules for image optimization, signal processing, Fast Fourier transform, linear algebra,
and ordinary differential equation (ODE). It is built on top of NumPy, as its underlying
data structure is a multi-­dimensional array.
• Scikit-­Learn: It is built in 2010 on top of NumPy and SciPy libraries. It contains several
supervised and unsupervised ML algorithms. The library is also useful in data mining
and data analysis. It handles clustering, regression, classification, model selection, and
preprocessing.
• TensorFlow: This library was developed by Google in 2015. It uses a NumPy backend for
manipulating tensors.
There is an abundance of implemented ML algorithms, applying to various domains. This chapter
provides an introduction to some of the most important as well as some of the most popular domain
applications. This chapter concludes with a relevant case study that explores some of the main
aspects of ML.
10.2
T
YPES OF MACHINE LEARNING ALGORITHMS
There are three main types of ML algorithms: supervised, unsupervised, and reinforcement. A simple way Observation 10.3 – Supervised
to understand the difference between supervised and Learning: Use labeled data to train
unsupervised ML is by introducing the concept of using a computer how to map particular
some type of help to teach a computer how to map par- input into output. If the output is in
a categorical form the type is classifiticular inputs into the relevant outputs.
In the case of supervised learning the supervisor uses cation. If the output is in continuous
what is referred to as labeled data to direct the computer numerical form the type is regression.
into understanding how to map the input into output. As Combining multiple supervised learnan example, assume the case of training a computer to ing models is referred to as type of
distinguish between the images of a laptop and a desktop ensembling.
PC. The computer is provided with a set of images and a
label or flag for each one specifying it is a laptop. The same process is repeated for the case of the
desktop PC images. Although this is a simplified example, it provides a straightforward description
of supervised learning.
In terms of the outputs associated with supervised learning, there are two broad types: classification and regression. Classification is related with categories, such as “sick” or “healthy” individuals, “dog” or “cat” pets, “laptop” or “desktop” PCs. Regression is related to outputs in the form of
continuous numerical values, such as predicting an individual’s height or weight, or the amount of
rainfall. An additional type of supervised learning is ensembling, which involves combining the
predictions of multiple ML models that may be too weak to stand on their own, in order to produce
a more accurate prediction for a new sample.
In general, a broad statement about supervised learning is that it uses labeled data to train a computer to map inputs (X) into outputs (Y) by solving equation Y = f(X) for f.
Machine Learning
411
In the case of unsupervised learning there is no supervisor to train the computer in terms of mapping inputs Observation 10.4 – Unsupervised
into outputs, and no labeled training input data to model Learning: There is no supervisor to
possible corresponding output variables. Essentially, the train the computer to map input into
computer is left to predict the possible outputs on its output and there is no labeled data for
own, given a set of previous inputs. There are three main such training. The computer is trained
types of unsupervised learning: association, clustering, by itself through a trial-­and-­error process. Association is used to determine
and dimensionality reduction.
Association is used to discover the probability of the the probability of the co-­occurrence
co-­occurrence of items in a collection. It is used exten- of items in the collection. Clustering
sively in market-­based analysis. For example, an associ- is used to group samples within the
ation model might be used to predict whether a purchase same cluster. Dimensionality reducof bread has an 80% probability to be connected with a tion is used to reduce the number of
purchase of eggs. Clustering is used to group samples in variables of the dataset.
a way that ensures that objects within the same cluster
share more similarities with each other than with objects from other clusters. Dimensionality reduction is used to reduce the number of variables of a dataset, while ensuring that important information is still conveyed. Dimensionality reduction can be achieved by using feature extraction and
feature selection functions. The latter essentially refers to the selection of a subset of the original
variables. Feature extraction performs data transformations from a high-­dimensional space to a
low-­dimensional space (e.g., PCA algorithm).
Finally, reinforcement learning is a type of ML that allows an agent to decide the best action
based on its current state, by learning behaviors that will maximize the associated rewards. It usually learns optimal actions through trial and error. For example, one can think of a video game in
which the player needs to move to certain places at certain times in order to earn points. If a reinforcement algorithm attempts to play this game instead of a human player, it would start by moving
randomly, but eventually would learn where and when it needs to move in order to maximize points
accumulation through the use of an appropriate trial and error process.
10.3 SUPERVISED LEARNING ALGORITHMS: LINEAR REGRESSION
The basic idea behind linear regression is the quantifica10.5
–
Linear
tion of the relationship between a set of inputs and their Observation
corresponding outputs. This takes the form of a line Regression: Trains a system to pre(y = a + b.x) where b is the slope of the regression line dict the output of a particular input by
(the coefficient of the line) and a is the y-­axis intercept. quantifying the relationship y = a + b.x
The goal is to have the least number of outliers (i.e., data between a set of inputs and their corwith a large deviation from the line). This is measured responding outputs, where b is the
as the sum of the squares of all the distances of the data slope of the line and a is the y-­axis
points from the line. Another important parameter in intercept. Use R2 to measure the
linear regression is that of R2, which suggests the pos- effect of the input on the possible outsibility that the output y is affected by a related change in put and p to measure the statistical
the input x. Obviously, like in all other statistical analy- significance of the test.
sis tests, this particular test results in a p value (statistical significance) that determines whether there is a statistically significant correlation between the
input and output datasets.
In Python, linear regression can be implemented using the linregress(X, y) function of the
Stats library. The function uses an input and an output dataset (i.e., X and y, respectively). The function output consists of five values: the slope of the linear regression, the intercept, the r value, the
p value, and the statistical error of the test. Based on this, the overall process can be summarized
in five distinct steps:
412
Handbook of Computer Programming with Python
• Step 1: Import/read the data for the linear regression.
• Step 2: Define the two datasets (X and y) used to create the model.
• Step 3: Use linregress() to calculate the slope, the intercept, the r, and the p values
of the linear regression.
• Step 4 (Optional): Use the slope and the intercept to visualize the model.
• Step 5 (Optional): Test the model with new data.
There are numerous real-­life applications of linear regression ML algorithms. A notable example
is their use in medicine and pharmaceutical research, when trying to determine the optimal dosage
of a particular drug for a particular illness. Other examples include the use of such algorithms in
sales and marketing, when trying to find the correct volume of promotional material (and the associated costs) for a particular product in order to maximize revenue, and the association of a student’s
coursework grades with their final grade in an educational context. The following Python script
quantifies the relationship between the values of two columns of the grades2.csv dataset (Midterm
Exam and Final Grade). Next, once the slope and the intercept values are calculated and the regression model is prepared for further use, both the training and the test datasets are visualized (plotted)
alongside the regression line:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
import matplotlib.pyplot as plt
# Request to plot inline with the rest of the results
# This is particularly relevant in Jupyter Anaconda
%matplotlib inline
from scipy import stats
# The function uses the calculated slope and intercept
# to predict the Final Grade, given the Midterm Exam grade input
def predictFinalGrade(X):
return slope * X + intercept
# Read the dataset
dataset = pd.read_csv("grades2.csv")
dataset2 = dataset[["Final Grade", "Midterm Exam"]]
print("The input dataset is as follows:")
print(dataset2)
# Define the input and output datasets
X = dataset2["Midterm Exam"]; y = dataset2["Final Grade"]
# Use the linregress function from the stats library
# to calculate slope, intercept, r, p, and std_err
slope, intercept, r, p, std_err = stats.linregress(X, y)
print("The slope and intercept values are: {:.2f}, \
{:.2f}".format(slope, intercept))
print("The value of R-­
square is: {:.2f}".format(r**2))
print("The value of statistical significance, p is: {:.2f}".format(p))
mymodel = list(map(predictFinalGrade, X))
# Plot the model of the resulting linear regression
Machine Learning
33
34
35
36
37
413
plt.scatter(X, y); plt.plot(X, mymodel); plt.show()
grades = int(input("Enter the new Midterm Exam grade:"))
grades = predictFinalGrade(grades)
print("The predicted Final Grade is: {:.2f}".format(grades))
Output 10.3:
The input
Final
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
The slope
The value
The value
data set is as follows:
Grade Midterm Exam
67.47
70
75.13
82
66.85
40
54.45
44
76.95
82
45.13
50
73.23
62
81.87
84
62.63
64
58.75
52
49.75
62
44.25
42
62.52
68
47.33
52
68.97
70
and intercept values are: 0.62,
23.96
of R-square is: 0.57
of statistical significance, p is: 0.00
Enter the new Midterm Exam grade:88
The predicted Final Grade is: 78.80
In terms of the information provided here, the dataset is printed first with the input values used to
train the system to quantify the regression model. The stats.linregress() function of the
Stats library is used to calculate the slope and the intercept values, as well as the R2 value, the statistical significance value (p) and the standard error (std_err). Next, the user is prompted to enter
a new Midterm Exam grade, and the system predicts the Final Grade using the related function
predictFinalGrade().
414
Handbook of Computer Programming with Python
The reader should also note that the output includes the R2 value, which can be interpreted as a
57% possibility that a change in the Midterm Exam will affect the Final Grade. Another noteworthy output is that of the p value (i.e., statistical significance), which in this particular case is less
than 0.05, suggesting that there is a correlation between the Midterm Exam and the Final Grade.
Another value calculated during linear regression, although not displayed in the output results, is
std_err. This value describes the maximum distance of the output values from the regression line in
the form of an error, which is often referred to as residual. The script makes use of the format()
specifier to limit the number of decimal places of the results to 2. Finally, the reader should note
the inclusion of directive %matplotlib inline, dictating that the regression model must be plotted
inline with the rest of the data.
10.4
SUPERVISED LEARNING ALGORITHMS: LOGISTIC REGRESSION
As shown, linear regression predictions take the form
10.6
–
Logistic
of continuous values. In the case of logistic regression, Observation
predictions take the form of discrete values (i.e., binary), Regression: Train a system to predict
such as whether a student will pass or fail a course, or the probability of an output as one of
whether it will rain or not. Its name comes from the two possible values based on a given
associated logistic function: y = 1/(1 + e−x). The plot of input. The function used for this purthis function is an S-­shaped curve. In contrast to linear pose is the following: y = 1/(1 + ex).
regression where the output is a value directly based on
the input, in logistic regression it is a probability ranging from 0 to 1. For example, if a value 1
represents a passing grade, an output of 0.85 means that a student is very likely to pass the course
at a probability of 85%.
There are eight possible steps to follow when performing logistic regression, of which two are
optional:
•
•
•
•
•
•
•
•
Step 1: Import/read the data for the logistic regression.
Step 2: Split the input datasets into train and test sets.
Step 3: Perform feature scaling for the data (between 0 and 1).
Step 4: Build the logistic classifier (with a preferred random_state = 0 for consistent
results) and fit the trained set into the classifier.
Step 5: Predict the results based on the classifier.
Step 6: Find the accuracy of the regression model as a percentage.
Step 7 (Optional): Visualize the results of the trained set.
Step 8 (Optional): Visualize the results of the test set.
The following Python script uses Midterm Exam and Project grades to create a logistic regression
model and visualize its results:
1
2
3
4
5
6
7
8
9
# Import train_test_split to train and test the input
from sklearn.model_selection import train_test_split
# Import StandardScaler to scale the data
from sklearn.preprocessing import StandardScaler
# Import the LogisticRegression to create the classifier object
from sklearn.linear_model import LogisticRegression
# Import the accuracy_score to calculare the accuracy of the model
from sklearn.metrics import accuracy_score
# Import numpy to prepare the plot parameters
Machine Learning
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
415
import numpy as np
# Import pyplot to create the plot
import matplotlib.pyplot as plt
# Import ListedColormap to color the data points in the plot
from matplotlib.colors import ListedColormap
# Define that results are to plotted inline
# This is particularly relevant in Jupyter Anaconda
%matplotlib inline
# Step 1: Define the input dataset. X must be a 2D list with
# as many rows as observations
X = [[60, 55], [54, 90], [70, 80], [76, 70], [64, 87], [66, 70],
[54, 87], [92, 70], [58, 78], [70, 71], [70, 70], [90, 76],
[86, 92], [72, 70], [70, 72], [82, 87], [40, 80], [44, 90],
[82, 92], [50, 68]]
y = [0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
# Step 2: Split set X and y into train test and test set
# Test size is 25% of the dataset, train size is 75%
# The new trained and test lists will be in random order
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.25, random_state = 0)
print("Trained X set:", X_train); print("Test X set:", X_test)
print("Trained y set:", y_train); print("Test y set:", y_test)
# Step 3: Perform feature scaling for the data (between 0 and 1)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
print("\nThe 2D set of trained X input:\n", X_train)
X_test = sc_X.transform(X_test)
print("\nThe 2D set of test X input:\n", X_test)
# Step 4: Build the logistic classifier
# Set random_state to 0 for consistent results
# Fit the trained set into the classifier
model = LogisticRegression(solver = 'liblinear',
random_state = 0).fit(X_train, y_train)
print("\n", model)
# Step 5: Predict the test results
y_pred = model.predict(X_test)
print("\nResults predicted by the model:", y_pred)
print("Results from the test:", y_test)
model.predict_proba(X)[:,1]
# Step 6: Form the confusion matrix to get the accuracy of the model
# Use y_test (actual output) and y_pred (predicted output)
accuracy = accuracy_score(y_test, y_pred)
print("The accuracy of the model given the test data is: ",
accuracy * 100, "%")
416
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Handbook of Computer Programming with Python
# Step 7: Visualize the training set results
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2, model.predict(np.array([X1.ravel(), \
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
cmap = ListedColormap(('red','blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1])
plt.title('Logistic Regression: Training set')
plt.xlabel("Midterm Exam")
plt.ylabel("Project")
plt.show()
# Step 8: Visualize the test results
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2, model.predict(np.array([X1.ravel(), \
X2.ravel()]).T).reshape(X1.shape),alpha = 0.75,
cmap = ListedColormap(('red','blue')))
plt.xlim(X1.min(), X1.max());plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1])
plt.title('Logistic Regression: Test set')
plt.xlabel("Midterm Exam"); plt.ylabel("Project")
plt.show()
Output 10.4:
Trained X set: [[44, 90], [54, 87], [72, 70], [64, 87],
[70, 80], [66, 70], [70, 72], [70, 71], [92, 70], [40,
80], [90, 76], [76, 70], [60, 55], [82, 87], [86, 92]]
Test X set: [[82, 92], [54, 90], [50, 68], [58, 78], [7
0, 70]]
Trained y set: [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
1, 1]
Test y set: [1, 0, 0, 0, 1]
The 2D set of trained X input:
[[-1.69129319 1.30698109]
[-1.01657516 1.00224457]
[ 0.19791729 -0.72459574]
[-0.34185713 1.00224457]
[ 0.06297368 0.29119268]
[-0.20691353 -0.72459574]
[ 0.06297368 -0.52143805]
[ 0.06297368 -0.62301689]
1, 1]
Test y set: [1, 0, 0, 0, 1]
The 2D set of trained X input:
Machine Learning
[[-1.69129319 1.30698109]
[-1.01657516 1.00224457]
[ 0.19791729 -0.72459574]
[-0.34185713 1.00224457]
[ 0.06297368 0.29119268]
[-0.20691353 -0.72459574]
[ 0.06297368 -0.52143805]
[ 0.06297368 -0.62301689]
[ 1.54735334 -0.72459574]
[-1.9611804
0.29119268]
[ 1.41240974 -0.11512269]
[ 0.4678045 -0.72459574]
[-0.61174434 -2.24827836]
[ 0.87263531 1.00224457]
[ 1.14252253 1.51013878]]
417
The 2D set of test X input:
[[ 0.87263531 1.51013878]
[-1.01657516 1.30698109]
[-1.28646237 -0.92775342]
[-0.74668795 0.088035 ]
[ 0.06297368 -0.72459574]]
LogisticRegression(random_state=0, solver='1iblinear')
Results predicted by the model: [1 1 0 0 0]
Results from the test: [1, 0, 0, 0, 1]
The accuracy of the model given the test data is:
%
60.0
418
Handbook of Computer Programming with Python
The above script and its output demonstrate the eight steps followed when using logistic regression.
In Step 1 (data read), it is important to remember that input dataset X must be a two-­dimensional
array/list of pairs of data equal to the number of observations. In this particular case, the set includes
the grades of each student for Midterm Exam and Project. The y dataset includes values 0 or 1 for
each student, with 0 referring to a fail and 1 to a pass.
In the step, the script makes use of the train_test_split() function (train_test_split module) from the Sklearn.model_selection library. The function takes the X and y datasets, splits them
to train and test subsets at a rate of 75/25 (test_size = 0.25), and randomizes the splitting process.
The results of the function are datasets X_train, X_test, y_train, and y_test. In Step 3,
the script imports the StandardScaler module from the Sklearn.preprocessing library and uses
the StandardScaler() constructor and the fit_transform() function to scale output data y
between 0 and 1, as required by the logistic regression model.
In Step 4, the actual logistic regression classifier is used to fit the data and execute the model
using the X_train, X_test, y_train, and y_test datasets. Next, the script uses the model to
predict (.predict()) the results of the regression (fifth step). In Step 6, the script uses function
accuracy_score() (Accuracy_score module) from the Sklearn.metrics library to calculate the
accuracy rate of the resulting regression model, as a number between 0 and 1. Finally, Steps 7 and 8 are
used to visualize the training and test set results, respectively. In both cases, function meshgrid() is
used to prepare the data for plotting and ListedColormap() to color the pass and fail outputs.
There are numerous different options and variations available for each of these steps, as well
as for displaying and plotting the resulting data. The reader can refer to the multitude of statistics
and/or machine learning textbooks and resources in order to delve deeper into the various concepts
related to the interpretation and use of the results of logistic regression in various contexts.
10.5 S UPERVISED LEARNING ALGORITHMS: CLASSIFICATION
AND REGRESSION TREE (CART)
A decision tree consists of a root, nodes, and leaves (Figure 10.1). The starting point of the decision
tree is the root; each internal node is branching out to connect to other inputs, also in the form of
nodes. Each leaf node is a possible output of the tree. The branching is determined by using a split
function, which divides the input data into one or more branches. The leaf nodes of the tree are the
outcomes.
FIGURE 10.1
Decision tree.
Machine Learning
419
In order to create the order (or height) of the decision tree and its features, the decision tree algorithm uses a function to determine the information gain. There are two functions serving this purpose, referred to as indices: entropy or Gini index. Their function is to measure the impurity of a
node in the tree and, based on their value, the node is being kept or discarded. These values also
determine the position of a node in the tree. There are different types of the decision tree, depending
on how the indices are calculated and what choices are being made in terms of splitting continuous
values. The most commonly used types of a decision tree are ID3 (Quinlan, 1986), C4.5 (Salzberg,
1994) and CART (Mola, 1998).
CART (Classification and Regression Tree) is one
of the most important and popular types of supervised Observation 10.7 – CART: The
learning algorithms. The output can be in a form of a Classification and Regression Tree
categorical value (e.g., it will rain or not) or a continuous (CART) is a decision tree with a root,
value (e.g., the final price of a car). A visual represen- nodes and leaves and with outputs
tation of a decision tree is shown in Figure 10.2. The either in a form of a categorical or a
tree starts with the Age feature, which is a numeric attri- continuous value. The branching is
bute in a bank dataset. The values of Age are split into determined by using a split function
three branches: 18–23, 24–34 and >35. The algorithm that divides the input data into one or
can split the continuous number values of the Age fea- more branches.
ture using a technique that also determines the order of
features within the tree. Next, the Age feature (the root Observation 10.8 – Input and
of the tree) is associated with three additional features Output Datasets: The Classification
(nodes): Job, Marital Status, and Housing.
and Regression Tree (CART) requires
The decision tree can be built using a training data- a 2D list/array of values as its input
set. In the following example, the script makes use of a and output datasets. If the input and
dataset of 40 bank account customer records, contain- output datasets do not match, approing features age, job, marital status, and education. The priate amendments are required.
system aims at predicting the possibility of customers
FIGURE 10.2
Example of decision tree.
420
Handbook of Computer Programming with Python
making a deposit in the bank or not. In order to train
the CART decision tree, these four features are used as Observation 10.9 – StringIO,
input and the deposit feature as output. The possible out- Graphviz: Used to depict the deciputs are Yes and No (depositing money or not). The script sion tree in a visual form.
requires a number of associated libraries. Some of these
libraries are already included in the system (e.g., Pandas and Numpy), while others like Pydoplus
and Graphviz must be installed explicitly. Given that the installation of any libraries depends on
the particular system in use, the reader is advised to check the available pip install statements for
specific system settings:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Import the basic libraries
import pandas as pd
import numpy as np
# Import the DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Import the confusion_matrix, the accuracy_score, and the
# classification report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Import train_test_split to split the data into train and test samples
from sklearn.model_selection import train_test_split
# Import the libraries for the necessary hot encoding
from sklearn.preprocessing import LabelEncoder
# Import the libraries to plot the graph
from sklearn.tree import export_graphviz
# import StringIO from sklearn.externals.six
from six import StringIO
from IPython.display import Image
import pydotplus
# Plot results inline
# This is often particularly needed in Jupyter Anaconda
%matplotlib inline
With the libraries imported, the next part of the script is the first step of this particular implementation. Initially, the list of values for input list X (2D array) is defined. Each sub-­list includes the age, job,
marital status, and education features of the bank customer. Next, output Y (single dimension list) is
defined as a unidimensional list, taking values of either Yes or No. In line 82, input list X is converted
to a Numpy array to facilitate a more efficient manipulation of the elements in the list. In the following line (83), the 2D array is divided into four unidimensional sub-­arrays, each storing the respective
elements. Finally, the data of each newly created input sub-­array (X1–X4) and of output Y are printed:
31
32
#====================================================================
# Step 1: Define and print the input and output datasets
Machine Learning
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
print("Step 1: Define and print the input and output datasets\n")
X = [[59, 'admin.', 'married', 'secondary'],
[56, 'admin.', 'married', 'secondary'],
[41, 'technician', 'married', 'secondary'],
[55, 'services', 'married', 'secondary'],
[54, 'admin.', 'married', 'tertiary'],
[42, 'management', 'single', 'tertiary'],
[56, 'management', 'married', 'tertiary'],
[60, 'retired', 'divorced', 'secondary'],
[37, 'technician', 'married', 'secondary'],
[28, 'services', 'single', 'secondary'],
[38, 'admin.', 'single', 'secondary'],
[30, 'blue-collar', 'married', 'secondary'],
[29, 'management', 'married', 'secondary'],
[46, 'blue-collar', 'single', 'tertiary'],
[31, 'technician', 'single', 'tertiary'],
[35, 'management', 'divorced', 'tertiary'],
[32, 'blue-collar', 'single', 'primary'],
[49, 'services', 'married', 'secondary'],
[41, 'admin.', 'married', 'secondary'],
[49, 'admin.', 'divorced', 'secondary'],
[49, 'retired', 'married', 'secondary'],
[32, 'technician', 'married', 'secondary'],
[30, 'self-employed', 'single', 'secondary'],
[55, 'services', 'divorced', 'tertiary'],
[32, 'blue-collar', 'married', 'secondary'],
[52, 'admin.', 'divorced', 'secondary'],
[38, 'unemployed', 'divorced', 'secondary'],
[60, 'retired', 'married', 'secondary'],
[60, 'retired', 'divorced', 'secondary'],
[30, 'admin.', 'married', 'tertiary'],
[44, 'unemployed', 'married', 'secondary'],
[32, 'blue-collar', 'married', 'secondary'],
[46, 'entrepreneur', 'married', 'tertiary'],
[34, 'management', 'married', 'secondary'],
[40, 'management', 'married', 'secondary'],
[34, 'housemaid', 'married', 'primary'],
[43, 'admin.', 'single', 'secondary'],
[52, 'technician', 'married', 'secondary'],
[35, 'blue-collar', 'married', 'secondary'],
[34, 'blue-collar', 'single', 'secondary']]
Y=['yes','yes','yes','yes','yes','yes','yes','yes','yes','yes',
'yes','yes','yes','yes','yes','yes','yes','yes','yes','yes',
'no','no','no','no','no','no','no','no','no','no',
'no','no','no','no','no','no','no','no','no','no' ]
# Convert the list into a numpy array for better index control
newX = np.array(X)
newX1,newX2,newX3,newX4=newX[:,0],newX[:, 1],newX[:, 2],newX[:, 3]
421
422
84
85
86
87
88
89
90
Handbook of Computer Programming with Python
print("\nThe
print("\nThe
print("\nThe
print("\nThe
input
input
input
input
of
of
of
of
ages (X1) is :\n", newX1)
jobs (X2) is :\n", newX2)
marital status (X3) is :\n", newX3)
education (X4) is :\n", newX4)
print("\nThe output of deposits (Y) is :\n", Y)
Output 10.5: Step 1
Step 1: Define and print the input and output datasets
The input of ages (X1) is :
['59' '56' '41' '55' '54' '42' '56' '60' '37' '28' '38' '30' '29' '46'
'31' '35' '32' '49' '41' '49' '49' '32' '30' '55' '32' '52' '38' '60'
'60' '30' '44' '32' '46' '34' '40' '34' '43' '52' '35' '34']
The input of jobs (X2) is :
['admin.' 'admin.' 'technician' 'services' 'admin.' 'management'
'management' 'retired' 'technician' 'services' 'admin.' 'blue-collar'
'management' 'blue-collar' 'technician' 'management' 'blue-collar'
'services' 'admin.' 'admin.' 'retired' 'technician' 'self-employed'
'services' 'blue-collar' 'admin.' 'unemployed' 'retired' 'retired'
'admin.' 'unemployed' 'blue-collar' 'entrepreneur' 'management'
'management' 'housemaid' 'admin.' 'technician' 'blue-collar'
'blue-collar']
The input of marital status (X3) is :
['married' 'married' 'married' 'married' 'married' 'single' 'married'
'divorced' 'married' 'single' 'single' 'married' 'married' 'single'
'single' 'divorced' 'single' 'married' 'married' 'divorced' 'married'
'married' 'single' 'divorced' 'married' 'divorced' 'divorced' 'married'
'divorced' 'married' 'married' 'married' 'married' 'married' 'married'
'married' 'single' 'married' 'married' 'single']
The input of education (X4) is :
['secondary' 'secondary' 'secondary' 'secondary' 'tertiary' 'tertiary'
'tertiary' 'secondary' 'secondary' 'secondary' 'secondary' 'secondary'
'secondary' 'tertiary' 'tertiary' 'tertiary' 'primary' 'secondary•
'secondary' 'secondary' 'secondary' 'secondary' 'secondary' 'tertiary'
'secondary' 'secondary' 'secondary' 'secondary' 'secondary' 'tertiary'
'secondary' 'secondary' 'tertiary' 'secondary' 'secondary' 'primary'
'secondary' 'secondary' 'secondary' 'secondary']
The output of deposits (Y) is :
['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no',
'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']
Machine Learning
423
In Step 2, the code addresses an important classifica10.10
–
Integer
tion issue. Since models are mathematical in nature, the Observation
Encoding:
The
process
of
converting
underlying calculations are based on textual rather than
numerical data. Hence, it is necessary to encode the var- a categorical value into the numerical
ious elements of the data into numerical (integer) values, form necessary for the CART algoa process referred to as integer encoding. Lines 97–102 rithm. Use the LabelEncoder()
include code for finding the unique elements in each of function from the Sklearn.preprocessthe input sub-­arrays X1–X4. Next, in lines 105–122, the ing library.
LabelEncoder() function (Sklearn.preprocessing
library) is utilized to create the relevant objects, subsequently used by fit_transform() to produce the integer encoded sub-­arrays for X1–X4. The same process is also applied in the case of
output dataset Y:
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
#====================================================================
# Step 2: Encode the categorical values of the input & output datasets
# Find and print the unique values of the categories/columns for job
# and marital status
print("\n\nStep 2: The inputs of jobs, marital status,",
"and education and the outputs are integer encoded")
jobs = np.unique(newX2)
print("\nThe various categories of jobs are:\n", jobs)
maritalStatus = np.unique(newX3)
print("\nThe various categories of marital status are:\n",
maritalStatus)
education = np.unique(newX4)
print("\nThe various categories of education are:\n", education)
# Integer Encode the categorical input and output values as fit()
# does not accept strings
label_encoderX2 = LabelEncoder()
integer_encodedX2 = label_encoderX2.fit_transform(newX2)
print("\nThe various categories of jobs are integer Encoded as",
"follows:\n", integer_encodedX2)
label_encoderX3 = LabelEncoder()
integer_encodedX3 = label_encoderX3.fit_transform(newX3)
print("\nThe various categories of marital status are ",
"integer Encoded as follows:\n",
integer_encodedX3)
label_encoderX4 = LabelEncoder()
integer_encodedX4 = label_encoderX4.fit_transform(newX4)
print("\nThe various categories of education are integer Encoded as",
"follows:\n", integer_encodedX4)
label_encoderY = LabelEncoder()
integer_encodedY = label_encoderY.fit_transform(Y)
print("\nThe various categories of output are integer Encoded as",
"follows:\n", integer_encodedY)
424
Handbook of Computer Programming with Python
Output 10.5: Step 2
Step 2: The inputs of jobs, marital status, and education and the outputs are
integer encoded
The various categories of jobs are:
['admin.' 'blue-collar' 'entrepreneur' 'housemaid' 'management' 'retired'
'self-employed' 'services' 'technician' 'unemployed']
The various categories of marital status are:
['divorced' 'married' 'single']
The various categories of education are:
['primary' 'secondary' 'tertiary']
The various categories of jobs are integer Encoded as follows:
[0 0 8 7 0 4 4 5 8 7 0 1 4 1 8 4 1 7 0 0 5 8 6 7 1 0 9 5 5 0 9 1 2 4 4 3 0
8 1 1]
The various categories of marital status are integer Encoded as follows:
[1 1 1 1 1 2 1 0 1 2 2 1 1 2 2 0 2 1 1 0 1 1 2 0 1 0 0 1 0 1 1 1 1 1 1 1 2
1 1 2]
The various categories of education are integer Encoded as follows:
[1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 0 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 0 1
1 1 1]
The various categories of output are integer Encoded as follows:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0]
In Step 3, the code splits the datasets into train and test input and train and test output. Provided
that the fit() function used in the next step needs a 2D numerical array to perform its calculations,
it is necessary to combine the previously divided input sub-­arrays into a single 2D array. The zip()
function takes the four input sub-­arrays and combines them in a single 2D array. However, since the
result is still unusable for the relevant fitting calculations, the list() function is used to convert the
2D array to a suitable form (lines 127–128).
Next, function train_test_split() (Sklearn.model_selection library) is used with the
newly created 2D array, as well as the unidimensional output array, in order to split (75/25) and
randomize the datasets. This is defined explicitly by the test_size = 0.25 and the random_
state = 0 arguments (lines 129–130). The test_size parameter is referring to the hold-­out
validation that splits the dataset into the train and test parts, in this case 75% and 25%.
The alternative to hold-­out validation is the cross-­validation technique, which selects data for
training via sampling. In this approach, a block of data of fixed size is selected for training in each
iteration. The technique could be also applied to smaller datasets, but the sample selection in each
iteration of training can lead to heavy computation requirements and, therefore, more CPU cycles. The
main types of cross-­validation are leave-­p out and k-­fold. In the case of k-­fold, the most commonly
used selection is the ten-­fold (i.e., k = 10). An example of a cross-­validation statement is the following:
crossValidation = cross_validate (decisionTree, X_Train, Y_Train,
crossValidation = 10)
In the current context, this statement would be placed in the code just after the definition of the
DecisionTreeClassifier().
The last part of this step prints the train and test inputs and the train and test outputs:
Machine Learning
123
124
125
126
127
128
129
130
131
132
133
134
135
425
#===================================================================
# Step 3: Define the point to split the dataset to 3/4
print("\nStep 3: Define the point to split the datasets to 3/4\n")
newEncodedInput = list(zip(newX1, integer_encodedX2, integer_encodedX3,
integer_encodedX4))
X_Train, X_Test, y_Train, y_Test = train_test_split(newEncodedInput,
integer_encodedY, test_size = 0.25, random_state = 0)
print("\nTrained X set:", X_Train)
print("\nTest X set:", X_Test)
print("\nTrained y set:", y_Train)
print("\nTest y set:", y_Test)
Output 10.5: Step 3
Step 3: Define the point to split the datasets to 3/4
Trained X set: [('60', 5, 1, 1), ('34', 3, 1, 0), ('52', 8, 1, 1), ('41',
8, 1, 1), ('34', 1, 2, 1), ('44', 9, 1, 1), ('40', 4, 1, 1), ('32', 1, 2,
0), ('43', 0, 2, 1), ('37', 8, 1, 1), ('46', 1, 2, 2), ('42', 4, 2, 2),
('49', 7, 1, 1), ('31', 8, 2, 2), ('34', 4, 1, 1), ('60', 5, 0, 1), ('46',
2, 1, 2), ('56', 0, 1, 1), ('38', 9, 0, 1), ('29', 4, 1, 1), ('32', 1, 1,
1), ('32', 1, 1, 1), ('56', 4, 1, 2), ('55', 7, 0, 2), ('32', 8, 1, 1),
('49', 0, 0, 1), ('28' ,7, 2, 1), ('35', 1, 1, 1), ('55', 7, 1, 1), ('59',
0, 1, 1)]
Test X set: (('30', 6, 2, 1), ('49', 5, 1, 1), ('52', 0, 0, 1), ('54', 0,
1, 2), ('38', 0, 2, 1), ('35', 4, 0, 2), ('60', 5, 0, 1), ('30', 1, 1, 1),
('41', 0, 1, 1), ('30', 0, 1, 2)]
Trained y set: [0 0 0 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 1]
Test y set: [0 0 0 1 1 1 0 1 1 0]
In Step 4, the defined trained and test inputs and outputs are used to train and test the model (i.e., predict Observation 10.11 – DecisionTree
the possible output). This is achieved through the Classifier(): The class used to create
DecisionTreeClassifier() function, (Sklearn. the decision tree model.
tree library), which creates the decisionTree object
model used for the output prediction (lines 144–146). The reader should note that the mathematical
algorithm used in the classifier is entropy, random_state = 100, maximum_depth = 100,
and min_samples_leaf = 2.
In terms of the entropy mechanism, the mathematical equation used is: E = −Σ(i:n)pilog2pi. The idea is
to calculate the entropy of mixed values encountered in the columns of the train dataset. If the values are
heavily mixed and unequal in population, the entropy will
be close to 1, otherwise it would be close to 0. Ideally, the Observation 10.12 – Entropy, Gini
preferred value is 0, which means that the dataset has Index: The mathematical models
largely homogeneous values. When visualizing the deci- used to define and organize the decision tree, the value of entropy suggests the impurity of the sion tree. They measure the level of
values in the related tree or sub-­tree. The alternative to impurity of the values in the dataset
entropy is the Gini index mechanism, which is also used by used for the tree.
the classifier to organize the decision tree. Its mathematical
426
Handbook of Computer Programming with Python
equation is: Gini Index = 1−Σ(P(x = k))2. This also suggests the probabilities of uncertainty of impurity
among various partitions of the dataset. In the case of this example, both mechanisms are included with
that of entropy applied and the Gini index deactivated as a comment. Switching the activation of one
over the other would showcase that the results are quite similar. For further information on either entropy
or the Gini index, the reader is advised to study textbooks specifically focused on ML.
There are two more parameters specified in DecisionTreeClassifier() that affect the visualization of the tree: max_depth and min_samples_leaf. The former determines the maximum
depth of the tree. If omitted, the tree will have no maximum depth but will grow as deep as necessary
according to the calculation and the dataset. The latter will determine the minimum number of samples required to be present as leaves in the tree. If its value
is 1, it will display every simple sample in the tree making the visual tree grow in size to its fullest. Increasing Observation 10.13 – Parameter
the value of min_samples_leaf will result in a maximum _ depth: Used to define
reduction of the size of the visual depiction of the tree by the depth of the decision tree (unlimcombining the number of samples in each leaf. As men- ited if omitted).
tioned, the present sample code includes two alternative
versions of DecisionTreeClassifier() (lines
141–146): one using entropy and one the Gini index. The Observation 10.14 – Parameter
former uses a min_samples_leaf value of 1, while min _ samples _ leaf: Used to
the latter a value of 6. Notice the difference in the size define the minimum number of samof the visual depiction of the decision tree in each case, ples that a leaf may have in order to
and also how the algorithm makes decisions based on the be displayed in the visualization of the
columns of the dataset that have the greatest influence on decision tree.
the resulting visual depiction of the decision tree:
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
#====================================================================
# Step 4: Create the classifier & train & test the input & output
# Create the classifier object using 4 attributes: criterion can be
# entropy or gini, splitter can be best or random,
print("\nStep 4: Define the point to split the datasets to 3/4")
#decisionTree = DecisionTreeClassifier(criterion = "entropy",
# splitter = "best", random_state = 100, max_depth = 100,
# min_samples_leaf = 1)
decisionTree = DecisionTreeClassifier (criterion = "gini",
splitter = "best", random_state = 100, max_depth = 100,
min_samples_leaf = 6)
# The classifier trains the input (X_Train) & the output (y_Train)
arrayX_Train = np.array(X_Train)
arrayY_Train = np.array(y_Train)
print("\nThe input dataset to train is:\n", arrayX_Train)
print("\nThe output dataset to train is:\n", arrayY_Train)
decisionTree.fit(arrayX_Train, arrayY_Train)
arrayY_Test1 = np.array(y_Test)
arrayY_Test = list(zip(arrayY_Test1, arrayY_Test1, arrayY_Test1,
arrayY_Test1))
print("\nThe output dataset to test is:\n", arrayY_Test)
y_Predict = decisionTree.predict(arrayY_Test)
print("\nThe predicted output is:\n", y_Predict)
Machine Learning
427
Output 10.5: Step 4
Step 4: Define the point to split the datasets to 3/4
The input dataset to train is:
[['60' '51 '1' '1']
['34' '3' '1' '0']
['52' '8' '1' '1']
['41' '8' '1' '1']
['34' '1' '2' '1']
['44' '9' '1' '1']
['40' '4' '1' '1']
['32' '1' '2' '0']
['43' '0' '2' '1']
['37' '8' '1' '1']
['46' '1' '2' '2']
...
The output dataset to train is:
[0 0 0 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 1]
The output dataset to test is:
[(0, 0, 0, 0), (0, 0, 0, 0), (0, 0, 0, 0), (1, 1, 1, 1), (1, 1, 1, 1),
(1, 1, 1, 1), (0, 0, 0, 0), (1, 1, 1, 1), (1, 1, 1, 1), (0, 0, 0, 0)]
The predicted output is:
[0 0 0 0 0 0 0 0 0 0]
In Step 5, the code inverts the output to the original column values, it calculates the confusion
matrix and the accuracy score, and provides the classification report. For the inversion of the output, the label encoders are used in the same way as in the case of the integer encoded arrays used in
the model. Next, the confusion matrix is printed followed by the accuracy score (50%). The reader
should note that, in an ideal scenario, the value of the latter approaches the 100% mark. Finally,
the classification report is displayed with all the relevant details. These tasks are coded in lines
164–174. The output shows the results of Step 5.
From one training dataset, the CART algorithm can build several decision trees. The performance criteria determine which tree is preferable for the task at hand. Different metrics or performance measurement parameters are being used, the most common being accuracy, confusion
matrix, precision, recall and f-­score. Accuracy represents the overall accuracy of a tree. It is calculated using the correctly classified observations divided by the total number of observations, and
is represented as a percentage. For example, if there are 100 observations tested and 70 of them
are correctly classified, the accuracy of that tree will be 70.00. A higher accuracy suggests a better
performance for the decision tree.
The confusion matrix represents the overall behavior of the tree, based on the test or train datasets. It provides more insight in terms of the performance of the tree on each class label. Therefore,
the size of confusion matrix depends on the class labels, as it is always n × n, where n denotes the
number of the class labels. For instance, if there are three class labels in a dataset, the confusion
matrix will be 3 × 3. In the case of the bank dataset, the confusion matrix will be 2 × 2, as it has only
two class labels (Yes/No). The matrix will also provide a breakdown of the numbers of labels being
wrongly categorized by the tree. Such information is not provided by the accuracy scores.
Precision is the measurement of the relevance-­based accuracy (i.e., a ratio of the number of
correctly predicted observations over the total number of observations) for each label. For example,
428
Handbook of Computer Programming with Python
assume a tree that has classified 60 customers out of 100 as Yes. However, only 40 out of the 60 classifications are correct. Thus, the precision will be 40/60 or 0.667.
Recall is the measure of relevance with respect to the overall classification performance in for
the class labels. For example, assume a tree that predicts 60 responses of Yes in a dataset of 100. If
40 of these predictions are correct, while the dataset has 75 observed responses of Yes, the recall
will be 40/75 or 0.533.
Fscore combines both the recall and the precision values into a single value. This value represents the performance in terms of relevance for each label. High fscore values dictate that the classifier is performing better and is more fine-­tuned than one with lower values.
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
#===================================================================
# Step 5: Invert the encoded values and calculate the confusion matrix,
# the accuracy score, and the classification report
print("\nStep 5: Invert the integer encoded results into "
"their original text-based")
invertedY_Test = label_encoderY.inverse_transform(y_Test)
print ("The inverted output test values are:", invertedY_Test)
invertedPredicted = label_encoderY.inverse_transform(y_Predict)
print ("The inverted predicted values of the output are:",
invertedPredicted)
confusionMatrix = confusion_matrix(invertedY_Test, invertedPredicted)
print("The confusion matrix for the particular case is:\n",
confusionMatrix)
accuracyScore = accuracy_score(invertedY_Test, invertedPredicted)
print("\nThe accuracy of the model given the test data is: ",
accuracyScore * 100, "%")
classificationReport = classification_report(y_Test, y_Predict)
print("\nThe classification report is as follows:\n",
classificationReport)
Output 10.5: Step 5
Step 5: Invert the integer encoded results into their original tex:-based
The inverted output test values are: ['no' 'no' 'no' 'yes' 'yes' 'yes'
'no' 'yes' 'yes' 'no']
The inverted predicted values of the output are: ['no' 'no' 'no' 'no'
'no' 'no' 'no' 'no' 'no' 'no']
The confusion matrix for the particular case is:
[[5 0]
[5 0]]
The accuracy of the model given the test data is:
The classification report is as follows:
precision
recall f1-score
50.0 %
support
0
1
0.50
0.00
1.00
0.00
0.67
0.00
5
5
accuracy
Macro avg
weighted avg
0.25
0.25
0.50
0.50
0.50
0.33
0.33
10
10
10
Machine Learning
429
Finally, Step 6 implements the statements used to visualize the decision tree based, on the parameters specified in the previous steps. The reader should note that the names of the features of the
depicted decision tree, referred as graphCols, must be defined before the tree is visualized, so
that proper labels are attached to the respective tree classifications:
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
#====================================================================
# Step 6: Visualizing the CART Decision Tree
# Define the names of the labels/features to be depicted in the
# decision tree
graphCols = ['age', 'Jobs', 'marital','education']
# Define the type of I/O to be used for the visualization of the
# decision tree
dot_data = StringIO()
# Use the export_graphviz() to prepare the visualization of the
# decision tree
export_graphviz(decisionTree, out_file = dot_data, filled = True,
feature_names = graphCols, rounded = True)
# Use the pydotplus library to plot the decision tree
graph = pydotplus.graphviz.graph_from_dot_data(dot_data.getvalue())
# Save the graph of the decision tree as a .png file in the local
# folder
graph.write_png("test.png")
Image(graph.create_png())
Output 10.5.a: Depicting the Decision Tree using gini index and min_samples_leaf = 6
430
Handbook of Computer Programming with Python
Output 10.5.b: Depicting the Decision Tree using entropy and min_samples_leaf = 1
10.6 S UPERVISED LEARNING ALGORITHMS: NAÏVE BAYES CLASSIFIER
Naïve Bayes is a probabilistic model, which can therefore generalize the classification problem
using a set of probabilities. The main concept of this model is based on the popular Bayesian
431
Machine Learning
theorem. The theorem can solve the problem of finding
the probability of an event by using existing data for the
conditions related to the event. For example, to find the
probability of an event A to occur while event B is true
is given by the equation below. This is also referred to as
posterior probability.
(
)
P A B =
(
)
P B A ⋅ P ( A)
Observation 10.15 – Naïve Bayes
Classifier: A supervised ML algorithm
that is used to find the probability of
an event given certain conditions.
This probability is referred to as posterior probability. The known information is referred to as prior probability.
P ( B)
P(B|A) represents the known information regarding the A occurrence, such that B occurring when A
is True. This probability is also called prior probability, as it is part of the existing knowledge. P(A)
is the probability or likelihood of A occurring without any condition. P(B) represents the probability
of event B occurring. P(B) is called evidence. Using prior probability, evidence and likelihood, a
Naïve Bayes model can determine the posterior probabilities of each class label for a set of features,
and assign a label based on these probabilities. The label with the highest or maximum posterior
probabilities is assigned to the current observation.
As an example, consider the following weather data for the covering the previous 7 days, as given
in Table 10.1. Based on the weather condition, the pilot instructors decide whether to run a training
flight or not.
The theorem can be used to make a decision for the following weather conditions:
1. Appearance: Sunny
2. Temperature: Hot
3. Windy: False
To find the posterior probability for each label, calculate the probability for label Yes:
P(Yes) = 3/7
P(Sunny|Yes) = 1/3
P(Hot| Yes) = 1/3
P(False|Yes) = 3/3
The posterior probability for label Yes would be the following:
P(Yes | (Sunny, Hot, False)) = P(Sunny | Yes) * P(Hot | Yes) * P(False | Yes) * P(Yes) =
= (1/3) * (1/3) * (3/3) * (3/7) = 0.047
TABLE 10.1
Weather Data for Previous 7 Days
Appearance
Sunny
Cloudy
Sunny
Rainy
Rainy
Cloudy
Cloudy
Temperature
Windy
Training Flight?
Cold
Mild
Cold
Hot
Cold
Hot
Cold
False
False
True
False
True
True
False
Yes
Yes
No
Yes
No
No
No
432
Handbook of Computer Programming with Python
Similarly, the posterior probability for label No for the same observation would be the following:
P(No | (Sunny, Hot, False)) = P(Sunny | No) * P(Hot | No) * P(False | No) * P(No) =
(1/4) * (1/4) * (1/4) * (4/7) = 0.009
In this case, the posterior probability of Yes is higher than that of No. Therefore, training flight will
run with weather condition of Appearance: Sunny, Temperature: Hot and Windy: False.
Naïve Bayes may have three different implementations, depending on the data. In the case of continuous data, the Gaussian distribution is more suitable, whereas in the case of nominal data the
multinomial distribution could produce better results. In the latter case (i.e., multinomial distribution),
the implementation can be expressed in the following seven steps, with the last two being optional:
•
•
•
•
•
•
•
Step 1: Import/read the data.
Step 2: Split the input data into train and test sets.
Step 3: Build the multinomial Naïve Bayes classifier.
Step 4: Predict the results based on the classifier.
Step 5: Find the accuracy of the regression model as a percentage.
Step 6 (Optional): Visualize the results of the trained set.
Step 7 (Optional): Visualize the results of the test set.
The following script uses students’ Midterm Exam and Project grades to create the Naïve Bayes
model and visualize the results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Import train_test_split to train and test the input
from sklearn.model_selection import train_test_split
# Import StandardScaler to scale the data
from sklearn.preprocessing import StandardScaler
# Import the Multinomial Naïve Bayes to create the classifier object
from sklearn.naive_bayes import MultinomialNB
# Import the accuracy_score to calculare the accuracy of the model
from sklearn.metrics import accuracy_score
# Import Numpy to prepare the plot parameters
import numpy as np
# Import Pyplot to create the plot
import matplotlib.pyplot as plt
# Import ListedColormap to color the data points in the plot
from matplotlib.colors import ListedColormap
# Plot inline
# This is particularly relevant in Jupyter Anaconda
%matplotlib inline
# Step 1: Define the input dataset. X must be a 2D list
# with as many rows as the observations
X = [[30, 75], [84, 89], [79, 84], [71, 74], [68, 71], [81, 70],
[61, 78], [89, 81], [58, 78], [70, 71], [70, 70], [90, 76],
[86, 92], [72, 70], [70, 72], [82, 87], [51, 78], [44, 71],
[82, 92], [50, 68]]
y = [0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
# Step 2: Split the set X and y into train and test sets
# Test size is 25% of the dataset, Train size is 75%
# The new train and test lists will be in random order
X_train, X_test, y_train, y_test = train_test_split(X, y, \
test_size = 0.25, random_state = 0)
Machine Learning
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
433
print("Trained X set:", X_train); print("Test X set:", X_test)
print("Trained y set:", y_train); print("Test y set:", y_test)
# Step 3: Build the Naïve Bayes classifier
# Fit the trained set into the classifier
model = MultinomialNB().fit(X_train, y_train)
print("\n", model)
# Step 4: Predict the test results
y_pred = model.predict(X_test)
print("\nResults predicted by the model:", y_pred)
print("Results from the test:", y_test)
model.predict_proba(X)[:,1]
# Step 5: Form the confusion matrix to get the accuracy of the model
# Use the y_test (actual output) and the y_pred (predicted output)
accuracy = accuracy_score(y_test, y_pred)
print("The accuracy of the model given the test data is: ",
accuracy * 100, "%")
# Step 6: Visualize the training set results
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start=np.array(X_set)[:, 0].min() - 1, \
stop = np.array(X_set)[:, 0].max() + 1, step = 0.01), \
np.arange(start = np.array(X_set)[:, 1].min() - 1, \
stop = np.array(X_set)[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2, model.predict(np.array([X1.ravel(), \
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, \
cmap = ListedColormap(('red','blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(np.array(X_set)[y_set == j, 0],
np.array(X_set)[y_set == j, 1])
plt.title('Naive Bayes: Training set')
plt.xlabel("Midterm Exam")
plt.ylabel("Project")
plt.show()
# Step 7: Visualize the test results
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start=np.array(X_set)[:, 0].min() - 1, \
stop = np.array(X_set)[:, 0].max() + 1, step = 0.01), \
np.arange(start = np.array(X_set)[:, 1].min() - 1, \
stop = np.array(X_set)[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2, model.predict(np.array([X1.ravel(), \
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, \
cmap = ListedColormap(('red','blue')))
plt.xlim(X1.min(), X1.max());plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(np.array(X_set)[y_set == j, 0],
434
85
86
87
88
Handbook of Computer Programming with Python
np.array(X_set)[y_set == j, 1])
plt.title('Naive Bayes: Test set')
plt.xlabel("Midterm Exam"); plt.ylabel("Project")
plt.show()
Output 10.6:
Trained X set: [[44, 71], [61, 78], [72, 70], [68, 71], [79, 84], [81, 7
01, [70, 72], [70, 71], [89, 81], [51, 78], [90, 761, [71, 74], [30, 75],
[82, 87], [86, 92]]
Test X set: [[82, 92], [84, 89], [50, 68], [58, 78], [70, 70]]
Trained y set: [0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1]
Test y set: [1, 1, 0, 0, 1]
MultinomialNB()
Results predicted by the model: [1 1 0 0 1]
Results from the test: [1, 1, 0, 0, 1]
The accuracy of the model given the test data is:
100.0 %
Machine Learning
435
In this case, the output suggests that Naïve Bayes can predict the final grade (Pass/Fail) for the students with 100% accuracy. For the same data, a different implementation of Naïve Bayes may produce results with large variations (e.g., in the case of Gaussian Naïve Bayes function, the accuracy
will be significantly lower). The reason for this is that the various Naïve Bayes functions depend on
the nature of the data and are, thus, more scalable than other models.
10.7 UNSUPERVISED LEARNING ALGORITHMS: K-­MEANS CLUSTERING
The k-­means clustering algorithm is an unsupervised ML
means
approach used to solve clustering problems in ML or data Observation 10.16 – K-­
Clustering:
An
unsupervised
ML
algoscience. Its aim is to group unlabeled datasets into difrithm
that
aims
to
group
unlabeled
ferent clusters, where k is equal to the chosen number of
newly created clusters. Each cluster is associated with a datasets into a number (k) of different
centroid, a data point representing the center of a cluster. clusters, each associated with a cenThe algorithm seeks to minimize the sum of distances troid data point representing the cenbetween the data point and their corresponding clusters. ter of cluster.
Its applications may be relevant in different domains, such
as customer segmentation, insurance fraud detection, and document classification just to name a few.
Figure 10.3 presents a case of two clusters (k = 2) being identified in the source dataset:
K-­means is, essentially, an iterative algorithm. First, it selects a value for k, that represents the
number of clusters (e.g., k = 3 for 3 clusters). Next, it randomly assigns each data point to any of the
clusters. Finally, it calculates the cluster centroid for each of the clusters. Once the iteration is complete a new one commences. At this stage, the algorithm reassigns each point to the closest cluster
centroid. It then follows the same procedure to assign the points to the clusters containing the other
centroids. The algorithm repeats the last two steps until there is no switching of data points from
one cluster to another, in which case it is completed.
Implementing the k-­means algorithm usually involves the following steps:
• Step 1: Select the number of clusters (k). One could also use the elbow function to determine the optimal number.
• Step 2: Select a random centroid for each cluster. Note that this may be other than the
input dataset.
FIGURE 10.3
k-­means clusters and their centroids. (See Raghupathi, 2018.)
436
Handbook of Computer Programming with Python
• Step 3: Measure the distance (Euclidean function) between each point and the centroids.
Assign each data point to their closest centroid.
• Step 4: Calculate the variance and add a new centroid for each cluster (i.e., calculate the
mean of all the points for each cluster and set the new centroid).
• Step 5: Repeat Steps 3 and 4 until the centroid positions do not change.
The implementation of this approach in Python is rather straightforward, making it accessible to
novice programmers and/or data scientists with no programming background. The following script
is an example of a k-­means algorithm implementation, with the objective to classify 100 customers
based on their annual incomes and spending scores:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Import Pandas
import pandas as pd
# Import Numpy as data manipulation
import numpy as np
# Import the KMeans library from the sklearn
from sklearn.cluster import KMeans
# Import the Pyplot to create the plot
import matplotlib.pyplot as plt
# Plot inline
# This is particularly relevant in Jupyter Anaconda
%matplotlib inline
# Import the operating system module
import os
# Import the Python data visualization library based on matplotlib
import seaborn as sns
sns.set(context = "notebook", palette = "Spectral", style = 'darkgrid',
font_scale = 1.5, color_codes = True)
# X is a list of 100 samples for customers, each representing the
# annual income and the spending score
X = [[15, 39], [15, 81], [16, 6], [16, 77], [17, 40], [17, 76],
[18, 6], [18, 94], [19, 3], [19, 72], [19, 14], [19, 99],
[20, 15], [20, 77], [20, 13], [20, 79], [21, 35], [21, 66],
[23, 29], [23, 98], [24, 35], [24, 73], [25, 5], [25, 73],
[28, 14], [28, 82], [28, 32], [28, 61], [29, 31], [29, 87],
[30, 4], [30, 73], [33, 4], [33, 92], [33, 14], [33, 81],
[34, 17], [34, 73], [37, 26], [37, 75], [38, 35], [38, 92],
[39, 36], [39, 61], [39, 28], [39, 65], [40, 55], [40, 47],
[40, 42], [40, 42], [42, 52], [42, 60], [43, 54], [43, 60],
[43, 45], [43, 41], [44, 50], [44, 46], [46, 51], [46, 46],
[46, 56], [46, 55], [47, 52], [47, 59], [48, 51], [48, 59],
[48, 50], [48, 48], [48, 59], [48, 47], [49, 55], [49, 42],
[50, 49], [50, 56], [54, 47], [54, 54], [54, 53], [54, 48],
[54, 52], [54, 42], [54, 51], [54, 55], [54, 41], [54, 44],
[54, 57], [54, 46], [57, 58], [57, 55], [58, 60], [58, 46],
[59, 55], [59, 41], [60, 49], [60, 40], [60, 42], [60, 52],
[60, 47], [60, 50], [61, 42], [61, 49]]
# Convert the list to an np.array for plotting the clusters
# of customers
437
Machine Learning
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
X = np.array(X)
# Find the optimal number of clusters (elbow method)
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 15):
kmeans = KMeans(n_clusters = i, init = 'k-­
means++', \
random_state = 42)
kmeans.fit(X)
# Inertia function returns wcss for that model:
# WCSS is the sum of squared distance between each point
# and the centroid in a cluster
wcss.append(kmeans.inertia_)
# Plot the clusters and WCSS
plt.figure(figsize = (10,5))
sns.lineplot(range(1, 15), wcss, marker = 'o', color = 'red')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Output 10.7.a:
The output illustrates the identification of the optimal number of clusters that can represent the
k-­means, in this Case 4. Next, this is used to find, organize, and illustrate the respective clusters
with their centroid data, as in the following script:
61
62
63
64
65
66
67
68
69
70
means to the dataset
# Fitting K-­
kmeans = KMeans(n_clusters = 4, init = 'k-­
means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
# plot ('Annual Income (k$), Spending Score)
plt.figure(figsize = (15,7))
sns.scatterplot(X[y_kmeans == 0, 0], X[y_kmeans
color = 'yellow', label = 'Cluster 1', s
sns.scatterplot(X[y_kmeans == 1, 0], X[y_kmeans
color = 'blue', label = 'Cluster 2', s =
== 0, 1], \
= 50)
== 1, 1], \
50)
438
71
72
73
74
75
76
77
78
79
80
81
82
83
Handbook of Computer Programming with Python
sns.scatterplot(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], \
color = 'green', label = 'Cluster 3', s = 50)
sns.scatterplot(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], \
color = 'grey', label = 'Cluster 4', s = 50)
sns.scatterplot(kmeans.cluster_centers_[:, 0], \
kmeans.cluster_centers_[:, 1], color = 'red',
label = 'Centroids', s = 300, marker = ', ')
plt.grid(False)
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1–100)')
plt.legend()
plt.show()
Output 10.7.b: Finding and illustrating the clusters, their data points, and their centroids
The output identifies the four optimal clusters of the data points and their centroids.
10.8
U
NSUPERVISED LEARNING ALGORITHMS: APRIORI
The apriori algorithm is based on rule mining and is
mainly used for finding the association between different items in a dataset. However, the algorithm can be
also used as a classifier. It explores the data space and
keeps all items in a dynamic structure. The apriori algorithm prunes the list of itemsets to keep only those that
meet certain criteria. One simple criterion is the use of
a threshold value: the most frequent item and itemset
lists can be pruned using the threshold values on support
and confidence. For example, if the support of an item is
less than the threshold value the item is not added to the
frequent items.
The association between items is determined based
on two main measurements: support and confidence.
Observation 10.17 – Apriori: An
unsupervised ML algorithm used to
find the association between different items in a dataset. It is based on
the measurements of confidence and
support.
Observation 10.18 – Support:
Calculates the likelihood of an item
being in the data space and filters the
reported items. Use parameter min _
support = value (0.0–1.0).
439
Machine Learning
Support calculates the likelihood of an item being in the data space and confidence measures the
relationship or association of an item with another.
For a given item (A) the support is calculated using the following equation (Equation 10.1):
Support ( A ) =
Number of observations containing A
Total number of observations
(10.1)
The confidence is measured using the following equation (Equation 10.2) and represents the association between two items, say A and B:
Confidence ( A to B ) =
Number of observations containing A & B
Number of observations containing A
(10.2)
The min_lift parameter indicates the likelihood of
an item being associated with another. A value of 1 indi- Observation 10.19 – Confidence:
cates that the items are not associated. A lift value Calculates the level of confidence of
greater than 1 indicates that an item is likely to be asso- the association with another item and
ciated with another item, while a value less than 1 means filters the reported items. Use parameter min _ confidence = value
the opposite.
The min_length parameter defines the minimum (0.0–1.0).
number of items considered for the rules, and depends
on the number of the available items. The association
among the items can be determined up to a certain Observation 10.20 – min_lift:
length: if the length of the association is 10, a maximum Defines the minimum number of
of ten items can be related to each other. Each one of items to be considered (as a combithese combinations is called an itemset. In a large data- nation) in the displayed rules. A value
set, the number of frequent items and itemsets could be of 1 suggests an association, while a
value less than 1 suggests lack of an
rather substantial.
The apriori algorithm can be further explained using association.
the dataset provided in Table 10.2. The table lists the
four most recent transactions made by customers in a
Observation 10.21 – min_length:
supermarket.
Apriori will start by calculating the support for all Defines the minimum number of
items as shown on Table 10.3. Next, it will apply the items to be considered for the rules,
threshold to trim the item list and build a frequent and depends on the number of availitemset. Assume that the threshold for the support is able items.
50%. The trimmed list of frequent items is shown on
Table 10.4. Similarly, the algorithm will calculate the confidence for finding an association between
two items, and trim the list using the threshold on confidence. Eventually, two rules will be selected:
1. If a customer buys an Apple, there are high chances the customer buys a Banana.
2. If a customer buys a Bread, there is a likelihood the customer will also buy Eggs.
TABLE 10.2
Transactions at a Supermarket
Transaction ID
1
2
3
4
Items Purchased
Apple, Banana, Biscuits
Apple, Banana, Bread
Bread, Eggs, Cereal
Apple, Bread, Eggs
440
Handbook of Computer Programming with Python
TABLE 10.3
Support for All Items
Item
Apple
Banana
Bread
Biscuits
Cereals
Eggs
Support
0.75
0.5
0.75
0.25
0.25
0.5
TABLE 10.4
Frequent Itemset with 50% Support
Item
Apple
Banana
Bread
Eggs
Support
0.75
0.5
0.75
0.5
The apriori implementation in Python can be described using the following four steps (the last
one being optional):
•
•
•
•
Step 1: Import/read the data.
Step 2: Build the apriori model.
Step 3: Transform the rules into a dataframe.
Step 4: Create a table to display all the rules.
The following script uses the above data to create the apriori model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# import Pandas and Numpy
import pandas as pd
import numpy as np
# import the apriori model
from apyori import apriori
# Import the accuracy_score to calculare the accuracy of the model
from sklearn.metrics import accuracy_score
# Step 1: Define the input dataset. X must be a 2D list with
# as many rows as the observations
X = [["Apple", "Banana", "Biscuits"], ["Apple", "Banana", "Bread"],
["Bread", "Eggs", "Cereal"], ["Apple", "Bread", "Eggs"]]
# Step 2 Build the apriori model
rules = apriori(X, min_length = 2, min_support = 0.1, \
min_confidence = 0.02, min_lift = 1)
# rules = apriori(X, min_length = 2, min_support = 0.5,
# min_confidence = 0.5, min_lift = 1)
Machine Learning
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
441
# Step3: Transform outputs in an appropriate pd.Dataframe format
results = list(rules)
results = pd.DataFrame(results)
print("The association rules for the particular dataset are:\n",
results)
# Step 4 Create an output table from the ordered statistics
# Note: not all tables are of the same type
F1 = []; F2 = []; F3 = []; F4 = []
C3 = results.support
for i in range(results.shape[0]):
single_list = results['ordered_statistics'][i][0]
F1.append(list(single_list[0]))
F2.append(list(single_list[1]))
F3.append(single_list[2])
F4.append(single_list[3])
# First column of the table
C1 = pd.DataFrame(F1)
# Second column of the table
C2 = pd.DataFrame(F2)
# Fourth column of the table
C4 = pd.DataFrame(F3,columns = ['Confidence'])
# Fifth column of the table
C5 = pd.DataFrame(F4,columns = ['Lift'])
# Concatenate all tables into one
table = pd.concat([C1,C2,C3,C4,C5], axis = 1)
print("\nImproved format of the association rules for the dataset:\n",
table)
Output 10.8.a–10.8.c:
The association rules for the particular dataset are:
items support \
0.75
0
(Apple)
0.50
1
(Banana)
0.25
2
(Biscuits)
0.75
3
(Bread)
0.25
4
(Cereal)
0.50
5
(Eggs)
0.50
6
(Apple, Banana)
0.25
7
(Apple, Biscuits)
0.50
8
(Apple, Bread)
0.25
9
(Apple, Eggs)
0.25
10
(Banana, Biscuits)
0.25
11
(Banana, Bread)
0.25
12
(Bread, Cereal)
0.50
13
(Eggs, Bread)
14
0.25
(Eggs, Cereal)
0.25
15 (Apple, Banana, Biscuits)
0.25
16
(Apple, Banana, Bread)
17
0.25
(Apple, Eggs, Bread)
18
(Eggs, Bread, Cereal)
0.25
442
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Handbook of Computer Programming with Python
[((),
[((),
[((),
[((),
[((),
[((),
[((),
[((),
[((),
[((),
ordered_statistics
[((), (Apple), 0.75, 1.0)]
[((), (Banana), 0.5, 1.0)]
[((), (Biscuits), 0.25, 1.0)]
[((), (Bread), 0.75, 1.0)]
[((), (Cereal), 0.25, 1.0)]
[((), (Eggs), 0.5, 1.0)]
(Apple, Banana), 0.5, 1.0), ((Apple), (B...
(Apple, Biscuits), 0.25, 1.0), ((Apple),...
[((), (Apple, Bread), 0.5, 1.0)]
[((), (Apple, Eggs), 0.25, 1.0)]
(Banana, Biscuits), 0.25, 1.0), ((Banana...
[((), (Banana, Bread), 0.25, 1.0)]
(Bread, Cereal), 0.25, 1.0), ((Bread), (...
(Eggs, Bread), 0.5, 1.0), ((Bread), (Egg...
(Eggs, Cereal), 0.25, 1.0), ((Cereal), (...
(Apple, Banana, Biscuits), 0.25, 1.0), (...
(Apple, Banana, Bread), 0.25, 1.0), ((Ap...
(Apple, Eggs, Bread), 0.25, 1.0), ((Brea...
(Eggs, Bread, Cereal), 0.25, 1.0), ((Bre...
Improved format of the association rules for the dataset:
0
1
2 support Confidence Lift
Apple
None
None
0.75
0.75
1.0
0
1
Banana
None
None
0.50
0.50
1.0
2
Biscuits
None
None
0.25
0.25
1.0
3
Bread
None
None
0.75
0.75
1.0
4
Cereal
None
None
0.25
0.25
1.0
5
Eggs
None
None
0.50
0.50
1.0
6
Apple
Banana
None
0.50
0.50
1.0
7
Apple
Biscuits
None
0.25
0.25
1.0
8
Apple
Bread
None
0.50
0.50
1.0
9
Apple
Eggs
None
0.25
0.25
1.0
10
Banana
Biscuits
None
0.25
0.25
1.0
11
Banana
Bread
None
0.25
0.25
1.0
12
Bread
Cereal
None
0.25
0.25
1.0
13
Eggs
Bread
None
0.50
0.50
1.0
14
Eggs
Cereal
None
0.25
0.25
1.0
15
Apple
Banana Biscuits
0.25
0.25
1.0
16
Apple
Banana
Bread
0.25
0.25
1.0
17
Apple
Eggs
Bread
0.25
0.25
1.0
18
Eggs
Bread
Cereal
0.25
0.25
1.0
The results demonstrate the apriori model at work, and also highlight the dominant associations
between the items. Strong associations between Bread and Eggs, and Apple and Banana is evident.
Changing the parameter values to min_support = 0.5 and min_confidence = 0.5 will change the
reported Output 10.8.d as follows:
443
Machine Learning
The association rules for the particular dataset are:
items support
ordered_statistics
0
(Apple)
0.75
[((), (Apple), 0.75, 1.0)]
1
(Banana)
0.50
[((), (Banana), 0.5, 1.0)]
2
(Bread)
0.75
[((), (Bread), 0.75, 1.0)]
3
(Eggs)
0.50
[((), (Eggs), 0.5, 1.0)]
4 (Apple, Banana)
0.50 [((), (Apple, Banana), 0.5, 1.0), ((Apple), (B...
[((), (Apple, Bread), 0.5, 1.0)]
5 (Apple, Bread)
0.50
6
(Bread, Eggs)
0.50 [((), (Bread, Eggs), 0.5, 1.0), ((Bread), (Egg...
Improved format of the association rules
0
1 support Confidence
0
Apple
None
0.75
0.75
None
0.50
1 Banana
0.50
Bread
None
0.75
0.75
2
None
0.50
3
0.50
Eggs
Apple Banana
0.50
0.50
4
5
0.50
Apple
Bread
0.50
6
Bread
Eggs
0.50
0.50
for the dataset:
Lift
1.0
1.0
1.0
1.0
1.0
1.0
1.0
Notice how filtering dramatically reduces the reported rules and output, by increasing the level
of confidence and the acceptable support.
The rules extracted by apriori identify the patterns of item sales for a supermakert. The model
can determine similar associations for a larger dataset and the report can be tweaked to display the
top ranking associations (e.g. Eggs and Bread or Apple and Banana).
10.9
OTHER LEARNING ALGORITHMS
A number of other ML algorithms are also frequently used in real-­life applications. One the most
popular is random forest (Andrade et al., 2019; Kwon et al., 2015; Naveed & Alrammal, 2017;
Naveed et al., 2020), a supervised ML algorithm. It can be used for both classification and regression. The main idea behind random forest is to create multiple ML decision tree models, with datasets created using what is referred to as a bootstrap sampling method. According to this method,
each sub-­dataset is composed of random sub-­samples of the original dataset. Each of the defined
training datasets is used to create a different model, using the same ML algorithm and making different predictions. The best prediction is used as the result of the process.
The random forest algorithm can be described using the following four steps:
• Step 1: Select random samples from a given
dataset.
• Step 2: Create a decision tree for each sample and
get a prediction result for each decision tree.
• Step 3: Perform a vote for each of the predicted
results.
• Step 4: Select the prediction result with the highest number votes as the final prediction.
Observation 10.22 – Random Forest:
Create multiple ML decision trees
from random sub-­sets of the original
dataset. Make predictions for each
of the decision trees and vote for the
best prediction.
444
Handbook of Computer Programming with Python
Random forest is considered a highly accurate ML algorithm, with the larger numbers of decision
trees created leading to increasingly more robust results. Since it calculates the average of all its
predictions, it does not suffer from overfitting or outliers being present in the original dataset. Its
main shortcomings come from the fact that it consists of multiple decision trees. Hence, it is slow in
generating a final prediction as it has to get all the sub-­tree predictions and vote the best one, and it
is not as straightforward to interpret as a single decision tree.
The K-­Nearest Neighbors (k-­NN) algorithm uses the entire dataset as a training set, rather than
splitting the dataset into a training and a test set. It assumes that similar data points are in close
proximity to each other. This proximity (or distance) can be calculated using a variety of methods,
such as the Euclidean theorem, or the Hamming distance (Sharma, 2020). When a new outcome is
requested for a new data point, the k-­NN algorithm calculates the instances between the new data point and the Observation 10.23 – k-­NN: Use the
entire dataset, or the user-­defined k data points that look whole data set as a training set to calmore similar to the new data point. Next, it calculates culate the distances between the varithe mean of the outcomes following a regression model, ous k data points in the dataset.
or the mode (i.e., the most frequent class).
The algorithm of the k-­NN model follows the following six main steps:
• Step 1: Load the data.
• Step 2: Select the number (k) of neighbors.
• Step 3: For each new data point, calculate the distance between new and the current dataset points.
• Step 4: Add the distance and the index of the new data point to the current collection.
• Step 5: Sort the current collection of distances and indices by distance.
• Step 6: Pick the first k entries from the sorted collection, get their labels, and return the
mean or mode.
The main disadvantage of k-­NN is that it is becoming significantly slower as the dataset increases
in size.
10.10 W
RAP UP - MACHINE LEARNING APPLICATIONS
Through the use of Machine Learning (ML) algorithms, Artificial Intelligence (AI) has penetrated
all forms of human activity. It is highly likely that the vast majority of humans has a first-­hand experience of this through one of its many real-­life applications. Traffic Alerts (maps) is such an example
with several applications being used to suggestions and routes to help drivers deal with navigation
and traffic. Data are collected either from other drivers currently using the same system or network
and, or historical data of the various routes collected over time. Data collected when users are
using the application or network include their location, average speed, and the route in which they
are travelling. Figure 10.4 illustrates such an example on heavy congestion conditions (i.e., Sheikh
Mohammed bin Rashid Blvd – Downtown Dubai).
Another class of examples of ML algorithms are the various virtual personal assistants. Such
systems assist the users on various daily tasks and include advanced detection capabilities like
understanding the users’ voice (e.g., asking “what is my schedule for today?” will trigger the
associated response). Common tasks implemented into contemporary virtual personal assistant
systems include speech recognition, speed-­to-­text conversion, natural language processing, and
text-­to-­speech conversion. The systems collect and refine the information based on previous interactions. They are integrated into a variety of platforms, including smart speakers, smartphones,
and mobile apps.
Social media is another space where ML applications are heavily integrated and used. From personalizing news feeds to better ads targeting, social media platforms are utilizing machine learning
Machine Learning
FIGURE 10.4
445
Traffic alert application.
for both corporate and end-­user benefits. The list below includes some examples one may be familiar with, perhaps without even realizing that these features are nothing but the practical application
of ML algorithms:
• People You May Know: ML works on a simple concept: understanding through experience. For example, Social Media platforms continuously monitor the friends one connects
with, the most often visited profiles, one’s interests, or work and personal status, or groups
one belongs too. Based on continuous learning, a list of the Social Media users that one can
become friends with is suggested.
• Face Recognition: A user uploads a personal picture with a friend and the system instantly
recognizes the identity of that friend. Such systems may check the poses and projections in
the picture, identify unique features, and match them with people in the user’s friends or
contact lists. The entire process is based on ML and is commonly referred to as friend tagging. It is a rather complex process taking place at the backend, but it is rather transparent
on the user side, as it seems like a simple and unobtrusive feature at the front end.
• Similar Pins: ML is a core element in computer vision, a technique to extract useful
information from images and videos. An example of this can be seen in platforms which
use computer vision to identify the objects (or pins) in the images and recommend other
related pins accordingly.
House price prediction is yet another example of ML algorithms in action. By leveraging the data
collected from large numbers of houses in relation to their characteristics (e.g., square footage,
number of rooms, property type), the algorithm trains the ML model to predict the price of other
houses. The multiple popular online portals for searching houses or apartments (both for rental and
purchase) are examples of the use of such applications.
446
Handbook of Computer Programming with Python
Product recommendation is an experience most people have without even noticing. As an example, one can think of using a web browser to check a product on a specific website. It is likely that
while engaging in other online activities, such as watching online videos, the same or similar products appear as an ad. In such cases, the various platforms use smart agents to track the user’s search
history and recommends ads based on it.
Recommender systems are another application of ML algorithms. Such systems use collaborative filtering, a method based on gathering and analyzing user behavior information and predicting
what they like based on similarities with other users. Figure 10.5 provides an example of the use
of collaborative filtering in an E-­commerce web app. In this context one can assume a customer
(Customer 1) viewing product A and other customers viewing products A, B, C, and D. Due to the
similarity of interests of all the users in product A, the web app will propose products B, C and D
to Customer 1.
Among the most important applications of ML is the monitoring of video cameras. In areas or
countries utilizing excessive numbers of traffic monitoring video cameras, monitoring by human officers can be impractical and challenging. The idea of training computers to accomplish this task comes
handy in such cases. Similarly, video surveillance systems powered by AI/ML make it possible to
detect suspicious activity, sometimes even before it takes place. This is done by tracking unusual
behavior (e.g., when one stands motionless for a long time, stumbles, or laying on public locations).
The system can generate alerts sent to human attendants, who can then take appropriate actions. As
activities are reported and verified, they help to improve the surveillance services even further.
In the context of information security, one should note the use of spam filtering. The term refers
to processes monitoring the user’s email traffic and executing appropriate preventive actions. It is
crucial for such systems to ascertain that spam filters are continuously updated; this is accomplished
through ML algorithms. While there are hundreds of thousands of malware and security threats
detected every single day, it is generally accepted that the associated code is 90% or more similar
to its predecessor. ML-­based security programs can identify such coding patterns and detect new
malware with slight coding variations rather easily. Similarly, ML provides great potential to secure
online monetary transactions from online frauds. For instance, online payment platforms use a set
of tools that helps compare millions of transactions taking place almost simultaneously and identifying suspicious of fraudulent action between buyers and sellers.
Finally, another common application of ML models can be found in the online customer support
services of many e-­Business or e-­Commerce platforms. Such platforms frequently offer the option
FIGURE 10.5
Product recommendations. (See Keshari, 2021.)
Machine Learning
447
to chat with a customer support representative while navigating the website. While the transaction
may seem like a regular conversation, it is not with a real representative but with a chatbot. The latter extracts information from the website and presents it to the customers in a chat-­like form. Every
time a new chat begins, the answer is improved based on the previously recorded answers.
The discussion on ML applications can continue further, with practical use examples like
weather prediction, distinction between animals/plants/objects, or customer segmentation, just to
name a few.
10.11
CASE STUDIES
Use dataset dataset.csv to write a Python script that predicts whether a patient will be readmitted or
not within 30 days. The application should do the following:
1. Read the dataset and create a data frame with the following categories: gender, race,
age, admission type id, discharge disposition id, admission source id, max glu serum,
A1Cresult, change, diabetesMed, readmitted (categorical), time in hospital, number of
lab procedures, number of procedures, number of medications, number of outpatients,
number of emergencies, number of inpatients, number of diagnoses (numerical).
2. Apply the following ML algorithms and calculate their accuracy: logistic regression,
k-­NN, SVM, Kernel SVM, Naïve Bayes, CART Decision Tree, Random Forest.
10.12
E XERCISES
1. Use the CART example in this chapter to change the criterion from entropy to Gini index
and the max depth to 10. How does this affect the accuracy of the model? What is the effect
of changing the max depth to 20?
2. Test both the BEST and RANDOM splitter features on the CART example from this chapter. Explain whether the performance of a decision tree depends on the splitter feature of
the classifier object.
3. Apply a smaller training dataset to the CART decision tree example to investigate whether
the performance will improve or decrease (Hint: Increase and decrease the ratio of the
size of the training dataset).
4. Find the precision, recall and fscore for a CART decision tree with entropy as criterion,
max dept of 4 and min samples leaf nodes of 20.
5. Use the bank dataset to train a decision tree classifier with ten-­fold cross validation and
generate the respective classification report.
REFERENCES
Andrade, E. de O., Viterbo, J., Vasconcelos, C. N., Guérin, J., & Bernardini, F. C. (2019). A model based
on lstm neural networks to identify five different types of malware. Procedia Computer Science, 159,
182–191.
Keshari, K. (2021). Top 10 Applications of Machine Learning: Machine Learning Applications in Daily Life.
https://www.edureka.co/blog/machine-­learning-­applications/.
Kwon, B. J., Mondal, J., Jang, J., Bilge, L., & Dumitraş, T. (2015). The dropper effect: Insights into malware
distribution with downloader graph analytics. Proceedings of the 22nd ACM SIGSAC Conference on
Computer and Communications Security (1118–1129), Denver, Colorado.
Mitchell, T. M. (1997). Machine Learning (1st ed.). New York: McGraw-­Hill.
Mola, F. (1998). Classification and Regression Trees Software and New Developments BT – Advances in Data
Science and Classification (A. Rizzi, M. Vichi, & H.-­H. Bock eds.; pp. 311–318). Berlin Heidelberg:
Springer.
448
Handbook of Computer Programming with Python
Naveed, M., & Alrammal, M. (2017). Reinforcement learning model for classification of Youtube movie.
Journal of Engineering and Applied Science, 12(9), 1–7.
Naveed, M., Alrammal, M., & Bensefia, A. (2020). HGM: A Novel Monte-­Carlo simulations based model for
malware detection. IOP Conference Series: Materials Science and Engineering, 946(1), 12003. https://
doi.org/10.1088/1757-­899x/946/1/012003.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/
BF00116251.
Raghupathi, K. (2018). 10 Interesting Use Cases for the K-­Means Algorithm. DZone AI Zone. https://dzone.
com/articles/10-­interesting-­use-­cases-­for-­the-­k-­means-­algorithm.
Salzberg, S. L. (1994). C4.5: Programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers,
Inc., 1993. Machine Learning, 16(3), 235–240. https://doi.org/10.1007/BF00993309.
Sharma, P. (2020). 4 Types of Distance Metrics in Machine Learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/02/4-­types-­of-­distance-­metrics-­in-­machine-­learning/.
11
Introduction to Neural
Networks and Deep Learning
Dimitrios Xanthidis
University College London
Higher Colleges of Technology
Muhammad Fahim
Higher Colleges of Technology
Han-I Wang
The University of York
CONTENTS
11.1 Introduction...........................................................................................................................449
11.2 Relevant Algebraic Math and Associated Python Methods for DL...................................... 452
11.2.1 The Dot Method........................................................................................................ 452
11.2.2 Matrix Operations with Python................................................................................. 455
11.2.3 Eigenvalues, Eigenvectors and Diagonals................................................................. 459
11.2.4 Solving Sets of Equations with Python.....................................................................460
11.2.5 Generating Random Numbers for Matrices with Python.......................................... 461
11.2.6 Plotting with Matplotlib............................................................................................. 463
11.2.7 Linear and Logistic Regression.................................................................................465
11.3 Introduction to Neural Networks...........................................................................................466
11.3.1 Modelling a Simple ANN with a Perceptron............................................................ 467
11.3.2 Sigmoid and Rectifier Linear Unit (ReLU) Methods................................................ 470
11.3.3 A Real-Life Example: Preparing the Dataset............................................................ 473
11.3.4 Creating and Compiling the Model........................................................................... 474
11.3.5 Stochastic Gradient Descent and the Loss Method and Parameters......................... 475
11.3.6 Fitting and Evaluating the Models, Plotting the Observed Losses............................ 477
11.3.7 Model Overfit and Underfit........................................................................................ 482
11.4 Wrap Up................................................................................................................................. 483
11.5 Case Study.............................................................................................................................484
References.......................................................................................................................................484
11.1 INTRODUCTION
Deep learning is in fact a new name for an approach to artificial intelligence called neural networks,
which has been going in and out of fashion for more than 70 years. Neural networks were first proposed
in 1944 by Warren McCullough and Walter Pitts, two University of Chicago researchers who moved to
MIT in 1952 as founding members of what’s sometimes called the first cognitive science department.
(Hardesty, 2017)
DOI: 10.1201/9781003139010-11
449
450
Handbook of Computer Programming with Python
Human intelligence is an evolutionary, biologically
controlled process. Humans learn based on their expe- Observation 11.1 – Deep Learning: A
riences. Similarly, machine or artificial intelligence is specialized form of Machine Learning.
subject to comparable experiences in the form of data. It uses many layers of algorithms to
On a broader context, the two forms of intelligence are process the underlying data which
similar in the sense that they are subject to a common could be human speeches, images,
approach: “based on what I have seen and observed I text, complex objects, etc.
think this will happen next”. Once this core idea is
transferred to mathematical constructs and the associated algorithms (self-evolving), machines are
observed to be capable of learning on their own, a process commonly referred to as machine learning (ML). ML is a branch of artificial intelligence (AI), an umbrella term used to describe approaches
and techniques that can make machines think and act in a more rational and human-like way.
Deep learning (DL) is a specific form of ML, and therefore another branch of AI (Figure 11.1).
At a basic level, DL is based on mimicking the human thinking process and developing relevant
abstractions and connections. It consists of the following elements:
1. Learning: Facilitating the functionality to artificially obtain and process new information.
2. Reasoning: Offering the functionality to process information in different, and potentially
overlooked, ways.
3. Understanding: Providing ways to showcase the results of the adopted model.
4. Validating: Offering the opportunity to validate the results of the model based on theory.
5. Discovering: Providing the mechanisms to identify new relationships within the data.
6. Extracting: Allowing the extraction of new meanings based on the predictors.
DL uses numerous layers of algorithms to process the underlying data, which could be spoken
words, images, text, or more complex objects. The data are normally passed through interconnected
layers of processing networks, as shown in Figure 11.2.
In ML, there are two types of variables: dependent and independent. One way to contextualize
these variables is to think of independent variables as the inputs of the ML process and dependent
as the outputs. For example, one can predict a person’s weight by knowing that person’s height.
Another notion the reader should be familiar with is that of data plotting. Essentially, plotting
is a way to visualize the data in an effort to identify underlying patterns and groupings. As data
can be scattered, when plotting them the goal is to find a line that represents the best fit for a given
dataset. A simple equation can define such a process: Y = F(X) + B where Y is the dependent variable
(predicted weight) and X the independent variable (an individual’s height).
In ML, there are mainly two types of predictions:
1. Linear Regression: Linear regression is focused on predicting continuous values. This
topic is thoroughly discussed in Chapter 10: Machine Learning with Python. It is highly
recommended that the reader goes through the basic discussions on that chapter before
proceeding to the next sections of the present one, as they offer a useful foundation for
understanding many aspects of DL.
FIGURE 11.1
Scope of data-based learning technologies.
Introduction to Neural Networks and Deep Learning
FIGURE 11.2
451
DL processing and layering structure.
2. Logistic Regression: Logistic regression is focused on predicting values classified as 0 or
1, and is one of the cornerstones of DL.
DL is applied in cases of learning based on unlabelled data with unknown features. Thus, feature
extraction (FE) is a vital aspect of DL. FE uses algorithms to construct the meaning of the features,
so the training and testing processes can be applied.
This chapter covers the following:
1. An introduction to the theory and mathematical constructs of DL fundamentals, supported
by the associated mathematical equations, and working examples and related Python
scripts.
2. An introductory discussion on Neural Networks (NN) and DL algorithms implementing
NN with working examples and scripts.
3. Examples of building a DL model using NN.
It should be noted that, since there are several mathematical concepts involved in the DL processes,
it is possible to face compatibility issues when working with more than one libraries. In such cases,
it is, often, quite useful to know if a particular library is installed in the system and, if so, which
version. In that case, the following statements may come handy:
1
2
3
4
5
6
7
8
9
10
11
12
# scipy
import scipy
print('scipy: %s' % scipy.__version__)
# numpy
import numpy
print('numpy: %s' % numpy.__version__)
# matplotlib
import matplotlib
print('matplotlib: %s' % matplotlib.__version__)
# pandas
import pandas
print('pandas: %s' % pandas.__version__)
452
13
14
15
16
17
18
Handbook of Computer Programming with Python
# statsmodels
import statsmodels
print('statsmodels: %s' % statsmodels.__version__)
# scikit-learn
import sklearn
print('sklearn: %s' % sklearn.__version__)
Output 11.1:
scipy: 1.4.1
numpy: 1.19.5
matplotlib: 3.2.2
pandas: 1.1.5
statsmode1s: 0.10.2
sklearn: 1.0.1
In addition to Pandas, MatplotLib, Nympy, and SciPy libraries already covered in previous chapters, there are a few more that are essential in DL scripts. Some of these must be installed prior to
their import and use in the script. However, given the variety of installations depending on the operating systems and configurations, it is deemed impractical to cover all those in the present chapter.
The reader is advised to seek instructions in the many online available sites. A list of these libraries,
with a brief description, follows:
1. TensorFlow: It is used for backpropagation and passes the data for training and prediction.
2. Theano: It helps with defining, optimizing and evaluating mathematical equations on
multi-dimensional arrays. It is very efficient when performing symbolic differentiation.
3. Pytorch: It helps with tensor computations with GPU and Neural Networks based data
modeling.
4. Caffe: It helps with implementing DL frameworks using improved expressions and speed.
5. Apache mxnet: As a core component, it comes with a dynamic dependency scheduler that
provides parallelism for both symbolic and imperative operations.
11.2 RELEVANT ALGEBRAIC MATH AND ASSOCIATED
PYTHON METHODS FOR DL
There are some essential mathematical concepts that must be explained and their Python implementations described before delving into the introduction of DL with Python. The most fundamental are
the dot() method, the matrix operations, eigenvalues/eigenvectors and diagonals, solving equations
through sets, generating random numbers, and linear and logistic regression.
11.2.1 The Dot Method
A method often used in DL that is not covered in previous chapters is the dot method. It implements
the math equation that sums the products of two arrays:
N
x. y = x b =
T
∑x y
n n
n=1
The dot method is important in the context of DL, as
the main method of the latter is to accept multiple inputs
Observation 11.2 – The Dot Method:
Calculates the sum of vectors, ­provided
in the form of matrices.
Introduction to Neural Networks and Deep Learning
FIGURE 11.3
453
The dot method in DL.
from various neurons and calculate their sum. Since the inputs are always in the form of vectors
(i.e., pairs of values like course grade and its weight), the dot method is an effective means for this
calculation. Figure 11.3 illustrates the functionality of the dot method:
Consider the following Python script:
1
2
3
4
import numpy as np
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
x2, y2 = np.array([1, 2, 3]), np.array([4, 5, 6])
print("The two arrays x1 and y1 are:\n", x1, y1)
print("The two arrays x2 and y2 are:\n", x2, y2)
# 1x2 and 1x3 arrays
x1, y1 = np.array([1, 2]), np.array([3, 4])
# Product of 2 arrays calculated as xi*yi (for each of the 2 elements)
print("\nCreate a new list as products of the elements of the two \
arrays (x1 * y1):", x1 * y1)
print("\nCreate a new list as products of the elements of the two \
arrays (x2 * y2):", x2 * y2)
# Loop calculates the dot method of the 2 arrays (x1, y1 & x2, y2)
Dot = 0
for i in range(len(x1)):
Dot += x1[i] * y1[i]
print("\nUsing a regular loop to calculate the dot value for \
the 1x2 arrays:", Dot)
Dot = 0
for i in range(len(x2)):
Dot += x2[i] * y2[i]
print("Using a regular loop to calculate the dot value for \
the 1x3 arrays:", Dot)
# The zip method with parallel iterations calculates
# the dot for x1, y1 and x2, y2
Dot = 0
for g, h in zip(x1, y1):
Dot += g * h
print("\nUsing the zip method for parallel iterations:", Dot)
454
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Handbook of Computer Programming with Python
Dot = 0
for g, h in zip(x2, y2):
Dot += g * h
print("Using the zip method for parallel iterations:", Dot)
# The sum method calculates the dot for two arrays
print("\nThe sum of the products of the elements of the two arrays \
(np.sum(x1 * y1)):", np.sum(x1 * y1))
print("\nThe sum of the products of the elements of the two arrays \
(np.sum(x2 * y2)):", np.sum(x2 * y2))
# A different version of the sum method calculates the dot of 2 arrays
print("\nThe sum of the products of the elements of the two arrays \
((x1 * y1).sum()):", (x1 * y1).sum())
print("The sum of the products of the elements of the two arrays \
((x1 * y1).sum()):", (x2 * y2).sum())
# The dot method on two arrays
print("\nUse the dot method on the elements of the two arrays \
(np.dot(x1, y1)):", np.dot(x1, y1))
print("Use the dot method on the elements of the two arrays \
(np.dot(x2, y2)):", np.dot(x2, y2))
# A different version of the dot method on two arrays
print("\nAnother way to use the dot method on the elements \
of the two arrays (x1.dot(y1)):", x1.dot(y1))
print("Another way to use the dot method on the elements \
of the two arrays (x2.dot(y2)):", x2.dot(y2))
# Direct use of the dot notation on two arrays
print("\nAnother way to use the dot method (x1 @ y1):", x1 @ y1)
print("\nAnother way to use the dot method (x2 @ y2):", x2 @ y2)
Output 11.2.1:
The
[1
The
[1
two arrays xl and yl are:
2] [3 4]
two arrays x2 and y2 are:
2 3] [4 5 6]
Create a new list as products of the elements of the two arrays (xl * yl)
: [3 8]
Create a new list as products of the elements of the two arrays (x2 * y2)
: [ 4 10 18]
Using a regular loop to calculate the dot value for the 1x2 arrays: 11
Using a regular loop to calculate the dot value for the 1x3 arrays: 32
Using the zip method for parallel iterations: 11
Using the zip method for parallel iterations: 32
The sum of the products of the elements of the two arrays (np.sum(xl * yl))
: 11
The sum of the products of the elements of the two arrays (np.sum(x2 * y2))
: 32
The sum of the products of the elements of the two arrays ((xl * yl).sum())
Using a regular loop to calculate the dot value for the 1x2 arrays: 11
Using a regular loop to calculate the dot value for the 1x3 arrays: 32
Using the zip method for parallel iterations: 11
Introduction to Neural Networks and Deep Learning
Using the zip method for parallel iterations: 32
455
The sum of the products of the elements of the two arrays (np.sum(xl * yl))
: 11
The sum of the products of the elements of the two arrays (np.sum(x2 * y2))
: 32
The sum of the products of the elements of the two arrays ((xl * yl).sum())
: 11
The sum of the products of the elements of the two arrays ((xl * yl).sum())
: 32
Use the dot method on the elements of the two arrays (np.dot(xl, yl)): 11
Use the dot method on the elements of the two arrays (np.dot(x2, y2)): 32
Another way to use the dot method on the elements of the two arrays (xl.dot
(yl)): 11
Another way to use the dot method on the elements of the two arrays (x2.dot
(y2)): 32
Another way to use the dot method (xl @ yl): 11
Another way to use the dot method (x2 @ y2): 32
This script calculates and presents the sum of the products of the elements of two arrays (based on
their indices) in varying ways and presents their results. For illustration purposes, it uses two types
of arrays (i.e., 1 × 2 elements and 1 × 3 elements). The reader should notice the various forms that
the dot method can take. The method is quite useful and becomes handy in the examples provided
in the following sections.
11.2.2 Matrix Operations with Python
Another algebraic concept that is quite useful in DL is that of matrix multiplication. Broadly speaking, this process requires that the size of the second dimension of the first matrix must be the same
as the size of the first dimension of the second matrix. In other words, the number of columns in
the first matrix must be equal to the number of rows in the second matrix. The resulting matrix has
the size of the first dimension of the first matrix (or its number of rows) and the size of the second
dimension of the second matrix (or its number of columns). For the calculation of the various elements of the new matrix the dot method is used.
As an example, one can assume the following two matrices:
 1
npArray = 
 5
 3
newMatrix = 
 1
2 
6 
4
2
5 
3 
The first array (npArray) has two columns, whereas the second (newMatrix) has two rows. Hence,
it is possible to have a new matrix as the product of these two matrices. The resulting matrix will
be calculated as follows:




(1* 3 + 2 *1)
( 5* 3 + 6 *1)
(1* 4 + 2 * 2)
( 5* 4 + 6 * 2)
(1* 5 + 2 * 3)
( 5* 5 + 6 * 3)


 =  5

 21

8
32
11 
43 
456
Handbook of Computer Programming with Python
Another mathematical Python method that often comes
handy when using matrices is exp() from the Numpy
library. The method accepts an array of elements (an
algebraic matrix) as an argument and creates a new
matrix as a result of e^xiyi. Using the previous example of matrix npArray, the resulting matrix will be as
follows:
Observation 11.3 – The exp()
Method: Creates a new matrix as a
result of e^xiyi of the elements of the
original matrix.
Observation 11.4 – Inverse Matrix:
A matrix which, if multiplied by the
original, gives the identity matrix.
 e ^ 1 e ^ 2   2.71828283
7.3890561 
 e ^ 5 e ^ 6  =:  148.4131591 403.42879349 

 

Another concept often used in DL is that of the inverse matrix. If such a matrix is multiplied by
the original, it will result into the identity matrix. If, in turn, the latter is multiplied by the original
matrix, it will not change it. This is similar to integer 1, which when multiplied by any other integer
it does not incur any value changes. The identity matrices for 2 × 2, 3 × 3, and 4 × 4 matrices can be
expressed as follows:
 1
 0

 1
 0

 0






1
0
0
0
0 
1 
0
1
0
0
1
0
0
0
0
1
0
0
1
0
This pattern can continue in a similar fashion for larger
square matrices. It is important to note that there are
two requirements for a matrix to have a corresponding
inverse: it must be a square matrix and its determinant
value must be non-zero.
The determinant is a special number, either integer
or real, calculated from a matrix. Its most important role
is precisely to determine whether a matrix can have an
inverse one, in which case the determinant is non-zero.
If not, it will have a value of 0 or extremely close to
0. It must be noted that even a number like 2.3e−23 is
­considered as 0 and, therefore, such a determinant would
suggest that it is not feasible to have an inverse matrix.
The determinant is calculated by subtracting the
product of the diagonal elements of the matrix. For
 1 2 
example, in the case of matrix 
 the deter 5 6 




0
0
0
1






Observation 11.5 – Identity Matrix:
A matrix that has all its first diagonal
elements with a value of 1, which
causes no change to the corresponding values when multiplied by the
original matrix.
Observation 11.6 – Determinant: A
special number, integer or real, calculated from the diagonals of a matrix.
It determines whether a matrix has
an inverse (value is non-zero) or not
(value is 0).
 1
minant is calculated as 1*6 – 5*2 = 6 – 10 = –4. However, in the case of  5
 7
3
4
6
2
8
9

 things


Introduction to Neural Networks and Deep Learning
457
are more complicated. In this case the determinant is calculated as 1*((4*9) − (6*8)) − 3*((5*9) −
(7*8)) + 2*((5*6) − (7*4)) = 1*(36−48) − 3*(45−56) + 2*(30−28) = − 12 − 3*(−11) + 2*2 = −12 +
33 + 4 = 25. The pattern for 3 × 3 or larger matrices is as follows:
• Multiply the first element of the first row with the determinant of the matrix that is not in
the same row or column.
• Similarly, calculate the same values for all the elements of the first row of the matrix.
• Calculate the final determinant as first result − second result + third result – fourth result
and so forth.
The reader should note that the determinant can be calculated only for square matrices.
The following script briefly demonstrates the above concepts:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
# Create a 2-dimensonal array (2x2) using the array function (Numpy)
npArray = np.array([[1, 2], [5, 6]])
# Show the entire array and the 2nd element of the 1st dimension
# in 2 different ways
print("\nThe nparray's array's contents:\n", npArray)
print("The 2nd element of the 1st dimension of the array:",
npArray[0][1])
print("The same result from a different syntax:", npArray[0, 1])
print("\nThe elements of the 2nd dimension:", npArray[:, 0])
print("\nShow the result of the e^x for each element of the input \
array:\n", np.exp(npArray))
# Create a 2-dimensonal array (2x3) using the array function (Numpy)
newMatrix = np.array([[3, 4, 5], [1, 2, 3]])
print("\nThe 2x3 matrix newMatrix is:\n", newMatrix)
# Multiply the arrays npArray and newMatrix applying the .dot method
print("\nThe product of npArray and newMatrix using the .dot method \
is:\n", npArray.dot(newMatrix))
# Create a 2-dimensional array (3x3) using the array function (Numpy)
newMatrix2 = np.array([[1, 3, 2], [5, 4, 8], [7, 6, 9]])
print("\nThe 3x3 matrix newMatrix2 is:\n", newMatrix2)
# Determinant values for npArray & newMatrix2. The matrices are squares
print("\nThe determinant for the npArray is: ", np.linalg.
det(npArray))
print("The determinant for the newMatrix is: ",
np.linalg.det(newMatrix2))
# Calculate and display the inverse matrix for npArray and newMatrix2
inverseNpArray = np.linalg.inv(npArray)
print("\nThe inverse matrix for the npArray is:\n", inverseNpArray)
inverseNewMatrix2 = np.linalg.inv(newMatrix2)
print("\nThe inverse matrix for the newMatrix2 is:\n",
inverseNewMatrix2)
# Multiplying original npArray & newMatrix2 matrices with their
# inverse produces the identity matrix
print("\nThe product of the npArray and its inverse matrix is:\n",
458
38
39
40
Handbook of Computer Programming with Python
inverseNpArray.dot(npArray))
print("\nThe product of the newMatrix2 and its inverse matrix is:\n",
inverseNewMatrix2.dot(newMatrix2))
Output 11.2.2:
The nparray's array's contents:
[[1 2]
[5 6]]
The 2nd element of the 1st dimension of the array: 2
The same result from a different syntax: 2
The elements of the 2nd dimension: [1 5]
Show the result of the e^x for each element of the input array:
[[ 2.71828183
7.3890561 ]
[148.4131591 403.42879349]]
The 2x3 matrix newMatrix is:
[[3 4 5]
[1 2 3]]
The product of npArray and newMatrix using the .dot method is:
[[ 5 8 11]
[21 32 43]]
The 3x3 matrix newMatrix2 is:
[[1 3 2]
[5 4 8]
[7 6 9]]
The determinant for the npArray is: -3.999999999999999
The determinant for the newMatrix is: 25.000000000000007
The inverse matrix for the npArray is:
[[-1.5
0.5 ]
[ 1.25 -0.25]]
The inverse matrix for the newMatrix2 is:
[[-0.48 -0.6
0.64]
[ 0.44 -0.2
0.08]
[ 0.08 0.6 -0.44]]
The product of the npArray and its inverse matrix is:
[[ 1.00000000e+00 -2.22044605e-16]
[-5.55111512e-17 1.00000000e+00]]
The product of the newMatrix2 and its inverse matrix is:
[[ 1.00000000e+00 6.66133815e-16 9.99200722e-16]
[-2.08166817e-16 1.00000000e+00 -1.24900090e-16]
[ 7.21644966e-16 1.11022302e-16 1.00000000e+00]]
Introduction to Neural Networks and Deep Learning
459
The results showcase the output of the calculations. Note that the rather complicated calculations
for the determinant lead to the respective values not being whole numbers. In addition, the product
of newMatrix2 and its inverse matrix is the identity matrix of 3 × 3, although some of its elements
appear to be non-zero values, but are quite close to that.
11.2.3 Eigenvalues, Eigenvectors and Diagonals
Another concept related to matrix operations is that of eigenvalues and eigenvectors, which determine whether a particular matrix changes direction when multiplied by a specified vector. As an
example, consider a square matrix A. Its eigenvector and eigenvalue will be the ones that make the
following equation true: AV = λV where A is the original matrix, V is the eigenvector and λ is the
eigenvalue. It is beyond the scope of this chapter to cover algebraic mathematics in any sort of
detail. The reader can find such information on the multitude of related books and resources. For the
purposes of this chapter, it should suffice to mention that the concept of eigenvalues and eigenvectors is useful in several transformation processes, including but not limited to computer graphics,
physics applications, and predictive modelling.
Another notion that must be mentioned is that of a
diagonal. It is often useful to find the diagonals above or Observation 11.7 – Eigenvalue,
below the main diagonal of a matrix. In the case of the Eigenvector: Mathematical concepts
former, a positive integer is suggested, whereas in the that suggest whether a particular
matrix changes direction when mulcase of the latter a negative one.
The following script is a demonstration of how the tiplied by a specified vector (AV = λV).
concepts of eigenvalue, eigenvector, and diagonals are
calculated and/or identified:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
# Create a 2x2 array using the array function (Numpy) and
# display its contents
npArray = np.array([[1, 2], [5, 6]])
print("\nThe nparray's array's contents:\n", npArray)
# Create a 3x3 array using the array function (Numpy) and
# display its contents
newMatrix = np.array([[1, 3, 2], [5, 4, 8], [7, 6, 9]])
print("\nThe 3x3 matrix newMatrix2 is:\n", newMatrix)
# Display the diagonal for both arrays
print("The diagonal of the npArray is: ", np.diag(npArray))
print("The diagonal of the npArray above the main diagonal is: ",
np.diag(npArray, 1))
print("The diagonal of the npArray below the main diagonal is: ",
np.diag(npArray, -1))
print("The diagonal of the newMatrix is: ", np.diag(newMatrix))
print("The diagonal of the newMatrix above the main diagonal is: ",
np.diag(newMatrix, 1))
print("The diagonal of the newMatrix below the main diagonal is: ",
np.diag(newMatrix, -1))
# Calculate and display the Eigenvalue and Eigenvector for both arrays
eigenValueNpArray, eigenVectorNpArray = np.linalg.eig(npArray)
460
27
28
29
30
31
32
33
Handbook of Computer Programming with Python
print("\nThe eigenvalues of the npArray are: \n", eigenValueNpArray)
print("\nThe eigenvectors of the npArray are: \n", eigenVectorNpArray)
eigenValueNewMatrix, eigenVectorNewMatrix = np.linalg.eig(newMatrix)
print("\nThe eigenvalues of the newMatrix are: \n",
eigenValueNewMatrix)
print("\nThe eigenvectors of the newMatrix are: \n",
eigenVectorNewMatrix)
Output 11.2.3:
The nparray's array's contents:
[[1 2]
[5 6]]
The 3x3 matrix newMatrix2 is:
[[1 3 2]
[5 4 8]
[7 6 9]]
The diagonal of the npArray is: [1 6]
The diagonal of the npArray above the main diagonal is: [2]
The diagonal of the npArray below the main diagonal is: [5]
The diagonal of the newMatrix is: [1 4 9]
The diagonal of the newMatrix above the main diagonal is: [3 8]
The diagonal of the newMatrix below the main diagonal is: [5 6]
The eigenvalues of the npArray are:
[-0.53112887 7.53112887]
The eigenvectors of the npArray are:
[[-0.79402877 -0.2928046 ]
[ 0.60788018 -0.9561723 ]]
The eigenvalues of the newMatrix are:
[15.86430285+0.j
-0.93215143+0.84080839j -0.93215143-0.84080839j]
The eigenvectors of the newMatrix are:
[[ 0.22516436+0.j
0.76184671+0.j
0.76184671-0.j
]
[ 0.60816639+0.j
-0.24748842+0.39196634j -0.24748842-0.39196634j]
[ 0.76120605+0.j
-0.36476897-0.26766596j -0.36476897+0.26766596j]]
11.2.4 Solving Sets of Equations with Python
Python provides a convenient way to solve sets of equations by treating them as matrices. The idea
behind this is to take a set of equations, produce the relevant matrices (i.e., one with the variable
coefficients and one with the resulting values for each equation), and call the solve() method
(Numpy library). Consider the following example of a set of three equations:
5 x − 3 y + 2 z = 10
−4 x − 3 y − 9 z = 3
2 x + 4 y + 3z = 6
461
Introduction to Neural Networks and Deep Learning
Firstly, the following matrix of the variable coefficients
is produced:
 5
 −4

 2
−3
−3
4
2
−9
3




Observation 11.8 – The solve()
Method: A method that solves a set
of equations using relevant, appropriately processed matrices.
This is followed by the matrix for their solutions:
10, 3, 6 
Finally, the solve() method is called, producing the respective solutions for x, y, and z:
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
#
#
#
#
#
Assume the following set of equations:
5x - 3y + 2z = 10
-4x - 3y - 9z = 3
2x + 4y + 3z = 6
Use solve() to solve the equations
# Create a 3x3 matrix based on the equations and and display contents
equations = np.array([[5, -3, 2], [-4, -3, -9], [2, 4, 3]])
results = np.array([10, 3, 6])
print(“\nThe solution for x, y, and z is:\n”,
np.linalg.solve(equations, results))
Output 11.2.4:
The solution for x, y, and z is:
[ 3.90225564 1.46616541 -2.55639098]
11.2.5 Generating Random Numbers for Matrices with Python
Sometimes it is useful to generate matrices with random numbers in order to evaluate models prior
to using actual data. Through the Numpy library, Python provides several methods that offer such
functionality. The following script can be divided into three distinct parts. In the first part, a 3 × 4
matrix is generated and filled with 0 s. Next, another two matrices are generated and filled with 1 s
and 20 s, respectively. Finally, a 4 × 4 identity matrix is generated. In the second part, the script
uses the rand() and randn() methods to generate
numbers for the matrices, either through the regular ran- Observation 11.9 – rand(), randn(),
dom numbers generator or from the Normal Gaussian mean(), var(), std(): Some of the
Distribution that has a mean of 0. In the third part, the methods of the Random package of
script demonstrates the use of basic statistics methods the Numpy library that provide basic
from Numpy, including mean(), var(), and std() to descriptive statistical calculations on
calculate the mean, the statistical variance, and the matrices.
­standard deviation of the data, respectively:
462
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Handbook of Computer Programming with Python
import numpy as np
# Generate 3x4 matrices of zeroes, ones, 20s, and a 4x4 identity matrix
print("Generate a 3x4 matrix of zeroes\n", np.zeros((3, 4)))
print("\nGenerate a 3x4 matrix of ones\n", np.ones((3, 4)))
print("\nGenerate a 3x4 matrix of 20s\n", 20 * np.ones((3, 4)))
print("\nGenerate an Identify matrix 4x4\n", np.eye(4))
# Generate a random number, a 3x4 matrix of random numbers,
# a 3x4 matrix of random numbers from the Normal (Gaussian)
# Distribution (i.e., mean = 0), and a 4x4 matrix of random
# numbers between 5 and 15 from the Normal Distribution
print("\nGenerate a random number\n", np.random.random())
print("\nGenerate an array 3x4 with random numbers\n",
np.random.random((3, 4)))
print("\nGenerate an array 3x4 with random numbers from the Normal \
Distribution\n", np.random.randn(3, 4))
print("\nGenerate an array 4x4 with random numbers between 5 and 15\n",
np.random.randint(5, 15, size = (4, 4)))
# Generate an array of 10 items with random numbers from the
# Normal (Gaussian). Distribution and use it as a source for performing
# basic statistics
npArray = np.random.randn(10)
print("\nGenerate an array of 10 random numbers from the Normal \
Distribution\n", np.random.randn(10))
# Print the mean of the new array
print("\nThe mean of the new array is: ", npArray.mean(), )
# Print the variance of the new array
print("The variance of the new array is: ", npArray.var())
# Print the standard deviation (i.e., the square root of the variance)
print("The stdDev of the new array is: ", npArray.std())
Output 11.2.5:
Generate a 3x4 matrix of zeroes
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Generate a 3x4 matrix of ones
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
Generate a 3x4 matrix of 20s
[[20. 20. 20. 20.]
[20. 20. 20. 20.]
[20. 20. 20. 20.]]
Generate an Identify matrix 4x4
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
Generate a random number
Generate a 3x4 matrix of 20s
[[20. 20. 20. 20.]
[20. 20. 20.
20.]Networks and Deep Learning
Introduction
to Neural
[20. 20. 20. 20.]]
463
Generate an Identify matrix 4x4
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
Generate a random number
0.8435542056822151
Generate an array 3x4 with random numbers
[[0.35570211 0.27618855 0.0541145 0.58001638]
[0.20641101 0.48294052 0.92104823 0.61556587]
[0.19491554 0.5713989 0.63918665 0.81824177]]
Generate an array 3x4 with random numbers from the Normal Distribution
[[-0.24286997 -1.00451518 0.06104505 -1.85966171]
[-0.47202171 0.01079039 0.03526387 0.44499205]
[ 2.2395344
0.42076315 0.6505322 -0.6350833 ]]
Generate an array 4x4 with random numbers between 5 and 15
[[ 7 13 7 12]
[ 7 9 5 5]
[12 12 10 8]
[12 7 9 13]]
Generate an array of 10 random numbers from the Normal Distribution
[ 0.80516765 -0.34184534 -1.01860459 1.55026532 1.52091946 0.68490906
-0.07417641 1.35254549 0.21432432 0.29326124]
The mean of the new array is: 0.009347564051776013
The variance of the new array is: 0.6866073562925792
The stdDev of the new array is: 0.8286177383405325
11.2.6 Plotting with Matplotlib
As it is already discussed in Chapter 8 on Data Analytics and Data Visualization and Chapter 9
on Statistics, Python offers libraries that effectively and efficiently address all types of charts that
might be required by the analysis of data at hand. These include Matplotlib and Scipy and are
widely used for Deep Learning as well. The following two scripts are a quick refresh of how to use
these libraries to visualize/plot the results of the mathematical methods of the previous sections:
1
2
3
4
5
6
7
8
9
10
11
# Import the Numpy and Matplotlib libraries
import numpy as np
import matplotlib.pyplot as plt
# Plot inline alongside the rest of the results
# This is particularly relevant in Jupyter Anaconda
%matplotlib inline
# Plot a
# with 4
for i in
A
line as the sin of the values between 0 and 40
different types of intervals
range(1, 5):
= np.linspace(0, 40, 20*i)
464
12
13
14
15
16
Handbook of Computer Programming with Python
B = np.sin(A) + 0.2 * A
plt.plot(A, B)
plt.xlabel("Input"); plt.ylabel("Output")
titleShow = "Basics of Charts. Number of samples: " + str(20*i)
plt.title(titleShow); plt.show()
Output 11.2.6.a–11.2.6.d:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import the Scipy and Matplotlib libraries
from scipy.stats import norm
import matplotlib.pyplot as plt
# Plot inline alongside the rest of the results
# This is particularly relevant in Jupyter Anaconda
%matplotlib inline
# Create data points between -10 and 10, with 2000 intervals
x = np.linspace(-10, 10, 2000)
# loc is the mean and scale is the standard deviation
# Calculate the probability density function (Norm module/Scipy)
fx = norm.pdf(x, loc = 0, scale = 1)
# Plot the chart
plt.plot(x, fx); plt.show()
# Calculate the cumulative distribution function (Norm module/Scipy)
fx2 = norm.cdf(x, loc = 0, scale = 1)
# Plot the chart
Introduction to Neural Networks and Deep Learning
18
19
20
21
22
23
24
25
26
465
plt.plot(x, fx2); plt.show()
# Calculate the log of the probability density function (Norm
# module/Scipy)
fx3 = norm.logpdf(x, loc = 0, scale = 1)
plt.plot(x, fx3); plt.show()
# Calculate the log of the cumulative distribution function
# (Norm module/Scipy)
fx4 = norm.logcdf(x, loc = 0, scale = 1)
plt.plot(x, fx4); plt.show()
Output 11.2.6.e–11.2.6.h:
11.2.7 Linear and Logistic Regression
Regression can involve either categorical or continuous variables. The input could be continuous,
categorical, or discrete. If y shows the outcome and x shows the input, the model can be written as
follows:
y = F(x), where F is the DL model that suggests the relationship between input and output.
In the case of Linear Regression this model reveals a directly proportional relationship between
input and output with some possible Regression coefficients (γ) of the various inputs (x) and the possibility of an error (φ) of the model calculations. Eventually, in the case of Linear Regression, the
model can be written as follows:
y = F ( x ) = γ 0 + γ 1 x1 +  + γ n x n + ϕ
In the case of Logistic Regression (LR), the backbone of a DL Neural Network, the DL algorithm is
used to classify the possible outputs as accurately as possible. The categories are encoded as either
466
Handbook of Computer Programming with Python
0 or 1 and a sigmoid method is used to output a number between 0 and 1. The output is interpreted
as a probability that the data is to be categorized as 1.
11.3 INTRODUCTION TO NEURAL NETWORKS
“Neural networks reflect the behavior of the human brain, allowing computer programs to recognize
patterns and solve common problems in the fields of AI, machine learning, and deep learning.”
(IBM Cloud Education, 2020)
The artificial neural networks (ANN) technique was inspired by the basics of human functioning.
The main idea behind it is to interpret data through a series of multiple ML-based perceptrons (covered in detail in the next section), and label or cluster the input as required. Real world data such as
images, sounds, time series, or other complex data are translated into numbers using vectors. ANN
is quite helpful in classifying and clustering raw data even if they are unidentified and unlabelled.
This is because it groups data based on similarities it observes or learns in its deeper layers, thus,
transforming them into labelled training data, in a similar way the human brain does.
A deep neural network consists of one or more perceptrons in two or more layers (input and
output). The perceptrons of each different layer are fed by the previous layer, using the same input
but with different weights. The target of DL in ANN is to find correlations and map inputs to outputs. At a basic level, it extracts unknown features from the input data that can be fed to other algorithms, while also creating components of larger ML applications that may include classification,
regression and reinforcement learning. It approximates the unknown method (f(x) = y) for any input
x and output y. During learning, ANN finds the right method by evolving into a tuned transformation of x into y. In simple terms, this could represent methods like f(x) = 7x + 18 or f(x) = 8x−0.8.
ANN performs particular well in clustering. It falls into the category of unsupervised learning, as it does not require labels to perform its tasks. It
consists of the input layer, the hidden layer(s), the out- Observation 11.10 – Neuron: The
put layer, the adjustable weights for model training and basic building block of a neural netlearning for all layers, and the activation method.
work, also called the linear unit. It
The neuron is the basic building block of a neural learns by modifying the values of the
network. It is also known as the linear unit of the neural weights of the inputs and adding up
network system Figure 11.4.
the sum of inputs × weights and the
In Figure 11.4 above, X is the input to the neuron and possible bias of the model.
w is the weight. In its most basic form, the key for a neuron to be able to learn is the modification of value w. Y is
the output and b the bias of the model. The bias is independent of the input and its value is provided
with the model. The neuron sums up all the input values to come up with the equation that describes
its model like a slope equation in linear algebra: Y = wX + b.
FIGURE 11.4
A typical neuron.
Introduction to Neural Networks and Deep Learning
467
11.3.1 Modelling a Simple ANN with a Perceptron
Figure 11.5 illustrates the method of a single neuron in a single layer (i.e., a perceptron). Its fundamental functionality is to mimic the behavior of the human brain’s neuron. The idea is to take the
inputs of the model (x1, x2,…, xn) and multiply each by their respective weights (w1, w2,…, wn), in
order to produce the relevant k values (k1, k2,…, kn). Often, a constant bias value multiplied by its
associated weight is also added to this sum. Next, the sum of the k values is calculated and applied
to the selected sigmoid activation method. Finally, the result is frequently normalized using some
type of method as the unit step. A perceptron is also
called a single-layer neural network because its output
is decided based on the outcome of a single activation Observation 11.11 – Perceptron:
method associated with a single neuron. Figure 11.5 A single-layer neural network as its
illustrates this model.
output is decided on a single activaClass FirstNeuralNetwork presented below imple- tion method associated with a single
ments a basic perceptron (i.e., single-layer ANN). The neuron.
implementation includes the following steps:
1. Generate and initialize a new object (named ANN) based on the FirstNeuralNetwork class,
to initiate the perceptron model (lines 46 and 5–10). Instead of reading the weights from
a data file, these are randomly generated as an array of 3 × 1 values, ranging from −1
to 1. The calculation uses the following formula: (max−min) * randomset (lines × columns) + min. Hence, in this case, the formula will be (1−(−1)) * np. random.random((3,
1)) + (−1) = 2 * np.random.random(3, 1)−1. The reader should keep in mind that by using
the seed() method with a particular parameter, in this case 1, the random sequence of
numbers will always be the same. If it is preferred to have a different sequence of numbers
every time the script runs, the seeding line should not be included.
2. Instead of reading the training inputs and outputs from a dataset, these are given as arrays
of values (lines 49–52). Since the dot method will be used on the inputs and weights to
calculate their sum, it is necessary that the number of columns of the former must match
the number of lines of the latter (in this case 3).
FIGURE 11.5
Perceptron.
468
Handbook of Computer Programming with Python
3. Call the Training() method to train the model (line 56). For optimum training results,
it is necessary to define the number of required iterations. The number is rather subjective;
however, empirical experience suggests that a number of iterations between 10,000 and
15,000 is sufficient.
4. Use the dot() method to calculate the weighted sum of the inputs and their weights (lines
38–42).
5. Use the Sigmoid() method (lines 12–15) to calculate the output based on the result of the
dot() method in step 4 (lines 41–42).
6. An optional step would be to calculate the training process error as the result of the training output (originally provided) – the calculated output. There are various ways to calculate this error, depending on the required level of accuracy. In this case, the error is
calculated based on the last iteration of the training process (lines 28–36).
7. Another optional step would be to adjust the weights vector, based on the error calculated
in the previous step (line 34).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
class FirstNeuralNetwork():
def __init__(self):
# Create a random number using the seed method
np.random.seed(1)
# Convert weights to a 3x1 matrix with values from -1 to 1 and
# a mean of 0 multiplied by 2
self.weights = 2 * np.random.random((3, 1)) -1
def Sigmoid(self, x):
# Use the sigmoid method to calculate the output
sigmoid = 1 / (1 + np.exp(-x))
return sigmoid
def SigmoidDerivative(self, x):
derivative = x * (1 - x)
return derivative
def Training(self, trainingInputs, trainingOutputs,
trainingIterations):
# Train the model for continuous adjustment of the weights
for iteration in range(trainingIterations):
# Train the data through the neuron
Introduction to Neural Networks and Deep Learning
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
469
output = self.NeuronThinking(trainingInputs)
# Compute the error rate for back-propagation
theError = trainingOutputs - output
# Perform weight adjustments during the training phase
theAdjustments = np.dot(trainingInputs.T,
theError * self.SigmoidDerivative(output))
self.weights += theAdjustments
print("\nThe calculated error vector of the training process \
is: \n", theError)
def NeuronThinking(self, inputs):
# Pass the inputs through the neuron
inputs = inputs.astype(float)
output = self.Sigmoid(np.dot(inputs, self.weights))
return output
if __name__ == "__main__":
# Create an object based on the FirstNeuralNetwork neuron class
ANN = FirstNeuralNetwork()
print("Randomly Generated Weights:\n", ANN.weights)
# Train the data with 4 input values and 1 output
trainingInputs = np.array([[0,0,1], [1,1,1], [1,0,1], [0,1,1]])
print("\nThe training inputs:\n", trainingInputs)
trainingOutputs = np.array([[0],[1],[1],[0]])
print("\nThe training output:\n", trainingOutputs)
# Call the Training method to train the model
ANN.Training(trainingInputs, trainingOutputs, 15000)
print("\nThe adjusted weights vector is:\n", ANN.weights)
firstInput = str(input("\nProvide first input: "))
secondInput = str(input("Provide second input: "))
thirdInput = str(input("Provide third input: "))
print("The three inputs are: ", firstInput, secondInput,
thirdInput)
print("The new data is projected to be: ")
print(ANN.NeuronThinking(np.array([firstInput, secondInput,
thirdInput])))
470
Handbook of Computer Programming with Python
Output 11.3.1: Test it with 1, 0, 0 and 0, 1, 0
Output test 1
Output test 2
Randomly Generated Weights:
[[-0.16595599]
[ 0.44064899]
[-0.99977125]]
Randomly Generated Weights:
[[-0.16595599]
[ 0.44064899]
[-0.99977125]]
The training inputs:
[[0 0 1]
[1 1 1]
[1 0 1]
[0 1 1]]
The training inputs:
[[0 0 1]
[1 1 1]
[1 0 1]
[0 1 1]]
The training output:
[[0]
[1]
[1]
[0]]
The training output:
[[0]
[1]
[1]
[0]]
The calculated error vector
of the training process is:
[[-0.00786416]
[ 0.00641397]
[ 0.00522118]
[-0.00640343]]
The calculated error vector
of the training process is:
[[-0.00786416]
[ 0.00641397]
[ 0.00522118]
[-0.00640343]]
The adjusted weights vector is:
[[10.08740896]
[-0.20695366]
[-4.83757835]]
The adjusted weights vector is:
[[10.08740896]
[-0.20695366]
[-4.83757835]]
Provide first input: 1
Provide second input: 0
Provide third input: 0
The three inputs are: 1 0 0
The new data is projected to be:
[0.9999584]
Provide first input: 0
Provide second input: 1
Provide third input: 0
The three inputs are: 0 1 0
The new data is projected to be:
[0.44844546]
11.3.2 Sigmoid and Rectifier Linear Unit (ReLU) Methods
Both sigmoid and rectifier linear unit (ReLU) are activation methods used in DL.
1
The sigmoid method is defined as: σ ( x ) =
.
Observation 11.12 – The Sigmoid
1 + e− x
One of the drawbacks of the sigmoid method is that it Method: It takes input values in a
slows down the DL process in case of big data inputs, range and calculates the relevant outas it takes time to make the necessary calculations. This put values given a specific formula.
is especially true when the input is a large number. For The output is always probabilistic
this reason, it is mostly used when its output is expected ranging from 0 to 1. The method is
to fall in the range between 0 and 1, much like a prob- slow with big data, and particularly
ability output.
with large numbers.
Introduction to Neural Networks and Deep Learning
471
In most cases, the ReLU method is used instead.
The concept of this method is simple: if the input value Observation 11.13 – The Rectifier
is higher than or equal to 0, it is returned as output Linear Unit (ReLU) Method: It takes
unchanged; if it is lower, the method returns 0 as out- input values in a range. For each input
put. The method is particularly useful as it is rather fast, higher than or equal to 0 it results in
regardless of the input. The obvious problem with ReLU the same value as the input. For each
is that it ignores the negative input values, thus, not map- input value lower than 0, it results in
0. An important restriction with this
ping them into the output.
The following script creates a sequence of input floats method is that it ignores negative
ranging from −10 to 10. Next, it calculates the outputs values.
for each of the inputs using the sigmoid method and the
outputs using ReLU. Finally, it plots the results of the inputs and outputs for both cases:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Import matplotlib, numpy and math
import matplotlib.pyplot as plt
import numpy as np
import math
# linspace(start, end) creates a sequence of integer input numbers
x = np.linspace(-10, 10)
print("The generated array of floats is: \n", x)
# Use the sigmoid function to calculate the output
sigmoid = 1/(1 + np.exp(-x))
print("\nThe calculated array of sigmoids is: \n", sigmoid)
# Create the Numpy array for the ReLU results & initialize with zeros
relu = np.zeros(len(x))
# Use the ReLU function to calculate the ReLU output based on the input
for i in range(len(x)):
if x[i]> 0:
relu[i] = x[i]
else:
relu[i] = 0.0
print("\nThe resulting array of ReLU is: \n", relu)
plt.plot(x, sigmoid)
plt.xlabel("x")
plt.ylabel("Sigmoid(X)")
plt.title("The sigmoid function for inputs -10 to 10")
plt.show()
plt.plot(x, relu)
plt.xlabel("x")
plt.ylabel("ReLU(X)")
plt.title("The ReLU function for inputs -10 to 10")
plt.show()
472
Handbook of Computer Programming with Python
Output 11.3.2:
The generated array of floats is:
[-10.
-9.59183673 -9.18367347
-7.95918367 -7.55102041 -7.14285714
-5.91836735 -5.51020408 -5.10204082
-3.87755102 -3.46938776 -3.06122449
-1.83673469 -1.42857143 -1.02040816
0.20408163
0.6122449
1.02040816
2.24489796
2.65306122
3.06122449
4.28571429
4.69387755
5.10204082
6.32653061
6.73469388
7.14285714
8.36734694
8.7755102
9.18367347
-8.7755102
-6.73469388
-4.69387755
-2.65306122
-0.6122449
1.42857143
3.46938776
5.51020408
7.55102041
9.59183673
-8.36734694
-6.32653061
-4.28571429
-2.24489796
-0.20408163
1.83673469
3.87755102
5.91836735
7.95918367
10.
]
The calculated array of sigmoids is:
[4.53978687e-05 6.82792246e-05 1.02692018e-04 1.54446212e-04
2.32277160e-04 3.49316192e-04 5.25297471e-04 7.89865942e-04
1.18752721e-03 1.78503502e-03 2.68237328e-03 4.02898336e-03
6.04752187e-03 9.06814944e-03 1.35769169e-02 2.02816018e-02
3.01959054e-02 4.47353464e-02 6.58005831e-02 9.57904660e-02
1.37437932e-01 1.93321370e-01 2.64947903e-01 3.51547277e-01
4.49155938e-01 5.50844062e-01 6.48452723e-01 7.35052097e-01
8.06678630e-01 8.62562068e-01 9.04209534e-01 9.34199417e-01
9.55264654e-01 9.69804095e-01 9.79718398e-01 9.86423083e-01
9.90931851e-01 9.93952478e-01 9.95971017e-01 9.97317627e-01
9.98214965e-01 9.98812473e-01 9.99210134e-01 9.99474703e-01
9.99650684e-01 9.99767723e-01 9.99845554e-01 9.99897308e-01
9.99931721e-01 9.99954602e-01]
The resulting array of ReLU is:
[ 0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.20408163 0.6122449
2.24489796 2.65306122 3.06122449
4.69387755 5.10204082 5.51020408
7.14285714 7.55102041 7.95918367
9.59183673 10.
]
0.
0.
0.
0.
1.02040816
3.46938776
5.91836735
8.36734694
0.
0.
0.
0.
1.42857143
3.87755102
6.32653061
8.7755102
0.
0.
0.
0.
1.83673469
4.28571429
6.73469388
9.18367347
Introduction to Neural Networks and Deep Learning
473
11.3.3 A Real-Life Example: Preparing the Dataset
The basic tasks when creating a multi-layer NN is to
create, compile and fit the model, if necessary, plot Observation 11.14 – The sample()
the associated observations and data, and evaluate Method: Use this Pandas method
it. Among the most important concepts in DL are the with the frac and random_state
sequential model, the dense class, the activation class, parameters, to define a sample from
and adding layers to the model. A detailed analysis of the original set to be used in the DL
these topics is beyond the scope of this chapter and the process.
reader is encouraged to consider related sources specializing in DL. Nevertheless, a relatively common real-life example is examined in order to showcase
and introduce some of the basic associated notions. This is split into a number of distinct steps,
presented in the following sections.
The first step involves reading a dataset from a CSV file (diabetes.csv) and taking a random
sample (i.e., 70%) of its rows to use as a training dataset (frac parameter). For the same input, the
sample will also be the same, as a result of the random_state = 0 parameter. Next, the index of
the dataset is dropped, in order to keep only the remaining columns. Finally, the NN is optimized
by scaling the dataset values to a range between 0 and 1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import numpy as np
# Step 1: Read the csv file
MyDataFrame = pd.read_csv('diabetes.csv')
MyDataSource = MyDataFrame.to_numpy()
X = MyDataSource[:,0:8]
y = MyDataSource[:,8]
# Step 2: Use frac to split dataset to the train & test parts (70/30)
# Use random state to return the sample rows in every iteration
# Remove the index column from the dataset and print the first 4 rows
# Scale the dataset values to [0, 1] to optimize the NN
My_train = MyDataFrame.sample(frac = 0.7, random_state = 0)
My_test = MyDataFrame.drop(My_train.index)
print(My_train.head(4))
maxTo = My_train.max(axis = 0)
minTo = My_train.min(axis = 0)
My_train = (My_train - minTo) / (maxTo - minTo)
My_test = (My_test - minTo) / (maxTo - minTo)
# Split the features and the target
Xtrain = My_train.drop('Outcome', axis = 1)
Xtest = My_test.drop('Outcome', axis = 1)
Ytrain = My_train['Outcome']
Ytest = My_test['Outcome']
print("\nThe dataset contains", Xtrain.shape[0], "rows and",
Xtrain.shape[1], "columns")
474
Handbook of Computer Programming with Python
Output 11.3.3:
661
122
113
14
Pregnancies
1
2
4
5
Glucose
199
107
76
166
•••
•••
•••
•••
•••
Age Outcome
22
1
23
0
25
0
51
1
[4 rows x 9 columns]
The dataset contains 538 rows and 8 columns
Number 8 in the output indicates the number of inputs, as the number of features in the dataset.
11.3.4 Creating and Compiling the Model
The next step involves the creation of four different models as a way to examine different scenarios.
Firstly, the Keras and Layers libraries (TensorFlow package) are imported. These libraries are
necessary in order to create the DL model and define its details. Next, the four models are created.
SimpleModel consists of only the input and the output layers, with the former having just 12 neurons. MakeItWider doubles the number of neurons keeping the same basic layers. MakeItDeeper
keeps the number of neurons the same as in the case of SimpleModel, but adds a third hidden layer
between the input and the output. Finally, FinalModel defines a significant number of neurons per
layer (a rather common case) and adds two layers between the input and the output.
In all four cases, the newly created DL models are created following the sequential approach.
This simply means that each layer builds upon the input from the previous layer, thus connecting all
layers to each other. The minimum number of layers in any DL model is 2: the input and the output.
Any other layer is a hidden layer. There is no consensus as to what is the correct number of neurons
per layer, although there are some suggested mathematical formulae on how to determine this number. As a rough guide, the reader should note that a number between 500 and 1,000 neuros per layer
is commonly used. It must be also noted that the various layers in the NN do not have to consist of
the same number of neurons.
The activation parameter defines the type of stochastic gradient descent used to optimize the weights Observation 11.15 – Sequential
of the model. In all four cases of this example, the Approach: Each of the layers of the
ReLU method is selected. The optional input_shape NN builds on the input from its previparameter defines the number of features in the NN ous layer, ensuring that all layers conmodel (i.e., in this case 8). This number defines the col- nected to each other.
umns of the data set excluding the index (which is not
used) and the output (i.e., the outcome column).
Once the models are created, they must be compiled. Compilation basically deals with training
and adjusting weights, and is often known as backend processes. It determines the best network representation for train/test and makes predictions on the specified hardware (i.e., either GPU or CPU).
It also supports distributed computing such as Hadoop/MapReduce. At the moment of writing,
Theano and TensorFlow are among the most commonly used libraries. In terms of the associated
methods/parameters used in all four cases of this example, the loss method of choice is mae, the
optimizer is adam, and the metric is accuracy. These methods/parameters are discussed in
more detail in the following section.
475
Introduction to Neural Networks and Deep Learning
The additionial part of the script is the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from tensorflow import keras
from tensorflow.keras import layers
# Step 3: Prepare the models for testing and compiling
# Prepare a simple model
SimpleModel = keras.Sequential([layers.Dense(12,
activation = 'relu'), layers.Dense(1)])
SimpleModel.compile(loss = 'mae', optimizer = 'adam',
metrics = ['accuracy'])
# Make the model wider by doubling the neuros of the layer
MakeItWider = keras.Sequential([layers.Dense(24, activation = 'relu'),
layers.Dense(1)])
MakeItWider.compile(loss = 'mae', optimizer = 'adam',
metrics = ['accuracy'])
# Make the model deeper by adding another layer
MakeItDeeper = keras.Sequential([layers.Dense(12, activation = 'relu'),
layers.Dense(12, activation = 'relu'),
layers.Dense(1)])
MakeItDeeper.compile(loss = 'mae', optimizer = 'adam',
metrics = ['accuracy'])
# Prepare the final model with many neuros and adding another layer
FinalModel = keras.Sequential([
layers.Dense(600, activation = 'relu', input_shape = [8]),
layers.Dense(600, activation = 'relu'),
layers.Dense(600, activation = 'relu'),
layers.Dense(1)])
FinalModel.compile(loss = 'mae', optimizer = 'adam',
metrics = ['accuracy'])
Notice that there is no output for the above script which serves as a preparation step.
11.3.5 Stochastic Gradient Descent and
the Loss Method and Parameters
Stochastic gradient descent (SGD) is a family of algorithms aiming to optimize the weights for the best possible mapping of inputs to outputs. The selected algorithm
is defined by the optimizer paramenter/method,
which at present is most often adam.
The loss parameter/method deals with the measurement of the integrity of the NN predictions. In simple
terms, it measures the disparity between predicted values and desired values. Several loss method options
are available, including mean square error (MSE), root
mean square (RMS), and mean absolute error (MAE).
MSE is amongst the most well-known methods of calculating the average (mean) of the differences between
Observation 11.16 – Stochastic
Gradient Descent (SGD): A family
of algorithms aiming to optimize the
weights for the best mapping of inputs
to outputs.
Observation 11.17 – Method loss
Parameters: Select from a number
of available mathematical methods
to calculate the loss resulting from
the process (e.g., mean square error,
root mean square, and mean absolute
error).
476
Handbook of Computer Programming with Python
the real observations and the predictions. The mathematical equation for this particular method is
the following:
K
MSE =
∑
( xi − xi′ )2
K
k =1
RMS is one of the most popular and, possibly, most accurate methods. It calculates the square
root of the MSE. Its mathematical equation is the following:
K
∑
RMSE =
( xi − xi′ )2
K
k =1
Finally, MAE is calculated as the mean of the absolute errors between the real and the predicted
observations as in the following formula (i.e., xk = true observations, xk = predictions):
MAE =
1
K
K
∑x
k
− x k′
k =1
The following script showcases the use of all three loss measuring methods discussed above:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
# Define the actual and the predicted values as np arrays
actual = np.array([1.8, 2, 1.9])
print("The actual observations are: \n", actual)
predicted = np.array([2, 1.7, 1.7])
print("\nThe predicted observations are: \n", predicted)
# Array calculated on the differences between the 2 sets of values
difference = predicted - actual
print("\nThe differences in the observations are: \n", difference)
# Calculate the array based on the squares of the differences
squareOfDifferences = difference ** 2
print("\nThe squares of the differences of the observations: \n",
squareOfDifferences)
# Calculate the mean square error for the observations
MSE = squareOfDifferences.mean()
print("\nThe Mean Square Error is calculated as: ", MSE)
# Calculate the mean of the square of the differences
meanSquareDifferences = squareOfDifferences.mean()
RMSE = np.sqrt(meanSquareDifferences)
print("\nThe root mean of square of differences is: ", RMSE)
477
Introduction to Neural Networks and Deep Learning
25
26
27
28
29
30
# Calculate the mean of the absolute error of the differences
absoluteDifferences = np.absolute(difference)
meanAbsoluteDifference = absoluteDifferences.mean()
print("\nThe mean of the absolute differences of the observations \
is: ", meanAbsoluteDifference)
Output 11.3.5:
The actual observations are:
[1.8 2. 1.9]
The predicted observations are:
[2. 1.7 1.7]
The differences in the observations are:
[ 0.2 -0.3 -0.2]
The squares of the differences of the observations:
[0.04 0.09 0.04]
The Mean Square Error is calculated as:
0.056666666666666664
The root mean of square of differences is:
0.23804761428476165
The mean of the absolute differences of the observations is:
0.2333333333333333
11.3.6 Fitting and Evaluating the Models, Plotting the Observed Losses
The next step involves the fitting of the various models, as well as the plotting of the relevant observations. The reader can follow the implementation of this step in the following script, taking note of
the following:
1. For practical reasons, the number of iterations
during model training is set to 5 (as defined by the
epochs parameter). It must be noted that this is
a quite small number to be truly efficient, but it is
sufficient for demonstration purposes. In reality,
this number is expected to be at least three digits
long (i.e., between 100 and 1,000).
2. The fitting process investigates the training of the
models with 300 rows of train data (shown in the
batch_size).
3. The observations from the four different models are plotted together using the plot method
(Matplotlib.pyplot library).
Observation 11.18 – The epochs
Parameter: Used to define the number of iterations of the training set during the training/fitting step. Usually,
the number is in the hundreds.
Observation 11.19 – The batch_
size Parameter: Used to define the
number of rows to be observed during the training/fitting step.
478
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Handbook of Computer Programming with Python
import matplotlib.pyplot as plt
# Step 4: Fit the models and plot the observations
# Fit the SimpleModel
print(“\nThe observation epochs for the simple model: \n”)
Observations1 = SimpleModel.fit(Xtrain, Ytrain, validation_data =
(Xtest, Ytest), batch_size = 300, epochs = 5)
# Prepare the dataframe from the SimpleModel observation history
Observation1DataFrame = pd.DataFrame(Observations1.history)
# Fit the MakeItWider model
print(“\nThe observation epochs for the wider model: \n”)
Observations2 = MakeItWider.fit(Xtrain, Ytrain, validation_data =
(Xtest, Ytest), batch_size = 300, epochs = 5)
# Prepare the dataframe from the MakeItWider observation history
Observation2DataFrame = pd.DataFrame(Observations2.history)
# Fit the MakeItDeeper model
print(“\nThe observation epochs for the deeper model: \n”)
Observations3 = MakeItDeeper.fit(Xtrain, Ytrain, validation_data =
(Xtest, Ytest), batch_size = 300, epochs = 5)
# Prepare the dataframe from the MakeItDeeper observation history
Observation3DataFrame = pd.DataFrame(Observations3.history)
# Fit the FinalModel model
print(“\nThe observation epochs for the final model: \n”)
Observations4 = FinalModel.fit(Xtrain, Ytrain, validation_data =
(Xtest, Ytest), batch_size = 300, epochs = 5)
# Prepare the dataframe from the FinalModel observation history
Observation4DataFrame = pd.DataFrame(Observations4.history)
# Plot the observations from the 4 models
plt.xlabel(“Epochs”)
plt.ylabel(“Loss”)
plt.title(“History of observations of loss”)
Observation1DataFrame[‘loss’].plot(label = “Simple model”)
Observation2DataFrame[‘loss’].plot(label = “Make it wider”)
Observation3DataFrame[‘loss’].plot(label = “Make it deeper”)
Observation4DataFrame[‘loss’].plot(label = “Final model”)
plt.legend()
plt.grid()
5/5
4/5
3/5
2/5
1/5
] - 0s 96ms/step - loss: 0.4231 - accuracy: 0.6450 - val_loss: 0.3924 - val_accuracy: 0.6652
] - 0s 106ms/step - loss: 0.4359 - accuracy: 0.6450 - val_loss: 0.4037 - val_accuracy: 0.6652
] - 0s 122ms/step - loss: 0.4491 - accuracy: 0.6450 - val_loss: 0.4161 - val_accuracy: 0.6652
] - 0s 118ms/step - loss: 0.4627 - accuracy: 0.6450 - Val_loss: 0.4290 - val_accuracy: 0.6652
] - 3s 946ms/step - loss: 0.4762 - accuracy: 0.6450 - val_loss: 0.4424 - val_accuracy: 0.6652
2/2 [
Epoch 5/5
2/2 [
Epoch 4/5
2/2 [
Epoch 3/5
2/2 [
Epoch 2/5
2/2 [
Epoch 1/5
] - 0s 45ms/step - loss: 0.3856 - accuracy: 0.6450 - val_loss: 
Download
Study collections