Handbook of Computer ­Programming with Python This handbook provides a hands-on experience based on the underlying topics, and assists students and faculty members in developing their algorithmic thought process and programs for given computational problems. It can also be used by professionals who possess the necessary theoretical and computational thinking background but are presently making their transition to Python. Key Features: • Discusses concepts such as basic programming principles, OOP principles, database programming, GUI programming, application development, data analytics and visualization, statistical analysis, virtual reality, data structures and algorithms, machine learning, and deep learning. • Provides the code and the output for all the concepts discussed. • Includes a case study at the end of each chapter. This handbook will benefit students of computer science, information systems, and information technology, or anyone who is involved in computer programming (entry-to-intermediate level), data analytics, HCI-GUI, and related disciplines. Handbook of Computer ­Programming with Python Edited by Dimitrios Xanthidis Christos Manolas Ourania K. Xanthidou Han-I Wang First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 selection and editorial matter, Dimitrios Xanthidis, Christos Manolas, Ourania K. Xanthidou, Han-I Wang; individual chapters, the contributors Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot ­ ublishers assume responsibility for the validity of all materials or the consequences of their use. The authors and p have attempted to trace the copyright holders of all material reproduced in this publication and apologize to ­copyright ­holders if permission to publish in this form has not been obtained. If any copyright material has not been ­acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, ­including ­photocopying, microfilming, and recording, or in any information storage or retrieval system, without written ­permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf.co.uk Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 978-0-367-68777-9 (hbk) ISBN: 978-0-367-68778-6 (pbk) ISBN: 978-1-003-13901-0 (ebk) DOI: 10.1201/9781003139010 Typeset in Times by codeMantra Access the Support Material: https://www.routledge.com/9780367687779 Contents Editors...............................................................................................................................................vii Contributors.......................................................................................................................................ix Chapter 1 Introduction...................................................................................................................1 Dimitrios Xanthidis, Christos Manolas, Ourania K. Xanthidou, and Han-I Wang Chapter 2 Introduction to Programming with Python...................................................................9 Ameur Bensefia, Muath Alrammal, and Ourania K. Xanthidou Chapter 3 Object-Oriented Programming in Python................................................................... 59 Ghazala Bilquise, Thaeer Kobbaey, and Ourania K. Xanthidou Chapter 4 Graphical User Interface Programming with Python............................................... 107 Ourania K. Xanthidou, Dimitrios Xanthidis, and Sujni Paul Chapter 5 Application Development with Python..................................................................... 161 Dimitrios Xanthidis, Christos Manolas, and Hanêne Ben-Abdallah Chapter 6 Data Structures and Algorithms with Python...........................................................207 Thaeer Kobbaey, Dimitrios Xanthidis, and Ghazala Bilquise Chapter 7 Database Programming with Python........................................................................ 273 Dimitrios Xanthidis, Christos Manolas, and Tareq Alhousary Chapter 8 Data Analytics and Data Visualization with Python................................................ 319 Dimitrios Xanthidis, Han-­I Wang, and Christos Manolas Chapter 9 Statistical Analysis with Python............................................................................... 373 Han-­I Wang, Christos Manolas, and Dimitrios Xanthidis Chapter 10 Machine Learning with Python................................................................................409 Muath Alrammal, Dimitrios Xanthidis, and Munir Naveed Chapter 11 Introduction to Neural Networks and Deep Learning..............................................449 Dimitrios Xanthidis, Muhammad Fahim, and Han-I Wang v vi Contents Chapter 12 Virtual Reality Application Development with Python............................................ 485 Christos Manolas, Ourania K. Xanthidou, and Dimitrios Xanthidis Appendix: Case Studies Solutions............................................................................................... 527 Index............................................................................................................................................... 617 Editors Dimitrios Xanthidis holds a PhD in Information Systems from University College London. For the past 25 years, he has been teaching computer science subjects with a focus on programming and software development, and data structures and databases in various tertiary education institutions. Currently, he is working in Higher Colleges of Technology in Dubai, U.A.E. Dimitrios’ research interests and work revolve around the topics of data science, machine learning/deep ­learning, ­virtual/augmented reality, and emerging technologies. Christos Manolas holds a PhD in Stereoscopic 3D Media (University of York, UK), and degrees and qualifications in Postproduction (MA), Music Technology (MSc), Music Performance, Software Development, and Media Production. Christos’ career includes work as a software developer, musician, audio producer, and educator for over 20 years. His research interests include multimodal (audiovisual) perception, spatial audio, interactive and immersive media (VR/AR/XR), and generally the impact and role of digital technologies on media production. Ourania K. Xanthidou is a PhD researcher at Brunel University, London. She holds an MSc in Computer Science from the University of Malaya, Kuala Lumpur, Malaysia. She has more than 15 years of involvement with the IT industry in the form of supporting IT departments of SMEs and more than 5 years of teaching experience in tertiary education. Ourania’s research interests are in the areas of eHealth, smart health, databases, web application development, and object-oriented programming with a focus on application development for VR/AR/XR. Han-I Wang holds a PhD in Health Economics from the University of York, UK. Han-I has been working as a research fellow for over 10 years, starting at the Epidemiology & Cancer Statistics Group (ECSG) before joining the Mental Health and Addiction Research Group (MHARG) at the University of York, UK. Her area of expertise spans across cost analysis, health outcome research, and decision modeling using complex patient-level data, and her main research interests are related with the exploration of different decision-modeling techniques and their application to predict healthcare expenditure, patients’ quality of life, and life expectancy. vii Contributors Tareq Alhousary Business Information Systems University of Salford Manchester, United Kingdom and Department of Management Information Systems Dhofar University, College of Commerce and Business Administration Salalah, Oman Muath Alrammal Department of Computer and Information Sciences Higher Colleges of Technology Abu Dhabi, United Arab Emirates and LACL (Laboratoire d’Algorithmique, Complexité et Logique) University Paris-Est (UPEC) Créteil, France Hanêne Ben-Abdallah Computer and Information Science University of Pennsylvania Philadelphia, PA Ameur Bensefia Department of Genie Informatique University of Rouen Normandy Laboratoire d’Informatique de Traitement de l’Information et des Systèmes (LITIS) Rouen, France and Department of Computer and Information Sciences Higher Colleges of Technology Abu Dhabi, United Arab Emirates Ghazala Bilquise Department of Computer and Information Sciences Higher Colleges of Technology Abu Dhabi, United Arab Emirates Muhammad Fahim Department of Computer and Information Sciences Higher Colleges of Technology Abu Dhabi, United Arab Emirates Thaeer Kobbaey Department of Computer and Information Sciences Higher Colleges of Technology Abu Dhabi, United Arab Emirates Christos Manolas Department of Theatre, Film, Television and Interactive Media The University of York York, United Kingdom and Department of Media Works Ravensbourne University London London, United Kingdom Munir Naveed Department of Computer Science University of Huddersfield Huddersfield, United Kingdom and Department of Computer and Information Sciences Higher Colleges of Technology Abu Dhabi, United Arab Emirates Sujni Paul Department of Computer and Information Sciences Higher Colleges of Technology Abu Dhabi, United Arab Emirates Han-I Wang Department of Health Sciences The University of York York, United Kingdom ix x Dimitrios Xanthidis School of Library, Archives, and Information Sciences University College London London, United Kingdom and Department of Computer and Information Sciences Higher Colleges of Technology Abu Dhabi, United Arab Emirates Contributors Ourania K. Xanthidou Department of Computer Science Brunel University of London Uxbridge, United Kingdom 1 Introduction Dimitrios Xanthidis University College London Higher Colleges of Technology Christos Manolas The University of York Ravensbourne University London Ourania K. Xanthidou Brunel University of London Han-I Wang The University of York CONTENTS 1.1 Introduction...............................................................................................................................1 1.2 Audience....................................................................................................................................2 1.3 Getting Started with Jupyter Notebook.....................................................................................2 1.4 Creating Standalone, Executable Files......................................................................................4 1.5 Structure of this Book................................................................................................................6 References...........................................................................................................................................6 1.1 INTRODUCTION Undoubtedly, at the time of writing, Python is among the most popular computer programming languages. Alongside other common languages like C# and Java, it belongs to the broader family of C/C++-based languages, from which it naturally borrows a large number of packages and modules. While Python is the youngest member in this family, it is widely adopted as the platform of choice by academic and corporate institutions and organizations on a global scale. As a C++-based language, Python follows the structured programming paradigm, and the associated programming principles of sequence, selection, and repetition, as well as the concepts of functions and arrays (as lists). A thorough presentation of such concepts is both beyond the scope of this book and possibly unnecessary, as this was the subject of the seminal works of computer science giants like Knuth, Stroustrup, and Aho (Aho Alfred et al., 1983; Knuth, 1997; Stroustrup, 2013). Readers interested in an in-depth understanding of these concepts on a theoretical basis are encouraged to refer to such works that form the backbone of modern programming. As an ObjectOriented Programming (OOP) platform, it provides all the facilities and tools to support the OOP paradigm. Unlike its counterparts (i.e., C++, C#, and Java), Python does not provide a streamlined, centralized IDE to support GUI programming, but it does offer a significant number of related modules that cover most, if not all, of the various GUI requirements one may encounter. It includes a number of modules that allow for the implementation of database programming, web development, DOI: 10.1201/9781003139010-1 1 2 Handbook of Computer Programming with Python and mobile development projects, as well as platforms, modules, and methods that can be used for machine and deep learning applications and even virtual and augmented reality project development. Nevertheless, one of the main reasons that made Python such a popular option among computer science professionals and academics is the wealth of modules and packages it offers for data science tasks, including a large variety of libraries and tools specifically designed for data analytics, data visualization, and statistical analysis tasks. Arguably, there is an abundance of online resources and tutorials and printed books that address most of the aforementioned topics in great detail. On the technical side, such resources may seem too complicated for someone who is currently studying the subject or approaches it without prior programming knowledge and experience. In other cases, resources may be structured more like reference books that may focus on particular topics without covering the introductory parts of computing with Python that some readers may find useful. This book aims at covering this gap by exploring how Python can be used to address various computational tasks of introductory to intermediate difficulty level, while also providing a basic theoretical introduction to the underlying concepts. 1.2 AUDIENCE This book focuses on students of computer science, information systems, and information technology, or anyone who is involved in computer programming, data analytics, HCI-GUI, and related disciplines, at an entry-to-intermediate level. This book aims to provide a hands-on experience based on the underlying topics, and assist students and faculty members in developing their algorithmic thought process and programs for given computational problems. It can also be used by professionals who possess the necessary theoretical and computational thinking background but are presently making their transition to Python. Considering the above, this book includes a wealth of examples and the associated Python code and output, presented in a context that also discusses the underlying concepts and their applications. It also provides key concepts in the form of quick access observations, so that the reader can skim through the various topics. Observations can be used as a reference and navigation tool, or as reminders for points for discussion and in-class presentation in the case of using this book as a teaching resource. Chapters are also accompanied by related exercises and case studies that can be used in this context, and their solutions are provided in the Appendix at the end of this book. 1.3 GETTING STARTED WITH JUPYTER NOTEBOOK Ample information and support are available through online community channels and the ­official documentation and guides in terms of installing and running Python programming environments. Nevertheless, this section provides a brief and straightforward guide on how to use Anaconda Navigator and Jupyter Notebook in order to interpret and execute Python code, as the majority of examples in this book have been implemented and tested using this particular configuration. Once Anaconda Navigator is launched, a number of different editors and environments are ­presented in the home page (Figure 1.1). Launching the Jupyter Notebook (i.e., clicking the Launch button) initiates a web interface based on the file directory of the local machine (Figure 1.1). To create a new Python program, the user can select New from the top right corner and the Python 3 notebook menu option (Figure 1.2). This action will launch a new Python file under Jupyter with a default name. This can be changed by clicking on the file name. 3 Introduction FIGURE 1.1 Anaconda IDE homepage. FIGURE 1.2 Create a new Python file in Jupyter Notebook. Jupyter editor is organized in cells. The user can add each line of code to a separate cell or add multiple lines to the same cell (Figure 1.3). The Run button in the main toolbar is used to execute the code in the selected cell. If the code is free from errors, the interpreter moves to the next cell; otherwise, an error message is displayed immediately after the cell where the error occurred (Figure 1.4). 4 Handbook of Computer Programming with Python FIGURE 1.3 Jupyter’s editor. FIGURE 1.4 Run a Python program on Jupyter. 1.4 CREATING STANDALONE, EXECUTABLE FILES With the exception of Chapter 12: Virtual Reality Application Development with Python that discusses applications that demand specific and highly specialized development platforms, the Python scripts and examples presented in this book were implemented and tested natively in the Anaconda Jupyter environment. In this context, the process of developing and testing software solutions is a rather straightforward and intuitive process. However, when it comes to the actual deployment of applications in more realistic scenarios, things become slightly more complex. This is mainly due to the fact that the Python code one develops is usually dependent on a number of external libraries, packages, and files of various formats. These are automatically provided in the background when working within the Anaconda environment, but this is not necessarily the case when scripts are exported as standalone files. The required libraries and resources may be located on numerous different places within the file structures of the computer and/or network systems used during development. In the context of application deployment, references to such external files and objects are generally referred to as application dependencies. Dependencies form a crucial and essential part of the developed application, and the underlying files must be provided alongside the final deliverable program (e.g., a standalone, executable application), as their absence will prevent the program from Introduction 5 running correctly in machines lacking the necessary libraries and file structures. Fortunately, the latter are automatically selected and packaged by special routines and processes during the deployment phase of the development cycle. This way, once the final deployment package is created, one can run the application on other computers, irrespectively of whether these include the necessary files and libraries or not. Many SDKs and programming environments provide built-in routines (i.e., wizards) for the generation of the deployment packages and standalone executable files. In the case of Anaconda Jupyter, although there is no automated, built-in wizard for such tasks, one can resort to a number of external helper applications. A detailed, step-by-step tutorial of this process is beyond the scope of this book. However, some basic, introductory examples are provided below, in order to assist readers with minimal or no previous experience with command line environments in familiarizing with such tasks. At the moment of writing, two of the most widely used third-party applications for generating standalone executable files from Python scripts are PyInstaller for Windows (PyInstaller Development Team, 2019) and Py2app for Windows/Mac OS (Oussoren & Ippolito, 2010). Both applications can handle dependencies and linking, and the decision on which one should be used comes down to the operating system at hand and personal preference. In broad terms, the steps one needs to follow when creating standalone executable files are summarized below: • Step 1: Irrespectively of what program and procedure one choses to generate the standalone application, the original script(s) must be firstly exported from Anaconda Jupyter, as one or more Python.py file(s). This will be the file(s) used as input to the deployment application. • Step 2: Another essential task is to ensure that the application is installed on the system. This can be achieved in a number of ways that are detailed in the numerous a­ ssociated online guides and tutorials (Apple Inc, 2021; Cortesi, 2021; Microsoft, 2021a, 2021b; Oussoren & Ippolito, 2010; PyInstaller Development Team, 2019). For the purposes of this example, one possibility is to install PyInstaller using a Command Prompt/PowerShell window (Microsoft, 2021a, b) using the following command: • pip install pyinstaller • Step 3a (Windows): Once PyInstaller is installed, and given that the associated files and the command line environment are set up appropriately, the generation of the standalone file could be as simple as the following command: • pyinstaller yourprogram.py Alternatively, the user can refer to the PyInstaller official documentation, in order to execute more specific and complex commands with appropriate parameters and flags, as necessary. For instance, using the same command with the --onefile flag would force the generated executable file to be packaged in a single file rather than in a folder structure containing multiple files: • pyinstaller --onefile yourprogram.py • Step 3b (Mac OS): The same basic idea also applies when using the Py2app (Oussoren & Ippolito, 2010), although the procedure and commands may be slightly different. For instance, when used on a Mac OS system, Py2app generates application bundles instead of an executable file. As an example, users of Mac OS systems can use the Terminal window (Apple Inc, 2021) to firstly install Py2app: • pip install -U py2app Py2app can be then used to create a setup file: • py2applet --make-setup yourprogram.py Finally, the setup file can be used to generate the standalone application bundle: • python setup.py py2app In both cases, the standalone application is usually placed at a specified directory structure according to the settings and parameters used. 6 Handbook of Computer Programming with Python In order to be able to successfully execute the example commands provided here, the reader may have to execute a number of other necessary commands and set up tasks and navigate to the correct ­directories using the command line environment. Detailed information on how to use both PyInstaller and Py2app can be found on the official documentation pages (Cortesi, 2021; Oussoren & Ippolito, 2010) and on the large variety of associated online resources. It must be noted that the third-party applications mentioned here are just two of the tools one may choose to use for creating standalone executable files based on Python scripts, and they are not the only way of dealing with such tasks. The development and deployment processes vary depending on the characteristics of the developed application, the chosen development platform, and the targeted operating system(s). As most chapters of this book utilize the Anaconda Jupyter environment, most of the examples and programming scripts can be developed and tested within the development platform (or even other platforms) without the need to generate standalone executable files. However, the information provided here can be used as a general guide for the deployment procedure and the necessary conversions, should the reader choose to create standalone versions of the various examples. 1.5 STRUCTURE OF THIS BOOK This book is divided into three main parts, based on the knowledge field, character, and objective of the presented topics. The first part (Chapters 2–5) covers classic computer programming topics like introduction to programming, Object-Oriented Programming, Graphical User Interface (GUI) programming, and application development. It is meant to assist readers with little or no prior programming experience to start learning computer programming using Python and the Anaconda Jupyter platform. The related concepts, techniques, and algorithms are discussed and explained with examples of the necessary code and the expected output. The second part (Chapters 6–9) covers concepts related to data structures and organization, the algorithms used to manipulate these structures, database programming (SQL), data analysis and visualization, and the basics of statistical analysis. These concepts cover most of the topics, algorithms, and applications that make up what is collectively referred to as data science. The structure of this part of this book provides a potential entry point for readers with no prior knowledge in data science, as well as a reference point for those who would like to focus on the implementation of specific data science tasks using Python. The third part (Chapters 10–12) covers machine and deep learning concepts, while also providing a brief introduction to using Python in contexts not traditionally linked with the language like virtual reality (VR) application development. This part introduces concepts that are potentially more advanced from a contextual perspective, but not necessarily more challenging when it comes to their implementation using Python. For instance, while a deeper understanding of the principles and algorithms behind machine and deep learning may be out of scope for many of the readers of this book, the development of applications using the various related modules and methods provided by Python may be something that is of interest. Similarly, while video game and VR/AR application development is certainly a topic that falls outside the scope of a Python textbook in the strict sense, a basic understanding of how such applications could be developed using the Python language may provide a useful insight to the most adventurous of the readers. All the scripts and case studies presented in this book, as well as the related data and files necessary for their execution, are included as supplementary material in Appendix A. REFERENCES Aho, A.V., Hopcroft, J.E., Ullman, J.D., Aho, A.V., Bracht, G.H., Hopkin, K.D., Stanley, J.C., Jean-Pierre, B., Samler, B.A., & Peter, B.A. (1983). Data Structures and Algorithms. USA: Addison-Wesley. Introduction 7 Apple Inc. (2021). Terminal User Guide. Support.Apple.Com. https://support.apple.com/en-gb/guide/terminal/ welcome/mac/. Cortesi, D. (2021). PyInstaller Documentation. PyInstaller 4.5. https://pyinstaller.readthedocs.io/_/downloads/ en/stable/pdf/. Knuth, D.E. (1997). The Art of Computer Programming (Vol. 3). Pearson Education. Microsoft. (2021a). Installing Windows PowerShell. https://docs.microsoft.com/en-us/powershell/scripting/ windows-powershell/install/installing-windows-powershell?view=powershell–7.1. Microsoft. (2021b). Windows Command Line. https://www.microsoft.com/en-gb/p/windows-command-line/9 nblggh4xtkq?activetab=pivot:overviewtab. Oussoren, R., & Ippolito, B. (2010). py2app – Create Standalone Mac OS X Applications with Python. https:// py2app.readthedocs.io/en/latest/. PyInstaller Development Team. (2019). PyInstaller Quickstart. https://www.pyinstaller.org/. Stroustrup, B. (2013). The C++ Programming Language. India: Pearson Education. 2 Introduction to Programming with Python Ameur Bensefia University of Rouen Normandy Higher Colleges of Technology Muath Alrammal Higher Colleges of Technology University Paris-Est (UPEC) Ourania K. Xanthidou Brunel University of London CONTENTS 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Introduction............................................................................................................................. 10 Algorithm vs. Program............................................................................................................ 11 2.2.1 Algorithm.................................................................................................................... 11 2.2.2 Program....................................................................................................................... 12 Lexical Structure..................................................................................................................... 12 2.3.1 Case Sensitivity and Whitespace................................................................................. 13 2.3.2 Comments.................................................................................................................... 13 2.3.3 Keywords..................................................................................................................... 13 Punctuations and Variables..................................................................................................... 14 2.4.1 Punctuations................................................................................................................ 14 2.4.2 Variables...................................................................................................................... 14 Data Types............................................................................................................................... 15 2.5.1 Primitive Data Types .................................................................................................. 15 2.5.2 Non-Primitive Data Types........................................................................................... 16 2.5.3 Examples of Variables and Data Types Using Python Code....................................... 16 Statements, Expressions, and Operators.................................................................................. 21 2.6.1 Statements and Expressions......................................................................................... 21 2.6.2 Operators..................................................................................................................... 21 2.6.2.1 Arithmetic Operators.................................................................................... 22 2.6.2.2 Comparison Operators.................................................................................. 23 2.6.2.3 Logical Operators.........................................................................................24 2.6.2.4 Assignment Operators..................................................................................25 2.6.2.5 Bitwise Operators.........................................................................................26 2.6.2.6 Operators Precedence...................................................................................28 Sequence: Input and Output Statements.................................................................................. 29 Selection Structure .................................................................................................................. 30 2.8.1 The if Structure......................................................................................................... 30 2.8.2 The if…else Structure.............................................................................................. 32 2.8.3 The if…elif…else Structure.................................................................................. 33 2.8.4 Switch Case Structures................................................................................................34 DOI: 10.1201/9781003139010-2 9 10 Handbook of Computer Programming with Python 2.8.5 Conditional Expressions.............................................................................................. 35 2.8.6 Nested if Statements.................................................................................................. 35 2.9 Iteration Statements ................................................................................................................ 36 2.9.1 The while Loop......................................................................................................... 36 2.9.2 The for Loop..............................................................................................................40 2.9.3 The Nested for Loop................................................................................................. 42 2.9.4 The break and continue Statement....................................................................... 45 2.9.5 Using Loops with the Turtle Library........................................................................... 47 2.10 Functions.................................................................................................................................. 50 2.10.1 Function Definition...................................................................................................... 50 2.10.2 No Arguments, No Return........................................................................................... 50 2.10.3 With Arguments, No Return....................................................................................... 51 2.10.4 No Arguments, With Return....................................................................................... 51 2.10.5 With Arguments, With Return.................................................................................... 52 2.10.6 Function Parameter Passing........................................................................................ 52 2.10.6.1 Call/Pass by Value........................................................................................ 52 2.10.6.2 Call/Pass by Reference................................................................................. 53 2.11 Case Study............................................................................................................................... 54 2.12 Exercises.................................................................................................................................. 55 2.12.1 Sequence and Selection............................................................................................... 55 2.12.2 Iterations – while Loops........................................................................................... 56 2.12.3 Iterations – for Loops................................................................................................ 56 2.12.4 Methods....................................................................................................................... 57 References......................................................................................................................................... 58 2.1 INTRODUCTION It is hard to find a programming language that does not follow the norms of how a computer program should look like, as the underlying structures have been established for over 50 years. These norms, widely known as the basic programming principles, are broadly accepted by the academic, scientific and professional communities, something also reflected in the approaches of legendary figures in the field like (Dijkstra et al., 1976; Knuth, 1997; Stroustrup, 2013). The three basic programming principles refer to the concepts of sequence, selection, and repetition or iteration. Sequence is the concept of executing instructions of computer programs from top to bottom, in a sequential form. Selection refers to the concept of deciding among different paths of execution that can be followed based on the evaluation of certain conditions. Repetition is the idea of repeating a particular block of instructions as long as a condition is evaluated to True (i.e., nonzero). The concept of computer programming in its most basic form can be defined as the integration of these programming principles with variables that store and manipulate data through programs and methods or functions that facilitate the fundamental idea of divide and conquer. The aim of this chapter is not to propose any innovative ideas of how to change the above logic and structures. Nevertheless, although it is unlikely that these concepts can be changed or redefined in a major way, they can be fine-tuned and put into the context of new and developing programming languages. From this perspective, this chapter can be viewed as an effort to present how these fundamental principles of computer programming are applied to Python, one of the most popular and intuitive modern programming languages, in a comprehensive and structured way. To accomplish this, a number of related basic concepts are presented and discussed in detail in the various sections of this chapter: 1. Algorithms and Programs, Lexical Structures. 2. Variables & Data Types, Primitive and Non-primitive. Introduction to Programming with Python 3. 4. 5. 6. 7. 11 Statements, Expressions, Operators & Punctuations. Sequence: Input, Basic Operations, and Output Statements. Selection Structures: if, if…else, if…elif…else, Conditional Expressions. Iteration structures: for Loops, while Loops, Nested Loops. Functions. It should be noted that this chapter introduces the Turtle library, which is used to demonstrate some of the uses of iteration structures. 2.2 ALGORITHM VS. PROGRAM The demand for developing a program always originates from a problem that must be addressed by means of computer-based automation. However, an intermediate essential step exists between the problem and the actual program, namely the algorithm. 2.2.1 Algorithm The term algorithm was firstly proposed by mathematician Mohamed Ibn Musa Al-Khwarizmi during the ninth century. It was defined as a set of ordered and finite mathematical operations designed to solve a specific problem. Nowadays, this term is being adopted in various fields and disciplines, most notably in Computer Science and Engineering, in which it is defined as a set of ordered operations executed by a machine (computer). The first step in program development is where a problem is defined. At this point, a solution is formulated Observation 2.1 – Algorithm: A set as a clear and unambiguous set of steps. This solution is of ordered operations that can be the algorithm. The steps described in the algorithm are executed by a machine (computer later translated into a program using a specific a pro- system). gramming language (Figure 2.1). The benefit of starting off with the formulation of an algorithm rather than directly implementing the actual program is that it allows the programmer to focus on how to solve the problem logically, free from any constraints or considerations related to the specifics of any given programming language. Indeed, algorithms are written in a format incorporating natural human language called pseudo-code, and follow particular formal rules. Ultimately, such approaches ensure a certain level of clarity and detail that reduces or eliminates ambiguity without having to deal with the technicalities of the implementation. The examples below provide two cases of algorithms demonstrating the clarity and simplicity that should characterize the solution to the problem at hand before it comes to translating this solution into an actual program. Both algorithms are in the form of pseudo-code and, thus, independent of any particular programming languages used for the implementation of the solutions: FIGURE 2.1 Phases of program development. 12 Handbook of Computer Programming with Python Algorithm 1: Calculate the Area of a Rectangle Start Read the length of the rectangle Read the width of the rectangle Assign width*length to Area Display Area End Algorithm 2: Draw a Square of 50 Pixels Length Start Draw a line of 50 pixels Turn the pen right by 90 Draw a line of 50 pixels Turn the pen right by 90 Draw a line of 50 pixels Turn the pen right by 90 Draw a line of 50 pixels Turn the pen right by 90 Display Area length degrees length degrees length degrees length degrees End 2.2.2 Program Once the algorithm is formed, the next step is to write the program in a specific programming language. Each programming language has its own rules and conventions. However, they all have a common core structure consisting of inputs, processing, and outputs. They are all implemented using some form of code, the format and structure of which could vary depending on the scope and purpose of each given language and program: Observation 2.2 – Input, Processing, Output: The basic structure of all programs irrespectively of the programming language used. Input represents any statement written to collect data from an external source. Output represents any statement that sends the outcome of the processing to a display unit, file, or another program. 1. Input: Statements dedicated to collecting data from external input sources (e.g., input from the user through the keyboard and mouse), opening and reading files, or accepting input from other programs. In most instances, input is managed at the beginning of the program execution, but this may vary between different languages and programs. 2. Processing: Processing lies at the core of the program and represents statements responsible for the manipulation of the information received at input. The length of this section can vary greatly, from a few simple statements to thousands of lines of code organized in numerous files and packages. 3. Output: Output statements are used in order for the outcome of the processing to be communicated outside the program. This can take many forms and includes, but is not limited to, sending visual information to a display unit, exporting to a file, or exporting to another program. In most cases, this is the last step of the sequence in a program. 2.3 LEXICAL STRUCTURE Lexical structure refers to the basic conventions and restrictions in terms of the format and syntax of the text used in the programming environment, in this case Python. This is an important aspect of any programming language, as incorrect format or syntax may lead to compiling errors and code that is difficult to read and debug. 13 Introduction to Programming with Python 2.3.1 Case Sensitivity and Whitespace Python is a case-sensitive programming language, which means that it distinguishes between keywords and variables written in capital and lower-case letters. Thus, if and IF are considered to be different words, with the first being recognized as a Python keyword and the second processed as a variable (see: Variables 2.4.2). 2.3.2 Comments A program is a set of instructions written in a specific language that can be translated and processed by a Observation 2.3 – Comments: Natural computer. In real life scenarios, programs can become language statements ignored by the quite sizable, with hundreds or even thousands of lines interpreter, used to explain the purof code required. This can make it quite difficult for pose of the different parts of the code. the programmer to remember the meaning, functional- Start a single line comment with #, or ity, and purpose of each line of code. As such, good start and end a multiple line comprogramming practice involves the use of comments in ment with """. Note that Python is the program itself. Comments function as useful and case-sensitive. intuitive reminders and descriptions to the programmer or anyone who may have direct access to the source code of the program. The comment is expressed in a natural human language and is ignored by the interpreter during runtime. Python allows the use of two main types of comments: • Single Line Comment: Starts with the # symbol and continues until the end of the current line: # This statement displays the sentence Hello World print ("Hello World") • Multiple Lines Comment: Starts with the """ symbols and ends when the same symbol combination occurs again: """ The statement below displays the sentence Hello World """ print("Hello World") 2.3.3 Keywords Python reserves a number of keywords that are used by the interpreter to trigger specific actions when the code is compiled. As these keywords are reserved, the programmer is not allowed to use them as variable, function, method, or class names. A list of these keywords is provided in Table 2.1. Observation 2.4 – Keywords: Reserved words that cannot be used as names for variables, functions, methods, or classes. TABLE 2.1 Python Keywords and as assert break class continue def del elif else except False finally for from global if import in is lambda None nonlocal not or pass raise return True try while with yield 14 Handbook of Computer Programming with Python 2.4 PUNCTUATIONS AND VARIABLES Punctuations and variables are special types of symbols and text that dictate specific functionality. As such, when these symbols or text are encountered, the interpreter performs specific, pre-­ determined tasks instead of treating them as common text. 2.4.1 Punctuations Python programs may contain punctuation characters that are combined with other symbols to denote specific functionality. These characters are divided into two main categories: separators and operators (Table 2.2). 2.4.2 Variables A variable describes a memory location used by a program to store data. Indeed, from a hardware stand- Observation 2.5 – Variable: Designated point, it is expressed as a binary or hexadecimal num- memory location used by the program ber that represents the memory location and another to store values. number that represents the actual data stored in it. Since working directly with hexadecimal numbers is arguably impractical and counter-productive from a programming perspective, a variable is expressed as a combination of an identifier that replaces the actual memory location, a data type identifying the kind of data that can be stored in it, and a value that represents the actual data stored. Each programming language has its own rules when it comes to naming variables. In Python, a variable name has to conform to the following rules: • • • • • It should start with a letter of the Latin alphabet ('a', 'b', …, 'z', 'A', 'B', …, 'Z'). It may contain numbers. It may contain (or start with) the special character " _ ". It cannot contain any other character. It cannot be a Python keyword. In line with the above, examples of allowed variable names include the following: Salary, Name, Child1, Email_address, firstName, _ID Similarly, examples of invalid variable names include the following: print, 1Child, Email#address TABLE 2.2 Separators and Operators in Python Separators: Operators: () {} [] : " , & | − + <> != %= //= < * = **= <= ** += &= >= / −+ |= > // *= ^= == % /= >>= <<= Introduction to Programming with Python 15 2.5 DATA TYPES Observation 2.6 – Data Types: The As stated previously, the purpose of a variable is to hold type of the value stored in a variable a value of a specified type. This value can be a num- could be primitive (i.e., integer, string, ber (e.g., decimal, real, octal, hexadecimal), text (i.e., float, Boolean) or non-primitive (i.e., a string of characters), a single character, or a Boolean a collection of primitive data types). value (i.e., one out of two possible values: True or False). More complex structures that consist of any of the aforementioned types may be also used. In general, Python supports two main different data types of variables in this context: primitive and non-primitive (Figure 2.2). 2.5.1 Primitive Data Types There are four primitive data types that are used when the variable is to hold pure, simple values of data: • String or Text: In Python, a string variable is declared with the str keyword. It can hold any set of characters, including letters, numbers, or other symbols, enclosed in double quotation marks: • "This is a text." • "Do you accept the proposal (Yes/No)?." • Numeric: Since there are different types of numbers, Python provides variables suitable for different numerical formats and representations: • int represents integer number (e.g., +24509129) • float represents real numbers (e.g., −123.0968) • complex represents complex numbers (e.g., +45−33.6j) • 0o represents octal numbers (e.g., 0o7652001) • 0x represents hexadecimal numbers (e.g., 0x34EF1C3) • Boolean: A Boolean variable is used to represent only two possible values: True or False. FIGURE 2.2 Python’s data types. (See Jaiswal, 2017.) 16 Handbook of Computer Programming with Python 2.5.2 Non-Primitive Data Types Non-primitive data types are complex types consisting of two or more other data types. Such structures are convenient when one needs to manipulate collections of values of different types. A list of non-primitive variables is provided below: • Sequence: This type is suitable to use when different values have to be stored and grouped together. It can be further divided into the following categories: • List: This category represents a collection of any primitive data types where the elements of the list can be accessible through an index and can be modified (mutable). • Tuple: This category represents a collection of any primitive data types where the ­elements of the list can be accessible through an index but cannot be modified (immutable). • Set: This category represents a collection of distinct, unique objects. It is useful when creating lists that hold strictly unique values in the dataset, and are especially relevant when this dataset is large. The data is unordered and mutable. • Range: This category represents a series of numbers starting at 0 and ending at a specified number. Examples: ["car", "bike", "truck"] [200, 6423, −709, 1205] ("car", "bike", "truck") (20.1, +23, −1.9, 12.5) {'O', 'E', 'K', 'C', 'I'} range(5) range(3) # # # # # # # This This This This This This This is a is a is a is a is a will will list of strings list of integers tuple of strings tuple of floats set of unique strings generate the numbers 0 1 2 3 4 generate the numbers 0 1 2 • Dictionary or Mapping: In cases where it is necessary to associate a pair of data (commonly known as key and value), dictionary or mapping types can be used. These types are labeled as dict. The declaration begins with curly brackets, followed by the set of pairs separated by commas. Each pair is represented with the key and the value separated by a colon. To access any value, the key name should be provided between brackets: {"name": "Steve", "age":20} # This is a mapping variable More information on this topic can be found in Chapter 6. 2.5.3 Examples of Variables and Data Types Using Python Code This section includes a number of practical examples that demonstrate typical uses and structures of variables and data types in Python. The first example is related to the string/text data type, one of the fundamental and most commonly used data types in computer programming. In this rather simple example, the reader can find a number of coding conventions and commands relating to this data type. For instance, the string values that are being passed to the firstName variable are enclosed in single quotes. Introduction to Programming with Python 17 This is also the case when a string is used directly as an argument of the print() function, used to display the information of its arguments on screen. It must be also noted that good programming practice dictates that variables start with lower-case letters, (e.g., firstName instead of FirstName). This example also highlights that, in addition to simple arguments like strings in quotation marks, functions like print() may accept multiple arguments of different types or formats, such as other variables, or calls to functions (e.g., .format(firstName)). The format() function takes a float value as an argument and loads it in the brackets {} of the preceding string (e.g., 'firstName is {}'.format(firstName)). Note the use of the type() function that returns the data type of the value stored in the provided variable (i.e., firstName). In the Jupyter Notebook editor, if the output is text, it is provided immediately after the current code cell when the program is executed. Last but not least, the reader should note that comments are included before every distinct piece of code that performs a particular task. While this is not a strict coding requirement, it is an important aspect of good programming practice. 1 2 3 4 5 6 7 8 # Declare a variable named firstName and assign its value to Steve firstName = 'Steve' # Print the value of variable firstName print('firstName is {}'.format(firstName)) # Print the data type of variable firstName print(type(firstName)) Output 2.5.3.a: firstName is Steve <class 'str'> Variables of the integer data type are non-decimal numbers (e.g., numberOfStudents = 20): 1 2 3 4 5 6 7 8 # Declare a variable named numberOfStudents and assign its value to 20 numberOfStudents = 20 # Print the value of variable numberOfStudents print('Number of students is {}'.format(numberOfStudents)) # Print the data type of variable numberOfStudents print(type(numberOfStudents)) Output 2.5.3.b: Number of students is 20 <class 'int'> Variables of the float data type are floating-point numbers that require a decimal value. Note that the inclusion of the decimal value is mandatory even if it is zero: 18 1 2 3 4 5 6 7 8 Handbook of Computer Programming with Python # Declare a variable named salary and assign its value to 20000.0 salary = 20000.0 # Print the value of variable salary print('Salary is {}'.format(salary)) # Print the data type of variable salary print(type(salary)) Output 2.5.3.c: Salary is 20000.0 <class 'float'> Variables of the complex data type are in the form of an expression containing real and imaginary numbers, such as +x−y.j (e.g., complexNumber = +45−33.6j): 1 2 3 4 5 6 7 8 # Declare variable complexNumber; assing its value to +45-33.6j complexNumber = +45−33.6J # Print the value of variable complexNumber print('complexNumber is {}'.format(complexNumber)) # Print the data type of variable complexNumber print(type(complexNumber)) Output 2.5.3.d: complexNumber is (45-33.6j) <class 'complex'> Values of the octal data type start with 0o (e.g., octalNumber = 0o7652001). In this particular example, the reader should also note the use of comments stretching across multiple lines. As mentioned, comments of this type start and end with three double quotation marks ("""): 1 2 3 4 5 6 7 8 9 # Declare a variable named octalNumber and assign its value to 0o7652001 octalNumber = 0o7652001 # Print the value of variable octalNumber print('octalNumber is {}'.format(octalNumber)) """Print the data type of variable octalNumber: notice that the type is octal integer; this is why a class int text appears in the result""" print(type(octalNumber)) Output 2.5.3.e: octalNumber is 2053121 <class 'int'> Introduction to Programming with Python 19 Boolean variables can only take two different values: True or False. In the following code, variable married is True, but the only other possible value this variable could take would be False: 1 2 3 4 5 6 7 8 # Declare a variable named married and assign its value to True married = True # Print the value of variable married print('married is {}'.format(married)) # Print the data type of variable married print(type(married)) Output 2.5.3.f: married is True <class 'bool'> Mapping variables are always enclosed in curly brackets (e.g., mappingVariable = {'name': 'Steve', 'age': 20}): 1 2 3 4 5 6 7 8 9 # Declare a variable named mappingVariable and assign its # value to {'name':'Steve', 'age':20} mappingVariable = {'name':'Steve', 'age':20} # Print the value of variable mappingVariable print('mappingVariable is {}'.format(mappingVariable)) # Print the data type of variable mappingVariable print(type(mappingVariable)) Output 2.5.3.g: mappingVariable is {'name': 'Steve', 'age': 20} <class 'dict'> List variables are enclosed in square brackets (e.g., listVariable = [200, 6423, −709, 1205]): 1 2 3 4 5 6 7 8 9 # Declare a variable named listVariable and assign # its value to [200, 6423, −709, 1205] listVariable = [200, 6423, −709, 1205] # Print the value of variable listVariable print('listVariable is {}'.format(listVariable)) # Print the data type of variable listVariable print(type(listVariable)) Output 2.5.3.h: listVariable is [200, 6423, -709, 1205] <class 'list'> 20 Handbook of Computer Programming with Python Tuple variables are enclosed in parentheses (e.g., tupleVariable = ('car', 'bike', 'truck')): 1 2 3 4 5 6 7 8 9 # Declare a variable named tupleVariable and assign # its value to ('car', 'bike', 'truck') tupleVariable = ('car', 'bike', 'truck') # Print the value of variable tupleVariable print('tupleVariable is {}'.format(tupleVariable)) # Print the data type of variable tupleVariable print(type(tupleVariable)) Output 2.5.3.i: tupleVariable is ('car', 'bike', 'truck') <class 'tuple'> Range variables hold integers ranging from 0 up to a specified number (e.g., rangeVariable = range(5)). Note that the specified number is not inclusive, so rangeVariable in this example will hold values 0, 1, 2, 3, and 4: 1 2 3 4 5 6 7 8 9 # Declare a variable named rangeVariable and assign its value to a # range of integers from 0 to 4 (i.e., 0 1 2 3 4) rangeVariable = range(5) # Print the value of variable rangeVariable print('rangeVariable is {}'.format(rangeVariable)) # Print the data type of variable rangeVariable print(type(rangeVariable)) Output 2.5.3.j: rangeVariable is range(0, 5) <class 'range'> Set variables hold sets of unique values of primitive data types. In the following code, command set('cookie') allocates unique values 'i', 'c', 'o', 'e', 'k' to variable setVariable: 1 2 3 4 5 6 7 8 9 # Declare a variable named setVariable and assign its value to # the set of unique letter in the word 'cookie' setVariable = set('cookie') # Print the value of variable setVariable print('setVariable is {}'.format(setVariable)) # Print the data type of variable setVariable print(type(setVariable)) 21 Introduction to Programming with Python Output 2.5.3.k: setVariable is {'i', 'e', 'c', 'k', 'o'} <class 'set'> 2.6 STATEMENTS, EXPRESSIONS, AND OPERATORS Statements and expressions refer to specific syntactical structures that provide instructions to the interpreter in order to execute specific tasks. They can be simple structures executing a simple task, like printing a message on screen, or more complicated ones that perform a number of tasks and generate multiple threads of information and results. Operators refer to special symbols that perform particu- Observation 2.7 – Statement: A line lar, pre-determined tasks, and can be used as building of code that can be executed by the blocks for building logical statements and expressions. Python interpreter. This section introduces basic concepts related to these fundamental programming elements. 2.6.1 Statements and Expressions A statement is a unit/line of code (i.e., an instruction) that the Python interpreter can execute. So far, two kinds of statements have been presented in this chapter, assignment and print: 1 2 3 4 5 Observation 2.8 – Expression: Any combination of values, variables, operators, and/or calls to functions that result in an unambiguous value. # Assignment statement produces no output name = 'Steve' # Print function print('Name is:', name) Output 2.6.1: Name is: Steve A script usually contains a sequence of statements. When there are more than one statements, the results appear one at a time, as each statement is executed. An expression is a combination of values, variables, operators, and calls to functions resulting in a clear and unambiguous value upon execution. 2.6.2 Operators Operators are tokens/symbols that represent computations, such as addition, multiplication and division. The values an operator acts upon are called operands. Let us consider the simple expression x = 3*2. The reader should note the following: • • • • Observation 2.9 – Operators/Operands: Operators are symbols representing computations like additions, multiplications, divisions. Operands are the values that the operators act upon. x is a variable. 3 and 2 are the operands. * is the multiplication operator. 3*2 is considered an expression since it results in a specific value. 22 Handbook of Computer Programming with Python TABLE 2.3 Python Arithmetic Operators Operator Example Name Description + (unary) + (binary) +a a + b Unary positive Addition − (unary) −a Unary negation − (binary) * / a − b a * b a / b Subtraction Multiplication Division % // a % b a // b ** a ** b Modulo Floor division (also called integer division) Exponentiation a Sum of a and b. The + operator adds two numbers. It can be also used to concatenate strings. If either operand is a string, the other is converted to a string too. It converts a positive value to its negative equivalent and vice versa. b subtracted from a. Product of a and b. The division of a by b. The result is always of type float. The remainder when a is divided by b. The division of a by b, rounded to the next smallest integer. a raised to the power of b. Python supports many operators for combining data into expressions. These can be divided into arithmetic, comparison, logical, assignment, and bitwise: Observation 2.10 – Efficient Script Writing: Include expressions that display results inside the print function to avoid multiple instructions. Use a single statement to declare and assign values to multiple variables. Arithmetic Operators 2.6.2.1 These operators can be used with integers, floating-point numbers, or even characters (i.e., they can be used with any primitive type other than Boolean). Table 2.3 lists the arithmetic operators supported by Python, and the example that follows presents a script that applies a number of these operators. It is worth noting that the arithmetic expressions are not separate statements in the script. Instead, they appear as arguments in the print() ­function. Both options are correct, although it is advisable to follow a syntax similar to the script in order to write shorter, and thus more efficient, scripts. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 a = 5 b = 4 # Addition expression print('a+b=', a + b) # Subtraction expression print('a−b=', a − b) # Multiplication expression print('a*b=', a * b) # Division expression print('a/b=', a / b) # Exponent expression print('a raised to the power of b =', a ** b) 23 Introduction to Programming with Python 18 19 20 21 22 23 24 25 26 # Unary negation expression print('a negated is =', − a) # Modulus expression print('The remainder of the integer division between a and b is:', a % b) # Floor division print('Floor division of a and b is:', a // b) Output 2.6.2.a: a+b= 9 a-b= 1 a*b= 20 a/b= 1.25 a raised to the power of b = 625 a negated is = -5 The remainder of the integer division between a and b is: 1 Floor division of a and b is: 1 2.6.2.2 Comparison Operators These operators compare values for equality or inequality, (i.e., the relation between the two operands, be it numbers, characters, or strings). They yield a Boolean value as a result. The comparison operators are typically used with some type of conditional statement (see: 2.8 Selection Structures) or within an iteration structure (see: 2.9 Iteration Structures), determining the branching or looping directions to follow. Table 2.4 lists the comparison operators supported by Python, and the code that follows provides some relevant example cases using a Python script. TABLE 2.4 Python Comparison Operators Operator Example Name Description == != < <= > >= a == b a != b a < b a <= b a > b a >= b Equal to Not equal to Less than Less than or equal to Greater than Greater than or equal to True if the value of a is equal to that of b; False otherwise True if a is not equal to b; False otherwise True if a is less than b; False otherwise True if a is less than or equal to b; False otherwise True if a is greater than b; False otherwise True if a is greater than or equal to b; False otherwise An interesting point about this particular script is that the variables are all declared and assigned with values in one statement separated by commas. The script also demonstrates the use of a mix of strings and arithmetic expressions as arguments of the print function, separated by commas: 1 2 3 4 a, b, c, d, e = 5, 4, 5, 'Dubai', 'Abu Dhabi' # Test for equality and print directly the result of the expression print(a == b, 'and', a == c) 24 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Handbook of Computer Programming with Python # Test for inequality and print directly the result of the expression print(a != b, 'and', a != c) # Test for 'less than' and for 'less than' or 'equal to' and # print directly the result of the expression print(a < b, 'and', a <= b) # Test for 'greater than' and for 'greater than or equal to' and # print directly the result of the expression print(a > b, 'and', a >= b) # Test for equality and 'less than' between strings print(d == e, 'and', d > e) Output 2.6.2.b: False and True True and False False and False True and True False and True 2.6.2.3 Logical Operators As mentioned, comparison operators compare their operands and produce a Boolean output. This type of output is commonly used in branching and looping statements. Boolean operators are used to combine multiple comparison expressions into a more complex, singular expression. The Boolean operators require their operands to be Boolean values. Table 2.5 lists the logical operators supported by Python and the following script demonstrates some of their indicative applications: 1 2 3 4 5 6 7 8 9 10 11 12 # Apply the 'not' logical operator x = 5 print(not (x < 10)) print(not (x < 3)) # Apply the 'or' logical operator x, y = 5, 7 print((x > 3) or (y < 6)) print((x < 3) or (y < 6)) # Apply the 'and' logical operator x, y = 5, 7 13 print((x > 3) and (y > 6)) 14 print((x < 3) and (y > 6)) 15 16 # Combine 'not', and 'and or' operators 17 x, y = 5, 7 18 print(not (x < 3) and (y > 6)) 19 print((x < 3) or (y > 6) and (x < 10)) Output 2.6.2.c: False True True False True False True True 25 Introduction to Programming with Python TABLE 2.5 Python Logical Operators Operator Example Description not or and not a a or b a and b True if a is False; False if a is True True if either a or b is True; False otherwise True if both a and b are True; False otherwise TABLE 2.6 Python Assignment Operators Operator Example Description = c = a + b +=, −= c c c c c c Assigns the result of the expression on the right side of the assignment operator to the variable on the left side. Equivalent to c = c + a or c = c − a *=, /= //= %= **= += a, −= b *= a, c /= b //= a %= a **= a Equivalent to c Equivalent to c Equivalent to c Equivalent to c = = = = c c c c * a or c = c / b // a % a ** a 2.6.2.4 Assignment Operators These quite significant operators allow the manipulation of variables by saving or updating their values. Table 2.6 and the code that follows summarize the use of the different assignment operators in Python: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Assign the result of the expression on the right side of # the assignment operator to the variable on the left side a, b = 12, 10 c = a + b print('The value of c is:', c) # Use +=, −+, *=, /= in assignments a, c = 2, 12 c += a print('The value of c is:', c) a, c = 2, 12 c −= a print('The value of c is:', c) a, c = 2, 12 c *= a print('The value of c is:', c) a, c = 2, 12 c /= a print('The value of c is:', c) 26 23 24 25 26 27 28 29 30 31 Handbook of Computer Programming with Python # Use the %= and **= in assignments a, c = 4, 10 c %= a print('The value of c is:', c) a, c = 4, 10 c **= a print('The value of c is:', c) Output 2.6.2.d: The The The The The The The value value value value value value value of of of of of of of c c c c c c c is: is: is: is: is: is: is: 22 14 10 24 6.0 2 10000 2.6.2.5 Bitwise Operators These are considered to be low-level operators. They treat operands as sequences of binary digits and operate on them bit by bit. Table 2.7 details the bitwise operators supported by Python and the example that follows demonstrates their application within a script. The reader should note that when assigning values to variables in the binary system, the values must be preceded by 0b, followed by the value in the binary form. Likewise, when variable values must be displayed in the binary form, the form {:04b} must be used in order to display the binary value with four digits. TABLE 2.7 Python Bitwise Operators Operator Example Name Description &. | a & b, a | b bitwise AND, OR ~ ~a bitwise negation ^ a^b bitwise XOR (exclusive OR) >>, << a >> n, a << n Shift right or left n places Each bit position in the result is the logical AND (or OR) of the bits in the corresponding position of the operands; 1 if both are 1, otherwise 0 for AND; 1 if either is 1, otherwise 0. Each bit position in the result is the logical negation of the bit in the corresponding position of the operand; 1 if 0, 0 if 1. Each bit position in the result is the logical XOR of the bits in the corresponding position of the operands; 1 if the bits in the operands are different, 0 if they are the same. Each bit is shifted right or left by n places. 1 2 3 4 5 6 7 8 9 # Bitwise 'and' a, b = 0b1100, 0b1010 print('0b{:04b}'.format(a & b)) # Bitwise 'and' a, b, c, = 12, 10, 0 # 12 = 0b1100, 10 = 0b1010 C = a & b # 8 = 0b1000 print('Value of c is', c) Introduction to Programming with Python 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 # Bitwise 'or' a, b = 0b1100, 0b1010 print('0b{:04b}'.format(a | b)) # Bitwise 'or' a, b, c, = 12, 10, 0 # 10 = 0b1100, 12 = 0b1010 c = a | b # 14 = 0b1110 print('Value of c is', c) # Bitwise negation a = 0b1100 b = ~a print('0b{:04b}'.format(b)) # Bitwise negation a, b = 12, ~(a) # 12 = 0b1100, −13 = 0b−1101 print('Value of b is', b) # Bitwise XOR (exclusive OR) a, b = 0b1100, 0b1010 print('0b{:04b}'.format(a ^ b)) # Bitwise XOR (exclusive OR) a, b, c = 12, 10, a ^ b # 12 = 0b1100, 10 = 0b1010, 6 = 0b0110 print ('Value of c is', c) # Shift right 'n' places a = 0b1100 print('0b{:04b}'.format(a >> 2)) # Shift right 'n' places a, b, = 12, a >> 2 # 3 = 0b0011 print('Value of c is', b) # Shift left 'n' places a = 0b1100 print('0b{:04b}'.format(a << 2)) Output 2.6.2.e: 0bl000 Value of 0blll0 Value of 0b-1101 Value of 0b0ll0 Value of 0b00ll Value of 0bll0000 c is 8 c is 14 b is -13 c is 6 c is 3 27 28 Handbook of Computer Programming with Python 2.6.2.6 Operators Precedence Python, like other programming languages, uses the standard algebraic procedure to evaluate expressions. All operators are assigned a precedence: Observation 2.11 – Order of Precedence: The order of precedence of operator execution determines the result of complex expressions. Inconsistencies can lead to incorrect scripts. • Operators with the highest precedence are applied first. • Next, the results of their expression are used to determine those with the next highest precedence. • In case of operators with equal precedence their application starts from left to right. • This pattern continues until the full expression is calculated. Table 2.8 lists the operator precedence for Python, from lowest to highest. The code following this provides some examples of their application. It is essential for the reader to keep in mind the order of precedence of the various operators, since failure to do so will most certainly lead to inconsistencies in the way the complex expressions are calculated by the system: TABLE 2.8 Python Precedence Operators Precedence Operator Description Lowest or Boolean OR Boolean AND Boolean NOT Comparisons, identity Highest and not ==, != , <, <=, >, >=, is, is not | ^ & << , >> + , − *, /, //, % +x, −x, ~x ** 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bitwise OR Bitwise XOR Bitwise AND Bit shifts Addition, subtraction Multiplication, division, floor division, modulo Unary positive, unary negation, bitwise negation Exponentiation # The order of execution is exponentiation first, # then multiplication: 2 * 2 = 4, then, 4 * 5 = 20 a = 5 * 2 ** 2 print('The value of a is:', a) # The order of execution is multiplication first, # then addition: 2 * 3 = 6, then 2 + 6 = 8 a = 2 + 2 * 3 print('The value of a is:', a) # Parentheses have the highest precedence, # then everything else: (2 + 2) = 4, then, 4 * 3 = 12 a = (2 + 2) * 3 print('The value of a is:', a) Introduction to Programming with Python 16 17 18 19 20 29 # Addition and subtraction have the same precedence, # hence, they are evaluated from left to right. # This is also the case between arithmetic operators # with equal precedence: 2 + 2 = 4, then 4 − 3 = 1 print('The value of a is:', a) Output 2.6.2.f: The The The The value value value value of of of of a a a a is: is: is: is: 20 8 12 12 2.7 SEQUENCE: INPUT AND OUTPUT STATEMENTS Similarly to most other contemporary programming languages, Python is organized around ­functions, reusable programming routines that can be attached to an object of a class or used as standalone pieces of code that perform specific tasks. Python has a quite extensive array of functions, both predefined ones that are inherently built in the core of the language itself, or as part of the various classes used by it. An example of a Python function that has already appeared in several of the exercises presented in this Observation 2.12 – Input/Output: chapter is the print() function. As the name sug- Use the print() function to display gests, this is a function used to display output on screen. output on screen. Output is passed To invoke it one simply has to call it with an argument to the function as an argument. Use the input() function to receive (e.g., print(<argument>)). Another frequently used Python function is input(), input from the keyboard. Ensure that used to get input from the keyboard. This function input() is assigned to a variable, prompts the user to provide input in the form of text. The as Python may treat it as memory function stops the program execution until the text input garbage. has been provided and resumes only when the user presses the designated key (i.e., Enter or Return). The following example demonstrates the use of both print() and input() in a single Python script: 1 2 3 4 5 6 # Call the 'input' function to accept the user's input from # the keyboard and assign the provided data to a variable fullName = input('Insert your full name\n') # Print the contents of the variable fullName on screen print('The name you entered is', fullName) Output 2.7.a & 2.7.b: Insert your full name 30 Handbook of Computer Programming with Python Insert your full name Rania The name you entered is Rania It is important to point out the following in regard to this particular script: • Any value received as input must be assigned to a suitable variable. If input data are unallocated, there is a serious risk that Python will treat them as memory garbage. • Escape character \n should be used to force the display of the next output of the program to the next line. • The input() function treats all input streams as text regardless of whether numeric values are provided. If an input stream is meant to be treated as a numerical value, further processing is required. 2.8 SELECTION STRUCTURE One of the three principles of computer programming is to make a decision of the next block of statements to execute, based on the result of the evaluation of a certain condition. Such a condition, and the statements to execute based on it, is referred to as a selection. There are three main types of selection statements: if, if…else, and if…elif…else. 2.8.1 The if Structure The if structure is used to determine whether a certain statement or block of statements will be executed Observation 2.13 – Condition: A or not, based on a simple or complex condition. If the True/False or zero/non-zero value condition is True (or non-zero), then the block of state- expression used to determine the flow ments is executed, otherwise it is not executed and the of program execution. program flow continues from the next statement outside the if structure. This means that the evaluation of the condition must yield a Boolean or arithmetic (i.e., zero/non-zero) value. The syntax of the basic if statement is provided below: if (condition): Block of statements to execute if condition is True Statements to execute outside the if statement Similarly, Figure 2.3 illustrates a simple if statement in the form of a flowchart. Most high-level programming languages, such as C++ or Java, use brackets {} to mark a block of statements. Since Python does not have any type of designated markers for such purpose, it uses indentation to identify these blocks. Under this scheme, the block starts with the indentation and ends at the first non-indented line of code. Consider the following script: 1 2 3 4 5 # Simple 'if' statement a = int(input('Enter the first integer to continue: ')) b = int(input('Enter the second integer to continue: ')) if (a > b): print("The first integer is larger than the second") Introduction to Programming with Python FIGURE 2.3 31 Flowchart of the if statement. Output 2.8.1: Enter the first integer to continue: 5 Enter the second integer to continue: 3 The first integer is larger than the second In this example, the user is prompted to enter two integer values assigned to two different, corresponding variables. Next, the variables are compared based on their values. This is done with a simple if statement that, when True, displays a message on screen. Both the input() and print() functions are used in the script. The reader should note that, since the input() function treats every input as text, it is necessary to convert this value into a suitable primitive type for the required calculations or processing to take place. This is the idea behind casting. In this particular example, the input value is cast into an integer using the int() function. Also, the reader should note that it is possible to use one function call inside another, in this case the input() function call inside the int() cast call. Observation 2.14 – if Statement: Used to determine whether a statement or block of statements will be executed or not, based on a simple or complex condition. Observation 2.15 – Indentation: Use indentation to mark a block of statements. Observation 2.16 – Casting: Convert input values to appropriate primitive data type, as required for calculations or processing. 32 Handbook of Computer Programming with Python 2.8.2 The if…else Structure It is possible to write the if statement in a way that it executes a block of statements when the condition is True and another when it is not. This is the concept behind the if…else statement: if (condition): Block of statements to execute if condition is True else: Block of Statements to execute if condition is False Figure 2.4 illustrates an if…else structure as a flowchart and the following code provides an example of its application. This particular script prompts the user to enter two integers (note that input is treated as text by default), converts the input to actual integers, compares the two values, and displays one of the two outputs, depending on the result of the comparison. In this example, there is only one statement to execute, as the condition of the if statement will be either True or False. However, the user can add multiple instructions within the block of statements, while it is also possible to have another if statement nested inside the block. Such cases are discussed at later sections of this chapter. FIGURE 2.4 Flowchart of the if…else statement. Observation 2.17 – Selection: • Use the if statement for the execution of one block of statements if the condition is True. • Use the if…else statement for the execution of either of two possible blocks of statements depending on a particular condition. • Use the if…elif…else statement for the execution of multiple possible blocks of statements depending on a number of conditions. • Use dictionary/mapping structures in place of the switch structure of C++, Java, etc. • Use conditional expression in place of the conditional operator used in C++, Java, etc. • Use nested if structures in more complex cases. Introduction to Programming with Python 1 2 3 4 5 6 7 33 # The 'if…else…' statement a = int(input('Enter the first integer to continue: ')) b = int(input('Enter the second integer to continue: ')) if (a > b): print('First integer holds a value greater than the second') else: print('Second integer holds a value greater than the first') Output 2.8.2: Enter the first integer to continue: 13 Enter the second integer to continue: 20 Second integer holds a value greater than the first 2.8.3 The if…elif…else Structure Python allows the execution of more than two blocks of statements in a single if structure. If one of the conditions controlling the if structure is True, the block associated with that structure is executed. The remaining blocks are just ignored and the program execution continues at the first line after the if structure. If none of the conditions are True, then the else statement is executed. The ­syntax of the if…elif…else structure is provided below, and its flowchart can be found in Figure 2.5: FIGURE 2.5 Flowchart of the if…elif…else statement. 34 Handbook of Computer Programming with Python if (condition1): Block to execute if condition1 is True elif (condition2): Block to execute if condition2 is True … else: Block to execute if none of the conditions are True The following script demonstrates the application of an if…elif…else structure. The script prompts the user to enter an integer between 0 and 100. Depending on the input value, a particular block of code is executed based on the conditions of the various if…elif…else structures: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # The 'if…elif…else…' statement a = int(input('Enter a grade between 0 and 100: ')) if (a < 60): print('I am sorry but you failed the course.\n'\ 'Please try harder next semester') elif (a < 70): print('Task completed! You passed the course') elif (a < 80): print('Well done! You did well in the course') elif (a < 90): print('Very good job. Keep up the good work') elif (a < 100): print('Excellent performance. Congratulations.') else: print('I am sorry but an integer between 0 and 100 was expected') Output 2.8.3: Enter a grade for the course between 0 and 100: 92 Excellent performance. Congratulations. 2.8.4 Switch Case Structures A switch case structure is used as an alternative to long if structures that compare a variable against several values. Unlike other programming languages, Python does not have a dedicated switch case statement. To get around the lack of such statements, programmers may use an if… elif…else structure, as described in the previous section. Alternatively, dictionary/mapping can be used as shown in the script below: 1 2 3 4 5 6 7 8 # Dictionary mapping used to check against a range of options numberToTextSwitcher = { 1: 'One', 2: 'Two', 3: 'Three' } number = input('Insert 1, 2, or 3: ') Introduction to Programming with Python 9 10 11 35 intNumber = int(number) print('The string value of', intNumber, \ 'is', numberToTextSwitcher.get(intNumber)) Output 2.8.4: Insert 1, 2, or 3: 3 The string value of 3 is Three The reader should note some interesting points in relation to this script: • The dictionary/mapping variable type, in this example numberToTextSwitcher, can be used to substitute the functionality of the missing switch statement. • When a statement is long and difficult to include in a single line, the programmer can use the \ symbol to inform the Python interpreter that the statement continues in the next line. • Apply the get() function of the dictionary/mapping variable with the key (i.e., the first part of the pair) to get access to the value (i.e., the second part of the pair). 2.8.5 Conditional Expressions Another expression that can be used in Python instead of the missing conditional operator of C++ or Java, is what is often called the conditional expression. The syntax is the following: Statement 1 if condition else Statement 2 In this case, the first part of the expression that is executed is the if condition. If this is True, the first statement is executed; otherwise, the second statement is executed. The following code provides an example of the application of the conditional expression: 1 2 3 4 5 # Use of 'conditional expression' instead of the 'if…else' statement a = int(input('Enter the first integer (a): ')) b = int(input('Enter the second integer (b): ')) print('a is greater than b') if (a > b) else print('b is greater than a') Output 2.8.5: Enter the first integer (a): 3 Enter the second integer (b): 6 b is greater than a 2.8.6 Nested if Statements As already implied, it is possible to have an if structure nested inside another. In fact, such a practice could go to as much depth as the programmer wishes, although it is not advisable to go deeper than three levels since it will be difficult to conceptually control the resulting structure. A possible syntax for the nested if structure is presented below: if (condition 1): if (condition 2): Block 1 executes 36 Handbook of Computer Programming with Python else: Block 2 executes else: Block 3 to execute if condition 1 is False Block 1 will be executed if condition 2 is True. Condition 1 is not considered at this point, as it is True by default. Note that if this was not the case, the program flow would never reach the nested if(<condition 2>) statement. Also, the first else statement is an alternative to the if(<condition 2>) part of the structure and not to the if(<condition 1>) part. The latter is taken care of by the second else statement. The code that follows is an example of a nested if, based on a simple variation of a previously used script: 1 2 3 4 5 6 7 8 9 10 11 12 13 # A script with a basic nested 'if' structure inputGrade = int(input('Enter your grade between 0 and 100: ')) if (inputGrade >= 80): if (inputGrade >= 90): print('Excellent performance') else: print('Very good. Keep up the good work') else: if (inputGrade >= 60): print('You did well') else: print('Sorry, you failed the course') Output 2.8.6: Enter your grade between 0 and 100: 50 Sorry, you failed the course 2.9 ITERATION STATEMENTS Application developers and programmers always look to optimize their programs using appropriate, efficient Observation 2.18 – Loop: A block of statements and minimizing the lines of code in order statements that is executed repeatedly to create an easy to maintain program. A common way while a certain condition is True. to reduce the lines of code is the concept of iteration. There are three possible forms of Indeed, iteration, alongside sequence (i.e., sequential loops: while loops, for loops, and execution of statements) and selection (see previous sec- nested loops. tions) constitute what is known in computer programming as the three basic principles of programming. The iteration concept applies to cases where a block of statements has to be repeated several times. There are three possible iteration alternatives offered in Python: the while loop, the for loop, and the nested loops. 2.9.1 The while Loop The while loop is suitable for cases where the number of iterations is unknown and depends on certain conditions. These conditions need to be specified explicitly, similarly to the various forms Introduction to Programming with Python of selection statements. The block of statements inside the loop is repeated as long as the specified conditions are satisfied. Once the conditions become False the Python interpreter exits the loop and proceeds with the rest of the program. The block of statements within the loop structure needs to be indented. The syntax of the basic while loop and its flowchart (Figure 2.6) are provided below: 37 Observation 2.19 – while Loop: Repeatedly executes a block of statements while a certain condition is True. If the condition is never True, the block is never executed. If the condition never changes to False, the block is executed indefinitely, causing an infinite loop. # while loop with one condition while (condition): Block of statements … # while loop with two conditions; # op can be any logical operator while (condition) op (condition2): Block of statements … If the condition before the beginning of the loop is not met, the block of statements will not be executed and/or repeated. It is also possible that the conditions inside the while loop are not updated, in which case the block will be executed indefinitely resulting in an undesirable infinite loop. In order to avoid the latter, it is essential for the conditions to be updated inside the while loop. The following script provides a basic example of the while loop. The program starts by prompting the user to decide whether the message should be displayed or not. This is done by entering either ‘Y’/‘y’ or ‘N’/‘n’. Any other input is considered as not ‘Y’/‘y’. In this arrangement, the flow goes into the block that belongs to the while loop only when the user enters ‘Y’ or ‘y’. Note that the same prompt for input is given to the user inside the loop. This is because it is necessary to change this value in order to determine the while condition. As mentioned, if this value is not modified inside the loop (i.e., if the statement showMessage = input ('Do you want to FIGURE 2.6 Flowchart of the while loop. 38 Handbook of Computer Programming with Python show the message again (Y/N)?)' is missing) the program execution would lead into an infinite loop. The program will continue to run as long as the user enters ‘Y’ or ‘y’: 1 2 3 4 5 6 7 # Use of 'while' loop to show the message 'Hello world' # as long as the user enters 'Y' or 'y' showMessage = input('Do you want to show the message again (Y/N)? ') while (showMessage == 'Y' or showMessage == 'y'): print('Hello world') showMessage = input('Do you want to show the message again (Y/N)? ') Output 2.9.1.a: Do you want to show the message again (Y/N)? Y Hello world Do you want to show the message again (Y/N)? Y Hello world Do you want to show the message again (Y/N)? N Another example of a while loop can be seen in the script below, which introduces the use of the end = '' clause in the print() function. This results in the program stopping and waiting for new output at the end of the same print without proceeding to the next line: 1 2 3 4 5 6 7 8 9 # Use the 'while' loop to display all integers # between two values provided by the user numberToShow = int(input('Enter the starting integer: ')) endInteger = int(input('Enter the ending integer: ')) while (numberToShow <= endInteger): print(numberToShow, ' ', end = '') numberToShow += 1 Output 2.9.1.b: Enter the starting integer: 5 Enter the ending integer: 10 5 6 7 8 9 10 The next script is a classic example of adding together two integers, the values of which are entered by the user at runtime. The reader should note how the loop control variable (i.e., currentInteger) is being modified inside the block of statements. Also, it should be noted how the two print() functions are used and connected through the end = '' clause, in order to display the results in a single line: 1 2 3 4 5 # Use the 'while' loop to add all integers between two values # provided by the user currentInteger = int(input('Enter the starting integer:')) endingInteger = int(input('Enter the ending integer:')) 39 Introduction to Programming with Python 6 7 8 9 10 11 12 sumOfValues = 0 while (currentInteger <= endingInteger): print('currentInteger value is', currentInteger, end = '') sumOfValues += currentInteger currentInteger += 1 print(' and sumOfValues currently is', sumOfValues) Output 2.9.1.c: Enter the starting integer:1 Enter the ending integer:5 currentInteger value is 1 and currentInteger value is 2 and currentInteger value is 3 and currentInteger value is 4 and currentInteger value is 5 and sumOfValues sumOfValues sumOfValues sumOfValues sumOfValues currently currently currently currently currently is is is is is 1 3 6 10 15 In addition to the above, it is also possible to have an if structure of any type nested inside the while loop. The following code provides an example of a script that repeatedly accepts integers from the keyboard, and displays the integers plus a calculation of the even and odd numbers present. What is noteworthy in this script is the use of an if…else structure inside the while loop: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 """ Use of the 'while' loop to count the number of even and odd numbers from an input stream provided by the user. Stop the loop and display the results when the user enters 0 """ # Declare the counters for even and odd numbers countEven, countOdd = 0, 0 # Declare a variable to temporarily store current input value userInput = int(input('Enter an integer, \ or 0 to display the results and exit: ')) # The 'while' loop that repeatedly executes the main block of code while (userInput != 0): if (userInput % 2 == 0): countEven += 1 else: countOdd += 1 # Repeatedly accept new input from the user until 0 is entered userInput = int(input('Enter an integer, or 0 to display \ the results and exit: ')) # Display the results of the program print('You entered', countEven,'even and', countOdd,'odd numbers') 40 Handbook of Computer Programming with Python Output 2.9.1.d: Enter an integer, or 0 Enter an integer, or 0 Enter an integer, or 0 Enter an integer, or 0 Enter an integer, or 0 Enter an integer, or 0 You entered 3 even and to display the to display the to display the to display the to display the to display the 2 odd numbers results results results results results results and and and and and and exit: exit: exit: exit: exit: exit: 2 3 4 5 6 0 Programmers can also use a logically modified version of the while loop in place of the do…until (or repeat…until) loop, another classic programming language loop structure that is not directly available in Python. When using the while loop to replace the do…until functionality, the programmer should make sure that the while condition is True during the first iteration, and that its value is repeatedly updated at the end of the block of statements inside the loop. 2.9.2 The for Loop The for loop structure allows for the execution of a block of statements for a predefined number of iterations. The Observation 2.20 – for Loop: loop controls the number of iterations using a counter Repeatedly executes a block of state(i.e., a variable declared locally in the loop), within a spe- ments for a predefined number of cific range defined by two numbers: start and end. The times. The end of the loop must be range can be also specified by just one end number, in defined, the start can be omitted, which case the start will be considered to be 0 by default. and the step can be specified in the Additionally, it is possible to include an incremental or header. decremental step inside the for header. Each repeated statement is placed within the block of statements, inside the for loop. The syntax for each of the three types of the for loop is provided below, while Figure 2.7 showcases the associated flowchart: # Number of iterations is end-start for counter in range (start, end): Block of statements # Number of iterations is end and starts from 0 for counter in range (end): Block of statements """ Number of iterations is (end-start)/step; counter increases/ decreases by step """ for counter in range (start, end, step): Block of statements The next script showcases a script used to display the list of names stored in a tuple. The block of statements inside the for loop is executed four times with the i index starting at 0 and increasing up to 3 (inclusive): 1 2 3 4 5 6 7 # Declare a variable as a 'tuple' of immutable string elements myFriends = ('John', 'Ali', 'Steven', 'Catherine') # Use a 'for' loop to read the elements in the 'tuple', first to last for i in range (0, 4): print('Happy New Year:', myFriends[i]) print('Done.') Introduction to Programming with Python 41 Output 2.9.2.a: Happy Happy Happy Happy Done. New New New New FIGURE 2.7 Year: Year: Year: Year: John Ali Steven Catherine Flowchart of the for loop. A similar example is provided in the following script, where instead of a tuple variable a list is used. The user is prompted to enter four names into the empty list, which are subsequently displayed on screen: 1 2 3 4 5 6 7 8 9 10 11 # Declare a 'list' variable that will accept names provided by the user nameList = [] # Declare a 'dictionary' mapping numbers 1–4 # to text values 'first', 'second', 'third', 'fourth', respectively numberToText = { 1: 'first', 2: 'second', 3: 'third', 4: 'fourth' } 42 12 13 14 15 16 17 18 19 20 21 22 23 24 Handbook of Computer Programming with Python # Use 'for' loop to accept 4 names; store them in dictionary for i in range (0, 4): message = ('Enter the ' + str(numberToText.get(i + 1)) + \ ' name to insert in the dictionary: ') newName = input(message) nameList.insert(i, newName) # Use a 'for' loop to display the newly created name list for i in range (4): print(nameList[i]) print('Done.') Output 2.9.2.b: Enter the Enter the Enter the Enter the Hellen Steven Ahmed Catherine Done. first name to insert in the dictionary: Hellen second name to insert in the dictionary: Steven third name to insert in the dictionary: Ahmed fourth name to insert in the dictionary: Catherine The reader should note the following: • A list is declared using square brackets instead of the parentheses used for tuples. By leaving the square brackets empty, an empty list is created. • Use a dictionary mapping to convert numeric values into the corresponding text (e.g., numberToText). • Use the str() function to convert a numeric value into a string. • Use the concatenation operator (+) to combine strings. • Use the insert() function to populate the list. The first argument is the index of the new element and the second is the actual value. • If the start number is omitted in the for loop header, zero is assumed as a default value. 2.9.3 The Nested for Loop As with if statements, it is possible to embed a for loop (i.e., inner loop) into another (i.e., outer loop) to create a nested for loop. This is particularly convenient when dealing with non-primitive data types of two or more dimensions, or with more complex problems. The syntax is provided below, and the associated flowchart is presented in Figure 2.8: Observation 2.21 – Nested Loops: Use nested loops of any type to address complex situations like mathematical problems, drawing shapes, searching or shorting, or dealing with multi-dimensional non-primitive data types. Introduction to Programming with Python 43 FIGURE 2.8 Flowchart of the nested for loop. for counter1 Block of ... for counter2 Block of ... for counter3 Block of ... in range (start1, end1): statements 1 in range (start2, end2): statements 2 in range (start3, end3): statements 3 Nested loops are commonly used for the implementation of programs that deal with various types of non-primitive data types, such as lists, tuples, or sets. The following script provides an example of a nested for loop structure, in which a two-dimensional list variable (i.e., languages) is displayed on screen. This particular variable stores six different elements (i.e., names of programming languages) in two different dimensions (i.e., three elements on each dimension). The reader should note how the counters of the nested loops are used as indices for the displayed items of the list: 1 2 3 4 5 6 7 8 9 10 # Define a two-dimensional list with 3 programming languages # as its elements (per dimension) languages=[['Python','Java','C++'],['PhP','HTML','Java Script']] # A nested 'for' loop prints the 2 different dimensions of the list for i in range(2): print(i, 'Set of programming languages:') for j in range(3): print('Happy new year:', languages[i][j]) print('All languages displayed') 44 Handbook of Computer Programming with Python Output 2.9.3.a: 0 Set of programming languages: Happy new year: Python Happy new year: Java Happy new year: c++ 1 Set of programming languages: Happy new year: PhP Happy new year: HTML Happy new year: Java Script All languages displayed Another common use of nested loops relates to the implementation of various sorting or searching algorithms (see: Chapter 6). The following script provides another example of a nested for loop structure that implements a classic sorting algorithm referred to as the Bubble Sort. This script does the following: • It declares two lists, one to accept the original list of integers and the other to store the sorted list. • It runs a for loop that accepts a number of integers as input from the user and transfers them to the first list. • It runs a second for loop that reads from the original list and transfers to into the second one (sorted list). • It runs a nested for loop that utilizes the Bubble Sort algorithm. • Finally, it runs two more for loops: one that displays the original list of integers and one that displays the sorted one. It should be noted that the code presented in this script is not an example of the most efficient or complete sorting algorithm, but a more simplistic implementation of it, as the main purpose was to help the reader gain a better understanding of the use of nested loops: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 originalList, sortedList = [], [] # The first 'for' loop accepts a number # of integers and populate the 'originalList' sizeOfList = int(input('Total number of integers in the list? ')) for i in range (sizeOfList): tempValue = int(input('Add an integer to the list: ')) originalList.insert(i, tempValue) # The second 'for' loop copies the 'originalList' into the # 'sortedListed' in preparation for sorting the latter for i in range (sizeOfList): sortedList.insert(i, originalList[i]) # Use a nested 'for' loop to sort the 'originalList' into the # 'sortedList' using the Bubble Sort algorithm for i in range (sizeOfList − 1): for j in range (sizeOfList): if (sortedList[i] > sortedList [i + 1]): Introduction to Programming with Python 20 21 22 23 24 25 26 27 28 29 30 31 45 temp = sortedList[i] sortedList[i] = sortedList[i + 1] sortedList[i + 1] = temp # Use two 'for' loops to successively display the two lists print('The original list is: ', end = '') for i in range (sizeOfList): print(originalList[i], '', end = '') print('\nThe sorted list is: ', end = '') for i in range (sizeOfList): print(sortedList[i], '', end = '') Output 2.9.3.b: Total number of integers in Add an integer to the list: Add an integer to the list: Add an integer to the list: The original list is: 2 1 4 The sorted list is: 1 2 4 the list? 3 2 1 4 2.9.4 The break and continue Statement Another common use of nested loops is related to the implementation of algorithms for the solution of math- Observation 2.22 – break and conematical problems. The following script presents an tinue: Use the break statement implementation of a program calculating the prime combined with a selection statement numbers. In this particular case, the user is prompted to in a loop, to permanently interrupt enter the last integer of the prime numbers list the pro- loop execution. Use the continue gram should calculate. Next, a for loop nested inside a statement combined with a selection while loop determines whether this integer is a prime statement in a loop to skip the current iteration. number or not. The script introduces the break statement, which forces the interpreter to skip all the remaining statements and iterations, and exit the current iteration. As shown in the script, break is generally combined with a selection statement: 1 2 3 4 5 6 7 8 9 10 11 12 13 # Use a nested 'for' loop inside a 'while' loop to find primary numbers. # Variable 'endInteger' stores the last integer of the sequence endInteger = int(input('Enter the last integer \ of the sequence of primary numbers: ')) # Print default prime integers 1 and 2. This is subsequently followed # by the rest of the sequence on the same line print('1 2 ', end = '') # The 'counter' variable is used to evaluate # whether a number within the range is prime counter, flag = 3, 'true' 46 14 15 16 17 18 19 20 21 22 23 24 25 26 Handbook of Computer Programming with Python # 'while' loop controls the counter variable used for evaluation while (counter <= endInteger): # 'for': check current 'counter' value against the integers # in the list up to itself to determine if it is a prime number for i in range (2, counter): if ((counter % i) == 0): flag = 'false' break if (flag == 'true'): print(counter, '', end = '') flag = 'true' counter += 1 Output 2.9.4.a: Enter the last integer of the sequence of primary numbers: 100 1 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 The following example provides a more direct demonstration of how the break statement is used. The code instructs the interpreter to read from a non-primitive data type list, but breaks just after reading its first element: 1 2 3 4 5 6 7 8 9 10 11 12 # Declare variable 'myFriends' and populate with a list of names myFriends = ('Ahmed', 'John', 'Emma', 'Hind') # Use a 'for' loop to read the elements of the list for i in range (4): # Use an 'if' statement to stop reading the list once # the second element (i.e., index 1) is reached if (i == 1): break print('Happy new year:', myFriends[i]) print('Done') Output 2.9.4.b: Happy new year: Ahmed Done Another statement that is commonly used in loops, and particularly in nested loops, is the ­continue statement. It is used when there is a need to skip one or more particular iterations, and continue with the rest of the program. It is worth noting that this statement is frequently combined with selection statements. The main difference between the continue and the break statements is that the former stops the active iteration without completely interrupting the loop. The following script demonstrates the use of the continue statement: Introduction to Programming with Python 1 2 3 4 5 6 7 8 9 10 11 12 47 # Declare variable 'myFriends' and populate with a list of names myFriends = ('Ahmed', 'John', 'Emma', 'Rania') # Use a 'for' loop to read the elements of the list for i in range (4): # Use an 'if' statement to skip the second element # (i.e., the element with index 1) if (i == 1): continue print('Happy new year:', myFriends[i]) print('Done.') Output 2.9.4.c: Happy new year: Ahmed Happy new year: Emma Happy new year: Rania Done. 2.9.5 Using Loops with the Turtle Library In addition to a multitude of other uses, loops are also convenient when using code for drawing shapes. Among the most important programming tools for such tasks is the Turtle library. The following script provides an example of how to draw a basic shape of four squares (100 pixels in length). The reader should note the use of the forward(length) function of the t object (­turtle class), which draws a straight line of 100 pixels. Next, the script uses the left(degrees) function on the t object to turn the drawing pen 90 degrees left and repeat the 100-pixel drawing. At the end of the script it is necessary to use the mainloop() function on the t object to ensure that the drawing process is completed promptly. The output of this example shows the four squares drawn as a result of the for loop: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Import the 'turtle' library import turtle as t # Use a 'for' loop to draw 4 squares with sides of 100 pixels for i in range (4): t.forward(100) t.left(90) t.forward(100) t.left(90) t.forward(100) t.left(90) t.forward(100) # Use the mainloop() function of the 'turtle' class t.mainloop() 48 Handbook of Computer Programming with Python Output 2.9.5.a: Nested loops can be also used with Turtle to draw more complex shapes. The following script demonstrates this by building on the previous example and forcing the drawing process to be repeated three more times with the use of a nested loop. In each repetition, the rectangular shape is rotated by 30 degrees to the left: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Import the 'turtle' library import turtle as t # Nested 'for' to draw a complex of squares with sides of 100 pixels for i in range (3): for j in range (4): t.forward(100) t.left(90) t.forward(100) t.left(90) t.forward(100) t.left(90) t.forward(100) t.left(30) # Use the mainloop() function of the 'turtle' class t.mainloop() Introduction to Programming with Python 49 Output 2.9.5.b: The Turtle library comes with a rich set of functions that support a large variety of drawing tasks. Table 2.9 provides a sample based on this set, including some of the most important of its functions. TABLE 2.9 Methods Available in the Turtle Class Method or Command Required Parameters Description forward backward right left penup pendown pensize color, pencolor fillcolor begin_fill, end_fill setposition goto shape speed circle Length in pixels Length in pixels Angle in degrees Angle in degrees None None Thickness of pen Color name Color name None Moves the Turtle pen forward by the specified amount Moves the Turtle pen backward by the specified amount Turns the Turtle pen a number of degrees clockwise Turns the Turtle pen a number of degrees counter-clockwise Picks up the Turtle pen Puts down the Turtle pen to start drawing The thickness of the Turtle pen Changes the color of the Turtle pen Changes the fill color for the drawing Defines the start and the end of the application of the fillcolor() method None x, y coordinates Shape name Time delay Radius, arc, steps Set the current position Moves the Turtle pen to coordinate position x, y Can accept values ‘arrow’, ‘classic’, ‘turtle’, or ‘circle’. Dictates the speed of the Turtle pen (i.e., slow (0) to fast (10+)). Draws a circle counter-clockwise with a pre-set radius. If arc is used, it will draw an arc from 0 up to a given number in degrees. If steps is used, it will draw the shape in pieces resembling a polygon. 50 Handbook of Computer Programming with Python 2.10 FUNCTIONS A function is a block of statements that performs a specific task. It allows the programmer to reuse parts of their code, promoting the concept of modularity. The main idea behind this approach is to divide a large block of code into smaller, and thus more manageable, subblocks. There are two types of functions in Python: Observation 2.23 – Function: A defined structure of statements that can be called repeatedly. It has a unique name, and may take arguments and/or return values to the caller. • Built-in: The programmer can use these functions in the program without defining them. Several functions of this type were used in the previous sections (e.g., print() and input()). • User-defined: Python allows programmers to create their own functions. The following section focuses on this particular function type. 2.10.1 Function Definition The main rules for defining functions in Python are the following: Observation 2.24 – Four Types of Functions: • The function block begins with the keyword def, 1. No arguments, no return followed by the function name and parentheses. value. Note that, as Python is case-sensitive, the pro2. With arguments, no return grammer must use def instead of Def. value. • Similar to variable names, function names can 3. No arguments, with return include letters or numbers, but no spaces or spevalue. cial characters, and cannot begin with a number. 4. With arguments, with return • Optional input parameters, called arguments, value. should be placed within the parentheses. It is also possible to define the parameters inside the parentheses. • The block of statements within a function starts with a colon and is indented. • A function that returns data must include the keyword return in its block of code. The syntax for a function declaration is as follows: def functionName (var1, var2, … etc.): Statements Depending on the presence or absence of arguments, and on the presence of input and/or return values, functions can be classified under four possible types. These types are presented in detail in the following section. 2.10.2 No Arguments, No Return This is a type in which the function does not accept variables as arguments, and does not return any data. This is demonstrated in the following script that merely prints a predefined string on screen. The reader should note that there are no arguments inside the parameters and no return statement inside the block of statements. The structure simply invokes the print() function displaying the desired message. Invoking such a function inside the main program is a rather simple and straightforward task: Introduction to Programming with Python 1 2 3 4 5 6 51 # Define function that neither accepts arguments nor returns values def printSomething(): print('Hello world') # Call the function from the main program printSomething() Output 2.10.2: Hello world 2.10.3 With Arguments, No Return Another type of a function is one in which the function accepts variables as arguments, but does not return any data. In the following script, the function is invoked by declaring its name while also including a number of values in the parentheses. These values are passed to the main body of the function, and can be treated as normal variables: 1 2 3 4 5 6 7 8 9 10 # Define a function that accepts arguments but does not return values def printMyName(fName, lName): print('Your name is:', fName, lName) # Prompt user to input their name firstName = input('Enter your first name: ') lastName = input('Enter your last name: ') # Call the function from the main program printMyName(firstName, lastName) Output 2.10.3: Enter your first name: Alex Enter your last name: Fora Your name is: Alex Fora 2.10.4 No Arguments, With Return The third type involves a function that does not accept arguments, but returns data. It is important to remember that since this type of function returns a value to the calling code, this value must be assigned to a variable before being used or processed: 1 2 3 4 5 6 7 8 9 # Define a function that does not accept arguments but returns values def returnFloatNumber(): inputFloat = float(input('Enter a real number ' \ 'to return to the main program: ')) return inputFloat # Call the function from the main program to display the input x = returnFloatNumber() print('You entered:', x) 52 Handbook of Computer Programming with Python Output 2.10.4: Enter a real number to return to the main program: 5.7 You entered: 5.7 2.10.5 With Arguments, With Return The fourth type involves a function that both accepts arguments and returns values back to the calling code. The following script demonstrates this. In this case, the call of the function must include a list of arguments and assign the return value to a specific variable for later use: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Function accepts arguments & returns values to the caller def calculateSum(number1, number2): print('Calculate the sum of the two numbers.') return(number1 + number2) # Accept two real numbers from the user num1 = float(input('Enter the first number: ')) num2 = float(input('Enter the second number: ')) # Call the function to calculate the sum for the two numbers addNumbers = calculateSum(num1, num2) # Print the sum for the numbers print('The sum for the two numbers is:', addNumbers) Output 2.10.5: Enter the first number: 3 Enter the second number: 5 Calculate the sum of the two numbers. The sum for the two numbers is: 8.0 2.10.6 Function Parameter Passing There are two different ways to pass parameters to functions. Determining which of the two should be chosen depends on whether the value of the original variables should be changed within the function or not. These two ways for passing parameter values to a function are commonly referred to as call/pass by value and call/pass by reference. 2.10.6.1 Call/Pass by Value In this case, the value of the argument (parameter) is processed as a copy of the original variable. Hence, the original variable in the caller’s scope will be unchanged when program control returns to the caller. In Python, if immutable parameters (e.g., integers and strings) are passed to a function, the common practice is to call/pass parameters by value. The example below illustrates such a case by introducing the id() function. It accepts an object as a parameter (i.e., id(object)) and returns the identity of this particular object. The return value of Observation 2.25 – Passing Values to Argument: 1. By Value: Argument is a copy of the original variable, which remains unchanged. 2. By Reference: Changes apply directly to the original variable, thus, changing its value. Introduction to Programming with Python 53 id() is an integer, which is unique and permanent for this object during its lifetime. As shown in the example, the id of variable x before calling the checkParamemterID function is 4564813232. It should be noted the id of x is not changed within the function as long as the value of x is not updated. However, once the value is updated to 20, its corresponding id is changed to 4564813552. The most important thing to note is that the id of x does not change after calling the function, and its original value is maintained (4564813232). That means that the change of the value of x was applied on a copy of the variable, and not the original one within the caller’s scope: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Define function 'checkParameterID' that accepts a parameter (by value) def checkParameterID(x): print('The value of x inside checkParameterID',\ 'before value change is', x, '\nand its id is', id(x)) # Change the value of parameter 'x' within the scope of the function x = 20 print('The value of x inside checkParameterID',\ 'after value change is', x, '\nand its id is', id(x)) # Declare variable 'x' in the main program and assign initial value x = 10 print('The value of x before calling the function ',\ 'checkParameterID is', x, '\nand its id is', id(x)) # Call function 'checkParameterID' checkParameterID(x) # Display info about 'x' in the main program after function call print('The value of x after calling the function checkParameterID '\ 'is', x, '\nand its id is', id(x)) Output 2.10.6.a: The and The and The and The and value of x before calling the method checkParameterID is 10 its id is 140715021772880 value of x inside checkParameterID before value change is 10 its id is 140715021772880 value of x inside checkParameterID after value change is 20 its id is 140715021773200 value of x after calling the method checkParameterID is 10 its id is 140715021772880 2.10.6.2 Call/Pass by Reference In this case, the function gets a reference to the argument (i.e., the original variable) rather than a copy of it. The value of the original variable in the caller’s scope will be modified if a change occurs within the function. In Python, if mutable parameters (e.g., a list) are passed to a function, the call/ pass is by reference. As shown below, updateList appends a value of 5 to the list named y. The fact that the value of the original mutable variable x changes demonstrates the functionality of argument call/pass by reference: 54 Handbook of Computer Programming with Python 1 2 3 4 5 6 7 8 9 10 11 12 13 # Define function 'upDateList' that changes values within the list def updateList(y): y = y.append(5) return y # Declare list 'x' with 4 elements and assign values x = [1, 2, 3, 4] print('The content of x before calling the function updateList is:', x) # Call function 'updateList' print('Call the function updateList') updateList(x) print('The content of x after calling the function updateList is:', x) Output 2.10.6.b: The content of x before calling the method updateList is: [1, 2, 3, 4] Call the method updateList The content of x after calling the method updateList is: [1, 2, 3, 4, 5] 2.11 CASE STUDY Write a Python application that displays the following menu and runs the associated functions based on the user’s input: • • • • • • Body mass index calculator. Check customer credit. Check a five-digit for palindrome. Convert an integer to the binary system. Initialize a list of integers and sort it. Exit. Specifics on the components of the application: • Body Mass Index Calculator: Read the user’s weight in kilos and height in meters, and calculate and display the user’s body mass index. The formula is: BMI = (weightKilos)/ (heightMeters × heightMeters). If the BMI value is less than 18.5, display the message “Underweight: less than 18.5”. If it is between 18.5 and 24.9, display the message “Normal: between 18.5 and 24.9”. If it is between 25 and 29.9, display the message “Overweight: between 25 and 29.9”. Finally, if it is more than 30, display the message “Obese: 30 or greater”. • Check Department-Store Customer Balance: Determine if a department-store customer has exceeded the credit limit on a charge account. For each customer, the following facts are to be entered by the user: • Account number. • Balance at the beginning of the month. • Total of all items charged by the customer this month. • Total of all credits applied to the customer’s account this month. • Allowed credit limit. Introduction to Programming with Python The program should accept input for each of the above from as integers, calculate the new balance (= beginning balance + charges − deposits), display the new balance, and determine if the new balance exceeds the customer’s credit limit. For customers whose credit limit is exceeded, the program should display the message “Credit limit exceeded”. • A palindrome is a number or a text phrase that reads the same backward as forward (e.g., 12321, 55555). Write an application that reads a five-digit integer and determines whether or not it is a palindrome. If the number is not five digits long, display an error message indicating the issue to the user. When the user dismisses the error dialog, allow them to enter a new value. • Convert Decimal to Binary: Accept an integer between 0 and 99 and print its binary equivalent. Use the modulus and division operations, as necessary. • List Manipulation and Bubble Sort: Write a script that does the following: a. Initialize a list of integers of a maximum size, where the maximum value is entered by the user. b. Prompt the user to select between automatic or manual entry of integers to the list. c. Fill the list with values either automatically or manually, depending on the user’s selection. d. Sort the list using Bubble Sort. e. Display the list if it has less than 100 elements. The above should be implemented using a single Python script. Avoid adding statements in the main body of the script unless necessary. Try to use functions to run the various tasks of the application. Have the application/menu run continuously until the user enters the value associated with exiting. 2.12 EXERCISES 2.12.1 Sequence and Selection 1. Write a script that displays numbers 1–4 on the same line and in one output, separated by one space. 2. Write a script that accepts three integers and calculates and displays their sum, average, product, lowest, and highest. 3. Write a script that accepts five integers and prints how many of them are odd and even. (Hint: An even number leaves a remainder of zero when divided by 2. Use the modulus operator.) 4. Write a script that accepts five numbers and calculates and prints the number of negatives, positives, and zeros. 5. Write a script that accepts two integers and determines and prints whether the first is a multiple of the second. 6 Write a script that accepts one number consisting of five digits, separates the number into the individual digits, and prints each digit separated by three spaces from each other. (Hint: use both division and modulus operations to break down the number.) 7. Write a script that accepts the radius of a circle as an integer and prints the circle’s diameter, circumference, and area. (Hint: Use the constant value 3.1459 for π. Calculate the diameter as radius*2, the circumference as 2π*radius, and the area as π*radius2.) 8. Write a script that accepts the first and the last name from the user as two separate inputs, concatenates them separated by one space character, and displays the result. 9. Write a script that accepts a character and displays it in the ASCII format. (Hint: use the ord() function.) 10. Write a script that accepts an ASCII value between 50 and 255 and displays its character. (Hint: use the chr() function.) 55 56 Handbook of Computer Programming with Python 2.12.2 Iterations – while Loops 1. Drivers are concerned with the accumulated mileage of their automobiles. One particular driver has been monitoring trips by recording miles driven and petrol gallons used. Write a script that uses a while statement to accept the miles and petrol gallons used for each trip. The script should calculate and display the miles per gallon obtained for each trip, and the combined, total miles per gallon obtained up to date. 2. Write a script that accepts integers within the range of 1–30. For each number entry, the script should print a line containing adjacent asterisks of the same number (e.g., for number 7 it should display: “7: *******”). The script should run until the user enters a predefined exit value. 3. A company pays its employees partially based on commissions. The employees receive $200 per week, plus 9% of their gross sales for the week. Write a script that accepts the items sold for a week by a single employee and calculates and displays their earnings. There is no limit to the number of items that can be sold by an employee. 4. Write a script that uses a while statement to determine and print the largest number entered by the user. The user is allowed to enter numbers until a predefined exit value is entered. 5. Write a script that uses a while statement and the tab escape sequence (\t) to print the tabular form of: a number, its multiple by 2, its multiple by 10, the square, and its cube number. 6. Armstrong numbers represent the sum of their digits to the power of the total number of digits. Therefore, for a three-digit Armstrong number, the sum of the cube roots of each digit should equal to the number itself (e.g., 153 = 1 ^ 3 + 5 ^ 3 + 3 ^ 3 = 1 + 125 +27 = 153). Based on the above, write a script that displays all three-digit Armstrong numbers between 130 and 140, as well as their breakdown. 7. The factorial of a non-negative integer is written as n! and is defined as n! = n*(n−1)*(n−2)*…*1 for values of n greater than or equal to 1, and as n! = 1 for n = 0. Write a script that accepts a non-negative integer and computes and prints its factorial. 8. Write a script that converts Celsius temperatures to Fahrenheit. The program should print a table displaying all the Celsius temperatures and their Fahrenheit equivalents. (Hint: the formula for the conversion is: F = 9/5C + 32.) 9. A company wants to send data over the Internet and has requested a script that will encrypt this data. The desired encryption function is the following: each digit should be replaced by a value calculated by adding 7 to it and getting the remainder after dividing the new value by 10. Next, the first digit should be swapped with the third and the second with the fourth. The program should print the resulting encrypted integer. 10. Write a script that reads an encrypted four-digit integer, decrypts it by reversing the encryption scheme of the previous exercise, and prints the result. 2.12.3 Iterations – for Loops 1. Write a script that uses a for statement to display the following patterns: (a) * ** *** **** ***** ****** ******* ******** (b) ********** ********* ******** ******* ****** ***** **** *** (c) ********** ********* ******** ******* ****** ***** **** *** (d) * ** *** **** ***** ****** ******* ******** 57 Introduction to Programming with Python 2. Write a script that prompts the user to enter a number of integer values and calculate their average. Use a for statement to receive and add up to the sequence of integers, based on user input. 3. A mail order house sells five different products with the following codes and retail prices: 001 = $2.98, 002 = $4.50, 003 = $9.98, 004 = $4.49, and 005 = $6.87. Write a script that accepts the following two values from the user: product number and quantity sold. This process must be repeated as long as the user enters a valid code. The script should use a mapping technique to determine the retail price for each product. Finally, the script should calculate and display the total value of all products sold. 2.12.4 Methods 1. Write a script that uses methods to do the following: (a) continuously accept integers into a two-dimensional list of integers until the user enters an exit value (e.g., 0), (b) find and display the min value for each row and/or column of the list and of the whole list, (c) find and display the max value for each row and/or column of the list and of the whole list, and (d) find and display the average value for each row and/or column of the list and of the whole list. 2. Write a script that uses methods to continuously accept the following details for a series of books: ISBN number, title, author, publication date, and publication company. The details of each book must be stored in five lists associated with the book information categories. The script should accept books until the user enters an ISBN number of 0. Before exiting, the script must print the details of the books. 3. Write a script that uses different methods to print a box, an oval, an arrow, and a diamond on screen. Use the Turtle library for this purpose. 4. Using the Olympic Games logo as a reference, write a Python script that uses the Turtle library and appropriate methods to draw the logo rings, matching the color order and position. 5. Using only the Turtle library methods fillcolor(), begin _ color(), end _ color(), color(), penup(), pendown(), and goto(), write a Python script that uses various methods to draw Figure Exercise 5. 6. Write a Python script that uses appropriate methods and the Turtle library to draw a regular polygon of N sides. The script should use a method to prompt the user to enter the number of sides (N). (Hint: a regular polygon of N sides is the combination of N equilateral triangles.) The figure drawn should look like Figure Exercise 6. Figure Exercise 5. Figure Exercise 6. 58 Handbook of Computer Programming with Python REFERENCES Dijkstra, E. W., Dijkstra, E. W., Dijkstra, E. W., & Dijkstra, E. W. (1976). A Discipline of Programming (Vol. 613924118). New Jersey: Prentice-Hall Englewood Cliffs. Jaiswal, S. (2017). Python Data Structures Tutorial. DataCamp. https://www.datacamp.com/community/ tutorials/data-structures-python. Knuth, D. E. (1997). The Art of Computer Programming (Vol. 3). Pearson Education. Stroustrup, B. (2013). The C++ Programming Language. India: Pearson Education. 3 Object-Oriented Programming in Python Ghazala Bilquise and Thaeer Kobbaey Higher Colleges of Technology Ourania K. Xanthidou Brunel University London CONTENTS 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Introduction.............................................................................................................................60 Classes and Objects in Python................................................................................................ 62 3.2.1 Instantiating Objects.................................................................................................... 63 3.2.2 Object Data (Attributes)............................................................................................... 63 3.2.2.1 Instance Attributes........................................................................................ 63 3.2.2.2 Class Attributes.............................................................................................64 3.2.3 Object Behavior (Methods)..........................................................................................66 3.2.3.1 Instance Methods..........................................................................................66 3.2.3.2 Constructor Methods.................................................................................... 68 3.2.3.3 Destructor Method........................................................................................ 71 Encapsulation........................................................................................................................... 72 3.3.1 Access Modifiers in Python......................................................................................... 72 3.3.2 Getters and Setters....................................................................................................... 72 3.3.3 Validating Inputs before Setting.................................................................................. 73 3.3.4 Creating Read-Only Attributes.................................................................................... 75 3.3.5 The property() Method................................................................................................ 76 3.3.6 The @property Decorator....................................................................................... 77 Inheritance............................................................................................................................... 78 3.4.1 Inheritance in Python.................................................................................................. 78 3.4.1.1 Customizing the Sub Class........................................................................... 79 3.4.2 Method Overriding...................................................................................................... 81 3.4.2.1 Overriding the Constructor Method............................................................. 82 3.4.3 Multiple Inheritance.................................................................................................... 83 Polymorphism – Method Overloading.................................................................................... 85 3.5.1 Method Overloading through Optional Parameters in Python................................... 86 Overloading Operators............................................................................................................ 87 3.6.1 Overloading Built-In Methods.....................................................................................90 Abstract Classes and Interfaces in Python.............................................................................. 91 3.7.1 Interfaces.....................................................................................................................94 Modules and Packages in Python............................................................................................94 3.8.1 The import Statement.................................................................................................. 95 3.8.2 The from…import Statement................................................................................... 95 3.8.3 Packages......................................................................................................................96 3.8.4 Using Modules to Store Abstract Classes....................................................................97 Exception Handling................................................................................................................. 98 DOI: 10.1201/9781003139010-3 59 60 Handbook of Computer Programming with Python 3.9.1 Handling Exceptions in Python................................................................................... 98 3.9.1.1 Handling Specific Exceptions..................................................................... 100 3.9.2 Raising Exceptions.................................................................................................... 101 3.9.3 User-Defined Exceptions in Python........................................................................... 102 3.10 Case Study............................................................................................................................. 103 3.11 Exercises................................................................................................................................ 104 3.1 INTRODUCTION The Object-Oriented Programming (OOP) paradigm is a powerful approach that involves problem solving by means of programming components called classes, and the associated programming objects contained in these classes. This approach aims at the creation of an environment that reflects method structures from the real world. Within the OOP paradigm, variables, and the associated data and methods (see: Chapter 2), are logically grouped into reusable objects belonging to a parent class. This enables a modular approach to programming. Some of the most significant benefits of developing software using this paradigm is that it is easier to implement, interpret, and maintain. OOP is developed around two fundamental pillars of programming, and four basic principles of how these could be used efficiently. The two pillars are the class and its objects. The four principles are the concepts of encapsulation, abstraction, inheritance, and polymorphism. Although it is true that various other programming techniques and approaches are also applied within the OOP paradigm, they all share the above core components and concepts. A real-life analogy that demonstrates the class and object relationship is that of a recipe of a cake. The recipe provides information about the ingredients and the method of how to bake it. Using the recipe, several cakes may be baked. In this context, the recipe represents the class, and each cake that is baked using the recipe represents the object. Similarly, in software development, if it is required to store the data of numerous employees, a class that describes the general specifications of an employee is created. This class defines what types of data are required for employees (class properties) and what actions can be performed on the data (class methods). New employees are then created using the class. What is important to note is that the class does not hold any data. It is simply a template used as a model for the container of employees of the same kind, alongside any related actions that can be performed on the data. The relation between these two fundamental elements (i.e., class and objects) is illustrated in Figure 3.1. In OOP terminology, the process of creating an object based on a specific class is known as instantiation. During instantiation, the created object inherits the properties described in the class. For example, an object named car1 may have properties like make, model, and color, while FIGURE 3.1 Using class Employee to generate the objects Employee1 and Employee2. Object-Oriented Programming 61 book1 may have ISBN, title, price and publication _ year. Similarly, the methods of the object are the actions or tasks it can perform. Using the same object examples, a car may perform actions like startEngine(), stopEngine() and moveCar(), and a book updatePrice() and calculateDiscount(). In terms of communicating complex OOP structures and ideas, programmers use the Unified Modelling Language (UML), a tool that allows them to draw standardized diagrams that visualize the structure of programs independently of the programming language used for the implementation. The basic building block of UML is the class diagram, a graphical representation of a class as a rectangle with three sections, namely the class name, the class attributes, and the class methods. The basic structure of a class diagram is illustrated in Figure 3.2, and a related example is provided in Figure 3.3. The top section of the class diagram contains the class name, which should adhere to the following naming conventions: • It must be a noun. • It must be written in singular form. • It must start with an upper-case letter (upper camel case should be used for multiple words in the class name). FIGURE 3.2 Syntax of a class diagram. FIGURE 3.3 A simple class with its attributes and methods. 62 Handbook of Computer Programming with Python The middle section of the class diagram consists of the class attributes. These should be written using lower- Observation 3.1 – Camel Case: The case letters, with compound words separated by an practice of starting each word of a underscore. Optionally, the data type of each attribute sentence in capital. can be specified after its name, separated by a colon. The last section of the class diagram contains the operations or methods of the class. Method names should be verbs and follow the lower camel case naming convention (i.e., the first word is in lower case and the first letters of all subsequent words are in upper case). Similar to attributes, the input and output parameters of the method can be specified. The input parameters are written within the parentheses following the method name. The output parameters are specified at the end of the method, separated by a colon. Finally, access modifiers, represented with a plus or minus symbol, are used to specify the scope of access of an attribute or method. The plus symbol indicates that the attribute or method is public and can be accessed by any object of any class outside the current one, whereas the minus symbol indicates that the method or attribute is private and can only be accessed from within the current class or its objects. This chapter covers basic concepts related to the usage of classes and objects, and the four main principles of OOP, namely: • Encapsulation: The process of wrapping the attributes and methods of the objects of a class in one unit, and managing the access to these attributes and methods. • Abstraction: The technique used to hide the implementation details of a class, by providing a more abstract view. This allows for the development of a simpler interface, by focusing on what the object does rather than how it does it. • Inheritance: The mechanism used for the creation of a parent-child relationship between classes, where the child (or sub) class acquires the attributes and the methods of the parent (or super) class, thus, eliminating redundant code and facilitating reusability and maintainability. • Polymorphism: A feature of OOP languages that enables methods to perform different tasks based on the context of the variables used. This is achieved through designated processes like method overriding and overloading. 3.2 CLASSES AND OBJECTS IN PYTHON Contextualizing the concepts of classes and methods and their relationship is frequently easier through the use of working examples. Consider the common case of developing a simple application that must store employees’ data. Every employee is likely to have an employee ID, a first name, a last name, a basic salary, and allowances. The first step toward the implementation of such an application in OOP would be to define a class that holds the appropriate, general specification for all employees. This will be used as a blueprint to create a record for each employee in the application. In Python, a class is created simply by using the class keyword followed by the name of the class. The name must follow the same naming rules that also apply to variables. However, for clarity purposes, it is recommended that the name of the class is capitalized using the CapWords notation (i.e., the first letter of each word in the class name should be capitalized). Observation 3.2 – pass: The pass keyword is a line of code that does nothing. It is necessary when defining an empty class since it is required that every class has at least one line of code. Observation 3.3 – class keyword: Create a class simply by using the class keyword followed by the name of the class. The class name must adhere to the naming conventions of Python for variables and should have the first letter in capital. 63 Object-Oriented Programming The example below creates an empty class with no attributes or methods, and thus no functionality: 1 2 3 # Define a class with no functionality class Employee: Pass 3.2.1 Instantiating Objects To instantiate an object means to create a new object Observation 3.4 – Creating/ using a class as a template. An object is instantiated by Instantiating Objects: An object is passing the class name (followed by parentheses) to a created by using the name of the class variable. In the script example provided below emp1 it belongs followed by parentheses. and emp2 are instances of the Employee class. Note that, in the output of the script, each object reserves a different memory location, as the attributes of the two employees will be stored separately: 1 2 3 4 5 6 7 8 9 10 11 # Define the class class Employee: Pass # Create two instances/objects based on the class emp1 = Employee() emp2 = Employee() # Print the memory address of instances 'emp1' and 'emp2' print(emp1) print(emp2) Output 3.2.1: <__main__.Employee object at 0x0000026242C487F0> <__main__.Employee object at 0x0000026242C483D0> 3.2.2 Object Data (Attributes) Object data, also known as attributes, are stored in variables. There are two types of attributes in a class, namely instance and class attributes. Observation 3.5 – Object Data (Attributes): Data that is associated with each instantiated object and is unique to that object. Use the dot notation syntax to call it (e.g., obj. attribute = value). 3.2.2.1 Instance Attributes An instance attribute contains data associated with each instantiated object, and is therefore unique to that object. Instance attributes are created using the dot notation syntax (obj.attribute = value) and are only accessible by the object associated with them. In the example below, class Employee is used to instantiate objects emp1 and emp2. These objects will store the first and last names, the basic salary, and the allowance of two different employees. 64 Handbook of Computer Programming with Python The reader should note the use of the dot notation to assign values to the instance/object attributes, and how the print() method is used to show the first and last names of the two Employee instances/objects: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Define the class class Employee: Pass # Create two instances/objects based on the class emp1 = Employee() emp2 = Employee() # Provide attributes and assign values to the instances emp1.firstName = "Maria" emp1.lastName = "Rena" emp1.basicSalary = 12000 emp1.allowance = 5000 emp2.firstName = "Alex" emp2.lastName = "Flora" emp2.basicSalary = 15000 emp2.allowance = 5000 # Print the objects and their attributes print(emp1.firstName, emp1.lastName) print(emp2.firstName, emp2.lastName) Output 3.2.2.1: Maria Rena Alex Flora 3.2.2.2 Class Attributes While instance attributes are specific to each individual object, class attributes belong to the class itself, and are thus shared among all instances of the class. In the following example, the class attribute bonusPercent is defined within the scope of the Employee class. Unlike instance attributes firstName and lastName, which take unique values for each of the two employees (i.e., emp1 and emp2), class attribute bonusPercent is common to both employees: Observation 3.6 – Class Attribute: Data that belongs to the class and has its values shared among each object instantiated through the class. Define it the same way as a simple variable. Observation 3.7: It is recommended to use lower-case letters when naming attributes. If an attribute name has more than one word, use lower case for the first word and capital first letters for the rest, all combined in one word. Object-Oriented Programming 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 65 class Employee: # Define the class attribute bonusPercent = 0.2 # Define and create the 'emp1' instance emp1 = Employee() emp1.firstName = "Maria" emp1.lastName = "Rena" # Define and create the 'emp2' instance emp2 = Employee() emp2.firstName = "Alex" emp2.lastName = "Flora" # Print class attribute print(Employee.bonusPercent) # Each instance is associated with the same class attribute value print(emp1.firstName, emp1.lastName, emp1.bonusPercent) print(emp2.firstName, emp2.lastName, emp2.bonusPercent) # Accessing the class attribute by using the class name Employee.bonusPercent = 0.3 print(Employee.bonusPercent) # Accessing the class attribute by using the instance name print(emp1.bonusPercent) print(emp2.bonusPercent) # Accessing the dictionary of the class and its objects print(emp1.__dict__) print(emp2.__dict__) print(Employee.__dict__) Output 3.2.2.2: 0.2 Maria Rena 0.2 Alex Flora 0.2 0.3 0.3 0.3 {'firstName': 'Maria', 'lastName': 'Rena'} {'firstName': 'Alex', 'lastName': 'Flora'} {'__module__': '__main__', 'bonusPercent': 0.3, '__dict__': <attribute '__dict__' of 'Employe e' objects>, '__weakref__': <attribute '__weakref__' of 'Employee' objects>, '__doc__': None} In terms of declaration and value assignments, a class attribute is treated as any other regular variable within the class, in contrast to instance attributes where the dot notation is used. It is accessed by using the name of the class to which it belongs followed by the attribute name: <className>.<attribute_name> = value When a class attribute is associated with an instantiated object name, Python firstly checks if that attribute is available in that particular object, and if not, whether it is available in the associated class or any super class the object inherits from (see Section: 3.4.1 Inheritance in Python). 66 Handbook of Computer Programming with Python There is a simple way to determine whether an attribute belongs to an object or to the class used to instanti- Observation 3.8: Call the __dict__ ate it. Every Python object contains a special attribute attribute on any object to find the called __dict__ (i.e., dictionary), which includes refer- attributes that belong to that particuences to all the attributes within this object. Using the lar object. previous example, if __dict__ is called for emp1 and emp2 it will not include the bonusPercent class attribute. On the contrary, this will be the case if it is called for the Employee class. 3.2.3 Object Behavior (Methods) A method is a structured block of code that is associated with an object. It is defined in a class and contains code that performs specific tasks using data from either the class itself or the instantiated objects inheriting from the class. Methods must have a distinct name, and may or may not take parameters or return values. All methods in a class must include an essential parameter, usually named self, that references the current object instance. It is important to note that self is not a reserved word. Any variable name may be used to reference the object, as long as it follows the Python variable naming rules. 3.2.3.1 Instance Methods An instance method, just like an instance attribute, is specific to a particular object rather than the class used Observation 3.9 – Instance Method: to instantiate it. It is, thus, invoked for each separate Defined as any other method but object, and uses the data of the object that invoked it. includes the self parameter as one of Instance methods are defined within a class and include its arguments. the mandatory self ­parameter. However, passing the self parameter to the method is not required when calling the method. In the following Python example, instance method printDetails(self) is defined in the Employee class and called twice to print each of the two employees’ data (i.e., firstName, lastName, and salary). It does not accept any arguments and it displays the required information utilizing the attributes of the particular object it is associated with. Instance method calculateBonus(self, bonusPercent) collects data from the attribute of the associated object, calculates the bonus for the employee, and displays the result. The reader should note that defining and calling instance and class methods is similar, with the exception of the use of dot notation to associate the instance method with the super class: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Define the class class Employee: # Define the 'printDetails' method def printDetails(self): print("Employee Name", self.firstName, self.lastName, "earns", self.salary) # Define the 'calculateBonus' method def calculateBonus(self, bonusPercent): return self.salary * bonusPercent # Create the two objects and print their attributes emp1 = Employee() emp1.firstName = "Maria" Object-Oriented Programming 15 16 17 18 19 20 21 22 23 24 25 67 emp1.lastName = "Rena" emp1.salary = 15000 emp1.printDetails() print("Bonus amount is", emp1.calculateBonus(0.2)) emp2 = Employee() emp2.firstName = "Alex" emp2.lastName = "Flora" emp2.salary = 18000 emp2.printDetails() print("Bonus amount is", emp1.calculateBonus(0.2)) Output 3.2.3.1.a: Employee Name Maria Rena earns 15000 Bonus amount is 3000.0 Employee Name Alex Flora earns 18000 Bonus amount is 3000.0 From a structural and logical viewpoint, class and instance methods can be used strategically to further improve the efficiency and clarity of the code. For instance, the class used in the previous examples can be further improved by introducing the following change. Since bonusPercent is the same for both employees, its value can be stored in a class attribute and be shared among all the instances of the class. In this case, calling the instance method is simplified, as it is no longer necessary to pass any parameters as method arguments. Instead, instance or class attributes can be accessed directly, as shown in the example below: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Define the class class Employee: # Define a class attribute common for all objects bonusPercent = 0.2 # Define an instance method that takes no arguments def calculateBonus(self): return self.salary * Employee.bonusPercent # Create two objects and an instance attribute emp1 = Employee() emp1.salary = 15000 emp2 = Employee() emp2.salary = 18000 # Print using the instance method and the class attribute print("Bonus amount is", emp1.calculateBonus(), "calculated at", Employee.bonusPercent) print("Bonus amount is", emp2.calculateBonus(), "calculated at", Employee.bonusPercent) # Change the value of the class attribute 68 23 24 25 26 27 28 29 Handbook of Computer Programming with Python Employee.bonusPercent = 0.3 # Print again using the instance method and the changed class attribute print("Bonus amount is", emp1.calculateBonus(), "calculated at", Employee.bonusPercent) print("Bonus amount is", emp1.calculateBonus(), "calculated at", Employee.bonusPercent) Output 3.2.3.1.b: Bonus Bonus Bonus Bonus amount amount amount amount is is is is 3000.0 3600.0 4500.0 4500.0 calculated calculated calculated calculated at at at at 0.2 0.2 0.3 0.3 3.2.3.2 Constructor Methods A constructor is a special method used to initialize the Observation 3.10 – Constructor data of an object. In Python, constructors are impleMethod: Defined either automatically mented using the __init__() method. This method is or by using the __init__() method. automatically invoked whenever a new instance of the It is invoked automatically when a class is created. If not explicitly defined, the compiler new instance of a class is created. It assumes a default constructor with no implementation can be used to initialize the data of details. It is important to note that a constructor does not the new object or to perform any return any value. other task necessary. It can take arguThe programmer can optionally define constructors ments with or without default values. other than the default one. A user-defined constructor is It does not return any value. created by defining the __init__() method within the class. Like all methods in a class, it takes a self argument that references the current object. The syntax of the __init__() method is the following: def __init__ (self [, arguments]) User-defined constructors can be one out of three different types, depending on whether they take arguments or not. The first is the simple constructor, which takes no arguments. The following Python script presents such a case, where the constructor takes no arguments and prints a default text message. Notice that every time a new object is instantiated the message is displayed: 1 2 3 4 5 6 7 8 9 10 11 # Define the class class Employee: # Default constructor takes no arguments, prints message def __init__ (self): print("Object created") # Every time a new object is created the constructor is called and # the message is displayed emp1 = Employee() emp3 = Employee() Object-Oriented Programming 69 Output 3.2.3.2.a: Object created Object created The default constructor may be also used to initialize instance attributes with default values. In the following example, when a new Employee object is created, instance attributes salary and allowances are set to a default value of 0: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # Define the class class Employee: """ Define the default constructor that takes no arguments but initializes the values of the instance attributes """ def __init__ (self): self.salary = 0 self.allowances = 0 """ Every time a new object is created the constructor is called and the instance attributes are set to 0 """ emp1 = Employee() emp1.salary = 15000 """ Print the instance attributes of the objects. The default allowances value is printed """ print(emp1.salary, emp1.allowances) # Change the value of the allowances attribute emp1.allowances = 3000 # Print the instance attribute of the object after the value # of allowances is changed print(emp1.salary, emp1.allowances) Output 3.2.3.2.b: 15000 0 15000 3000 The second constructor type accepts parameters as arguments. It is used when initialization of the attributes of the new object involves the assignment of specific values rather than the default ones. To highlight this, in the following example, a list of the arguments used to initialize the attributes of the object is provided after the default self attribute: 1 2 3 4 5 6 # Define the class class Employee: # Define the constructor with four arguments def __init__ (self, first, last, salary, allowances): # Initialize instance attributes: use values of arguments 70 7 8 9 10 11 12 13 14 15 16 Handbook of Computer Programming with Python self.firstName = first self.lastName = last self.salary = salary self.allowances = allowances # Create a new object with specific instance attribute values emp1 = Employee("Maria", "Rena", 15000, 3000) # Print the object's attributes print(emp1.firstName, emp1.lastName, emp1.salary, emp1.allowances) Output 3.2.3.2.c: Maria Rena 15000 3000 For simplicity reasons, Python does not support method overloading and, thus, the definition of multiple constructors is not allowed. Additionally, if a user-defined constructor is provided, it is no longer possible to use the default constructor in order to create a new object with no parameters. This limitation can be overcome by means of the third constructor type, which is used to accept arguments with default values. This allows the programmer to initialize the associated object with or without values. This constructor type is illustrated in the following example. When emp1 is instantiated, the constructor is invoked without any parameter values. In contrast, in the case of emp2, it is invoked with predefined parameter values, which are assigned to the respective instance attributes. Once both objects are instantiated, the instance attributes of both emp1 and emp2 are accessed and printed using regular dot notation: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Define the class class Employee: """ Define a constructor that takes four arguments with default empty values (None) if no values are passed """ def __init__ (self, first = None, last = None, salary = None, allowances = None): if first!= None and last!= None and salary!= None \ and allowances!= None: self.firstName = first self.lastName = last self.salary = salary self.allowances = allowances print("Object initialized with supplied values") else: self.salary = 0 self.allowances = 0 print("Object initialized with default values") # Create a new object invoking the constructor with no parameters emp1 = Employee() 71 Object-Oriented Programming 22 23 24 25 26 27 28 29 30 emp1.firstName = "Alex" emp1.lastName = "Flora" print(emp1.firstName, emp1.lastName, emp1.salary, emp1.allowances) # Create a new object invoking the constructor with parameters emp2 = Employee("Maria", "Rena", 15000, 5000) print(emp2.firstName, emp2.lastName, emp2.salary, emp2.allowances) # Change and reprint the value of instance attribute of ‘emp2’ emp2.salary = 20000 print(emp2.firstName, emp2.lastName, emp2.salary, emp2.allowances) Output 3.2.3.2.d: Object initialized with default values Alex Flora 0 0 Object initialized with supplied values Maria Rena 15000 5000 Maria Rena 20000 5000 3.2.3.3 Destructor Method Destructors are special methods invoked at the end of the lifecycle of objects, when they must be deleted. In Python, destructors are implemented using the __del__() method, and are invoked when all references to an object have been deleted. The following Python script provides an example of two objects (i.e., emp1 and emp2) firstly being created and then destroyed: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Observation 3.11 – Destructor Method: Defined by using the __del__() method. It is used to delete an instance/object when it is not needed anymore. The method takes no arguments, and returns no values. # Define the class class Employee: # Define the default constructor that only prints a message def __init__(self): print("Employee created") # Destructor deletes the object and prints a message def __del__(self): print("Employee deleted") # Constructor automatically invoked to create ‘emp1’ and ‘emp2’ emp1 = Employee() emp2 = Employee() # Destroy objects 'emp1' and 'emp2'. Destructor method is called del emp1 del emp2 72 Handbook of Computer Programming with Python Output 3.2.3.3: Employee Employee Employee Employee created created deleted deleted 3.3 ENCAPSULATION Encapsulation is one of the pillars of Object-Oriented Programming. It is based on the idea of wrapping up the attributes and methods in a class and controlling access when instantiating new objects/instances. Instead, access modifiers are used to dictate and control how the instance attributes can be accessed. Observation 3.12 – Encapsulation: Wrapping up the attributes and methods in a class and controlling access when instantiating new objects/ instances. 3.3.1 Access Modifiers in Python As mentioned, objects store data in attributes. Appropriate protective measures ensure that this data is accessed and modified in a controlled way. In general, OOP languages provide access modifiers that specify how an attribute or method can be accessed. There are three main types of access modifiers: Observation 3.13 – Access Modifiers: Access modifiers control how the instance attributes can be accessed. Access modifiers can be public with no special notation needed, private denoted by double underscore (__), or protected denoted by single underscore ( _ ). • Public: Attribute/method can be accessed by any class or program without any restrictions. • Private: Attribute/method can be accessed only within the container class. • Protected: Attribute/method can be accessed within the container class and its sub-classes. By default, all attributes and methods in Python are public. Instead of using special keywords to specify whether an attribute is public, private, or protected, Python uses a special naming convention to control access. An attribute with an underscore prefix (_) denotes a protected attribute, while a double underscore prefix (__) a private attribute. As mentioned, the absence of a prefix denotes the default, public modifier. 3.3.2 Getters and Setters When defining a class, it is good programming practice to control the access to instance attributes by means Observation 3.14 – Getters and of two special types of methods commonly referred to Setters: Used to implement encapsuas getters and setters. Many OOP languages use such lation. Setters are used to store data methods to implement the principle of encapsulation. A into private instance attributes whereas getter is a method that reads (gets) the value of an attri- getters are used to read that data. bute, while a setter writes (sets) it. Using getters and setters to access object attributes ensures that the data is protected (i.e., encapsulated). The benefits of using these special methods are the following: Object-Oriented Programming 73 • Ensuring validation when reading or writing attribute data. • Setting different access levels for the class attributes. • Preventing direct manipulation of the attribute data. In the Python example below, the Employee class uses setFirstName(), a setter method, to store data in a protected attribute of the object (denoted by the double underscore symbol), while getter method getFirstName() is used to read and print the employee’s first name. As the attribute is protected, it is accessible using the methods within the class, and within the object created using the class. Getter and setter methods should be used for all instance attributes defined in the class. In other words, for every instance attribute, it is recommended that the associated getter and setter methods are provided. The reader should also notice the use of the self parameter with all methods, as it provides the reference to the current object being used: In this context, if the print(emp1.getFirstName()) command is replaced by print(emp1.__first) in an attempt to access the private instance attribute directly, an error will occur: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Define the class class Employee: # Define the getter method to read private attribute__first def getFirstName(self): return self.__first # Setter method writes to private attribute__first def setFirstName(self, value): self.__first = value # Create object emp1 emp1 = Employee() # Use the setter to store new data in the private attribute emp1.setFirstName("George") # Getter reads the data from the private attribute and prints it print(emp1.getFirstName()) Output 3.3.2: George 3.3.3 Validating Inputs before Setting As discussed, getter and setter methods shield the data values of private instance attributes. In addition, they also provide data validation functionality. As an example, if the value of private instance attribute __firstName should not exceed 15 characters in length, and __salary should be a 74 Handbook of Computer Programming with Python number between 0 and 20,000, the associated validation code can be added to the setter methods of the attributes. Similarly, if it is necessary to format the output in a particular way, the associated code could be added to the getter methods. The following script provides a class example demonstrating this concept: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Observation 3.15 – Validating Data: Use getters and setters to validate data stored in the private attributes and format data appropriately before used as output. # Define the class class Employee: # Define a setter for private attribute '__firstName'. # Check the attribute value and store it if it is lower than 15 def setFirstName(self, value): if len(value) < 15: emp1.__firstName = value # Define a getter for private attribute '__firstName'. # Print the data with an appropriate message def getFirstName(self): return "The first name is :", self.__firstName # Define a setter for private attribute '__salary'. # Check attribute value; store it if it is between 0 and 20000 def setSalary(self, value): if (value > 0 and value < 20000): emp1.__salary = value # Define a getter for private attribute '__salary'. # Print the data with an appropriate message def getSalary(self): return "The salary is ", self.__salary # Create a new object and call its setters # to validate and store values in its attributes emp1 = Employee() emp1.setFirstName("John") emp1.setSalary(17000) # Attribute getters print stored values and associated messages print(emp1.getFirstName(), emp1.getSalary()) # Repeat the previous tasks with an invalid first name entry. # Notice: no change takes place in the ‘__firstName’ attribute emp1.setFirstName("Check to see if more than 15 characters are stored") emp1.setSalary(19000) print(emp1.getFirstName(), emp1.getSalary()) Object-Oriented Programming 40 41 42 43 44 45 75 # Repeat the previous tasks with invalid salary entry. # Notice: there is no change taking place in the ‘__salary’ attribute emp1.setFirstName("George") emp1.setSalary(21000) print(emp1.getFirstName(), emp1.getSalary()) Output 3.3.3: ('The first name is :', 'John') ('The salary is ', 17000) ('The first name is :', 'John') ('The salary is ', 19000) ('The first name is :', 'George') ('The salary is ', 19000) 3.3.4 Creating Read-Only Attributes Getter and setter methods may be also used to control read-only or write-only attributes. For example, attribute age may be designated as read only, since it should be calculated using the value of attribute dateOfBirth. In this case, age will require a getter but no setter method, allowing thus the user to read the age value but not to update it. In the following example, class Employee defines instance attributes for employees’ first and last names, Observation 3.16 – Creating Readand the corresponding getter and setter methods. The Only Attributes: Use getters with no class also defines attributes for the employees’ emails setters to create and output the values and full names, which as read-only attributes do not of read-only attributes, whose data have setter methods. In this case, the values of these are calculated using private attributes. attributes are constructed when they are being read using the ­getter method: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Define the class class Employee: # The getter and setter methods for the first name def getFirstName(self): return self.__first def setFirstName(self, value): self.__first = value # The getter and setter methods for the last name def getLastName(self): return self.__last def setLastName(self, value): self.__last = value # Read-only attributes with only a getter method def getEmail(self): return self.__first + "." + self.__last + "@company.com" def getFullName(self): return self.__first + " " + self.__last 76 22 23 24 25 26 27 28 29 30 31 Handbook of Computer Programming with Python # Create a new ‘Employee’ object emp1 = Employee() # Setter stores value to the ‘__private’ instance attributes emp1.setFirstName("George") emp1.setLastName("Davies") # Print the read-only attributes print(emp1.getFullName(), emp1.getEmail()) Output 3.3.4: George Davies George.Davies@company.com 3.3.5 The property() Method In the example presented below, methods getFirstName() and setLastName() are used to read from, Observation 3.17 – Property and write to, private attribute __first. In order to Method: Use it to encapsulate the make this particular example more user-friendly, the getter and setter methods in a single getter and setter methods could be automatically called interface that facilitates access to a when accessing the attribute, using the dot notation (i.e., private attribute using simply the dot <obj>.<property>). The property() method pro- notation. vides the necessary interface by encapsulating the getter and setter methods, which are invoked when reading from, or writing to it. The method syntax is the following: property_name = property(gettermethod, settermethod) After defining the property method, the attribute is accessed using the dot notation on the property name (<obj>.<property>) instead of invoking the getter and setter methods directly: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Define the class class Employee: # Define the getter method def getFirstName(self): return self.__first # Define the setter method def setFirstName(self, value): self.__first = value """ Use the property method to encapsulate the getter and setter in a single method interface """ firstName = property(getFirstName, setFirstName) Object-Oriented Programming 16 17 18 19 20 21 22 77 # Create the 'emp1' object emp1 = Employee() """ Use dot notation to invoke the setter and getter methods through the property interface """ emp1.firstName = "George" print(emp1.firstName) Output 3.3.5: George 3.3.6 The @property Decorator Another way to define attributes in Python is to use the @property decorator, which is built in the prop- Observation 3.18 – The @property erty() method. In the example below, @property Decorator: It allows the extension of defines the firstName attribute by using two different the property method in a similar way. methods with the property name. The firstName(self) method is decorated with the @property decorator, indicating that the method is a getter. Accordingly, the firstName(self, value) method is decorated with @firstName.setter, indicating that this is a setter. With this structure in place, the attribute can be accessed by using its property name with the dot notation, without explicitly calling the getter and setter methods: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Define the class class Employee: # Use the property decorator to define the getter method @property def firstName(self): return self.__first # Use the property decorator to define the setter method @firstName.setter def firstName(self, value): self.__first = value # Create the 'emp1' object emp1 = Employee() # Access private attribute '__first' through property name 'firstName' emp1.firstName = "George" print(emp1.firstName) Output 3.3.6: George 78 Handbook of Computer Programming with Python 3.4 INHERITANCE Inheritance is one of the four main principles of OOP. It allows the programmer to extend the functionality of Observation 3.19 – Inheritance: a class by creating a parent-child relationship between Allows the extension of the functionclasses. In such a relationship, the child (also called sub ality of a parent/super/base class, by or derived class) inherits from the parent (also called creating a child/sub/derived class that super or base class). The reader should note that these inherits its attributes and behavior. terms may be used interchangeably in this chapter, based on the context of each discussion. Inheritance is extremely useful, as it facilitates code reusability, thus minimizing code and making it easier to maintain. An important concept relating to child classes is that they may have their own new attributes and methods, and can optionally override the functionality of the respective parent class. 3.4.1 Inheritance in Python The Python syntax for implementing the concept of inheritance is the following: Class Parent: Parent class definition Class Child(Parent): Child class definition As a practical example of inheritance, the reader can consider two classes, a super class named Employee and a sub class named SalesEmployee (Figure 3.4). Instead of creating the general attributes of SalesEmployee (e.g., first name, last name, salary, or allowances) from scratch, they can be inherited from Employee. Accordingly, the sub class can also inherit the setters and getters, and generally all the functionality of the Employee class. Additional attributes that may be unique to SalesEmployee (e.g., commission rate) can be also added to the inherited ones, as required. FIGURE 3.4 Parent-child relationship between classes. The implementation of this particular example of super class Employee and sub class SalesEmployee is presented in the Python script examples below. In the first script, Employee class is defined with private attributes __first, __last, __salary, and __­allowances, and class method getTotalSalary(). In the second, SalesEmployee class is created as an empty class, hence the use of the pass keyword. Private attributes and the method are inherited from the Employee class. Note that the name of super class Employee is passed to SalesEmployee as an argument: Object-Oriented Programming 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 79 # Define class 'Employee' and its private attributes and method class Employee(): def __init__(self, first, last, salary, allowances): self.__first = first self.__last = last self.__salary = salary self.__allowances = allowances def getTotalSalary(self): return self.__salary + self.__allowances # Create object 'emp1' and print the total salary of the current employee emp1 = Employee("George", "White", 16000, 5200) print(emp1.getTotalSalary()) Output 3.4.1.a: 21200 1 2 3 4 5 6 7 8 # Define sub class 'SalesEmployee' based on super class 'Employee' class salesEmployee(Employee): pass """ Create a new object of the sub class that inherits attributes and behavior from the super class """ semp1 = salesEmployee("Alex", "Flora", 12000, 4000) print(semp1.getTotalSalary()) # Method of the superclass is invoked Output 3.4.1.b: 16000 When the semp1 object is instantiated, Python scans SalesEmployee for an initialization method (i.e., __init__()). If this is not found, it scans and executes the initialization method of the super class (i.e., Employee), with the parameters associated with the current object. Similarly, when getTotalSalary() is invoked for object semp1, the method is called from the super class, since it does not exist in the sub class. The same order of resolution is followed for all methods and attributes in the sub class. Observation 3.20 – Customize Sub Classes: Add attributes and/or meth3.4.1.1 Customizing the Sub Class ods to sub classes to extend their As mentioned, sub classes can be further customized by behavior beyond that of the super adding new attributes and methods. For instance, in the class. Using the added behavior on case of sub class SalesEmployee this can be done objects of the super class will raise an by adding attribute commission _ percent. The error. Attributes of the super class that reader should note that attempting to use the added attri- will be used in the sub class need to bute for an object that belongs to the Employee class be declared as protected. will raise an error. This is because there is no such 80 Handbook of Computer Programming with Python attribute or method in the super class. It is also worth noting that in order to be able to use super class attributes salary and allowances, they must be declared as protected instead of private. The following scripts demonstrate these concepts: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 # Define class 'Employee' class Employee(): """ Define the constructor of the class with parameters. Define the attributes of the class """ def __init__(self, first, last, salary, allowances): self.__first = first self.__last = last self._salary = salary self._allowances = allowances # Define a derived attribute def getTotalSalary(self): return self._salary + self._allowances # Define the 'SalesEmployee' sub class class salesEmployee(Employee): # Use the property decorator to define the getter method @property def commissionPercent(self): return self.__comm # Use the property decorator to define the setter method @commissionPercent.setter def commissionPercent(self, value): self.__comm = value # Create and use object 'emp1' based on super class ‘Employee’ emp1 = Employee("Maria", "Rena", 15000, 5000) print(emp1.getTotalSalary()) # Create and use object 'semp1' based on sub class 'SalesEmployee' semp1 = salesEmployee("Alex", "Flora", 16000, 6000) # The attribute is set in the sub class semp1.commissionPercent = 0.05 print(semp1.commissionPercent) """ The next line generates an error since its attribute only exists in the sub class """ print(emp1.commissionPercent) # Print the attributes of objects 'emp1' and 'semp1' print(semp1.__dict) print(emp1.__dict) Object-Oriented Programming 81 Output 3.4.1.1: 20000 0.05 AttributeError Traceback (most recent call last) <ipython-input-9-0e8e58d5eaf8> in <module> 40 """ The next line generates an error since its 41 attribute only exists in the sub class """ ---> 42 print(empl.commissionPercent) 43 44 # Print the attributes of objects 'empl' and 'sempl' AttributeError: 'Employee' object has no attribute 'commissionPercent' 3.4.2 Method Overriding Method overriding is another important programming feature that is common in OOP languages. It allows a sub class to contain a method with a different implementation than the one inherited from the super class. In the context of the previous examples, the programmer may wish to compute the total salary of a sales employee by adding commissions to their salary and allowances. In this case, sub class method getTotalSalary() must be implemented differently to the original one inherited from Employee. As shown in the following example, super class method g ­ etTotalSalary() has to be called in the implementation of sub class method getTotalSalary(): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 # Define class 'Employee' class Employee(): # Define the constructor and the attributes of the super class def __init__(self, first, last, salary, allowances): self.__first = first self.__last = last self._salary = salary self._allowances = allowances # Define 'getTotalSalary' def getTotalSalary(self): return self._salary + self._allowances # Define sub class 'salesEmployee' class salesEmployee(Employee): # Use the property decorator to define the getter method @property def commissionPercent(self): return self.__comm # Use the property decorator to define the setter method @commissionPercent.setter def commissionPercent(self, value): self.__comm = value 82 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Handbook of Computer Programming with Python # Super class getter overrides the parent class method def getTotalSalary(self): return super().getTotalSalary() + (super().getTotalSalary() *self.__comm) # Create and use object 'emp1' based on super class 'Employee' emp1 = Employee("Maria", "Rena", 15000, 5000) print(emp1.getTotalSalary()) # Create and use object 'semp1' based on sub class 'salesEmployee' semp1 = salesEmployee("Alex", "Flora", 16000, 6000) # Set the attribute in the sub class semp1.commissionPercent = 0.05 # Invoke the overridden getter method from the sub class print(semp1.getTotalSalary()) Output 3.4.2: 20000 23100.0 3.4.2.1 Overriding the Constructor Method The concept of method overriding is also used to create customized constructors in the sub class. In this case, the super() method is used to invoke the __init__() method of the super class, as shown in the following script: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Observation 3.21 – Constructor Overriding: Call the __init__() method of the super class to access the constructor and add attributes to extend it. # Define class 'Employee' class Employee(): # Define the constructor of the super class and its attributes def __init__(self, first, last, salary, allowances): self.__first = first self.__last = last self._salary = salary # Protected attribute self.__allowances = allowances # Define the getter of the class def getTotalSalary(self): return self._salary + self.__allowances # Define sub class 'salesEmployee' class salesEmployee(Employee): """ Define the constructor of the sub class adding the ‘comm’ attribute. Call the ‘init’ method of the super class """ def __init__(self, first, last, salary, allowances, comm): 83 Object-Oriented Programming 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 super().__init__(first, last, salary, allowances) self.__comm = comm # Access protected attribute '_salary' from the sub class def getTotalSalary(self): return super().getTotalSalary() + (self._salary * self.__comm) # Create and use object 'emp1' based on the super class emp1 = Employee("Maria", "Rena", 15000, 5000) print(emp1.getTotalSalary()) # Create and use object 'semp1' based on the sub class semp1 = salesEmployee("Alex", "Flora", 16000, 6000, 0.05) print(semp1.getTotalSalary()) # Method of the child class is invoked Output 3.4.2.1: 20000 22800.0 3.4.3 Multiple Inheritance Sub classes can inherit attributes and methods from multiple super classes, a concept known as multiple inheritance. In Python, this can be implemented using the following syntax: Observation 3.22 – Multiple Inheritance: The concept of having a sub class inheriting from more than one super classes. class Parent1 pass class Parent2 pass class Child (Parent1, Parent2): pass As an example of multiple inheritance, Figure 3.5 presents a structure consisting of two super classes (Person and Employee) and one sub class (Manager) that inherits from both super classes. FIGURE 3.5 A representation of multiple inheritance between three classes. 84 Handbook of Computer Programming with Python The following Python scripts implement this structure. The reader should note that the constructor in the Manager class calls the respective constructors of both super classes during initialization. Methods getFullName and getContact are inherited from super class Person, while getAnnualSalary and getDepartment are inherited from Employee: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 # Define the first super class ('Person') class Person(): # Define class constructor and attributes def __init__(self, firstName, lastName, contact): self.__firstName = firstName self.__lastName = lastName self.__contact = contact # Getter for the first & last name of the first super class def getFullName(self): return "Employee name is: " + self.__firstName +" " \ + self.__lastName # Define the getter for the contact of the first parent def getContact(self): return "Contact number is: " + self.__contact # Define the second Parent base class Employee class Employee(): # The constructor & the attributes of the second super class def __init__(self, salary, dept): self.__salary = salary self.__dept = dept # Define the getter for the salary of the second super class def getAnnualSalary(self): return "The annual salary is: " + str(self.__salary * 12) # The getter for the department of the 2nd super class def getDepartment(self): return "The employee belongs to the department: " +\ self.__dept # Define subclass 'Manager' inheriting from both 'Person' and 'Employee' class Manager(Person, Employee): def __init__(self, firstName, lastName, contact, salary, dept): Person.__init__(self, firstName, lastName, contact) Employee.__init__(self, salary, dept) Object-Oriented Programming 39 40 41 42 43 44 45 46 47 48 49 85 # Create and use a new instance of the 'Manager' class mgr1 = Manager("Maria", "Rena", "0123456789", 14500, "Marketing") # Call inherited behaviour from super class 'Person' print(mgr1.getFullName()) print(mgr1.getContact()) # Call inherited behaviour from super class 'Employee' print(mgr1.getAnnualSalary()) print(mgr1.getDepartment()) Output 3.4.3: Employee name is: Maria Rena Contact number is: 0123456789 The annual salary is: 174000 The employee belongs to the department: Marketing 3.5 POLYMORPHISM – METHOD OVERLOADING Another powerful feature of OOP languages is the support of method overloading. This is a fundamental ele- Observation 3.23 – Polymorphism/ ment of polymorphism, the option of defining and using Method Overloading: The concept of two or more methods with the same name but differ- using method overloading to impleent parameter lists or signatures. Overloading a method ment two or more methods with the improves code readability and maintainability, as imple- same name but different signatures. mentation is divided into multiple methods instead of being concentrated into a single, complex one. While method overloading is a prominent feature in many OOP languages, such as Java and C++, it is not entirely supported in Python. Python is a dynamically typed language and datatype binding occurs at runtime. This is known as late binding and it differs from the static binding used in languages like Java and C++, in which overloaded methods are invoked at compile time based on the arguments they are supplied with. In Python, when multiple methods with the same name are defined, the last definition overrides all previous ones. As an example, consider method calculateTotalSalary() in the Employee class. The method computes the annual salary of the employee without the bonus. A second method that calculates the total salary plus the bonus can be implemented with the same name, thus, overloading calculateTotalSalary(). In this case, the first method will be ignored and any reference to it will raise an error, as shown in the following example: 1 2 3 4 5 6 7 8 9 # Define class 'Employee' class Employee: # Define method 'calculateTotalSalary' def calculateTotalSalary(self): return(self.salary + self.allowances) # Define a method overloading 'calculateTotalSalary' def calculateTotalSalary (self, bonus): return(self.salary + self.allowances) + bonus 86 10 11 12 13 14 15 16 17 18 19 20 21 22 Handbook of Computer Programming with Python # Create and use the 'emp1' object emp1 = Employee() emp1.salary = 15000 emp1.allowances = 5000 print("Total salary is ", emp1.calculateTotalSalary(2000)) # Create and use the 'emp2' object emp2 = Employee() emp2.salary = 18000 emp2.allowances = 4000 # This method call will generate an error print("Total salary is ", emp2.calculateTotalSalary()) Output 3.5: Total salary is 22000 TypeError Traceback (most recent call last) <ipython-input-8-517bl73547e9> in <module> 22 23 # This method ca11 wi11 generate an error ---> 24 print("Tota1 sa1ary is ", emp2.calculateTotalSalary()) TypeError: calculateTotalSalary() missing 1 required positional argument: 'bonus' 3.5.1 Method Overloading through Optional Parameters in Python Although Python does not directly support method 3.24 – Method overloading in the same form as other OOP languages, it Observation Overloading in Python: In Python, offers an alternative approach to achieve the same funcuse optional method parameters tionality. Instead of resorting to the creation of multiple methods, it allows methods to take optional parameters to emulate the method overloadwith default values. When a method is invoked in the ing feature available in other OOP code, the programmer can choose whether to provide languages. the parameter values or not. This, in turn, dictates which block of statements would be executed within the method. Commonly, the None value is used to assign a default null value to the attribute. In the example below, constructor method calculateTotalSalary() is defined with optional parameter bonus. The implementation subsequently returns different values, depending on whether a new value has been assigned to the optional parameter. If this is not the case, the default None value is used. 1 2 3 4 5 6 7 class Employee: def calculateTotalSalary(self, bonus = None): # None statement supports both 'is' and '==' comparison operators if bonus is None: return(self.salary + self.allowances) else: Object-Oriented Programming 8 9 10 11 12 13 14 15 16 17 18 87 return(self.salary + self.allowances) + bonus emp1 = Employee() emp1.salary = 15000 emp1.allowances = 5000 emp2 = Employee() emp2.salary = 18000 emp2.allowances = 4000 print("Total salary is ", emp2.calculateTotalSalary(2000)) print("Total salary is ", emp1.calculateTotalSalary()) Output 3.5.1: Total salary is 24000 Total salary is 20000 3.6 OVERLOADING OPERATORS Operator overloading refers to the process of changing the default behavior of an operator based on the oper- Observation 3.25 – Operator ands being used. A classic case of operator overloading Overloading: Apply the + and * in Python is the modification of the behavior of the addi- operators on operands of different tion (+) and multiplication (*) operators based on the primitive data types to yield different input type. For instance, when the addition operator is results. used on two numbers it performs regular numerical addition, but when it is used with strings it concatenates them. Similarly, when the multiplication operator is used on numbers it multiplies them, while when it is used on a string and an integer it repeats the string. The reader should note that this fundamental operator overloading functionality works on operands of primitive data types, like in the following example: 1 2 3 4 5 6 7 8 9 a = 1 b = 2 print(a + b) # Adds the two numbers print(a * b) # Multiplies the two numbers a = 'Python' b = ' is fun' print(a + b) # Concatenates the two strings print(a + b * 3) # Concatenates and repeats the string Output 3.6.a: 3 2 Python is fun Python is fun is fun is fun 88 Handbook of Computer Programming with Python If the addition operator is used on user-defined objects it raises a TypeError, since it does not support the instance type, as shown below: 1 2 3 4 5 6 7 8 9 10 11 12 13 # Define class 'Employee' class Employee: salary = 0 # Create and use two objects of the 'Employee' class emp1 = Employee() emp1.salary = 15000 emp2 = Employee() emp2.salary = 22000 # Attempting the following print will generate a TypeError print(emp1 + emp2) Output 3.6.b: TypeError Traceback (most recent call last) <ipython-input-11-527139aab026> in <module> 11 12 # Attempting the following print will generate a TypeError ---> 13 print(empl + emp2) TypeError: unsupported operand type(s) for +: 'Employee' and 'Employee' This issue can be bypassed by utilizing the built-in magic or dunder methods, which can be invoked by means of the respective operators. For instance, in the case of the addition operator the associated __add__() method is firstly extended in terms of its functionality and, subsequently, invoked as shown in the following script: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Observation 3.26 – Magic or Dunder Methods: Special methods invoked when a basic operator is called, with a double underscore as a prefix and a suffix. They are used to overload operators with the object type of operands. # Define class 'Employee' class Employee: # Overload the + operator to add the 'salary' of two objects def __add__(self, other): return self.salary + other.salary # Create the two objects of the 'Employee' class emp1 = Employee() emp1.salary = 15000 emp2 = Employee() emp2.salary = 22000 # Invoke the overloaded + operator by extending the '__add__' method print(emp1 + emp2) 89 Object-Oriented Programming Output 3.6.c: 37000 In order to implement operator overloading, the programmer has to define the appropriate magic method according to the operator in the class definition. Tables 3.1–3.4 provide a list of magic methods corresponding to the respective binary, ­comparison, unary, and assignment operators. Changing the implementation of the magic method associated with the respective operator can provide a different meaning to that particular operator. For example, the plus (+) operator can be used with the Employee objects to add their salaries (i.e., emp1 + emp2). Similarly, the less than (<) operator can be used to compare which employee was hired first, or which is older. Conceptually, the idea is to use operator overloading in order to define and implement the functionality of operators in a way that is logical and appropriate in the context of the overall program structure and requirements. TABLE 3.1 List of Binary Operators and Their Corresponding Magic Method Operator Magic Method + − * // / % ** << >> & ^ | __add__(self, other) __sub__(self, other) __mul__(self, other) __floordiv__(self, other) __div__(self, other) __mod__(self, other) __pow__(self, other) __lshift__(self, other) __rshift__(self, other) __and__(self, other) __xor__(self, other) __or__(self, other) TABLE 3.2 List of Comparison Operators and Their Corresponding Magic Method Operator Magic Method < > <= >= == != __lt__(self, __gt__(self, __le__(self, __ge__(self, __eq__(self, __ne__(self, other) other) other) other) other) other) 90 Handbook of Computer Programming with Python TABLE 3.3 List of Unary Operators and Their Corresponding Magic Method Operator Magic Method – + ~ __neg__(self, other) __pos__(self, other) __invert__(self, other) TABLE 3.4 List of Assignment Operators and Their Corresponding Magic Method Operator Magic Method += −= *= /= //= %= **= <<= >>= &= ^= |= __iadd__(self, other) __isub__(self, other) __imul__(self, other) __ifloordiv__(self, other) __idiv__(self, other) __imod__(self, other) __ipow__(self, other) __ilshift__(self, other) __irshift__(self, other) __iand__(self, other) __ixor__(self, other) __ior__(self, other) 3.6.1 Overloading Built-In Methods While Python does not support overloading of custom methods in a class, it does so for built-in methods. This allows the programmer to change the default behavior of an existing method within the context of a class. For example, in the case of the print() method, the default behavior is to print a string if the input is text or an object reference if the argument is an object, as shown in the following example: 1 2 3 4 5 6 7 8 9 10 11 Observation 3.27 – Overloading Built-In Methods: It is possible to overload built-in methods (e.g., print, len, bool) by extending the functionality of their respective magic methods. # Define class 'Employee' class Employee: Pass # Create a new 'emp1' object based on the class emp1 = Employee() emp1.firstName = "George" emp1.lastName = "Comma" # Use the print method to show the object's reference print(emp1) 91 Object-Oriented Programming Output 3.6.1.a: <__main__.Employee object at 0x000002A2140033D0> Nevertheless, when an object is used as an argument, it can be overloaded. Using the usual Employee example, overloading the appropriate magic method, in this particular instance __str__(), allows the program to print the respective employee’s details (e.g., firstName, lastName) instead of the object reference as in the following example: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Define class 'Employee' class Employee: # Define and extend the constructor of the class def __init__(self, first, last, salary): self.firstName = first self.lastName = last self.salary = salary # Overload print: extend the functionality of ‘__str__’ def __str__(self): return "Employe name: " + self.firstName + " " + \ self.lastName + " Salary: " = str(self.salary) # Create and use the 'emp1' object based on the 'Employee' class emp1 = Employee("George", "Comma", 15000) # Use the overloaded print method print(emp1) Output 3.6.1.b: Employee name: George Comma Salary: 15000 3.7 ABSTRACT CLASSES AND INTERFACES IN PYTHON An abstract class is a class that cannot be instantiated. It serves as a blueprint or template for creating sub classes, but it cannot be used to create objects. An abstract class contains declarations of abstract methods. Declarations of this type include the names and parameter lists of the methods, but no implementation. The latter must be defined in the corresponding sub class. In order to create abstract classes and methods, modules ABC and abstractmethod must be imported to the program. The syntax for doing so is the following: Observation 3.28 – Abstract Class: A class that cannot be instantiated, but serves as a template for sub classes. Abstract classes contain declarations of abstract methods (i.e., methods whose implementation must be defined in the sub classes or nonabstract methods). from abc import ABC, abstractmethod ABC stands for Abstract Base Classes. Newly created abstract classes inherit from ABC and must include at least one abstract method using the @abstractmethod built-in decorator, 92 Handbook of Computer Programming with Python with no implementation. The following script provides an example of an abstract class (i.e., Employee) with one abstract method (i.e., getTotalSalary()). Running this script raises an error, since abstract classes cannot instantiate objects: 1 2 3 4 5 6 7 8 9 10 11 12 13 # Import ABC from abc import ABC, abstractmethod # Define abstract class 'Employee' class Employee(ABC): # Define abstract method 'getTotalSalary', which must be empty @abstractmethod def getTotalSalary(self): Pass # Abstract classes cannot instantiate objects emp1 = Employee() Output 3.7.a: TypeError Traceback (most recent call last) <ipython-input-16-47belb52dd97> in <module> 11 12 # Abstract classes cannot instantiate objects ---> 13 empl = Employee() TypeError: can't instantiate abstract class Employee with abstract methods getTotalSalary Once the abstract class is implemented, it can be used as a super class for deriving sub classes. Sub classes of this type must implement the abstract method of the abstract class as a minimum requirement. In this context, as shown in the first of the following scripts, sub class FullTimeEmployee will raise an error, since it does not implement the abstract method (i.e., getTotalSalary()) of its super abstract class (i.e., Employee). On the contrary, the second script presents the implementation of abstract method getTotalSalary() that resolves this issue: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Import ABC from abc import ABC, abstractmethod # Define abstract class 'Employee' class Employee(ABC): # Define abstract method 'getTotalSalary' @abstractmethod def getTotalSalary(self): Pass # Define class 'fullTimeEmployee' based on the abstract class class fullTimeEmployee(Employee): # Define the constructor of the sub class and its attributes def __init__(self, first, last, salary, allowances): Object-Oriented Programming 17 18 19 20 21 22 23 24 93 self.__first = first self._last = last self.__salary = salary self.__allowances = allowances # Error will be raised as the sub class does not implement # the abstract method ftl = fullTimeEmployee("Maria", "Rena", 15000, 6000) Output 3.7.b: TypeError Traceback (most recent call last) <ipython-input-12-7e5c51df1210> in <module> 21 22 # Error will be raised as the sub class does not implement the abstract method ---> 23 ftl = fullTimeEmployee("Maria", "Rena", 15000, 6000) TypeError: Can't instantiate abstract class fullTimeEmployee with abstract methods getTotalSalary 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 # Import ABC from abc import ABC, abstractmethod # Define abstract class 'Employee' class Employee(ABC): # Define abstract method 'getTotalSalary' @abstractmethod def getTotalSalary(self): Pass # Define class 'fullTimeEmployee' based on the abstract class class fullTimeEmployee(Employee): # Define the constructor of the sub class and its attributes def __init__(self, first, last, salary, allowances): self.__first = first self._last = last self.__salary = salary self.__allowances = allowances # Implement the abstract method of the abstract class def getTotalSalary(self): return self.__salary + self.__allowances # Create and use a new 'fullTimeEmployee' object ftl = fullTimeEmployee("Maria", "Rena", 15000, 6000) print(ftl.getTotalSalary()) Output 3.7.c: 21000 94 Handbook of Computer Programming with Python Abstract classes may include both abstract and non-abstract methods with implementations. Sub classes that inherit from the abstract class also inherit the implemented methods. If required, the latter can be overridden, but in all cases, implementations must include the abstract method. 3.7.1 Interfaces In OOP, an interface refers to a class that serves as a template for the creation of other classes. Its main purpose is to improve the organization and efficiency of the code by providing blueprints for prospective classes. As such, interfaces describe the behavior of inherited classes, similarly to abstract classes. However, contrary to the latter, they cannot contain non-abstract methods. Python does not Observation 3.29 – Interface: A class support the explicit creation of interfaces. However, that cannot be instantiated but serves since it does support multiple inheritance, the program- as a template for sub classes. Unlike mer can mimic the interface functionality by utilizing abstract classes, interfaces cannot abstract class inheritance, limited to the exclusive use of have non-abstract methods. abstract methods. 3.8 MODULES AND PACKAGES IN PYTHON Modules and packages refer to structures used for organizing code in Python. Modules are files containing Observation 3.30 – Module: A modPython code structures (e.g., classes, methods, attributes, ule provides a way of organizing code or simple variables) signified by the .py file extension. in Python. Modules can host classes, Instead of rewriting particular blocks of code, modules methods, attributes, or even simple can be imported into other Python files or applications, variables that can be imported and thus allowing for a modular programming approach reused in other classes. Modules are commonly used with abstract classes. based on reusable code. Abstract classes and interfaces are two of the programming structures commonly stored in modules, from where they can be imported on demand. In the example provided in the following script, the entire definition of class Employee is stored in a module named employee.py: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # 'Employee' module saved in 'employee.py' file class Employee: # Define the constructor and private attributes of the class def __init__(self, first, last, salary): self.__firstName = first self.__lastName = last self.__salary = salary # Define the getter for annual salary def getAnnualSalary(self): return self.__salary * 12 # Define the getter for fullName def getFullName(self): return self.__firstName + " " + self.__lastName 95 Object-Oriented Programming 3.8.1 The import Statement Python module files are imported using the import statement. The statement may include one or more modules. The syntax is the following: import module1, [module2, module3…] Once a module is imported, its classes and methods can be referenced using its name as a prefix (i.e., module. classname). The following example imports the Employee class from the associated employee.py ­module, and accesses its attributes and methods from the main body of the program: 1 2 3 4 5 6 7 Observation 3.31 – The import Statement: Used to import either specific methods and attributes or entire classes stored in modules. # Import the 'employee.py' file as a module import employee # Use the module to create and use a new object emp1 = employee.Employee("Maria", "Rena", 15000) print(emp1.getFullName()) print() Output 3.8.1: Maria Rena 3.8.2 The FROM…IMPORT Statement A Python module may contain several classes, methods, attributes, or variables. The from… import statement allows the programmer to selectively import specific components from a ­module. The syntax is the following: from module import name1, [name2, name3…] Note that the names used in this example (e.g., name1, name2, name3) represent names of classes, methods, or attributes. To import all objects from a module the following syntax can be used: from module import * The reader should note that if a specific class is imported explicitly, it can be referenced without a prefix, like in the next example: 1 2 3 4 5 6 7 # Import class ‘Employee’ from ‘employee’ module in ‘employee.py’ from employee import Employee # Use the imported class to create and use a new object emp1 = Employee("Alex", "Flora", 18000) print(emp1.getFullName()) print(emp1.getAnnualSalary()) 96 Handbook of Computer Programming with Python Output 3.8.2: Alex Flora 216000 3.8.3 Packages A package is a collection of modules grouped together in a common folder. The package folder must contain Observation 3.32 – Package: A a file with the designated name __init__.py, which mechanism used to store a number of indicates that the folder is a package. The __init__.py different modules in the same folder file can be empty, but it must be always present in the for better code organization. package folder. Once the package structure is created, Python modules can be added as required. The example in Figure 3.6 illustrates the structure of a package named hr, containing the mandatory __init__.py file, and a module named employee.py. Modules contained in packages can be imported to an application using the package name as a prefix in the import statement, as shown in the following scripts: 1 2 3 4 5 6 7 # Import the employee module from the 'hr' package import hr.employee # Use ‘Employee’ class stored in the module to create & use an object emp1 = hr.employee.Employee("Alex", "Flora", 16000) print(emp1.getFullName()) print(emp1.getAnnualSalary()) Output 3.8.3.a: Alex Flora 216000 1 2 3 4 5 6 7 # Import ‘Employee’ class in the employee module from ‘hr’ package from hr.employee import Employee # Use the 'Employee' class of the module to create and use an object emp2 = Employee ("Alex", "Flora", 15000) print(emp2.getFullName()) print(emp2.getAnnualSalary()) FIGURE 3.6 Package hr contains the __init__.py file and the employee.py module. Object-Oriented Programming 97 Output 3.8.3.b: Alex Flora 180000 3.8.4 Using Modules to Store Abstract Classes Modules may be also used to store abstract classes or interfaces. In the following example, abstract class IEmployee is stored in module employee.py, which is contained in the hr package named: 1 2 3 4 5 6 7 8 9 10 11 12 # Use ‘abc’ module to create an abstract class: store it as a module # ('employee.py') in the hr package from abc import ABC, abstractmethod # Define abstract class 'IEmployee' and its behavior class IEmployee(ABC): @abstractmethod def getTotalSalary(self): Pass @abstractmethod def getFullName(self): Pass The following script demonstrates how the programmer can import the IEmployee class to the application, and use it to create a sub class (FullTimeEmployee): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # Import the 'IEmployee' class from the employee module ('hr' package) from hr.employee import IEmployee # Define a new sub class inheriting from the 'IEmployee' super class class fullTimeEmployee(IEmployee): # The constructor, attributes & behavior of the sub class def __init__(self, first, last, salary, allowances): self.__first = first self.__last = last self.__salary = salary self.__allowances = allowances def getTotalSalary(self): return self.__salary + self.__allowances def getFullName(self): return self.__first + " " + self.__last # Create and use a new object ftl = fullTimeEmployee("Maria", "Rena", 15000, 6000) print(ftl.getFullName()) print(ftl.getTotalSalary()) 98 Handbook of Computer Programming with Python Output 3.8.4: Maria Rena 21000 3.9 EXCEPTION HANDLING When writing programs in Python, or in any other programming language for that matter, the code may include errors. Depending on their nature and significance, these errors may lead to a number of issues, such as preventing the program from executing, generating incorrect output, or causing the program to crash. It is, thus, the responsibility of the programmer to provide error identification and handling solutions, whenever possible. Errors can be classified into three main categories: Observation 3.33 – Types of Errors: There are three types of errors that may be encountered: 1. Compile Time: This is due to incorrect syntax and will not allow the program to execute. 2. Logical: This error type will allow execution of the program but may produce incorrect output. 3. Runtime: Raised because of unexpected external issues, wrong input, or wrong expressions. This error type will cause the program to crash. • Compile Time Errors: They occur due to incorrect syntax, datatype use, or parameters in a method call among others. Whenever the compiler encounters a compile error in the program it will stop execution. Compile time errors are the easiest to handle and can be fixed easily by correcting the problematic code line(s). • Logical Errors: They occur due to incorrect program logic. A program containing logical errors may run normally without crashing, but will generate incorrect output. Logical errors are handled by testing the application with various different input values, and making corrections to the program logic as necessary. • Runtime Errors: They occur during the execution of a program, due to external factors not necessarily related to the code. For example, a user may provide an invalid input that the application is not expecting, or the code is attempting to read a file that does not exist in the system. In Python, these types of errors raise exceptions and cause the program to crash and terminate abruptly. To prevent this, the programmer should catch these exceptions by adding appropriate error handling code to the program. 3.9.1 Handling Exceptions in Python In Python, when a runtime error occurs, the program crashes and a built-in exception is raised. The exception provides information about the error. For example, running the following script will cause a ZeroDivisionError exception as it attempts to divide a value by 0. The exception provides information about the nature of the issue (i.e., division by zero). 1 2 3 a = 10 b = 0 print(a / b) Observation 3.34 – Handling Exception: Use the try…exception… [else:]…[finally] syntax to identify possible errors that might be encountered during execution and handle them appropriately, avoiding abnormal termination of the program. Object-Oriented Programming 99 Output 3.9.1.a: ZeroDivisionError Traceback (most recent call last) <ipython-input-2-dd04aeeae314> in <module> 1 a = 10 2 b = 0 ----> 3 print(a / b) ZeroDivisionError: division by zero Exceptions can be handled using a try/except block of statements. As the name suggests, this structure consists of two distinct blocks: try and except. The try block includes critical statements that are most likely to cause an exception. When the exception occurs within the try block, the execution of the program jumps to the except block. This part contains code that handles the exception appropriately. For example, it may display a related user message, close an open file, or log the error to a file. If no exception is raised in the try block, the program skips the except block and execution continues as normal. Two optional blocks may also be added to the excep- Observation 3.35 – Raising Exceptions: tion handling code, namely else and finally. The Instead of using built-in exceptions, it is else block contains statements that are executed in possible to define user-defined excepcase no exception occurs. The finally block contains tion to address specific errors in the code that must be executed irrespectively of whether an program execution. exception occurs or not, and is mainly used for releasing external resources, such as closing an open file. The main Python syntax for catching exceptions is shown below: try: critical statement except[ExceptionClass as err]: exception handling statements [else: statements to execute when exception has not occurred finally: statements to execute whether an exception has occurred or not] The ExceptionClass is optional, and refers to the type of exception being handled. If omitted, all types of exceptions are handled by the except block. The following example is an improved version of the code used in previous examples, since in this occasion the program will not crash abruptly. Instead, it will terminate with a user-friendly error message: 1 2 3 4 5 6 7 8 9 # Declare variables 'a' and 'b' a, b = 10, 0 """ Try to divide the variables and if an exception is raised execute the alternative statement in the ‘except’ block """ try: print(a / b) except: print("An error has occurred") 100 Handbook of Computer Programming with Python Output 3.9.1.b: An error has occurred 3.9.1.1 Handling Specific Exceptions Trying to catch all types of errors within a single try/except block is not considered good programming practice, as it does not allow the programmer to handle exceptions on a case-by-case basis. Python provides various different built-in exception classes that are raised automatically, according to the type of error being encountered. These specific exceptions can be utilized by referring to their designated names. Table 3.5 lists a number of common built-in exception classes in Python. The example presented below demonstrates how a specific error can be handled using the ZeroDivisionError exception class: 1 2 3 4 5 6 7 8 9 10 11 # Declare variables 'a' and 'b' a, b = 10, 0 # Attempt to print the result of the division of 'a' by 'b' try: print(a / b) # If a specific 'ZeroDivisionError' occurs print a relevant message except ZeroDivisionError as err: print("An error has occurred") print(err) Output 3.9.1.1: An error has occurred division by zero A try block may also contain multiple except blocks. This is useful when the programmer wants to handle various different types of errors. However, only one of these blocks will be executed when TABLE 3.5 Common Exception Classes in Python Exception Class Description ArithmeticError Raised when arithmetic operations fail. Includes the following exception sub classes: OverflowError, ZeroDivisionError, FloatingPointError The result of an arithmetic operation is out of range Attempting to divide by zero Floating-point operation failure An array index is invalid A non-existing attribute is referenced for an instance An operator or method is applied to an inappropriate type of object A file is not found The parameter of a method is of an inappropriate type OverflowError ZeroDivisionError FloatingPointError IndexError AttributeError TypeError FileNotFoundError ValueError Object-Oriented Programming 101 an exception occurs. When multiple except blocks are used, the code structure must start with the more specific exception classes and end with the more generic ones. In this case, the latter are used as an added measure of trying to handle unexpected errors that are not accounted for explicitly. The syntax of a multiple exceptions block is provided below: try: # critical statements pass except FileNotFoundError: # handle FileNotFound exception pass except (IndexError, ArithmeticError): # except block with multiple exceptions # index out of range in an array and arithmetic error pass except: # must be placed at end. Handles all other errors pass 3.9.2 Raising Exceptions In Python, built-in exceptions are raised automatically when a corresponding runtime error occurs. However, it also allows raising exceptions defined by the programmer. This is achieved by using the raise keyword followed by the exception name. When raising user-defined exceptions, it is also possible to provide a string parameter that describes the reason for raising the exception. The next example demonstrates such a case, where if the user input (i.e., user’s age) is less than 18, a userdefined exception (i.e., ValueError) is raised: 1 2 3 4 5 6 # Accepts the user's age age = int(input("Enter your age: ")) # If the input is an integer less than 18 raise an error if age < 18: raise ValueError("Age cannot be below 18") Output 3.9.2.a: Enter your age: 17 ValueError Traceback (most recent call last) <ipython-input-6-de16dc8d8553> in <module> 4 # If the input is an integer less than 18 raise an error 5 if age < 18: ----> 6 raise ValueError("Age cannot be below 18") ValueError: Age cannot be below 18 In the example below, built-in exception AttributeError is raised when the value of private attribute __first is invalid. 102 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Handbook of Computer Programming with Python # Define class 'Employee' class Employee: # Define the getter method def getFirstName(self): return self.__first # Define the setter method def setFirstName(self, value): if len(value) < 15: self.__first = value else: # Raise error if the input exceeds 14 characters raise AttributeError(“First name must be less than 15 \ characters”) # Attempt to create a new object and set the first name try: emp1 = Employee() emp1.setFirstName("Maria Rena White") # Exception raised # Raise the ‘AttributeError’ exception if the first name exceeds 14 # characters except AttributeError as err: print(err) except: print("An error has occurred") Output 3.9.2.b: First name must be less than 15 characters Raising exceptions is also a convenient way of handling invalid values passed to an attribute setter method. However, in this case, instead of raising built-in exceptions, it is preferable to create custom, in-class ones. 3.9.3 User-Defined Exceptions in Python As mentioned, Python raises built-in exceptions whenever a runtime error occurs. However, for custom errors, Python also allows the creation of custom exceptions that can be raised from within the code. For example, instead of raising built-in exception AttributeError, the programmer can create a user-defined exception by deriving a new class from the Exception base class, as shown below: class NewExceptionName (Exception): pass In the following script, user-defined exception FirstNameException is created and subsequently raised in the setter method, when the length of the first name exceeds the limit of 14 characters: Object-Oriented Programming 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 103 # Define the new exception class based on the built-in exceptions class FirstNameException(Exception): def __init__(self, message): super().__init__(message) # Define class 'Employee' class Employee: # Getter method def getFirstName(self): return self.__first # Setter method def setFirstName(self, value): if len(value) < 15: self.__first = value else: # Raise an extended exception 'FirstNameException' if # the first name exceeds 14 chars raise FirstNameException( # Raise error “First name should be less than 15 characters”) # Create and use the new object handling possible user-defined # exceptions try: emp1 = Employee() emp1.setFirstName("Maria Rena White") # Exception raised except FirstNameException as err: print(err) except: print("An error has occurred") Output 3.9.3: First name should be less than 15 characters 3.10 CASE STUDY Sherwood real estate requires an application to manage properties. There are two types of properties: apartments and houses. Each property may be available for rent or sale. Both types of properties are described using a reference number, address, built-up area, number of bedrooms, number of bathrooms, number of parking slots, pool availability, and gym availability. A house requires extra attributes such as the number of floors, plot size and house type (villa or townhouse). An apartment requires additional attributes such as floor and number of balconies. Each type of property (house or apartment) may be available for rent or sale. A rental property should include attributes such as deposit amount, yearly rent, furnished (yes or no), and maids’ room (yes or no). A property available for sale has attributes such as sale price and estimated annual service charge. 104 Handbook of Computer Programming with Python All properties include a fixed agent commission of 2%. Both types of sale properties have a fixed tax of 4%. All properties require a method to display the details of the property. All properties should include a method to compute the agent commission. For rental properties, agent commission is calculated by using the yearly rental amount, whereas for purchase properties it is calculated using the sale price. Both types of purchase properties should include a method to compute the tax amount. Tax amount is computed based on the sale price. Design and implement a Python application that creates the four types of properties (e.g., RentalApartment, RentalHouse, SaleApartment, SaleHouse) by using multiple inheritance and abstract classes. Implement class attributes and instance attributes using encapsulation. All numeric attributes, such as price, should be validated for inputs with a suitable minimum and maximum price. Define the methods in the abstract class and implement it in the respective classes. Override the print method to display each property details. Test your application by creating new properties of each type and calling the respective methods. 3.11 EXERCISES 1. Using the diagram shown below, write Python code for the following: a. Create a class named Student. b. Create appropriate getters and setters using the @property decorator for Student_ Name and GPA attributes. The Student_ID and Email attributes are read only. Create only getter methods for these attributes. c. Add a private class attribute named MAX_ID and set it to 0. Object-Oriented Programming 105 d. Add a default constructor method to the Student class. The default constructor should initialize the GPA attribute to 0 and Student_ID to MAX_ID + 1. e. Add an overloaded constructor that takes Student_Name and GPA as arguments and initializes private data variables with the values provide. In addition, it should set the Student_ID to MAX_ID + 1 and the email attribute to first_name.last_name@ university.edu. f. Modify the setter method of the GPA attribute to check if the provided value is between 0 and 4 before storing it. g. Add a destructor method to the Student class. The method should print the message “All student records destroyed”. h. Instantiate two new objects called std1 and std2, using the default and the ­overloaded constructors, respectively. i. Print the data values stored in each object’s attributes. j. Delete objects std1 and std2. 4 Graphical User Interface Programming with Python Ourania K. Xanthidou Brunel University London Dimitrios Xanthidis University College London Higher Colleges of Technology Sujni Paul Higher Colleges of Technology CONTENTS 4.1 4.2 4.3 4.4 4.5 4.6 Introduction........................................................................................................................... 108 4.1.1 Python’s GUI Modules.............................................................................................. 109 4.1.2 Python IDE (Anaconda) and Chapter Scope............................................................. 109 Basic Widgets in Tkinter....................................................................................................... 109 4.2.1 Empty Frame............................................................................................................. 110 4.2.2 The Label Widget...................................................................................................... 111 4.2.3 The Button Widget..................................................................................................... 119 4.2.4 The Entry Widget...................................................................................................... 120 4.2.5 Integrating the Basic Widgets.................................................................................... 121 Enhancing the GUI Experience............................................................................................. 126 4.3.1 The Spinbox and Scale Widgets inside Individual Frames....................................... 126 4.3.2 The Listbox and Combobox Widgets inside LabelFrames........................................ 131 4.3.3 GUIs with CheckButtons, RadioButtons and SimpleMessages................................ 138 Basic Automation and User Input Control............................................................................. 146 4.4.1 Traffic Lights Version 1 – Basic Functionality.......................................................... 146 4.4.2 Traffic Lights Version 2 – Creating a Basic Illusion................................................. 148 4.4.3 Traffic Lights Version 3 – Creating a Primitive Automation.................................... 149 4.4.4 Traffic Lights Version 4 – A Primitive Screen Saver with a Progress Bar................ 151 4.4.5 Traffic Lights Version 5 – Suggesting a Primitive Screen Saver............................... 156 Case Studies........................................................................................................................... 159 Exercises................................................................................................................................ 159 DOI: 10.1201/9781003139010-4 107 108 Handbook of Computer Programming with Python 4.1 INTRODUCTION In modern day software development, creating an application with an intuitive Windows style Graphical User Interface (GUI) is a must in order to make it attractive for the user. There are four essential concepts related to this, and the associated programming tools: • Widgets: The different components used to create an application GUI. These are relatively simple, pre-defined objects available through Python libraries. In this chapter, the libraries and modules used include tkinter and PIL, providing visual attributes that supply the necessary windows object aesthetic. The associated objects can be as simple as labels, texts, and buttons or as complex as frames and grids. • Options: Characteristics or attributes of a ­widget/ object that dictate the way the latter looks and behaves (e.g., the object color, text, position, or alignment). Value changes, usually integrated with interactions between the user and the GUI, control aspects like the visual appearance or format of the application and its behavior. • Methods: Pre-defined or newly developed snippets of Python code, aiming to affect the widgets by changing the values of their properties/attributes. There is a wealth of method in the various packages offered by Python, such as tkinter and PIL. They can be as simple or complex as the developer intends. • Events: The interaction between the user of a GUI-based Windows style application and the various widgets of the application is expressed through the various available events that trigger the execution of particular commands or blocks of code. There are numerous such events offered by Python, some of them applicable to several different widgets. Examples are the click or doubleclick of a mouse, pressing the enter key in the keyboard, hovering over a widget, or changing the text of a text widget. Observation 4.1 – Widget: A graphical component used to create the interface of the Python application. This is provided as a pre-defined class of the tkinter or PIL packages. Observation 4.2 – Option: An attribute of the widget that controls its look and behavior. Observation 4.3 – Method: A specific structure of code that changes the value of an option of a particular widget. It can be either pre-defined or newly developed. Observation 4.4 – Event: An interaction between the user and an object that causes a change in terms of the object’s appearance and/or value. Many types of interactions are available. Observation 4.5 – Event-Driven (or Visual) Programming: The concept of handling events, through the use of methods in order to change the options of an object and, thus, their look and actions. Event-driven (or visual) programming is the process during which one or more of the properties/ attributes of a widget/object changes state or value. This is done through the use of specific methods and is triggered through interactions between the user and the widget/object, caught by the associated event. The focus of this chapter is to introduce the concept of event-driven (or visual) programming by presenting some of the most popular widgets and the associated methods and properties/attributes/ options, and the most commonly used events for the creation of a GUI experience. Graphical User Interface Programming 109 4.1.1 Python’s GUI Modules Python provides a rather complete set of widgets (presented as classes) to create objects for user-friendly Observation 4.6 – Python GUI applications, a comprehensive and developer-friendly set Modules: The most important and of methods available through these widgets, a rich set frequently used modules for GUI of attributes of these widgets, and an adequate number of programming in Python are Tk/Tcl, well-defined programmable events that can be triggered Tkinter.Tix, and tkinter.ttk. through user interactions. There are two basic modules that define the components and functionality of these widgets, namely the tkinter and the PIL modules. The tkinter module provides a number of classes, including the fundamental Tk class, as well as numerous other classes associated with GUIs. It consists of the following: • Tk/Tcl: A toolkit that includes widgets for GUI applications. • Tkinter.Tix: An extension of tkinter including more advanced GUI widgets (e.g., spin boxes, trees). • tkinter.ttk: a collection of widgets, some of which are part of the original tkinter module (e.g., combo boxes, progress bars). Although it is not possible to describe all the widgets, methods, properties, and events available through all these modules in detail in this chapter, an effort is made to present the most commonly used ones and provide examples of their application. This chapter gradually moves from simpler to more sophisticated cases of increasing complexity. 4.1.2 Python IDE (Anaconda) and Chapter Scope In line with the approach taken in previous chapters, the Jupyter Notebook (Anaconda) is the p­ latform of choice for the code developed in this chapter. Detailed download and installation instructions are provided in the introductory Chapter 1. It is worth noting that when writing programs in Python, or any other language indeed, it is useful following good programming practices. It is a good habit and a helpful strategy in the long run to use pseudocode in the form of comments before lines or blocks of code that are written to accomplish a specific and well-defined task. This allows the reader or the owner of the program to understand the underlying algorithm, making the program more readable and user-friendly. It is beyond the scope of this chapter to write “highly intelligent” Python programs that create complex and sophisticated GUI applications, as this would make this chapter content difficult to digest. Instead, this chapter aims at presenting the tools and their suggested uses for the creation of common tasks and applications, without trying to offer the most efficient or optimal solution for such tasks. 4.2 BASIC WIDGETS IN TKINTER Arguably, when creating a GUI, there are four basic widgets that intuitively come to mind. These are the actual frame, and the label, the button, and the entry widgets (the latter is commonly referred to as textbox in other programming languages). In this section, these particular widgets will be presented and utilized to create simple GUI applications. Observation 4.7 – Basic Widgets: The basic widgets of any GUI in Python are the form, and the label, the button, and the entry widgets. 110 Handbook of Computer Programming with Python 4.2.1 Empty Frame The basic frame is the initial parent object that a Python GUI application requires in order to support the GUI interface and functionality. The following Python code creates a basic, empty frame titled “Python Basic Window Frame”: 1 2 3 4 5 6 7 8 # Import the necessary library import tkinter as tk # Create the frame using the tk class winFrame = tk.Tk() winFrame.title("Python Basic Window Frame") winFrame.mainloop() Output 4.2.1.a: A few things are worth noting in this example: • Every frame is an object of the tk class, initiated by the Tk() constructor. The object must have a name. • It is common practice to give a title to every frame using the title() method. • The mainloop() method runs the frame and puts tkinter in a wait state, which internally monitors user-generated events, such as keyboard and mouse activity. By default, the basic frame is resizable and its size is determined automatically. If there is a requirement for specifically defining and controlling whether it should be resizable, two methods can be used, namely: resizable() and geometry(). If it is preferred to have a non-resizable frame, one can just pass Boolean value False to both parameters of the resizable() method. Accordingly, passing True would result in a resizable frame. The geometry() method is used to pass the initial size of the frame as a string. It is also possible to define the maximum and minimum sizes of the window frame, as well as its background color. The aforementioned methods and their application are demonstrated in the following example: Observation 4.8 – The mainloop() Method: Use the mainloop() method to monitor and control any type of interaction between the user and the application. Observation 4.9 – Frame Methods: Use the title(), resizable(), geometry(), maxsize(), minsize(), config() methods to configure the basic content, size, geometry, flexibility, and look of the main window frame. 111 Graphical User Interface Programming 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # Import the necessary library import tkinter as tk # Create the frame using the tk object winFrame = tk.Tk() # Provide a title for the frame winFrame.title("Python Controlled Frame") # The frame is resizable if the method parameters are set # to True or non-zero; if set to False, it is not resizable winFrame.resizable(True, True) # The frame will have initial dimensions of 500 by 200 winFrame.geometry('500x200') # The frame can be resized up to a maximum of 1500 by 600 winFrame.maxsize(1500, 600) # The frame can be resized down to a minimum of 250 by 100 winFrame.minsize(250, 100) # The background colour of the frame can be changed with # the use of the configure method and the bg option winFrame.configure(bg = 'dark grey') winFrame.mainloop() Output 4.2.1.b: Once the basic frame is set, the actual GUI can be created by adding the desired widgets. 4.2.2 The Label Widget The label widget is a basic widget class from the tkinter module. It is used to display a message or image on screen. As it does not accept input from the keyboard its value cannot be changed directly during runtime, but this can be done indirectly through the code. The widget comes with several methods and the associated parameters and options that can be used to change its Observation 4.10 – Labels: Basic widgets used to display a message or an image. They do not accept input and, thus, their value cannot be changed directly by the user. Label widgets must be attached to a frame or window through the pack() or grid() methods. 112 Handbook of Computer Programming with Python appearance and functionality. The following script is an example showcasing the use of some of the available options: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Import the tkinter library import tkinter as tk # Define the parent frame winFrame = tk.Tk() winFrame.title("Labels in Python") winFrame.resizable(True, True) winFrame.geometry('300x100') # Create a label object based on the tk.Label class winLabel = tk.Label(winFrame, text = "Hello Python programmer") # Associate the label object with the parent frame winLabel.pack() # Run the interface winFrame.mainloop() Output 4.2.2.a: The script creates a window frame containing a basic label widget, used to display a text message. The label widget (winLabel) is derived from the tk.Label class, by means of the related tk.Label() constructor. This call takes a minimum of two parameters, namely the parent frame (winFrame) and the text that assigns the label with a message to display. The label widget is tied to the parent frame through the pack() method. Finally, the mainloop() method activates the application. An extension of this basic use of the label widget could involve the use of the grid() method, in order to control its placement within the parent frame more efficiently: 1 2 3 4 5 6 7 8 9 10 11 # Import the tkinter library import tkinter as tk # Define the parent frame winFrame = tk.Tk() winFrame.title("Python Label using the Grid") # Create a label and place it in the Grid winLabel = tk.Label(winFrame, text = \ "Use the Grid method to \nplace the label in a static position") # Specify the row and column the label 113 Graphical User Interface Programming 12 13 14 15 # is to be placed, regardless of the size of the parent frame winLabel.grid(column = 0, row = 0) winFrame.mainloop() Output 4.2.2.b: A couple of things are noteworthy in this case: • For clarity purposes, if the statement is lengthy, it can be broken by inserting the backslash special character (“\”). This character informs Python that the statement continues on the next line. • Using the grid() method instead of pack() ensures that the label widget will be placed in the respective grid cell, in this case in the first row (row = 0) and first column (column = 0), and that its position will not be directly adjusted based on the size of the frame or parent widget. Observation 4.11 – The Backslash Special Character (“\”): Use the backslash special character (“\”) to break a lengthy line. Observation 4.12 – expand, foreground, background, font, anchor: Use the expand, foreground, background, font, and anchor options to improve the appearance of widgets. It is possible to further enhance the appearance of a label by changing its foreground and background colors, its alignment, and its expandability, as shown in the following script. This example demonstrates the behavior of the alignment of labels before and after resizing the window frame: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # Import the relevant library import tkinter as tk # The basic frame with the tk.Tk() constructor and provide a title winFrame = tk.Tk() winFrame.title('More options for label widgets') # Create the 1st label and place it in the middle of the parent window winLabel1 = tk.Label(winFrame, fg = 'green', font = "Arial 24", text = 'A green label of Arial 24, that does not expand') winLabel1.pack(expand = 'N') # The second label that expands vertically when the frame is resized winLabel2 = tk.Label(winFrame, bg = 'red', fg = 'white', text = 'A label in red background that expands only vertically') winLabel2.pack(expand = 1, fill = tk.Y) # The third label that expands horizontally when the frame is resized winLabel3 = tk.Label(winFrame, bg = 'blue', fg = 'yellow', text = 'A label in blue background that expands only horizontally') 114 21 22 23 24 25 26 27 28 Handbook of Computer Programming with Python winLabel3.pack(expand = 1, fill = tk.X) # The fourth label 'anchored' (i.e., align always to the right/east) winLabel4 = tk.Label(winFrame, anchor = 'e', bg = 'green', text = 'A right, i.e., east, aligned label') winLabel4.pack(expand = 1, fill = tk.BOTH) winFrame.mainloop() Output 4.2.2.c: A number of key observations can be made based on this example: 1. The expand option can be used to control whether a label widget will expand in line with its parent widget. If the value is 0 or “N”, the label will not expand. 2. If the expand option is set to ‘Y’ or non-zero, the label widget can expand in line with its parent widget. It can be also specified whether the expansion will be horizontal, vertical, or both. In this case, one can use the fill option with the following arguments: X for horizontal expansion only; Y for vertical expansion only, and BOTH for a simultaneous expansion in both directions. 3. The fg and bg options can be used to define the color of the foreground and background of the label widget, respectively. 4. The font option can be used to set up the font name and size of the text in the label widget. 5. The anchor option can be used to ensure that the label widget will not relocate if the parent widget does. Ultimately, label widgets can provide additional functionality and can be further enhanced in terms of their appearance. Indeed, they can be loaded with image objects with or without associated text, and can function as buttons (covered in a later section of this chapter). If images are to be used, the PIL module must be imported, as it provides the necessary methods to support such processes. The following Python program uses image objects as buttons that change the text-related properties of the main label: 1 2 3 4 5 6 7 8 9 # Import the relevant library import tkinter as tk # Import the necessary image processing classes from PIL from PIL import Image, ImageTk global photo1, photo2, photo3, photo4, photo5, photo6 # Declare the methods to control the click events from each of the # labels and change the settings of the main label Graphical User Interface Programming 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 115 def changeBorders(a, b): winLabel5.config(relief = a, borderwidth = b) def changeText(a): winLabel5.config(text = a) def changeAlignment(a): winLabel5.config(anchor = a) # Declare the method that will open the various images def photos(): global photo1, photo2, photo3, photo4, photo5, photo6 image1 = Image.open('LabelsDynamicWithImageGoodMorning.gif') image1 = image1.resize((100, 50), Image.ANTIALIAS) photo1 = ImageTk.PhotoImage(image1) image2 = Image.open('LabelsDynamicWithImageGoodAfternoon.gif') image2 = image2.resize((100, 50), Image.ANTIALIAS) photo2 = ImageTk.PhotoImage(image2) image3 = Image.open('LabelsDynamicWithImageGoodEvening.gif') image3 = image3.resize((100, 50), Image.ANTIALIAS) photo3 = ImageTk.PhotoImage(image3) image4 = Image.open('LabelsDynamicWithImageAlignLeft.gif') image4 = image4.resize((100, 50), Image.ANTIALIAS) photo4 = ImageTk.PhotoImage(image4) image5 = Image.open('LabelsDynamicWithImageAlignRight.gif') image5 = image5.resize((100, 50), Image.ANTIALIAS) photo5 = ImageTk.PhotoImage(image5) image6 = Image.open('LabelsDynamicWithImageAlignCenter.gif') image6 = image6.resize((100, 50), Image.ANTIALIAS) photo6 = ImageTk.PhotoImage(image6) # Declare the method that will create the first row of labels # that will shape the main label def firstRow(): winLabel1a = tk.Label(winFrame, text = "Left click to \ \n change to raised label \nwith border width of 4", relief = "raised") winLabel1a.grid(column = 1, row = 0) winLabel1a.bind("<Button-1>", lambda event, a = "raised", b = 4: changeBorders(a, b)) winLabel1b = tk.Label(winFrame, text = "Left click to \n change \ to sunken label \nwith border width of 6", relief = "raised") winLabel1b.grid(column = 2, row = 0) winLabel1b.bind("<Button-1>", lambda event, a = "sunken", b = 6: changeBorders(a, b)) winLabel1c=tk.Label(winFrame, text = "Left click to \n change \ to flat label \nwith border width of 8", relief = "raised") winLabel1c.grid(column = 3, row = 0) 116 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 Handbook of Computer Programming with Python winLabel1c.bind("<Button-1>", lambda event, a = "flat", b = 8: changeBorders(a, b)) # Declare the method that will create the second row of labels # that will shape the main border def secondRow(): winLabel2a = tk.Label(winFrame, text = "Left click to \n change \ to ridge label \nwith border width of 10", relief = "raised") winLabel2a.grid(column = 1, row = 4); winLabel2a.bind("<Button-1>", lambda event, a = "ridge", b = 10: changeBorders(a, b)) winLabel2b = tk.Label(winFrame, text="Left click to \nchange to \ solid label \nwith border width of 12", relief = "raised") winLabel2b.grid(column = 2, row = 4); winLabel2b.bind("<Button-1>", lambda event, a = "solid", b = 12: changeBorders(a, b)) winLabel2c = tk.Label(winFrame, text="Left click to \n change to \ groove label \nwith border width of 14", relief = "raised") winLabel2c.grid(column = 3, row = 4); winLabel2c.bind("<Button-1>", lambda event, a = "groove", b = 14: changeBorders(a, b)) # Declare the method that will create the third row of labels # that will change the text of the main label def thirdRow(): global photo1, photo2, photo3, photo4, photo5, photo6 winLabel3a = tk.Label(winFrame, text="Double left click to\n change to", image = photo1, compound = 'left', relief = "raised") winLabel3a.grid(column = 0, row = 1) winLabel3a.bind("<Double-Button-1>", lambda event, a = "Good morning": changeText(a)) winLabel3b = tk.Label(winFrame, image = photo2, relief = "raised") winLabel3b.grid(column = 0, row = 2) winLabel3b.bind("<Double-Button-1>", lambda event, a = "Good afternoon": changeText(a)) winLabel3c=tk.Label(winFrame, image=photo3, compound="center", text="Double click to\n change the text to", relief="raised") winLabel3c.grid(column = 0, row = 3) winLabel3c.bind("<Double-Button-1>", lambda event, a = "Good evening": changeText(a)) # Declare the method that will create the fourth row of labels # that will adjust the alignments of the text of the main label def fourthRow(): winLabel4a = tk.Label(winFrame, image = photo4, text = "Right click to \n left align the text\nof the label", compound = "center", relief = "raised") winLabel4a.grid(column = 4, row = 1) winLabel4a.bind("<Button-3>", lambda event, a = "w": changeAlignment(a)) winLabel4b = tk.Label(winFrame, image = photo5, relief = "raised", text = "Right click to \nright align the text\nof the label") winLabel4b.grid(column = 4, row = 2) Graphical User Interface Programming 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 117 winLabel4b.bind("<Button-3>", lambda event, a = "e": changeAlignment(a)) winLabel4c = tk.Label(winFrame, image = photo6, compound = "right", text = "Right click to \ncenter align the text\nof the label", relief = "raised") winLabel4c.grid(column = 4, row = 3) winLabel4c.bind("<Button-3>", lambda event, a = "center": changeAlignment(a)) # The basic frame with the tk.Tk() constructor and provide a title winFrame = tk.Tk() winFrame.title("Playing with Label options at runtime") photos() firstRow() secondRow() thirdRow() fourthRow() # Create the main label winLabel5=tk.Label(winFrame, text = "...", font= "Arial 18", width= 30) winLabel5.grid(column = 2, row = 2) winFrame.mainloop() Output 4.2.2.d: As mentioned, the PIL module provides the necessary classes to support processes related with images, in this case Image and ImageTk. The photos() method includes six sets of three lines/steps, and deals with the opening and reading of the images, as well as their preparation in order to be loaded to the respective labels. In the first step (i.e., the first line of each set) the Image class and the open() method are used to read the images and create an image object. Next, the script uses the resize() method with the 118 Handbook of Computer Programming with Python preferred dimensions for the image and the ANTIALIAS option in order to ensure that quality is maintained when downsizing an image to fit the label. This applies to all six cases. During the final step, a new image object is created based on the previously processed image. This is accomplished by using PhotoImage method from the ImageTk class for each of the six cases. It is worth noting that this process applies to images with a gif file type. The reader should check the Python documentation to find the exact classes, methods, and options that should be used when working with other types of images, as well as the exact process that must be followed. Nevertheless, the latter should not differ significantly from the process presented above. The next part of the script involves the use of four methods (firstRow(), secondRow(), thirdRow(), Observation 4.13 – resize(), and fourthRow()) to create the twelve labels of the ANTIALIAS: Use the resize() application (i.e., three labels for each row). For each method to set the preferred dimenlabel, three statements are used. The first statement cre- sions of the image, and the ates the label widget and sets its text property to show ANTIALIAS option to ensure that the the associated message, and the relief property to highest quality is maintained when enhance the widget appearance to raised. The second resizing an image. statement places the label in the desired position within the grid of the current frame. The third statement calls the bind method in order to associate the particular Observation 4.14 – <button-1>, widget with an event. <button-3>, <Double-Button-1>: There are a number of events that can be associated Use the <Button-1>, <Button-3>, with the various widgets. This example involves three and <Double-Button-1> events basic events, namely: <Button-1> that is triggered to catch when the parent widwhen the user left-clicks on the parent widget (label in get is left-clicked, right-clicked or this case), <Button-3> that is triggered when the user double-left-clicked. right-clicks, and <Double-Button-1> that is triggered when the parent widget is double left-clicked. Whenever an event is triggered, a method is usually called in order to execute a set of statements. If the Observation 4.15 – lambda: Use method is to accept arguments from the calling state- the lambda event expression to define ment, the lambda event expression must be also called in the arguments passed by an event to order to define the arguments before they are passed to a method. the method. There are a number of options offered for the purpose of changing the appearance of the border of a label wid- Observation 4.16 – relief, borget. These include options such as raised, sunken, derwidth: Use the relief and flat, ridge, solid, and groove and have to be set borderwidth properties to adjust through the relief property. Property ­borderwidth, the visual attributes of the label. used with an integer argument, is used to change the default border width of a label. Finally, it is possible to have both a text and an image Observation 4.17 – compound, appearing in a label widget. In such cases, it is neces- left, right, center: Use the sary to combine the two elements using the compound compound filter to combine text and expression. The expression accepts different alignment image objects in a label. Options values, namely left when the image is to be placed include left, right, and center. before the text, right when the image is to be placed after the text, and center when both objects are to be placed at the same position, one over the other. Graphical User Interface Programming 119 4.2.3 The Button Widget Observation 4.18 – The Button As mentioned previously, the label widget is not meant Widget: Use the button widget to to be used to trigger events initiated by the user interac- create objects that are responsive to tion with the GUI. In such cases, the button widget can various types of events (e.g., click, be used instead. This widget also belongs to the tkinter double-click, right-click), and the cormodule, although it can be also found in the ttk module, responding options or properties to where button objects can be created by defining the but- modify its appearance. ton class. The following script demonstrates the possible output of five different user interactions through the use of a simple button widget. The script also provides user feedback depending on the type of interaction, by displaying relevant messages through a label widget: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 # Import the relevant library import tkinter as tk # Define the method that controls the mouse click events def changeText(a): winLabel.config(text = a) # The basic frame with the tk.Tk() constructor and provide a title winFrame = tk.Tk() winFrame.title("A simple button and label application") # Create the label winLabel = tk.Label(winFrame, text = "...") winLabel.grid(column = 1, row = 0) # Create the button widget and bind it with the associated events winButton=tk.Button(winFrame, text="Left, right, or double left Click "\ "\nto change the text of the label", font = "Arial 16", fg = "red") winButton.grid(column = 0, row = 0) winButton.bind("<Button-1>", lambda event, \ a = "You left clicked on the button": changeText(a)) winButton.bind("<Button-3>", lambda event, \ a = "You right clicked on the button": changeText(a)) winButton.bind("<Double-Button-1>", lambda event, \ a = "You double left clicked on the button": changeText(a)) winButton.bind("<Enter>", lambda event, \ a = "You are hovering above the button": changeText(a)) winButton.bind("<Leave>", lambda event, \ a = "You left the button widget": changeText(a)) winFrame.mainloop() 120 Handbook of Computer Programming with Python Output 4.2.3: As shown, the process of creating a button widget object and assigning values to its basic options or properties (e.g., text, font, fg) is not different to the one used in the case of the label widget. Accordingly, binding the button widget to an event and calling a method (with or without arguments) is also following the same syntax and logic as in the label widget case. 4.2.4 The Entry Widget The entry widget is a basic widget from the ttk module (tkinter package), which allows input from the keyboard as a single line. The widget offers several methods and options that allow the control of its appearance and/or functionality. The widget must be placed in a parent widget, usually the current frame, through the .pack() or .grid() methods. The following script introduces the basic use of the entry widget, and its output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Observation 4.19 – Entry/Text: Use the entry and/or text widgets from the ttk module (tkinter package) to allow the user to enter text as a single line or multiple lines respectively. When using the text widget, specify the number of text lines through the height = <number of lines> option. # Import the necessary library import tkinter as tk from tkinter import ttk # Create the frame using the tk object winFrame = tk.Tk() winFrame.title("Python GUI with text") # Create a StringVar object to accept user input from the keyboard textVar = tk.StringVar() # Set the initial text for the StringVar textVar.set('Enter text here') # Create an entry widget and associate it to the StringVar object winText = ttk.Entry(winFrame, textvariable = textVar, width = 40) winText.grid(column = 1, row = 0) winFrame.mainloop() Graphical User Interface Programming 121 Output 4.2.4: In line with common GUI development practice, the frame is created first and any child objects (in this case the entry widget) are created and placed in it subsequently. Finally, the mainloop() method is called to run the application and monitor its interactions. The width property specifies the number of characters the widget can display. The reader should note that this is not necessarily the total number of accepted characters, rather the number of displayed characters. It must be also noted that if it is necessary to have multiple lines entered, it would be preferable to use the text widget (tk module, tkinter library) and specify the number of lines through the height = <number of lines> option. The script also introduces a method that helps the programmer monitor the execution of the application: the StringVar() constructor from the tk class. When associated with relevant widgets, such as the entry widget, its functionality is to create objects that accept text input. Once such an object is created it can have its content set through the .set() method. If no content is set, the object will remain empty until the user provides input through the associated widget. The entry widget and the StringVar object are associated via the textvariable. 4.2.5 Integrating the Basic Widgets Having introduced the syntax and functionality of the basic Python widgets included in the tkinter, PIL, and ttk modules/libraries, it would be useful to attempt to create an interface that integrates all of them in one application. The following Python script displays a message to the user, accepts a text input from the keyboard, and uses a number of buttons to change the various attributes of the text, through the integration of label, entry, and button widgets: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Import the necessary library import tkinter as tk from tkinter import ttk # The tempText variable will store the contents of the entry widget global tempText # The textVar object will associate the entry widget with the input global textVar # Define the winText widget global winText # =================================================================== # Declare the methods that will run the application def showHideLabelEntry(a): if (a == 's'): winText.grid() elif (a == 'h'): 122 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Handbook of Computer Programming with Python winText.grid_remove() def showHideEntryContent(a): global tempText global textVar if (a == 's'): if (tempText!= ''): textVar.set(tempText) if (a == 'h'): tempText = textVar.get() textVar.set('') def enableLockDisableEntryWidget(a): if (a == 'e'): winText.config(state = 'normal') elif (a == 'l'): winText.config(state = 'disabled') def boldContentsOfEntryWidget(a): if (a == 'b'): winText.config(font = 'Arial 14 bold') elif (a == 'n'): winText.config(font = 'Arial 14') def passwordEntryWidget(a): if (a == 'p'): winText.config(show = '*') elif (a == 'n'): winText.config(show = '') # =================================================================== # Declare the method that will create the application GUI def createGUI(): createLabelEntry() showHideButton() showHideContent() enableDisable() boldOnOff() passwordOnOff() # Create a label and an entry widget to prompt for input and # associate it with a StringVar object def createLabelEntry(): global textVar global winText winLabel = tk.Label(winFrame, text = 'Enter text:', bg = 'yellow', font = 'Arial 14 bold', relief = 'ridge', fg = 'red', bd = 8) Graphical User Interface Programming 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 123 winLabel.grid(column = 0, row = 0) # A StringVar object to accept user input from the keyboard textVar = tk.StringVar() winText = ttk.Entry(winFrame, textvariable = textVar, width = 20) winText.grid(column = 1, row = 0) # Create two button widgets to show/hide the label and entry widgets def showHideButton(): winButtonShow = tk.Button(winFrame, font='Arial 14 bold', text = 'Show the\nentry widget', fg='red', borderwidth=8, height=3, width=20) winButtonShow.grid(column = 0, row = 1) winButtonShow.bind('<Button-1>',lambda event, a = 's': showHideLabelEntry(a)) winButtonHide = tk.Button(winFrame, font = 'Arial 14 bold', text = 'Hide the\nentry widget', fg = 'red', borderwidth = 8, height = 3, width = 20) winButtonHide.grid(column = 1, row = 1) winButtonHide.bind('<Button-1>', lambda event, \ a = 'h': showHideLabelEntry(a)) # Two button widgets to show/hide the contents of the entry widget def showHideContent(): winButtonContentShow = tk.Button(winFrame, font = 'Arial 14 bold', text = 'Show the contents\nof the entry widget', fg = 'blue', borderwidth = 8, height = 3, width = 20) winButtonContentShow.grid(column = 0, row = 2) winButtonContentShow.bind('<Button-1>', lambda event, a = 's': showHideEntryContent(a)) winButtonContentHide = tk.Button (winFrame, text = 'Hide the contents\nof the entry widget', font = 'Arial 14 bold', fg = 'blue', borderwidth = 8, height = 3, width = 20) winButtonContentHide.grid (column = 1, row = 2) winButtonContentHide.bind ('<Button-1>', lambda event, a = 'h': showHideEntryContent(a)) # Button widgets to enable/disable & lock/unlock the entry widget def enableDisable(): winButtonEnableEntryWidget = tk.Button(winFrame, text = 'Enable the\nentry widget', font = 'Arial 14 bold', fg = 'green', borderwidth = 8, height = 3, width = 20) winButtonEnableEntryWidget.grid(column = 0, row = 3) winButtonEnableEntryWidget.bind('<Button-1>', lambda event, a = 'e': enableLockDisableEntryWidget(a)) winButtonDisableEntryWidget = tk.Button(winFrame, 124 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 Handbook of Computer Programming with Python text = 'Lock the\nentry widget', font = 'Arial 14 bold', fg = 'green', borderwidth = 8, height = 3, width = 20) winButtonDisableEntryWidget.grid(column = 1, row = 3) winButtonDisableEntryWidget.bind('<Button-1>', lambda event, a = 'l': enableLockDisableEntryWidget(a)) # Create two button widgets to switch the "bold" property # of the entry widget content on or off def boldOnOff(): winButtonBoldEntryWidget = tk.Button (winFrame, text = 'Bold contents of\nthe entry widget', font = 'Arial 14 bold', fg = 'brown', borderwidth = 8, height = 3, width = 20) winButtonBoldEntryWidget.grid (column = 0, row = 4) winButtonBoldEntryWidget.bind ('<Button-1>', lambda event, a = 'b': boldContentsOfEntryWidget(a)) winButtonNoBoldEntryWidget = tk.Button (winFrame, text = 'No bold contents of \nthe entry widget', font = 'Arial 14 bold', fg = 'brown', borderwidth = 8, height = 3, width = 20) winButtonNoBoldEntryWidget.grid (column = 1, row = 4) winButtonNoBoldEntryWidget.bind ('<Button-1>', lambda event, a = 'n': boldContentsOfEntryWidget(a)) # Button widgets to convert the entry widget text to a password def passwordOnOff(): winButtonPasswordEntryWidget = tk.Button(winFrame, text ='Show entry widget \ncontent as password', borderwidth=8, font = 'Arial 14 bold', fg = 'grey', height = 3, width = 20) winButtonPasswordEntryWidget.grid(column = 0, row = 5) winButtonPasswordEntryWidget.bind('<Button-1>', lambda event, a = 'p': passwordEntryWidget(a)) winButtonNormalEntryWidget = tk.Button(winFrame, font = 'Arial 14 bold', text = 'Show entry widget \ncontent as normal text', fg = 'grey', borderwidth = 8, height = 3, width = 20) winButtonNormalEntryWidget.grid(column = 1, row = 5) winButtonNormalEntryWidget.bind('<Button-1>', lambda event, a = 'n': passwordEntryWidget(a)) # =================================================================== # Create the frame using the tk object and run the application winFrame = tk.Tk() winFrame.title("Wrap up the basic widgets") createGUI() winFrame.mainloop() Graphical User Interface Programming Output 4.2.5.a–4.2.5.f: 125 126 Handbook of Computer Programming with Python There are some noteworthy ideas presented in this script, relating to the need to hide, disable, and lock the text of Observation 4.20 – grid(): Use a widget, or make it appear as a password. For example, the grid() method to position a sometimes it is required to hide, and subsequently widget on the grid; use the grid _ unhide, a widget. This is often referred to as adjusting remove() method to remove it withits visibility. In Python this is achieved with the use out deleting it. of the grid() and grid _ remove() methods. It should be stated that when the widget is invisible it is Observation 4.21 – state, nornot deleted, but merely removed from the grid. mal, disabled: Use the state Method showHideLabelEntry() implements this option with the normal or disfunctionality. abled flags to enable or disable In a similar fashion, the method showHideEntry(lock) the functionality of a widget. Content() implements the functionality of hiding and displaying the contents of the same entry widget using the set() and get() methods. The reader should note Observation 4.22 – show: Use the that the content of the entry widget should be stored in show option to replace the text with a variable, since tampering with the set() and get() a password-like text, based on a premethods may accidentally delete it. Likewise, method ferred character/symbol. enableLockDisableEntryWidget() implements the functionality of locking/disabling the entry widget using the state option and its normal and disabled values. Finally, if it is required to utilize text font properties, such as bold or italic, one can use the font option as shown in the boldContentsOfEntryWidget() method. It is also possible to make the content of the entry widget appear as a password. Method passwordEntryWidget() uses option show to replace each character with a chosen placeholder character, in this case an asterisk (“*”). The rest of the methods are assigned with the creation of the application GUI. 4.3 ENHANCING THE GUI EXPERIENCE The widgets, methods, options, and events presented in the previous sections should provide a good enough basis to create a GUI application for a basic system, as they cover all the fundamental aspects of basic interaction. However, they do not address two major requirements in computer programming: validation and efficiency. In the case of numbers, specific widgets like spinbox and scale are frequently used for the purposes of validation and improvement of visual appearance. In the case of text, for tasks requiring optimized and synchronized organization, widgets like listbox and combobox can be used. Checkbuttons and radiobuttons are used frequently in cases where improved selection options are required. Finally, in order to improve the organization of the GUI and avoid accidental repositioning of the widgets at runtime, the various objects can be placed in individual frames within the main frame of the application. 4.3.1 The Spinbox and Scale Widgets inside Individual Frames One of the main challenges in programming is to identify and highlight the user’s mistakes when entering numbers as part of their interaction with an application. It is often the case that either numeric values entered are outside the allowed range or they are alphanumeric sequences consisting of both text and numbers. In order to validate that a number is entered correctly two different approaches are followed: (a) code is written to ensure the correct, acceptable form of the input number, and (b) widgets like spinbox and scale are used to restrict the user’s options when selecting numbers. The following Python script makes use of such widgets to implement a small application in which the user may enter the speed limit, the current speed, and the fine per km/h over the Graphical User Interface Programming 127 speed limit. Once these numbers are entered, the fine is calculated based on the following formula: fine = (current speed − speed limit) × fine per km/h. For improving the organization of the GUI, the script uses a frame widget, which the various other widgets are placed upon: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 # Import the necessary modules import tkinter as tk from tkinter import ttk # Declare and initialise the global variables and widgets # and define the associated methods currentSpeedValue, speedLimitValue, finePerKmValue = 0, 0, 0 global speedLimitSpinbox global finePerKmScale global currentSpeedScale global fine # =========================================================== # Define the methods to run the control speed application # Define the method to control the Current Speed Scale widget change def onScale(val): global currentSpeedValue v = float(val) currentSpeedValue.set(v) calculateFine() # Define the method to control the Speed Limit Spinbox widget change def getSpeedLimit(): global speedLimitValue v = float(speedLimitSpinbox.get()) speedLimitValue.set(v) calculateFine() # Define the method to control the Fine per Km Spinbox widget change def getFinePerKm(val): global finePerKmValue v = int(float(val)) finePerKmValue.set(v) calculateFine() # Define the method to calculate the Fine given the 3 user parameters def calculateFine(): global currentSpeedValue, speedLimitValue, finePerKmValue global fine diff = float(currentSpeedValue.get())-float(speedLimitValue.get()) finePerKm = float(finePerKmValue.get()) if (diff <= 0): fine.config(text = 'No fine') else: fine.config(text = 'Fine in USD: '+ str(diff * finePerKm)) # =========================================================== 128 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 Handbook of Computer Programming with Python # Define the methods that will create the interface of the application def createGUI(): currentSpeedFrame() speedLimitFrame() finePerKmFrame() fineFrame() # Create the frame to include the Current Speed widgets def currentSpeedFrame(): global currentSpeedValue CurrentSpeedFrame = tk.Frame (winFrame, bg = 'light grey', bd = 2, relief = 'sunken') CurrentSpeedFrame.pack() CurrentSpeedFrame.place(relx = 0.05, rely = 0.05) currentSpeed = tk.Label(CurrentSpeedFrame, text = 'Current speed:', width = 24) currentSpeed.config(bg = 'light blue', fg = 'red', bd = 2, font = 'Arial 14 bold') currentSpeed.grid(column = 0, row = 0) # Create Scale widget; define variable to connect to scale widget currentSpeedValue = tk.DoubleVar() currentSpeedScale = tk.Scale (CurrentSpeedFrame, length = 200, from_ = 0, to = 360) currentSpeedScale.config(resolution = 0.5, activebackground = 'dark blue', orient = 'horizontal') currentSpeedScale.config(bg = 'light blue', fg = 'red', troughcolor = 'cyan', command = onScale) currentSpeedScale.grid(column = 1, row = 0) currentSpeedSelected = tk.Label(CurrentSpeedFrame, text = '...', textvariable = currentSpeedValue) currentSpeedSelected.grid(column = 2, row = 0) # Create the frame to include the Speed Limit widgets def speedLimitFrame(): global speedLimitValue global speedLimitSpinbox SpeedLimitFrame = tk.Frame(winFrame, bg = 'light yellow', bd = 4, relief = 'sunken') SpeedLimitFrame.pack() SpeedLimitFrame.place(relx = 0.05, rely = 0.30) # Create the prompt label on the Speed Limit frame speedLimit=tk.Label(SpeedLimitFrame, text='Speed limit:', width=24) speedLimit.config(bg = 'light blue', fg = 'yellow', bd = 2, font = 'Arial 14 bold') speedLimit.grid(column = 0, row = 0) # Create the Spinbox widget; define variable to connect to Spinbox speedLimitValue = tk.DoubleVar() Graphical User Interface Programming 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 129 speedLimitSpinbox = ttk.Spinbox(SpeedLimitFrame, from_ = 0, to = 360, command = getSpeedLimit) speedLimitSpinbox.grid(column = 1, row = 0) speedLimitSelected = tk.Label(SpeedLimitFrame, text = '...', textvariable = speedLimitValue) speedLimitSelected.grid(column = 2, row = 0) # Create the frame to include the Fine per Km widgets def finePerKmFrame(): global finePerKmValue FinePerKmFrame = tk.Frame(winFrame, bg = 'light blue', bd = 4, relief = 'sunken') FinePerKmFrame.pack() FinePerKmFrame.place (relx = 0.05, rely = 0.55) # Create the prompt label on the Fine per Km frame finePerKm=tk.Label(FinePerKmFrame, text='Fine/Km overspeed (USD):', width = 24) finePerKm.config(bg = 'light blue', fg = 'brown', bd = 2, font = 'Arial 14 bold') finePerKm.grid(column = 0, row = 0) # Create Scale widget; define variable to connect to Scale widget finePerKmValue = tk.IntVar() finePerKmScale = ttk.Scale(FinePerKmFrame, orient = 'horizontal', length = 200, from_ = 0, to = 100, command = getFinePerKm) finePerKmScale.grid(column = 1, row = 0) finePerKmSelected = tk.Label(FinePerKmFrame, text = '...', textvariable = finePerKmValue) finePerKmSelected.grid(column = 2, row = 0) # Create the frame to include the Fine for speeding def fineFrame(): global fine FineFrame = tk.Frame(winFrame, bg='yellow', bd=4, relief='raised') FineFrame.pack() FineFrame.place(relx = 0.05, rely = 0.80) # Create the label that will display the fine on the Fine frame fine = tk.Label(FineFrame, text = 'Fine in USD:...', fg = 'blue') fine.grid(column = 0, row = 0) # =================================================================== # Create the main frame for the application and run it winFrame = tk.Tk() winFrame.title("Control speed") winFrame.config(bg = 'light grey') winFrame.resizable(False, False) winFrame.geometry('500x170') createGUI() winFrame.mainloop() 130 Handbook of Computer Programming with Python Output 4.3.1: Conceptually, the script may be divided into three parts. The first part involves the declaration of the global vari- Observation 4.23 – frames, relx, ables and their initialization, so that they can be used rely: Use frames for improved conin runtime when the user interacts with the program trol of the interface. Contain the vari(line 7). This is important since the methods imple- ous widgets of the interface in the menting the interaction will be using the same variables relevant frames. Use options relx and dynamically. At this stage, the main frame is also ini- rely to place the frames in specified tialized and formed (lines 139–145), although this is positions, relative to the main window. done outside the initial phase. Eventually, a frame is created with a single label placed in it, with the sole purpose of displaying the calculated fine for speeding Observation 4.24 – scale: Use the scale widget to create a controlled (lines 128–137). The second part includes the creation of the four dif- mechanism that will accept numeriferent frames inside the main frame, and the placement cal user input. The tkinter widget of the relevant widgets in each of them. These frames has more visual options than the ttk are created by means of a call to the relevant methods, alternative. through the createGUI() method (lines 49–54). In the first case, (lines 56–80), the frame is placed inside the main window frame in a particular position Observation 4.25 – Options: Use the (relx and rely options). Next, a label and a scale required options, such as activewidget are placed in the frame. The reader should note background, troughcolor, bg, the use of the config() method that defines the back- fg, to modify the visual attributes of ground (bg), foreground (fg), borderwidth (bd), and the widget. Use the resolution font name and size (font) of the label. It must be also option to specify the increment and noted that the label is placed in column 0 and row 0 of decrement steps. Use the orient option to specify its orientation (i.e., the current frame, and not of the main window frame. In addition to the label, the scale widget is also placed horizontal or vertical). Use the in the frame. It is set to have a length (length) of 200 from _ = and to = options to set the pixels, and its values are restricted within a lower bound- numerical boundaries of the widget. ary of 0 and upper boundary of 360. The reader should also observe the use of the config() method that sets the resolution option of the widget, allowing for user-defined increments (including decimals) of the values, the activebackground option that sets the color of the widget when it is active, and the orientation (orient) that can take one of two values: horizontal or vertical. For clarity reasons, the config() method is used for a second time to set some more options for the widget, such as the background (bg), the foreground (fg), and the troughcolor that sets the color of the trough. Additionally, another label is placed in the frame in order to display the current value of the scale widget, as an optional visual aid. The second frame and the associated label introduce the spinbox widget (lines 82–103). This is also used to control user input when entering numeric values. It is very similar to the scale widget, allowing for the setting of the lower and upper boundaries of the accepted values, with two main differences: (a) it is visually different, and (b) the user may directly enter a value to the textual part of the widget, and/or control it with the increase/decrease arrows. As in the previous case, another label is added to the frame as an extra visual aid. 131 Graphical User Interface Programming The third frame introduces another scale widget (lines 105–126). This is different to the one used in the Observation 4.26: Use the spinbox first frame in that (a) it is visually different and restricted widget to create a controlled mechaas to its visual attributes (i.e., it is not offering several of nism that will accept numerical user the tk widget options), and (b) it belongs to the ttk class/ input, while also allowing direct input. library instead of tk. The reader should notice the distinctly different visual results of the two scale widgets. The third part defines the four methods used to control the interaction between the user and the application (lines 16–46). The reader should note that three of the methods (i.e., onScale(val), getSpeedLimit(), and getFinePerKm(val)) are directly associated with widgets currentSpeedScale, speedLimitSpinbox, and finePerKmScale, respectively. This is done through the command option. More specifically, when the user interacts with a particular widget, the resulting values are captured and the respective methods are called for the calculation of the fine. In the case of the scale widget, the value is passed with the call to the method. This is the case for both tk and ttk. The reader should observe (a) the use of the set and get methods applied to the objects of the widgets in order to tamper with the widget values, (b) the use of the casing operators (i.e., float(), int(float())) to control the type of numerical values used in the calculation, and (c) the declaration of the global variables that must be called and used in the methods. At the end of each of these methods the calculateFine() method is called to perform the associated calculation. 4.3.2 The Listbox and Combobox Widgets inside LabelFrames Two of the most well-known widgets used in programming are the listbox and the combobox. These widgets are used to present the user with lines of text as a list, with the purpose of allowing them to make a selection. This selection can be also used to synchronize the contents between multiple instances of different widgets. The programmer can be creative as to the appearance of the widgets, as it is possible to manipulate their visual attributes, despite the fact that the basic form cannot be modified. The main difference between the two widgets is that the former provides an open list whereas the latter is a collapsed list that opens upon the user’s click. Another widget which can help further enhancing the appearance of an application is the labelframe widget. This widget is similar to the frame widget, but it allows for a label to be specified on the frame itself, thus, removing the need for the creation of an extra label widget into the frame. Some of the visual attributes of this widget (including those related to the label font) can be manipulated. In this section, two additional libraries are introduced: random and time. The former is introduced in order to use method randint() that generates random numbers, and the latter in order to use process _ time() that records the starting and/or ending time of a particular process. The following Python script allows the user to select a number of randomly generated integers in order to populate a listbox. Subsequently, it sorts this list into a Observation 4.27 – listbox, combobox: Use the listbox and combobox widgets to display lists of lines of text, select one or more of these lines and, synchronize their contents as necessary. Observation 4.28 – labelframe: As with the frame widget, one can use the labelframe widget without the need to create an extra label for descriptions. The same options as with the frame and label widgets apply. Observation 4.29 – randint(): Use the randint() method of the random library to generate random ­numbers within a specified range. Observation 4.30 – process _ time(): Use the process _ time() method of the time library to mark a particular moment in time and use it to count the time elapsed for a given process. 132 Handbook of Computer Programming with Python second listbox before displaying the size of the list, the sum of the numbers and their average, and the processing time for completing the sorting process: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 # Import the necessary modules import tkinter as tk from tkinter import ttk from tkinter import * import random import time # Initialise various lists used by the listboxes, comboboxes, & methods unsortedL = []; sortedL = []; statisticsData = []; sizes = [5, 20, 100, 1000, 10000, 20000] global UnsortedList, SortedList global startTime, endTime, ListSizeSelection, size global UnsortedListScrollBar, SortedListScrollBar global EntryFrame, UnsortedFrame, SortedFrame # Populate the unsorted list with random numbers and # the unsorted listbox def populateUnsortedList(): global size global UnsortedListScrollBar global UnsortedList global ListSizeSelection # Read the number of elements as they are selected from the combobox size = int(ListSizeSelection.get()) # randint() method of the random class generates random integers for i in range (size): n = random.randint(-100, 100) # Enter the generated random integer to the relevant place in the # unsorted list unsortedL.insert(i, n) # Populate the listbox with the elements of the unsorted list for i in range (0, size): UnsortedList.insert(i, unsortedL[i]) UnsortedListScrollBar.config(command = UnsortedList.yview) # Use Bubble sort to sort the list & record the statistics for later use def sortToSortedList(): global size, startTime, endTime global SortedListScrollBar global SortedList # Load the unsorted list and listbox to the sorted list and listbox for i in range (0, size): sortedL.insert(i, unsortedL[i]) Graphical User Interface Programming 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 133 # Start the timer startTime = time.process_time() # The Bubble sort algorithm for i in range (0, size-1): for j in range (0, size-1): if (sortedL[j] > sortedL[j+1]): temp = sortedL[j] sortedL[j] = sortedL[j+1] sortedL[j+1] = temp # End the timer endTime = time.process_time() # Load the sorted list to the relevant listbox for i in range (0, size): SortedList.insert(i, sortedL[i]) SortedListScrollBar.config(command = SortedList.yview) # Clear all lists, listboxes, & comboboxes, & the global size variable def clearLists(): global size sortedL.clear() unsortedL.clear() UnsortedList.delete('0', 'end') SortedList.delete('0', 'end') statisticsData.clear() StatisticsCombo.delete('0', 'end') # Calculate and report the statistics from the sorting process def statistics(): global size, startTime, endTime statisticsData.clear() statisticsData.insert(1, 'The size of the lists is ' + str(size)) statisticsData.insert(2,'The sum of the lists is '+str(sum(sortedL))) statisticsData.insert(3, 'The time passed to sort the list was ' \ + str(round(endTime - startTime, 5))) statisticsData.insert(4, 'The average of the sorted list is: ' \ + str(round(sum(sortedL) / size, 2))) StatisticsCombo['values'] = statisticsData # =================================================================== # Define the methods that will create the GUI of the application def createGUI(): unsortedFrame() entryFrame() entryButton() sortButton() sortedFrame() clearButton() statisticsButton() 134 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 Handbook of Computer Programming with Python statisticsSelection() # Create the labelframe & place the Unsorted Array Listbox widgets in it def unsortedFrame(): global unsortedList global UnsortedListScrollBar global UnsortedList global winFrame global UnsortedFrame UnsortedFrame = tk.LabelFrame (winFrame, text = 'Unsorted Array') UnsortedFrame.config(bg='light grey',fg='blue',bd=2, relief='sunken') # Create a scrollbar widget to attach to the UnsortedList UnsortedListScrollBar = Scrollbar (UnsortedFrame, orient = VERTICAL) UnsortedListScrollBar.pack(side = RIGHT, fill = Y) # Create the listbox in the Unsorted Array frame UnsortedList = tk.Listbox(UnsortedFrame, bg='cyan', width=13, bd=0, height = 12, yscrollcommand = UnsortedListScrollBar.set) UnsortedList.pack(side = LEFT, fill = BOTH) # Associate the scrollbar command with its parent widget, # i.e., the UnsortedList yview UnsortedListScrollBar.config(command = UnsortedList.yview) # Place the Unsorted frame and its parts into the interface UnsortedFrame.pack(); UnsortedFrame.place(relx = 0.02, rely = 0.05) # Create the labelframe to include the Entry widget def entryFrame(): global unsortedList global UnsortedListScrollBar global ListSizeSelection global EntryFrame global winFrame EntryFrame = tk.LabelFrame(winFrame, text = 'Actions') EntryFrame.config(bg='light grey', fg='red', bd=2, relief = 'sunken') EntryFrame.pack(); EntryFrame.place(relx = 0.25, rely = 0.05) # Create the label in the Entry frame EntryLabel = tk.Label(EntryFrame, text='How many integers\nin the list', width = 16) EntryLabel.config(bg = 'light grey', fg='red', bd = 3, relief = 'flat', font = 'Arial 14 bold') EntryLabel.grid(column = 0, row = 0) # Create the combobox to select the number of elements in the lists ListSizeSelection = tk.IntVar() ListSizeCombo = ttk.Combobox(EntryFrame, textvariable=ListSizeSelection, width = 10) ListSizeCombo['values'] = sizes ListSizeCombo.current(0) ListSizeCombo.grid(column = 1, row = 0) # Create button to insert new entries into the unsorted array & listbox def entryButton(): Graphical User Interface Programming 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 135 global EntryFrame EntryButton = tk.Button(EntryFrame, text = 'Populate\nUnsorted list', relief = 'raised', width = 16) EntryButton.bind('<Button-1>', lambda event: populateUnsortedList()) EntryButton.grid(column = 0, row = 2) # Create the button that will sort the numbers and display them # in the sorted array and listbox def sortButton(): global EntryFrame SortButton=tk.Button(EntryFrame,text='Sort numbers\nwith BubbleSort', relief = 'raised', width = 16) SortButton.bind('<Button-1>', lambda event: sortToSortedList()) SortButton.grid(column = 1, row = 2) # Create the labelframe to include the Sorted Array Listbox widgets def sortedFrame(): global sortedList global SortedListScrollBar global SortedList global winFrame global SortedFrame SortedFrame = tk.LabelFrame(winFrame, text = 'Sorted Array') SortedFrame.config(bg='light grey', fg='blue', bd=2, relief='sunken') # Create a scrollbar widget to attach to the SortedList SortedListScrollBar = Scrollbar (SortedFrame) SortedListScrollBar.pack(side = RIGHT, fill = Y) # Create the listbox in the Sorted Array frame SortedList = tk.Listbox (SortedFrame, bg='cyan', width=13, height=12, yscrollcommand = SortedListScrollBar.set, bd = 0) SortedList.pack(side = LEFT, fill = BOTH) # Associate the scrollbar command with its parent widget, # i.e., the SortedList yview SortedListScrollBar.config(command = SortedList.yview) # Place the Unsorted frame and its parts into the interface SortedFrame.pack(); SortedFrame.place(relx = 0.75, rely = 0.05) # Create the button that will clear the two listboxes and the two lists def clearButton(): global EntryFrame ClearButton = tk.Button(EntryFrame, text = 'Clear lists', relief = 'raised', width = 16) ClearButton.bind('<Button-1>', lambda event: clearLists()) ClearButton.grid(column = 0, row = 3) # Create the button that will display the statistics for the sorting def statisticsButton(): global EntryFrame 136 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 Handbook of Computer Programming with Python StatisticsButton = tk.Button(EntryFrame, text = 'Show statistics', relief = 'raised', width = 16) StatisticsButton.bind('<Button-1>', lambda event: statistics()) StatisticsButton.grid(column = 1, row = 3) # Create the option menu that will show the statistical results # from the sorting process def statisticsSelection(): global EntryFrame global StatisticsCombo StatisticsSelection = tk.StringVar() statisticsData = ['The statistics will appear here'] StatisticsSelection.set(statisticsData[0]) StatisticsCombo = ttk.Combobox(EntryFrame, width = 30, textvariable = StatisticsSelection) StatisticsCombo['values'] = statisticsData StatisticsCombo.grid(column = 0, columnspan = 2, row = 4) # =================================================================== # Create the main frame for the application winFrame = tk.Tk() winFrame.title("Bubble Sort"); winFrame.config(bg = 'light grey') winFrame.resizable(True, True); winFrame.geometry('650x300') createGUI() winFrame.mainloop() Output 4.3.2: Initially, the necessary libraries are imported (i.e., tkinter, time, and random, lines 2–6). Next, the various lists, variables, and listboxes are initialized (lines 9–14). Note that the lists are not defined as global, since they are accessed by reference by all methods in the script by default. It must be also noted that different types of objects and/or variables must be declared as global in separate lines, since declaring them together may raise errors. After initialization, the main frame is created and configured in lines 227–229. The next step is to create the application interface. In this case, the interface consists of two distinct parts. The first includes two listboxes created and placed inside the associated labelframes (lines 103–124 and 170–191). The use of labelframes makes the creation of additional labels obsolete. The visual properties of the listboxes can be configured through their options, which are almost Graphical User Interface Programming 137 identical to those of an entry widget. The listboxes can be populated at run time using the insert(index, Observation 4.31 – insert(), value) method, and cleared at run time using the delete(): Use the insert() and delete(index, index) method. Likewise, the delete() methods to populate or properties/options of the labelframes are similar to those clear a listbox. of regular frames and labels. The second part is to create the labelframe that hosts Observation 4.32 – [“values”]: Use the comboboxes and the buttons required in the applithe [“values”] property to popucation. The purpose of the first combobox is to display late a combobox with an initial list of the number of random integers in the unsorted list. The values. second one displays basic statistics related to the sorting process, the size of the lists, the sum and average of the integers, and the time required to sort the list. There Observation 4.33 – textvariare three notable observations related to the creation and able: Use the textvariable use of the comboboxes (lines 143–149 and 211–223). option of the combobox to associate Firstly, they must include a [“values”] list which will it with an IntVar() object that will take its values from an associated list. The latter can be store the selected value. initially empty or populated. Secondly, their selection value (e.g., textvariable), must be associated with an object of the IntVar() class (or any similar alterna- Observation 4.34 – current(): tive) that will store it for further use, since the selected Use the current() method to combobox value is not directly accessible. Thirdly, the define the currently selected value of currently selected value must be defined through the the combobox. current(index) method. The last step is to create the interaction between the user and the application. For this purpose, four but- Observation 4.35 – get(): It is nectons are created and bound with click events to trig- essary to use the get() method to ger the respective methods. This populates, sorts, and read from the IntVar() object, as clears the relevant lists, and displays the basic statis- it is private and, hence, not directly tics. The populateUnsortedList() method uses accessible. the randint() method to generate random integers, and the insert() method populates the unsorted list (lines 16–38). It is worth noting the declaration of Observation 4.36 – clear(): Use global variable size, and the use of the get() method the clear() method to clear the valto read the value from the private attribute of the ues of the lists. ListSizeSelection object (line 25). The sortToSortedList() method (lines 40–67) declares global variables size, startTime and endTime, Observation 4.37 – xview, yview, uses the process_time() method to mark the xscrollcommand, yscrollcomstart and end of the sort process, and utilizes a com- mand: Use the scrollbar widget to mon Bubble Sort algorithm to sort the list and populate attach a scrollbar to the associated the sortedList. The clearLists() method uses widget (usually a listbox). Use xview methods clear() to clear the ­values of the lists and or yview to control its orientation delete() to delete the values of the listboxes (lines (i.e., horizontal or vertical). Use the 69–77). Finally, the statistics() method uses meth- xscrollcommand or the yscrollods sum() and round() to produce the basic statistics command to activate it. that will be displayed (lines 79–89). The reader should observe the use of the scrollbar widget introduced in this script. The idea behind, and the use of, this particular widget is intuitive and quite straightforward. Firstly, the ­labelframe inside which the scrollbar operates is created. Next, the scrollbar is created and connected (packed) to the parent widget (i.e., in this case the associated labelframe), specifying its 138 Handbook of Computer Programming with Python orientation and positioning. Lastly, the widget/object that will make use of the scrollbar is created and associated with the scrollbar through either yscrollcommand or xscrollcommand (depending on whether the scrollbar orientation is vertical or horizontal respectively), and configured to scroll the contents of the attached widget (lines 38, 120–124, and 67, 187–191). 4.3.3 GUIs with CheckButtons, RadioButtons and SimpleMessages In addition to listboxes and comboboxes, there are two more widgets that users of windows-based applications are familiar with, namely checkbuttons and radiobuttons. These widgets allow the user to make one or more selections from a set of different available options/actions. Their main difference is that while in the case of checkbuttons the user may select more than one option at any given time, radiobuttons only allow a single selection from the set of available options. Finally, another handy widget available in Python the reader should be familiar with is the message widget. In this section the most basic form of this widget will be introduced and explained. The following script implements an interface that includes two listboxes with associated, attached vertical scrollbars. The listboxes are populated with the names of various countries and their capital cities. It also includes two entry boxes for accepting new entries to the listboxes. Insertions are triggered using the associated button-click events. The contents of all listboxes are synchronized with the user’s click on any listbox. The interface also includes four buttons that handle the interaction between the application and the user, allowing for the insertion and deletion of particular entries, the clearance of all entries from all three containers, and exiting the application. Finally, two checkbuttons control whether the relevant containers are enabled or not, and two radiobuttons whether they are visible: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import tkinter as tk from tkinter import * from tkinter import ttk from tkinter import messagebox countries = ['E.U.', 'U.S.A.', 'Russia', 'China', 'India', 'Brazil'] Capital = ['Brussels', 'Washinghton', 'Moscow', 'Beijing', 'New Delhi', 'Brazilia'] global global global global global global newCountry, newCapital CountriesFrame, CapitalFrame checkButton1, checkButton2 radioButton CountriesList, CapitalList CountriesScrollBar, CapitalScrollBar # Create the interface for the listboxes def drawListBoxes(): global CountriesList, CapitalList global CountriesFrame, CapitalFrame global CountriesScrollBar, CapitalScrollBar # Create CountriesFrame labelframe; place CountriesList widget in it CountriesFrame = tk.LabelFrame(winFrame, text = 'Countries') CountriesFrame.config(bg = 'light grey', fg = 'blue', bd = 2, Graphical User Interface Programming 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 139 width = 15, relief = 'sunken') # Create a scrollbar widget to attach to the CountriesList CountriesScrollBar = Scrollbar(CountriesFrame, orient = VERTICAL) CountriesScrollBar.pack(side = RIGHT, fill = Y) # Create the listbox in the CountriesFrame CountriesList = tk.Listbox(CountriesFrame, bg = 'cyan', width = 15, height = 8, yscrollcommand = CountriesScrollBar) CountriesList.pack(side = LEFT, fill = BOTH) # Associate the scrollbar command with its parent widget, # (i.e., the CountriesList yview) CountriesScrollBar.config(command = CountriesList.yview) # Place the Countries frame and its parts on the interface CountriesFrame.pack(); CountriesFrame.place(relx = 0.03, rely = 0.05) CountriesList.bind('<Double-Button-1>', lambda event: alignList('countries')) # Create the CapitalFrame labelframe; place CapitalList widget on it CapitalFrame = tk.LabelFrame(winFrame, text = 'Countries Capital') CapitalFrame.config(bg = 'light grey', fg = 'blue', bd = 2, width = 13, relief = 'sunken') # Create a scrollbar widget to attach to the CapitalFrame CapitalScrollBar = Scrollbar(CapitalFrame, orient = VERTICAL) CapitalScrollBar.pack(side = RIGHT, fill = Y) # Create the listbox in the CapitalFrame CapitalList = tk.Listbox(CapitalFrame, bg = 'cyan', yscrollcommand = CapitalScrollBar, width = 16, height = 8, bd = 0) CapitalList.pack(side = LEFT, fill = BOTH) # Associate the scrollbar command with its parent widget, # (i.e., the CapitalList yview) CapitalFrame.pack(); CapitalFrame.place(relx = 0.70, rely = 0.05) CapitalList.bind('<Double-Button-1>', lambda event: alignList('capital')) # Create the interface for the new entries def drawNewEntries(): global newCountry, newCapital # Create the labelframe and place the newCountry entry widget on it NewCountryFrame = tk.LabelFrame(winFrame, text = 'New Country') NewCountryFrame.config(bg = 'light grey', fg = 'blue', bd = 2, width = 13, relief = 'sunken') NewCountryFrame.pack(); NewCountryFrame.place(relx= 0.03, rely = 0.75) newCountry = tk.StringVar(); newCountry.set('') NewCountryEntry = tk.Entry(NewCountryFrame, textvariable = newCountry, width = 15) NewCountryEntry.config(bg= 'dark grey', fg = 'red', relief = 'sunken') NewCountryEntry.grid(row = 0, column = 0) # Create the labelframe and place the newCapital entry widget on it NewCapitalFrame = tk.LabelFrame(winFrame, text = 'New Capital') NewCapitalFrame.config(bg = 'light grey', fg = 'blue', bd = 2, width = 13, relief = 'sunken') 140 Handbook of Computer Programming with Python 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 NewCapitalFrame.pack(); NewCapitalFrame.place(relx= 0.70, rely = 0.75) newCapital = tk.StringVar(); newCapital.set('') NewCapitalEntry = tk.Entry(NewCapitalFrame, textvariable = newCapital, width = 15) NewCapitalEntry.config(bg= 'dark grey', fg = 'red', relief = 'sunken') NewCapitalEntry.grid(row = 0, column = 0) # Create the interface for the action buttons def drawButtons(): # Create the labelframe that will host the buttons ButtonsFrame = tk.Frame(winFrame) ButtonsFrame.config(bg= 'light grey', bd=2, width=14, relief='sunken') ButtonsFrame.pack(); ButtonsFrame.place(relx = 0.30, rely = 0.07) newRecordButton = tk.Button(ButtonsFrame, text = 'Insert\nnew record', width = 11, height = 2) newRecordButton.grid(row = 0, column = 0) newRecordButton.bind('<Button-1>', lambda event, a = 'insertRecord': buttonsClicked(a)) deleteRecordButton = tk.Button (ButtonsFrame, text = 'Delete\n record', width = 11, height = 2) deleteRecordButton.grid (row = 0, column = 1) deleteRecordButton.bind('<Button-1>', lambda event, a = 'deleteRecord': buttonsClicked(a)) clearRecordsButton = tk.Button (ButtonsFrame, text = 'Clear\n records', width = 11, height = 2) clearRecordsButton.grid (row = 1, column = 0) clearRecordsButton.bind('<Button-1>', lambda event, a = 'clearAllRecords': buttonsClicked(a)) exitButton = tk.Button(ButtonsFrame, text='Exit', width=11, height=2) exitButton.grid (row = 1, column = 1) exitButton.bind('<Button-1>', lambda event : winFrame.destroy()) exit() # Create the interface for the checkbuttons def drawCheckButtons(): global checkButton1, checkButton2 # Create the labelframe that will host the checkbuttons CheckButtonsFrame = tk.Frame(winFrame) CheckButtonsFrame.config(bg = 'light grey', bd = 2, relief = 'sunken') CheckButtonsFrame.pack();CheckButtonsFrame.place(relx=0.34, rely=0.43) checkButton1 = IntVar(value = 1) CountriesCheckButton = tk.Checkbutton (CheckButtonsFrame, variable = checkButton1, text = 'Countries \nenabled/disabled', bg = 'light blue', onvalue = 1, offvalue = 0, width = 15, Graphical User Interface Programming 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 141 height = 2, command = checkClicked).grid(row = 0, column = 0) checkButton2 = IntVar(value = 1) CapitalCheckButton = tk.Checkbutton (CheckButtonsFrame, variable = checkButton2, onvalue = 1, offvalue = 0, text = 'Capitals \nenabled/disabled', width = 15, height = 2, bg = 'light blue', command = checkClicked).grid (row=1, column=0) # Create the interface for the radiobuttons def drawRadioButtons(): global radioButton # Create the labelframe that will host the radiobuttons RadioButtonsFrame = tk.Frame(winFrame) RadioButtonsFrame.config(bg = 'light grey', bd = 2, relief = 'sunken') RadioButtonsFrame.pack();RadioButtonsFrame.place(relx=0.31, rely=0.78) radioButton = IntVar() visibleRadioButton = tk.Radiobutton (RadioButtonsFrame, text = 'Containers \nvisible', width = 8, height = 2, bg = 'light green', variable = radioButton, value = 1, command = radioClicked).grid(row = 0, column = 0) invisibleRadioButton = tk.Radiobutton (RadioButtonsFrame, text = 'Containers \ninvisible', width = 8, height = 2, bg = 'light green', variable = radioButton, value = 2, command = radioClicked).grid(row = 0, column = 1) radioButton.set(1) # Define method alignList that will identify the selected row # in any of the listboxes and align it with the corresponding row others def alignList(a): global CountriesList, CapitalList global selectedIndex if (a == 'countries'): selectedIndex = int(CountriesList.curselection()[0]) CapitalList.selection_set(selectedIndex) if (a == 'capital'): selectedIndex = int(CapitalList.curselection()[0]) CountriesList.selection_set(selectedIndex) # Define checkClicked method to control the state of the containers def checkClicked(): global checkButton1, checkButton2 # Control the state of the containers as NORMAL or DISABLED # based on the state of the checkbuttons 142 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 Handbook of Computer Programming with Python if (checkButton1.get() == 1): CountriesList.config(state = NORMAL) else: CountriesList.config(state = DISABLED) if (checkButton2.get() == 1): CapitalList.config(state = NORMAL) else: CapitalList.config(state = DISABLED) # Define the radioClicked method that will display or hide the frames # of the containers def radioClicked(): global CountriesFrame, CapitalFrame global radioButton # Use the destroy() method to destroy the frames of the containers. # The lists are not destroyed CountriesFrame.destroy() CapitalFrame.destroy() if (radioButton.get() == 1): drawListBoxes() populate() # Populate the listboxes def populate(): global CountryList, CapitalList global selectedIndex for i in range (int(len(countries))): CountriesList.insert(i, countries[i]) for i in range (int(len(capital))): CapitalList.insert(i, capital[i]) # Define method buttonsClicked that will trigger the corresponding code # when any of the buttons is clicked def buttonsClicked(a): global CountriesList, PopulationCombo, CapitalList global newCountry, newPopulation, newCapital, populationSelection global selectedIndex if (a == "insertRecord"): if (newCountry!= '' and newCapital!= ''): countries.append(newCountry.get()); CountriesList.delete('0', 'end') capital.append(newCapital.get());CapitalList.delete('0','end') # Call method populate() to re-populate the containers Graphical User Interface Programming 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 143 # with the renewed lists populate() if (a == 'deleteRecord'): # Use messagebox.askyesno() to pop a confirmation message # for deleting the elements deleteElementOrNot=messagebox.askokcancel(title="Delete element", message="Are you ready to delete the elements?", icon='info') if (deleteElementOrNot == True): # Use the pop() method to remove selected elements from the lists countries.pop(selectedIndex); capital.pop(selectedIndex) CountriesList.delete('0', 'end'); CapitalList.delete('0', 'end') # Call method populate() to re-populate the containers # with the renewed lists populate() if (a == 'clearAllRecords'): # Use messagebox.askyesno() to pop a confirmation message # for clearing the lists clearListsOrNot=messagebox.askokcancel(title="Clear all elements", message = "Are you ready to clear the lists?", icon = 'info') if (clearListsOrNot == True): countries.clear(); capital.clear() CountriesList.delete('0', 'end'); CapitalList.delete('0', 'end') # Call method populate() to re-populate the containers # with the renewed lists populate() # Create the frame for the Countries program and configure its size # and background color winFrame = tk.Tk() winFrame.title ('Countries') winFrame.geometry("500x250") winFrame.config (bg = 'light grey') winFrame.resizable(False, False) # Create the Graphical User Interface drawListBoxes() drawNewEntries() drawButtons() drawCheckButtons() drawRadioButtons() # Call populate()to populate the listboxes and comboboxes populate() winFrame.mainloop() 144 Handbook of Computer Programming with Python Output 4.3.3: As in previous examples, the first part of the application deals with drawing the interface. In this Observation 4.38 – destroy(), particular case this task is assigned to methods drawL- exit(): Use methods destroy() to istBoxes(), drawNewEntries(), drawButtons(), destroy the interface (i.e., the widgets drawCheckButtons(), and drawRadioBut- of the particular frame it applies) and tons(). Method drawListBoxes() (lines 16–55) exit() to exit the application. creates the relevant frames and containers. The reader should note the call to method alignList() that causes the contents of the two containers to be aligned, Observation 4.39 – checkbutton, offvalue: Use the and the use of the relx and rely options that posi- onvalue, checkbutton widget to offer selection tion the respective frames in the appropriate places options. Each option is represented within the interface. The drawNewEntries() by a separate widget. If an option method (lines 56–80) creates the entry widgets that will is selected, the widget is given an accept the user’s input for new entries. Observe how onvalue, otherwise it is given an the entry widgets are associated with the respective offvalue through the associated StringVar() objects that allow the use of the input through the appropriate set() and get() methods. IntVar() object. Similarly, the drawButtons() method (lines 82–110) creates the frame and places the buttons that perform Observation 4.40 – radiobutton: the basic actions of the application (i.e., insert a new Use the radiobutton widget to offer entry, delete a selected entry, clear all contents of the a number of mutually exclusive containers, and exit the application). In the case of ­ options. Each option is represented the Exit button in particular, one should note the use by a different widget. If an option is of the destroy() method that destroys the interface selected, the widgets are given a parof the main window, and the exit() method that exits ticular value through the associated the application. IntVar() object. The drawCheckButtons() method (lines 112– 131) creates the frame for the checkbutton widgets. Notice how each of the checkbuttons is associated (bound) with a separate IntVar() object to monitor its state (i.e., onvalue = 1 if it is checked or offvalue = 0 if it is unchecked). The reader should also notice that when the user checks/unchecks the checkbutton the Graphical User Interface Programming 145 checkClicked() method is triggered through the command option. This is in order to control the Observation 4.41 – command, radiobutton: appearance of the respective container. Likewise, in checkbutton, Use the command option to trigger the case of drawRadiobuttons() (lines 133–153), a particular action when any of the two of them are placed in the relevant frame and trigger checkButton or radioButton the radioClicked() method through the command option. This controls the appearance of the containers widgets are selected. as a whole. It is important to note that in such cases where multiple radiobuttons are associated/bound with Observation 4.42 – curselecthe same IntVar() object, only one can be selected. tion(): Use methods curselecThe second part of the application deals with the tion() to identify the selected interactions that take place between the interface and element from a listbox and selecthe user and their results, through the use of methods tion _ set() to select a particular alignList(), checkClicked(), radioClicked(), indexed element. populate(), and buttonsClicked(). In the case of alignList() (lines 155–167), the curselection() method is applied to the relevant container (listbox) to Observation 4.43 – state, identify the element of the container that was selected. NORMAL, DISABLED: Use the Since the method results to a tuple, it is necessary to state option to determine whether limit the result to the first element of the tuple (i.e., the a particular listbox is enabled [0] value). Once the element of the container is identified (NORMAL) or disabled (DISABLED). through its index, the selection _ set() method is executed. This allows the other container to align the two listboxes based on the selections. Ultimately, this process synchronizes the two containers. In the case of the checkClicked() method (lines 169–183) the reader should note the following: • The use of the state option and its two possible values (i.e., NORMAL and DISABLED), which determine whether the associated widget will be enabled or not. More specifically, NORMAL dictates that the user is allowed to click in the relevant container and select one or more of its elements and DISABLED the opposite. • The use of the get() method to access the value of objects checkButton1 and checkButton2. The reader is reminded that accessing the values of these objects is only possible through such methods, since the objects and their values are private. The checkButton1 and checkButton2 widgets are declared as global to ensure that they are used by reference, taking their values from the original objects in the main application. In the case of the radioClicked() method Observation 4.44 – append(), 185–198), frames CountriesFrame and (lines ­ delete(), clear(): Use methCapitalFrame are destroyed alongside their containers/ ods append() to append a list (i.e., listboxes (i.e., CountriesList and CapitalList) insert a new element at the end of the and are only recreated and repopulated if the user selects list), delete() to delete a selected the appropriate ­ visibleRadioButton from the element from a list, and clear() to interface (i.e., assigning a value of 1 to the radioButclear all the elements of a list. ton object). Finally, the buttonsClicked() method (lines 211–249) has three main tasks. Firstly, it inserts a new element in each of the listboxes when the user clicks the Insert button. In this case, the values of the newCountry and newCapital entry widgets are checked and, if not empty, used to append the relevant lists. Notice that it is preferable to append the lists and not the listboxes, as the former host the actual values. The listboxes are repopulated only after this task is completed. 146 Handbook of Computer Programming with Python Secondly, the method has the task of deleting the selected elements from the listboxes when the user clicks the Delete button. In this case, as long as an element of the listboxes is selected, a simple messagebox pops up to confirm the user’s choice. Notice that the askyesno() method provides one of the simplest available forms of messages, and results in either True or False. The programmer can use these values to determine further actions. The reader should note that the messagebox module is part of the tkinter library. It is also noteworthy that the delete() method is used in the code to initially clear the listboxes from their contents, and subsequently re-populate them with the refreshed, appended lists. This particular method accepts the first and the last index in the range of elements that should be deleted from the lists as arguments. Similarly, a third task is to completely clear the listboxes from their contents. For this purpose, the clear() method is applied to both lists (but not the listboxes), given that confirmation is provided by the user through another Observation 4.45 – askyesno(): simple ­messagebox interaction. Use the appropriate messagebox In all the cases discussed above, the populate() module method (e.g., askyesno()) method (lines 200–209) is responsible for reading the to confirm the user’s choice. lists and using their contents to populate the listboxes. 4.4 BASIC AUTOMATION AND USER INPUT CONTROL A common characteristic of visual programming is the creation of the illusion that the application objects/widgets change shape, content, or status, either automatically or based on the user’s input or automatically. If an object/widget is to be activated and put in operation automatically, the programmer needs to associate it with a respective time-controlled event. The latter enables the programmer to change the properties of the object/widget at run time, through the activation and execution of appropriate blocks of code that are based on the time-controlled event. In this section, the reader will have the opportunity to get some exposure to the creation of ­applications that manipulate objects/widgets without the user’s input, or with interactions of a different type than direct written input or button-click events. Throughout the section, a basic Traffic Lights application is gradually developed toward a primitive, but informative, automated user experience. 4.4.1 Traffic Lights Version 1 – Basic Functionality The Traffic Lights sample project can start by creating a very basic application that uses three images (loaded in labels) displaying a green, a yellow, and a red traffic light, respectively. The three images can be programmed to appear and disappear based on user’s selection. The following Python script creates this interface and implements the related interactions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Import libraries import tkinter as tk from tkinter import * # Import the necessary image processing classes from PIL import Image, ImageTk global global global global global radioButton image1, image2, image3 photo1, photo2, photo3 winLabel1, winLabel2, winLabel3 winFrame # Create the main frame winFrame = tk.Tk() Graphical User Interface Programming 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 147 winFrame.title("Traffic Lights v1") # Create the interface with the images and labels def photos(): global radioButton global image1, image2, image3 global photo1, photo2, photo3 image1 = Image.open("TrafficLightsGreen.gif") image1 = image1.resize((50, 100), Image.ANTIALIAS) photo1 = ImageTk.PhotoImage(image1) winLabel1=tk.Label(winFrame,text='', image=photo1, compound='left') winLabel1.grid(row = 0, column = 0) image2 = Image.open("TrafficLightsYellow.gif") image2 = image2.resize((50, 100), Image.ANTIALIAS) photo2 = ImageTk.PhotoImage(image2) winLabel2 = tk.Label(winFrame,text='',image=photo2,compound='left') winLabel2.grid(row = 0, column = 1) image3 = Image.open("TrafficLightsRed.gif") image3 = image3.resize((50, 100), Image.ANTIALIAS) photo3 = ImageTk.PhotoImage(image3) winLabel3 = tk.Label(winFrame,text='',image=photo3,compound='left') winLabel3.grid(row = 0, column = 2) # Control active traffic lights based on the radio button selection if (radioButton.get() == 1): winLabel2.destroy() winLabel3.destroy() if (radioButton.get() == 2): winLabel1.destroy() winLabel3.destroy() if (radioButton.get() == 3): winLabel1.destroy() winLabel2.destroy() # Create the radio button interface def drawRadioButtons(): global radioButton visibleGreenRadioButton = tk.Radiobutton (winFrame, text = 'Green', width=17, height=1, bg = 'light grey', variable = radioButton, value = 1, command = photos).grid(row = 1, column = 0) visibleYellowRadioButton = tk.Radiobutton(winFrame, text='Yellow', width= 17, height= 1, bg= 'light grey', variable = radioButton, 148 67 68 69 70 71 72 73 74 75 76 77 Handbook of Computer Programming with Python value = 2, command = photos).grid(row = 1, column = 1) visibleRedRadioButton = tk.Radiobutton (winFrame, text = 'Red', width= 17, height= 1, bg= 'light grey', variable = radioButton, value = 3, command = photos).grid(row = 1, column = 2) radioButton = IntVar() photos() drawRadioButtons() winFrame.mainloop() Output 4.4.1: The output demonstrates the two main parts of the application. In the first part, the photos() method loads the three images and controls their visibility within the interface (lines 17–55). The reader will notice that part of the method is the destruction of two of the images, in order to leave only one on display (lines 44–56). For this task, the reader might also consider to use the grid _ remove() method (covered in previous sections), which will have the same result. The second part controls which of the three images will be displayed. Once the desired radiobutton has been clicked upon, the corresponding image stays on display and the other two are hidden (lines 57–71). It is worth noting that all three radio buttons are associated with the same variable. This is reflected on the fact that they cancel each other when selected, as the value of the common associated object is altered. 4.4.2 Traffic Lights Version 2 – Creating a Basic Illusion Taking things one step further, the application is changed in such a way as to make only one image appearing instead of three. The impression that there is only one image is of course illusory, as it is essentially caused by manipulating the visual properties of the associated widget and/or its position in the interface. In this case, the traffic images are stacked upon each other using the same grid coordinates, and, subsequently, two of them are being removed from the interface. This version is almost identical to the original one, with the exception of the positioning of the widgets and the slightly modified title. The proposed modification only requires the replacement of lines 15, 35, 42, 62, 66–67, and 70–71 with the ones provided below, which are only different in terms of their grid coordinates and width: 15 35 42 winFrame.title ("Traffic Lights v2"); winFrame.geometry("200x180") [...] winLabel2.grid(row = 0, column = 0) [...] winLabel3.grid(row = 0, column = 0) [...] Graphical User Interface Programming 62 66 67 70 71 width = 20, [...] width = 20, value = [...] width = 20, value = 149 height = 1, bg = 'light grey', variable = radioButton, height = 1, bg = 'light grey', variable = radioButton, 2, command = photos).grid(row = 2, column = 0) height = 1, bg = 'light grey', variable = radioButton, 3, command = photos).grid(row = 3, column = 0) Output 4.4.2 4.4.3 Traffic Lights Version 3 – Creating a Primitive Automation In this version of the sample application, there is no need for the user to click on the respective radio buttons in order to cause the traffic light images to appear/disappear. The change happens automatically after 5 seconds from the time one of the images is turned on (and the other two turned off). In order to enable timed functionality, in addition to the libraries used in the previous versions, the time library must be imported to the script. This version differs from the previous ones in a number of ways: • The radio buttons that were dealing with the interaction are removed, and a new manageLabels() function is introduced to control the automated process of traffic light changes. • Every time there is a change of the displayed image, the time.sleep() function (time library) is used to freeze the execution of the application for a given period of time (in this case 3 seconds). • Since there are no radiobuttons, the application uses another object (trafficLight), to control which image is displayed. This is accomplished by setting its value through the set() method. • The update() function is applied to the main frame in order to refresh the interface based on the latest status update. The complete script is provided below: 1 2 3 4 # Import libraries import tkinter as tk from tkinter import * # Import the necessary image processing classes 150 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 Handbook of Computer Programming with Python from PIL import Image, ImageTk # Import the timer threading library import time global global global global global image1, image2, image3 photo1, photo2, photo3 winLabel1, winLabel2 winFrame trafficLight # Open the traffic images and create the relevant pointers def photos(): global image1, image2, image3 global photo1, photo2, photo3 image1 = Image.open("TrafficLightsGreen.gif") image1 = image1.resize((50, 100), Image.ANTIALIAS) photo1 = ImageTk.PhotoImage(image1) image2 = Image.open("TrafficLightsYellow.gif") image2 = image2.resize((50, 100), Image.ANTIALIAS) photo2 = ImageTk.PhotoImage(image2) image3 = Image.open("TrafficLightsRed.gif") image3 = image3.resize((50, 100), Image.ANTIALIAS) photo3 = ImageTk.PhotoImage(image3) # Manage label visibility based on time. def manageLabels(): global winLabel1, winLabel2 global Photo1, Photo2, Photo3 global winFrame global trafficLight if (trafficLight.get() == 1): winLabel1.config(image = photo1) winLabel1.grid(row = 0, column = 0) winLabel2.config(text = 'Green') time.sleep(3) if (trafficLight.get() == 2): winLabel1.config(image = photo2) winLabel1.grid(row = 0, column = 0) winLabel2.config(text = 'Yellow') time.sleep(3) if (trafficLight.get() == 3): winLabel1.config(image = photo3) winLabel1.grid(row = 0, column = 0) winLabel2.config(text = 'Red') time.sleep(3) Graphical User Interface Programming 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 151 winFrame.update() # Create the main frame winFrame = tk.Tk() winFrame.title ("Traffic Lights v3"); winFrame.geometry("200x180") photos() winLabel1 = tk.Label(winFrame, text='', image=photo1, compound='left') winLabel1.grid(row = 0, column = 0) winLabel2=tk.Label(winFrame,text='...'); winLabel2.grid(row=1,column=0) trafficLight = IntVar() trafficLight.set(1) while (True): if (trafficLight.get() == 1): trafficLight.set(2) elif (trafficLight.get() == 2): trafficLight.set(3) elif (trafficLight.get() == 3): trafficLight.set(1) manageLabels() winFrame.mainloop() Output 4.4.3: 4.4.4 Traffic Lights Version 4 – A Primitive Screen Saver with a Progress Bar Having introduced the concept of timed events and how they can be used to control the flow of events in an application, it is rather straightforward to expand the same idea to the creation of an illusory movement of particular objects inside a frame. A good example of this is the creation of a primitive screen saver using the existing Traffic Lights application as a basis. In addition to the existing widgets, an additional widget that can be used in this scenario is the progressbar widget. This will assist in making the screen saver a bit more informative, by providing clues about the elapsed and remaining time in any particular condition (i.e., green, yellow, and red traffic light). The widget belongs to the ttk library and can take several parameters that control its appearance and functionality, with the most important ones being length, orient, 152 Handbook of Computer Programming with Python and mode. Length determines the size (i.e., length) of the progress bar, orient the orientation of the widget Observation 4.46 – progressbar: Use (i.e., VERTICAL or HORIZONTAL), and mode if the the progressbar widget to display the displayed value is predetermined (“­determinate”) or progress of an event or task that takes a indetermined (“intederminate”). In the case of the particular amount of time to complete. former, the bar will appear moving toward one end of Progressbars can be ­horizontal or the widget until the specified value is reached, while in vertical, and can have a predeterthe case of the latter the bar will appear moving continu- mined (determinate) or undetermined (interminate) value. ously from one end to the other and back. The following script implements a related implementation example, where the three traffic lights are controlling the movement of a car image (embedded in a label widget). When the green light is on, the car is moving at a particular speed and when yellow is on at half that speed. Similarly, when the red light is on, the car appears to stop and the progressbar appears to be loading to reflect the elapsed time in this particular condition (i.e., red light) and remaining time until the next condition is triggered (i.e., green light). The car image appears to be bouncing across the frame, moving toward a different direction every time it reaches the edges of the parent frame. The movement of the car image is always diagonal, and follows four different directions. The program stops when the user interrupts (closes) the application. The associated Python script is provided below: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 # Import libraries import tkinter as tk from tkinter import ttk from tkinter import * # Import the necessary image processing classes from PIL import Image, ImageTk # Import threading libary for the timer threading import time global global global global global global global global trafficLight image1, image2, image3 photo1, photo2, photo3 winLabel1, winLabel2, winLabel3 direction posx, posy winFrame progressBar # Open the traffic and car images and create the relevant pointers def photos(): global image1, image2, image3, image4 global photo1, photo2, photo3, photo4 image1 = Image.open("TrafficLightsGreen.gif") image1 = image1.resize((50, 100), Image.ANTIALIAS) photo1 = ImageTk.PhotoImage(image1) image2 = Image.open("TrafficLightsYellow.gif") image2 = image2.resize((50, 100), Image.ANTIALIAS) photo2 = ImageTk.PhotoImage(image2) Graphical User Interface Programming 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 image3 = Image.open("TrafficLightsRed.gif") image3 = image3.resize((50, 100), Image.ANTIALIAS) photo3 = ImageTk.PhotoImage(image3) image4 = Image.open("Car.gif") image4 = image4.resize((30, 15), Image.ANTIALIAS) photo4 = ImageTk.PhotoImage(image4) # Manage label visibility based on time def manageLabels(): global trafficLight global winLabel1, winLabel2 global Photo1, Photo2, Photo3 global winFrame if (trafficLight.get() == 1): winLabel1.config(image=photo1) winLabel2.config(text='Green'); a=1 elif (trafficLight.get() == 2): winLabel1.config(image=photo2) winLabel2.config(text='Yellow'); a=2 elif (trafficLight.get() == 3): winLabel1.config(image = photo3) winLabel2.config(text = 'Red'); a = 3 winLabel1.pack(); winLabel1.place(x = 1, y = 1) winLabel2.pack(); winLabel2.place(x = 1, y = 100) winFrame.update # Call method moveCar()to move the image within the interface moveCar(a) # Control the direction of the movement def checkDirection(): global direction global posx, posy if (posx >= 400 and direction == 1): direction = 2 elif (posx >= 400 and direction == 4): direction = 3 elif (posx <= 0 and direction == 2): direction = 1 elif (posx <= 0 and direction == 3): direction = 4 elif (posy <= 0 and direction == 3): direction = 2 elif (posy <= 0 and direction == 4): direction = 1 elif (posy >= 200 and direction == 1): direction = 4 elif (posy >= 200 and direction == 2): 153 154 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 Handbook of Computer Programming with Python direction = 3 # Manage the movement of the car def moveCar(a): global direction global posx, posy global winLabel3 global winFrame global progressBar progressBar['value'] = 0 for i in range(10): # Call checkDirection() to control the movement direction checkDirection() if (a == 1): move = 10 elif (a == 2): move = 5 else: move = 0 progressBar['value'] = int((i/(10 - 1)) * 100) if (direction == 1): posy += move; posx += move elif (direction == 2): posy += move; posx -= move elif (direction == 3): posy -= move; posx -= move elif (direction == 4): posy -= move; posx += move winLabel3.pack(); winLabel3.place(x = posx, y = posy) winFrame.update() time.sleep(0.3) # Create the main frame winFrame = tk.Tk() winFrame.title ("Traffic Lights v4"); winFrame.geometry("400x200") photos() winLabel1 = tk.Label(winFrame, text='', image=photo1, compound='left') winLabel1.pack(); winLabel1.place(x = 1, y = 1) winLabel2 = tk.Label(winFrame, text = '...') winLabel2.pack(); winLabel2.place(x = 1, y = 100) winLabel3 = tk.Label(winFrame, text='', image=photo4, compound='left') winLabel3.pack(); winLabel3.place(x = 1, y = 1) posx = 0; posy = 0 Graphical User Interface Programming 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 155 progressBar = ttk.Progressbar(winFrame, length=100, orient = VERTICAL, mode = 'determinate') progressBar.place(relx = 0.13, rely = 0.02) trafficLight = IntVar() trafficLight.set(3) direction = 1 while (True): winFrame.update_idletasks() if (trafficLight.get() == 1): trafficLight.set(2) elif (trafficLight.get() == 2): trafficLight.set(3) elif (trafficLight.get() == 3): trafficLight.set(1) manageLabels() winFrame.mainloop() Output 4.4.4: A number of new methods, options and computational ideas are introduced in this script. First, the reader will notice the use of the update _ idletasks() method, which ensures that objects or methods not being currently used are still updated every time the while loop is executed (line 132). This safeguards from unwanted garbage collection processes that might occur for the, 156 Handbook of Computer Programming with Python seemingly unused, objects. Second, it is worth noting the use of absolute coordinates x and y to continuously position the relevant widgets on the interface, instead of the relative ones (relx and rely) used in previous examples. This is especially relevant in the case of the moving car in order to trace and handle its movement when reaching the edges of the Observation 4.47 – update_idleinterface. tasks(): Use the update _ idleIn terms of the actual movement of the car, the compu- tasks() method to ensure that idle tational idea is quite simple. For instance, when it reaches widgets/objects are not being destroyed the east edge of the interface, (a) if it is moving southeast when not being used for extended peri(i.e., direction = 1) it should bounce toward the southwest ods of time. (i.e., direction = 2), and (b) if it is moving northeast (i.e., direction = 4) it should bounce toward the northwest (i.e., direction = 3). The c ­ heckDirection() method Observation 4.48 – x, y coordinates: (lines 59–79) takes care of the rest of the movements of It is often preferable to use the x and the car. Once the step and directions are set, the actual y coordinates when placing a widget movement takes place in method movecar() (lines on an interface, in order to ensure its 81–109). The method recalculates the current placement absolute placement in pixels instead coordinates of the car based on the actual coordinates, of the relative positions (i.e., using given both the intended direction and the state of the relx and rely). traffic light. 4.4.5 Traffic Lights Version 5 – Suggesting a Primitive Screen Saver As a conclusion of this automation-related series of scripts based on the Traffic Lights sample application, it is useful to introduce the idea of using designated keyboard input commands to achieve a certain level of control over the automated events. The following script introduces functionality that allows the user to move the car dynamically at run time using the up, down, left, and right keys on the keyboard, as well as the esc key to exit: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Import libraries import tkinter as tk from tkinter import * # Import the necessary image processing classes from PIL import Image, ImageTk # Import the timer threading libary import time global global global global global global trafficLight posx, posy image1, image2, image3 photo1, photo2, photo3 winLabel1, winLabel2 winFrame # Open the traffic and car images and create the relevant pointers def photos(): global image1, image2, image3, image4 global photo1, photo2, photo3, photo4 image1 = Image.open("TrafficLightsGreen.gif") Graphical User Interface Programming 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 image1 = image1.resize((50, 100), Image.ANTIALIAS) photo1 = ImageTk.PhotoImage(image1) image2 = Image.open("TrafficLightsYellow.gif") image2 = image2.resize((50, 100), Image.ANTIALIAS) photo2 = ImageTk.PhotoImage(image2) image3 = Image.open("TrafficLightsRed.gif") image3 = image3.resize((50, 100), Image.ANTIALIAS) photo3 = ImageTk.PhotoImage(image3) image4 = Image.open("Car.gif") image4 = image4.resize((30, 15), Image.ANTIALIAS) photo4 = ImageTk.PhotoImage(image4) # Manage the movement based on the traffic light def keyPressed (event): global trafficLight global posx, posy global winFrame global winLabel3 # Set the moving step based on the traffic light if (trafficLight == 1): move = 10 elif (trafficLight == 2): move = 5 elif (trafficLight == 3): move = 0 print(event.keycode) # Prepare the moving step (up, down, left, right, esc) # Mac codes: (8320768,8255233, 8124162, 8189699, 3473435) # The user pressed 'up'. Move the car accordingly if (event.keycode == 38): if (move == 10 and posy >= 20): posy -= 10 elif (move == 5 and posy >=20): posy -= 5 # The user pressed 'down'. Move the car accordingly elif (event.keycode == 40): if (move == 10 and posy <= 270): posy += 10 elif (move == 5 and posy <= 270): posy += 5 # The user pressed 'right'. Move the car accordingly elif (event.keycode == 39): if (move == 10 and posx <= 570): posx += 10 elif (move == 5 and posx <= 570): 157 158 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 Handbook of Computer Programming with Python posx += 5 # The user pressed 'left'. Move the car accordingly elif (event.keycode == 37): if (move == 10 and posx >= 20): posx -= 10 elif (move == 5 and posx >= 20): posx -= 5 # The user pressed 'escape'. Close the program elif (event.keycode == 27): winFrame.destroy() exit() winLabel2.pack(); winLabel2.place(x = posx, y = posy) winFrame.update() def trafficLightsLoop(): global trafficLight global winFrame global winLabel1 winFrame.update_idletasks() if (trafficLight == 1): trafficLight = 2; winLabel1.config(image = photo2) elif (trafficLight == 2): trafficLight = 3; winLabel1.config(image = photo3) elif (trafficLight == 3): trafficLight = 1; winLabel1.config(image = photo1) winLabel1.pack(); winLabel1.place(x = 1, y = 1) winFrame.update winFrame.after(3000, trafficLightsLoop) # Create the main frame winFrame = tk.Tk() winFrame.title ("Traffic Lights v5"); winFrame.geometry("600x300") winFrame.bind('<Key>', keyPressed) photos() winLabel1 = tk.Label(winFrame, text='', image=photo1, compound='left') winLabel1.pack(); winLabel1.place(x = 1, y = 1) winLabel2 = tk.Label(winFrame, text='', image=photo4, compound='left') winLabel2.pack(); winLabel2.place(x = 1, y = 1) trafficLight = 1; posx = 0; posy = 0 winFrame.after(3000, trafficLightsLoop) winFrame.mainloop() Graphical User Interface Programming 159 Output 4.4.5: The script introduces some new ideas and techniques aiming to make the user experience more engaging, and to encourage further enhancements. Firstly, it must be noted that, in the main program, the main frame is bound to the keypressed() method through the <Key> event (line 102). It must be stressed that the naming of the event is important and that any deviations (e.g., <key>) may not be translated correctly by Python. The use of the binding results in the user being able to press any of the up, down, left, and right directional keys in order to move the car to the relevant direction. This is achieved by checking the values of the event.keycode produced based on the user’s input. Observation 4.49 – <key>, event. It is worth noting that these values may vary between keycode: Use the <Key> event to different systems, so the code should include appropriate bind a particular frame or widget to controls and solutions for such variations (lines 37–73). a key press event. Once the key input Secondly, the reader should note the avoidance of a is captured, use event.keycode to loop and its replacement by the after() method, which determine the appropriate action. is applied to the main frame (winFrame). The reason for this decision was that since the program activates the monitoring of the <Key> event, the presence of a Observation 4.50 – after(): Use second monitoring event like a loop would cause con- the after() method to call a method flicts in the internal threading of the application. The or execute a command after a preafter() method serves the purpose of creating a loop- determined number of seconds has like behavior without causing such a conflict (lines 97, elapsed since the initiation of the cur113). Finally, the reader should note the use of the esc rent method. This can be used as an code in the keypressed() method (line 76) to exit the alternative to for or while loops. application in a controlled way. 4.5 CASE STUDIES Enhance the Countries application in order to include the following functionality: • • • • Add one more listbox to display more content for each country (e.g., size, population, etc.). Add a combobox to allow the user to select the font name of the contents of the listboxes. Add a combobox to allow the user to select the font size of the contents of the listboxes. Add a combobox to change the background color of the content in the listboxes. 4.6 EXERCISES Enrich the Traffic Lights application by including one more car. The new car must be controlled by another set of keys on the keyboard, using the same traffic lights as those on the original application. 5 Application Development with Python Dimitrios Xanthidis University College London Higher Colleges of Technology Christos Manolas The University of York Ravensbourne University London Hanêne Ben-Abdallah University of Pennsylvania CONTENTS 5.1 5.2 5.3 5.4 5.5 5.6 Introduction........................................................................................................................... 161 Messages, Common Dialogs, and Splash Screens in Python................................................ 162 5.2.1 Simple Message Boxes.............................................................................................. 162 5.2.2 Message Boxes with Options..................................................................................... 164 5.2.3 Message Boxes with User Input................................................................................. 166 5.2.4 Splash Screen/About Forms....................................................................................... 168 5.2.5 Common Dialogs....................................................................................................... 169 Menus.................................................................................................................................... 171 5.3.1 Simple Menus with Shortcuts.................................................................................... 171 5.3.2 Toolbar Menus with Tooltips..................................................................................... 175 5.3.3 Popup Menus with Embedded Icons......................................................................... 178 Enhancing the GUI Experience............................................................................................. 181 5.4.1 Notebooks and Tabbed Interfaces............................................................................. 181 5.4.2 Threaded Applications.............................................................................................. 185 5.4.3 Combining Multiple Concepts and Applications in a Multithreaded System........... 190 Wrap Up................................................................................................................................. 199 Case Study.............................................................................................................................205 5.1 INTRODUCTION Application development can be viewed as a process that is both scientific and creative. Scientific because it follows the systematic process of the software development life-cycle. This covers all development steps, from requirement analysis and implementation to deployment and maintenance. Creative as it calls for the creativity of the developer to design a system that incorporates features that make it suitable and efficient for the task at hand, while also being attractive to the end user. The previous chapter introduced and discussed some of the key objects for the development of an appealing user interface. In this chapter, the concept of application development is examined more DOI: 10.1201/9781003139010-5 161 162 Handbook of Computer Programming with Python thoroughly, by introducing ideas and tools that call for the integration of multiple functions within a single application. These include: • Dialogs, Messages, and the Splash Screen: Simple and intuitive objects that most users of Windows style applications are quite familiar with. Each of these objects serves a particular function and is part of the Python API (Application Programming Interface), thus, requiring only minimal coding. • Menus, Toolbar Menus, Popup Menus: Variations of the well-known menu object allowing the user to select different functions available in the application. Menus are usually accompanied by extra functionality options like hot keys, shortcuts, and tooltips, in order to enhance their attractiveness and efficiency. • Tabs: Tabs provide an effective way to optimize the use of the real estate of the running interface, allowing the inclusion of more than one application in the same space. This idea is simple, but intuitive and effective. Tabs are commonly used to separate a single notebook into various sections and load various independent applications. • Threads: Threading involves the simultaneous execution of code relating to multiple instances of the same process, class or application. Different threads can be executed simultaneously, either in parallel or in explicitly defined time slots. Each thread can have its own widgets (if it is GUI based) and attributes. Threaded objects do not necessarily communicate with each other, although this is possible and can be implemented when and if necessary. The focus of this chapter is on discussing and illustrating key underlying concepts and mechanisms associated with these tools and structures. 5.2 MESSAGES, COMMON DIALOGS, AND SPLASH SCREENS IN PYTHON Messageboxes, common dialogs, and splash screens are some of the most understated, but useful objects that can help in enhancing the functionality of an application without adding lengthy code to it. They are user-friendly and multifunctional, and provide instant, and strictly restricted and managed input from the user during the execution of an application. Several types of these components are available with varied and diverse functions, such as the display of user messages, the creation of menus of options/choices, the acceptance and verification of user input, the management of display parameters and options (e.g., colors), and the management of files, file structures and directories. Each of the above can be called and implemented with relatively simple Python code commands, as described in the following sections. 5.2.1 Simple Message Boxes The simple message box displays a message to the user and stays on display until the corresponding (OK) button Observation 5.1 – Simple Message is clicked, at which point the application resumes execu- Box: Methods showinfo(), showtion. As there is no input to be received, the user reaction error(), and showwarning() to the message is irrelevant and the only possible choice (members of the messagebox is to click the OK button. The object has three distinct object, tkinter library) are used to forms represented by methods showinfo(), shower- display a simple message box with a ror(), and showwarning(), which are embedded in respective info, error, or warning icon. the messagebox object (tkinter library). These methods do not change any fundamental aspects of the message box, but modify the icon that accompanies it according to the type of information provided to the user. The following Python script presents a basic example of the use of each of the three methods: Application Development 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 # Import libraries import tkinter as tk from tkinter import messagebox # Declare simpleMessage() function, invoked upon button click def simpleMessage(a): if (a == 1): messagebox.showinfo("Simple Info Message", "You clicked for the info message") elif (a == 2): messagebox.showerror("Simple Error Message", "You clicked for the error message") elif (a == 3): messagebox.showwarning("Simple Warning Message", "You clicked for the warning message") # Create a non-resizable Windows frame using the tk object winFrame = tk.Tk() winFrame.title("Simple Messageboxes") winFrame.resizable(False, False) winFrame.geometry('290x180') winFrame.configure(bg = 'dark grey') # Create button that triggers an info message winButton1 = tk.Button(winFrame, width = 25, text = "Click to display \na simple info messagebox") winButton1.pack(); winButton1.place(x = 50, y = 20) winButton1.bind('<Button-1>', lambda event: simpleMessage(1)) # Create button that triggers an error message winButton2 = tk.Button(winFrame, width = 25, text = "Click to display \na simple error messagebox") winButton2.pack(); winButton2.place(x = 50, y = 70) winButton2.bind('<Button-1>', lambda event: simpleMessage(2)) # Create button that triggers a warning message winButton3 = tk.Button(winFrame, width = 25, text = "Click to display \na simple warning messagebox") winButton3.pack(); winButton3.place(x = 50, y = 120) winButton3.bind('<Button-1>', lambda event: simpleMessage(3)) winFrame.mainloop() Output 5.2.1: 163 164 Handbook of Computer Programming with Python The reader should note that the first parameter passed to the message box is the title, whereas the second is the content. The program output provided above illustrates the resulting messages for each of the three simple types of message boxes. 5.2.2 Message Boxes with Options Message boxes are commonly used to receive user confirmation for processes that take place at run-time. In Observation 5.2 – Message Box with such cases, instead of merely displaying information, the Options: Methods askokcancel(), object must prompt the user to confirm their approval (or askretrycancel(), askyesno(), lack of) regarding the execution of particular processes. and askquestion() (members of As in the case of simple messages, several options are the messagebox object, tkinter available for message boxes with options, depending library) are used to display a meson the type of confirmation that is requested. However, sage, while also requesting some sort there are two major differences between the two types of of confirmation from the user. The messages. Firstly, in the case of messages with options, responses can be True or False for the user makes a choice that may alter the execution the first three and ‘Yes’ or ‘No’ for order of the processes that follow, in contrast to the the last one. simple message box. The type and format of the input depends on the type of the message (e.g., OK-Cancel, Retry-Cancel, Yes-No). Secondly, the user’s choice has a tangible value that can be stored in a variable and checked against other pre-defined values to determine the flow of execution. These values are True or False (no quotes and casesensitive) in the case of OK-Cancel, Retry-Cancel, and Yes-No, and ‘Yes’ or ‘No’ (in single quotation marks and case-sensitive) in the case of a question message box. The following Python script provides a simple example that integrates all four different types of messages with options. The script also makes use of the showinfo() and showerror() methods of the simple message box: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Import libraries import tkinter as tk from tkinter import messagebox # Declare optionMessage()function, invoked upon button click def optionMessage(a): if (a == 1): response = messagebox.askokcancel(title = "ok-cancel Message", message = "Clicked the OK-Cancel message", icon = 'info') if (response == True): messagebox.showinfo("Info Message", "Clicked OK") elif (response == False): messagebox.showerror("Error Message", "Clicked Cancel") elif (a == 2): response = messagebox.askquestion(title = "question Message", message = "Clicked the question message", icon = 'info') if (response == 'yes'): messagebox.showinfo("Info Message", "Clicked Yes") elif (response == 'no'): Application Development 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 165 messagebox.showerror("Error Message", "Clicked No") elif (a == 3): response=messagebox.askretrycancel(title="retry-cancel Message", message = "Clicked the Retry-Cancel message", icon = 'info') if (response == True): messagebox.showinfo("Info Message", "Clicked Retry") elif (response == False): messagebox.showerror("Error Message", "Clicked Cancel") elif (a == 4): response = messagebox.askyesno(title = "yes-no Message", message = "Clicked the Yes-No message", icon = 'info') if (response == True): messagebox.showinfo("Info Message", "Clicked Yes") elif (response == False): messagebox.showerror("Error Message", "Clicked No") # Create a non-resizable Windows frame using the tk object winFrame = tk.Tk() winFrame.title("Messageboxes with options") winFrame.resizable(False, False) winFrame.geometry('320x220') winFrame.configure(bg = 'grey') # Create button that triggers an OK-Cancel message winButton1 = tk.Button(winFrame, width = 20, text = "Click to display \na OK-Cancel messagebox") winButton1.pack(); winButton1.place(x = 85, y = 20) winButton1.bind('<Button-1>', lambda event: optionMessage(1)) # Create button that triggers a question message winButton2 = tk.Button(winFrame, width = 20, text = "Click to display \na Question messagebox") winButton2.pack(); winButton2.place(x = 85, y = 70) winButton2.bind('<Button-1>', lambda event: optionMessage(2)) # Create button that triggers a Retry-Cancel message winButton3 = tk.Button(winFrame, width = 20, text = "Click to display \na Retry-Cancel messagebox") winButton3.pack(); winButton3.place(x = 85, y = 120) winButton3.bind('<Button-1>', lambda event: optionMessage(3)) # Create button that triggers a Yes-No message winButton3 = tk.Button(winFrame, width = 20, text = "Click to display \na Yes-No messagebox") winButton3.pack(); winButton3.place(x = 85, y = 170) winButton3.bind('<Button-1>', lambda event: optionMessage(4)) winFrame.mainloop() 166 Handbook of Computer Programming with Python Output 5.2.2: 5.2.3 Message Boxes with User Input Occasionally, message boxes are used instead of regular entry or text widgets, to prompt user input of various dif- Observation 5.3 – Message Box with ferent data types (i.e., string, integer, float). This is a via- User Input: Methods askstring(), ble choice when the interface is heavily loaded or when askinteger(), and askfloat() the use of widgets is not desirable. When message boxes (members of the simpledialog are used for this purpose, the following methods can be object, tkinter library) are used to used: (a) askstring() for string input, (b) askinte- display a message requesting input of ger() for integer numbers input, and (c) askfloat() a specific data type from the user. for float numbers (real numbers) input. These methods are members of the simpledialog class of the tkinter library. As they return a particular data type value, it must be stored in a suitable variable declared for this purpose. As shown in the following Python script, the title and the message of the message box must be also specified: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Import libraries import tkinter as tk from tkinter import simpledialog from tkinter import messagebox global name; global birthyear; global gpa # Declare optionMessage() function, invoked upon button click def inputMessage(a): global name; global birthyear; global gpa # Accept student name, year of birth, and GPA # and display it through a simple message box if (a == 1): name = simpledialog.askstring("Name", "What is your name?") elif (a == 2): birthyear = simpledialog.askinteger("Year of birth", Application Development 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 167 "What is the year of your birth?") elif (a == 3): gpa = simpledialog.askfloat("GPA", "What is your GPA (out of 4 with one decimal)?") elif (a == 4): message="Student's name: "+name+"\nStudent's year of birth: "+\ str(birthyear) + "\nStudent's GPA: " + str(gpa) messagebox.showinfo("Student's info", message) # Create a non-resizable Windows frame using the tk object winFrame = tk.Tk() winFrame.title("Inputboxes") winFrame.resizable(False, False) winFrame.geometry('260x220') winFrame.configure(bg = 'grey') # Create buttons that will trigger the associated messages winButton1 = tk.Button(winFrame, text = "Click to ask \nthe student's name", width = 20) winButton1.pack(); winButton1.place(x = 30, y = 20) winButton1.bind('<Button-1>', lambda event: inputMessage(1)) winButton2 = tk.Button(winFrame, width = 20, text = "Click to ask \nthe student's year of birth") winButton2.pack(); winButton2.place(x = 30, y = 70) winButton2.bind('<Button-1>', lambda event: inputMessage(2)) winButton3 = tk.Button(winFrame, text = "Click to ask \nthe student's GPA", width = 20) winButton3.pack(); winButton3.place(x = 30, y = 120) winButton3.bind('<Button-1>', lambda event: inputMessage(3)) winButton4 = tk.Button(winFrame, text = "Click to show \nthe student's info", width = 20) winButton4.pack(); winButton4.place(x = 30, y = 170) winButton4.bind('<Button-1>', lambda event: inputMessage(4)) name = ""; birthyear = 0; gpa = 0.0 winFrame.mainloop() Output 5.2.3: 168 Handbook of Computer Programming with Python 5.2.4 Splash Screen/About Forms A frequently underestimated type of object is the soObservation 5.4 – Splash screen: A called splash screen or about form. It is most commonly splash screen can be used in cases used to provide information about application execution of excessive loading times of a winand processes, development details and dates, copydow/widget or when there is a need rights, and contacting the development team. The object to display information related to the does not follow a formal design and, therefore, it is not application. offered as a template by most well-known programming languages. Among its various uses, the splash screen/about form can be used to give time to the main application to load its components. This is especially relevant if significant amounts of data need to be loaded, such as sizable databases or graphics, and heavy objects in general. The following script is a basic example of a splash screen with no apparent functionality. The form disappears after 8 seconds to give its place to the main application window: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # Import libraries import tkinter as tk import time global winSplash # Create the Splash screen def splash(): global winSplash winSplash = tk.Tk() winSplash.title("Splash screen") winSplash.resizable(False, False) winSplash.geometry('250x100') winSplash.configure (bg = 'dark grey') winLabel1 = tk.Label(winSplash, text = "Display the Splash screen \nfor 8 seconds") winLabel1.grid(row = 0, column = 0) # Use the update function to display the splash screen # before the mainloop (main window) takes over winSplash.update() # Call the splash screen for 8 seconds splash() time.sleep(8) # Destroy the splash screen before the mainloop winSplash.destroy() # Create the main window winFrame = tk.Tk() winFrame.title("Main Window") winFrame.resizable(False, False) Application Development 35 36 37 38 39 40 41 169 winFrame.geometry('250x100') winFrame.configure(bg = 'grey') winLabel2 = tk.Label(winFrame, text = "Entered the main window") winLabel2.grid(row = 0, column = 0) winFrame.mainloop() Output 5.2.4: The user should note the use of the time.sleep() method after the splash() method is invoked. This delays the splash screen before the main window (winFrame) is loaded. It is also worth noting the use of the update() method on the winSplash object. This method ensures that the widget is displayed, although it is not the main window and, thus, the mainloop() method cannot be used with it. 5.2.5 Common Dialogs It is frequently the case that the programmer needs to utilize the API (Application Programming Interface) of the operating system in order to avoid writing code that is already provided as prepackaged, essential functionality. Some of the most important GUI-related API elements can be found under the broader category of dialogs. Different versions of dialogs exist, such as Color, Open File, Save File, Directory, Font Dialog, and Print. These dialogs allow programmers to circumvent extensive GUI programming by offering instant access to basic, repetitive functional tasks. These are the common dialog objects that appear in various types of widely used GUI applications (e.g., MS Office or Adobe Creative Suite). With the exception of the color dialog (askcolor), which is included in the colorchooser library, Observation 5.5 – API methods: the aforementioned dialogs are all included in the The API methods offered by Python ­filedialog library under the associated keywords can be used to perform basic repeti(e.g., filedialog.askopenfile(), filedia- tive tasks across many platforms and log.asksaveasfile(), filedialog.askdirec- operating systems. These methods tory()). The syntax for invoking these API methods is include askcolor() from the colsimple and rather intuitive, and it allows a two-way com- orchooser library and asksavesmunication with the user in order to obtain their selec- asfile(), askopenfile(), and tion. In the case of askcolor(), one should note that the askdirectory() from the fileresult is a set of two values: an rgb (red, green, blue) value dialog library. and a particular color selection. The color values selected can be stored in a variable for further use. The following Python script illustrates the use of the four API methods mentioned above: 170 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 Handbook of Computer Programming with Python # Import libraries import tkinter as tk from tkinter import filedialog from tkinter import colorchooser # Define openDialogs() function, invoked upon button click def openDialogs(a): if (a == 1): # Assign user color selection to a set of variables (rgbSelected, colorSelected) = colorchooser.askcolor() # Use the color element from the variable set to change # the color of the form winFrame.config(background = colorSelected) elif (a == 2): filedialog.askopenfile(title = "Open File Dialog") elif (a == 3): filedialog.askdirectory(title = "Directory Dialog") elif (a == 4): filedialog.asksaveasfilename(title = "Save As Dialog") # Create a non-resizable Windows frame using the tk object winFrame = tk.Tk() winFrame.title("Common Dialogs") winFrame.resizable(False, False) winFrame.geometry('280x220') winFrame.configure(bg = 'grey') # Create button that triggers the Color dialog winButton1 = tk.Button(winFrame, text = "Click to open \nthe Color dialog", width = 20) winButton1.pack(); winButton1.place(x = 60, y = 20) winButton1.bind('<Button-1>', lambda event: openDialogs(1)) # Create button that triggers the Open File dialog winButton2 = tk.Button(winFrame, text = "Click to open \nthe File Dialog", width = 20) winButton2.pack(); winButton2.place(x = 60, y = 70) winButton2.bind('<Button-1>', lambda event: openDialogs(2)) # Create button that triggers the Directory dialog winButton3=tk.Button(winFrame, text="Click to open \nthe Directory Dialog", width = 20) winButton3.pack(); winButton3.place(x = 60, y = 120) winButton3.bind('<Button-1>', lambda event: openDialogs(3)) # Create button that triggers the Save As dialog winButton3=tk.Button(winFrame, text = "Click to open \nthe Save As Dialog", width = 20) winButton3.pack(); winButton3.place(x = 60, y = 170) winButton3.bind('<Button-1>', lambda event: openDialogs(4)) winFrame.mainloop() Application Development 171 Output 5.2.5: 5.3 MENUS It is quite rare for a desktop or mobile application to offer singular functionality. Developers usually create systems capable of performing numerous tasks and functions. An example of this are the scripts developed in the previous sections, where multiple, although quite simplistic, tasks were performed using a series of corresponding buttons. In reality, in most cases, access to different functions within an application is provided through menus. These can take different forms, such as simple menus, single-layered menus, menus with nested sub-menus, toolbars, and pop-up menus. These types of menus can be used in isolation, but are also frequently used in conjunction. This section covers basic menu concepts, as well as a number of particular options that can be used to further enhance menu functionality. 5.3.1 Simple Menus with Shortcuts In all windows style applications, simple menus follow the same basic, but rather intuitive, style. They include Observation 5.6 – Menu class: Use a top-level list of items, usually displayed just below the the constructor of the Menu class title of the application. This top-level menu layer sits on to create a menu object. The main top of sub-menus that are hidden in subsequent layers. menu choices can be added using Such basic menus are created using the constructor of the constructor (Menu()), while simthe Menu class from the tkinter library. The idea is ple menu items can be added using quite straightforward indeed. Firstly, the menu object the add _ command() method and is created using the Menu() constructor. Additional radio and check buttons using the menu objects can be also created and attached to the add _ checkbutton() and add _ respecmain menu object, as necessary. Next, any required radiobutton()methods, tively. Use add _ cascade() to put sub-menus can be added to the main menu. This can all pieces of the menu together and be accomplished with the add _ command() method display them on the menu bar. for simple items or the add _ checkbutton() and add _ radiobutton() methods for check button and radio button items, respectively. For nested menus, these steps can be repeated as many times as necessary, although one should avoid going deeper than two levels of menus for clarity reasons. Finally, the add _ cascade() method is used to tie together the various menu pieces and activate the menu system. 172 Handbook of Computer Programming with Python In addition to creating the basic menu structure, developers often choose to extend its functionality by means of menu shortcuts. This can take the form of either hot letters using the underline option, or combinations of special keys (e.g., the control key) and letters through the accelerator option. In both cases, it is essential to remember that while these options may appear on the menu, they do not automatically trigger the relevant functionality. For this purpose, the main window form should be bound to the relevant event in order to trigger the respective functionality. This is achieved with the bind() method. The following application uses the functionality of the previous section, but with the implementation of a two-level deep basic menu instead of buttons: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 # Import libraries import tkinter as tk from tkinter import filedialog from tkinter import colorchooser from tkinter import messagebox from tkinter import Menu # Define functions colorDialog, openDialog, saveAsDialog, quit, askyesno # and askokcancel, invoking the relevant dialogs or message boxes def colorDialog(): # Assign user color selection to a set of variables (rgbSelected, colorSelected) = colorchooser.askcolor() # Change the form color; use the color element from the variable set winFrame.config(background = colorSelected) def openDialog(): filedialog.askopenfile(title = "Open File Dialog") def saveAsDialog(): filedialog.asksaveasfilename(title = "Save As Dialog") def quit(): winFrame.destroy() exit() def askyesno(): messagebox.askyesno("YesNo message", "Click on Yes or No to continue") def askokcancel(): messagebox.askokcancel("OKCancel message", "Click on OK or Cancel to continue") # Define keypressedEvent() function that will invoke # the associated function based on key press def keypressedEvent(event): if (event.keycode == 67 or event.keycode == 99): colorDialog() if (event.keycode == 70 or event.keycode == 102): openDialog() if (event.keycode == 83 or event.keycode == 115): saveAsDialog() Application Development 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 173 # Create non-resizable Windows frame using the tk object winFrame = tk.Tk() winFrame.title("Menus") winFrame.resizable(False, False) winFrame.geometry('260x220') # Create the menu widget on the main window menubar = tk.Menu(winFrame) # Create the first series of sub-menus with dialogs # and underline the shortcut letters dialogs = tk.Menu(menubar, tearoff = 0) dialogs.add_command(label = "Color dialog", command = colorDialog, underline = 0) dialogs.add_command(label = "Open File dialog", command = openDialog, underline = 5) dialogs.add_command(label = "Save As dialog", command = saveAsDialog, underline = 0) menubar.add_cascade(label = "Dialogs", menu = dialogs) # Create the second series of sub-menus with messages mssgs = tk.Menu(menubar, tearoff = 0) # Create sub-menu inside the Yes/No, OK/Cancel message mssgs1 = tk.Menu(mssgs, tearoff = 0) mssgs1.add_command(label = "Yes/No Message", command = askyesno, accelerator = 'Ctrl-Y') mssgs1.add_command(label = "OK/Cancel Message", command = askokcancel, accelerator = 'Ctrl-O') mssgs.add_cascade(label = "Yes/No, OK/Cancel", menu = mssgs1) mssgs.add_separator() mssgs.add_command(label= "Exit", command = quit, accelerator = 'Ctrl-X') menubar.add_cascade(label = "Messages", menu = mssgs) # Create the third series of menus with check buttons and radio buttons buttonmenus = tk.Menu(menubar, tearoff = 0) buttonmenus.add_checkbutton(label = "Checkmenu1", onvalue=1, offvalue=0) buttonmenus.add_checkbutton(label = "Checkmenu2", onvalue=1, offvalue=0) buttonmenus.add_separator() buttonmenus.add_radiobutton(label = "Radiomenu1") buttonmenus.add_radiobutton(label = "Radiomenu2") menubar.add_cascade(label = "Button menus", menu = buttonmenus) # Bind the main window frame with the event/shortcut that will trigger # the relevant function winFrame.bind('<Key>', lambda event: keypressedEvent(event)) winFrame.bind('<Control-Y>', lambda event: askyesno()) winFrame.bind('<Control-O>', lambda event: askokcancel()) winFrame.bind('<Control-X>', lambda event: quit()) winFrame.config(menu = menubar) winFrame.mainloop() 174 Handbook of Computer Programming with Python Output 5.3.1: In addition to the necessary library calls, the script Observation 5.7 – add_separais split into three main parts. In the first part, the tor(), underline, accelerator: main window frame is created and configured (lines Use the add _ separator() method 44–48). Next, a menu object (menubar) is created to add a line separating the various items (lines 50–51) and two main menu items (dialogs of a menu. Use the underline option and mssgs) are attached to it (lines 55, 65). Notice the to create hot keys, or the accelerator tearoff option, which prevents the menu from being option to create ctrl-, shift-, or alt-keys, detached from the main menu bar. Once the main menu and to associate them with the desired components are in place, the various sub-menu items functionality and events. are created and associated with their parent menu item through the add _ command() method (lines 55–62 and 69–73). The command option binds particular menu items with the relevant methods. The underline option accepts the index of the text of the underlying object (starting at 0) and displays the associated character as a hot key. As in the case of hot keys in previous menu item examples, this is not enough by itself to trigger the relevant method or command, so a relevant event must be bound to the hot key character (lines 55–62 and 69–73). This is unlike the case of the command option. When sub-menus are required as part of a menu item, the same process can be utilized. The only difference in this case would be that the referenced object should be the menu item instead of the main menu item (line 68). If it is preferred to use combinations of special keys (i.e., Control, Shift, or Alt) and characters, one can use the accelerator option instead of underline (lines 69–72, 76). As with underline, additional code should be written in order to trigger the function, method, or command associated with the menu item. In cases where check or radio buttons are required instead of simple menu items, one can use methods add _ checkbutton() and add _ radiobutton(), respectively. These methods are used as alternatives to the add _ command() method (lines 81–82 and 84–85). When there is a need to separate the various menu items in groups, one can use the add _ separator() method Application Development 175 (line 83). As mentioned, the add _ cascade() method ties together and activates the various items of the menu system. In the second part of the script, the bindings between the menu item shortcuts (hot keys or ­control characters) and the associated commands are established (lines 90–93 and 36–42). The third part of the script involves the methods that perform the various functionality tasks (lines 8–32). Should the reader experience difficulties to follow through this example, the main coding concepts and commands used in the script are discussed in more detail in previous sections and/or chapters. It is important to note that there is a difference in terms of how a menu is displayed in Windows (the menu bar is inside the running application window) and in Mac OS (the menu is displayed at the main system menu bar, detached from the running application window). 5.3.2 Toolbar Menus with Tooltips An alternative form of presenting menu options to the user is the toolbar menu. It could either supplement the Observation 5.8 – toolbar menu: simple menu system or be used as a stand-alone compo- Use a toolbar menu system in addinent. The idea is rather straightforward: creating a col- tion to (or instead of) simple menus, to lection of buttons (on a frame) and attaching it to the improve the GUI of a multi-functional main window frame. The buttons are then bound to the application. respective commands. Buttons can display either images or text, or a combination of both. In order to improve clarity and make the interface more user-friendly, button text is often replaced by appropriate tooltips. The following Python script provides the same functionality as the one in the previous section, but is using a toolbar instead of a menu. The implementation also embeds tooltips to the toolbar buttons: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # Import libraries import tkinter as tk from tkinter import filedialog from tkinter import colorchooser from tkinter import Menu from tkinter import * # Import the necessary image processing classes from PIL from PIL import Image, ImageTk global openFileToolTip, saveAsToolTip, colorsDialogToolTip, exitToolTip global photo1, photo2, photo3, photo4 global openFileButton, saveAsButton, colorsButton, exitButton #----------------------------------------------------------------------------# Open and resize images - load images to buttons def images(): global photo1, photo2, photo3, photo4 image1 image1 photo1 image2 image2 photo2 = = = = = = Image.open("OpenFile.gif") image1.resize((24, 24), Image.ANTIALIAS) ImageTk.PhotoImage(image1) Image.open("SaveAs.gif") image2.resize((24, 24), Image.ANTIALIAS) ImageTk.PhotoImage(image2) 176 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 Handbook of Computer Programming with Python image3 = Image.open("ColorsDialog.gif") image3 = image3.resize((24, 24), Image.ANTIALIAS) photo3 = ImageTk.PhotoImage(image3) image4 = Image.open("Exit.gif") image4 = image4.resize((24, 24), Image.ANTIALIAS) photo4 = ImageTk.PhotoImage(image4) #-----------------------------------------------------------------------------# Define the colorDialog, openDialog, saveAsDialog, and quit functions # that will invoke the relevant dialogs or quit the application def colorDialog(): # Assign user color selection to a set of variables (rgbSelected, colorSelected) = colorchooser.askcolor() # Change the form color; use the color element from set variable winFrame.config(background = colorSelected) def openDialog(): filedialog.askopenfile(title = "Open File Dialog") def saveAsDialog(): filedialog.asksaveasfilename(title = "Save As Dialog") def quit(): winFrame.destroy() exit() #-----------------------------------------------------------------------------# showToolTips function displays relevant message when hovering over a # button; hideToolTips() function destroys/hides the tooltip def showToolTips(a): global openFileToolTip, saveAsToolTip global colorsDialogToolTip, exitToolTip if (a == 1): openFileToolTip = tk.Label(winFrame, relief = FLAT, text = "Open the Open File dialog", background = 'cyan') openFileToolTip.place(x = 25, y = 30) if (a == 2): saveAsToolTip = tk.Label(winFrame, bd = 2, relief = FLAT, text = "Open the Save As Dialog", background = 'cyan') saveAsToolTip.place(x = 50, y = 30) if (a == 3): colorsDialogToolTip = tk.Label(winFrame, bd = 2, relief = FLAT, text = "Open the Colors Dialog", background = 'cyan') colorsDialogToolTip.place(x = 75, y = 30) if (a == 4): exitToolTip = tk.Label(winFrame, bd = 2, relief = FLAT, text = "Click to exit the application", background = 'cyan') exitToolTip.place(x = 100, y = 30) def hideToolTips(a): global openFileToolTip, saveAsToolTip global colorsDialogToolTip, exitToolTip if (a == 1): Application Development 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 177 openFileToolTip.destroy() if (a == 2): saveAsToolTip.destroy() if (a == 3): colorsDialogToolTip.destroy() if (a == 4): exitToolTip.destroy() #-----------------------------------------------------------------------------# Defing the bindButtons function to bind the buttons with the # various events def bindButtons(): global openFileButton, saveAsButton, colorsButton, exitButton openFileButton.bind('<Button-1>', lambda event: openDialog()) openFileButton.bind('<Enter>', lambda event: showToolTips(1)) openFileButton.bind('<Leave>', lambda event: hideToolTips(1)) saveAsButton.bind('<Button-1>', lambda event: saveAsDialog()) saveAsButton.bind('<Enter>', lambda event: showToolTips(2)) saveAsButton.bind('<Leave>', lambda event: hideToolTips(2)) colorsButton.bind('<Button-1>', lambda event: colorDialog()) colorsButton.bind('<Enter>', lambda event: showToolTips(3)) colorsButton.bind('<Leave>', lambda event: hideToolTips(3)) exitButton.bind('<Button-1>', lambda event: quit()) exitButton.bind('<Enter>', lambda event: showToolTips(4)) exitButton.bind('<Leave>', lambda event: hideToolTips(4)) #-----------------------------------------------------------------------------# Create non-resizable Windows frame using the tk object winFrame = tk.Tk() winFrame.title("Menus") winFrame.resizable(False, False) winFrame.geometry('260x220') # Invoke the images function images() # Create toolbar with images and bind to related click event toolbar = tk.Frame(winFrame, bd = 1, relief = RAISED) toolbar.pack(side=TOP, fill=X) # Create the toolbar buttons and invoke the bindButton function to bind # them with the relevant events openFileButton = tk.Button(toolbar, image = photo1, relief = FLAT) saveAsButton = tk.Button(toolbar, image = photo2, relief = FLAT) colorsButton = tk.Button(toolbar, image = photo3, relief = FLAT) exitButton = tk.Button(toolbar, image = photo4, relief = FLAT) bindButtons() openFileButton.pack(side=LEFT, padx=0, pady=0) saveAsButton.pack(side=LEFT, padx=0, pady=0) colorsButton.pack(side=LEFT, padx=0, pady=0) exitButton.pack(side=LEFT, padx=0, pady=0) winFrame.mainloop() 178 Handbook of Computer Programming with Python Output 5.3.2: The script is similar to the previous versions in structure but with some notable differences. Firstly, a toolbar Observation 5.9 – Enter, Leave: frame is created and populated with four buttons instead Use the Enter and Leave events to of creating a menu structure. Images are added to the trigger the desired actions when the buttons (lines 16–30) and activated through the associ- mouse hovers over or moves away ated pack() method calls (lines 121–124). Secondly, from an object. the buttons are associated with three events, namely Button-1, Enter, and Leave (lines 86–100, 120). Button-1 is triggered when the left mouse button is pressed, Enter when the mouse pointer hovers over the button, and Leave when the mouse pointer exits the boundaries of the button. Another key point in this script is the way tooltips are created and triggered. At the time of writing, Python did Observation 5.10 – tooltip: To add a not provide an automatic method to create and trigger tooltip to a particular object, associa tooltip. As such, developers wishing to use a tooltip ate a label with it and display or hide should implement this functionality through coding. the label as the mouse hovers over or Nevertheless, the concept for doing so is rather simple: moves away from an object. creating a label object that is displayed when the mouse hovers over the button. This can be accomplished by creating separate labels for each button or by creating a single label and changing its text and location coordinates depending on the mouse pointer position. As mentioned, once the mouse pointer exits the boundaries of the button, the label can be hidden (destroyed). This implementation of tooltip functionality is illustrated in methods showToolTips() and hideToolTips() (lines 52–82). 5.3.3 Popup Menus with Embedded Icons A third way to create menus in Python is through popup menus. Pop-up menus are quite similar to simple Observation 5.11 – pop-up: Use a menus, with the difference that they are not attached to pop-up menu to provide menu funcany particular, pre-defined position, but are floating on tionality without having to permatop of the application window. The creation and con- nently display the menu within the figuration of pop-up menus follow the same structure as application. Pop-up menus can be simple menus; however, they are triggered in a slightly used as stand-alone menu options or different way (e.g., left or right click on a designated in combination with simple menus space within the application window). Pop-up menus, and/or toolbars. similarly to simple menus, can include items of various forms like text, images, combinations of both text and images, or shortcuts. They are often used in combination with menus of other types, like simple menus and toolbars, in order to improve application efficiency and make it more appealing to the user. Application Development 179 The following script implements the same functionality as the previous two examples, but uses pop-up menus instead of simple menus and/or toolbars. In this example, menu items include ­combinations of images and text: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 # Import libraries import tkinter as tk from tkinter import filedialog from tkinter import colorchooser from tkinter import Menu from tkinter import * # Import the necessary image processing classes from PIL from PIL import Image, ImageTk global photo1, photo2, photo3, photo4 global popupmenu # Open and resize images - load images to the buttons def images(): global photo1, photo2, photo3, photo4 image1 image1 photo1 image2 image2 photo2 image3 image3 photo3 image4 image4 photo4 = = = = = = = = = = = = Image.open("OpenFile.gif") image1.resize((24, 24), Image.ANTIALIAS) ImageTk.PhotoImage(image1) Image.open("SaveAs.gif") image2.resize((24, 24), Image.ANTIALIAS) ImageTk.PhotoImage(image2) Image.open("ColorsDialog.gif") image3.resize((24, 24), Image.ANTIALIAS) ImageTk.PhotoImage(image3) Image.open("Exit.gif") image4.resize((24, 24), Image.ANTIALIAS) ImageTk.PhotoImage(image4) # Define the colorDialog, openDialog, saveAsDialog, and quit functions # to invoke the relevant dialogs or quit the application def colorDialog(): # Assign the user's selection of the color to a set of variables (rgbSelected, colorSelected) = colorchooser.askcolor() # Change the form color using the color part of the set of variables winFrame.config(background = colorSelected) def openDialog(): filedialog.askopenfile(title = "Open File Dialog") def saveAsDialog(): filedialog.asksaveasfilename(title = "Save As Dialog") def quit(): winFrame.destroy() exit() 180 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 Handbook of Computer Programming with Python def popupMenu(event): global popupmenu popupmenu.tk_popup(event.x_root, event.y_root) #----------------------------------------------------------------------------# Create non-resizable Windows frame using the tk objec, winFrame = tk.Tk() winFrame.title("Menus") winFrame.resizable(False, False) winFrame.geometry('260x220') # Invoke the images function images() # Create the popup menu popupmenu = tk.Menu(winFrame, tearoff = 0) popupmenu.add_command(label = "Color dialog", image = photo1, compound = LEFT, command = colorDialog) popupmenu.add_command(label = "Exit", image = photo4, compound = LEFT, command = quit) popupmenu.add_separator() popupmenu.add_command(label = "Open File dialog", image = photo2, compound = LEFT, command = openDialog) popupmenu.add_command(label = "Save As dialog", image = photo3, compound = LEFT, command = saveAsDialog) winFrame.bind('<Button-1>', lambda event: popupMenu(event)) winFrame.mainloop() Output 5.3.3: 181 Application Development The reader should pay attention to two particular aspects of this script. Firstly, the add _ cascade() method that was used in previous scripts to tie together the ­various menu items to the main menu system is missing. In this instance, the tk _ popup() method is used instead. The method is called as a member of the popupmenu object (i.e., inside the popupmenu(event) method), and casts the pop-up menu at the current position of the mouse cursor (line 50). Secondly, it must be noted how the text and the picture are combined on the menu items. Hot keys and other types of shortcuts can be also used, as described in previous sections (lines 63–71). 5.4 ENHANCING THE GUI EXPERIENCE Observation 5.12 – tk_popup(), add_cascade(): Use the tk _ p o p u p(e v e n t.x _ r o o t, event.y _ root) method to display the pop-up menu at the current mouse location. Note that the add _ cascade() method should not be used in this occasion, in contrast to the creation of simple menus. Observation 5.13: Use combinations of text, images, and hot keys to make the pop-up menu items more appealing and self-explanatory. Three additional concepts can be utilized in order to further enhance the GUI experience. What these concepts have in common is that they can be used to improve the efficiency of real estate and memory usage of an application. Ultimately, good programming practice supports the creation of separate, autonomous GUIs and their ability to be reused in various programs by simple calls from the corresponding objects. This section examines these three concepts and provides some examples of their application. 5.4.1 Notebooks and Tabbed Interfaces As information systems grow larger in size, the management of real estate of the related applications (i.e., the Observation 5.14 – Notebook(), creation of space that will host and display these appli- Frame(): Use the Notebook() concations) becomes increasingly important. The idea of structor (ttk module) to create the using a menu system in its various different forms was main object of a tabbed interface. introduced and explained in detail in previous sections. Use the Frame() constructor (ttk Menus offer a quite efficient way of addressing the man- module) to create each tab sepaagement of real estate. An alternative way of doing so is rately and to add them to the main through the use of tabbed interfaces. This approach is object. Finally, pack() all the pieces based on the creation of separate sub-sections inside a together and load the applications in single window (i.e., tabs). Tabs are opened and run sepa- the respective tabs. rately, but at the same time, they are parts of the same GUI structure. Tab-based implementations are commonly used in web browsers, where the various different web pages can be opened in separate tabs. The following script combines two of the scripts covered in Chapter 4 (i.e., Buttons and Text and Speed Control) in a single application, utilizing a tab-based implementation: 1 2 3 4 5 6 7 8 # Import libraries import tkinter as tk from tkinter import ttk # Declare and initialise the global variables and widgets # for use with the functions currentSpeedValue, speedLimitValue, finePerKmValue = 0, 0, 0 global speedLimitSpinbox 182 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Handbook of Computer Programming with Python global global global global global global finePerKmScale currentSpeedScale fine tab1, tab2 winLabel winButton # =========================================================== # Functions related to the tab2 application of Speed Control # =========================================================== # Define the functions that will create the application interface def createGUITab2(): currentSpeedFrame() speedLimitFrame() finePerKmFrame() fineFrame() # Define function to control changes in the Current Speed Scale widget def onScale(val): global currentSpeedValue currentSpeedValue.set(float(val)) calculateFine() # Define function to control changes in the Speed Limit Spinbox widget def getSpeedLimit(): global speedLimitValue speedLimitValue.set(float(speedLimitSpinbox.get())) calculateFine() # Define function to control changes in the Fine per Km Spinbox widget def getFinePerKm(val): global finePerKmValue finePerKmValue.set(int(float(val))) calculateFine() # Define function to calculate Fine based on user input def calculateFine(): global currentSpeedValue, speedLimitValue, finePerKmValue global fine diff = float(currentSpeedValue.get()) – float(speedLimitValue.get()) finePerKm = float(finePerKmValue.get()) if (diff <= 0): fine.config(text = 'No fine') else: fine.config(text = 'Fine in USD: '+ str(diff * finePerKm)) # Add the Current Speed widgets to tab2 def currentSpeedFrame(): global currentSpeedValue # Create the prompt label for the Current Speed tab currentSpeed = tk.Label(tab2, text = 'Current speed:', width = 24) Application Development 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 183 currentSpeed.config(bg = 'light blue', fg = 'red', bd = 2, font = 'Arial 14 bold') currentSpeed.grid(column = 0, row = 0) # Create Scale widget and define connection variable currentSpeedValue = tk.DoubleVar() currentSpeedScale=tk.Scale (tab2, length = 200, from_ = 0, to = 360) currentSpeedScale.config(resolution = 0.5, activebackground = 'dark blue', orient = 'horizontal') currentSpeedScale.config(bg = 'light blue', fg = 'red', troughcolor = 'cyan', command = onScale) currentSpeedScale.grid(column = 1, row = 0) currentSpeedSelected = tk.Label(tab2, text = '...', textvariable = currentSpeedValue) currentSpeedSelected.grid(column = 2, row = 0) # Add the Speed Limit widgets to tab2 def speedLimitFrame(): global speedLimitValue global speedLimitSpinbox # Create the prompt label for the Speed Limit tab speedLimit = tk.Label (tab2, text = 'Speed Limit:', width = 24) speedLimit.config(bg = 'light blue', fg = 'yellow', bd = 2, font = 'Arial 14 bold') speedLimit.grid(column = 0, row = 1) # Create the Spinbox widget and define variable to connect # to Spinbox widget speedLimitValue = tk.DoubleVar() speedLimitSpinbox = ttk.Spinbox(tab2, from_ = 0, to = 360, command = getSpeedLimit) speedLimitSpinbox.grid(column = 1, row = 1) speedLimitSelected = tk.Label(tab2, text = '...', textvariable = speedLimitValue) speedLimitSelected.grid(column = 2, row = 1) # Add the Fine per Km widgets to tab2 def finePerKmFrame(): global finePerKmValue # Create the prompt label for the Fine per Km tab finePerKm=tk.Label(tab2, text='Fine/Km overspeed (USD):', width=24) finePerKm.config(bg = 'light blue', fg = 'brown', bd = 2, font = 'Arial 14 bold') finePerKm.grid(column = 0, row = 2) # Create Scale widget and define variable to connect to Scale widget finePerKmValue = tk.IntVar() finePerKmScale=ttk.Scale(tab2, orient = 'horizontal', length = 200, from_ = 0, to = 100, command = getFinePerKm) finePerKmScale.grid(column = 1, row = 2) finePerKmSelected = tk.Label(tab2, text = '...', textvariable = finePerKmValue) finePerKmSelected.grid(column = 2, row = 2) 184 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 Handbook of Computer Programming with Python # Add the Fine for speeding label to tab2 def fineFrame(): global fine # Create the label that will display the fine on the Fine tab fine = tk.Label(tab2, text = 'Fine in USD:...', fg = 'blue') fine.grid(column = 0, row = 3) # =========================================================== # The functions related to the tab1 application (button and text) # =========================================================== # Define the function that will control the mouse click events def changeText(a): global winLabel winLabel.config(text = a) # Define the function that will create the GUI for the tab1 def createGUITab1(): global winButton global winLabel winLabel = tk.Label(tab1, text = "...") winLabel.grid(column = 1, row = 0) # Create the button widget and bind it with the associated events winButton=tk.Button(tab1, text="Left, right, or double left Click " "\nto change the text of the label", font="Arial 16", fg="red") winButton.grid(column = 0, row = 0) winButton.bind("<Button-1>", lambda event, \ a = "You left clicked on the button": changeText(a)) winButton.bind("<Button-2>", lambda event, \ a = "You right clicked on the button": changeText(a)) winButton.bind("<Double-Button-1>", lambda event, \ a = "You double left clicked on the button": changeText(a)) winButton.bind("<Enter>", lambda event, \ a = "You are hovering above the button": changeText(a)) winButton.bind("<Leave>", lambda event, \ a = "You left the button widget": changeText(a)) # =========================================================== # Create non-resizable Windows frame using the tk object winFrame = tk.Tk() winFrame.title("Tabs") winFrame.resizable(True, True) winFrame.geometry('500x180') # Create notebook with tab pages tabbedInterface = ttk.Notebook(winFrame) tab1 = ttk.Frame(tabbedInterface) tabbedInterface.add(tab1, text = "Buttons and Text") tab2 = ttk.Frame(tabbedInterface) tabbedInterface.add(tab2, text = "Speed control") tabbedInterface.pack() Application Development 164 165 166 167 168 169 185 # Invoke the 2 functions to create the different GUIs for the 2 tabs createGUITab1() createGUITab2() winFrame.mainloop() Output 5.4.1: As shown in the output, the application implements an interface with two tabs, one hosting the Buttons and Text application and the other the Speed Control application. In this example, it is worth to raise some key points. Firstly, the tabs allow for a more efficient use of the real estate, since the two separate applications run simultaneously in a single window, but are displayed independently from each other. Secondly, the creation of the tabbed interface is through the Notebook() constructor of the ttk module (line 158). The two tabs are created using the Frame() constructor of the ttk module (lines 159 and 161) and are associated with the main notebook object by being added to it (lines 160 and 162). All the components are packed together in line 163. Ultimately, the tabs are created by means of the relevant GUI calls in lines 166 and 167. There are two main differences between the way the applications are used in this example and in the original implementations presented in Chapter 4. The first is that, in both cases, the applications are converted to a completely procedural format, making full use of methods for all the required functionality and without any statements being added to the main body of the program. The second is that the Speed Control application is somewhat simplified, as the control variables associated with the Scale and Spinbox widgets and their respective labels are removed in order to avoid possible referencing issues between the various methods. 5.4.2 Threaded Applications One of the most important concepts in programming, and arguably among the most effective tools when creating real-life applications, is that of threads and threading. The idea behind threads is rather straightforward: multiple instances of an application can be run as independent processes. One way to conceptualize threads is to view them as different objects of the same class. Indeed, this is a rather accurate description, with the additional element of utilizing different processes of the operating system. One of the main characteristics of threaded applications is that they are meant to run in parallel. In reality, even in the case of using multi-core computer systems, this is 186 Handbook of Computer Programming with Python not entirely feasible, but this is a rather specialized computer architecture consideration that exceeds the scope of this book. In the following example, the SpeedControl application from Chapter 4 is converted to a class, for the purpose of demonstrating the implementation of threads. The script creates two objects of the SpeedControl class, and runs them separately on two different threads: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Observation 5.15 – threads: Create different threads of the same objects of a class. Threads are separate and independent, and can run in parallel or sequentially. They use separate processes and allocated memory space. # Import modules tk and ttk import tkinter as tk from tkinter import ttk import threading class SpeedControl(threading.Thread): # Create and run the main window frame for the application def __init__(self, winFrame): super(SpeedControl, self).__init__() self.winFrame = winFrame self.winFrame.title("Control speed") self.winFrame.config(bg = 'light grey') self.winFrame.resizable(False, False) self.winFrame.geometry('500x170') # Create the frame, label and scale widgets for currentSpeed self.currentSpeedFrame = tk.Frame (self.winFrame, bg = 'light grey', bd = 2, relief = 'sunken') self.currentSpeedFrame.pack() self.currentSpeedFrame.place(relx = 0.05, rely = 0.05) self.currentSpeed = tk.Label(self.currentSpeedFrame, text = 'Current speed:', width = 24) self.currentSpeed.config(bg = 'light blue', fg = 'red', bd = 2, font = 'Arial 14 bold') self.currentSpeed.grid(column = 0, row = 0) self.currentSpeedScale = tk.Scale (self.currentSpeedFrame, length = 200, from_ = 0, to = 360) self.currentSpeedScale.config(resolution = 1, orient = 'horizontal', activebackground = 'dark blue') self.currentSpeedScale.config(bg = 'light blue', fg = 'red', troughcolor = 'cyan', command = self.onScale) self.currentSpeedScale.grid(column = 1, row = 0) self.currentSpeedSel = tk.Label(self.currentSpeedFrame, text='...') self.currentSpeedSel.grid(column = 2, row = 0) # Create the frame, label, & spinbox widget for the speedLimit self.speedLimitFrame = tk.Frame(self.winFrame, bg = 'light yellow', bd = 4, relief = 'sunken') self.speedLimitFrame.pack() self.speedLimitFrame.place(relx = 0.05, rely = 0.30) self.speedLimit = tk.Label (self.speedLimitFrame, text = 'Speed limit:', width = 24) Application Development 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 187 self.speedLimit.config(bg= 'light blue', fg = 'yellow', bd = 2, font = 'Arial 14 bold') self.speedLimit.grid(column = 0, row = 0) self.speedLimitSpinbox = ttk.Spinbox(self.speedLimitFrame, from_ = 0, to = 360, command = self.getSpeedLimit) self.speedLimitSpinbox.grid(column = 1, row = 0) self.speedLimitSel=tk.Label(self.speedLimitFrame, text='...') self.speedLimitSel.grid(column = 2, row = 0) # Create the frame, label, and scale widget for finePerKm self.finePerKmFrame = tk.Frame(self.winFrame, bg = 'light grey', bd = 2, relief = 'sunken') self.finePerKmFrame.pack() self.finePerKmFrame.place (relx = 0.05, rely = 0.55) self.finePerKm = tk.Label(self.finePerKmFrame, text = 'Fine/Km overspeed (USD):', width = 24) self.finePerKm.config(bg = 'light blue', fg = 'red', bd = 2, font = 'Arial 14 bold') self.finePerKm.grid(column = 0, row = 0) self.finePerKmScale = tk.Scale(self.finePerKmFrame, length = 200, from_ = 0, to = 100) self.finePerKmScale.config(resolution = 1, activebackground = 'dark blue', orient = 'horizontal') self.finePerKmScale.config(bg = 'light cyan', fg = 'red', troughcolor = 'light blue', command = self.getFinePerKm) self.finePerKmScale.grid(column = 1, row = 0) self.finePerKmSel = tk.Label(self.finePerKmFrame, text='...') self.finePerKmSel.grid(column = 2, row = 0) # Create the frame for the fine and the related label self.fineFrame = tk.Frame(self.winFrame, bg = 'yellow', bd = 4, relief = 'raised') self.fineFrame.pack() self.fineFrame.place(relx = 0.05, rely = 0.80) self.fine = tk.Label(self.fineFrame, text = 'Fine in USD:...', fg = 'blue') self.fine.grid(column = 0, row = 0) # Define function to control changes in Current Speed Scale widget def onScale(self, val): v = int(float(val)) self.currentSpeedSel.config(text = v) self.calculateFine() # Define function to control changes in Speed Limit Spinbox widget def getSpeedLimit(self): v = self.speedLimitSpinbox.get() self.speedLimitSel.config(text = v) self.calculateFine() # Define function to control changes in Fine per Km Spinbox widget def getFinePerKm(self, val): v = int(float(val)) self.finePerKmSel.config(text = v) 188 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 Handbook of Computer Programming with Python self.calculateFine() # Define function to calculate the Fine based on user input def calculateFine(self): currentSpeed, speedLimit, finePerKm = 0, 0.0, 0 # Ensure relevant objects are initiated & assigned with values if (self.currentSpeedScale.get()!= '' and self.speedLimitSpinbox.get()!= '' and self.finePerKmScale.get()!= ''): currentSpeed = self.currentSpeedScale.get() speedLimit = float(self.speedLimitSpinbox.get()) finePerKm = self.finePerKmScale.get() else: currentSpeed, finePerkKm = 0, 0; speedLimit = 0.0 # Calculate the fine and display it on the associated label diff = currentSpeed - speedLimit if (diff <= 0): self.fine.config(text = 'No fine') else: self.fine.config(text='Fine in USD: '+str(diff*finePerKm)) # Create two different GUI frames winFrame1 = tk.Tk() winFrame2 = tk.Tk() # Create two different threads - one for each GUI frame speedControl1 = SpeedControl(winFrame1) speedControl2 = SpeedControl(winFrame2) # Start each thread/frame and run it separately speedControl1.start() winFrame1.mainloop() speedControl2.start() winFrame2.mainloop() Output 5.4.2: Application Development 189 The output illustrates how this particular application runs the two different objects in separate threads. It must be noted that the threads are running simultaneously. The term in parallel should be avoided in this context, as it is uncertain whether the threads are indeed running in parallel. This is also something that can be affected by the operating system, the hardware and software settings, and the associated behaviors. Nevertheless, from the perspective of the user, this is of purely academic interest. As shown in the example above, the two threaded objects appear to run in parallel indeed, but at the same time they function independently and use different inputs as if they were run sequentially. The order of statements between lines 121 and 133 Observation 5.16 – Threads: Use the is also important. Firstly, the two GUI window frames Thread class from the threading are created as normal. If the first GUI frame was to be module to create threaded objects. created directly followed by the first threaded object, Use the start() method to start and before the second GUI frame and threaded object, the threads and the stop()method the user would only get access to the first window to stop them. Always use the self frame. The second window frame would only appear parameter on all widgets and attrionce the first one was closed and stopped. The reader butes to refer to the specific object should also notice that the threading module needs they belong to. to be inserted before the calls to the start methods of the threaded objects (i.e., speedControl1 and speedControl2). Observation 5.17: Avoid using control It must be noted that each threaded object is assigned variables (e.g., IntVar()) in threaded to a separate window frame and has a dedicated main- objects. loop() method to monitor its GUI and the associated events (lines 125–130 and 126–133). This assignment is taking place in lines 125–126, where the window frame Observation 5.18: In cases of GUIfor each threaded object is called as a parameter, and based threaded objects, use the used on the specific, independent GUI for the underly- mainloop() method for monitoring ing object. each object. Another notable aspect of the script is the explicit definition of the __ init __ (self, winFrame): super(SpeedControl, self). __ init __ () that loads the GUI widgets onto the window frames of each of the threaded objects. The reader should be reminded here that the __ init __ () method is provided by Python to automatically initialize basic and necessary widgets and attributes in preparation of launching the object. The self parameter is necessary in order for the Python interpreter to distinguish which object is running and what widgets and attributes belong to it. This is the reason why each widget and attribute, and even simple variables, are preceded by the self parameter. Another key point in this particular script is that, since the object that is being created is threaded, it inherits from the Thread class of the threading module (line 6) and is implemented on that class (line 10). These two lines that essentially create the threaded object are called each time a new threaded object is initiated. Finally, the reader should note that the control variables (e.g., IntVar()) are missing from this version of the code. This was done on purpose, as their inclusion could cause unnecessary conflicts between the threaded objects and the cross-method operations within any single threaded object, without offering any particular benefits to the application. In general, it is advisable that control variables on widgets are avoided, especially when implementing object-oriented and/or threaded object applications. 190 Handbook of Computer Programming with Python 5.4.3 Combining Multiple Concepts and Applications in a Multithreaded System Chapters 2–5 of this book provide a gradual progression from basic programming skills to more advanced application development concepts. Although there are certainly many more concepts and layers of depth to be explored when it comes to programming in Python, Chapters 2–5 should provide a solid basis for the aspiring programmer, as they cover the necessary building blocks required to make functional and well-structured applications. As a conclusion to this conceptual sub-section of this book, it was deemed necessary to provide an overview of how the concepts, mechanisms, and practices presented so far can be integrated into a coherent, centralized solution. Ultimately, this should provide an idea of how a multithreaded and multi-functional information system can be built, resembling the scenarios and challenges one may face in real life. The example presented below combines two of the applications developed earlier (Speed Control and Bubble Sort) into a multithreaded system that can be launched and operated as a single, unified platform. In order for this to be possible, two changes are required: a. Each of the two individual applications (Speed Control and Bubble Sort) must be adjusted according to the object-oriented paradigm. This is done by separating and extracting the main code that is responsible for the GUI creation and all related methods, and save the remaining code as separate text files in Jupyter. By doing so, the original applications cannot be run separately, as there is no actual object being created in the remaining code. Instead of creating the object within the main body of each application, this is done through a call from another application, which now functions as the main application. b. The code that was extracted from the original applications must be imported to this newly created application. The code examples presented and discussed in the following pages provide a practical illustration of these changes: Chapter5SpeedControl.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Import modules tk and ttk import tkinter as tk from tkinter import ttk import threading class SpeedControl(threading.Thread): # Create and run the main window frame for the application def __init__(self, winFrame): super(SpeedControl, self).__init__() self.winFrame = winFrame self.winFrame.title("Control speed") self.winFrame.config(bg = 'light grey') self.winFrame.resizable(False, False) self.winFrame.geometry('500x170') # Create frame for currentSpeed & its label and scale widgets self.currentSpeedFrame = tk.Frame(self.winFrame, Application Development 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 191 bg = 'light grey', bd = 2, relief = 'sunken') self.currentSpeedFrame.pack() self.currentSpeedFrame.place(relx = 0.05, rely = 0.05) self.currentSpeed = tk.Label(self.currentSpeedFrame, text = 'Current speed:', width = 24) self.currentSpeed.config(bg = 'light blue', fg = 'red', bd = 2, font = 'Arial 14 bold') self.currentSpeed.grid(column = 0, row = 0) self.currentSpeedScale = tk.Scale(self.currentSpeedFrame, length = 200, from_ = 0, to = 360) self.currentSpeedScale.config(resolution = 1, activebackground = 'dark blue', orient = 'horizontal') self.currentSpeedScale.config(bg = 'light blue', fg = 'red', troughcolor = 'cyan', command = self.onScale) self.currentSpeedScale.grid(column = 1, row = 0) self.currentSpeedSel = tk.Label(self.currentSpeedFrame, text = '...') self.currentSpeedSel.grid(column = 2, row = 0) # Create frame for speedLimit & its label and spinbox widgets self.speedLimitFrame = tk.Frame(self.winFrame, bg = 'light yellow', bd = 4, relief = 'sunken') self.speedLimitFrame.pack() self.speedLimitFrame.place(relx = 0.05, rely = 0.30) self.speedLimit = tk.Label(self.speedLimitFrame, text = 'Speed limit:', width = 24) self.speedLimit.config(bg = 'light blue', fg = 'yellow', bd = 2, font = 'Arial 14 bold') self.speedLimit.grid(column = 0, row = 0) self.speedLimitSpinbox = ttk.Spinbox(self.speedLimitFrame, from_ = 0, to = 360, command = self.getSpeedLimit) self.speedLimitSpinbox.grid(column = 1, row = 0) self.speedLimitSel=tk.Label(self.speedLimitFrame,text='...') self.speedLimitSel.grid(column = 2, row = 0) # Create frame for finePerKm and its label and scale widgets self.finePerKmFrame = tk.Frame(self.winFrame, bg = 'light grey', bd = 2, relief = 'sunken') self.finePerKmFrame.pack() self.finePerKmFrame.place(relx = 0.05, rely = 0.55) self.finePerKm = tk.Label(self.finePerKmFrame, text = 'Fine/Km overspeed (USD):', width = 24) self.finePerKm.config(bg = 'light blue', fg = 'red', bd = 2, font = 'Arial 14 bold') self.finePerKm.grid(column = 0, row = 0) self.finePerKmScale = tk.Scale(self.finePerKmFrame, length = 200, from_ = 0, to = 100) self.finePerKmScale.config(resolution = 1, activebackground = 'dark blue', orient = 'horizontal') self.finePerKmScale.config(bg = 'light cyan', fg = 'red', troughcolor = 'light blue', command = self.getFinePerKm) 192 Handbook of Computer Programming with Python 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 self.finePerKmScale.grid(column = 1, row = 0) self.finePerKmSel=tk.Label(self.finePerKmFrame, text = '...') self.finePerKmSel.grid(column = 2, row = 0) # Create the frame for Fine and its label self.fineFrame = tk.Frame(self.winFrame, bg = 'yellow', bd = 4, relief = 'raised') self.fineFrame.pack() self.fineFrame.place(relx = 0.05, rely = 0.80) self.fine = tk.Label(self.fineFrame, text = 'Fine in USD:...', fg = 'blue') self.fine.grid(column = 0, row = 0) # Define function to control changes in CurrentSpeedScale widget def onScale(self, val): v = int(float(val)) self.currentSpeedSel.config(text = v) self.calculateFine() # Define function to control changes in SpeedLimitSpinbox widget def getSpeedLimit(self): v = self.speedLimitSpinbox.get() self.speedLimitSel.config(text = v) self.calculateFine() # Define function to control changes in FineperKm Spinbox widget def getFinePerKm(self, val): v = int(float(val)) self.finePerKmSel.config(text = v) self.calculateFine() # Define the function to calculate the Fine based on user input def calculateFine(self): currentSpeed, speedLimit, finePerKm = 0, 0.0, 0 # Make sure the objects are initiated and assigned with values if (self.currentSpeedScale.get()!= '' and self.speedLimitSpinbox.get()!= '' and self.finePerKmScale.get()!= ''): currentSpeed = self.currentSpeedScale.get() speedLimit = float(self.speedLimitSpinbox.get()) finePerKm = self.finePerKmScale.get() else: currentSpeed, finePerkKm = 0, 0; speedLimit = 0.0 # Calculate the fine and display it on the associated label diff = currentSpeed - speedLimit if (diff <= 0): self.fine.config(text = 'No fine') else: self.fine.config(text='Fine in USD: '+str(diff*finePerKm)) Application Development 193 In the class presented above, the statements that create and run the GUI have been already separated and extracted, ready to be imported to the main application that will eventually create the multithreaded objects. Apart from extracting these particular statements, the class implements the SpeedControl application as discussed in the previous section. The class needs to be saved as a text file with the .py extension. Chapter5BubbleSort.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 # Import modules tk, random and time import tkinter as tk from tkinter import ttk from tkinter import * import random import time import threading class BubbleSort(threading.Thread): # Initialise the various lists used by the objects of the class unsortedL = []; sortedL = []; statisticsData = []; sizes = [5, 20, 100, 250, 500, 750, 1000, 2000, 5000, 10000, 20000] # Create and run the main window frame for the application def __init__(self, winFrame): super(BubbleSort, self).__init__() self.winFrame = winFrame self.winFrame.title("Bubble Sort"); self.winFrame.config(bg = 'light grey') self.winFrame.resizable(True, True); self.winFrame.geometry('650x300') self.listSize = 0 self.createGUI() # Define the functions that will create the application GUI def createGUI(self): self.unsortedFrame() self.entryFrame() self.entryButton() self.sortButton() self.sortedFrame() self.clearButton() self.statisticsButton() self.statisticsSelection() # Create labelframe; populate with Unsorted Array Listbox widgets def unsortedFrame(self): self.UnsortedFrame=tk.LabelFrame(self.winFrame, text='Unsorted Array') self.UnsortedFrame.config(bg = 'light grey', fg = 'blue', bd = 2, relief = 'sunken') 194 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 Handbook of Computer Programming with Python # Create a scrollbar widget to attach to UnsortedList self.UnsortedListScrollBar = Scrollbar(self.UnsortedFrame, orient = VERTICAL) self.UnsortedListScrollBar.pack(side = RIGHT, fill = Y) # Create a listbox in the Unsorted Array frame self.UnsortedList = tk.Listbox(self.UnsortedFrame, yscrollcommand = self.UnsortedListScrollBar.set, bg = 'cyan', width = 13, height = 12, bd = 0) self.UnsortedList.pack(side = LEFT, fill = BOTH) # Associate the scrollbar command with its parent widget # (i.e., the UnsortedList yview) self.UnsortedListScrollBar.config(command = self.UnsortedList.yview) # Place the Unsorted frame & its components into the interface self.UnsortedFrame.pack() self.UnsortedFrame.place(relx = 0.02, rely = 0.05) # Create the labelframe that will contain the Entry widget def entryFrame(self): self.EntryFrame = tk.LabelFrame(self.winFrame, text= 'Actions') self.EntryFrame.config(bg = 'light grey', fg = 'red', bd = 2, relief = 'sunken') self.EntryFrame.pack(); self.EntryFrame.place(relx=0.25, rely=0.05) # Create the label in the Entry frame self.EntryLabel = tk.Label(self.EntryFrame, text = 'How many integers\nin the list', width = 16) self.EntryLabel.config(bg = 'light grey', fg = 'red', bd = 3, relief = 'flat', font = 'Arial 14 bold') self.EntryLabel.grid(column = 0, row = 0) # Create combo box to select the number of elements in lists self.ListSizeCombo = ttk.Combobox(self.EntryFrame, width = 10) self.ListSizeCombo['values'] = self.sizes self.ListSizeCombo.current(0) self.ListSizeCombo.grid(column = 1, row = 0) # Create the button that will insert new entries into the unsorted # array and list box def entryButton(self): self.EntryButton = tk.Button(self.EntryFrame, relief= 'raised', text = 'Populate\nUnsorted list', width = 16) self.EntryButton.bind('<Button-1>', lambda event: self.populateUnsortedList()) self.EntryButton.grid(column = 0, row = 2) # Populate the unsorted list with random numbers and populate # the unsorted list box def populateUnsortedList(self): self.listSize = int(self.ListSizeCombo.get()) Application Development 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 195 # Generate random integers with randint() from the random class for i in range (self.listSize): n = random.randint(-100, 100) # Enter the generated random integer to the relevant place # in the unsorted list self.unsortedL.insert(i, n) # Populate UnsortedList with the unsorted list elements for i in range (0, self.listSize): self.UnsortedList.insert(i, self.unsortedL[i]) self.UnsortedListScrollBar.config(command= self.UnsortedList.yview) # Create the button that will sort the numbers and display them # in the sorted array and list box def sortButton(self): self.SortButton = tk.Button(self.EntryFrame, relief = 'raised', text = 'Sort numbers\nwith BubbleSort', width = 16) self.SortButton.bind('<Button-1>',lambda event: self.sortToSortedList()) self.SortButton.grid(column = 1, row = 2) # Create the labelframe to include the Sorted Array Listbox widgets def sortedFrame(self): self.SortedFrame=tk.LabelFrame(self.winFrame, text='Sorted Array') self.SortedFrame.config(bg = 'light grey', fg = 'blue', bd = 2, relief = 'sunken') # Create a scrollbar widget to attach to the SortedList self.SortedListScrollBar = Scrollbar (self.SortedFrame) self.SortedListScrollBar.pack(side = RIGHT, fill = Y) # Create the list box in the Sorted Array frame self.SortedList = tk.Listbox (self.SortedFrame, yscrollcommand = self.SortedListScrollBar.set, bg = 'cyan', width = 13, height = 12, bd = 0) self.SortedList.pack(side = LEFT, fill = BOTH) # Associate the scrollbar command with its parent widget # (i.e., the SortedList yview) self.SortedListScrollBar.config(command = self.SortedList.yview) # Place the Unsorted frame and its parts into the interface self.SortedFrame.pack(); self.SortedFrame.place(relx = 0.75, rely = 0.05) # Bubble Sort sorts the list & records information for later use def sortToSortedList(self): # Load unsorted list & list box to the sorted list & list box for i in range (0, self.listSize): self.sortedL.insert(i, self.unsortedL[i]) 196 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 Handbook of Computer Programming with Python # Start timer self.startTime = time.process_time() # The Bubble sort algorithm for i in range (self.listSize-1): for j in range (self.listSize-1): if (self.sortedL[j] > self.sortedL[j+1]): temp = self.sortedL[j] self.sortedL[j] = self.sortedL[j+1] self.sortedL[j+1] = temp # End timer self.endTime = time.process_time() # Load the sorted list to the relevant list box for i in range (0, self.listSize): self.SortedList.insert(i, self.sortedL[i]) self.SortedListScrollBar.config(command=self.SortedList.yview) # Create button that will clear the two list boxes & the two lists def clearButton(self): self.ClearButton = tk.Button(self.EntryFrame, text = 'Clear lists', relief = 'raised', width = 16) self.ClearButton.bind('<Button-1>', lambda event: self.clearLists()) self.ClearButton.grid(column = 0, row = 3) # Clear all lists, list & combo boxes, & related global variable def clearLists(self): self.sortedL.clear() self.unsortedL.clear() self.UnsortedList.delete('0', 'end') self.SortedList.delete('0', 'end') self.statisticsData.clear() self.StatisticsCombo.delete('0', 'end') self.listSize = 0 # Create the button that will display sorting information def statisticsButton(self): self.StatisticsButton = tk.Button(self.EntryFrame, text = 'Show statistics', relief = 'raised', width = 16) self.StatisticsButton.bind('<Button-1>', lambda event: self.statistics()) self.StatisticsButton.grid(column = 1, row = 3) # Create the option menu that will show the statistical results # from the sorting process def statisticsSelection(self): self.StatisticsSelection = tk.StringVar() Application Development 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 197 self.statisticsData = ['The statistics will appear here'] self.StatisticsSelection.set(self.statisticsData[0]) self.StatisticsCombo = ttk.Combobox(self.EntryFrame, textvariable = self.StatisticsSelection, width = 30) self.StatisticsCombo['values'] = self.statisticsData self.StatisticsCombo.grid(column = 0, columnspan = 2, row = 4) # Calculate and report the statistics from the sorting process def statistics(self): self.statisticsData.clear() self.statisticsData.insert(1, 'The size of the list is ' + str(self.listSize)) self.statisticsData.insert(2, 'The sum of the list is ' + str(sum(self.sortedL))) self.statisticsData.insert(3, 'The time passed to sort the ' + 'list was ' + str(round(self.endTime - self.startTime, 5))) self.statisticsData.insert(4, 'The average of the sorted list ' +'is: ' + str(round(sum(self.sortedL) / self.listSize, 2))) self.StatisticsCombo['values'] = self.statisticsData As with the SpeedControl class discussed previously, the class presented above is the modified version of the Bubble Sort application. The object-oriented paradigm is adopted by separating and extracting the statements that would create and run the GUI. The remaining code is saved as a .py text file in Jupyter, in order to be accessible by the main application. The following class implements the main application that imports the two classes and runs them as threaded objects. The classes are imported in lines 5–6, and the main GUI object is created in lines 47, 49, and 51. The interface offers a single method: the display of a popup menu when a left-click event takes place. The menu allows for the creation of two threaded objects based on SpeedControl and Bubble Sort (line 30). The reader should note how the statements separated and extracted from the imported classes were added to the main application in lines 32–37 and 39–44 respectively: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Import libraries import tkinter as tk from tkinter import Menu from tkinter import * import Chapter5SpeedControl import Chapter5BubbleSort import threading class Application: # Create main window frame for the application with the popup menu def __init__(self, winFrame): self.winFrame = winFrame self.winFrame.title("Application with threads") self.winFrame.config(bg = 'light grey') self.winFrame.resizable(False, False) 198 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 Handbook of Computer Programming with Python self.winFrame.geometry('260x220') self.popupmenu = tk.Menu(self.winFrame, tearoff = 0) self.popupmenu.add_command(label = "Speed Control", command = self.speedControlThread) self.popupmenu.add_command(label = "Bubble Sort", command = self.bubbleSortThread) self.winFrame.bind('<Button-1>', lambda event: self.popupMenu(event)) self.winFrame.config(menu = self.popupmenu) self.winFrame.mainloop() def popupMenu(self, event): self.popupmenu.tk_popup(event.x_root, event.y_root) def speedControlThread(self): # Prepare the Speed Control GUI speedControlFrame = tk.Tk() speedControl1 = Chapter5SpeedControl.SpeedControl(speedControlFrame) speedControl1.start() speedControlFrame.mainloop() def bubbleSortThread(self): # Prepare the Bubble sort GUI bubbleSortFrame = tk.Tk() bubbleSort1 = Chapter5BubbleSort.BubbleSort(bubbleSortFrame) bubbleSort1.start() bubbleSortFrame.mainloop # Prepare the application GUI winFrame = tk.Tk() application = Application(winFrame) winFrame.mainloop() Application Development 199 Output 5.4.3: 5.5 WRAP UP Chapters 4 and 5 provided a step-by-step, systematic walkthrough of Graphical User Interface (GUI) programming with Python, and an introduction to GUI objects like menus, tabs, and threads. Key Python widgets were introduced alongside their most common uses and options. This was done through a series of straightforward examples and applications that progressed gradually from simpler to more challenging implementations. Although a detailed coverage of all the available widgets is beyond the scope of this chapter, Table 5.1 provides widget lists with descriptions, and 200 Handbook of Computer Programming with Python TABLE 5.1 Frequently Used Widgets and the Module They Belong to Widget Name Brief Description Windows frame The main object of a windows-based application, acting as a container for all other widgets in order to create the Graphical-User-Interface. Displays a short message to the user. Its content is not expected to change significantly in the program lifecycle and it is not meant to be used for interaction. Nevertheless, it is possible to write code that will enhance its functionality. Used to handle basic interaction between the user and the application. This is usually implemented through movement or click-based events. A basic widget used to accept a single line of text from the keyboard. As with most other widgets, it can be modified in terms of functionality and appearance. A controlled mechanism for accepting numerical user input. Two different implementations of the widget are available, with the one found in tkinter offering more options than that in ttk. A controlled mechanism for accepting numerical user input from the ttk library. Used for improved control of the GUI. It can contain various other widgets. Similar to the frame widget, but with the inclusion of a label. Used to display separate lines of text, allowing the user to make a selection. The contents of multiple listboxes can be synchronized. Similar to the list box, but instead of being permanently expanded it is in a collapsed state and only opens when clicked upon. The selected line of text is displayed on the top level (i.e., the displayed text box when the list is collapsed). Used to improve the appearance and use of associated multiline widgets (e.g., list boxes) when they are populated with a large number of entries. Used to offer selection options. It allows for the selection of multiple options at any given time. Used to offer selection options. Options are mutually exclusive. Used to inform the user about the state of a particular running method. It can be determinate, in which case the widget presents the actual state of the method, or indeterminate, where the widget provides a scrolling message indicating that the method is still in progress. Similar to the entry widget, but allowing multiple lines of text. A widget that provides a space to place graphics, text, or other objects. Provides the supporting object for tabbed frames. Label Button Entry Scale Spinbox Frame Labelframe Listbox Combobox ScrollBar CheckButton RadioButton Progressbar Text Canvas Notebook Module/Constructor tkinter, tk.Tk() tkinter, tk.Label() tkinter, tk.Button() ttk, ttk.Entry() tkinter/ttk, tk.Scale()/ ttk.Scale() ttk, ttk.Spinbox() tkinter, tk.Frame() tkinter, tk.LabelFrame() tkinter, tk.ListBox() ttk, ttk.Combobox() tkinter, ScrollBar() tkinter, tk.CheckButton() tkinter, tk.RadioButton() ttk, ttk.Progressbar() tk, tk.Text() tk, tk.Canvas() ttk, ttk.Notebook() 201 Application Development the modules/libraries they belong to as a quick reference. This information can be also used as a reference for constructors when creating objects from the respective classes. Additional details on the listed widgets (including tkinter) can be found in the official Python documentation. In addition to the aforementioned widgets, a number of other objects are frequently used to improve the GUI experience. Although many of these are not standalone objects, their use in conjunction with other objects is rather common. Table 5.2 lists some of these objects: The above objects make use of a number of methods that contribute to the creation of the overall user experience. Table 5.3 lists some of the most important of the methods used in the various scripts and applications developed in this chapter: TABLE 5.2 Notable Objects and Their Modules Object Brief Description Image Used to load and display an image. It supports different file types (e.g., gif, jpg, png). Various different methods are available, depending on the file type. Used to host text or numbers. tk.StringVar(), tk.IntVar(), tk.DoubleVar(), etc. askyesno(), askokcancel(), askretrycancel(), askquestion() showinfo(), showerror(), showwarning() askopenfile(), asksaveasfilename(), askdirectory(), askcolor() menu, popup menu Thread Module PIL tkinter Used to display different types of pre-defined message boxes. messagebox Used to display a simple message box with an info, error, or warning icon. Used to display the common windows-based dialogs, ranging from file dialogs to color chooser modules. messagebox filedialog, colorchooser Used to display regular windows-based or popup menus. Used to create threaded objects. tkinter, tk.Menu() threading TABLE 5.3 Frequently Used Methods and Their Respective Widgets (in Alphabetical Order with Constructors First) Method .add_command(), .add_ checkbutton(), .add_ radiobutton(), .add_cascade(), .add_separator() .after() .append() .askyesno(), askokcancel(), .askretrycancel(), .askquestion() .askopenfile(), .asksaveasfilename(), .askdirectory(), .askcolor() .bind() .clear() Brief Description Adds the various components of a menu object. Invokes a method after a set amount of time has elapsed. Appends a new element to the end of a list. Offers a set of different types of pre-defined message boxes. Offers a set of different types of pre-defined dialogs. Binds the widget with a user interaction event. Clears the values from a list. (Continued) 202 Handbook of Computer Programming with Python TABLE 5.3 (Continued) Frequently Used Methods and Their Respective Widgets (in Alphabetical Order with Constructors First) Method .config() .current() .curselection() .delete() .destroy() .exit() .geometry() .grid() .grid_remove() int(), float(), str() .insert() .mainloop() .maxsize(), .minsize() .open() .pack() .PhotoImage() .place() .process_time() .randit() .resizable() .resize() round(), sum(), len() .selection_set() .set (),.get () .showinfo(), .showerror(), .showwarning() .start(), .stop() .title() .update_idletasks() Brief Description Allows the configuration of the widget in terms of its characteristics (e.g., color, font properties). Identifies the current selection from a combo box. Identifies the current selection from a list box. Deletes values from a list box. Destroys the current frame/interface. Exits the current frame/interface (or the entire application). Accepts the initial dimensions of the frame in the form of a string (i.e., ‘length x width’). Places the widget on the grid of the parent widget and at a specific column and row. It can span across multiple columns/rows. Temporarily hides the widget from the grid of the parent without deleting or destroying it. Converts the specified values to integer, float, or string values respectively. Inserts values to a list box. Puts the frame in an idle state, and monitors possible interactions. The latter can take the form of defined events between the user and the GUI. Defines the minimum/maximum size of the associated frame. Reads an image/picture based on its full path, assigned as an argument. Attaches the widget to the parent, allowing coordinates to be calculated either on a relative or absolute basis. Creates a memory pointer to a processed image object, by means of the open() method. Places the widget at specific coordinates on the parent frame, either on a relative or absolute basis. Counts the time needed for a particular process to execute. Generates random numbers in the specified range. Specifies whether the object is resizable based on a Boolean value (True/False) that is provided as a parameter. Specifies the size of the image/picture. It is usually accompanied by the ANTIALIAS expression to ensure the quality of the image is maintained when downsizing. Basic mathematical methods. Selects a particular indexed element in a list box. Sets or gets the value of an object. Offer different types of pre-defined message boxes. Starts or stops a threaded object. Provides a title to the windows frame. Ensures that a widget/object that has been idle for extended periods of time is not destroyed. For most methods listed on Table 5.3, there exists a number of options/parameters that may be also used for the improvement of the GUI. These are applicable to a variety of widgets/objects. Table 5.4 provides a list of some of the most important ones. The list is not exhaustive, but it is based on cases described in detail in the various examples in this chapter. 203 Application Development TABLE 5.4 Frequently Used Properties and Their Descriptions Properties/Expressions activebackground, activeforeground anchor borderwidth, bd command compound expand fg (or foreground), bg (or background) fill font from_ =, to = height, width highlightcolor image justify lambda expression onvalue, offvalue orient padx, pady relief resolution relx, rely show side state text textvariable troughcolor value ["values"] underline wraplength yscrollcommand, xscrollcommand yview, xview Brief Description The background or foreground color when the cursor hovers over the widget. Ensures that the particular element it applies to (i.e., text or image) is placed on a position within the parent widget that will remain unchanged. The width of the border around the widget (e.g., borderwidth = 12) as an integer. The method called when the widget is clicked. Combines two objects in the same position (e.g., an image and a text) in a parent label widget. It can take different values (e.g., left, center, right) that specify the order of the two objects. Specifies whether the underlying widget is expandable (value is “Y” or non-zero) or not (value is “N” or zero) when the parent widget is resized. The color of the foreground/background (fg/bg) or the text a particular widget will display (see Table 4.6). Specifies whether the widget it applies to will expand horizontally (fill = tk.X), vertically (fill = tk.Y) or both (fill = tk.BOTH). Sets/gets the font name and the size of the text to be displayed by the widget (e.g., font = 'Arial 24'). Sets the numerical boundaries of the widget. The height or width of the widget in characters (for text widgets) or pixels (for image widgets). The color of the text of the widget when the widget is in focus. Defines an image to be displayed on the widget instead of text. Determines how multiple lines of text will be justified in respect to each other. Values are LEFT, CENTER, or RIGHT. Sets the parameters to be passed on to a method or method when an event is triggered. The values assigned to a check button depending on whether it is selected or not. Specifies the orientation of the widget (horizontal or vertical). Additional padding left/right (padx) or above/below (pady) in relation to the widget. Causes the widget to be displayed with a particular visual effect in terms of its border appearance (see Table 4.6 for available values). The incremental or decremental step of the scale widget. The position of the widget relative to the parent object. Replaces the text of the current widget with the specified character(s). Specifies the position of the content of the widget (Left, Center, or Right). The state of responsiveness and/or accessibility of the widget. Values can be NORMAL, ACTIVE, DISABLED. The textual content to be displayed. The textual content of the text-based widget. The color of the trough of the scale widget. The value assigned to a radio button, depending on the selection/state. Associates/populates a combo box with a particular list of values. If −1, no character of the button’s text will be underlined. If a non-zero value is provided, the corresponding character(s) will be underlined. If non-zero, the text lines of the widget will be wrapped to fit the length of the parent widget. Used to activate the scrollbar. Specifies the orientation of a scrollbar (yview for vertical or xview for horizontal). 204 Handbook of Computer Programming with Python TABLE 5.5 Frequently Used Events and Their Descriptions Event Brief Description <Button-1>, <Button-2>, <Button-3> <Double-Button-1>, <Double-Button-2>, <Double-Button-3> <Enter> <Key> <Leave> Triggered when the left, middle, or right button of the mouse is clicked upon the widget. Triggered when the left, middle, or right mouse button is double clicked upon the widget. Triggered when the mouse is hovering across the widget. Triggered when any key on the keyboard is pressed. Use the event.keycode option to check the key that was pressed. Note that the values of the keyboard keys vary between operating systems. Triggered when the mouse leaves the parent widget. It should be evident by the examples provided in this chapter that one of the most important concepts in GUI programming is the user’s interaction with the widgets, as this is how events are used to trigger specific tasks. Such interactions usually take the form of mouse clicks or keyboard events. Table 5.5 lists some of the most important methods of interactions as a quick reference. Finally, some common values of the options mentioned previously are provided on Table 5.6 below. TABLE 5.6 Possible Values for the Various Different Options Option Color related Font related Anchor related Relief styles Bitmap styles Cursor styles Pack options Values Available It is possible to set the color of the widget, text, or object, either in the form of a hexadecimal string (e.g., “#000111”), or by using color names (e.g., “white”, “black”, “red”, “green”, “blue”, “cyan”, “yellow”, and “magenta”). The font of a text can be set just after the text is specified, using the following sub-options: • Family: The font family names as a string. • Size: The font height in points (n) or pixels (−n). • Weight: The attributes of the text (“bold” for bold, or “normal” for regular text). • Slant: The attributes of the text (“italic” for italic, or “roman” for unslanted). • Underline: The attributes of the text (1 for underlined or 0 for normal text). • Overstrike: The attributes of the text (1 for overstruck or 0 for normal text). The possible values for the anchor justification are: NW, N, NE, W, CENTER, E, SW, S, SE. After specifying the text of a widget, the possible values for the relief option are: raised, sunken, flat, groove, ridge. Possible bitmap styles include the following: error, gray75, gray50, gray25, gray12, hourglass, info, questhead, question, warning. These can be used in combination with, or instead of, text. Possible cursor styles include the following: arrow, circle, clock, cross, dotbox, exchange, fleur, heart, man, mouse, pirate, plus, shuttle, sizing, spider, spraycan, star, target, tcross, trek, watch. These can be used after the text is specified. There are 4 options in terms of placing a particular widget in respect to the parent widget through the pack() method. Use the side option with values: TOP (default), BOTTOM, LEFT, or RIGHT. There are 3 options to determine whether and how a particular widget should expand when the parent widget expands. Use the fill option with values: NONE (default), X (fill only horizontally), Y (fill only vertically), or BOTH (fill both horizontally and vertically). (Continued) 205 Application Development TABLE 5.6 (Continued) Possible Values for the Various Different Options Option Grid options Values Available When placing widgets on the interface using the grid() method, the following options are available: • columnrow: The column and row the widget will be placed in. The leftmost column (0) and the first row are the defaults. • columnspan, rowspan: The number of columns or rows a widget will span across. 1 is the default value. • ipadx, ipady: The number of pixels to pad the widget (horizontally and vertically) within its borders. • padx, pady: The number of pixels to pad the widget (horizontally and vertically) outside its borders. • sticky: Determines how the widget will be aligned if its size is smaller than its cell in the grid. The default value is centered. Other possible values are N, E, S, W, NE, NW, SE, and SW. 5.6 CASE STUDY Complete the integration of the Basic Widgets Python script from Chapters 4 with a full menu ­system in an object-oriented application, using all three types of menus (i.e., regular, toolbar, popup), as described in this chapter. The menu system should include the following options: Color dialog, Open File dialog, Separator, Basic Widgets, Save As, Open Directory, Separator, About, and Exit. 6 Data Structures and Algorithms with Python Thaeer Kobbaey Higher Colleges of Technology Dimitrios Xanthidis University College London Higher Colleges of Technology Ghazala Bilquise Higher Colleges of Technology CONTENTS 6.1 6.2 Introduction...........................................................................................................................208 Lists, Tuples, Sets, Dictionaries.............................................................................................209 6.2.1 List.............................................................................................................................209 6.2.2 Tuple.......................................................................................................................... 214 6.2.3 Sets............................................................................................................................. 214 6.2.4 Dictionary.................................................................................................................. 215 6.3 Basic Sorting.......................................................................................................................... 217 6.3.1 Bubble Sort................................................................................................................ 217 6.3.2 Insertion Sort............................................................................................................. 220 6.3.3 Selection Sort............................................................................................................. 222 6.3.4 Shell Sort................................................................................................................... 225 6.3.5 Shaker Sort................................................................................................................ 227 6.4 Recursion, Binary Search, and Efficient Sorting with Lists.................................................. 230 6.4.1 Recursion................................................................................................................... 230 6.4.2 Binary Search............................................................................................................ 233 6.4.3 Quicksort................................................................................................................... 235 6.4.4 Merge Sort................................................................................................................. 238 6.5 Complex Data Structures....................................................................................................... 242 6.5.1 Stack.......................................................................................................................... 242 6.5.2 Infix, Postfix, Prefix................................................................................................... 245 6.5.3 Queue.........................................................................................................................248 6.5.4 Circular Queue........................................................................................................... 250 6.6 Dynamic Data Structures...................................................................................................... 253 6.6.1 Linked Lists............................................................................................................... 254 6.6.2 Binary Trees.............................................................................................................. 261 6.6.3 Binary Search Tree.................................................................................................... 262 6.6.4 Graphs........................................................................................................................ 267 6.6.5 Implementing Graphs and the Eulerian Path in Python............................................ 269 6.7 Wrap Up................................................................................................................................. 271 6.8 Case Studies........................................................................................................................... 271 6.9 Exercises................................................................................................................................ 272 References....................................................................................................................................... 272 DOI: 10.1201/9781003139010-6 207 208 Handbook of Computer Programming with Python 6.1 INTRODUCTION Data is defined as a collection of facts. In raw form, data Observation 6.1 – Data Structures: A is difficult to process and, thus, in need of further strucway of representing, organizing, storturing in order to be useful. In computer science, a data ing, and accessing data based on a set structure refers to the organization, storage, and manof well-defined rules. agement of data in a way that allows its efficient processing and retrieval. In simple terms, a data structure represents the associated data on a computer in a specific format, while preserving any underlying logical relationships, and it provides storage and efficient access to the data based on set of performance-enhancing rules. As an example, one can consider the real-life scenario of searching for a particular name in a phone book. The search is being made easy by organizing the names in the phone book and sorting them in alphabetical order. In this rather primitive example, one is not required to go through the phone book page by page to find the desired name. Other relevant examples include the history of web pages visited through the web browser (implemented as a linked-list structure), the undo/redo mechanism available in many applications (implemented as stack structure), the queue structures used by operating systems for scheduling the various CPU tasks, and the tree structure used in many artificial intelligence-based games to track the player’s actions. In a broader context, there are two different types of data structures: • Basic data structures that are usually available in every modern programming language. In Python, these include structures like the list, the dictionary, the tuple, and the set. Lists and tuples allow the programmer to work with data that is ordered sequentially. Sets are unordered collections of values with no duplicates. • Complex data structures, like stacks, queues, and various types of trees, that are built on basic data structures. In terms of the way these structures organize data, stacks and queues are classified as linear (i.e., the data elements are ordered), whereas trees and graphs as non-linear (i.e., the elements do not follow a particular order). This chapter covers the following topics: • Basic data structures (i.e., lists, tuples, sets, and dictionaries) and their operations. • Basic Sorting Algorithms: bubble sort, insertion sort, selection sort, shell sort, shaker sort. • The concept of recursion and its application to binary search, and the merge sort and quick sort algorithms. • Complex data structures (i.e., stacks and queues). • Dynamic data structures like singly and doubly linked lists, binary trees/binary search trees, and graphs. The focus is both on the computational thinking behind these topics, and on a detailed look into the programming concepts used for their implementation. Nevertheless, it must be stated that this chapter aims to provide a thorough introduction of the underlying ideas rather than to cover the aforementioned data structures exhaustively. Fundamental and critically important data structures and the associated algorithms like the heap tree and the heap sort or hashing structures and hashing tables, are not covered here. The reader can find more details on related subjects in the seminal works of Dijkstra et al. (1976), Knuth (1997), and Stroustrup (2013), to whom the modern computer ­science and information systems and technology community owes much of its existence. Data Structures and Algorithms 209 6.2 LISTS, TUPLES, SETS, DICTIONARIES This section explores the four built-in data structures provided by Python, namely lists, tuples, sets and dictionaries. These structures are also briefly discussed in Chapter 2, where they are referred to as non-primitive data types. Their main use is to store a collection of values and provide tools for its manipulation. 6.2.1 List A list is a data structure that stores a collection of items in specified, and frequently successive, memory locations. Each item in the list has a location number called an index. The index starts from zero and follows a sequential order. This does not refer to the values of the stored data being ordered in a particular way (e.g., alphabetically), but the index values. To access an item at a particular location, the programmer can simply use the index number corresponding to this location. The concept of the list is analogous to a to-do list that contains things that must be accomplished. In terms of functionality, Python provides various operations, such as adding items to, and removing from, a list. Since items in a list can be modified, it is considered to be mutable. Observation 6.2 – List: A list is a data At a practical level, lists in Python are denoted by structure that stores a collection of square brackets (i.e., []). The list can be populated by items in specified, usually successive, adding items within the brackets, separated by commas. memory locations. It is indexed by a The following script creates a list, and then prints both the sequential index that always starts at list items and the number of items in the list. It also asks zero. The items do not have to be in the user to specify the index of an item to print (starting a particular order. A list is a mutable from zero), a range of items to print from the start of the object, meaning that each item can list to a user-specified index, and a range of items to print be modified. from a user-specified index to the end of the list: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Create the list cars = ["BMW", "Toyota", "Honda", "Mercedes"] # Print the list items print("The list of the cars is the following: ", cars) # Use the len() function to print the number of items in the list print("The number of items in the list is: ", len(cars)) # Ask the user for the index number of an item for printing singleIndex = int(input("Enter the index \ of the item to print (indexes start from 0): ")) print("Your selection for display is: ", cars[singleIndex]) # Ask the user for the starting index of the print range startingIndex = int(input("Enter the starting index of the range \ of items to print (index starts from 0): ")) print("Your selected range of items to display is: ", cars[startingIndex:len(cars)-1]) # Ask the user for the ending index of the print range endingIndex = int(input("Enter the ending index of the range of items \ 210 23 24 25 26 27 Handbook of Computer Programming with Python to print (index starts from 0): ")) print("Your selected range of items to display is: ", cars[0:endingIndex]) # Use a negative index to start printing the list from the end print("The last item in the list is: ", cars[–1]) Output 6.2.1.a: The list of the cars is the following: ['BMW', 'Toyota', 'Honda', 'Mercedes'] The number of items in the list is: 4 Enter the index of the item to print (indexes start from 0): 0 Your selection for display is: BMW Enter the starting index of the range of items to print (index starts from 0): 1 Your selected range of items to display is: ['Toyota', 'Honda'] Enter the ending index of the range of items to print (index starts from 0): 2 Your selected range of items to display is: ['BMW', 'Toyota'] The last item in the list is: Mercedes In this script, the reader will notice that the syntax for calling a range of items is list[start:end], with start denoting the position of the starting index (inclusive) and end the ending index (not ­inclusive). It must be stressed that the start and end parameters are optional. For instance, expression cars[0: endingIndex] could be replaced by cars[:endingIndex] and, similarly, expression cars[startingIndex:len(cars)-1] could be replaced by cars[startingIndex:]. The reader should also note that if the user tries to access a list item using an index that does not exist, an IndexError exception will be raised, as illustrated in the example below: Output 6.2.1.b: The list of the cars is the following: ['BMW', 'Toyota', 'Honda', 'Mercedes'] The number cf items in the list is: 4 Enter the index of the item to print (indexes start from 0): 4 IndexError Traceback (most recent call last) <ipython-input-5-695ecl33b0e9> in <module> 11 singleIndex = int(input("Enter the index \ 12 of the item to print (indexes start from 0): ")) ---> 13 print("Your selection for display is: ", cars[singleIndex]) 14 15 # Ask the user for the starting index of a range of items in the list to print IndexError: list index out of range In addition to the basic functions discussed above, Python also provides a number of additional functions that can be used to manipulate a list (Table 6.1): 211 Data Structures and Algorithms TABLE 6.1 Most Important Functions for List Manipulation Functions append(item) clear() copy() count() extend(list2) index(item) insert(pos, item) pop() remove(item) reverse() sort() Description Adds an element at the end of the list Removes all the elements from the list Returns a copy of the list Returns the number of elements with the specified value Adds the elements of a second list (e.g., list2) to the end of the current list Returns the index of the first item with the specified value Adds an element at the specified position Removes and returns the last element of the list Removes the item with the specified value Reverses the order of the list Sorts the list in ascending order The script below is a modified version of the previously created one, demonstrating the use of append(), insert(), extend(), remove(), and pop() (Table 6.1). The script performs the tasks of adding items at the end of a list (line 9), inserting an item in a particular position specified by an index value (line 11), extending the list by adding items from a second list (lines 16–17), removing a particular item from the list (line 22), and removing the last item of the list (line 26): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 # Create the list cars = ["BMW", "Toyota", "Honda", "Mercedes"] # Print the list size and its items print("The list of the cars has ", len(cars), " items which are the following: ", cars) # Append/add an item to the end of the list cars.append("Nissan") # Insert an item to position 1 of the list cars.insert(1,"Suzuki") # Print the updated list print("The updated list after the append and insert is: ", cars) # Extend the list by adding the items of a second list cars2 = ["Renault", "Audi"] cars.extend(cars2) print("The updated list after extending it with items from " "a second list is: ", cars) # Remove a specific item from the list cars.remove("Toyota") print(cars) # Remove the last item from the list cars.pop() print(cars) 212 Handbook of Computer Programming with Python Output 6.2.1.c: The list of the cars has 4 items which are the following: ['BMW', 'Toyota', 'Honda', 'Mercedes'] The updated list after the append and insert is: ['BMW', 'Suzuki', 'Toyota', 'Honda', 'Mercedes', 'Nissan'] The updated list after extending it with items from a second list is: ['BMW', 'Suzuki', 'Toyota', 'Honda', 'Mercedes', 'Nissan', 'Renault', 'Audi'] ['BMW', 'Suzuki', 'Honda', 'Mercedes', 'Nissan', 'Renault', 'Audi'] ['BMW', 'Suzuki', 'Honda', 'Mercedes', 'Nissan', 'Renault'] The following variation of the same script showcases the use of reverse(), sort(), sort(reverse = True), and index() in order to reverse the items of the list (line 9), sort them in ascending order (line 13), sort them in descending/reverse order (line 17), and find and return the index of a particular item (line 21). Notice that none of the results of these functions have a permanent effect on the original list: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Create the list cars = ["BMW", "Toyota", "Honda", "Mercedes", "Toyota"] # Print the list size and its items print("The list of the cars has ", len(cars), " items which are the following: ", cars) # Print the items of the list in reverse order cars.reverse() print(cars) # Sort the items of the list and print them cars.sort() print(cars) # Sort the items of the list in reverse order and print them cars.sort(reverse = True) print(cars) # Find and return the index of a specific item in the list print(cars.index("BMW")) Output 6.2.1.d: The list of the cars has 4 items which are the following: ['BMW', 'Toyota', 'Honda', 'Mercedes'] ['Mercedes', 'Honda', 'Toyota', 'BMW'] ['BMW', 'Honda', 'Mercedes', 'Toyota'] ['Toyota', 'Mercedes', 'Honda', 'BMW'] 3 Data Structures and Algorithms 213 Finally, with the use of in <list>, copy(), count(), and clear(), the programmer can examine in run-time whether a particular item belongs in a list (lines 8–11 and 13–16), copy the contents of a list (line 23), count the occurrences of an item in the list (line 19), and clear the list (line 27): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 # Create the list cars = ["BMW", "Toyota", "Honda", "Mercedes", "Toyota"] # Print the list items print("The list of the cars is the following: ", cars) # Print True or False depending on whether an item is included in the list if ("Toyota" in cars): print("Toyota is in the list") else: print("Toyota is not in the list") if ("Nissan" in cars): print("Nissan is in the list") else: print("Nissan is not in the list") # The number of occurrences of an item in the list occurrences = cars.count("Toyota") print("Occurrences of the particular item in the list is: ", occurrences) # Copy the contents of a list into another newCars = cars.copy() print("The contents of the new list are: ", newCars) # Clear the list newCars.clear() print("The newCars list of items is now empty: ", newCars) Output 6.2.1.e: The list of the cars is the following: ['BMW', 'Toyota', 'Honda', 'Mercedes', 'Toyota'] Toyota is in the list Nissan is not in the list Occurences of the particular item in the list is: 2 The contents of the new list are: ['BMW', 'Toyota', 'Honda', 'Mercedes', 'Toyota'] The newCars list of items is now empty: [] 214 Handbook of Computer Programming with Python 6.2.2 Tuple Tuples are a special type of list, with items being orgaObservation 6.3 – Tuple: A special nized in a particular order and accessed by referencing type of list that is immutable (i.e., its index values. The difference between a normal list and a items cannot be modified). Tuples are tuple is that the latter is immutable, meaning that its created using parentheses instead of items cannot be modified. As such, tuples do not offer square brackets. some of the extended functionality of a list described in the previous section. In terms of syntax, tuples are created using parentheses instead of square brackets. The following script demonstrates the basics of tuple creation and usage: 1 2 3 4 5 6 7 8 9 10 # Create a tuple cars = ("BMW", "Toyota", "Honda", "Mercedes") # Display all items in the tuple print("The items in the tuple are: ", cars) # Display the first item in the tuple print("The first item in the tuple is: ", cars[0]) # Raises TypeError exception as the tuple item can't be modified cars[0] = "Tesla" Output 6.2.2: The items in the tuple are: ('BMW', 'Toyota', 'Honda', 'Mercedes') The first item in the tuple is: BMW TypeError Traceback (most recent call last) <ipython-input-1-3c3eee3a45c8> in <module> 8 9 # Raises a TypeError exception since the item in the tuple cannot be modified ---> 10 cars[0] = "Tesla" TypeError: 'tuple' object does not support item assignment 6.2.3 Sets A set is a collection of unordered and unique items. It is created using curly braces (i.e., {}) (Hoare, 1961). When the print() function is used to display the contents of a set, the duplicates are removed from the output and its contents are not presented in a particular order. In fact, every time the code is executed the order of the elements is different. There are four particular operators/functions used on sets: 1. The in Operator: Examines whether an item is included in the set. Observation 6.4 – Set: A collection of unordered, unique items. Use the in operator to examine if an item belongs to a set. Use the intersection() function to find the common items between two sets. Use the difference() function to retrieve items from the first set that are not found in the second. The union() function combines the items of two sets, removing any duplicates. Data Structures and Algorithms 215 2. The intersection() Function: Identifies the common items between two sets. 3. The difference() Function: Retrieves items from a set that do not exist in another set. 4. The union() Function: Combines the items of two sets and returns a new one after removing any duplicates. The following script demonstrates the basic use of sets and their main operations: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # Create the set cars = {"BMW", "Toyota", "Honda", "Mercedes", "Toyota"} # Print the set print("The cars set includes the following items: ", cars) # Check whether a particular item exists in the set if ("Honda" in cars): print("Honda is in the cars set") else: print("Honda is not in the cars set") # Create and print an additional set german_cars = {"BMW", "Mercedes", "Audi", "Porsche"} print("The german cars set includes the following items: ", german_cars) # Find and print the intersection (i.e., common items of the two sets) print("The intersection, i.e., the common items of the two sets, is: ", cars.intersection(german_cars)) # Find and print the difference of the two sets print("The different items between the two sets are: ", cars.difference(german_cars)) # Find and print the union of the two sets print("The union of the two sets is: ", cars.union(german_cars)) Output 6.2.3: The cars set includes the following items: {'Honda', 'Mercedes', 'BMW', 'Toyota'} Honda is in the cars set The german cars set includes the following items: {'Mercedes', 'Porsche', 'BMW', 'Audi'} The intersection, i.e., the common items of the two sets, is: {'Mercedes', 'BMW'} The different items between the two sets are: {'Honda', Toyota'} The union of the two sets is: {'Audi', 'Porsche', 'Honda', 'Mercedes', 'BMW', 'Toyota'} 6.2.4 Dictionary A dictionary is a collection of items that stores values in key-value pairs. The key is a unique identifier and the value is the data associated with it. The dictionary is analogous to a phone book that stores the contact name and telephone of a person. The contact name would be the key that is used 216 Handbook of Computer Programming with Python TABLE 6.2 Functions of a Dictionary Function clear() copy() get(key) has_key(key) items() keys() values() pop(key) popitem() update() Description Removes all the elements from the dictionary Returns a copy of the dictionary Gets an item by the key Returns a Boolean value based of whether the key is in the dictionary or not Returns a list of (key, value) tuples Returns a list of keys Returns a list of values Removes an item given the key and returns the value Removes the next item, and returns the key/value Adds or overwrites items from another dictionary to look up the telephone number (i.e., the value). In a dictionary, keys must be unique and of an immutable data type, such as strings or integers, while values can be of any type (e.g., strings, integers, lists). The Python syntax for creating a dictionary is the following: dictionary = {key1: value1, key2: value2} Observation 6.5 – Dictionary: A collection of items stored in a key-value pair format. The keys must use immutable data types. The values can be of any type and are mutable. The syntax is the following: dictionary = {key1: value1, key2: value2} Table 6.2 lists the available dictionary functions The following script presents an example involving a dictionary named employee that holds the employees’ names, salaries, and job titles: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # Create the dictionary employee = {"name": "Maria", "salary": 15000, "job": "Sales Manager"} # Print the dictionary print("The employee dictionary is: ", employee) # Access a specific key and print the paired value print("The pair value for the <name> key is: ", employee["name"]) # Use the get() method to print a pair based on a given key print("The value pair of the <name> key is: ", employee.get("name")) # If the key value does not exist the get() method will return # None (empty) print("The value pair of the <name> key is: ", employee.get("department")) # Add a new pair to the dictionary employee["department"] = "Sales" print("The value pair of the new <department> key is: ", employee.get("department")) # Modify the value of a given key employee["salary"] = "20000" print("The new employee dictionary includes the following pairs: ", Data Structures and Algorithms 24 25 26 27 28 29 30 31 32 33 34 217 employee) # Use the update() method to modify the dictionary employee.update({"name":"Alex","department":"Sales"}) print(employee) # Pop/remove a pair based on a given key, assign it to a new # dictionary and print it emp_job = employee.pop("job") print("The original employee dictionary is: ", employee) print("The new emp_job dictionary is: ", emp_job) Output 6.2.4: The employee dictionary is: {'name': 'Maria', 'salary': 15000, 'job': 'Sales Manager'} The pair value for the <name> key is: Maria The value pair of the <name> key is: Maria The value pair of the <name> key is: None The value pair of the new <department> key is: Sales The new employee dictionary includes the following pairs: {'name': 'Maria', 'salary': '20000', 'job': 'Sales Manager', 'department': 'Sales'} {'name': 'Alex', 'salary': '20000', 'job': 'Sales Manager', 'department': 'Sales'} The original employee dictionary is: {'name': 'Alex', 'salary': '20000', 'department': 'Sales'} The new empjob dictionary is: Sales Manager The reader should note that it is possible to access the value of a dictionary key either directly (line 7) or through the get() function (line 10). If access to a value of a key that does not exist in the dictionary is requested, get() returns an empty value (line 13 and 14). It is also worth noting that it is possible to add a new pair of values through the update() function (line 27). Finally, line 32 demonstrates how to remove a particular pair from a dictionary through the pop() function and how to create a new dictionary from it. 6.3 BASIC SORTING Sorting is a major task in computer science and information systems/technology, with as much as 30% of the total computer processing time of everyday business activity allegedly being devoted to it. In a broader context, sorting is the computational process of arranging data in a particular order. As different sorting algorithms can result in differences of minutes, hours, or even days, efficiency is an important factor in terms of sorting time. Efficiency is measured by counting the number of comparisons and exchanges/swaps required to sort a given list of data. A comparison takes place when an element of the list is compared with another, whereas exchanges/swaps happen when two elements of the list switch their positions. 6.3.1 Bubble Sort The bubble sort is one of the most well-known sorting algorithms. It is also covered in Chapter 4 of this book, under the topic of listboxes. The main idea of the algorithm is to have the element with the highest (or lowest) value in a list moved to the last (or first) place during each iteration. At each 218 Handbook of Computer Programming with Python iteration, the program repeats this process, moving the next highest (lowest) number in the list to the appropri- Observation 6.6 – Bubble Sort: Use ate place. The number of the main iteration corresponds two nested for loops during the to the number of the elements of the list. During each inner iterations to successively move main iteration there are as many comparisons (and the highest/lowest value element to potentially exchanges/swaps) as the total number of ele- the end of the list until the entire list ments in the list. Thus, the time complexity of the bubble is sorted. sort is O(n2). The detailed explanation of time complexities and the Big O/Theta/Omega notation is beyond the scope of this book, but the reader can find related information in most of the essential computer science sources and bibliography. For the purposes of this chapter, it should suffice to claim that the bubble sort is not particularly efficient in terms of time. In order to examine the low efficiency of the algorithm, the reader could assume that each comparison takes 1 nanosecond to complete (1 nanosecond = 1.0e−9 seconds). This would translate to the following rough estimates: • • • • • n = 10: n2 = 81 comparisons → approximate time 3e−4 seconds. n = 100: n2 = 9.8e3 comparisons → approximate time 5e−3 seconds. n = 1,000: n2 = 9.98e5 comparisons → approximate time 0.4 seconds. n = 10,000: n2 = 9.998e7 comparisons → approximate time 46 seconds. n = 20,000: n2 = 4e7 comparisons → approximate time 188 seconds As these calculations are estimates, they are largely dependent on the system at hand, the type of data of the list, and the conditions of the programming platform used. However, the crude assumptions and numbers used here could provide a rough idea of the increasing inefficiency of the bubble sort in line with an increasing size of the list. Indeed, bubble sort works well as long as n is not higher than approximately 10,000. After this point, it becomes heavy and its inefficiency starts to show. It is possible to slightly improve the efficiency of the algorithm by avoiding unnecessary ­comparisons. As an example, one could use the following eight-element list: 3, 5, 4, 2, 3, 1, 6, 7. The algorithm will execute n−1 times (i.e., seven iterations) during each of the main iterations. The inner iterations are then responsible to bring each element to the corresponding place successively (Table 6.3). The reader should note that, firstly, it is not necessary that an exchange/swap of elements will take place in every iteration of the inner loop and, secondly, at the end of the main outer iteration the highest element is pushed to the end of the list. In this case, in the first main outer iteration, element 7 is pushed to the end of the list. The last line is the result of the first main outer iteration, after all seven inner loops are completed. Subsequent iterations will repeat the same process, ensuring that the next highest element moves to the appropriate position, until all elements have taken the correct place in the list. TABLE 6.3 The Inner Loop inside the First Main Iteration 3 3 3 3 3 3 3 5 5 4 4 4 4 4 4 4 5 2 2 2 2 2 2 2 5 3 3 3 3 3 3 3 5 1 1 1 1 1 1 1 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 219 Data Structures and Algorithms TABLE 6.4 The Results of the Outer Loops After the 1st pass After the 2nd pass After the 3rd pass After the 4th pass After the 5th pass After the 6th pass After the 7th pass 3 3 2 2 1 4 2 3 1 2 2 3 1 5 3 1 4 5 1 3 4 5 3 3 4 5 3 3 4 5 Comparisons are made with no swaps Comparisons are made with no swaps 6 6 6 6 6 7 7 7 7 7 Table 6.4 presents the results after each of the outer iterations/loops. A Python implementation of a basic bubble sort and its output is provided below: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # Import the random module to generate random numbers import random import time comparisons = 0 list = [] # Enter the number of list elements size = int(input("Enter the number of list elements: ")) # Use the randint() function to generate random integers for i in range (size): newNum = random.randint(-100, 100) list.append(newNum) print("The unsorted list is: ", list) # Bubble sorts the list & records the stats for later use # Start the timer startTime = time.process_time() # The bubble sort algorithm for i in range (size-1): for j in range (size-1): comparisons += 1 if (list[j] > list[j+1]): temp = list[j] list[j] = list[j+1] list[j+1] = temp # End the timer endTime = time.process_time() # Display the basic info for the bubble sort print("The sorted list is: ", list) print("The number of comparisons is: ", comparisons) print("The elapsed time in seconds is: ", (endTime - startTime)) 220 Handbook of Computer Programming with Python Output 6.3.1: Enter the number of elements in the list:7 The unsorted list is: [33, -16, -57, -17, 95, 5, 15] The sorted list is: [-57, -17, -16, 5, 15, 33, 95] The number of comparisons is = 36 The elapsed time in seconds = 0.0 6.3.2 Insertion Sort Insertion sort is another basic sorting algorithm, similar to bubble sort but somewhat improved. The basic idea Observation 6.7 – Insertion Sort: is that on the ith pass the algorithm inserts the ith ele- Use a while loop nested inside a ment into the appropriate place (i.e., L[i]) at the end of for loop to find the highest/lowest in the subset of the list the L[1], L[2], …, L[i-1] sequence, the elements of which value element th pass. The subset starts with in each i have been previously placed in sorted order. As a result, after the insertion, the elements occupying the L[1], the first two elements (index extends L[2], …, L[i] sequence are in sorted order. In simple up to i + 1) and is increased by 1 in terms, the algorithm sorts increasingly larger subsets of each pass. the original list until the whole list is sorted. As an example, assume that the insertion sort is applied to the following seven-element list: 3, 5, 4, 2, 3, 1, 6, thus executing n−1 (i.e., 6) outer iterations/loops. The big difference between this algorithm and bubble sort is that each of the main iterations will not require the same number as the inner iterations, but an increasing iteration number starting from 1 and up to n−1. During each inner iteration, the highest element is moved to the last location of the current subset of the list. The following section describes in detail each of the main iterations. The inner iteration of the first main iteration will put the two elements of the subset in order: 3 3 5 5 The two-iteration loop of the second main iteration will put the three elements of the subset in order: 3 3 3 5 4 4 4 5 5 The three-iteration loop of the third main iteration will put the four elements of the subset in order: 3 3 3 2 4 4 2 3 5 2 4 4 2 5 5 5 221 Data Structures and Algorithms The four-iteration loop of the fourth main iteration will put the five elements of the subset in order: 2 2 2 2 2 3 3 3 3 3 4 4 3 3 3 5 3 4 4 4 3 5 5 5 5 The five-iteration loop of the fifth main iteration will put the six elements of the subset in order: 2 2 2 2 2 1 3 3 3 3 1 2 3 3 3 1 3 3 4 4 1 3 3 3 5 1 4 4 4 4 1 5 5 5 5 5 The six-iteration loop of the sixth main iteration will put the seven elements of the subset in order: 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6 6 The algorithm relies on the introduction of a temporary element (e.g., temp) and a temporary location (i.e., loc), which are assigned with values L[1] and 1 respectively. The following script provides an implementation of the insertion sort algorithm in Python: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import random import time list = [] comparisons = 0 # Enter the number of list elements size = int(input("Enter the number of list elements: ")) # Use the randint() function to generate random integers for i in range (size): newNum = random.randint(–100, 100) list.append(newNum) print("The unsorted list is: ", list) startTime = time.process_time() # Start the timer # The insertion sort algorithm for i in range(1, size): 222 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Handbook of Computer Programming with Python temp = list[i] loc = i while ((loc > 0) and (list[loc-1] > temp)): comparisons += 1 list[loc] = list[loc-1]; loc = loc -1 list[loc] = temp endTime = time.process_time() # End the timer # Display the basic info for the insertion sort print("The sorted list is: ", list) print("The number of comparisons is: ", comparisons) print("The elapsed time in seconds is: ", (endTime - startTime)) Output 6.3.2: Enter the number of elements in the list:7 The unsorted list is: [2, -8, 69, 20, -56, -32, -81] The sorted list is: [-81, -56, -32, -8, 2, 20, 69] The number of comparisons is = 16 The elapsed time in seconds = 0.0 There are a couple of characteristics that make insertion sort significantly more efficient compared to bubble sort. First, since each subset of the list includes fewer elements than the entire list, it performs fewer comparisons. Second, as each pass secures that the subset is in order, fewer swaps are required. However, on average, the algorithm falls under the same time efficiency bracket as bubble sort (i.e., O(n2)), and only shows improvement on the best case, where it becomes linear and achieves a time complexity of O(n). An approximation of the time efficiency improvements of the insertion sort over the bubble sort is provided in the list below (assume 1 comparison takes 1 nanosecond or 1.0e−9 seconds; where Cs stands for Comparisons): • • • • • n = 10: ~40 Cs (n2 = 81 in Bubble S.) → approx. 2.0e−4 seconds (3e−4 in Bubble S.) n = 100: ~4.0e3 Cs (n2 = 9.8e3 in Bubble S.) → approx. 2.5e−3 seconds (4.5e−3 in Bubble S.) n = 1,000: ~5.0e5 Cs (n2 = 9.98e5 in Bubble S.) → approx. 0.16 seconds (3.7e−1 in Bubble S.) n = 10,000: ~4.0e7 Cs (n2 = 9.998e7 in Bubble S.) → approx. 15 seconds (46 in Bubble S.) n = 20,000: ~9.8e7 Cs (n2 = 2.0e8 in Bubble S.) → approx. 57 seconds (188 in Bubble S.) 6.3.3 Selection Sort Selection sort, also considered one of the fundamental sorting algorithms, is similar to insertion sort, but provides some improvements in terms of efficiency as it reduces the number of required swaps. The basic idea is that, on the ith pass, the algorithm selects the element with the lowest (or highest) value within a given range (i.e., A[j], …, A[n]), and swaps it with the current position (i.e., A[j]). Thus, after the ith pass, the ith lowest elements will occupy A[1], A[2], …, A[i] in sorted order. Observation 6.8 – Selection Sort: Use a for loop nested inside another for loop to find and replace the highest/lowest value element with the original, ith item in the list. In each successive pass, the subset of the searchable list is reduced by one. 223 Data Structures and Algorithms The algorithm utilizes subsets of a list to sort it, moving from the whole list to end up with the smallest divisions of it. In a sense, it is almost the opposite of insertion sort. The algorithm requires one additional variable in order to store the location (index) of the lowest value element within the list. Using the list from the previous example (i.e., 3, 5, 4, 2, 3, 1, 6), during the 1st outer iteration of the selection sort, the inner iterations will determine that the lowest value element is in index 5. Therefore, the elements in list[0] and list[5] will be swapped, and the element in list[0] will not be involved in any further processing from this point on: list[0] = 3 list[1] = 5 list[2] = 4 list[3] = 2 list[4] = 3 list[5] = 1 list[6] = 6 By the end of the 1st outer iteration, the list has the following structure: list[0] = 1 list[1] = 5 list[2] = 4 list[3] = 2 list[4] = 3 list[5] = 3 list[6] = 6 Given that the 2nd outer loop will move the index to the 2nd element of the list (i.e., i = 1), the 2nd inner iterations will only deal with the subset of the original list, excluding the sorted part (i.e., list[0]). This means that in the unsorted subset of the list, the element with the lowest value will be in index 3. Thus, the elements in list[1] and list[3] will be swapped, while the element in list[1] will not be involved in any further processing: list[0] = 1 list[1] = 5 list[2] = 4 list[3] = 2 list[4] = 3 list[5] = 3 list[6] = 6 list[5] = 3 list[6] = 6 By the end of the 2nd outer iteration the list will be the following: list[0] = 1 list[1] = 2 list[2] = 4 list[3] = 5 list[4] = 3 Once again, the 3rd outer loop will move the index to the 3rd element of the list (i.e., i = 2) and the 3rd inner iterations will only deal with the subset of the original list, excluding the sorted part (i.e., list[0], list[1]). As in the previous two iterations, this will result in the element with the lowest value in the unsorted subset of the list being found in index 4, and thus the elements in list[2] and list[4] will be swapped: list[0] = 1 list[1] = 2 list[2] = 4 list[3] = 5 list[4] = 3 list[5] = 3 list[6] = 6 By the end of the 3rd outer iteration the list will be the following: list[0] = 1 list[1] = 2 list[2] = 3 list[3] = 5 list[4] = 4 list[5] = 3 list[6] = 6 Repeating the outer loop for a 4th time will further move the index to the 4th element of the list and the 4th inner iterations will deal with the remaining subset of the list. The inner loop will find the lowest value element to be in index 5 of that subset, and the elements in list[3] and list[5] will be swapped: list[0] = 1 list[1] = 2 list[2] = 3 list[3] = 3 list[4] = 4 list[5] = 5 list[6] = 6 224 Handbook of Computer Programming with Python The algorithm will continue until there is no subset left unprocessed. By that time, the list will have been sorted. The following script showcases an implementation of selection sort in Python and its output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 # Import the random module to generate random numbers import random import time comparisons = 0 list = [] # Enter the number of list elements size = int(input("Enter the number of list elements: ")) # Use the randint() function to generate random integers for i in range (size): newNum = random.randint(-100, 100) list.append(newNum) print("The unsorted list is: ", list) # Selection sorts the list & records the stats for later use # Start the timer startTime = time.process_time() # The selection sort algorithm for i in range(size): locOfMin = i # Find the smallest element in the # remaining subset of the list for j in range(i+1, size): comparisons += 1 if (list[locOfMin] > list[j]): locOfMin = j # Swap the minimum element with # the first element of the subset list[i], list[locOfMin] = list[locOfMin], list[i] # End the timer endTime = time.process_time() # Display the basic info for the selection sort print("The sorted list is: ", list) print("The number of comparisons is: ", comparisons) print("The elapsed time in seconds: ", (endTime - startTime)) Data Structures and Algorithms 225 Output 6.3.3: Enter the number of elements in the list:7 The unsorted list is: [32, 81, -76, -88, 62, -53, -17] The screed list is: [-88, -76, -53, -17, 32, 62, 81] The number of comparisons is = 21 The elapsed time in seconds = 0.0 Selection sort is a bit heavier than insertion sort, but it becomes comparatively faster as the list grows larger. Nevertheless, for lists containing between approximately 1,000 and 50,000 elements, both algorithms perform similarly in terms of their efficiency. Their most important difference is that the efficiency of selection sort is quite similar across the best, average, and worst cases, with a time complexity of O(n2), whereas insertion sort has a complexity that in the best case might even reach O(n). In practice, both algorithms are suitable for relatively small lists. The following list provides approximate comparative figures highlighting the performance differences between the two algorithms (assume 1 comparison takes 1 nanosecond or 1.0e−9 seconds; Cs stands for Comparisons): • n = 10: 45 Cs (up to 40 in Insertion S.) → approx. 6.0e−4 seconds (2.0e−4 in Insertion S.) • n = 100: 4.9e3 Cs (up to 4.0e3 in Insertion S.) → approx. 8.0 e−3 seconds (2.5e−3 in Insertion S.) • n = 1,000: 5.0e5 Cs (up to 5.0e5 in Insertion S.) → approx. 0.18 seconds (0.16 seconds in Insertion S.) • n = 10,000: 5.0e7 Cs (4.0e7 Cs in Insertion S.) → approx. 17 seconds (15 seconds in Insertion S.) • n = 20,000: 2.0e8 Cs (9.8e7 Cs in Insertion S.) → approx. 62 seconds (57 seconds in Insertion S.) • n = 30,000: 4.5e8 Cs (2.2e8 Cs in Insertion S.) → approx. 142 seconds (125 seconds in Insertion S.) 6.3.4 Shell Sort In order to improve the performance of sorting larger lists, the reader can use the shell sort (also referred to Observation 6.9 – Shell Sort: An as the diminishing-increment sort). The main problem improved variation of the bubble with previously discussed algorithms like insertion, sort, sorting subsets of a list based selection and bubble sort, is their time performance of on the distance between the various O(n2), making them extremely slow when sorting big list elements. The process starts with lists. Shell sort, while being based on insertion sort, is a defined number that is reduced in using smaller distances between elements. Initially, ele- each iteration (usually by one). ments within a specifically defined distance in the list are sorted. The algorithm then starts working with elements of decreasing distances until all subsequent elements have been processed. The key point in this algorithm is that every pass deals with a relatively small number of elements, or with already sorted elements, and every pass secures an increasing part of the list is ordered. The sequence of the distances can change, provided that the last distance must be 1. It is mathematically proven that the algorithm has a time complexity of O(n1,2). 226 Handbook of Computer Programming with Python As an example, let us consider the following list: 3, 5, 2, 4, 6, 1, 7, 9, 8. In the 1st pass, the list is split into three subsets, each of which is processed using the insertion sort. In this particular case, the three subsets have a distance of three between each element: • 1st Pass/Subset 1: 3, 4, 7. Result after insertion sort: 3, 4, 7 • 1st Pass/Subset 2: 5, 6, 9. Result after insertion sort: 5, 6, 9 • 1st Pass/Subset 3: 2, 1, 8. Result after insertion sort: 1, 2, 8 After the end of the 1st pass the list will be in the following order: 3, 5, 1, 4, 6, 2, 7, 9, 8. In the 2nd, the list is split into two subsets, with each one being processed again using the insertion sort. In this case, the two subsets have a distance of two between each element: • 2nd Pass/Subset 1: 3, 1, 6, 7, 8. Result after insertion sort: 1, 3, 6, 7, 8 • 2nd Pass/Subset 2: 5, 4, 2, 9. Result after insertion sort: 2, 4, 5, 9 After the end of the 2nd pass, the complete list will be in the following order: 1, 2, 3, 4, 6, 5, 7, 9, 8. Finally, in the 3rd pass, the list is dealt with as a whole, again using the insertion sort. Given that the previous passes ensured that the list is close to being fully sorted, this pass does require multiple swaps but only the necessary comparisons. The following script implements the aforementioned algorithm: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 # Import the random module to generate random numbers import random import time comparisons = 0 list = [] # Enter the number of elements for the list size = int(input("Enter the number of list elements: ")) # Use the randint() function to generate random integers for i in range (size): newNum = random.randint(-100, 100) list.append(newNum) print("The unsorted list is: ", list) # Start the timer startTime = time.process_time() # Use shell sort to sort the list and record the statistics for later use # Start with a big distance and reduce it successively distance = int(size/2) # Insertion sorts each of the list subsets divided by distance while distance >= 0: # The insertion sort algorithm for i in range(size): temp = list[i] 227 Data Structures and Algorithms 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 loc = i while ((loc >= distance) and (list[loc-distance] > temp)): comparisons += 1 list[loc] = list[loc-distance] loc = loc - distance list[loc] = temp distance -= 1 # End the timer endTime = time.process_time() # Display basic info for the shell sort print("The sorted list is: ", list) print("The number of comparisons is: ", comparisons) print("The elapsed time in seconds is: ", (endTime - startTime)) Output 6.3.4: Enter the number of elements in the list:10 The unsorted list is: [-47, 79, -79, 94, -79, -97, -7, -3, 49, 88] The sorted list is: [-97, -79, -79, -47, -7, -3, 49, 79, 88, 94] The number of comparisons is = 10 The elapsed time in seconds = 0.0 While the efficiency of the algorithm may not be instantly noticeable, it does make a difference when examined more closely. The following list of approximate results showcases the performance difference between insertion sort and shell sort (assume 1 comparison takes 1 nanosecond or 1.0e−9 seconds; Cs stands for Comparisons): • n = 10: 8 Cs (up to 40 in Insertion S.) → approx. 3.8e−4 seconds (2.0e−4 in Insertion S.) • n = 100: 4e2 Cs (up to 4.0e3 in Insertion S.) → approx. 3.8e−3 seconds (2.5e−3 in Insertion S.) • n = 1,000: 1.5e4 Cs (up to 5.0e5 in Insertion S.) → approx. 0.27 seconds (0.16 seconds in Insertion S.) • n = 10,000: 1.7e5 Cs (4.0e7 Cs in Insertion S.) → approx. 26 seconds (15 seconds in Insertion S.) • n = 20,000: 3.4e5 Cs (9.8e7 Cs in Insertion S.) → approx. 99 seconds (57 seconds in Insertion S.) • n = 30,000: 5 e5 Cs (2.2e8 Cs in Insertion S.) → approx. 215 seconds (125 seconds in Insertion S.) 6.3.5 Shaker Sort The shaker sort algorithm is based on the bubble sort, but instead of the list being read always on the same direction, consequent readings occur in opposite directions. This ensures that both the highest and lowest value elements of the list move to the correct positions faster. The main disadvantage of this algorithm is that, since it is based on bubble sort, its time complexity is bound to O(n2). Observation 6.10 – Shaker Sort: Use two separates for loops nested inside a while loop to read a list of elements in opposite directions. This ensures that the elements will be positioned to the correct places in the list faster than with bubble sort. 228 Handbook of Computer Programming with Python The following list provides approximate comparisons between the shaker and the bubble sort. The examples support the argument that it is not worth using this algorithm unless the size of the list falls within the approximate range of 1,000–50,000 elements. For lists with more elements than the upper threshold of this range (50,000), using the shaker sort is impractical (as in previous examples, 1 comparison takes 1 nanosecond to complete and 1 nanosecond = 1.0e−9 seconds): n = 10: ~40 Cs (n2 = 81 in Bubble S.) → approx. 7.7e−4 seconds (3e−4 in Bubble S.) n = 100: ~4.2e3 Cs (n2 = 9.8e3 in Bubble S.) → approx. 3.2e−3 seconds (4.5e−3 in Bubble S.) n = 1,000: ~3.9e5 Cs (n2 = 9.98e5 in Bubble S.) → approx. 0.28 seconds (0.37 in Bubble S.) n = 10,000: ~3.8e7 Cs (n2 = 9.998e7 in Bubble S.) → approx. 28 seconds (46 in Bubble S.) n = 20,000: ~1.5e8 Cs (n2 = 2.0e8 in Bubble S.) → approx. 110 seconds (188 in Bubble S.) • • • • • In general, the time complexity of the algorithm for the average and worst cases are O(n2), while slight improvements can potentially lead to a running time complexity of O(n) at best. As an example, let us consider the same list as the one used with bubble sort: 2, 3, 1, 6, 7. During the 1st outer loop, shaker sort will execute two inner iterations successively, with one iteration processing the list to the right and one to the left. Each time an inner loop processes the list to the right, the pointer at the end of the list is reduced by one. Similarly, each time it processes the list to the left, the pointer at the start of the list is increased by one. Starting with the 1st outer iteration, the inner loop presented in Table 6.5 (processing the list to the right) will take place. Likewise, in the 1st outer iteration, the inner loop presented in Table 6.6 will process the list to the left. The reader should note that, at the end of each outer iteration, the highest value element of the current sub-list is pushed to the end of the sub-list and the lowest is pushed to the start. Table 6.7 presents the results of each of the outer iterations. Note that the algorithm will stop at the end the first inner iteration of the 3rd outer pass, as there are no more swaps to be made: TABLE 6.5 The First Inner Loop within the First Main Iteration, Reading the List to the Right 3 3 3 3 3 3 3 5 5 4 4 4 4 4 4 4 5 2 2 2 2 2 2 2 5 3 3 3 3 3 3 3 5 1 1 1 1 1 1 1 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 TABLE 6.6 The Second Inner Loop within the First Main Iteration, Reading the List to the Left 3 3 3 3 3 3 1 4 4 4 4 4 1 3 2 2 2 2 1 4 4 3 3 3 1 2 2 2 1 1 1 3 3 3 3 5 5 5 5 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 229 Data Structures and Algorithms TABLE 6.7 The Results of the Outer Loops After the 1st pass After the 2nd pass After the 1st inner of the 3rd outer pass 1 1 1 3 2 2 4 3 3 2 3 3 3 4 4 5 5 5 6 6 6 The following script demonstrates an implementation of the shaker sort and its output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 # Import the random module to generate random numbers import random import time comparisons = 0 list = [] # Enter the number of list elements size = int(input("Enter the number of list elements: ")) # Use the randint() function to generate random integers for i in range (size): newNum = random.randint(-100, 100) list.append(newNum) print("The unsorted list is: ", list) # Start the timer startTime = time.process_time() # The shaker sort algorithm swapped = True; start = 0; end = size -1 # Keep running the shaker sort while swaps are taking place while (swapped == True): # Set swap to false to start the new loop swapped = False; # Loop from left to right using bubble sort for i in range(start, end): comparisons += 1 if (list[i] > list[i + 1]): temp = list[i]; list[i] = list[i+1]; list[i+1] = temp swapped = True; # If there were no swaps, the list is sorted if (swapped == False): break # If at least one swap, then reset swap to false and continue else: swapped = False 7 7 7 230 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 Handbook of Computer Programming with Python # Decrease the end of the list to -1, since largest element moved # to the right end –= 1 # Loop from right to left using bubble sort for i in range (end, start, -1): comparisons += 1 if (list[i] < list[i-1]): temp = list[i]; list[i] = list[i-1]; list[i-1] = temp swapped = True # Increase the start of the list by 1 since smallest element moved # to the left start += 1 # End the timer endTime = time.process_time() # Display the sorted list print("The sorted list is: ", list) print("The number of comparisons is: ", comparisons) print("The elapsed time in seconds: ", (endTime - startTime)) Output 6.3.5: Enter the number of elements in the list:15 The unsorted list is: [98, -23, -29, 17, -11, 2, 77, -20, -53, 66, -2, 33, 63, 33, 68] The sorted list is: [-53, -29, -23, -20, -11, -2, 2, 17, 33, 33, 63, 66, 68, 77, 98] The number of comparisons is = 77 The elapsed time in seconds = 0.0 6.4 RECURSION, BINARY SEARCH, AND EFFICIENT SORTING WITH LISTS On a broader context, any attempt to find an algorithm that addresses the problem of sorting a list efficiently is subject to certain restrictions. This is due to the fact algorithms generally fall within the same time complexity of O(n2), as a result of their inherent nested loop structures. As shown in the previous sections, this is true even when improved and optimized versions of the algorithms are used. In order to improve the efficiency of sorting algorithms further, recursion must be adopted. This section presents and discusses the concept of recursion, and uses it as a base to implement some common related algorithmic ideas like binary search and factorial. Subsequently, two notable algorithms that address the problem of sorting large lists in an efficient way are presented: merge sort and quick sort. 6.4.1 Recursion By definition, a recursive function is one that calls itself. The basic idea is to break a large problem into several smaller parts that are equivalent to the original. These are further broken down successively into even smaller parts, until the problem is small enough for its solution to become evident. 231 Data Structures and Algorithms This final point is called a terminal or base case. The condition that must be met in order to achieve the terminal case is called the terminal condition. The associated step followed to break down the problem into smaller parts is called the basic step. In order to contextualize the idea of recursion, one needs to break down what happens on a recursive function call: Observation 6.11 – Recursion: A recursive function is one that calls itself. It takes a large problem and breaks it into smaller ones successively, following a step. The step is repeated until the smaller parts are so small that the solution is evident. The final and smallest part is referred to as the terminal or base case. • Firstly, the compiler/interpreter passes a parameter to the function. • The called function and its parameter is pushed to the program stack (stacks are discussed in Section 6.5.5), a separate place in memory where the local variables are stored until this particular function call is completed. • The compiler/interpreter records the return address, which will be used as a return to the calling function when the current function call is complete. • When the current function call is complete, the compiler/interpreter records the value to be returned to the calling function (if applicable). In terms of its results, recursion is similar to the iteration explained in Chapter 2, but differs in terms of the functions used. An iterative algorithm uses a looping construct whereas a recursive algorithm uses a branching structure. In terms of both time and memory usage, recursive solutions are often less efficient than their iterative counterpart. However, in many occasions they are the only solutions available. Their main advantage is that by simplifying the solution to a single problem they often result in shorter and more readable source code. The following script presents a basic recursive function that calls itself continuously and ­indefinitely, printing a particular message: 1 2 3 4 5 def message(): print("This is a recursive function") message() message() Output 6.4.1.a: This This This This is is is is a a a a recursive recursive recursive recursive function function function function RecursionError Traceback (most recent call last) <ipython-input-l-e0c7cc045453> in <module> To prevent the function from falling into this infinite call loop, the number of repetitions must be controlled. This can be achieved by incorporating the following two steps: • A dividing step must be applied to a subset of the original values in each repetition. • The terminal or basic case must be defined and calculated (if applicable). 232 Handbook of Computer Programming with Python The following script is a modified version of the message() function presented above. It passes an integer argument that dictates the number of times the function will call itself before the terminal case: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # The recursive function def message(times): print("Message called with times = ", times) # Define the dividing step through an if statement if (times > 0): print("\tThis is a recursive function.\n") message(times -1) # The terminal or base case stops recursion & "roll back" print("Message returning with times = ", times, "\n") # Start the recursion by calling the recursive function message(3) Output 6.4.1.b: Message called with times = 3 This is a recursive function. Message called with times = 2 This is a recursive function. Message called with times = 1 This is a recursive function. Message called with times = 0 Message returning with times = 0 Message returning with times = 1 Message returning with times = 2 Message returning with times = 3 The application of recursion can be also considered in the context of a purely mathematical function, that of the factorial. The complete definition of the factorial is f(n) = n * f(n−1) for n > 1, and f(1) = 1 for n = 1. According to this definition, for f(4) the result would be calculated as follows: f(4) = 4 * f(3) = 4 * 3 * f(2) = 4 * 3 * 2 *f(1) = 4 * 3 * 2 * 1 = 24. Notice that in the case of f(1) there is no further breakdown of the function, as this is considered the terminal or base case with a result of f(1) = 1. The following script implements the solution of the factorial: 233 Data Structures and Algorithms 1 2 3 4 5 6 7 8 9 10 11 12 # The factorial function using recursion def factorial(n): # The terminal or base case if (n == 1): return 1 # The recursive step else: print(n, "* f(", n-1, ")") return n * factorial(n-1) num = int(input("Enter the number to find its factorial: ")) print("The factorial for", num, "is ", factorial(num)) Output 6.4.1.c: Enter the number to find its factorial: 1 The factorial for 1 is 1 Enter the number to find its factorial: 3 3 * f( 2 ) 2 * f( 1 ) The factorial for 3 is 6 Enter the number to find its factorial: 7 7 * f( 6 ) 6 * f( 5 ) 5 * f( 4 ) 4 * f( 3 ) 3 * f( 2 ) 2 * f( 1 ) The factorial for 7 is 5040 6.4.2 Binary Search One of the most well-known applications of recursion is the binary search. The main idea behind binary search is to find whether a word exists in a dictionary. The necessary precondition is to use it on a sorted list, regardless of the algorithm used for the sorting. The concept is rather simple: Observation 6.12 – Binary Search: A recursive algorithm applied to sorted lists in order to find the location of a particular element. • Initially, the algorithm checks whether the word in the middle element of the list exists. • If it does not and the middle element value is larger than the search value, the list is split into two halves and the middle element of the first half is checked; otherwise, the middle element of the second half is checked. • The algorithm continues until the desired element is found, in which case the element and its position in the list are reported. If the search element is not found, a relevant message is generated. 234 Handbook of Computer Programming with Python An implementation of the binary search algorithm is provided below: # The recursive function for binary search binarySearch(word, startPage, endPage) # if the dictionary consists of one page (base case) search for it in # that page if startPage = endPage search the word in the startPage else # get to the middle of the dictionary middlePage = (endPage + startPage)/2 # determine which half of the dictionary might contain # the chosen word # if the word is in the first half if the word is located before the middlePage # find the word in the first half of the dictionary binarySearch(word, startPage, middlePage) else # find the word in the second half of the dictionary binarySearch(word, middlePage+1, endPage) In this particular algorithm, function binarySearch calls itself recursively. At each call, the problem gets smaller as the size is halved. The base case is the startPage = endPage statement that dictates that either the word is found or it does not exist in the dictionary. The following script implements the algorithm: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # The list of numbers to search in listOfNumbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # The recursive function for binary search def binarySearch(number, startPage, endPage): # If the list consists of one page (base case) search for it # in that page if (startPage == endPage): if (listOfNumbers[startPage] == number): print("The number was found in the list in " "position: ", startPage) else: print("The number was not found in the list") else: # Split the list using the middle point as a reference middlePage = int((endPage + startPage)/2) # Determine which half of the list might contain the number # If the number is in the first half if (number <= listOfNumbers[middlePage]): # Find the number in the first half of the list binarySearch(number, startPage, middlePage) else: # Find the number in the second half of the list binarySearch(number, middlePage + 1, endPage) num = int(input("Enter the number to find in the list: ")) 235 Data Structures and Algorithms 27 28 29 # Call the binarySearch function binarySearch(num, 0, 9) Output 6.4.2: Enter the number to find in the list: 7 The number was found in the list in position: 6 Enter the number to find in the list: 23 The number was not found in the list 6.4.3 Quicksort Quicksort is considered as one of the more advanced sorting algorithms for lists (i.e., static objects), with a better average performance than insertion, selection, and shell sort. It was presented by Hoare in 1962 (Hoare, 1961). Quicksort belongs to a well-known and highly regarded family of algorithms adopting the divide and conquer strategy. The algorithm sorts a list of n elements by picking a key value k in the list as a pivot point, around which the Observation 6.13 – Quicksort: Select list elements are then rearranged. Finding or calculat- an element in the list as the pivot k ing the ideal pivot point is key, although not absolutely element and rearrange the rest so that necessary. The pivot point should be either the median lower value elements precede it and or close to the median key value, so that the numbers higher succeed it (or the opposite). of preceding and succeeding elements in the list are Apply the same process to the two resulting sub-lists repeatedly, until balanced. Once this pivot key (k) is decided, the elements of the there are no more lists to divide. By list are rearranged so that those with lower values appear definition, at the end of this process before it and those with higher values after it. Once this the list will be sorted. process is completed, the list is partitioned into two sublists: one containing all values lower than k and one containing k itself (in its original position in the list) plus all values higher than k. This process is applied recursively to the two sub-lists and all subsequent sub-lists created based on them until there are no lists to divide. Once this process is complete, the list is sorted by definition. As an example, let us consider the following list: 37, 2, 6, 4, 89, 8, 10, 12, 68, 45. The first element (i.e., list[0]: 37) is taken as the pivot element (k). The process will start with the rightmost element of the list, moving in a decremental order from that point on (i.e., list[9]: 45, list[8]: 68, list[7]: 12). Each element is compared with k until an element with a lower value is found. In this instance, the process will stop at list[7]: 12 and this element will be swapped with k (Table 6.8). TABLE 6.8 The First Round of Comparisons at the Right of the List and Towards the Pivot Element 37 37 37 12 2 2 2 2 6 6 6 6 4 4 4 4 89 89 89 89 8 8 8 8 10 10 10 10 12 12 12 37 68 68 68 68 45 45 45 45 236 Handbook of Computer Programming with Python TABLE 6.9 The First Round of Comparisons at the Left of the List and Towards the Pivot Element 12 12 12 12 12 2 2 2 2 2 6 6 6 6 6 4 4 4 4 4 89 89 89 89 37 8 8 8 8 8 10 10 10 10 10 37 37 37 37 89 68 68 68 68 68 45 45 45 45 45 89 89 68 68 45 45 89 89 68 68 45 45 TABLE 6.10 The First Round of Comparisons Resumes at the Right of the Pivot Element 12 12 2 2 6 6 4 4 37 10 8 8 10 37 TABLE 6.11 The First Round of Comparisons Resumes and Finishes at the Left 12 12 2 2 6 6 4 4 37 10 8 8 10 37 Next, the k (37) will be compared with the elements on its left, beginning after 12. The comparisons will continue in an increasing order until an element greater than 37 is found. This will happen for value 89, so 37 and 89 will be swapped (Table 6.9). After the swap, the process will resume at the left of the previously swapped element (89) and at the right of pivot element k. The first element that will be considered is 10, which is smaller than the pivot element, thus, the two elements will be swapped. The rearranged list is shown in Table 6.10 below. Finally, the process will start again at the left of the sub-list with 37 as the pivot, and begin with the element after 10. This time, the only remaining element to compare (8) is lower than 37 so no swap will take place between the two elements. This first round of comparisons will end with the 1st pivot element (37) placed in its final place in the list, leaving two unsorted sub-lists on its left and right sides (Table 6.11). This is the first partitioning of the list into the first two unsorted sub-lists. The exact same comparison process will be next applied to both the left and right sub-lists recursively. When all comparisons and partitions are complete there will be no further sub-lists left to sort and the entire list will be sorted. The algorithm may seem rather complicated and its efficiency difficult to gauge. Nevertheless, it is indeed much more efficient than all the previously discussed algorithms. A script implementing the quicksort algorithm is provided below: 1 2 3 4 5 6 # Import the random and time modules # to generate random numbers and keep time import random import time global comparisons list = [] Data Structures and Algorithms 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 237 # The quicksort algorithm def quickSortReadings(list, start, end): global comparisons pivot = list[start] low = start + 1 high = end while (True): # Compare elements from the right to find one # that is smaller than the pivot. Stop when one is found while (low <= high and list[high] >= pivot): high -= 1; comparisons += 1 # Compare elements from the left to find one # that is larger than the pivot. Stop when one is found while (low <= high and list[low] <= pivot): low += 1; comparisons += 1 # If an element larger or smaller than the pivot is found # swap elements to put things in order & continue the process if (low <= high): list[low], list[high] = list[high], list[low] # Stop and exit if the low index moved beyond the high index else: Break list[start], list[high] = list[high], list[start] return high def quickSortPartition(list, start, end): if start >= end: Return p = quickSortReadings(list, start, end) quickSortPartition(list, start, p -1) quickSortPartition(list, p + 1, end) # Enter the number of list elements size = int(input("Enter the number of list elements:")) # Use the randint() function to generate random integers for i in range (size): newNum = random.randint(-100, 100) list.append(newNum) print("The unsorted list is: ", list) comparisons = 0 # Start the timer startTime = time.process_time() 238 58 59 60 61 62 63 64 65 66 67 Handbook of Computer Programming with Python quickSortPartition(list, 0, size -1) # End the timer endTime = time.process_time() # Display the sorted list print("The sorted list is: ", list) print("The number of comparisons is = ", comparisons) print("The elapsed time in seconds = ", (endTime - startTime)) Output 6.4.3: Enter the number of elements in the list:10 The unsorted list is: [-94, -1, -35, 13, -73, 18, 4, 29, 46, -62] The sorted list is: [-94, -73, -62, -35, -1, 4, 13, 18, 29, 46} The number of comparisons is = 26 The elapsed time in seconds = 0.0 The following estimates provide a rough comparison between quicksort and bubble sort, highlighting the fact that the former operates at a completely different efficiency level and, thus, being capable of processing much larger lists. The only possible restrictions in relation to its use have to do with the power of the computer system used and the available memory, as these are determining factors when running recursive calls on lists larger than 100,000 elements (a comparison takes 1 nanosecond to complete and 1 nanosecond = 1.0e−9 seconds): n = 10: ~30 Cs (n2 = 81 in Bubble S.) → approx. 1.8e−4 seconds (3e−4 in Bubble S.) n = 100: ~6.2e2 Cs (n2 = 9.8e3 in Bubble S.) → approx. 4e−4 seconds (4.5e−3 in Bubble S.) n = 1,000: ~1e4 Cs (n2 = 9.98e5 in Bubble S.) → approx. 9.7e−3 seconds (3.7e−1 in Bubble S.) n = 10,000: ~3e5 Cs (n2 = 9.998e7 in Bubble S.) → approx. 0.1 seconds (46 in Bubble S.) n = 20,000: ~1e6 Cs (n2 = 2.0e8 in Bubble S.) → approx. 0.3 seconds (188 in Bubble S.) n = 30,000: ~3e6 Cs (n2 = 2.0e8 in Bubble S.) → approx. 0.6 seconds (Not practical in Bubble S.) • n = 100,000: ~2e7 Cs (n2 = 2.0e8 in Bubble S.) → approx. 5.6 seconds (Not practical in Bubble S.) • n = 300,000: ~1.8e8 Cs (n2 = 2.0e8 in Bubble S.) → approx. 48 seconds (Not practical in Bubble S.) • • • • • • In terms of time complexity, while the worst cases run at O(n2), the average and best cases run at the much more efficient level of O (n log(n)). 6.4.4 Merge Sort Merge sort is another advanced algorithm for efficient sorting of large lists, falling into the same divide and conquer approach as quicksort. Merge sort is an excellent choice for sorting data that cannot be kept on the computer memory all at once and are, thus, kept in secondary storage. The essential idea behind merge sort is to split lists into two halves continuously until all sub-lists Observation 6.14 – Merge Sort: A divide and conquer algorithm for sorting static lists. The basic idea is to divide the list into two sub-lists repeatedly, until all sub-lists consist of a single element. The divided lists are then merged again following a particular sorting procedure. Data Structures and Algorithms 239 consist of a single element and, subsequently, merge the sub-lists while also ordering their elements. Algorithmically, the process is rather straightforward, particularly for the split part. The process the programmer must follow for merging each given set of two sub-lists is summarized below: • Check if the first sub-list is empty. • If not, check if the second sub-list is empty. • If not, compare the first available element in the first sub-list with the first available ­element in the second sub-list. • Whichever of the two elements has a lower value must be placed in the first available slot of a new merged list. • This process should be repeated for all remaining elements of the two sub-lists. • If all the elements of one of the sub-lists have been used, place the remaining elements of the other sub-list to the new merge list, in the order they appear in the sub-list. • Recursively repeat this process until all the sub-lists are merged into one ordered merged list. As an example, let us consider the following list: 25, 13, 9, 32, 17, 5, 33, 25, 43, 21. Firstly, the list is split into the required set of sub-lists: Next, the lists are merged on a bottom-up basis, as shown below: 240 Handbook of Computer Programming with Python The following script provides an implementation of the merge sort algorithm: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 # Random and time modules generate random numbers & keep time import random import time global comparisons, i, j, k global list # Merge two sub-lists, list[first, middle] and list[middle+1, last] def merge(first, middle, last): global list global i, j, k, comparisons size1 = middle - first + 1; size2 = last - middle # Create temporary lists leftList = []; rightList = [] # Copy original list to temporary lists leftList & rightList for i in range(0, size1): leftList.append(list[first + i]) for j in range(0, size2): rightList.append(list[middle + 1 + j]) # Merge temp lists leftList & rightList into original list # until one of the sub-lists is empty i = 0; j = 0; k = first while (i < size1 and j < size2): if (leftList[i] <= rightList[j]): list[k] = leftList[i]; i += 1; comparisons += 1 else: list[k] = rightList[j]; j += 1; comparisons += 1 k += 1 # If list becomes empty, copy remaining elements to original while (i < size1): list[k] = leftList[i]; i += 1; k += 1 # If list becomes empty, copy remaining elements to original while (j < size2): list[k] = rightList[j]; j += 1; k += 1 # The merge sort algorithm def mergesort(first, last): global list # The recursive step if (first <= last-1): middle = (first + last)//2 Data Structures and Algorithms 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 241 mergesort(first, middle) mergesort(middle + 1, last) merge(first, middle, last) list = [] # Initialize the indices of the sub-lists i, j, k = 0, 0, 0 # Enter the number of list elements size = int(input("Enter the number of list elements: ")) # Use the randint() function to generate random integers for i in range (size): newNum = random.randint(-100, 100) list.append(newNum) print("The unsorted list is: ", list) comparisons = 0 # Start the timer startTime = time.process_time() mergesort(0, size-1) # End the timer endTime = time.process_time() # Display the sorted list print("The sorted list is: ", list) print("The number of comparisons is = ", comparisons) print("The elapsed time in seconds = ", (endTime - startTime)) Output 6.4.4: Enter the number of elements in the list:15 The unsorted list is: [83, -3, 89, 64, -5, 65, 78, 17, 8, -3, 82, 89, -80, 23, 64] The sorted list is: [-80, -5, -3, -3, 8, 17, 23, 64, 64, 65, 78, 82, 83, 89, 89] The number of comparisons is = 42 The elapsed time in seconds = 0.0 The efficiency of the algorithm in sorting static lists is comparable to that of quicksort (a comparison takes 1 nanosecond to complete; 1 nanosecond = 1.0e−9 seconds): • n = 10: ~20 Cs (30 in Quicksort) → approx. 2e−4 seconds (1.8e−4 in Quicksort) • n = 100: ~5.4e2 Cs (6.2e2 Cs in Quicksort) → approx. 0.0012 seconds (1.2e−2 in Quicksort) • n = 1,000: ~8.6e3 Cs (1e4 Cs in Quicksort) → approx. 0.015 seconds (9.7e−3 seconds in Quicksort) • n = 10,000: ~1.2e5 Cs (3e5 Cs in Quicksort) → approx. 0.15 seconds (0.1 seconds in Quicksort) • n = 30,000: ~4e5 Cs (3e6 in Quicksort) → approx. 0.44 seconds (0.6 seconds in Quicksort) • n = 100,000: ~1.5e6 Cs (2e7 in Quicksort) → approx. 1.6 seconds (5.6 seconds in Quicksort) • n = 300,000: ~5e6 Cs (1.8e8 in Quicksort) → approx. 5.5 seconds (48 seconds in Quicksort) 242 Handbook of Computer Programming with Python In general, merge sort is more efficient than quicksort as it runs on O(n logn) time complexity in all cases (i.e., best, average, and worst case). Most importantly, it becomes significantly better as the size of the list grows larger (e.g., lists consisting of hundreds of thousands of elements or higher) depending on the power, memory, and settings of the system it runs on. 6.5 COMPLEX DATA STRUCTURES In the previous sections, the focus was on the implementation of sorting by means of relatively simple, static data structures, like lists. When it comes to more advanced, real-life applications more complex data structures may be required. This section addresses such data structures, which can take both linear and non-linear forms (Figure 6.1). In linear structures, such as stacks, queues, and linked lists, each element occupies a position that is relative to that of previous and succeeding elements within the structure. Consequently, the structure is traversed (i.e., read) sequentially. In non-linear structures, such as trees and graphs, the items are not arranged in a particular, hierarchical order, thus, sequential traverse is not feasible. Non-linear structures are more complex to implement, but they are also more powerful. As such, they are used extensively in real-life applications. 6.5.1 Stack A stack is an ordered list with two ends, the top and the base. New items are always inserted at the top end in an operation called push. Items are also removed from the top end, in what is referred to as pop. In a stack, the last item to push is always the first to pop, hence a stack is also called a last in, first out (LIFO) list. Besides the item at the top, other items in the stack are not directly accessible. As an analogy, one can think of a stack as a pile of plates stacked upon each other. Each new plate is placed at the top of the pile. In order to be used, a plate is also taken from the top of the pile. FIGURE 6.1 Classification of data structures. Observation 6.15 – Stack: An ordered, linear list structure with two ends: top and base. Items are pushed to and popped from the top, and the last item pushed in the stack is the first to be popped out (LIFO). The operations performed on the stack are the following: initialize, push, pop, isEmpty, top, and size. Data Structures and Algorithms 243 From a more formal, technical perspective, the stack ADS (Abstract Data Structure) consist of the following: • An index pointing at the top item in the stack, with values ranging from 0 to its maximum size −1. • The body of the stack that stores the values (i.e., the actual data of the list). • Initialize – init(s): A function that initializes the stack (i.e., creating an empty list). • Empty – isEmpty(s): A function that checks whether the stack (s) is empty. • Push – push(x, s): A function that pushes a new item (x) onto the stack (s). • Pop – pop(x, s): A function that deletes the top item (x) from the stack (s). • Top – top(s): A function that returns the item at the top of the stack. • Size – size(s): A function that returns the total number of items in the stack. The following Python class (filename: Chapter6Stack.py) defines the stack structure (stack ADS): class Stack: def __init__(self): self.items = [] def push(self, item): self.items.append(item) def pop(self): return self.items.pop() def isEmpty(self): return self.items == [] def top(self): if (not self.isEmpty()): return self.items[-1] def size(self): return len(self.items) def show(self): return self.items Since the class in this form is rather generic, it can be used for a variety of stack-based applications. The following script imports the stack class from Chapter6Stack.py in order to implement a simple example of the functionality of the stack: 1 2 3 4 5 6 7 8 9 10 11 12 13 import Chapter6Stack fruits = Chapter6Stack.Stack() # Confirm that the stack is empty if (fruits.isEmpty() == True): print ("The stack is empty") # Push elements to the stack fruits.push('apple') fruits.push('orange') fruits.push('banana') 244 14 15 16 17 18 19 20 21 22 23 24 Handbook of Computer Programming with Python # Confirm that the stack is not empty and print its contents if (fruits.isEmpty()!= True): print("The stack is not empty: It's size is: ", fruits.size()) print("The contents of the stack are: ", fruits.show()) # Return the top item of the stack print("The top item of the stack is: ", fruits.top()) # Remove the top item of the stack, print the new top item and the stack print("Remove the top item of the stack: ", fruits.pop()) print("The top item of the stack is now: ", fruits.top()) print("The contents of the stack now are: ", fruits.show()) Output 6.5.1.a: The stack is empty The stack is not empty: It's size is: 3 The contents of the stack are: ['apple', 'orange', 'banana'] The top item of the stack is: banana Remove the top item of the stack: banana The top item of the stack is now: orange The contents of the stack now are: ['apple', 'orange'] Stacks are used extensively in computer programs. A rather common example is storing page visits on a web browser. Every page that is visited is added to a stack and when the user clicks on the back button the last page visited is retrieved from the stack. A similar use can be found in the undo function included in most computer applications. A stack is used to store all the tasks performed in the application and when the user clicks on the respective button, the last action is retrieved from the stack and its action is reversed. Stacks are also useful in evaluating expressions, backtracking, and implementing recursive function calls. As an example of a practical use of the stack, let us consider the common utility task of converting a decimal number into binary. The algorithm is quite simple: repeatedly divide the decimal number by 2 until the result is 0, while pushing the remainder of the integer division to the stack. At the end of the process, all the items are popped from the stack to get the binary representation of the decimal number. Assuming that the integer to be converted is number 21, the above procedure will result in binary number 10101 (Figure 6.2). FIGURE 6.2 Decimal to binary number conversion. Data Structures and Algorithms 245 The following script implements the stack structure, utilizing Stack ADS (Chapter6Stack.py) as in the previous example: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import Chapter6Stack # decimal object implements the conversion using the stack decimal = Chapter6Stack.Stack() # Accept an integer to convert to binary form userInput = int(input("Enter the integer to convert to binary: ")) # Repeatedly divide by 2; keep pushing the remainder to the stack while (userInput > 0): decimal.push(userInput % 2) userInput = userInput//2 # Confirm that the stack is not empty and print its contents if (decimal.isEmpty()!= True): print("The stack is not empty: It's size is: ", decimal.size()) print("The contents of the stack are: ", decimal.show()) # Return the number in binary form print("The binary form of the number is: ", end = '') for i in range (decimal.size()): print(decimal.pop(), end = '') Output 6.5.1.b: Enter the integer to convert to binary: 56 The stack is not empty: It's size is: 6 The contents of the stack are: [0, 0, 0, 1, 1, 1] The binary form of the number is: 111000 6.5.2 Infix, Postfix, Prefix Another application of a stack that is particularly important in computer science is the evaluation of arithmetic expressions. In general, the reader should be aware of the fact that there are three kinds of arithmetic notations, namely infix, prefix, and postfix. Infix is what humans are mostly used to, as it involves a binary operator appearing between two operands and determining the type of operation that will take place between them (e.g., 3 + 5). In a prefix notation, the same expression would be converted to + 3 5, where the operator precedes both operands. Likewise, the postfix notation would take the form 3 5 +, with the operator succeeding the two operands. It must be noted that the postfix notation is the one used by compilers when evaluating an arithmetic expression. As such, the conversion of an infix expression that humans would Observation 6.16 – Infix, Postfix, understand more easily to a postfix expression that can Prefix: Three different kinds of notabe evaluated by compilers is a rather important task in tions used to evaluate arithmetic computer science. The implementation of such a conver- expressions by humans or computers. sion poses three main problems that must be addressed: 246 Handbook of Computer Programming with Python • In an infix expression, the operation precedence is forcing multiplication/division to apply before the additions/subtractions, whereas in a postfix expression there is no operator priority. • When translating an infix to a postfix expression, only the placement of the operators is different. An algorithm that translates from infix to postfix only needs to shift the operators to the right, and possibly reorder them. • Postfix expressions do not take parentheses. The following algorithm uses a stack to temporarily store the operators until they can be inserted to the right position into the postfix expression: • Initialize the stack. • Scan the infix expression from left to right. • While the scanned character is valid: • If the character is an operand, move it directly to the postfix expression. • If the character is an operator, compare it with the operator at the top of the stack. • While the operator at the top of the stack is of higher or equal priority than the character just encountered, and is not a left parenthesis character, pop the operator from the stack and move it to the postfix expression. Once all the operators are popped, push the current character/operator to the stack. • If the character is a left parenthesis, push the character onto the stack. • If the character is a right parenthesis, pop and move the operators off the stack to the postfix expression. Pop the left parenthesis and ignore it. • If the operator at the top of the stack is of a lower priority than the character just encountered or if the stack is empty, push the character that was just encountered to the stack. • After the entire infix expression has been scanned, pop any remaining operators from the stack and move them to the postfix expression. As an example, Figure 6.3 illustrates the use of a stack to convert infix expression 2 + 3 x 5 + 4 into postfix. • • • • • 2+3=5 → 2 3 +=5 2 x 5 + 3 = 13 → 2 5 x 3 + = 13 2 + 5 x 3 = 17 → 2 5 3 x 3 = 17 2 x 3 + 5 x 4 = 26 → 2 3 x 5 4 x + = 26 2 + 3 x 5 + 4 = 21 → 2 3 5 x + 4 + = 21 FIGURE 6.3 Infix expression remaining to be evaluated. Data Structures and Algorithms 247 Figures 6.4 and 6.5 demonstrate a more complex case of an infix to postfix expression conversion that includes operators in parentheses: 2 x (7 + 3 x 4) + 6. The evaluation of a postfix expression utilizes the steps described in the algorithm below: • Scan the postfix expression from left to right. • If an operand is encountered, push it to the stack. • If an operator is encountered, apply it to the top two operands of the stack and replace the two operands with the result of the operation. • After scanning the entire postfix expression, the stack should have one item, which is the value of the expression. Figure 6.6 illustrates how expression 1 6 + 5 2 – x is evaluated using a stack. FIGURE 6.4 Infix to postfix with parenthesis – Part A. FIGURE 6.5 Infix to postfix with parenthesis – Part B. 248 FIGURE 6.6 Handbook of Computer Programming with Python Evaluating a postfix expression. 6.5.3 Queue A queue is also a linear structure in which items are added at one end through a process called enqueue, but removed from the other end through what is referred to as dequeue. The two ends are called rear and front. Unlike the stack, in a queue the items that are added first are also removed first, hence it is also described as a first in, first out (FIFO) structure. A queue is analogous to people waiting in line to purchase a ticket or pay a bill. The person first in line is the first one to be served. The following is a visual illustration of the queue structure: Observation 6.17 – Queue: An ordered, linear list structure with two ends: rear and front. Items are enqueued at one end and dequeued at the other. The first enqueued item is also the first to be dequeued (FIFO). The operations performed on the queue are the following: initialize, enqueue, dequeue, isEmpty, peek, and size. Figure 6.7 below illustrates the execution of a simple queue: FIGURE 6.7 Execution of a simple queue. Data Structures and Algorithms 249 In computer science, queues are used extensively to schedule tasks, such as printing or managing CPU processes. When multiple users submit print jobs, the printer queues all the jobs and prints them in a first-come-first-served basis. Similarly, when multiple processes require to use the CPU, the order of execution is scheduled and performed through a queue structure. The queue ADS consists of the following: • • • • • • • An index that points to the front item of the queue. An index that points to the rear item of the queue. The body of the queue that stores its values (i.e., the actual data in the list). Initialize – init(q): A function that initializes the queue (i.e., creates the empty list). Empty – isEmpty(q): A function that checks whether the queue is empty. Enqueue – enqueue(x, q): A function that adds an item to the rear end of the queue. Dequeue – dequeue(x, q): A function that returns the item at the front end of the queue and removes it from the queue. • Front – peek(q): A function that returns the item at the front of the queue. • Size – size(q): A function that returns the number of items in the queue. The Python class provided below (filename: Chapter6Queue.py) is an implementation of the queue ADS: class Queue: # Initialize the queue def __init__(self): self.items = [] # Check whether the queue is empty def isEmpty(self): return self.items == [] # Add an item to the queue def enqueue(self, item): self.items.insert(0,item) # Delete an item from the queue def dequeue(self): if not self.isEmpty(): return self.items.pop() def peek(self): if not self.isEmpty(): return self.items[-1] def size(self): return len(self.items) def show(self): return self.items The following script (filename: Chapter6QueueExample) imports and runs a simple queue ADS: 1 2 3 4 5 import Chapter6Queue q = Chapter6Queue.Queue() print(q.isEmpty()) q.enqueue('Task A') 250 6 7 8 9 10 11 12 13 14 15 Handbook of Computer Programming with Python print(q.show()) q.enqueue('Task B') print(q.show()) q.enqueue('Task C') print(q.show()) print(q.dequeue()) # removes Task A print(q.show()) print(q.dequeue()) # removes Task B print(q.show()) # q has only one task left print(q.size()) Output 6.5.3: True ['Task ['Task ['Task Task A ['Task Task B ['Task 1 A'] B', 'Task A'] C', 'Task B', 'Task A'] C', 'Task B'] C'] 6.5.4 Circular Queue A circular queue is essentially the same as a regular queue, but with two major differences. First, the size Observation 6.18 – Circular Queue: of the circular queue does not change. This size restric- A structure similar to a queue with tion can be viewed as the main weakness of the circular the difference that its size does not queue. Second, its front and rear are continuously mov- change and the front and rear are ing in a circular form based on the demand for enqueue movable. This is based on the demand and dequeue, provided that there is available empty for enqueue and dequeue in a circuspace and that they do not clash with each other (i.e., the lar form, allowing for the front item to front cannot be in the same list index as the rear). This is be stored before the rear. an important observation, as it is possible that the front item is stored before the rear one on the circular queue. Because of these qualitative differences, a circular queue ADS needs to check whether the queue is full before enqueuing a new item in it. Figure 6.8 provides an illustration of the circular queue operation. The following script (filename: Chapter6CircularQueue) imports and runs an implementation of the queue ADS: Data Structures and Algorithms FIGURE 6.8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 251 Example of circular queue. class CircularQueue(): # Initialize the circular queue to the preferred size # with all its items empty and the front and rear starting at -1 def __init__(self, maxSize): self.cqSize = maxSize self.queue = [None] * self.cqSize self.front = self.rear = -1 # Insert an item into the circular queue def enqueue(self, data): # Insert the first item to the queue, start the front and rear if (self.front == -1): self.front = self.rear = 0 self.queue[self.rear] = self.queue[self.front] = data # Insert items to the queue else: # Only be concerned with the front item; use % and the size # of the queue to move the front in a circular manner self.front = (self.front + 1) % self.cqSize self.queue[self.front] = data print("Queue size: ", self.cqSize, "Queue front: ", self.front, "Queue rear: ", self.rear) # Delete an item from the circular queue 252 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 Handbook of Computer Programming with Python def dequeue(self): if (self.front == -1): print("The circular queue is empty\n") # If the front item is the same as the rear the queue has only # one item; empty the queue elif (self.front == self.rear): self.front = self.rear = -1 else: # Only be concerned with the rear item; use % and the size # of the queue to move the rear in a circular form self.queue[self.rear] = [None] self.rear = (self.rear + 1) % self.cqSize print("Queue size: ", self.cqSize, "Queue front: ", self.front, "Queue rear: ", self.rear) # The printCQueue will display the contents of the circular queue def printCQueue(self): # If the front value is -1 the circular queue is still empty if(self.rear == -1): print("No element in the circular queue") # If front index is larger than rear then queue is still valid elif (self.front >= self.rear): for i in range(self.rear, self.front + 1): print(self.queue[i], end = " ") # If front less than rear, queue has completed a circle else: for i in range(self.front + 1): print(self.queue[i], end = " ") for i in range(self.rear, self.cqSize): print(self.queue[i], end = " ") print() # Check whether the circular queue is full def isFull(self): if ((self.front + 1) % self.cqSize == self.rear): return True else: return False # Ask the user for the preferred size for the circular queue maxSize = int(input("Enter the size of the circular queue:")) cq = CircularQueue(maxSize) # Keep working on the circular queue until input is not E or D while (True): # Ask the user for the next move, enqueue or dequeue choice = input("(E)nqueue or (D)equeue or (Q)uit?") if (choice == "E"): if (cq.isFull()!= True): newItem= int(input("Enter the next item of the circular queue:")) cq.enqueue(newItem) Data Structures and Algorithms 77 78 79 80 81 82 83 84 253 else: print("The queue is full. Cannot insert a new item") elif (choice == "D"): cq.dequeue() else: break print("The updated Queue is: ", end = " ") cq.printCQueue() Output 6.5.4: Enter the size of the circular queue:3 (E)nqueue or (D)equeue or (Q)uit?E Enter the next item of the circular queue:10 Queue size: 3 Queue front: 0 Queue rear: 0 The updated Queue is: 10 (E)nqueue or (D)equeue or (Q)uit?E Enter the next item of the circular queue:20 Queue size: 3 Queue front: 1 Queue rear: 0 The updated Queue is: 10 20 (E)nqueue or (D)equeue or (Q)uit?E Enter the next item of the circular queue:30 Queue size: 3 Queue front: 2 Queue rear: 0 The updated Queue is: 10 20 30 (E)nqueue or (D)equeue or (Q)uit?E The queue is full. Cannot insert a new item The updated Queue is: 10 20 30 (E)nqueue or (D)equeue or (Q)uit? D (E)nqueue or (D)equeue or (Q)uit?D Queue size: 3 Queue front: 2 Queue rear: 1 The updated Queue is: 20 30 (E)nqueue or (D)equeue or (Q)uit?D Queue size: 3 Queue front: 2 Queue rear: 2 The updated Queue is: 30 (E)nqueue or (D)equeue or (Q)uit?E Enter the next item of the circular queue:40 Queue size: 3 Queue front: 0 Queue rear: 2 The updated Queue is: 40 30 (E)nqueue or (D)equeue or (Q)uit? 6.6 DYNAMIC DATA STRUCTURES The data structures described in the previous sections are characterized as static, since they all use inherently static list structures. To some extent, issues like restrictions associated with the requirement for large amounts of memory, generally weak performance due to the heavy nature of the tasks, and a certain inflexibility, can be traced in all of these structures. The previously discussed cases have demonstrated that the execution of even the most advanced algorithms tends to become impractical as the size of the structures increases. In order to address this issue, there is a need for more effective data structures that allocate the available computer memory only as and when necessary, and in the most efficient way possible. Structures that fall under this category are collectively known as dynamic data structures. Some of the most important of these structures are introduced and briefly discussed in the following sections. 254 Handbook of Computer Programming with Python 6.6.1 Linked Lists A linked list is a collection of nodes linked to each other through pointers. The structure is recursive by defini- Observation 6.19 – Linked List: A tion. Each node includes a data value and a pointer to structure of connected nodes. Each the first node of a subsequent linked list, or to null if node contains a data value and a the latter is empty. In order to navigate a linked list, it is pointer to the first node of the subnecessary to create a separate object, called head, that sequent list. A head pointer is always always points to the first node of the list. Subsequent pointing to the first node. The last nodes are accessed via the associated pointers, stored in node points to null. The rest of the each node. If the list is empty, the head will simply point nodes are defined as intermediate. to a null value. In a similar fashion, the link pointer of the last node is set to null to mark the end of the list. There is only one head, and it is always pointing to the first node of the linked list. Similarly, there is only one tail (i.e., the last node), pointing to null. All other nodes are called intermediate nodes and have both a predecessor and a successor. Traversing (i.e., moving through) intermediate nodes towards the tail starts at the first node of the list, pointed to by the head. For this purpose, it is best to create another object, usually called ­current, that is used to move between the intermediate nodes in the list. The strength of the linked list is that its data are stored dynamically, with new nodes created only if and when necessary, and unwanted nodes deleted if they are not in use. Separately from the data, the pointer of every newly created node is set to point to null. Nodes can store any data type, but all nodes of a linked list need to store the same data type. Figure 6.9 illustrates the structure of a linked list. Notice how the head points to the first node and that the last node points to null: FIGURE 6.9 Linked list. The implementation of a linked list requires two classes. The first is the node class that c­ ontains a data and a pointer to the next item. For any new node that is created, next will point to null. The second, is the linked list itself that contains the head pointer to the first item in the list and the ­current_node that is used to move through the list. Both the head and the current_node will ­initially point to null since there are no items in the list. The linked list ADS (Abstract Data Structure) includes the following operations: • Instantiating & initializing the list: This function is used to create the head and the current object that initially point to null (i.e., the empty list; Figure 6.10). The Python code for this function is the following: def __init__(self): self.head = self.current_node = None FIGURE 6.10 New linked list. Data Structures and Algorithms 255 • Checking if the list is empty: This function checks whether the linked list is empty, in which case no more nodes can be deleted and any newly inserted node must be the first in the list. The Python code is the following: def isEmpty(self): current_node = self.head if (current_node == None): return True • Reading and printing the list: It is often useful to print the nodes of the list and provide information about its size (i.e., the number of nodes it contains). In order to do this, it is necessary to traverse (i.e., read through) the list starting at the first node. While the current_node value is not null, current node values are read/printed successively as the list is traversed. Figure 6.11 illustrates this process diagrammatically. The related Python code is presented below: def readList(self): count = 0 current_node = self.head print("The current list is: ", end = " ") while (current_node): count += 1 print(current_node.data, " ", end = "") current_node = current_node.next print("\nThe size of the linked list is: ", count) • Inserting a new node in the list: A new node can be either inserted as a first element when the list is empty or as the last element appended to the list. In the former case, a new node is created (including the associated data) and its next element is set to point to null. Finally, the head is set to point to the new node (Figure 6.12). In the case of appending a new element to the list, after the new node is created, the list is traversed until the last node is reached. Once this is done, the next element of the last node is set to point to the newly created node (Figure 6.13). The related Python code is presented below: def append(self, data): # Create the newNode to append the linked list newNode = Node(data) # Case 1: List is empty if (self.head == None): self.head = newNode Return # Case 2: If the list is not empty start the # current node at the head of the list current_node = self.head # Loop through the linked list untill the current node # has Next pointing to None while (current_node.next): current_node = current_node.next # Add new node to the end of the list current_node.next = newNode 256 Handbook of Computer Programming with Python • Deleting a node: This operation starts by checking if the linked list is empty. If not, it searches for the data that must be deleted. If the data are not found, the list remains as is. If the data are found, the node they belong to is deleted and the list is updated accordingly. There are two cases to consider in relation to this process. The first case is that the node to be deleted is the first one in the list. In this case, the process simply involves the allocation of the head to the next node, and the assignment of the pointer that points to the deleted node to null. The second case is that the node to be deleted is not the first one in the list. In this case, it is necessary to also find the nodes before and after the deleted, and keep references to them. With this information at hand, the next pointer of the node preceding the deleted one is made to point to the node succeeding it. Finally, the pointers of the deleted node are removed. Figure 6.14 illustrates this process diagrammatically. FIGURE 6.11 Traversing the linked list. FIGURE 6.12 Inserting the first node. Data Structures and Algorithms FIGURE 6.13 Appending a node to the list. FIGURE 6.14 Deleting a node from a linked list. The following Python script demonstrates the deletion process: def delete(self, data): if (self.isEmpty()): print("There is no node available to delete. " "The linked list is empty.") else: current_node = self.head # Case 1: If the node to be deleted is the first node if (current_node and current_node.data == data): # Set the head of the list of the next item self.head = current_node.next # Set the current item’s pointer to null current_node.next = None Return 257 258 Handbook of Computer Programming with Python # Keep track of the previous node while searching # for the node to be deleted previous_node = None while (current_node and current_node.data != data): previous_node = current_node current_node = current_node.next # Check if the node was found if (current_node is None): return previous_node.next = current_node.next current_node = None • Destroying the list: Since building a linked list involves the dynamic allocation of memory in the form of pointers, it is advisable that before the underlying application stops, any pointers and memory allocated during its lifecycle are freed and released back to the system. The following Python code demonstrates a possible implementation of this task: def destroyList(self): temp = self.head if (temp is None): print("\n The linked list is deleted") while (temp): self.head = temp.next temp = None temp = self.head self.readList() The reader can merge the above functions and commands as in the code example provided below (the code is arranged into two classes, stored in file Chapter6LinkedList.py): class Node: def __init__(self, data): self.data = data self.next = None class LinkedList: def __init__(self): ... def append(self,data): ... def delete(self, data): ... def destroyList(self): ... def readList(self): ... def isEmpty(self): ... Data Structures and Algorithms 259 The following script (filename: Chapter6LinkedListExample) implements the class, as discussed above: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import Chapter6LinkedList ll = Chapter6LinkedList.LinkedList() while (True): print("[A]: Append a new node") print("[D]: Delete a particular node") print("[Q]: Clear all list and exit") print("[P]: Print the current list") choice = input("Enter your choice: ") if (choice == "A"): newNode = int(input("Enter the new node value to append the list: ")) ll.append(newNode) elif (choice == "D"): deleteNode = int(input("Enter the node to delete: ")) ll.delete(deleteNode) elif (choice == "P"): ll.readList() else: ll.destroyList() break Output 6.6.1: [A]: Append a new node [D]: Delete a particular node [Q]: Clear all list and exit [P]: Print the current list Enter your choice: A Enter the new node value to append the list: 5 [A]: Append a new node [D]: Delete a particular node [Q]: Clear all list and exit [P]: Print the current list Enter your choice: A Enter the new node value to append the list: 3 [A]: Append a new node [D]: Delete a particular node [Q]: Clear all list and exit [P]: Print the current list Enter your choice: A Enter the new node value to append the list: 7 260 [A]: Append a new node [D]: Delete a particular node [Q]: Clear all list and exit [P]: Print the current list Enter your choice: P The current list is: 5 3 7 The size of the linked list is: Handbook of Computer Programming with Python 3 [A]: Append a new node [D]: Delete a particular node [Q]: Clear all list and exit [P]: Print the current list Enter your choice: D Enter the node to delete: 3 [A]: Append a new node [D]: Delete a particular node [Q]: Clear all list and exit [P]: Print the current list Enter your choice: P The current list is: 5 7 The size of the linked list is: 2 In addition to the operations discussed above, the effectiveness of the linked list could be also improved by: • Inserting a new node before/after an existing node based on its data. • Searching for a node using key data, and retrieving the data and the positional index of the node. • Modifying the data of a particular node within the list. • Sorting the linked list. Some key points when implementing linked lists or related structures are summarized in the list below: • To access the nth node of a linked list, it is necessary to pass through the first n−1 nodes. • If nodes are added at a particular position instead of just being appended, the insertion will result in a node index change. • Deletion of nodes will result in a node index change. • Trying to store the node indices in a linked list is of no use, since they are constantly changing (indeed, there are no actual indices in such a list). • To append a node, one has to traverse the whole list and reach the last node. • In addition to the head and current_node pointers, adding a tail pointer to the last node of the list makes appending easier and more efficient. • To delete the last node, one has to traverse the whole list and find the two last positions. • If for any reason the head pointer is lost, the linked list cannot be read and retrieved. A particular variation of the linked list is the circular linked list, in which the last node is linked to the first. It is used when the node next to the last corresponds to the first one, such as in the cases of the weekdays or the ring network topology. The advantage of the circular linked list is that it can be traversed starting at any node and is able to reach the node it has started with again in a circular manner. Figure 6.15 provides an illustration of a simple circular linked list. Data Structures and Algorithms FIGURE 6.15 261 A circular linked list. 6.6.2 Binary Trees The previous section focused in the singly linked list, in which the pointer of each node points to the next node. The main problem with this type of linked list is that it does not offer direct access to the previ- Observation 6.20 – Doubly Linked ous node. This can make the process of deleting nodes List: A structure similar to a singly from the list rather complicated. Doubly linked lists can linked list, but containing two pointers address this problem. As the name implies, the main dif- pointing to both the next and previous ference between singly and doubly linked lists is that nodes instead of just one (next). the latter consist of two pointers instead of one, with the additional pointer pointing to the previous node. Despite the obvious functional advantage of this additional pointer, it tends to make operations more complicated and causes additional overhead, as an extra pointer is added to every node. Figure 6.16 provides an illustration of the inner structure of a doubly linked list node and an example of a threenode doubly linked list connections: Among the most important types of doubly linked lists is the binary tree (Figure 6.17), a rooted tree in which every node has at most two children (i.e., degree 2). Its recursive definition declares that a binary tree is either an external node (leaf) or an internal node (root/parent) and up to two sub-trees (a left subtree and a right subtree). In simple terms, if a node is a root, it has one or two children nodes but no parent, if it is a leaf, it has a parent node but no children, and every node is an element that contains data. The number of levels in the tree is defined as its depth. FIGURE 6.16 A NODE of a double linked list. FIGURE 6.17 Binary trees. 262 FIGURE 6.18 Handbook of Computer Programming with Python Decision trees. Example 1 in Figure 6.17 shows an unfinished binary tree with degree 2 and a depth of three levels. The tree has 76 as its root, 26 and 85 as children nodes, and 27, 24, and 18 as leaf nodes. Example 2 shows a completely unbalanced binary tree and Example 3 a mixed case. Binary trees are commonly used in decision tree structures (Figure 6.18), although this may often go unnoticed. Observation 6.21 – Binary Tree: A rooted tree in which every node is either an external node (leaf) or an internal node (root/parent), with up to two sub-trees (a left subtree and a right subtree). 6.6.3 Binary Search Tree A particular type of a binary tree is the binary search tree. Its definition is the same as that of the regular binary tree, but with the following additional properties: • All elements rooted at the right child of a node have higher values than that of the parent node. • All elements rooted at the left child of a node have lower values than that of the parent node. Observation 6.22 – Binary Search Tree: A structure based on a binary tree with the difference that all elements rooted at the right child of a node are greater and those rooted at its left child lower than the value of the parent node. In the example provided in Figure 6.19 the reader would notice that every node on the left subtree of the root has a lower value than 43, while every node on the right subtree has a higher value. The reader should also notice that this is recursively applied to the internal nodes too (e.g., as in the case of node with value 56). This could be potentially reversed by having the smaller values on the right and the larger on the left subtrees respectively, but the logic of the binary tree structure remains the same. There are three systematic ways to visit all the nodes of a binary search tree: preorder, inorder, and postorder. If the left subtree contains values that are lower than the root node, all three of these will traverse the left subtree before the right subtree. Their only difference lies on when the root node is visited and read (Table 6.12). The implementation of a linked list requires two classes. The first is the node class, containing the data and a pointer to the next item. For any new node that is created, next will point to null. The second is the linked list itself, and contains the head pointer (pointing to the first item in the list) and the current_node that is used to move through the list. Both the head and the current_node will initially point to null since there are no items in the list. 263 Data Structures and Algorithms FIGURE 6.19 Binary search tree. TABLE 6.12 Searching a Node in a Binary Search Tree Inorder Traversal Traverse the left subtree. Visit/read the root node. Traverse the right subtree. Resulting list: 20, 28, 31, 33, 40, 43, 47, 56, 59, 64, 89 Preorder Traversal Postorder Traversal Visit/read the root node. Traverse the left subtree. Traverse the right subtree. Resulting list: 43, 31, 20, 28, 40, 33, 64, 56, 47, 59, 89 Traverse the left subtree. Traverse the right subtree. Visit/read the root node. Resulting list: 28, 20, 33, 40, 31, 47, 59, 56, 89, 64, 43 In its most basic form, the binary search tree ADS includes the following operations: • Instantiating & initializing the Binary Search Tree (BST): This function is used to create each new node in the BST, allocating the necessary memory and initializing its pointers to both the left and right subtrees to null. Figure 6.20 provides a visual representation of the new node and the following code excerpt illustrates its implementation: def __init__(self, key): self.left = None self.right = None self.data = key FIGURE 6.20 New node for the BST. 264 Handbook of Computer Programming with Python FIGURE 6.21 Traversing the BST inorder. • Inorder traversal of the BST: The inorder function, one of the most well-known functions associated with dynamic data structures, happens to be also among the easiest ones. The following Python code and Figure 6.21 illustrate its operation: def traverseInorderBST(root): # If the BST current node is not a leaf traverse # the left subtree. If it is, print its data and # then traverse the right subtree if (root): traverseInorderBST(root.left) print(root, root.data) traverseInorderBST(root.right) • Inserting a new node to the list: The goal of this function is to place the newly imported data to the desired place in the BST. When the BST is empty, the new node simply initializes it. In all other cases, the function recursively checks whether the data value in the new node is lower, equal to, or higher than the data in the current node, and keeps on moving to the respective subtree accordingly until the current node is empty. At that point, it finally assigns the new node. Figure 6.22 illustrates this process by inserting nodes from the following list to a BST: 43, 31, 64, 56, 20, 40, 59, 28, 33, 47, 89. The Python code for this function is the following: def insert(root, key): # If there is no BST create its first node if (root is None): return BinarySearchTree(key) else: Data Structures and Algorithms 265 FIGURE 6.22 Inserting nodes to the BST. # If the current node's data is less than or equal # to the new key, move into the right subtree; # otherwise, move to the right subtree recursively if (root.data <= key): root.right = insert(root.right, key) else: root.left = insert(root.left, key) return root • Searching for a key value in the BST: This function searches the BST for a key value provided by the user. As with the previous functions, it recursively calls itself on either the left or right subtree in an effort to find a match for the key value. If the key value is not found after all the BST has been searched, an empty BST is returned. This raises an error and crashes the application unless it is handled by the calling function. Figure 6.23 illustrates both a case where the key is being found and one where it is not. The following Python code provides an implementation of this function: def search(root, key): # Recursively visit the left and right subtrees to find # the node that matches the key searched for if (root.data == key): return root if (root.data < key): return search(root.right,key) else: return search(root.left,key) FIGURE 6.23 Data search in a BST. 266 Handbook of Computer Programming with Python # If the key is not found, return the empty BST if (root is None): return None • Deleting a node from the BST: Arguably, this is the most complex function in the BST ADS. If the current root is empty, which may be because the key was not found, there is nothing to be done and the current BST is returned as is. In any other case, the key is found in the current node, or its left or right subtree. If the key is found in the current node and the right subtree is empty, the function replaces the current node with its left subtree. Accordingly, if the left subtree is empty it is replaced with the right subtree. If none of these are empty, the function finds the minimum data in the right subtree, replaces the data in the current node, and the current node with the right subtree, while also deleting the node of the subtree with the lowest value data. If the key is not found in the current node, the function is called recursively on the left and the right subtrees, depending on whether the key value is lower or higher than the current node data. Figure 6.24 illustrates this process and the related Python script is provided below: def delete_Node(root, key): """ If the root is empty, return it; if not, if the key is larger than the current root, find it in the right subtree; Otherwise, if it is smaller, find it in the left subtree If the key is matched, delete the current root """ if (root == None): return root elif (root.data > key): root.left = delete_Node(root.left, key) elif (root.data < key): root.right= delete_Node(root.right, key) """ If the key is matched, then, if there is no right subtree just replace the current node with the left subtree; similarly in this case, if there is no left subtree just replace the current node with the right subtree.""" elif (root.data == key): if (root.right == None): return root.left if (root.left == None): return root.right """ If none of the left or right subtrees is empty replace the data in the current node with the minimum data in the right subtree and delete the node with that minimum data from the right subtree""" temp = root.right FIGURE 6.24 Deleting a node from a BST. 267 Data Structures and Algorithms mini_data = temp.data while (temp.left): temp = temp.left mini_data = temp.data root.data = mini_data root.right = delete_Node(root.right,root.data) return root • Destroying the BST: As with most structures occupying computer memory space, it is advisable that the BST is deleted (i.e., destroyed) when exiting the application. The following Python code excerpt provides a possible implementation of this task: def destroyBST(root): if (root): destroyBST(root.left) destroyBST(root.right) print("Node destroyed before exiting: ", root, root.data) root = None Finally, it must be noted that the performance of the BST in terms of searching, inserting, or deleting depends on how balanced it is. In the case of well-balanced BSTs, the performance is always O(logn), while in extremely unbalanced cases the performance can be improved to O(n). 6.6.4 Graphs A graph is a non-linear data structure consisting of nodes, also called vertices, which may or may not be connected to other nodes. The line or path connecting two nodes is called an edge. If edges have particular flow directions, the graph is said to be directed. Graphs with no directional edges are referred to as undirected graphs (Figure 6.25). A directed graph consists of a set of vertices and a set of arcs. The vertices are also called nodes or points. FIGURE 6.25 An undirected graph. Observation 6.23 – Graph: A non-linear structure of nodes/vertices interconnected through edges. Edges may have a particular direction (directed graphs) or not (undirected graphs). Graphs can be presented as static adjacency matrices or as dynamic adjacency lists. 268 Handbook of Computer Programming with Python FIGURE 6.26 Arc (V, W). An arc is an ordered pair of vertices (V, W); V is called the tail and W is called the head of the arc. Function arc (V, W) is often expressed as V → W (Figure 6.26). A path in a directed graph can be described as a sequence of vertices V1, V2, …Vn, thus V1 → V2, V2 → V3, …, Vn−1 → Vn can be viewed as arcs. In this occasion, the path from vertex V1 to vertex Vn, passes through vertices V2, V3, …, Vn−1, and ends at vertex Vn. The length of the path is the number of arcs on the path, in this particular case n−1. A path is simple if all vertices, except possibly the first and last, are distinct. A simple cycle is a simple path of a length of at least one that begins and ends at the same vertex. A labeled graph is one in which each arc and/or vertex can have an associated label that carries some kind of information (e.g., a name, cost, or other values associated with the arc/vertex). There are two ways to represent a directed graph: as a static adjacency matrix or as a dynamic adjacency list. The prefix static refers to the use of a static structure (i.e., a list), whereas the prefix dynamic refers to the use of a dynamic structure in the form of a linked list. In the case of the former, assuming that V = {1, 2, …, N}, the adjacency matrix of G is an NxN matrix A of booleans, where A[i, j] is true if and only if there is an arc from vertex i to j. An extension of this scheme is what is called a labelled adjacency matrix, where A[i, j] is the label of the arc going from vertex i to vertex j; if there is no arc from i to j, it is not possible to have an associated value referring to it. The main disadvantage of the adjacency matrix is that it requires storage in the region of O(n2). In contrast, in the case of the adjacency list, which is essentially a list of pointers representing every vertex of the graph that is adjacent to vertex i, the whole structure is dynamic and, therefore, can have its memory size increased or decreased on demand. Figure 6.27 presents examples of an adjacency matrix and an adjacency list. An undirected graph consists of a set of vertices and a set of arcs. As in the case of the directed graph, the vertices are also called nodes or points. Its main difference from a directed graph is that edges are unordered, implying that (V, W) = (W, V). The applications of graphs, both directed and undirected, are numerous. Examples include, but are not limited to, the airlines industry, the logistics and freight industries, or the various GPS and navigation systems. In all these cases, the solution to most of their operational problems is a form of the famous shortest path algorithm. The idea behind this algorithm is pretty simple. FIGURE 6.27 Adjacency matrix vs. adjacency list. Data Structures and Algorithms 269 • A directed graph G = (V, E) is drawn, in which each arc has a non-negative label and a vertex is specified as the source. • The cost of the shortest path from the source back to itself is calculated through every other vertex in V (i.e., the length of the path). Dijkstra’s famous greedy algorithm, also called the Eulerian path, provides the solution to this problem. The algorithm can be summarized in the following steps: • Step 1: Determine if the solution is feasible, which is true only if every vertex is connected to an even number of other vertices. • Step 2: Start with the source vertex and move to the first next available vertex in the adjacency matrix (or adjacency list). • Step 3: Print/store the identified vertex and delete it from the adjacency matrix (or adjacency list). • Step 4: Repeat Steps 2 and 3 until there are no more connections to use. 6.6.5 Implementing Graphs and the Eulerian Path in Python Implementing an undirected graph implies the implementation of either an adjacency matrix or an adjacency list. Although the implementations may differ, the algorithm is basically the same in both cases: the Eulerian path (Dijkstra’s algorithm) is used to find and display the shortest path between the vertices. Based on the undirected graph provided in Figure 6.28, the following script offers three different scenarios (i.e., scenarios can be selected by enabling/disabling the associated commented statements). The scenario firstly prompts the user to enter the number of vertices in the graph. Next, it accepts the connections in the form of an adjacency matrix as 0s or 1s (fillAdjacencyMatrix()), checks whether the Eulerian path algorithm can be applied to this particular matrix, and traverses the graph and displays the shortest path. Note that this process may result in one path being inside another. In this case, in the second round, the vertex that opens the path must also close it. The reader should also notice that, in order to merge two paths, the vertex that opens and closes FIGURE 6.28 An undirected graph. 270 Handbook of Computer Programming with Python the second path is the one that associates the two separate cases, in the form of a zoom-in path residing inside another. The second and third scenarios involve two different, pre-defined matrices that represent graphs and are addressed accordingly: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 def fillAdjacencyMatrix(matrix, vertices): for i in range(vertices): col = [] for j in range (vertices): print("Enter 1 if there is a connection between ", i, \ " and ", j, " or 0 if not: ", end = " ") connectionExists = int(input()) col.append(connectionExists) matrix.append(col) return matrix def displayAdjacencyMatrix(matrix, vertices): for i in range(vertices): print(matrix[i]) def checkEulerian(matrix, vertices): newStartVertex = -1 for i in range(vertices-1, -1, -1): sumPerCol = 0 for j in range (vertices): sumPerCol = sumPerCol + matrix[i][j] if (sumPerCol != 0): newStartVertex = i return newStartVertex # Ask the user for the number of graph vertices numVertices = int(input("Number of graph vertices: ")) #graph = [] graph =[[0,1,1,1,1], [1,0,1,1,1], [1,1,0,1,1], [1,1,1,0,1], [1,1,1,1,0]] #graph = [[0,1,0,0,0,1], [1,0,1,0,1,1], [0,1,0,1,1,1], [0,0,1,0,1,0], [0,1,1,1,0,1], [1,1,1,0,1,0] # Fill the adjacency matrix # graph = fillAdjacencyMatrix(graph, numVertices) # Display the adjacency matrix before running the Eulerian Path displayAdjacencyMatrix(graph, numVertices) # Check if the Eulerian Path algorithm can be applied in this case startVertex = checkEulerian(graph, numVertices) endVertex = vertex = startVertex col = 0 if (startVertex == -1): print("Eulerian Path cannot be applied in this case") else: print("The first round: ", graph[vertex][0], end = "") while (vertex < numVertices and col < numVertices): Data Structures and Algorithms 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 271 if (graph[vertex][col] == 0): col += 1 if (col == numVertices or vertex == numVertices): startVertex = checkEulerian(graph, numVertices) if (startVertex == -1): print("\nPath closed") else: endVertex = startVertex vertex = startVertex; col = 0 print("\nZoom into", startVertex, "for the round: ", startVertex, end = " ") elif (graph[vertex][col] == 1): print("->", col, end = "") graph[vertex][col] = graph[col][vertex] = 0 vertex = col; col = 0 Output 6.6.5: How many vertices in the graph? 5 [0, 1, 1, 1, 1] [1, 0, 1, 1, 1] [1, 1, 0, 1, 1] [1, 1, 1, 0, 1] [1, 1, 1, 1, 0] The first round: 0-> 1-> 2-> 0-> 3-> 1-> 4-> 0 Zoom into 2 for the round: 2 -> 3-> 4-> 2 Path closed 6.7 WRAP UP In this chapter an effort was made to briefly explain some of the most important data structures in programming and the algorithms to support those. The various scripts were showcasing how Python can be utilized to implement those." Apparently, there are several other data structures available and, perhaps, more efficient algorithms to implement those which was beyond the scope of this chapter. 6.8 CASE STUDIES 1. Create an application that implements the algorithms and tasks specified below. The application should use a GUI interface in the form of a tabbed notebook, using one tab for each ­algorithm. The application requirements are the following: a. Implement the following static sorting algorithms: bubble sort, insertion sort, shaker sort, merge sort. b. Ask the user to enter a regular arithmetic expression in a form of a phrase, with each of the operators limited to single-digit integer numbers. Convert the infix expression to postfix. c. Ask the user to enter a sequence of integers, insert them into a binary search tree and implement the BST ADS algorithm with both inorder and postorder traversals. 272 Handbook of Computer Programming with Python 6.9 EXERCISES 1. Use a notebook GUI to implement the selection sort, the shell sort and the quicksort (one on each tab). 2. Use a stack to implement the following tasks: a. Reversing a string. b. Calculating the sum of integers 1…N. c. Calculating the sum of squares 1 ^ 2 +…+ N ^ 2. d. Checking if a number or word is a palindrome. e. Evaluating a postfix expression by using a stack. 3. Implement a deque structure with an example to test it. A deque is a linear structure of items similar to a queue in the sense that it has two ends (i.e., front and rear). However, it can enqueue and dequeue from both ends of the structure. Deque supports the following operations: a. add _ front(item): Adds an item to the front of the deque. b. add _ rear(item): Adds an item to the rear of the deque. c. remove _ front(item): Removes an item from the front of the deque. d. remove _ rear(item): Removes an item from the rear of the deque. e. isEmpty(): Returns a Boolean value indicating whether the deque is empty or not. f. peek _ front(): Returns the item at the front of the deque without removing it. g. peek _ rear(): Returns the item at the rear of the deque without removing it. h. size(): Returns the number of items in the deque. 4. Using a graph do the following: a. Ask the user to enter the number of vertices in the undirected graph. b. Ask the user to enter the name of each of the vertices in the undirected graph. c. Ask the user to enter the connected vertices to each of the edges in the undirected graph. d. Determine whether the Eulerian Path solution (Dijkstra’s algorithm) is feasible. e. In case it is not, ask the user to add new connections to the missing ones. f. Create the adjacency matrix for the graph and display it. g. Create the adjacency list for the graph and display it. h. Run the Dijkstra’s algorithm to find the shortest path, starting from a source entered by the user. i. Display the solution of the shortest path. REFERENCES Dijkstra, E. W., Dijkstra, E. W., Dijkstra, E. W., & Dijkstra, E. W. (1976). A Discipline of Programming (Vol. 613924118). Prentice-Hall: Englewood Cliffs. Hoare, C. A. R. (1961). Algorithm 64: Quicksort. Communications of the ACM, 4(7), 321. Knuth, D. E. (1997). The Art of Computer Programming (Vol. 3). Pearson Education. Stroustrup, B. (2013). The C++ Programming Language. India: Pearson Education. 7 Database Programming with Python Dimitrios Xanthidis University College London Higher Colleges of Technology Christos Manolas The University of York Ravensbourne University London Tareq Alhousary University of Salford Dhofar University CONTENTS 7.1 7.2 Introduction........................................................................................................................... 273 Scripting for Data Definition Language................................................................................ 274 7.2.1 Creating a New Database in MySQL........................................................................ 276 7.2.2 Connecting to a Database.......................................................................................... 279 7.2.3 Creating Tables..........................................................................................................280 7.2.4 Altering Tables.......................................................................................................... 289 7.2.5 Dropping Tables......................................................................................................... 294 7.2.6 The DESC Statement.................................................................................................. 296 7.3 Scripting for Data Manipulation Language........................................................................... 296 7.3.1 Inserting Records....................................................................................................... 296 7.3.2 Updating Records...................................................................................................... 301 7.3.3 Deleting Records....................................................................................................... 303 7.4 Querying a Database and Using a GUI................................................................................. 305 7.4.1 The SELECT Statement.............................................................................................306 7.4.2 The SELECT Statement with a Simple Condition.....................................................307 7.4.3 The SELECT Statement Using GUI.......................................................................... 310 7.5 Case Study............................................................................................................................. 316 7.6 Exercises................................................................................................................................ 316 References....................................................................................................................................... 317 7.1 I NTRODUCTION Most IT professionals and scholars may agree on what makes computers special and useful: they can perform operations at lightning speed and on large volumes of data. Stemming from these two fundamental computational thinking elements are the notions of algorithms and programs as a means to process and manipulate data. In the scope of computer science, information systems, and information technology, the logical and physical organization of data falls under the broader context of databases. A thorough analysis of the various concepts related to databases and their structural DOI: 10.1201/9781003139010-7 273 274 Handbook of Computer Programming with Python design is outside the scope of this book. The reader can find relevant information on Elmasri & Navathe (2017). Observation 7.1 – Types of Scripting The focus of this chapter is on the crossroads between in Relational Databases: There are computer programming with Python and a common three types of scripts addressing relational databases: Data Definition type of database structure: the relational database. In relational databases, there are three main types of Language (DDL), Data Manipulation scripting techniques and/or languages that are used to Language (DML), and Queries. perform the various associated tasks, namely Data Definition Language (DDL), Data Manipulation Language (DML), and Queries. DDL is used to create, Observation 7.2 – Database Schema, display, modify, or delete the database and its structures Database Instance: The structure of and tables, and it is associated with the database schema a database, including table metadata, or metadata. DML is used to insert data into the various is also referred to as the database tables, and modify or delete this data as required. It schema. The data stored on the tables relates to the database instance or state. Queries are at any given time are called the dataused to display the data in various different ways. Most base instance or state. commercially available Database Management Systems (DBMS) incorporate facilities and tools that utilize these three mechanisms. The DBMS of choice for this chapter is MySQL (2021). This is part of a package that includes both the DBMS and a local server solution called Apache (2021). The package supports both Windows and Mac OS systems, and the two associated versions come under the name MAMP. The packages are free for download from MAMP (2021) and Oracle (2021b) and the installation is pretty intuitive and straightforward. While it is always beneficial for one to study and understand the tools and technologies of any given system to a good extent, it must be noted that no prior knowledge or practical experience with MAMP is needed in order to practice and execute the examples presented in this chapter. While the examples make use of the MySQL DBMS and the Apache Server, this is just a matter of simply logging in and activating them, and accessing the created databases. The scripts provided in this chapter will do all the necessary work, while the results will appear in the relevant MySQL database. This chapter will cover the following topics: • DDL (Data Definition Language): Creating a database and connecting to it. Modifying, deleting, or displaying DB tables, structures, and attributes. • DML (Data Manipulation Language): Inserting, modifying, and deleting records in a table. • Queries: Displaying the records of one or more tables in various different ways. • Using GUI programming, and in particular the Grid widget, to create presentable database applications with Python. It should be noted that while expertise in databases is not essential, a good understanding of the concepts and techniques introduced in Chapter 4: Graphical User Interface Programming with Python and Chapter 5: Application Development with Python may be required. Ideally, the reader should be comfortable with the major concepts introduced in all the previous introductory chapters, as many of these concepts will be utilized or integrated in the examples presented here. 7.2 S CRIPTING FOR DATA DEFINITION LANGUAGE As mentioned, MAMP will provide some of the tools that are necessary for the examples presented in this chapter. The MAMP packages must be downloaded and installed, as required. Once installation is complete, the MAMP application must be launched. This will start the Apache local server and the MySQL DBMS, both of which are required in order to run a client-server application. Figures 7.1 and 7.2 illustrate the MAMP server and the MySQL DBMS interfaces, respectively: Database Programming with Python FIGURE 7.1 MAMP server. FIGURE 7.2 MySQL phpMyAdmin. 275 276 FIGURE 7.3 Handbook of Computer Programming with Python Installed libraries in environments tab. Once these services are launched, the libraries related to MySQL connectivity and scripting must be also installed in the Anaconda environment. The libraries can be found under the Environments tab in Anaconda Navigator. If the reader has already installed the necessary libraries in previous chapters of this book, installing the new libraries ensures that the import statements related to MySQL will not raise errors. If some of the libraries used here have not been previously installed, the reader should refer to the scripts of the previous chapters and amend the installation and scripts presented here accordingly. Figures 7.3 and 7.4 illustrate the Environments tab with lists of the installed libraries, as well as those that are not installed but needed for running the examples. 7.2.1 Creating a New Database in MySQL A database can be formally defined as an organized collection of related data the processing of which can pro- Observation 7.3 – Database: An vide a particular, explicit meaning. A database includes organized collection of related data a number of tables, also called relations, hence the rela- which are processed to provide tional prefix. Each table/relation consists of attributes, explicit meaning. A database includes also referred to as fields or columns. Typically, one or a number of tables, each with its own more of these attributes serve as unique record identi- attributes. Tables may be organized fiers called primary keys and are often organized using using a unique primary key and make indices. These structural elements of the database are use of indices. collectively referred to as the database metadata. As mentioned, the creation and control of metadata can be handled using the DDL. Database Programming with Python 277 FIGURE 7.4 Not installed but necessary libraries. It goes without saying that the database itself needs to be created prior to the creation of the metadata. In MySQL, the creation of a new database is as simple as clicking on the New option on the left panel of phpMyAdmin (Figure 7.2). When creating a new database, the user must specify a name, the database format (usually GuiDB) and the default character set (usually utf8). In Python, the creation process involves a number of steps: • Obtaining the log-in credentials for the MySQL environment. These can be found in the Welcome page in the Example area in MySQL. • Using the config object (list) to set the credentials in the dictionary form: ­config = {‘user’: ‘root’, ‘password’: ‘root’, ‘host’: ‘localhost’}. • Writing the statements to connect to the database, setting the SQL statement, and executing the commands. Writing a Python script to create a database may be as simple as writing the basic statements in a command-prompt mode or as sophisticated as offering a full GUI environment. The following Python script is an example of the latter. Notice that, upon execution, the application should not produce an output, which simply means that no problems were encountered while connecting to MySQL. Instead of an output, the program should display the newly created database as an available database. It must be also stressed that SQL statements are simply treated as strings that are not case sensitive. As such, they can be written with capital or lower-case letters, or a combination of both. In this chapter, it was decided to use capital letters for the keywords of the statements, in line with the style adopted in the official MySQL documentation (Oracle, 2021a). This decision had to do mainly with distinguishing the SQL keywords from the SQL database table and attribute names and from the Python code, thus improving clarity and readability: 278 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Handbook of Computer Programming with Python import tkinter as tk from tkinter import ttk import mysql.connector config = {'user': 'root', 'password': 'root', 'host': 'localhost'} def createDB(dbName): GUIDB = 'GuiDB' connect = mysql.connector.connect(**config) cursor = connect.cursor() sqlString = "CREATE DATABASE " + dbName.get() + \ "DEFAULT CHARACTER SET utf8" cursor.execute(sqlString.format(GUIDB)) # Create the basic window frame and give it a title winFrame = tk.Tk() winFrame.title("Create a new database") # Create the interface winLabel = tk.Label(winFrame, text = "Enter the name of the new database", bg = "grey") winLabel.grid(column = 0, row = 0) # Create the StringVar object that will accept user input from the # keyboard,and initialize it textVar = tk.StringVar() textVar.set("Enter the name here") winText = ttk.Entry(winFrame, textvariable = textVar, width = 30) winText.grid(column = 0, row = 1) winButton = tk.Button(winFrame, font = "Arial 16", text = "Click to create the new DB\nin the localhost") winButton.bind("<Button-1>", lambda event, a = textVar: createDB(a)) winButton.grid(column = 0, row = 2) winFrame.mainloop() Output 7.2.1: Database Programming with Python 279 The part of the script specifically relating to the database is in lines 3–13. In line 3, the mysql. connector function that handles the connection with MySQL is imported. A standard connection configuration is implemented in line 5. Once the GUI is built, a click button event calls the createDB() function that assigns the most frequently used database format (GuiDB) to the relevant variable (line 8). Next, it connects to MySQL using the mysql.connector.connect(**config) adaptor (line 10), prepares the pending execution statement in the form of a sqlString (line 12), and executes the statement (line 13). 7.2.2 Connecting to a Database As in the previous example, once the database is created a connection must be established. Connecting to a Observation 7.4 – Connecting to a database involves the creation of a link to it inside the Database: 1. Import the mysql.connector relevant DMS (e.g., MySQL) through a server, such as library. Internet Information Server (ISS) or Apache. Once the 2. Use the cursor object connection is established, the database must be opened and the mysql.connector. and a link must be created and attached to it. This usually connect(**config) function to requires some credentials, including login username, connect to the database. password, the host address (i.e., the network address of 3. Prepare the SQL statement. the server that hosts the database), and the name of the 4. Execute the SQL statement using database itself. In the case of databases stored and used the cursor.execute() function. from within a local computer system and a local server (e.g., MySQL through Apache), the host address is usually “localhost”. The following Python script connects to the newly Observation 7.5 – The SHOW TABLES created database. It sets the configuration string Statement: Use the SHOW TABLES (­config) that holds the credentials for the connection statement to locate tables in the datato the database (lines 2–3). Next, it links the execution base. If successful, use the cursor. statement with the MySQL database through mysql. fetchall() function to load the connector (line 5). Once the connection is success- results to the cursor object for later fully established, the results are loaded to the cursor use. object, which always receives the results of all executed SQL statements (line 6). Lastly, the database tables are displayed by executing the cursor.execute("SHOW 7.6 – Exception TABLES") (line 7) and cursor.fetchall() (line 8) Observation Handling: It is highly advisable that commands. the try…except exception handling In this example the reader should note the use of the try…except statement (lines 4 and 10) to display the structure is used for each statement appropriate messages in the cases of both successes related to SQL scripts, as it is likely and failures. This ensures that statements execution that the execution of such statements that may return incorrect or unexpected values will will frequently cause errors that can not cause the application to crash. As an example, run- lead to the abnormal termination ning this script with newDB as the database name will (crash) of the application. display the tables as expected. However, if the database name were to be changed to a non-existing one (e.g., newDB1), the exception handling code in lines 9 and 10 would be executed, launching an error message. It is worth mentioning that the execution of the except segment of the script will be triggered for any reason that might cause a failure in connecting to the database. Nevertheless, if the database is empty, an empty set of tables will be displayed: 280 1 2 3 4 5 6 7 8 9 10 Handbook of Computer Programming with Python import mysql.connector config = {'user': 'root', 'password': 'root', 'host': 'localhost', 'database': 'newDB'} try: link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute("SHOW TABLES") print(cursor.fetchall()) except: print("There is an error with the connection") Output 7.2.2.a: [('STUDENT',), ('Table1',)] Output 7.2.2.a shows the results for a database including tables Student and Table 1. Output 7.2.2.b: There is an error with the connection Output 7.2.2.b shows the results for an empty database. In this case, the exception handling mechanism is activated and the corresponding error message is displayed. Returning an empty cursor after the execution of the SHOW TABLES statement is considered an internal error, and it is thus raising an exception. 7.2.3 Creating Tables The first action needed once a new database is created is the creation of its table(s). This is accomplished by the execution of the CREATE TABLE statement in SQL. The CREATE TABLE statement is very similar or identical across different DBMS. A detail description of the small syntax variations between different DBMS systems is beyond the scope of this chapter, but the basic structure remains the same. Assuming the commonly used relational model, seven particular elements need to be specified when creating a table: Observation 7.7 – The CREATE TABLE Statement: Use the CREATE TABLE statement to create a table, define its attributes, data types, and sizes, and set possible primary and foreign keys. Observation 7.8 – Create Tables with No Primary or Foreign Key: Use the following statement to create a table with no primary or foreign keys: CREATE TABLE (<attribute1> 1. The table name (i.e., the name of each structure <DATA TYPE>(<size>),..., that will store data in its columns or fields, also <attributeN> <DATA TYPE>(<size>)) called attributes). 2. The number of attributes of the table. 3. The name of each attribute, preferably as a single, descriptive word. 4. The data type for each of the attributes (e.g., CHAR, INT, or DATE). 5. The length/size of the data for each attribute in bytes. 6. Whether any of the attributes is the primary key, or part of a combined primary key of the table. 7. Whether any of the attributes is a foreign key, referencing a corresponding attribute in another table. Database Programming with Python 281 Provided that these seven elements are specified, there are three possible cases when creating a table: 1. The table does not have a primary key and does not have any of its attributes referencing the attributes of another table. In this case, the table is part of a single-table database or it is a parent table for other tables to refer to. 2. The table has one or more of its attributes designated as a primary key, ensuring that each of its records is unique. 3. There are more than one tables in the database and they are somehow related to each other. This occurs when one or more of the attributes reference an identical column in another table within the same database. Python provides support for all three cases. Starting with the first case, one could create a table with a number of attributes, but no primary or foreign keys. This can be done either statically or dynamically. A static approach entails pre-defined statements and pre-determined results. A dynamic approach allows the programmer to determine the table structure at run-time. The following script and output is an example of the latter: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import mysql.connector # The database config details config = {'user': 'root', 'password': 'root', 'host': 'localhost', 'database': 'newDB'} # The name of the table and its attributes tableName = input("Enter the name of the table to create: ") sqlString = "CREATE TABLE " + tableName + "(" numOfAt = int(input("Enter the number of attributes in the table")) atName = [""]*numOfAt atType = [""]*numOfAt atSize = [0]*numOfAt # Define the table structure (i.e., attribute details) for i in range(numOfAt): atName[i] = input("Enter the attribute " + str(i) + ": ") atType[i]=str(input("Enter 'char' for char, 'int' for int type: ")) atSize[i] = int(input("Enter the size of the attribute: ")) sqlString += atName[i]+ " " + atType[i]+"("+str(atSize[i])+")" if (i < numOfAt-1): sqlString += "," else: sqlString += ")" # The SQL statement and exception handling mechanism print("The SQL statement to run is: ", sqlString) 282 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Handbook of Computer Programming with Python try: link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(sqlString) sqlString = "DESC " + tableName cursor.execute(sqlString) attributes = cursor.fetchall() # Desc/show the metadata of the new table print("The metadata for the new table "+str(tableName)+" are: ") for row in attributes: print(row) except: print("There is an error with the connection") Output 7.2.3.a: Enter the name of the table to create: Student Enter the number of attributes in the table: 3 Enter the attribute 0: Name Enter 'char' for char type, 'int' for int type: char Enter the size of the attribute: 10 Enter the attribute 1: Address Enter 'char' for char type, 'int' for int type: char Enter the size of the attribute: 15 Enter the attribute 2: Year Enter 'char' for char type, 'int' for int type: int Enter the size of the attribute: 4 The SQL statement to run is: Create Table Student(Name char(10), Address char(15),Year int(4)) The metadata for the new table Student are: ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') The script consists of three distinct parts. In the first part (lines 7–13), the user is prompted to enter a name for the new table and the number of its attributes. The SQL string that is subsequently used for the creation of the table is also constructed. In the second part (lines 15–25), the user is prompted to enter the required details for each attribute (e.g., name, data type, size), and the SQL string is updated accordingly. The third part involves code that connects to the database and executes the SQL string. As mentioned, this is wrapped in an exception handling block in order to prevent a possible uncontrolled termination of the program due to failures of database-related activities (lines 30–42). This is one the most straightforward cases of creating tables using Python scripts. Indeed, this implementation simply involves the incorporation and execution of SQL statements through the Python script wrapper, similarly to what one would do with any other modern programming language. Observation 7.9 – Primary Key: An In the output of this particular example, the user attribute or a combination of attrienters the rather trivial and common example of a butes with values that uniquely idenStudent table with three basic attributes: Name, tify each particular record in the table. Address, and Year (of birth). After execution, the 283 Database Programming with Python reader should be able to verify that the table has been created with the desired structure (e.g., with no primary or foreign keys) by checking database newDB in MySQL. The second case involves the addition of primary keys to the table. As a reminder, a formal definition of the primary key is that of an attribute of a table the value of which identifies records uniquely. Simply put, the primary key designation ensures that there are no duplicate values for the related attribute(s). It must be stressed again that two distinct possibilities exist in relation to primary keys. The first is that it consists of a single attribute. In this case the syntax is the following: CREATE TABLE <table name> (<attribute1> <DATA TYPE>(<size>) PRIMARY KEY,..., <attributeN> <DATA TYPE>(<size>)) The second is that the primary key consists of a combination of two or more attributes. In this case the syntax is slightly different: CREATE TABLE <table name> (<attribute1> <DATA TYPE>(<size>),..., <attributeN> <DATA TYPE>(<size>), PRIMARY KEY (<attributeX>,... <attributeY>)) The following script is another version of the one presented previously, modified in order to addresses the creation of a table with a single primary key (lines 15–31): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Observation 7.10 – Foreign Key: An attribute that references the values of a corresponding attribute on another table of the same database that is also the primary key for the referenced table. Observation 7.11 – Create a Table with a Single Primary Key but No Foreign Key: CREATE TABLE <table name> (<attribute1> <DATA TYPE>(<size>) PRIMARY KEY, ..., <attributeN> <DATA TYPE>(<size>)) Observation 7.12 – Create a Table with Combined Primary Key but No Foreign Key: CREATE TABLE <table name> (<attribute1> <DATA TYPE>(<size>),..., <attributeN> <DATA TYPE>(<size>), PRIMARY KEY (<attributeX>,... <attributeY>)) import mysql.connector # The database config details config = {'user': 'root', 'password': 'root', 'host': 'localhost', 'database': 'newDB'} # The name of the table and its attributes tableName = input("Enter the name of the table to create: ") sqlString = "CREATE TABLE " + tableName + "(" numOfAt = int(input("Enter the number of attributes in the table: ")) atName = [""]*numOfAt atType = [""]*numOfAt atSize = [0]*numOfAt key = 0 # Define the structure of the table (i.e., attribute details) for i in range(numOfAt): 284 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Handbook of Computer Programming with Python atName[i] = input("Enter the attribute " + str(i) + ": ") atType[i]=str(input("Enter 'CHAR' for char, 'INT' for int type: ")) atSize[i] = int(input("Enter the size of the attribute: ")) sqlString += atName[i] + " " + atType[i] + \ "(" + str(atSize[i]) + ")" if (key == 0): primaryKey = str(input("Is this a primary key (Y/N)? ")) if (primaryKey == "Y"): sqlString += " PRIMARY KEY" key = 1 if (i < numOfAt-1): sqlString += ", " else: sqlString += ")" # The SQL statement to run using exception handling print("The SQL statement to run is: \n", sqlString) try: link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(sqlString) sqlString = "DESC " + tableName cursor.execute(sqlString) columns = cursor.fetchall() print("The structure/metadata of the table ",str(tableName),"is:") for row in columns: print(row) except: print("There is an error with the connection") Output 7.2.3.b: Enter the name of the table to create: Customers Enter the number of attributes in the table: 3 Enter the attribute 0: CustomerID Enter 'char' for char, 'int' for int type: int Enter the size of the attribute: 3 Is this a primary key (Y/N)? Y Enter the attribute 1: CustLastName Enter 'char' for char, 'int' for int type: char Enter the size of the attribute: 15 Enter the attribute 2: CustFirstName Enter 'char' for char, 'int' for int type: char Enter the size of the attribute: 10 The SQL statement to run is: Create Table Customers(CustomerID int(3) Primary key, CustLastName char(15), CustFirstName char(10)) There is an error with the connection 285 Database Programming with Python Output 7.2.3.c: Enter the name of the table to create: Items Enter the number of attributes in the table: 3 Enter the attribute 0: ItemID Enter 'char' for char, 'int' for int type: char Enter the size of the attribute: 6 Is this a primary key (Y/N)? Y Enter the attribute 1: ItemDesc Enter 'char' for char, 'int' for int type: char Enter the size of the attribute: 25 Enter the attribute 2: ItemPrice Enter 'char' for char, 'int' for int type: int Enter the size of the attribute: 5 The SQL statement to run is: Create Table Items(ItemID char(6) Primary key, ItemDesc char(25), ItemPrice int(5)) The structure/metadata of the table Items is: ('ItemID', 'char(6)', 'NO', 'PRI', None, '') ('ItemDesc', 'char(25)', 'YES', '', None, '') ('ItemPrice', 'int(5)', 'YES', '', None, '') The output demonstrates the creation of two of the three tables (i.e., Customers and Items) from Table 7.1. The third case involves the connection of more than one tables connecting to each other through a common attribute. In this case, this common attribute is usually designated as a primary key in one of the tables and a foreign key in the others, although this is not the only possible arrangement. This practice is often termed as referencing, as the foreign key of the child table references the primary key of the parent table. The syntax for the creation of the table and the key designation is the following: Observation 7.13 – Create a Table with One or More Foreign Keys: CREATE TABLE <table name> (<attribute1> <DATA TYPE>(<size>), FOREIGN KEY (<attribute name>) REFERENCES <table name> (<attribute name>),..., <attributeN> <DATA TYPE>(<size>), FOREIGN KEY (<attribute name>) REFERENCES <table name> (<attribute name>)) CREATE TABLE <table name> ( <attribute1> <DATA TYPE>(<size>), FOREIGN KEY (<attribute name>) REFERENCES <table name> (<attribute name>),... <attributeN> <DATA TYPE>(<size>) FOREIGN KEY (<attribute name>) REFERENCES <table name> (<attribute name>)) TABLE 7.1 Customers – Items – Orders Customers Attribute CustomerID CustLastName CustFirstName Items Orders Type Attribute Type Attribute Type INT(3) PK CHAR(15) CHAR(10) ItemID ItemDesc ItemPrice CHAR(6) PK CHAR(25) INT(5) OrderID CustID ItemID OrderYear OrderQuantity INT(3) PK INT(3) FK INT(6) FK INT(4) INT(3) 286 Handbook of Computer Programming with Python The following Python script is another amendment to the previously developed script, allowing for the specification of a foreign key attribute, and the corresponding tables and reference attributes. It is beyond the scope of this chapter to discuss the numerous possibilities of such tasks in detail, and to provide safety measures against the multitude of cases of incorrect entries that could cause abnormal termination of the program. The goal of this example is to demonstrate how to use Python to facilitate the creation of such relationships in their simplest form using database table Orders from Table 7.1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 import mysql.connector # The database config details config = {'user': 'root', 'password': 'root', 'host': 'localhost', 'database': 'newDB'} # The name of the table and its attributes tableName = input("Enter the name of the table to create: ") sqlString = "CREATE TABLE " + tableName + "(" numOfAt = int(input("Enter the number of attributes in the table: ")) atName = [""]*numOfAt atType = [""]*numOfAt atSize = [0]*numOfAt pkey = 0 # Define the structure of the table (i.e., attribute details) for i in range(numOfAt): atName[i] = input("\nEnter the attribute " + str(i) + ": ") atType[i]=str(input("Enter 'CHAR' for char, 'INT' for int type: ")) atSize[i] = int(input("Enter the size of the attribute: ")) sqlString += atName[i] + " " + atType[i] + \ "(" + str(atSize[i]) + ")" if (pkey == 0): primaryKey = input("Is this a primary key (Y/N)? ") if (primaryKey == "Y"): sqlString += " PRIMARY KEY" pkey = 1 foreignKey = input("Is this a foreign key (Y/N)? ") if (foreignKey == "Y"): availableTables = "SHOW TABLES" link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(availableTables) tables = cursor.fetchall() print(tables) refTable = input("Select the table to reference: ") availableAttributes = "DESC " + str(refTable) link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(availableAttributes) Database Programming with Python 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 287 columns = cursor.fetchall() print(columns) refAt = input("Select the attribute to reference: ") sqlString += ", FOREIGN KEY (" + atName[i] sqlString += ") REFERENCES " + str(refTable) + "(" + \ str(refAt) + ")" if (i < numOfAt-1): sqlString += ", " else: sqlString += ")" # The SQL statement and the exception handling mechanism print("\nThe SQL statement to run is: \n", sqlString) try: link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(sqlString) sqlString = "DESC " + tableName cursor.execute(sqlString) columns = cursor.fetchall() print("\nThe structure/metadata of the table ", str(tableName), "is:") for row in columns: print(row) except: print("There is an error with the connection") Output 7.2.3.d: Enter the name of the cable to create: Orders Enter the number of attributes in the table: 5 Enter the attribute 0: OrderiD Enter 'char' for char type, 'int. for int type: int Enter the size of the attribute: 3 Is this a primary key (Y/N)? Y Is this a foreign key (Y/N)? n Enter the attribute 1: CustID Enter 'char' for char type, 'int. for int type: int Enter the size of the attribute: 3 Is this a foreign key (Y/N)? Y [('customers',), ('items',), ('student',), ('table1',)] Select the table to reference: Customers [('CustomerID', 'int(3)', 'NO', 'PRI', None, ''), ('CustLastName', 'char(15)', 'YES', '', None, ''), ('CustFirstName', 'char(10)', 'YES', '', None, '')] Select the attribute to reference: CustomerID Enter the attribute 2: ItemID Enter 'char' for char type, 'int' for int type: char Enter the size of the attribute: 6 Is this a foreign key (Y/N)? Y [('customers',), ('items',), ('student',), ('table1',)] Select the table to reference: Items [('ItemID', 'char(6)', 'NO', 'PRI', None, ''), ('ItemDesc', 'char(25)', 'YES', '', None, ''), ('ItemPrice', 'int(5)', 'YES', '', None, '')] Select the attribute to reference: ItemID Select the table to reference: Customers [('CustomerID', 'int(3)', 'NO', 'PRI', None, ''), ('CustLastName', 'char(15)', 'YES', '', None, ''), ('CustFirstName', 'char(10)', 'YES', '', None, '')] 288 of Computer Programming with Python Select the attribute to reference: Handbook CustomerID Enter the attribute 2: ItemID Enter 'char' for char type, 'int' for int type: char Enter the size of the attribute: 6 Is this a foreign key (Y/N)? Y [('customers',), ('items',), ('student',), ('table1',)] Select the table to reference: Items [('ItemID', 'char(6)', 'NO', 'PRI', None, ''), ('ItemDesc', 'char(25)', 'YES', '', None, ''), ('ItemPrice', 'int(5)', 'YES', '', None, '')] Select the attribute to reference: ItemID Enter the attribute 3: OrderYear Enter 'char' for char type, 'int. for int type: int Enter the size of the attribute: 4 Is this a foreign key (Y/N)? N Enter the attribute 4: OrderQty Enter 'char' for char type, 'int' for int type: int Enter the size of the attribute: 3 Is this a foreign key (Y/N)? N The SQL statement to run is: Create Table Orders(OrderID int(3) Primary key, CustID int(3), Foreign Key (CustID) References Customers(CustomerID), ItemID char(6), Foreign Key (ItemID) References Items(ItemID), OrderYear int(4), OrderQty int(3)) The structure/metadata of the table Orders is: ('OrderID', 'int(3)', 'NO', 'PRI', None, '') ('CustID', 'int(3)', 'YES', 'MUL', None, '') ('ItemID', 'char(6)', 'YES', 'MUL', None, '') ('OrderYear', 'int(4)', 'YES', '', None, '') ('OrderQty', 'int(3)', 'YES', '', None, '') Once the table is created and references to tables Customers and Items are established, the following Entity Relationship Diagram (ERD) should appear in MySQL Designer (Figure 7.5): FIGURE 7.5 Entity relationship diagram for the customers-items-orders database. 289 Database Programming with Python 7.2.4 Altering Tables As discussed, the CREATE TABLE statement creates new tables and defines their attributes and characteristics. In other words, it is used to create and specify the metadata of the table. This metadata is not expected to change frequently; indeed, the better the design of the database the lower the possibility of metadata modification being required. Nevertheless, when necessary, the most drastic way to do so is to destroy and re-create the entire table. This is also the easiest solution provided that the table contains no data. However, the feasibility of using this function is inversely related to the amount of existing data, as destroying the table would also lead to permanent data loss. This is where the ALTER TABLE statement comes into play. The statement has numerous variations, but they all serve the purpose of altering the structure and metadata of an existing table. The most important and frequently used of these variations cover the following: Observation 7.14 – The ALTER TABLE Statement: ALTER TABLE <name> ADD <new attribute> <DATA TYPE>(<size>) ALTER TABLE <name> DROP <attribute name> ALTER TABLE <name> CHANGE <attribute name><attribute new name> <attribute new DATA TYPE>(<new size>) ALTER TABLE <name> ADD (new attribute) <DATA TYPE>(<size>) PRIMARY KEY ALTER TABLE <name> DROP PRIMARY KEY 1. Adding/deleting/modifying an attribute in an existing table. 2. Adding/deleting a primary key constraint. The first set of statements relates to the manipulation of simple attributes. For instance, if a new attribute is to be added to an existing table, the ALTER TABLE syntax would be the following: ALTER TABLE <table name> ADD <new attribute> <DATA TYPE>(<size>) Accordingly, to delete an existing attribute from a table the statement can be used with following syntax: ALTER TABLE <table name> DROP <attribute name> Modifications of the data type and/or size of an attribute would take the following form: ALTER TABLE <table name> CHANGE <attribute name> <attribute new name> <attribute new DATA TYPE>(<new size>) The second set of statements involves the addition of a new attribute that also serves as a (composite) primary key or the deletion of the primary key function of an attribute. In the first case, the following syntax should be used: ALTER TABLE <table name> ADD <new attribute> <DATA TYPE>(<size>) PRIMARY KEY In the case of the latter, the syntax would be the following: ALTER TABLE <table name> DROP PRIMARY KEY 290 Handbook of Computer Programming with Python The following Python script demonstrates the use of all the aforementioned cases in a single application: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 import mysql.connector # The database config details config = {'user': 'root', 'password': 'root', 'host': 'localhost', 'database': 'newDB'} # Show the available tables availableTables = "SHOW TABLES" link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(availableTables) tables = cursor.fetchall() print(tables) # Select the table to alter and show its attributes selectedTable = input("Select the table to alter: ") availableAttributes = "DESC " + str(selectedTable) link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(availableAttributes) columns = cursor.fetchall() for row in columns: print(row) # Decide to add a column in the selected table, modify it, or drop it alterType = input("(A)dd a new column\n(M)odify its size\n(D)rop one?\ \n(APK)Add Primary Key\n(DPK)Drop Primary Key?\ n\Select preferred task: ") if (alterType == "A"): atName = input("\nEnter the attribute name: ") atType = input("Enter 'char' for char type, 'int' for int type: ") atSize = int(input("Enter the size of the attribute: ")) if (alterType == "D"): atName = input("\nEnter the name of the attribute to drop: ") if (alterType == "M"): atName = input("\nEnter the name of the attribute to change: ") atNewName = input("\nEnter the new name of the attribute: ") atNewType=input("Enter 'char' for char type, 'int' for int type: ") 291 Database Programming with Python 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 atNewSize = int(input("Enter the if (alterType == "APK"): atName = input("\nEnter the name convert to Primary Key: ") atNewType=input("Enter 'char' for atNewSize = int(input("Enter the size of the attribute: ")) of the attribute to \ char type, 'int' for int type: ") size of the attribute: ")) # Prepare and execute the alter statement if (alterType == "A"): sqlString = "ALTER TABLE " + str(selectedTable) + " ADD " + \ atName + " " + str(atType) + "(" + str(atSize) + ")" elif (alterType == "D"): sqlString = "ALTER TABLE " + str(selectedTable) + \ " DROP COLUMN " + str(atName) elif (alterType == "M"): sqlString = "ALTER TABLE " + str(selectedTable) + " CHANGE " + \ atName + " " + atNewName + " " + atNewType + \ "(" + str(atNewSize) + ");" elif (alterType == "APK"): sqlString="ALTER TABLE "+str(selectedTable)+" ADD "+atName + \ " " + atNewType + "(" + str(ateNewSize) + ") PRIMARY KEY" elif (alterType == "DPK"): sqlString="ALTER TABLE "+str(selectedTable)+" DROP PRIMARY KEY" print(sqlString) try: link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(sqlString) print(cursor) sqlString = "DESC " + selectedTable cursor.execute(sqlString) columns = cursor.fetchall() print("\nThe structure/metadata of the table ", str(selectedTable), "is:") for row in columns: print(row) except: print("There is an error with the connection") 292 Handbook of Computer Programming with Python Output 7.2.4.a: Adding a new attribute [('customers',), ('items',), ('orders',), ('student',), ('table1',)] Select the table to alter: Student ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') (A)dd a new column (M)odify its size (D)rop one? (APK)Add Primary Key (DPK)Drop Primary Key? Select preferred task: A Enter the attribute name: MobileNumber Enter 'char' for char type, 'int' for int type: char Enter the size of the attribute: 15 Alter table Student add MobileNumber char(15) MySQLCursor: Alter table Student add MobileNumber cha.. The structure/metadata of the table Student is: ('Name', 'char(13)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') ('MobileNumber', 'char(15)', 'YES', '', None, '') Output 7.2.4.b: Modifying an attribute [('customers',), ('items',), ('orders',), ('student',), ('table1',)] Select the table to alter: Student ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') ('MobileNumber', 'char(15)', 'YES', '', None, '') (A)dd a new column (M)odify its size (D)rop one? (APK)Add Primary Key (DPK)Drop Primary Key? Select preferred task: M Enter the name of the attribute to change: MobileNumber Enter the new name of the attribute: PhoneNumber Enter 'char' for char type, 'int' for int type: char Enter the size of the attribute: 20 Alter table Student change MobileNumber PhoneNumber char(20); MySQLCursor: Alter table Student change MobileNumber .. The structure/metadata of the table Student is: ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') ('PhoneNumber', 'char(20)', 'YES', '', None, '') Database Programming with Python 293 Output 7.2.4.c: Deleting/Dropping an attribute [('customers',), ('items',), ('orders',), ('student',), ('table1',)] Select the table to alter: Student ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') ('PhoneNumber', 'char(20)', 'YES', '', None, '') (A)dd a new column (M)odify its size (D)rop one? (APK)Add Primary Key (DPK)Drop Primary Key? Select preferred task: D Enter the name of the attribute to drop: PhoneNumber Alter table Student drop column PhoneNumber MySQLCursor: Alter table Student drop column PhoneNum.. The structure/metadata of the table Student is: ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') Output 7.2.4.d: Adding a primary key [('customers',), ('items',), ('orders',), ('student',), ('tablel',)] Select the table to alter: student ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') (A)dd a new column (M)odify its size (D)rop one? (APK)Add Primary Key (DPK)Drop Primary Key? Select preferred task: APK Enter the name of the attribute to convert to Primary Key: StudentID Enter 'char' for char type, 'int' for int type: char Enter the size of the attribute: 10 Alter table student add StudentID char(10) Primary key MySQLCursor: Alter table student add StudentID char(1.. The structure/metadata of the table student is: ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') ('StudentID', 'char(10)', 'NO', 'PRI', None, '') 294 Handbook of Computer Programming with Python Output 7.2.4.e: Dropping a primary key [('customers',), ('items',), ('orders',), ('student',), ('table1',)] Select the table to alter: student ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') ('StudentID', 'char(10)', 'NO', 'PRI', None, '') (A)dd a new column (M)odify its size (D)rop one? (APK)Add Primary Key (DPK)Drop Primary Key? Select preferred task: DPK Alter table student Drop Primary Key MySQLCursor: Alter table student Drop Primary Key The structure/metadata of the table student is: ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') ('StudentID', 'char(10)', 'NO', '', None, '') The script allows the user to select the table the metadata of which must be altered. The user is presented with a simple menu that can be used for choosing the type of the execution statement. Upon execution the result is displayed on screen, but can be also verified in MySQL. As the concepts related to the programming aspects of the script have been covered in previous sections, they are not discussed here. The outputs showcase some testing cases based on the developed script. 7.2.5 Dropping Tables The deletion of an entire table, and especially of one that contains data, is not something that one should resort to frequently. Nevertheless, there are occasions that this may be necessary. Assuming that there are no referential integrity relationships between the table in question and any other tables, the deletion can be implemented with the DROP TABLE statement and a simple reference to the name of the table: Observation 7.15 – The DROP TABLE Statement: Destroys (deletes) a table and all the data contained in it, as in the example below. DROP TABLE <table name> DROP TABLE <table name> The following Python script demonstrates this by displaying the available tables to the user and offering a mechanism for table selection and deletion to the user: 1 2 3 4 5 6 7 import mysql.connector # The database config details config = {'user': 'root', 'password': 'root', 'host': 'localhost', 'database': 'newDB'} # Show the available tables Database Programming with Python 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 295 def showTables(): availableTables = "SHOW TABLES" link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(availableTables) tables = cursor.fetchall() print(tables) # Show the available tables showTables() # Select the table to drop and show its attributes selectedTable = input("Select the table to drop: ") availableAttributes = "DESC " + str(selectedTable) link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(availableAttributes) columns = cursor.fetchall() for row in columns: print(row) # Confirm the decision to drop the table dropConfirmation = input("Are you sure you want to drop \ the table (Y/N)? ") if (dropConfirmation == "Y"): sqlString = "DROP TABLE " + str(selectedTable) print(sqlString) try: link = mysql.connector.connect(**config) cursor = link.cursor() cursor.execute(sqlString) # Show the available tables showTables() except: print("There is an error with the connection") Output 7.2.5: [('customers',), ('items',), ('orders',), ('student',), ('table1',), ('test',)] Select the table to drop: test ('test1', 'char(10)', 'NO', 'PRI', None, '') ('test2', 'char(10)', 'YES', '', None, '') Are you sure you want to drop the table (Y/N)? Y Drop table test [('customers',), ('items',), ('orders',), ('student',), ('table1',)] The output shows how to use the DROP TABLE statement to delete/destroy a table and its data. Note that before trying to drop a table (in this instance table Test), one has to ensure that the table has been created and is in existence. 296 Handbook of Computer Programming with Python 7.2.6 The DESC Statement In previous sections, there were instances where the structure or metadata of a table had to be displayed. The statement used in such cases was the following: DESC <table name> Observation 7.16 – The DESC Statement: Returns the metadata of a table as in the example below. DESC <table name> This statement returns a list of tuples with the attributes of the table and the associated details, such as its name, size, and primary key designation. The reader can refer to the scripts provided in previous sections as practical examples of its functionality and use. 7.3 S CRIPTING FOR DATA MANIPULATION LANGUAGE The previous sections introduced the various DDL statements used to create, alter, and drop the metadata of the tables in a database. This is often called the database schema. As mentioned, it is not expected nor desired that this schema changes frequently. Once the schema is finalized, one can start working on its state or instance. A database instance contains all the data stored in the database at any particular moment in time. The statements used for working with the database instance are usually referred to as the Data Manipulation Language (DML). As in DDL and the database schema, DML statements are used to create or insert new records to a table, modify and amend data, or delete existing records from a table. The following sections introduce the most basic and common uses of these statements. 7.3.1 Inserting Records The INSERT statement is used to insert a single record (row) to a table. The general syntax of the statement is the following: INSERT INTO <table name> VALUES (<attribute1 value>... <attributeN value>) If the user is allowed to insert data to a table in a different order than the one specified in the corresponding table metadata or to enter data selectively to a subset of the table attributes, the following syntax could be used: INSERT INTO <table name> (<attributeX name>... <attributeZ name>) VALUES (<attributeX value>... <attributeZ value>) Observation 7.17 – Insert Records: INSERT INTO <table name> VALUES (<attribute1 value>... <attributeN value>) If the data order is different than that of the table attributes, or if some attributes are not supposed to receive data, the following syntax can be used: INSERT INTO <table name> (<attributeX name>... <attributeZ name>) VALUES (<attributeX value>... <attributeZ value>) The following Python script demonstrates the use of the INSERT statement, using a case where the user is also allowed to select the table to which the statement applies first: 1 2 3 4 import mysql.connector # Provide the established database config GUIDB = 'GuiDB' Database Programming with Python 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 297 config = {'user': "root", 'password': "root", 'host': "localhost", 'database': "newDB"} # Connect to the newDB database connect = mysql.connector.connect(**config) cursor = connect.cursor() try: # Attempt to show the tables of the newDB database cursor.execute("SHOW TABLES") tables = cursor.fetchall() print("DB tables are: " + str(tables)) except: print("There was a problem showing tables") tableName = input("Enter the table selected: ") try: # Show the table metadata cursor.execute("DESC " + tableName) columns = cursor.fetchall() print("Selected table is: ", tableName) print("Its attributes are: ") for row in columns: print(row) # Show the current instance of the table cursor.execute("SELECT * FROM " + str(tableName)) records = cursor.fetchall() print("The records in the table are: ") for row in records: print(row) except: print("There was a problem showing the table attributes") # Prepare the insert statement numColumns = len(columns) attributes = [""]*numColumns sqlString = "INSERT INTO " + tableName + " VALUES (" # Invite user's input for each attribute for i in range(numColumns): attributes[i] = input("Enter data for attribute " + str(i) + ": ") if (columns[i][1][0] == "c"): sqlString += "\"" + attributes[i] + "\"" elif (columns[i][1][0] == "i"): sqlString += attributes[i] if (i < numColumns-1): sqlString += ", " sqlString += ")" # Execute the prepared insert statement 298 56 57 58 59 60 61 62 63 64 65 66 67 68 Handbook of Computer Programming with Python print("SQL statement to execute is: ") print(sqlString) cursor.execute(sqlString) # Commit the results to ensure they are permanently stored connect.commit() # Show the new instance of the table print("The records in the " + str(tableName) + " table are: ") sqlString = "SELECT * FROM " + tableName cursor.execute(sqlString) records = cursor.fetchall() for row in records: print(row) Output 7.3.1.a: Inserting a new record to Student DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: student Selected table is: student Its attributes are: ('Name', 'char(10)', 'YES', '', None, '') ('Address', 'char(15)', 'YES', '', None, '') ('Year', 'int(4)', 'YES', '', None, '') ('StudentlD', 'char(10)', 'NO', '', None, '') The records in the table are: Enter data for attribute 0: Alex Enter data for attribute 1: Westwood 7 Enter data for attribute 2: 2002 Enter data for attribute 3: 001 SQL statement to execute is: Insert into student values ("Alex", "Westwood 7", 2002, "001") The records in the student table are: ('Alex', 'Westwood 7', 2002, '001') Upon execution, the script displays the tables in the current database and prompts the user to select one of them. Once a selection is made, the user is provided with both the metadata and the instance of the table. Next, the user is invited to enter values for each of the attributes of the table, one at a time. In this case, the more generic, basic syntax is adopted, so the user must enter values for all the attributes of the table in the order dictated when the table was created. After all values are collected, the related INSERT statement is prepared and executed, and its result is committed. Finally, the script provides the new instance of the table. The following observations are also noteworthy in relation to the script and its output. Firstly, any text value that is inserted to a table always takes single quotes, while numbers do not. Dates also have a particular, unique format. Secondly, in this particular example, the user attempts to insert a record to the Student table, which has no primary key attribute, and is neither referencing nor being referenced by another table. As this is a rather straightforward case, should any issues arise with the statement these should be likely related to technical connectivity issues between the database, the server, and the connections in the script. Thirdly, when committing the results of the INSERT statement, it is important that the newly inserted data are indeed stored in the table. One could use the Customers, Items, and Orders tables as a working example. Firstly, the user would enter a new record to the Customers table (note that the table has an attribute that Database Programming with Python 299 serves as a primary key). The following output illustrates this with the following data: 001, “John”, and “Good”: Output 7.3.1.b: Inserting a new record to Customers DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: customers Selected table is: customers Its attributes are: ('CustomerID', 'int(3)', 'NO', 'PRI', None, '') ('CustLastName', 'char(15)', 'YES', '', None, '') ('CustFirstName', 'char(10)', 'YES', '', None, '') The records in the table are: Enter data for attribute 0: 001 Enter data for attribute 1: John Enter data for attribute 2: Good SQL statement to execute is: Insert into customers values (001, "John", "Good") The records in the customers table are: (1, 'John', 'Good') Next, let us assume that the user attempts to enter a new record with the following data: 001, “Maria”, and “Green”. The problem in this case is that the user is attempting to insert a new record with the same value for the primary key (i.e., 001). This will raise an internal error, since MySQL does not allow duplicate values for this attribute. The output shows the error that would be raised in such a case: Output 7.3.1.c: Attempting to insert a new record to Customers with duplicate primary key DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: customers Selected table is: customers Its attributes are: ('CustomerID', 'int(3)', 'NO', 'PRI', None, '') ('CustLastName', 'char(15)', 'YES', '', None, '') ('CustFirstName', 'char(10)', 'YES', '', None, '') The records in the table are: (1, 'John', 'Good') Enter data for attribute 0: 001 Enter data for attribute 1: Maria Enter data for attribute 2: Green SQL statement to execute is: Insert into customers values (001, "Maria", "Green") ~\anaconda3\lib\site-packages\mysq1\connector\connection.py in_handle_ result(self, packet) 571 return self._handle eof(packet) 572 elif packet[4] == 255: raise errors.get_exception(packet) -- > 573 574 # We have a text result set 575 IntegrityError: 1062 (23000): Duplicate entry '1' for key 'PRIMARY' 300 Handbook of Computer Programming with Python Following up on the same example, let us assume that the user attempts to insert a record in the Items table, as displayed on the output below: Output 7.3.1.d: Inserting a record to Items DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: items Selected table is: items Its attributes are: ('ItemID', 'char(6)', 'NO', 'PRI', None, '') ('ItemDesc', 'char(25)', 'YES', '', None, '') ('ItemPrice', 'int(5)', 'YES', '', None, '') The records in the table are: Enter data for attribute 0: 100 Enter data for attribute 1: Refrigerator Enter data for attribute 2: 600 SQL statement to execute is: Insert into items values ("100", "Refrigerator", 600) The records in the items table are: ('100', 'Refrigerator', 600) The user may also attempt to insert a record in the Orders table. Firstly, let us assume that the user correctly inputs data that correspond to the other two tables (i.e., Customers and Items). The following output illustrates a successful attempt: Output 7.3.1.e: Inserting a record to Orders DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: orders Selected table is: orders Its attributes are: ('OrderID', 'int(3)', 'NO', 'PRI', None, '') ('CustID', 'int(3)', 'YES', 'MUL', None, '') ('ItemID', 'char(6)', 'YES', 'MUL', None, '') ('OrderYear', 'int(4)', 'YES', '', None, '') ('OrderQty', 'int(3)', 'YES', '', None, '') The records in the table are: Enter data for attribute 0: 1 Enter data for attribute 1: 1 Enter data for attribute 2: 100 Enter data for attribute 3: 2021 Enter data for attribute 4: 15 SQL statement to execute is: Insert into orders values (1, 1, "100", 2021, 15) The records in the orders table are: (1, 1, '100', 2021, 15) In contrast, if we assume that the user attempts to insert another record to Orders with no consideration towards the corresponding Customers table, an error will be raised: 301 Database Programming with Python Output 7.3.1.f: Violating a referential integrity constraint in an INSERT statement DB tables are: (('custorers',), ('items',), ('orders',), ('student',)] Enter the table selected: orders Selected table is: orders Its attributes are: ('OrderID', 'int(3)', 'NO', 'PRI', None, '') ('CustID', 'int(3)', 'YES', 'MUL', None, '') ('ItemID', 'char(6)', 'YES', 'MUL', None, '') ('OrderYear', 'int(4)', 'YES', '', None, '') ('OrderQty', 'int(3)', 'YES', '', None, '') The records in the table are: (1, 1, '100', 2021, 15) Enter data for attribute 0: 2 Enter data for attribute 1: 2 Enter data for attribute 2: 100 Enter data for attribute 3: 2021 Enter data for attribute 4: 10 SQL statement to execute is: Insert into orders values (2, 2, "100", 2021, 10) IntegrityError Traceback (most recent call last) ~\anaconda3\lib\site-packages\mysql\connector\connection.py in _handle_ result(self, packet) 571 572 --> 573 574 575 return self._handle_eof(packet) elif packet[4] == 255: raise errors.get_exception(packet) # We have a text result set IntegrityError : 1452 (23000): Cannot add or update a child row: a foreign key constraint fails ('newdb'.'orders', CONSTRAINT 'orders_ibfk_1' FOREIGN KEY ('CustID') REFERENCES 'customers' ('CustomerID')) These examples provide a basic demonstration of various cases of data insertion to tables, and of potential violations of important constraints like primary and foreign keys. Of course, this is not an exhaustive collection of all possible cases, but it should provide some clarity in terms of working with INSERT statements in Python. Ideally, exception handling should be employed to control as many violation scenarios as possible. 7.3.2 Updating Records Contrary to data definition statements, where the case of changing the metadata of a table after its creation is generally undesirable and quite rare, when it comes to data manipulation it is necessary to be able to change the data of particular records rather frequently. This is accomplished with the use of the UPDATE statement: Observation 7.18 – The UPDATE Statement: UPDATE <table name> SET <attribute1> = <value1>,..., <attributeN> = <valueN> WHERE <condition that involves one or more attributes> 302 Handbook of Computer Programming with Python UPDATE <table name> SET <attribute1> = <value1>,..., <attributeN> = <valueN> WHERE <condition that involves one or more attributes> The following Python script is based on the examples developed in the previous sections, and adopts the same user prompts and table selection functions in order to showcase the use of the UPDATE statement, using the Customers table: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 import mysql.connector # Provide the established database config GUIDB = 'GuiDB' config = {'user': "root", 'password': "root", 'host': "localhost", 'database': "newDB"} # Connect to the newDB database connect = mysql.connector.connect(**config) cursor = connect.cursor() try: # Attempt to show the tables of the newDB database cursor.execute("SHOW TABLES") tables = cursor.fetchall() print("DB tables are: " + str(tables)) except: print("There was a problem showing tables") tableName = input("Enter the table selected: ") try: # Show the table metadata cursor.execute("DESC " + tableName) columns = cursor.fetchall() print("Selected table is: ", tableName) print("Its attributes are: ") for row in columns: print(row) # Show the current instance of the table cursor.execute("SELECT * FROM " + str(tableName)) records = cursor.fetchall() print("The records in the table are: ") for row in records: print(row) except: print("There was a problem showing the table attributes") # Prepare the update statement attributeSelected = input("Select the attribute to change its values: ") newValue = input("Enter the new value") oldValue = input("Enter the old value") sqlString = "UPDATE " + tableName + " SET " + attributeSelected + \ 303 Database Programming with Python 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 " = " + "\'" + newValue + "\'" + " WHERE " + attributeSelected + \ " = " + "\'" + oldValue + "\'" # Execute the prepared Update statement print("SQL statement to execute is: ") print(sqlString) cursor.execute(sqlString) # Commit the results to ensure they are permanently stored connect.commit() # Show the new instance of the table print("The records in the " + str(tableName) + " table are: ") sqlString = "SELECT * FROM " + tableName cursor.execute(sqlString) records = cursor.fetchall() for row in records: print(row) Output 7.3.2: Updating a record in Customers DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: customers Selected table is: customers Its attributes are: ('CustomerID', 'int(3)', 'NO', 'PRI', None, '') ('CustLastName', 'char(15)', 'YES', '', None, '') ('CustFirstName', 'char(10)', 'YES', '', None, '') The records in the table are: (1, 'John', 'Good') Select the attribute to change its values: CustLastName Enter the new valueJames Enter the old valueJohn SQL statement to execute is: Update customers set CustLastName = 'James' where CustLastName = 'John' In addition to the UPDATE statement and its execution, the reader should pay close attention to the requirement to commit the results of the execution. The commit() function ensures that the results are permanently stored in the table. It must be also noted that there are several variations of the UPDATE statement, the detailed coverage of which is out of the scope of this chapter. For more detailed information on this topic, the reader is advised to refer to the official MySQL documentation. 7.3.3 Deleting Records In DML, the deletion of one or more records from a table is handled through the DELETE statement. The general syntax of the statement is the following: DELETE <table name> WHERE <condition> Observation 7.19 – The DELETE Statement: DELETE <table name> WHERE <condition> 304 Handbook of Computer Programming with Python If the WHERE clause is omitted, all the records of the table are deleted. Nevertheless, the empty table will be still in existence, as the table deletion is a task achieved only through the DROP statement. It must be also noted that the <condition> part is quite flexible and can include various expressions and parameters, such as one or more attributes of the same table, queries related to the same table, or queries from different tables. Finally, it is important to remember that the DELETE statement cannot be executed if the result is violating referential integrity constraints. Using the same example as in previous sections, the following Python script demonstrates a simple use of the DELETE statement: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 import mysql.connector # Provide the established database config GUIDB = 'GuiDB' config = {'user': "root", 'password': "root", 'host': "localhost", 'database': "newDB"} # Connect to the newDB database connect = mysql.connector.connect(**config) cursor = connect.cursor() try: # Attempt to show the tables of the newDB database cursor.execute("SHOW TABLES") tables = cursor.fetchall() print("DB tables are: " + str(tables)) except: print("There was a problem showing tables") tableName = input("Enter the table selected: ") try: # Show the table metadata cursor.execute("DESC " + tableName) columns = cursor.fetchall() print("Selected table is: ", tableName) print("Its attributes are: ") for row in columns: print(row) # Show the current instance of the table cursor.execute("SELECT * FROM " + str(tableName)) records = cursor.fetchall() print("The records in the table are: ") for row in records: print(row) except: print("There was a problem showing the table attributes") # Prepare the Delete statement attributeSelected = input("Select the attribute based on \ which to delete a record(s): ") deleteValue = input("Enter the data to delete: ") Database Programming with Python 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 305 sqlString = "DELETE FROM " + tableName + " WHERE " + \ attributeSelected + " = " + "\'" + deleteValue + "\'" # Execute the prepared Update statement print("SQL statement to execute is: ") print(sqlString) cursor.execute(sqlString) # Commit the results to ensure they are permanently stored connect.commit() # Show the new instance of the table print("The records in the " + str(tableName) + " table are: ") sqlString = "SELECT * FROM " + tableName cursor.execute(sqlString) records = cursor.fetchall() for row in records: print(row) Output 7.3.3: Updating a record in Customers DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: orders Selected table is: orders Its attributes are: ('OrderID', 'int(3)', 'NO', 'PRI', None, '') ('CustID', 'int(3)', 'YES', 'MUL', None, '') ('ItemID', 'char(6)', 'YES', 'MUL', None, '') ('OrderYear', 'int(4)', 'YES', '', None, '') ('OrderQty', 'int(3)', 'YES', '', None, '') The records in the table are: (1, 1, '100', 2021, 15) Select the attribute based on which to delete a record(s): 100 Enter the data to delete: 100 SQL statement to execute is: Delete from orders where 100 = '100' The records in the orders table are: In the example illustrated in the output, the user selects the only record that has a value of 100 for attribute ItemID in the Orders table. The reader should note how DELETE is prepared based on the user’s selections, and how the result is committed using the commit() function. 7.4 QUERYING A DATABASE AND USING A GUI Querying and reporting data from database tables is arguably the most useful part of database management from the perspective of the user. Thus, it should come as no surprise that the remaining SQL statements are specifically used for these purposes. The available clauses are numerous, and the possibilities for nested queries and for conditional query execution render the potential combinations virtually limitless. As such, an exhaustive coverage of every possible case of querying and reporting is not only outside the scope of this chapter, but also a rather futile attempt in general. The focus of this section is to showcase some basic ways to execute querying and reporting tasks, and to demonstrate how GUIs could be utilized for presentation purposes. 306 Handbook of Computer Programming with Python 7.4.1 The SELECT Statement The SELECT statement is used to query and report data from tables. Its most basic and generic syntax does not involve any clauses that dictate additional functionality or selection criteria: SELECT * FROM <table name> WHERE * Observation 7.20 – The SELECT Statement: SELECT <list of attributes from one or more tables> OR * FROM <list of tables> WHERE <conditions> Such a statement will return all the attributes of the specified table, as the asterisk (*) character is used to include all attributes and all conditions. Selections based on more specific criteria can be built by adding the required clauses: SELECT <list of attributes from one or more tables> OR * FROM <list of tables> WHERE <conditions> The <conditions> part specifies the particular requirements that the data must meet in order to be reported, ranging from no conditions to very complicated multi-attribute and multi-table ones. Similarly, the <list of tables> part specifies the tables that must be included in the report. The reader can refer to the rich and readily available collection of related textbooks and resources, providing thorough descriptions of the numerous forms of the detailed syntax clauses and possible refinements (Oracle, 2021a). The following Python script builds on the previous examples to demonstrate querying and reporting on data from a table (i.e., Customers, Products, Orders), as specified by the user: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import mysql.connector # Provide the established database config GUIDB = 'GuiDB' config = {'user': "root", 'password': "root", 'host': "localhost", 'database': "newDB"} # Connect to the newDB database connect = mysql.connector.connect(**config) cursor = connect.cursor() try: # Attempt to show the tables of the newDB database cursor.execute("SHOW TABLES") tables = cursor.fetchall() print("DB tables are: " + str(tables)) except: print("There was a problem showing tables") tableName = input("Enter the table selected: ") try: # Show the table metadata cursor.execute("DESC " + tableName) columns = cursor.fetchall() Database Programming with Python 307 25 print("===================") print("Selected table is: ", tableName) 26 27 print("===================") 28 print("Its attributes are:") for row in columns: 29 30 print(row) 31 32 # Show the current instance of the table 33 cursor.execute("SELECT * FROM " + str(tableName)) 34 records = cursor.fetchall() print("==============================") 35 36 print("The records in the table are: ") 37 print("==============================") for row in records: 38 39 print(row) 40 except: 41 print("There was a problem showing the table attributes") Output 7.4.1: Reporting data from a table based on user selection DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: customers Selected table is: customers Its attributes are: ('CustomerID', 'int(3)', 'NO', 'PRI', None, '') ('CustLastName', 'char(15)', 'YES', '', None, '') ('CustFirstName', 'char(10)', 'YES', '', None, '') The records in the table are: (1, (2, (3, 'John', 'Good') 'Norman', 'Chris') 'Flora', 'Alex') In the case presented here, the output reports all the records from the Customers table. 7.4.2 The SELECT Statement with a Simple Condition The previous section demonstrated the use of simple SELECT statements to report on data of a MySQL table. The complexity of the queries is limited only by the imagination and capabilities of the programmer and the task at hand, since Python provides the facilities and support for highly complex querying and reporting tasks. As a starting point for building more complex tasks, the following Python script invites the user to select a table from an example database and build a query based on the selection. Next, it prompts the user for a particular attribute to base the condition on, and for setting particular preferences for the condition depending on whether the attribute is numerical or text-based: 1 2 3 import mysql.connector # Provide the established database config 308 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 Handbook of Computer Programming with Python GUIDB = 'GuiDB' config = {'user': "root", 'password': "root", 'host': "localhost", 'database': "newDB"} # Connect to the newDB database connect = mysql.connector.connect(**config) cursor = connect.cursor() try: # Attempt to show the tables of the newDB database cursor.execute("SHOW TABLES") tables = cursor.fetchall() print("DB tables are: " + str(tables)) except: print("There was a problem showing tables") tableName = input("Enter the table selected: ") # Show the table metadata cursor.execute("DESC " + tableName) columns = cursor.fetchall() print("==================================================") print("Selected table is: ", tableName) print("==================================================") print("Its attributes are:") for row in columns: print(row) # Select the attribute to build the condition print("==================================================") condAttribute = input("Enter the attribute to build the condition: ") typeAttribute = input("Is it a numeric attribute or a text (Num/Text):") if (typeAttribute == "Num"): minCond = int(input("Enter the min value for the attribute")) maxCond = int(input("Enter the max value for the attribute")) sqlStatementCondition = " WHERE "+str(condAttribute)+" >= "+ \ str(minCond)+" AND "+str(condAttribute)+" <= "+str(maxCond) if (typeAttribute == "Text"): startingText = input("Enter the starting text of the value to \ search for: ") sqlStatementCondition = " WHERE "+str(condAttribute)+" LIKE \'"+ \ str(startingText) + "%\'" # Show the current instance of the table sqlStatement = "SELECT * FROM " + str(tableName) + sqlStatementCondition print(sqlStatement) cursor.execute(sqlStatement) records = cursor.fetchall() print("====================================") print("The records in the table are: ") Database Programming with Python 54 55 56 print("====================================") for row in records: print(row) Output 7.4.2.a – Example 1: Conditionally reporting data based on user selection DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: items Selected table is: items Its attributes are: ('ItemID', 'char(6)', 'NO', 'PRI', None, '') ('ItemDesc', 'char(25)', 'YES', '', None, '') ('ItemPrice', 'int(5)', 'YES', '', None, '') Enter the attribute to build the condition: ItemPrice Is it a numeric attribute or a text (Num/Text):Num Enter the min value for the attribute300 Enter the max value for the attribute450 Select * from items where ItemPrice >= 300 and ItemPrice <= 450 The records in the table are: ('100', 'RF-100', 300) ('200', 'TV-LG100', 400) ('303', 'PC-3', 400) Output 7.4.2.b – Example 2: Conditionally reporting data based on user selection DB tables are: [('customers',), ('items',), ('orders',), ('student',)] Enter the table selected: items Selected table is: items Its attributes are: ('ItemID', 'char(6)', 'NO', 'PRI', None, '') ('ItemDesc', 'char(25)', 'YES', '', None, '') ('ItemPrice', 'int(5)', 'YES', '', None, '') Enter the attribute to build the condition: ItemDesc Is it a numeric attribute or a text (Num/Text):Text Enter the starting text of the value to search for: TV Select * from items where ItemDesc like 'TV%' The records in the table are: ('200', 'TV-LG100', 400) ('201', 'TV-Samsung 100', 550) ('202', 'TV-BenQ', 600) 309 310 Handbook of Computer Programming with Python In the output of Example 1 above, the user firstly selects table Items. Next, a list of all the available attributes is presented to the user as a choice for the condition of the SELECT statement. The user selects ItemPrice and is prompted to choose whether it is a numerical or text attribute. As the attribute is numerical, the script offers the option to enter the min and max values. On the contrary, in the output of Example 2, the user selects an attribute that is text-based. Hence, the script offers a different set of prompts and statements, appropriate for the use of the SELECT statement with text-based conditions. The reader should note that the SELECT statements in both cases are the same as those used in MySQL. The only challenge in this instance is that the programmer has to prepare the final SQL script with the dynamic elements in place. Expectedly, if no dynamic elements are involved in the query (e.g., if the table and the condition are predefined), the preparation of the SELECT statement is less complicated. 7.4.3 The SELECT Statement Using GUI Arguably, if one aims to develop a user-oriented application, it is necessary to wrap the application with a user-friendly GUI. An extensive introduction to the most important GUI widgets (e.g., labels, entry boxes, radio buttons, buttons) and their application is provided in earlier chapters of this book. In the current context, it is assumed that the focus is on the creation of a grid-based layout that will be used to host the results of the SQL queries. In such a case, a grid layout manager could be used. The following Python script showcases the development and execution of a condition-based MySQL SELECT query using a fully deployed GUI: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import mysql.connector import tkinter as tk from tkinter import ttk global global global global global global tableName, attributeName, radioButton, textVar minLabel, maxLabel, textualLabel; global textualEntry selectionsFrame, resultsFrame; global columnName, columnType minCondScale, maxCondScale; global tablesCombo, columnsCombo connect, cursor, config; global tables, columns minCond, maxCond; global minValue, maxValue, numCols # Create the frame to select the table for the query and its attributes def selectionGUI(): global tables, columns; global tablesCombo, columnsCombo global tableName, radioButton, textVar global selectionsFrame, resultsFrame global minLabel, maxLabel, textualLabel global minCondScale, maxCondScale; global textualEntry # The frame for the query selections of the user selectionsFrame=tk.LabelFrame(winFrame, text='Query selections') selectionsFrame.config(bg = 'light grey', fg = 'red', bd = 2, relief = 'sunken') selectionsFrame.grid(column = 0, row = 0) # Create the combobox to hold the tables available in the db tablesLabel = tk.Label(selectionsFrame, Database Programming with Python 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 311 text = "Tables available:", bg = "light grey") tablesLabel.grid(column = 0, row = 0) tablesCombo = ttk.Combobox(selectionsFrame, textvariable = tableName, width = 15) tablesCombo['values'] = tables; tablesCombo.current(0) tablesCombo.grid(column = 1, row = 0) # Button updates the attributes combo based on the table selection updateAttributesButton = tk.Button(selectionsFrame, text = 'Update Attributes', relief = 'raised', width = 15) updateAttributesButton.bind('<Button-1>', lambda event: updateAttributes()) updateAttributesButton.grid(column = 2, row = 0) # Create the button to run the query runButton = tk.Button(selectionsFrame, text = 'Run Query', relief = 'raised', width = 15) runButton.bind('<Button-1>', lambda event: runQuery()) runButton.grid(column = 3, row = 0) # Update the columns combo based on the table selection columnsLabel = tk.Label(selectionsFrame, text = "Select attribute:", bg = "light grey") columnsLabel.grid(column = 0, row = 1) columnsCombo = ttk.Combobox(selectionsFrame, textvariable = attributeName, width = 15) columnsCombo.grid(column = 1, row = 1) # Check whether selected attribute is numeric or text numericalAttribute = tk.Radiobutton (selectionsFrame, text = 'Numerical\nattribute', width = 10, height = 2, bg = 'light green', variable = radioButton, value = 1, command = radioClicked).grid(column = 2, row = 1) textAttribute = tk.Radiobutton (selectionsFrame, text = 'Text\nattribute', width = 10, height = 2, bg = 'light green', variable = radioButton, value = 2, command = radioClicked).grid(column = 3, row = 1) radioButton.set(1) # Create the GUI for the numerical conditional parameters minLabel=tk.Label(selectionsFrame,text="Min value:",bg="light grey") minLabel.grid(column = 0, row = 4); minLabel.grid_remove() minCond = tk.IntVar() minCondScale = tk.Scale (selectionsFrame, length = 200, from_ = 0, to = 10000) minCondScale.config(resolution = 10, activebackground = 'dark blue', orient = 'horizontal') minCondScale.config(bg = 'light blue', fg = 'red', troughcolor = 'cyan', command = onScaleMin) minCondScale.grid(column = 1, row = 4); minCondScale.grid_remove() 312 Handbook of Computer Programming with Python 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 maxLabel = tk.Label(selectionsFrame, text = "Max value:", bg = "light grey") maxLabel.grid(column = 2, row = 4); maxLabel.grid_remove() maxCond = tk.IntVar() maxCondScale = tk.Scale (selectionsFrame, length = 200, from_ = 0, to = 10000) maxCondScale.config(resolution = 10, activebackground = 'dark blue', orient = 'horizontal') maxCondScale.config(bg = 'light blue', fg = 'red', troughcolor = 'cyan', command = onScaleMax) maxCondScale.grid(column = 3, row = 4); maxCondScale.grid_remove() # Create the GUI for the textual parameters textualLabel = tk.Label(selectionsFrame, text = "Enter text to find:", bg = "light grey") textualLabel.grid(column = 0, row = 5); textualLabel.grid_remove() textVar = tk.StringVar() textualEntry = ttk.Entry(selectionsFrame, textvariable = textVar, width = 20) textualEntry.grid(column = 1, row = 5); textualEntry.grid_remove() # Update the attributes table based on the table selection def updateAttributes(): global cursor; global tableName, textVar; global tables, columns global tablesCombo, columnsCombo; global numCols global columnName, columnType; global mindCondScale, maxCondScale try: # Show the selected table metadata if (str(tableName.get()) != ""): sqlString = "DESC " + str(tableName.get()) cursor.execute(sqlString) columns = cursor.fetchall() # Reformat the columns list to new useful ones numCols = len(columns) columnName = []; columnType = [] for i in range (numCols): columnName.append(columns[i][0]) columnType.append(columns[i][1]) columns[i] = str(columns[i][0]) + " " + \ str(columns[i][1]) columnsCombo['values'] = columns columnsCombo.current(0) except: print("There was a problem showing the attributes") # Update the attributes table based on the table selection def runQuery(): global cursor; global tableName; global tables, columns Database Programming with Python 128 global columnsCombo; global numCols, numRows 129 global selectedAttribute; global columnName, columnType 130 global minValue, maxValue; global resultsFrame 131 132 # Empty the results list and the results frame 133 records = [] 134 if (resultsFrame != None): 135 resultsFrame.destroy() 136 137 # Prepare the query to run 138 selectedIndex = columnsCombo.current() 139 if (radioButton.get() == 1): 140 sqlStatementCondition = " WHERE " + \ 141 str(columnName[selectedIndex]) + \ 142 " >= " + str(minValue) + " AND " + \ 143 str(columnName[selectedIndex]) + \ 144 " <= " + str(maxValue) 145 elif (radioButton.get() == 2): 146 startingText = str(textVar.get()) 147 sqlStatementCondition = " WHERE " + \ 148 str(columnName[selectedIndex]) + \ 149 " LIKE \'" + str(startingText) + "%\'" 150 151 # The frame for the query selections of the user 152 resultsFrame = tk.LabelFrame(winFrame, text = "Query data") 153 resultsFrame.config(bg = 'light grey', fg = 'red', bd = 2, 154 relief = 'sunken') 155 resultsFrame.grid(column = 0, row = 1) 156 157 # Show the current instance of the table 158 sqlStatement = "SELECT * FROM " + str(tableName.get()) + \ 159 sqlStatementCondition 160 cursor.execute(sqlStatement) 161 records = cursor.fetchall() 162 163 numRows = len(records) 164 165 for i in range(numRows): 166 for j in range(numCols): 167 # Create the labels to display the columns of results 168 newLabel = tk.Label(resultsFrame, width = 24) 169 if (i%2 == 0): 170 newLabel.config(text = records[i][j], 171 bg = "light grey", relief = "sunken") 172 else: 173 newLabel.config(text = records[i][j], 174 bg = "light cyan", relief = "sunken") 175 newLabel.grid(column = j, row = i) 176 313 314 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 Handbook of Computer Programming with Python # Display/hide the relevant conditional parameters depending on # the type of the attribute def radioClicked(): global minLabel, maxLabel; global minCondScale, maxCondScale global textualLabel, textualEntry if (radioButton.get() == 1): minLabel.grid(); minCondScale.grid(); maxLabel.grid() maxCondScale.grid(); textualLabel.grid_remove(); textualEntry.grid_remove() if (radioButton.get() == 2): minLabel.grid_remove(); minCondScale.grid_remove() maxLabel.grid_remove() maxCondScale.grid_remove(); textualLabel.grid() textualEntry.grid() # Define the method to control the min condition value def onScaleMin(val): global minValue minValue = int(val) # Define the method to control the max condition value def onScaleMax(val): global maxValue maxValue = int(val) #==================================================================== # Provide the established database config GUIDB = 'GuiDB' config = {'user': "root", 'password': "root", 'host': "localhost", 'database': "newDB"} # Connect to the newDB database connect = mysql.connector.connect(**config) cursor = connect.cursor() # Basic window frame with the title through tk.Tk() constructor winFrame = tk.Tk() winFrame.config(bg = "grey") winFrame.title("Queries through GUIs") try: # Attempt to show the tables of the newDB database cursor.execute("SHOW TABLES") tables = cursor.fetchall() except: print("There was a problem with reporting the tables") tableName = tk.StringVar() Database Programming with Python 226 227 228 229 230 231 232 315 attributeName = tk.StringVar() radioButton = tk.IntVar() resultsFrame = None updateAttributes() selectionGUI() winFrame.mainloop() Output 7.4.3.a: Using the grid layout manager with a numerical condition query Output 7.4.3.b: Using the grid layout manager with a text-based condition query Conceptually, this script is divided into four parts. The first part (lines 12–92) provides the GUI element using the selectionGUI() function. This covers the main body of the GUI but excludes the grid where the query data will be reported on. When running the application, the user must perform the following actions: 316 1. 2. 3. 4. Handbook of Computer Programming with Python Select a table from the connected database through the relevant combo box. Update the combo box using the attributes of the selected table. Select the attribute upon which the condition for the query will be based. Identify whether the attribute is numerical (int) or text-based (char). The second part (lines 165–180) provides the necessary functionality for the user to be able to decide the type of the attribute, through the selection of the relevant radio button. This provides the appropriate partial interface that will enable the creation of the condition. The reader should note how the selection causes the partial interfaces to appear/disappear and be replaced by the most appropriate option based on the selection. This can be further enhanced and automated to include as many conditions as needed. In the third part, function updateAttributes() (lines 93–116) is used to update the attributes combo box based on the selected table. Functions onScaleMin() and onScaleMax() (lines 182–190) are also part of this process, as they allow the user to determine the limits of the condition when a numerical attribute is selected. Arguably, the most important part of the application is the runQuery() function (lines 118– 163). The function firstly prepares the query based on the user’s preferences, and subsequently runs it based on the prepared condition. Upon execution, the data grid is displayed as required, with a number of columns dictated by the results of the query. The grid is merely an arrangement of a sequence of columns (i.e., per line of the grid layout manager) that is created on-the-spot and loaded with the results of the previously executed query. In relation to appearance and aesthetics, the reader should also note how the variation of the background color of each new line creates a specific color theme for the grid. It must be stressed that, in this particular application, the grid consists of labels and it is, thus, not possible to work on it directly. If a different widget were to be used instead (e.g., entry boxes), the contents would be editable and processing (e.g., updating the value of a particular attribute on selected table records) could be applied to the data directly through the grid. The simple application presented here is just a sample of the use and functionality of the SQL and GUI features provided by Python. As mentioned, SQL provides numerous options and possibilities, and this is reflected on the virtually limitless potential when designing and implementing database applications in Python or other compatible programming languages. 7.5 CASE STUDY Create an application that provides the following functionality: a. Prompt the user for their credentials and the name of the MySQL database to connect to. Display a list of the tables that are available in the connected database in a status bar form at the bottom of the application window (Hint: A label can be used for this purpose). b. Allow the user to define a new table and set the number of its attributes. Based on user selection, create the interface required for the specifications of the attributes in the new table (i.e., attribute name, type and size, primary or foreign key designation). The interface should be created on-the-spot. The application must use a GUI interface and the MySQL facilities for the database element. 7.6 EXERCISES Based on the Employee example, write Python scripts to perform the following tasks using MySQL: Database Programming with Python 317 1. Create table DEPT to host departmental data for a company, with the following attributes: a. Code → DeptNo, Number (2), not null, primary key. b. Department name → Dname, 20 characters. 2. Create table EMP to host employee data, with the following attributes: c. Code → Empno, Number (4), not null, primary key. d. Name (Last and First) → Ename, 40 characters. e. Job → Job, 10 characters. f. Manager Code → Mgr, Number (4), internal foreign key to Emp → Empno. g. Date Hired → Hiredate, date. h. Monthly salary → Sal, Number (7, 2), between 100 and 10,000. i. Department code → DeptNo, Number (2), foreign key to Dept → DeptNo. 3. Alter table DEPT to include the following attribute: Location → DLocation, 20 characters. 4. Alter table EMP to include the following attribute: Sales Commission → Comm, Number (7, 2), no more than Sal. 5. Insert five records into DEPT. 6. Insert ten records into EMP, two for each department. 7. Delete the record of the department entered last. REFERENCES APACHE. (2021). APACHE Software Foundation. https://apache.org. Elmasri, R., & Navathe, S. (2017). Fundamentals of Database Systems (Vol. 7). Pearson, Hoboken, NJ. MAMP. (2021). Download MAMP & MAMP PRO. https://www.mamp.info/en/downloads. MySQL. (2021). Oracle Corporation. https://www.mysql.com. Oracle. (2021a). MySQL Documentation. https://dev.mysql.com/doc. Oracle. (2021b). Oracle.com. 8 Data Analytics and Data Visualization with Python Dimitrios Xanthidis University College London Higher Colleges of Technology Han-­I Wang The University of York Christos Manolas The University of York Ravensbourne University London CONTENTS 8.1 8.2 Introduction........................................................................................................................... 320 Importing and Cleaning Data................................................................................................ 322 8.2.1 Data Acquisition: Importing and Viewing Datasets.................................................. 322 8.2.2 Data Cleaning: Delete Empty or NaN Values........................................................... 324 8.2.3 Data Cleaning: Fill Empty or NaN Values................................................................ 326 8.2.4 Data Cleaning: Rename Columns............................................................................. 327 8.2.5 Data Cleaning: Changing and Resetting the Index................................................... 329 8.3 Data Exploration.................................................................................................................... 329 8.3.1 Data Exploration: Counting and Selecting Columns................................................. 329 8.3.2 Data Exploration: Limiting/Slicing Dataset Views................................................... 331 8.3.3 Data Exploration: Conditioning/Filtering................................................................. 332 8.3.4 Data Exploration: Creating New Data....................................................................... 333 8.3.5 Data Exploration: Grouping and Sorting Data.......................................................... 336 8.4 Descriptive Statistics............................................................................................................. 339 8.4.1 Measures of Central Tendency..................................................................................340 8.4.2 Measures of Spread................................................................................................... 343 8.4.3 Skewness and Kurtosis.............................................................................................. 347 8.4.4 The describe() and count() Methods......................................................................... 350 8.5 Data Visualization................................................................................................................. 352 8.5.1 Continuous Data: Histograms.................................................................................... 352 8.5.2 Continuous Data: Box and Whisker Plot................................................................... 354 8.5.3 Continuous Data: Line Chart..................................................................................... 356 8.5.4 Categorical Data: Bar Chart...................................................................................... 357 8.5.5 Categorical Data: Pie Chart....................................................................................... 363 8.5.6 Paired Data: Scatter Plot............................................................................................364 8.6 Wrapping Up.......................................................................................................................... 366 8.7 Case Study............................................................................................................................. 371 References....................................................................................................................................... 371 DOI: 10.1201/9781003139010-8 319 320 Handbook of Computer Programming with Python 8.1 INTRODUCTION Python is one of the most popular modern programming languages for data analytics, data visualization, and Observation 8.1 – Data Analytics: data science tasks in general. Indeed, its reputation as Analysis of data from various sources a programming language comes from its efficiency in to produce meaningful results that aid such tasks and the wealth of related facilities and tools the process of decision-­making. it provides. Its power in addressing data analytics problems comes from its numerous built-­in libraries, including Pandas, Numpy, Matplotlib, Scipy, and Seaborn. Observation 8.2 – Data Visualization: These libraries provide functionality to read data from The process of illustrating the results of a variety of sources, clean data, and perform descrip- data analytics through visual means. tive and inferential statistics operations. In addition, the libraries provide data visualization facilities, supporting the generation of all types of charts based on the data at Observation 8.3 – Big Data: Data hand. Finally, the platform is capable of performing the obtained from a large variety of aforementioned tasks on large collections of data, a task sources, at great velocity, in large amounts of volumes, and in a variety commonly referred to as big data analytics. A formal definition of the term data analytics may of formats. be difficult to come up with, as it is a relatively new and rather broad concept in the contemporary business and academic context. However, a possible description could be that the term refers to the efficient analysis of data from various sources to produce meaningful results that aid the process of decision-­making. If this was to be extended in order to also capture big data analytics, the associated data would be expected to come from a large variety of sources, at great velocity (i.e., speed), in vast amounts of volume, and in a serious variety of formats, as pointed in relevant, contemporary literature. The term data visualization, another relatively new concept, refers to common mechanisms of illustrating the results of data analytics in the form of various charts, available as visual tools or through built-­in methods in programming libraries. A quick look into any book or resource related to data analytics would unveil that the process is more or less the same, with any minor variations most likely having to do with the terminology rather than the functionality and structure. The latter includes the following seven steps: • Research Objectives/Research Question(s): The first part of the data analytics process is frequently omitted, as it can be deemed as an obvious step. However, it is the most essential part of the process and requires effort to develop. To complicate things further, it is a task of purely investigative nature, so limited support is available in terms of specific and automated tools. It is basically a process seeking to establish the objectives and questions the process is aiming to address for the task at hand at any given instance. It is beyond the scope of this chapter to address these concepts in more detail. For more information, the reader is encouraged to refer to literature related to research methods and methodologies. • Data Acquisition: The process of reading data stored in a variety of formats and sources, including spreadsheets, comma separated files, web pages, and databases. Once the data is read, it is stored in a specific type of variable called data frame for further processing. • Cleaning Data: While the collection of complete and error-­free data during the acquisition process is highly desirable, this is seldom the case. Given that the data are entered by users who are often not familiar with the data entry process, it is highly probable and expected to encounter such problems. The process of data cleaning focuses on the removal of these types of errors. • Exploratory Analysis: This is a process that comes after data cleaning, with the aim of identifying and summarizing the main characteristics of the data. It often involves the application of descriptive statistics methods and analysis. Data Analytics and Data Visualization 321 • Modeling and Validation: This process involves the deployment of advanced tools and techniques, such as machine learning, for building models relating to the data. This task covers broad and deep areas of study and expertise that is beyond the scope of this chapter. • Visualizing Results: This task relates to the use of various facilities and programming libraries to create charts that help in visualizing the data and assisting in the process of decision-­making. • Reporting: Writing-­up of the final reports relating to the data, including any conclusions and recommendations. It is apparent that the process involves various fields of expertise, including databases and data mining, artificial intelligence/machine learning, statistics, social science, and others. It is this interdisciplinary nature of the overall process that results in the widely used data science term. As mentioned, the main Python packages and libraries used for data processing and visualization are Pandas, Numpy, Matplotlib, Scipy, and Seaborn. More these libraries are the following: Observation 8.4 – Data Science: An interdisciplinary field that involves databases and data mining, AI/ machine learning, statistics, social sciences and other relevant means to analyze and interpret data. specifically, the main characteristics of • NumPy: A library optimized for working with single and multi-­dimensional arrays. A tool suitable for machine learning and statistical analysis tasks. • Pandas: An easy-­to-­use, open-­source library that is based on NumPy. It works particularly well with one and two-­dimensional data (Series and DataFrame respectively). It is a good choice for statistical analysis tasks. • SciPy: Another library based on NumPy. It offers additional functionality compared to NumPy, making it a solid choice for both machine learning and statistical analysis tasks. • Matplotlib: A low level plotting library suitable for creating basic graphs. While it provides a lot of freedom to the programmer, it may be rather demanding in terms of coding requirements. One must be also aware of the fact that Matplotlib cannot deal directly with analysis. As such, this needs to be addressed prior to plotting. • Python’s Statistics: A built-­in Python library for descriptive statistics. It works rather well when datasets are not too large (Statistics — Mathematical Statistics Functions, 2021). In this chapter, the reader will have the opportunity to acquire basic skills required for cleaning and describing data, and performing data visualization, while familiarising with some of the most popular libraries associated with these tasks. This chapter is divided into four main sections: • Data Acquisition and Cleaning: Import, re-­arrange, and clean data from various types of sources. • Data Exploration: Report data by selecting, sorting, filtering, grouping, and/or re-­ calculating rows/columns, as necessary. • Data Processing/Descriptive Statistics: Apply simple descriptive statistics on the data frame. • Data Visualization: Use the available methods from the various Python packages for data visualization. Excel files Grades.xlsx and Grades2.xlsx are used for the various examples presented throughout this chapter. 322 Handbook of Computer Programming with Python 8.2 IMPORTING AND CLEANING DATA Before discussing the process of importing data for ­analysis, there are two key terms that need to be pre- Observation 8.5 – Data Frame: dimensional data sented: arrays/lists and data frames. Unlike other Typically, a two-­ structure with rows representing the common programming languages like C++ or Java, in data records. Records are divided Python there is no distinct array object. Instead, this into columns, and indices are used to functionality is provided by the list object, as discussed speed up the searching process within in Chapter 2. As a quick reminder, a list is a sequence of the data frame. variables that hold data of the same data type, sharing the same name, and being distinguished only by their index. A data frame is a data structure that resembles a relational database table, or an Excel spreadsheet consisting of rows and columns. The rows correspond to the actual records of the data frame and are accessed by their index number. The columns correspond to the attributes/columns/fields in a database table and are accessed by their names. The index is the first column of a data frame (i.e., starting at zero). 8.2.1 Data Acquisition: Importing and Viewing Datasets The Pandas library is required in order to create the object used to both read the data from the source and create the data frame to which data analysis will be applied. Various sources and formats of data are supported, including Excel and Comma Separated Values (CSV) files, tables, plain text, databases, or web-­based sources. In all cases, the basic process of reading from the source remains the same. However, the method and the parameters used may vary slightly, depending on the source. In the case of reading data from Excel files, the general syntax is the following: <name of data frame> = <name of Pandas object>.read_excel("<Filename>", sheet_ name = "<Sheet name>") Observation 8.6 – The Pandas Library: The Pandas library provides support for the creation of objects that can be used for various data analytics tasks. Observation 8.7 – Reading from data sets: Use the read _ excel(), read _ csv(), or read _ html() methods to import (read) data from Excel, CSV, or html files into the data frame. The following example demonstrates the process of reading data from a particular spreadsheet (Grades 2020) within an Excel file (Grades.xlsx): 1 2 3 import pandas as pd dataset = pd.read_excel("Grades.xlsx", sheet_name = "Grades 2020") print(dataset) Output 8.2.1.a: 0 1 2 Final Grade Final Exam Quiz 1 58.57 50.5 76.0 65.90 49.0 89.0 69.32 63.5 73.0 Quiz 2 Midterm Exam 70.7 60.0 63.0 54.0 54.7 70.0 Project 55 90 80 323 Data Analytics and Data Visualization 3 4 5 6 7 8 9 10 11 12 13 14 15 16 72.02 73.68 61.32 67.87 75.57 61.28 0.00 62.35 66.13 69.43 82.60 0.00 62.62 0.00 60.5 74.0 45.5 66.5 66.0 50.5 NaN 48.0 61.0 50.0 74.0 NaN 45.5 NaN 99.0 84.0 94.0 73.0 94.0 84.0 NaN 78.0 83.0 80.0 94.0 NaN 78.0 NaN 74.7 53.3 42.7 53.7 58.7 37.3 NaN 49.0 45.3 49.3 65.0 NaN 56.7 NaN 76.0 64.0 66.0 54.0 92.0 58.0 NaN 70.0 70.0 90.0 86.0 NaN 72.0 NaN 70 87 70 87 70 78 69 71 70 76 92 75 70 0 The above script reports 16 rows/records across 6 columns. A few key things are noteworthy in the script output. Firstly, the name of the read_excel() method is case sensitive. This is in line with the general Python syntax rule for methods and statements used in data analytics tasks. Secondly, as mentioned, it is highly unlikely to deal with perfect, clean data during data analysis. More often than not, one has to deal with erroneous, corrupt, or missing data. The latter applies to both designated NaN entries or empty cells. Fortunately, there are easy ways to tackle such problems, some of which are described in the following sections. Finally, it is worth mentioning that in order to report a given dataset the print() method can be used. The method comes handy in several situations related to reporting data from datasets and it is further discussed latter in this chapter. In the case of reading data from a flat CSV file, the general syntax is the following: <name of data frame> = <name of Pandas object>.read_csv("<Filename.csv", delimiter = ', ') The following script reads and reports the data included in file Grades2.csv: 1 2 3 import pandas as pd dataset = pd.read_csv('Grades2.csv', delimiter = ', ') Dataset Output 8.2.1.b: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project 0 67.47 59.0 70 72.7 70 72 1 75.13 61.5 76 68.3 82 87 2 66.85 77.5 84 52.0 40 80 3 54.45 34.5 62 44.0 44 90 4 76.95 66.5 68 67.0 82 92 5 45.13 26.0 52 26.3 50 68 324 Handbook of Computer Programming with Python 6 73.23 63.5 96 68.3 62 89 7 81.87 83.0 97 82.7 84 72 8 62.63 54.5 54 31.3 64 87 9 58.75 46.5 54 39.0 52 90 10 49.75 27.5 48 37.0 62 70 11 44.25 21.5 55 18.0 42 80 12 62.52 31.0 85 54.7 68 89 13 47.33 16.5 38 33.3 52 89 14 68.97 55.0 65 49.7 70 94 In the case of reading data from a web page, the general syntax is the following: <name of data frame> = <name of Pandas object>.read_html("<url>") 8.2.2 Data Cleaning: Delete Empty or NaN Values There are two main techniques to clean a dataset. One has to do with correcting erroneous data and the other Observation 8.8 – Drop NaN or with dealing with missing values. The cleaning process Empty Values: Use the dropna() may include the partial or complete deletion of the related method to delete rows with NaN or rows or the replacement of cells that contain missing data empty values from a data frame. The method must be used with the how with specific calculated or predefined values. In the case of the former, there are two possible sce- parameter (“all” or “any” values). narios. Rows may contain missing or designated NaN values, in some or all of its columns. If it is decided to delete all the rows that contain missing data, the following syntax should be used: <name of new Data Frame> = <name of original Data Frame>.dropna() The following script demonstrates the application of the dropna() method that deletes all rows with cells that include NaN values: 1 2 3 4 import pandas as pd dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020") dframe_no_missing_data = dataset.dropna() dframe_no_missing_data Using the dropna(how = “any”) method form instead of the simple dropna() form will p­ roduce the same result, similarly to deleting any row that contains either NaN or empty values. The full syntax in this case is very similar to the previous one: <name of new Data Frame> = <name of original Data Frame>.dropna(how = "any") The following Python script provides an example of this method applied to the same data frame: 325 Data Analytics and Data Visualization 1 2 3 4 import pandas as pd dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020") dframe_delete_rows_with_any_na_values = dataset.dropna(how = "any") dframe_delete_rows_with_any_na_values Output 8.2.2.a: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project 0 58.57 50.5 76.0 70.7 60.0 55 1 65.90 49.0 89.0 63.0 54.0 90 2 69.32 63.5 73.0 54.7 70.0 80 3 72.02 60.5 99.0 74.7 76.0 70 4 73.68 74.0 84.0 53.3 64.0 87 5 61.32 45.5 94.0 42.7 66.0 70 6 67.87 66.5 73.0 53.7 54.0 87 7 75.57 66.0 94.0 58.7 92.0 70 8 61.28 50.5 84.0 37.3 58.0 78 10 62.35 48.0 78.0 49.0 70.0 71 11 66.13 61.0 83.0 45.3 70.0 70 12 69.43 50.0 80.0 49.3 90.0 76 13 82.60 74.0 94.0 65.0 86.0 92 15 62.62 45.5 78.0 56.7 72.0 70 The reader should note that 2 of the 16 original rows are deleted from the data frame as a result of running the two versions of the script, irrespectively of whether the dropna() or dropna(how “any”) method form is used. If it is decided to delete only the rows with all columns containing NaN or empty values, the following syntax of the dropna() method should be used: <name of new Data Frame> = <name of original Data Frame>.dropna(how = "all") The following script and its output demonstrate the use of the dropna() method, with parameters that result in the deletion of rows consisting exclusively of cells with NaN values. Note that none of the 16 original rows are deleted from the data frame as a result of the method call. 1 2 3 4 import pandas as pd dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020") dframe_delete_rows_with_all_na_values = dataset.dropna(how = "all") dframe_delete_rows_with_all_na_values 326 Handbook of Computer Programming with Python Output 8.2.2.b: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project 0 58.57 50.5 76.0 70.7 60.0 55 1 65.90 49.0 89.0 63.0 54.0 90 2 69.32 63.5 73.0 54.7 70.0 80 3 72.02 60.5 99.0 74.7 76.0 70 4 73.68 74.0 84.0 53.3 64.0 87 5 61.32 45.5 94.0 42.7 66.0 70 6 67.87 66.5 73.0 53.7 54.0 87 7 75.57 66.0 94.0 58.7 92.0 70 8 61.28 50.5 84.0 37.3 58.0 78 9 0.00 NaN NaN NaN NaN 69 10 62.35 48.0 78.0 49.0 70.0 71 11 66.13 61.0 83.0 45.3 70.0 70 12 69.43 50.0 80.0 49.3 90.0 76 13 82.60 74.0 94.0 65.0 86.0 92 14 0.00 NaN NaN NaN NaN 75 15 62.62 45.5 78.0 56.7 72.0 70 16 0.00 NaN NaN NaN NaN 0 8.2.3 Data Cleaning: Fill Empty or NaN Values It is often the case that empty cells or cells with NaN values are filled with either predefined values or values calculated based on the rest of the data. In such cases, instead of the dropna() method (in any of its forms), one can use the fillna(<value>, [inplace = true]) method. The general syntax of the method is the following: <name how = <name how = of new 'all'] of new 'any'] Observation 8.9 – Fill NaN or Empty Values: Use the fillna() method to define replacement values for any NaN or empty values encountered. Data Frame> = <name of original Data Frame>.fillna(value[, [, inplace = True]) Data Frame> = <name of original Data Frame>.fillna(value[, [, inplace = True]) The value can be defined before running the script, based on existing dataset values and/or other calculations (e.g., using the mean of the existing data in the same column). The inplace parameter enables the permanent change of the data in the dataset, if set to true. While the false value can be also used, this would not make much sense, since it is the default value when inplace is not used. 327 Data Analytics and Data Visualization The following script and its output demonstrate the use of the fillna() method, while also applying the inplace parameter to enable the permanent change of the data. The default value used for the modification of empty or missing values is zero. The reader should note that the inplace parameter affects only the dataset resulting from the execution of the script, and not the data source: 1 2 3 4 import pandas as pd dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020") dataset.fillna(0, inplace = True) dataset Output 8.2.3: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project 0 58.57 50.5 76.0 70.7 60.0 55 1 65.90 49.0 89.0 63.0 54.0 90 2 69.32 63.5 73.0 54.7 70.0 80 3 72.02 60.5 99.0 74.7 76.0 70 4 73.68 74.0 84.0 53.3 64.0 87 5 61.32 45.5 94.0 42.7 66.0 70 6 67.87 66.5 73.0 53.7 54.0 87 7 75.57 66.0 94.0 58.7 92.0 70 8 61.28 50.5 84.0 37.3 58.0 78 9 0.00 0.0 0.0 0.0 0.0 69 10 62.35 48.0 78.0 49.0 70.0 71 11 66.13 61.0 83.0 45.3 70.0 70 12 69.43 50.0 80.0 49.3 90.0 76 13 82.60 74.0 94.0 65.0 86.0 92 14 0.00 0.0 0.0 0.0 0.0 75 15 62.62 45.5 78.0 56.7 72.0 70 16 0.00 0.0 0.0 0.0 0.0 0 8.2.4 Data Cleaning: Rename Columns It is sometimes required to change the column headings in a dataset. This is especially true in the case of formal reports, where clarity and appearance are key. In such cases, the rename() method is used. The method allows for the temporary change of the column heading without affecting the original dataset at the source. Observation 8.10 – rename(): Use the rename() method to change the column heading appearance. Use the set notation to dictate the old and new (temporary) column names. 328 Handbook of Computer Programming with Python The general syntax is the following: df.rename(columns = {"oldname": "newname", } [, inplace=True]) As in the previous case, if the inplace parameter is used, the column names will be changed for the resulting dataset, but the source data will not be affected. The most crucial aspect of the syntax is that the programmer can change any number of column names just by separating them using commas: 1 2 3 4 5 import pandas as pd dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020") dataset_new = dataset.rename(columns = {"Final Grade": "Total Grade", "Quiz 1": "Test 1", "Quiz 2": "Test 2", "Midterm Exam": “Midterm”}) dataset_new Output 8.2.4: Total Grade Final Exam Test 1 Test 2 Midterm Project 0 58.57 50.5 76.0 70.7 60.0 55 1 65.90 49.0 89.0 63.0 54.0 90 2 69.32 63.5 73.0 54.7 70.0 80 3 72.02 60.5 99.0 74.7 76.0 70 4 73.68 74.0 84.0 53.3 64.0 87 5 61.32 45.5 94.0 42.7 66.0 70 6 67.87 66.5 73.0 53.7 54.0 87 7 75.57 66.0 94.0 58.7 92.0 70 8 61.28 50.5 84.0 37.3 58.0 78 9 0.00 NaN NaN NaN NaN 69 10 62.35 48.0 78.0 49.0 70.0 71 11 66.13 61.0 83.0 45.3 70.0 70 12 69.43 50.0 80.0 49.3 90.0 76 13 82.60 74.0 94.0 65.0 86.0 92 14 0.00 NaN NaN NaN NaN 75 15 62.62 45.5 78.0 56.7 72.0 70 16 0.00 NaN NaN NaN NaN 0 The reader should note the use of the set notation to declare the pairs of column names (i.e., old and new) when changing them. It must be also noted that, in order for the change to apply, the result of the rename() method must be assigned to a new dataset before it is reported. 329 Data Analytics and Data Visualization 8.2.5 Data Cleaning: Changing and Resetting the Index The index of a dataset is important, as it can speed up the process of data searching. This is particularly relevant Observation 8.11 – set _ index(), when searching for or sorting data on a column of the reset _ index(): Use the set _ dataset different than the one the focus is on. In such a index() and reset _ index() case, it is convenient to temporarily change the indexed methods to set the index of the datacolumn to perform the task at hand, and return back to set to another column and restore it the original state by resetting the index to its original back to the original one. column once this is completed. The general syntax for changing and resetting the index in a dataset is the following: <name of dataset>.set_index("<column name>" [, inplace=True]) <name of dataset>.reset_index([inplace=True]) 8.3 DATA EXPLORATION Data exploration is an umbrella term, encompassing processes used to report data in various different ways. For example, it may refer to the process of row/column selection for inclusion in the report, or to facilities used to sort and/or filter data based on certain, defined conditions. If necessary, it offers options to group the data in one or more columns and the functionality to create new columns based on calculations on existing ones. This section will explore some of the most important concepts and methods related to data exploration. 8.3.1 Data Exploration: Counting and Selecting Columns Three of the basic methods and parameters used in order to view the data of a dataset are len(), columns, and shape. The len() method reports the number of records in the dataset. The general syntax is the following: len(<name of dataset>) Observation 8.12 – len(): Use the len() method and the columns and shape attributes of a dataset to report the number of its records, the names of its attributes, and the number of its records and columns, respectively. The columns attribute can be used to get a list of the available columns in the dataset, with the following syntax: <name of dataset>.columns Finally, the shape attribute can be used to report the number of records and columns in a dataset: <name of dataset>.shape The following script uses all three of the above, while also including a basic statement to display all the data in the dataset: 1 2 3 4 5 6 7 8 import pandas as pd dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020") dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"]] dataset len(dataset) dataset.columns dataset.shape 330 Handbook of Computer Programming with Python Output 8.3.1.a: Basic exploration methods without print (17, 6) It should be noted that the script fails to display all the requested output. Instead, it displays only the result of the application of shape: the number of records and columns. If it is necessary to display all the requested information, the print() method should be used, as in the amended version of the script below: 1 2 3 4 5 6 7 8 9 import pandas as pd dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020") dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"]] print(dataset) print("The dataset has", len(dataset), "records") print("The columns in the dataset are:", dataset.columns) print("The number of records is:", dataset.shape[0]) print("The number of columns is:", dataset.shape[1]) Output 8.3.1.b: Basic exploration methods using print Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project 0 58.57 50.5 76.0 70.7 60.0 55 1 65.90 49.0 89.0 63.0 54.0 90 2 69.32 63.5 73.0 54.7 70.0 80 3 72.02 60.5 99.0 74.7 76.0 70 4 73.68 74.0 84.0 53.3 64.0 87 61.32 45.5 94.0 42.7 66.0 70 5 6 67.87 66.5 73.0 53.7 54.0 87 7 75.57 66.0 94.0 58.7 92.0 70 8 61.28 50.5 84.0 37.3 58.0 78 9 0.00 NaN NaN NaN NaN 69 10 62.35 48.0 78.0 49.0 70.0 71 11 66.13 61.0 83.0 45.3 70.0 70 12 69.43 50.0 80.0 49.3 90.0 76 13 82.60 74.0 94.0 65.0 86.0 92 14 0.00 NaN NaN NaN NaN 75 15 62.62 45.5 78.0 56.7 72.0 70 0.00 NaN NaN NaN NaN 0 16 The dataset has 17 records The columns in the dataset are: Index(['Final Grade', 'Final Exam', 'Quiz 1', 'Quiz 2', 'Midterm Exam', 'Project'], dtype='object') The number of records is: 17 The number of columns is: 6 As shown above, it is possible to improve the output appearance by adding appropriate text through the print() method. Obviously, the presentation of the results could be further improved with the use of more elaborate presentation techniques and tools, such as an appropriate GUI. 331 Data Analytics and Data Visualization 8.3.2 Data Exploration: Limiting/Slicing Dataset Views It is often the case that it is impractical to display all the data in a single report. This is especially true when working with very large datasets. In such cases, it is preferable to display just a sample of the dataset, by limiting the number of records and/or columns. There are a number of methods that can be used for this task. Methods head(n) and tail(n) restrict the number of the displayed records, either at the top or the bottom of the dataset. The general syntax is the following: Observation 8.13 – head(), tail(): Use the head(n) and tail(n) methods to restrict the number of displayed records from the top and bottom of the dataset. Use the loc[] or iloc[] attributes to restrict the report to the specified rows and columns using labels or indices. <name of dataset>.head(number of rows from the top) <name of dataset>.tail(number of rows from the bottom) Methods loc[] and iloc[] can be used to restrict the displayed results based on specific rows and/or columns: <name of dataset>[start record number: end record number [: step] <name of dataset>.loc[start record number: end record number [: step], "<start column name>": "<end column name>"] <name of dataset>.iloc[[start record number: end record number, start column index: end column index] The practical application of these methods and attributes is demonstrated in the following script: 1 2 3 4 5 6 7 import pandas as pd dataset = pd.read_excel('Grades.xlsx', sheet_name = "Grades 2020") print(dataset.head(5)) print(dataset.tail(5)) print(dataset[0:37:5]) print(dataset.loc[0:5,"Final Grade": "Final Exam"]) print(dataset.iloc[0:5,0:3]) Output 8.3.2: 0 1 2 3 4 12 13 14 15 16 0 5 10 15 Final Grade 58.57 65.90 69.32 72.02 73.68 Final Grade 69.43 82.60 0.00 62.62 0.00 Final Grade 58.57 61.32 62.35 62.62 Final Exam 50.5 49.0 63.5 60.5 74.0 Final Exam 50.0 74.0 NaN 45.5 NaN Final Exam 50.5 45.5 48.0 45.5 Quiz 1 76.0 89.0 73.0 99.0 84.0 Quiz 1 80.0 94.0 NaN 78.0 NaN Quiz 1 76.0 94.0 78.0 78.0 Quiz 2 70.7 63.0 54.7 74.7 53.3 Quiz 2 49.3 65.0 NaN 56.7 NaN Quiz 2 70.7 42.7 49.0 56.7 Midterm Exam 60.0 54.0 70.0 76.0 64.0 Midterm Exam 90.0 86.0 NaN 72.0 NaN Midterm Exam 60.0 66.0 70.0 72.0 Project 55 90 80 70 87 Project 76 92 75 70 0 Project 55 70 71 70 332 0 1 2 3 4 5 0 1 2 3 4 Handbook of Computer Programming with Python Final Grade 58.57 65.90 69.32 72.02 73.68 61.32 Final Grade 58.57 65.90 69.32 72.02 73.68 Final Exam 50.5 49.0 63.5 60.5 74.0 45.5 Final Exam 50.5 49.0 63.5 60.5 74.0 Quiz 1 76.0 89.0 73.0 99.0 84.0 In the output, the reader will notice that with the application of head(5) and tail(5), only the five first and last records of the dataset are displayed (with all their columns). Next, records are displayed in intervals of five, starting from zero and ending with the last records of the dataset. The next section displays six records of the dataset using only the first three columns (inclusive of the index of the dataset). In a similar way, the last section shows the first five records using only the first four columns (inclusive of the index of the dataset), but the columns are specified by their index and not their names. If it is required to report on non-­sequential columns, these columns must be included in square brackets ([]) and separated by commas. 8.3.3 Data Exploration: Conditioning/Filtering Expectedly, Pandas also offers a set of methods that allow for the filtering of the displayed data through conditioning. For instance, the unique() method displays only the first occurrence of recurring data values from the specified column: <name of dataset>["<name of column>"]. unique() Observation 8.14 – unique(): Use the unique() method and the square bracket ([]) list notation to report unique data in a dataset based on a specified column and to set the conditions for the reported records. It is also possible to define a particular condition that limits the displayed results like in the case of an if statement. The condition can be simple (single) or complex. The general syntax is the following: <name of dataset>[<condition>] <name of dataset> [<condition>[&/|] <condition>]] The following script uses the data from the Grades.xlsx file to identify unique grades for the project, and report all final grades with a percentage higher than 80% and between 1% and 59%: 1 2 3 4 5 6 7 import pandas as pd dataset = pd.read_excel('Grades.xlsx') print("Unique grades for project:", dataset["Project"].unique()) print("Final grades more than 80%:\n", dataset[dataset["Final Grade"] > 80]) print("Final grades 1% to 60%:\n", dataset[(dataset["Final Grade"] > 0) & (dataset["Final Grade"] < 60)]) 333 Data Analytics and Data Visualization Output 8.3.3: Unique grades for project: [55 90 80 70 87 78 69 71 76 92 75 0] Final grades more than 80%: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project 13 82.6 74.0 94.0 65.0 86.0 92 Final grades 1% to 60%: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam Project 0 58.57 50.5 76.0 70.7 60.0 55 The reader should note that it is possible to limit the displayed columns if the loc[] parameter is also used, although this is not shown in the current script and its output. It is also worth mentioning that, in a compound condition like the second one in the example, instead of using the and or or keywords one can use & and | operators respectively. 8.3.4 Data Exploration: Creating New Data As part of the data exploration process, it is sometimes necessary to create new data. This can take four different forms: • Merging two or more datasets into one. • Creating a new column with data derived from other available data sources, in the same or other datasets. • Creating a new column with data calculated from other available data sources, in the same or other datasets. • Creating a new file of a certain file type (e.g., Observation 8.15 – Create New Excel, CSV). Column: Use the following expression and syntax to create a new column The append() method is used to merge two or more based on the values of other columns datasets. The basic syntax is the following: from the same or other datasets: <name of new dataset> = <name of first old dataset>.append(<name of second old dataset>) To create a new column with values calculated based on data of other columns one can use the following command: <name of dataset>["<name of new column>"] = expression with other columns If the newly created column is based on certain conditions applied to data from other columns the following commands could be used instead: <name of dataset>["<name of new column>"] = np.where(condition, value if True, value if False) or <name of dataset>["<name of new column>"] = np.select(<condition set>, <set of values>) <name of dataset>[“<name of new column>”] = expression with other columns Observation 8.16 – Create a New Column Using np.where() or np. select(): Use Numpy’s np.where() or np.select() methods and the following syntax to create a new column based on a simple or complex condition. This can include other columns from the same or other datasets: <name of dataset>[“<name of new column>”] = np.where (condition, value if True, value if False) <name of dataset>[“<name of new column>”] = np.select (<condition set>, <set of values>) 334 Handbook of Computer Programming with Python Finally, to create a new dataset and store it in a file, one of the following command structures could be used. The examples provided here cover Excel and CSV files, but the same logic also applies to other data file formats. Excel files: <name of new Excel file object> = pd.ExcelWriter("<name of new Excel file>") <name of dataset>.to_excel(<name of new Excel file object>, "sheet name") <name of new Excel file object>.save() CSV files: <name of dataset>.to_csv("<name of new CSV file>") Observation 8.17 – Create a New Excel File: Use the following syntax to create a new Excel file from a given dataset: <name of new Excel file object> = pd.ExcelWriter (“<name of new Excel file>”) <name of dataset>.to_excel (<name of new Excel file object>, “sheet name”) <name of new Excel file object>.save() Using the Grades.xlsx dataset as an example, student grades are stored in a particular section of a course and in a particular semester. If another dataset for the same course but a different section exists in another file (e.g., Grades2.csv), it may be useful to merge the two and perform the necessary pro- Observation 8.18 – Create a New cesses in the newly created dataset. The following script CSV File: Use the following syntax reads two different files (i.e., Excel and CSV), reports to create a new CSV file from a given their data, appends the second dataset at the end of the dataset: first, defines the condition, and creates a new column with values calculated from the data of other columns. <name of dataset>.to_csv Finally, it saves the new dataset in both Excel and CSV (“<name of new CSV file>”) formats: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import pandas as pd import numpy as np dataset1 = pd.read_excel("Grades.xlsx") print("The data in Grades file are:"); print(dataset1.head(3)) dataset2 = pd.read_csv('Grades2.csv') print("The data in Grades2 file are:"); print(dataset2.tail(3)) dataset = dataset1.append(dataset2) print("The new merge dataset is:"); print(dataset.head(3)) print(dataset.tail(3)) # The conditions for the Letter Grades conditions = [(dataset["Final Grade"] > 90.0), (dataset["Final Grade"] > 80.0) & (dataset["Final Grade"] <= 89.9), (dataset["Final Grade"] > 70.0) & (dataset["Final Grade"] <= 79.9), (dataset["Final Grade"] > 60.0) & (dataset["Final Grade"] <= 69.9), (dataset["Final Grade"] < 59.9) ] # The list of Grade Letters based on the conditions gradeLetters = ["A", "B", "C", "D", "F"] # Create a new Letter Grades column in the new dataset using numpy dataset["Letter Grade"] = np.select(conditions, gradeLetters) 335 Data Analytics and Data Visualization 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 dataset["Course Work"] = dataset["Quiz 1"]*0.1+dataset["Quiz 2"]*0.1+ \ dataset["Midterm Exam"]*0.25 + dataset["Project"]*0.25 print("A partial view of the new dataset:") # Find the number of records in the dataset rowNum = len(dataset) # Select the columns to be displayed in the report cols = [7, 1, 0, 6] print(dataset.iloc[:rowNum:5, cols]) # Save the new dataset as an Excel file newExcel = pd.ExcelWriter("NewGrades.xlsx") dataset.to_excel(newExcel, "New Data") newExcel.save() # Save the new dataset as a CSV file dataset.to_csv("newGrades.csv") Output 8.3.4: The data in Grades file are: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam 70.7 60.0 0 58.57 50.5 76.0 63.0 54.0 65.90 49.0 89.0 1 70.0 69.32 63.5 73.0 2 54.7 The data in Grades2 file are: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam 62.52 31.0 85 54.7 68 12 16.5 38 33.3 52 13 47.33 68.97 55.0 65 49.7 70 14 The new merge dataset is: Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam 0 58.57 50.5 76.0 70.7 60.0 1 65.90 49.0 89.0 63.0 54.0 2 69.32 63.5 73.0 54.7 70.0 Final Grade Final Exam Quiz 1 Quiz 2 Midterm Exam 12 62.52 31.0 85.0 54.7 68.0 13 47.33 16.5 38.0 33.3 52.0 14 68.97 55.0 65.0 49.7 70.0 A partial view of the new dataset: Course Work Final Exam Final Grade Letter Grade 0 43.42 50.5 58.57 F 5 47.67 45.5 61.32 D 10 47.95 48.0 62.35 D 15 48.97 45.5 62.62 D 3 44.10 34.5 54.45 F 8 46.28 54.5 62.63 D 13 42.38 16.5 47.33 F Project 55 90 80 Project 89 89 94 Project 55 90 80 Project 89 89 94 Some key observations can be made based on this script. Firstly, it is possible, and indeed common, for the programmer to require the merging of datasets from files of different file types. In this instance, the script merges a dataset stored in an Excel file with one in a CSV file. Secondly, 336 Handbook of Computer Programming with Python although it is possible to use multiple lines of code to define the values of a new column based on different conditions, a more efficient option is to use the np.where() method to define the conditions and their paired values in advance, and subsequently use the np.select() method from the Numpy library. Thirdly, it is possible to create a new column based on simple or complex expressions that include other columns. Fourthly, it may be more convenient to define the displayed records and columns as variables and use them in a statement, rather than directly adding the associated constraints to the statement. Finally, the reader should note that the sequence of statements used to create a new Excel file is different than that for a CSV file. Such differences also exist for files of other formats. 8.3.5 Data Exploration: Grouping and Sorting Data Data grouping is one of the most important data processing tasks, and is usually carried out before other tasks commence. This is commonly coupled with data sorting, and the two tasks together constitute a key building block for the production of professional reports. Unsurprisingly, Python provides facilities for both of these tasks. In order to group data within a dataset, the groupby() method can be used. The general syntax is the following: Observation 8.19 – Grouping Data: Use the groupby() method to group a dataset based on one or more columns. The method must be used with either an aggregate method (e.g., mean()) or with the apply(lambda x: x[…]) statement for non-­aggregate groupings. <name of dataset>.groupby([“<name of column>” [, “<name of column>”, …]]).<aggregate function> It must be noted that the method requires the application of an aggregation (e.g., mean) to the grouped data, a concept covered in the following section. Alternatively, if the goal is to simply display the report grouped by a specific column, the apply() method can be used with the following syntax: <name of dataset>.groupby([“<name of column>” [, “<name of column>”, …]]).apply(lambda x: x[<rows>, <cols>]) The apply() method replaces the aggregation with the lambda x: x[…] expression in order to specify the records and columns that should be displayed in the report. The reader should also note that if more than one column is used for the grouping, the data will be initially grouped based on the firstly selected column. After that point, data will be grouped in each separate group based on the second column. For the purposes of data sorting, the sort_values() method is used. The general syntax is the following: <name of dataset>.sort_values([“<name of column>” [, “<name of column>”, …]] [, ascending = False]) Observation 8.20 – Sorting Data: Use the sort_values() method to sort a dataset based on one or more specified columns. As with data grouping, the reader should note that if more than one column is specified, the data with the same value are sorted based on the first column. Finally, it is possible to combine the functionality of groupby() and sort_values() by firstly applying the former and assigning the result to the lambda expression, and then applying the sort_values() method to the lambda expression. 337 Data Analytics and Data Visualization The following script reads a CSV file and groups and reports its data based on the Letter Grade column, displaying only columns Letter Grade and Final Grade. Next, it creates a second dataset and sorts the values based on the Final Grade in ascending order. Finally, it utilizes the apply() method to group the data based on Letter Grade and sort them based on Final Grade: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import pandas as pd dataset = pd.read_csv('newGrades.csv') # Report the number of records in the dataset rows = len(dataset) # Report the records grouped by Letter Grade dataset1 = dataset[["Letter Grade", "Final Grade"]] print(dataset1.groupby(["Letter Grade"]).apply(lambda x: x[0:rows])) # Report the records sorted by Final Grade dataset2 = dataset[["Letter Grade", "Final Grade"]] print(dataset2.sort_values(["Final Grade"], ascending = False)) # Report the records firstly grouped by Letter Grade and # then sorted by Final Grade (within groups) dataset3 = dataset[["Letter Grade", "Final Grade"]] print(dataset3.groupby(["Letter Grade"]). apply(lambda x: x.sort_values(["Final Grade"], ascending=False))) Output 8.3.5.a–8.3.5.c: Letter Grade B 13 24 3 C 4 7 18 21 23 D 1 2 5 6 8 10 11 12 15 17 19 Letter Grade Final Grade B B C C C C C C D D D D D D D D D D D 82.60 81.87 72.02 73.68 75.57 75.13 76.95 73.23 65.90 69.32 61.32 67.87 61.28 62.35 66.13 69.43 62.62 67.47 66.85 338 F Handbook of Computer Programming with Python 25 29 31 0 9 14 16 20 22 26 27 28 30 13 24 21 7 18 4 23 3 12 2 31 6 17 19 11 1 25 15 29 10 5 8 26 0 20 27 30 22 28 14 9 16 Letter Grade B 13 24 C 21 7 D D D F F F F F F F F F F 62.63 62.52 68.97 58.57 0.00 0.00 0.00 54.45 45.13 58.75 49.75 44.25 47.33 Letter Grade B B C C C C C C D D D D D D D D D D D D D D F F F F F F F F F F Final Grade 82.60 81.87 76.95 75.57 75.13 73.68 73.23 72.02 69.43 69.32 68.97 67.87 67.47 66.85 66.13 65.90 62.63 62.62 62.52 62.35 61.32 61.28 58.75 58.57 54.45 49.75 47.33 45.13 44.25 0.00 0.00 0.00 Letter Grade Final Grade B B C C 82.60 81.87 76.95 75.57 339 Data Analytics and Data Visualization D F 18 4 23 3 12 2 31 6 17 19 11 1 25 15 29 10 5 8 26 0 20 27 30 22 28 9 14 16 C C C C D D D D D D D D D D D D D D F F F F F F F F F F 75.13 73.68 73.23 72.02 69.43 69.32 68.97 67.87 67.47 66.85 66.13 65.90 62.63 62.62 62.52 62.35 61.32 61.28 58.75 58.57 54.45 49.75 47.33 45.13 44.25 0.00 0.00 0.00 The output shows the results of the reports for the three datasets. From left to right, the output shows the results of groupby() based on Letter Grade, the results of sort_values() based on Final Grade, and the dataset grouped by Letter Grade and sorted by Final Grade. The reader should note that, in this instance, the outputs are presented side-­by-­side for demonstration purposes, but in a more realistic scenario they should be presented in succession, as dictated by the actual output. 8.4 DESCRIPTIVE STATISTICS Descriptive statistics are defined as the analysis of data that describe, show, or summarize information in a meaningful manner. They are simply a way of describing the data and they do not draw conclusions, make predictions, or test hypotheses based on the data, all of which form a specific branch of statistical analysis referred to as inferential statistics (covered in Chapter 9). This section provides introductions to basic concepts relating to descriptive statistics and how Python is used to carry out various descriptive analysis tasks. Before performing any statistical task, it is useful to distinguish and identify the type(s) of data that will be analysed, as this largely dictates the most appropriate descriptive statistics and data visualisation techniques for the task at hand. Observation 8.21 – Descriptive Statistics: A branch of data analysis that describes, displays, or summarizes information without drawing conclusions, making predictions, or testing hypotheses. Observation 8.22 – Categorical and Continuous Data: Categorical data are data that can be divided into groups or classes but with no numerical relationship. Continuous data are numerical data that can be used for counting or measurements. 340 Handbook of Computer Programming with Python In a broad context, data can be simply categorized into two types: categorical and continuous. Categorical data are data that can be divided into groups or classes that do not have a numerical or hierarchical relationship (e.g., gender). Continuous data are numerical, and can include counting (i.e., integers) or measurements (i.e., any numerical values). The reader should become familiar with these two terms, as they are used extensively throughout this section. 8.4.1 Measures of Central Tendency There are two main ways to explore and describe continuous data: (a) measuring their central tendency and, (b) measuring their spread. The following sections introduce and briefly discuss these two concepts. The measures of central tendency show the central or middle values of datasets. Hence, this is also frequently referred to as measures of central location. There are three different measures that can be considered as the centre of a dataset, namely mean, median, and mode. The mean, also called the arithmetic mean, is a popular measure of central tendency. It is the average of the data in a dataset, and is calculated as the sum of all the data values divided by the number of cases in the dataset. The mean can fail to describe the central location of the data if there are outliers present or if the data are skewed. The median is the middle point of a dataset that has been sorted in either ascending or descending order. The main difference between the mean and the median is that the former is heavily affected by outliers or skewed data, while the latter is affected only slightly or not at all. The following Python script reads the data frame from the newGrades.csv file introduced in previous script samples, and calculates the means, medians, and modes of each of the columns: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Observation 8.23 – Measures of Central Tendency: Measures that describe the central or middle values of a dataset. The three different measures are the mean, the median, and the mode. Observation 8.24 – (Arithmetic) Mean: The average of the data in a dataset, calculated as the sum of all the data values divided by the number of cases. Observation 8.25 – Median: The middle point of a sorted dataset. Observation 8.26 – Mode: The most frequently occurring value in the dataset. If more than one such values exist, the dataset is characterized as multimodal. import pandas as pd # Define the format of float numbers pd.options.display.float_format = '${:,.2f}'.format dataset = pd.read_csv('newGrades.csv') # Define the number of rows and columns in the data frame rows = len(dataset) cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"] # Calculate the mean of all columns and append the dataset mean1 = dataset["Final Grade"].mean() mean2 = dataset["Final Exam"].mean() mean3 = dataset["Quiz 1"].mean() Data Analytics and Data Visualization 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 341 mean4 mean5 mean6 means = dataset["Quiz 2"].mean() = dataset["Midterm Exam"].mean() = dataset["Project"].mean() = {"Final Grade": mean1, "Final Exam": mean2, "Quiz 1": mean3, "Quiz 2": mean4, "Midterm Exam": mean5, "Project": mean6} dataset = dataset.append(means, ignore_index = True) # Calculate the median of all columns and append the dataset median1 = dataset["Final Grade"].median() median2 = dataset["Final Exam"].median() median3 = dataset["Quiz 1"].median() median4 = dataset["Quiz 2"].median() median5 = dataset["Midterm Exam"].median() median6 = dataset["Project"].median() medians = {"Final Grade": median1, "Final Exam": median2, "Quiz 1": median3, "Quiz 2": median4, "Midterm Exam": median5, "Project": median6} dataset = dataset.append(medians, ignore_index = True) # Find the mode in all columns and append the dataset mode1 = dataset["Final Grade"].mode(dropna = True).values if (len(mode1) > 1): mode1 = "Multimode" mode2 = dataset["Final Exam"].mode(dropna = True).values if (len(mode2) > 1): mode2 = "Multimode" mode3 = dataset["Quiz 1"].mode(dropna = True).values if (len(mode3) > 1): mode3 = "Multimode" mode4 = dataset["Quiz 2"].mode(dropna = True).values if (len(mode4) > 1): mode4 = "Multimode" mode5 = dataset["Midterm Exam"].mode(dropna = True).values if (len(mode5) > 1): mode5 = "Multimode" mode6 = dataset["Project"].mode(dropna = True).values if (len(mode6) > 1): mode6 = "Multimode" modes = {"Final Grade": mode1, "Final Exam": mode2, "Quiz 1": mode3, "Quiz 2": mode4, "Midterm Exam": mode5, "Project": mode6} dataset = dataset.append(modes, ignore_index = True) # Report the dataset dataset1 = dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"]] print(dataset1.iloc[0:rows:1]) #Report the rows with the means, medians, modes print("Means"); print(dataset1.iloc[32:33]) print("Medians"); print(dataset1.iloc[33:34]) print("Modes"); print(dataset1.iloc[34:35]) 342 Handbook of Computer Programming with Python Output 8.4.1: Final Grade 0 $58.57 1 $65.90 2 $69.32 3 $72.02 4 $73.68 5 $61.32 6 $67.87 7 $75.57 8 $61.28 9 $0.00 10 $62.35 11 $66.13 12 $69.43 13 $82.60 14 $0.00 15 $62.62 16 $0.00 17 $67.47 18 $75.13 19 $66.85 20 $54.45 21 $76.95 22 $45.13 23 $73.23 24 $81.87 25 $62.63 26 $58.75 27 $49.75 28 $44.25 29 $62.52 30 $47.33 31 $68.97 Means Final Grade 32 $58.87 Medians Final Grade 33 $62.63 Modes Final Grade 34 [0.0] Final Exam $50.50 $49.00 $63.50 $60.50 $74.00 $45.50 $66.50 $66.00 $50.50 NaN $48.00 $61.00 $50.00 $74.00 NaN $45.50 NaN $59.00 $61.50 $77.50 $34.50 $66.50 $26.00 $63.50 $83.00 $54.50 $46.50 $27.50 $21.50 $31.00 $16.50 $55.00 Quiz 1 $76.00 $89.00 $73.00 $99.00 $84.00 $94.00 $73.00 $94.00 $84.00 NaN $78.00 $83.00 $80.00 $94.00 NaN $78.00 NaN $70.00 $76.00 $84.00 $62.00 $68.00 $52.00 $96.00 $97.00 $54.00 $54.00 $48.00 $55.00 $85.00 $38.00 $65.00 Quiz 2 Midterm Exam Project $70.70 $60.00 $55.00 $63.00 $54.00 $90.00 $54.70 $70.00 $80.00 $74.70 $76.00 $70.00 $53.30 $64.00 $87.00 $42.70 $66.00 $70.00 $53.70 $54.00 $87.00 $58.70 $92.00 $70.00 $37.30 $58.00 $78.00 NaN NaN $69.00 $49.00 $70.00 $71.00 $45.30 $70.00 $70.00 $49.30 $90.00 $76.00 $65.00 $86.00 $92.00 NaN NaN $75.00 $56.70 $72.00 $70.00 NaN $0.00 NaN $72.70 $70.00 $72.00 $68.30 $82.00 $87.00 $52.00 $40.00 $80.00 $44.00 $44.00 $90.00 $67.00 $82.00 $92.00 $26.30 $50.00 $68.00 $68.30 $62.00 $89.00 $82.70 $84.00 $72.00 $31.30 $64.00 $87.00 $39.00 $52.00 $90.00 $37.00 $62.00 $70.00 $18.00 $42.00 $80.00 $54.70 $68.00 $89.00 $33.30 $52.00 $89.00 $49.70 $70.00 $94.00 Final Exam Quiz 1 Quiz 2 Midterm Exam Project $52.71 $75.28 $52.36 $65.72 $76.84 Final Exam Quiz 1 Quiz 2 Midterm Exam Project $53.60 $77.00 $52.83 $65.86 $78.00 Final Exam Multimode Quiz 1 Multimode Quiz 2 Midterm Exam Project Multimode [70.0] [70.0] The script and its output demonstrate a few important points: • Given that the various calculations occasionally produce floating point numbers with several decimal digits, it may be desirable to limit the latter to a more manageable scale (i.e., two digits). The statement in line four formats the output accordingly. • The statements in lines 13–15 calculate the mean of each of the columns of the dataset. Next, these values are appended at the end of the dataset as a new row. Data Analytics and Data Visualization 343 • In a similar fashion, the statements in lines 21–26 calculate the median of each of the columns of the dataset and append them as a new row at the end of the dataset. It should be noted that, since it is necessary to have the data sorted in order to make such a calculation, this particular method performs this task too. • The statements in lines 33–50 calculate the mode for each of the columns. Since it is undesirable in this particular example to have more than one such value reported, the code includes appropriate if statements to ensure that the mode is a single value per column or report that the output is multimodal, (i.e., it includes more than one values). • Finally, the reader should note the use of the dropna = True parameter in the statements that ensure empty or NaN values are not considered in the mode calculation. The .values parameter also discards the information related to the resulting series and its object type, leaving only the pure value. 8.4.2 Measures of Spread Another way to describe and summarize continuous data is through measures of spread. Such measures quantify the variability of data points; hence they are also called measures of dispersion. Measures of spread are frequently used in conjunction with measures of central tendency to provide a clearer and more rounded overview of the data at hand. The importance of measures of spread lies in the fact that they can describe how well the mean represents the data. If the data spread is large (i.e., if there are large differences between the data points), the mean may not be as good a representation of the data as the median or the mode. The data range is the difference between the minimum and maximum data points in the dataset. It is calculated as range = max−min. Quartiles describe the data spread by breaking the data into four parts (i.e., quarters), using three quartiles. The 1st quartile (Q1) is the 25th percentile of the sample, dividing roughly the lowest 25% from the rest of the data, while the 2nd quartile (Q2) is the 50th percentile or the median, and the third (Q3) the 75th percentile. Quartiles are a useful measure of spread, as they are much less affected by outliers or skewed datasets than other measures like variance or standard deviation. Variance shows numerically how far the data points are from the mean. Variance is useful as, unlike quartiles, it takes into account all data points in the dataset and provides a better representation of the data spread. The variance of dataset 𝑥 with 𝑛 data points is expressed as 𝑠² = Σi(𝑥i−mean(𝑥))²/(𝑛−1), where 𝑖 = 1, 2, …, 𝑛 and mean(𝑥) is the mean of 𝑥. In order to get a better understanding of why the sum has to be divided with 𝑛−1 instead of 𝑛, the reader can refer to Bessel’s correction. Standard deviation also demonstrates how the data points spread out from the mean. It is the positive square root of the variance. A small standard deviation Observation 8.27 – Measures of Spread: Measures that quantify the variability of data points in a dataset. If the spread is large, the measures of tendency are not good representations of the data. Observation 8.28 – min(), max(): Use the min() and max() methods to find the minimum and maximum values in a dataset. Calculate their difference to find the range of these values. Observation 8.29 – Quartiles: Use the quantile() method to specify and report the relevant quartile of data in a dataset. For instance, ­quantile(0.1) will report the lowest 10% of the data values in the dataset. Observation 8.30 – variance(): Use the variance() method to find the variance of a dataset and show the distance of the data points from the mean. Observation 8.31 – Standard Deviation (SD): Standard deviation shows the distance of the data points from the mean. The larger its values the larger the spread of the data points from the mean. It is frequently preferable to the measure of variance. 344 Handbook of Computer Programming with Python indicates that the data are close to the mean, while a large one shows a high outwards data spread from the mean. Standard deviation is often the preferred choice in order to present the data spread, and it is more convenient compared to variance, as it utilizes the same unit as the data points. The following script uses the Pandas and Statistics Python packages to read the newGrades. csv file, find the max and min values for each column in the dataset, find the 25% (1st) quartile and calculate the variance and the standard deviation using both the regular std() and the stdev() methods from the statistics package. Finally, it creates a new dataset with all the related values, and reports the dataset: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import pandas as pd import statistics # Define the format of float numbers pd.options.display.float_format = '${:,.2f}'.format dataset = pd.read_csv('newGrades.csv') rows = len(dataset) cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"] # Find max1 = max3 = max5 = the max values in each column dataset["Final Grade"].max(); max2 = dataset["Final Exam"].max() dataset["Quiz 1"].max(); max4 = dataset["Quiz 2"].max() dataset["Midterm Exam"].max(); max6 = dataset["Project"].max() # Find min1 = min3 = min5 = the min values in each column dataset["Final Grade"].min(); min2 = dataset["Final Exam"].min() dataset["Quiz 1"].min(); min4 = dataset["Quiz 2"].min() dataset["Midterm Exam"].min(); min6 = dataset["Project"].min() # Find the lower 25% quartile in all columns quartile25a = dataset["Final Grade"].quantile(0.25); quartile25b = dataset["Final Exam"].quantile(0.25) quartile25c = dataset["Quiz 1"].quantile(0.25); quartile25d = dataset["Quiz 2"].quantile(0.25) quartile25e = dataset["Midterm Exam"].quantile(0.25) quartile25f = dataset["Project"].quantile(0.25) # Calculate variance1 = variance2 = variance3 = variance4 = variance5 = variance6 = the variance in all columns statistics.variance(dataset["Final Grade"].dropna()) statistics.variance(dataset["Final Exam"].dropna()) statistics.variance(dataset["Quiz 1"].dropna()) statistics.variance(dataset["Quiz 2"].dropna()) statistics.variance(dataset["Midterm Exam"].dropna()) statistics.variance(dataset["Project"].dropna()) # Calculate the standard deviation of all columns using std() std1 = dataset["Final Grade"].std(); std2 = dataset["Final Exam"].std() std3 = dataset["Quiz 1"].std(); std4 = dataset["Quiz 2"].std() Data Analytics and Data Visualization 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 345 std5 = dataset["Midterm Exam"].std(); std6 = dataset["Project"].std() # Calculate the standard deviation in all columns using stdev() stdev1 = statistics.stdev(dataset["Final Grade"].dropna()) stdev2 = statistics.stdev(dataset["Final Exam"].dropna()) stdev3 = statistics.stdev(dataset["Quiz 1"].dropna()) stdev4 = statistics.stdev(dataset["Quiz 2"].dropna()) stdev5 = statistics.stdev(dataset["Midterm Exam"].dropna()) stdev6 = statistics.stdev(dataset["Project"].dropna()) # Report the dataset dataset1 = dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"]] print(dataset1.iloc[0:rows:1]) # Append the dataset with the max values maxs = {"Final Grade": max1, "Final Exam": max2, "Quiz 1": max3, "Quiz 2": max4, "Midterm Exam": max5, "Project": max6} dataset1 = dataset1.append(maxs, ignore_index = True) mins = {"Final Grade": min1, "Final Exam": min2, "Quiz 1": min3, "Quiz 2": min4, "Midterm Exam": min5, "Project": min6} dataset1 = dataset1.append(mins, ignore_index = True) quartiles = {"Final Grade": quartile25a, "Final Exam": quartile25b, "Quiz 1": quartile25c, "Quiz 2": quartile25d, "Midterm Exam": quartile25e, "Project": quartile25f} dataset1 = dataset1.append(quartiles, ignore_index = True) variances = {"Final Grade": variance1, "Final Exam": variance2, "Quiz 1": variance3, "Quiz 2": variance4, "Midterm Exam": variance5, "Project": variance6} dataset1 = dataset1.append(variances, ignore_index = True) stds = {"Final Grade": std1, "Final Exam": std2, "Quiz 1": std3, "Quiz 2": std4, "Midterm Exam": std5, "Project": std6} dataset1 = dataset1.append(stds, ignore_index = True) stdevs = {"Final Grade": stdev1, "Final Exam": stdev2, "Quiz 1": stdev3, "Quiz 2": stdev4, "Midterm Exam": stdev5, "Project": stdev6} dataset1 = dataset1.append(stdevs, ignore_index = True) # Report the rows with the max, min, quartile, variance, and std values print("Max"); print(dataset1.iloc[32:33]) print("Min"); print(dataset1.iloc[33:34]) print("25% Quartile"); print(dataset1.iloc[34:35]) print("Variance"); print(dataset1.iloc[35:36]) print("Standard Deviation (using: std())"); print(dataset1.iloc[36:37]) print("Standard Deviation (using: stdev())") print(dataset1.iloc[37:38]) 346 Handbook of Computer Programming with Python Output 8.4.2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Max Final Grade $58.57 $65.90 $69.32 $72.02 $73.68 $61.32 $67.87 $75.57 $61.28 $0.00 $62.35 $66.13 $69.43 $82.60 $0.00 $62.62 $0.00 $67.47 $75.13 $66.85 $54.45 $76.95 $45.13 $73.23 $81.87 $62.63 $58.75 $49.75 $44.25 $62.52 $47.33 $68.97 Final Exam $83.00 Quiz 1 $76.00 $89.00 $73.00 $99.00 $84.00 $94.00 $73.00 $94.00 $84.00 NaN $78.00 $83.00 $80.00 $94.00 NaN $78.00 NaN $70.00 $76.00 $84.00 $62.00 $68.00 $52.00 $96.00 $97.00 $54.00 $54.00 $48.00 $55.00 $85.00 $38.00 $65.00 Quiz 1 $99.00 Quiz 2 $70.70 $63.00 $54.70 $74.70 $53.30 $42.70 $53.70 $58.70 $37.30 NaN $49.00 $45.30 $49.30 $65.00 NaN $56.70 NaN $72.70 $68.30 $52.00 $44.00 $67.00 $26.30 $68.30 $82.70 $31.30 $39.00 $37.00 $18.00 $54.70 $33.30 $49.70 Midterm Exam $60.00 $54.00 $70.00 $76.00 $64.00 $66.00 $54.00 $92.00 $58.00 NaN $70.00 $70.00 $90.00 $86.00 NaN $72.00 NaN $70.00 $82.00 $40.00 $44.00 $82.00 $50.00 $62.00 $84.00 $64.00 $52.00 $62.00 $42.00 $68.00 $52.00 $70.00 Project 55 90 80 70 87 70 87 70 78 69 71 70 76 92 75 70 0 72 87 80 90 92 68 89 72 87 90 70 80 89 89 94 Quiz 2 $82.70 Midterm Exam $92.00 Project $94.00 Final Grade Final Exam Quiz 1 Quiz 2 $0.00 $16.50 $38.00 $18.00 33 25% Quartile Final Grade Final Exam Quiz 1 Quiz2 34 $57.54 $45.50 $65.00 $42.70 Variance Final Grade Final Exam Quiz 1 Quiz 2 35 $461.46 $289.85 $267.49 $242.37 Standard Deviation (using: std()) Final Grade Final Exam Quiz 1 Quiz 2 36 $21.48 $17.02 $16.36 $15.57 Standard Deviation (using: stdev()) Final Grade Final Exam Quiz 1 Quiz 2 37 $21.48 $17.02 $16.36 $15.57 Midterm Exam $40.00 Project $0.00 Midterm Exam $54.00 Project $70.00 Midterm Exam $197.06 Project $291.88 Midterm Exam $14.04 Project $17.08 Midterm Exam $14.04 Project $17.08 32 Min Final Grade $82.60 Final Exam $50.50 $49.00 $63.50 $60.50 $74.00 $45.50 $66.50 $66.00 $50.50 NaN $48.00 $61.00 $50.00 $74.00 NaN $45.50 NaN $59.00 $61.50 $77.50 $34.50 $66.50 $26.00 $63.50 $83.00 $54.50 $46.50 $27.50 $21.50 $31.00 $16.50 $55.00 Data Analytics and Data Visualization 347 8.4.3 Skewness and Kurtosis Skewness measures the asymmetry of the data and describes the amount by which the distribution differs from a normal distribution. There are several mathematical definitions of skewness. A commonly used one is Pearson’s skewness coefficient, which can be derived using the size of a dataset, the mean, and the standard deviation of the data. Negative skewness values indicate a dominant tail on the left side, while positive values correspond to a long tail on the right side. If the skewness is close to 0 (i.e., between −0.5 and 0.5), the data are considered to be symmetric (Figure 8.1). When the skewness is between −1 and −0.5 or between 0.5 and 1, the data are considered to be moderately skewed. If skewness is less than −1 or more then 1, the data are considered to be highly skewed. Kurtosis shows whether the data is heavy-­tailed or light-­tailed compared to a normal distribution. In other words, kurtosis identifies whether the data contains extreme values. A high kurtosis indicates a heavy tail and more outliers in the data, while a low kurtosis shows a light tail and fewer outliers. An alternative and effective way to show kurtosis and skewness is the histogram, as it visually demonstrates the shape of the data distribution. There are three main types of kurtosis: mesokurtic, leptokurtic, and platykurtic (Figure 8.2). Observation 8.32 – Skewness: Use the skew() method to calculate the skewness of a dataset. Based on Pearson’s skewness coefficient, skewness between −0.5 and 0.5 is considered to be symmetric, while values between −1 and −0.5 or 0.5 and 1 indicate that skewness is moderate and values less than −1 or more than 1 that it is high. Observation 8.33 – Kurtosis: Use the kurtosis() method to calculate the kurtosis of a dataset. Data can be characterized as mesokurtic (normal distribution with value of 3), leptokyrtic (data heavily-­tailed with profusion of outliers and value higher than 3), or platykurtic (data light-­tailed with less extreme values than normal distribution and value lower than 3). • Mesokurtic (Kurtosis = 3): Data are normally distributed. • Leptokurtic (Kurtosis > 3): Data are heavy-­tailed with profusion of outliers. • Platykurtic (Kurtosis < 3): Data are light-­tailed and/or contain less extreme values than normal distribution. FIGURE 8.1 Symmetric, positive, and negative skewness. FIGURE 8.2 Main types of kurtosis. 348 Handbook of Computer Programming with Python The following script reads the newGrades.csv file, calculates the skewness, kurtosis, and sum values of all columns, and reports them alongside the rest of the dataset: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 import pandas as pd # Define the format of float numbers pd.options.display.float_format = '${:,.2f}'.format dataset = pd.read_csv('newGrades.csv') rows = len(dataset) cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"] # Find the skewness (Pearson's coefficient) values for each column skew1 = dataset["Final Grade"].skew() skew2 = dataset["Final Exam"].skew() skew3 = dataset["Quiz 1"].skew() skew4 = dataset["Quiz 2"].skew() skew5 = dataset["Midterm Exam"].skew() skew6 = dataset["Project"].skew() # Find the kurtosis values for each column kurtosis1 = dataset["Final Grade"].kurtosis() kurtosis2 = dataset["Final Exam"].kurtosis() kurtosis3 = dataset["Quiz 1"].kurtosis() kurtosis4 = dataset["Quiz 2"].kurtosis() kurtosis5 = dataset["Midterm Exam"].kurtosis() kurtosis6 = dataset["Project"].kurtosis() # Find sum1 = sum2 = sum3 = sum4 = sum5 = sum6 = the sum of all values for each column dataset["Final Grade"].sum() dataset["Final Exam"].sum() dataset["Quiz 1"].sum() dataset["Quiz 2"].sum() dataset["Midterm Exam"].sum(); dataset["Project"].sum() # Report the dataset dataset1 = dataset[["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"]] print(dataset1.iloc[0:rows:1]) # Append the dataset with the max values skewness = {"Final Grade": skew1, "Final Exam": skew2, "Quiz 1": skew3, "Quiz 2": skew4, "Midterm Exam": skew5, "Project": skew6} dataset1 = dataset1.append(skewness, ignore_index = True) kurtosis = {"Final Grade": kurtosis1, "Final Exam": kurtosis2, "Quiz 1": kurtosis3, "Quiz 2": kurtosis4, "Midterm Exam": kurtosis5, "Project": kurtosis6} dataset1 = dataset1.append(kurtosis, ignore_index = True) 349 Data Analytics and Data Visualization 51 52 53 54 55 56 57 58 sums = {"Final Grade": sum1, "Final Exam": sum2, "Quiz 1": sum3, "Quiz 2": sum4, "Midterm Exam": sum5, "Project": sum6} dataset1 = dataset1.append(sums, ignore_index = True) # Report the rows with the skewness, kurtosis, and sums print("Skewness"); print(dataset1.iloc[32:33]) print("Kurtosis"); print(dataset1.iloc[33:34]) print("Sum values"); print(dataset1.iloc[34:35]) Output 8.4.3: Final Grade $58.57 0 1 $65.90 2 $69.32 3 $72.02 4 $73.68 5 $61.32 6 $67.87 7 $75.57 8 $61.28 9 $0.00 10 $62.35 11 $66.13 12 $69.43 13 $82.60 14 $0.00 15 $62.62 16 $0.00 17 $67.47 18 $75.13 19 $66.85 20 $54.45 21 $76.95 22 $45.13 23 $73.23 24 $81.87 25 $62.63 26 $58.75 27 $49.75 28 $44.25 29 $62.52 30 $47.33 31 $68.97 Skewness Final Grade 32 $-1.96 Kurtcsis Final Grade 33 $3.52 Sum values Final Grade 34 $1,883.94 Final Exam $50.50 $49.00 $63.50 $60.50 $74.00 $45.50 $66.50 $66.00 $50.50 NaN $48.00 $61.00 $50.00 $74.00 NaN $45.50 NaN $59.00 $61.50 $77.50 $34.50 $66.50 $26.00 $63.50 $83.00 $54.50 $46.50 $27.50 $21.50 $31.00 $16.50 $55.00 Quiz 1 $76.00 $89.00 $73.00 $99.00 $84.00 $94.00 $73.00 $94.00 $84.00 NaN $78.00 $83.00 $80.00 $94.00 NaN $78.00 NaN $70.00 $76.00 $84.00 $62.00 $68.00 $52.00 $96.00 $97.00 $54.00 $54.00 $48.00 $55.00 $85.00 $38.00 $65.00 Quiz 2 $70.70 $63.00 $54.70 $74.70 $53.30 $42.70 $53.70 $58.70 $37.30 NaN $49.00 $45.30 $49.30 $65.00 NaN $56.70 NaN $72.70 $68.30 $52.00 $44.00 $67.00 526.30 $68.30 5E2.70 $31.30 $39.00 $37.00 $18.00 $54.70 $33.30 $49.70 Midterm Exam $60.00 $54.00 $70.00 $76.00 $64.00 $66.00 $54.00 $92.00 $58.00 NaN $70.00 $70.00 $90.00 $86.00 NaN $72.00 NaN $70.00 $82.00 $40.00 $44.00 $82.00 $50.00 $62.00 $84.00 $64.00 $52.00 $62.00 $42.00 $68.00 $52.00 $70.00 Project 55 90 80 70 87 70 87 70 78 69 71 70 76 92 75 70 0 72 87 80 90 92 68 89 72 87 90 70 80 89 89 94 Final Exam $-0.43 Quiz 1 $-0.51 Quiz 2 $-0.18 Midterm Exam $0.05 Project $-3.03 Final Exam $-0.35 Quiz 1 $-0.53 Quiz 2 $-0.39 Midterm Exam $-0.60 Project $13.01 Final Exam Quiz 1 Quiz 2 $1,528.50 $2,183.00 $1,518.40 Midterm Exam Project $1,906.00 $2,459.00 350 Handbook of Computer Programming with Python 8.4.4 The describe() and count() Methods Two more methods that are worth mentioning are describe() and count(). These methods come Observation 8.34 – describe(): rather handy when describing categorical data, but can Use the describe() method to be also used with continuous data. The describe() automatically report a set of basic method provides a simple way to describe data, report- descriptive statistics. ing the max, min, variance, quartiles, mean, and standard deviation without having to deal with each of them separately. The count() method reports the Observation 8.35 – count(): Use number of occurrences of each case of categorical the count() method to report the data in the dataset (i.e., it denotes frequency of occur- frequency of occurrence of categorirence). It can be also calculated on a percentage basis cal data. in order to obtain a representation of the part-­to-­whole relationship. The following script uses newGrades.csv to report basic descriptive statistics for Final Grade, while also counting the As, Bs, Cs, Ds, and Fs in the report: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import pandas as pd # Define the format of float numbers pd.options.display.float_format = '${:,.2f}'.format dataset c pd.read_csv('newGrades.csv') rows = len(dataset) cols = ["Final Grade", "Letter Grade"] # Report the basic descriptive statistics for Final Grade print("Basic descriptive statistics on Final Grade") print(dataset["Final Grade"].describe(), "\n") # Create a new dataset with Letter Grade only dataset1 = dataset[["Letter Grade"]] # Find the number of occurrences of Letter Grades countAll = dataset1.count() print("Total students:", countAll.values) dataset2 = dataset1[dataset1["Letter Grade"] == "A"] if (not dataset2.empty): countA = dataset2.count() else: countA = 0 print("Students awarded an A:", countA) Data Analytics and Data Visualization 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 dataset2 = dataset1[dataset1["Letter Grade"] == "B"] if (not dataset2.empty): countB = dataset2.count().values else: countB = 0 print("Students awarded an B:", countB) dataset2 = dataset1[dataset1["Letter Grade"] == "C"] if (not dataset2.empty): countC = dataset2.count().values else: countC = 0 print("Students awarded an C:", countC) dataset2 = dataset1[dataset1["Letter Grade"] == "D"] if (not dataset2.empty): countD = dataset2.count().values else: countD = 0 print("Students awarded an D:", countD) dataset2 = dataset1[dataset1["Letter Grade"] == "F"] if (not dataset2.empty): countF = dataset2.count().values else: countF = 0 print("Students awarded an F:", countF) Output 8.4.4: Basic descriptive statistics on Final Grade $32.00 count $58.87 mean $21.48 std $0.00 min $57.54 25% $64.27 50% $70.08 75% $82.60 max Name: Final Grade, dtype: float64 Total students: [32] Students awarded an A: Students awarded an 3: Students awarded an C: Students awarded an D: Students awarded an F: 0 [2] [6] [14] [10] 351 352 Handbook of Computer Programming with Python 8.5 DATA VISUALIZATION We are all familiar with the expression a picture is worth a thousand words. Data visualisation refers to Observation 8.36 – Data Visuali­ the use of graphical means to represent and summarize zation: The use of visual means, such data. It can help the analyst identify and conceptualize as various types of charts, to represent patterns, trends, and correlations present in the data that and summarize data. may be otherwise difficult to spot. It is also an efficient way to convey insights or summaries to wider audiences and, thus, it is widely used for data presentation (particularly when working with big data). Data visualisation is also an essential step before undertaking inferential statistics analysis (Chapter 9) and machine learning (Chapter 10) tasks, as it provides an overview of some of the structures and techniques used in these fields. In general, data visualisation is useful for the following tasks: • • • • • • • Recognizing the structure and patterns of the data. Detecting errors or outliers. Exploring relationships between variables. Discovering new trends. Suggesting appropriate inferential statistical analysis and machine learning methods. Identifying the need for data correction (e.g., transforming data to log-­scale). Communicating data to wider audiences. Python is a popular data visualization choice for data scientists, as it provides various packages and libraries suitable for visualization tasks. Some popular plotting libraries are the following: • Matplotlib: As mentioned in earlier sections, Matplotlib is a low-­level plotting library, suitable for creating basic graphs and providing a lot of options relating to this task to the programmer. • Pandas: Pandas is based on Matplotlib and, in addition to plotting, it also provides extra analysis functionality. • Seaborn: Seaborn is a high-­level plotting library with a solid collection of usable, default styles. It also allows for graph plotting with minimal coding, and it provides advanced visuals, making it the tool of choice for many data scientists. The above libraries and packages provide a wealth of available methods to produce any type of visualization. In this section, only Pandas and Matplotlib are used. This is mainly for simplicity and clarity reasons. 8.5.1 Continuous Data: Histograms A histogram is a type of graph that can depict the distribution of continuous numerical data by displaying the data frequency using bars of different heights. Due to the use of bars, prior to plotting histograms, one first has to bin the range of data values. The term bin is used to describe the process of dividing the entire range of data values into a series of intervals. Subsequently, data falling into each interval are counted and the resulting frequencies are plotted in the form of bars. Bins are usually specified as consecutive, non-­overlapping intervals and often have equal or comparable sizes, although this is not a strict requirement (Freedman et al., 1998). Observation 8.37 – Histograms: Use the plot.hist() method (Pandas library) to visualize continuous data, dividing the entire range of values into a series of intervals referred to as bins. Parameters such as subplots, ­ layout, grid, ­xlabelsize, ­ ylabelsize, xrot, yrot, ­figsize, and legend allow for the detailed configuration of the histogram. Data Analytics and Data Visualization FIGURE 8.3 353 Types of histograms. Histograms can be used when investigating and demonstrating the shape of the data distribution (i.e., its center, spread, and skewness), as well as its various modes and the presence of outliers. They help the analysis by visually determining whether two or more data distributions are different, like in the example above (Figure 8.3). At first, histograms may look like bar charts, but these two graph formats are notably different. Histograms are used for summarising and grouping continuous data into ranges, while bar charts are used for displaying the frequency of categorical data. Another difference is that the proportion of the data in a histogram is represented as a unified area of the graph, while in a bar chart through the length of individual bars. Bar charts are discussed in more detail in later parts of this chapter. To plot a histogram in Python, one can use the plot.hist() method from the Pandas library. For basic plotting, no further arguments are needed. However, the method accepts additional arguments in order to optionally control specific plotting details, such as the bin size (the default value is 10). It is also possible to have multiple histograms generated and illustrated in one single plot. The subplots parameter allows the programmer to plot each feature in the dataset separately, and the layout parameter specifies the number of plots per row and column of a given diagram. By default, the histogram appears inside a grid, but it is possible to avoid this by setting the grid parameter to False. The letter size of the x or y axis can be controlled by setting the xlabelsize or ylabelsize parameters, respectively. The histogram can be rotated by a specified number of degrees on the x or y axis, by setting the xrot or yrot parameters. The size of the figures can be specified (in inches) using the figsize parameter. The following script uses the newGrades.csv dataset used in previous examples to display six histograms in one plot (i.e., two lines and three columns): 1 2 3 4 5 6 7 8 9 10 import pandas as pd dataset = pd.read_csv('newGrades.csv') dataset1 = dataset[["Final Grade", "Final Exam", "Midterm Exam", "Project"]] "Quiz 1", "Quiz 2", # Prepare a histogram with 2 lines of subplots, visible grid & legend # in 2 rows & 3 columns, with figures of size 10x10 inches, & 10 bins plt = dataset1.plot.hist(subplots = 2, grid = True, legend = True, layout = (2, 3), figsize = (10, 10), bins = 10) 354 Handbook of Computer Programming with Python Output 8.5.1: 8.5.2 Continuous Data: Box and Whisker Plot A box and whisker plot, also called box plot, is a graphical Observation 8.38 – Box and Whisker representation of the spread of continuous data, based Plot: Use the boxplot() method on a five number summary: the minimum, the maximum, (Pandas library) to draw a box and the sample median, the first quartile (Q1), and the third whisker plot. Plot aspects like the grid, quartile (Q3). As the name suggests, the plot contains the figure size, and the labels can be two parts: a box and a set of whiskers. The two ends configured using the grid, figsize, of the whiskers show the minimum and the maximum and labels parameters, respectively. values of the dataset, while the top and the bottom of the box represent Q3 and Q1, respectively. The horizontal line in the middle of the box denotes the median. The data point that is located outside the whiskers of the box plot is defined as an outlier, which is the value that is more than one and a half times the length of the box. It is worth noting that box plots work better with data that only contain a limited number of categories (Figure 8.4). Data Analytics and Data Visualization FIGURE 8.4 355 Box and whisker plot. Box plots can be used when: • • • • Working with numerical data. Presenting the spread of the data and the central value. Comparing data distribution across different categories. Identifying outliers. Box plots can be created using the boxplot() method from the Pandas library. The x and y axis values can be modified using the by and column parameters, respectively (Pandas, 2021a). For an improved visual effect, one can alternatively use the sns.boxplot() method from the Seaborn library. The following script draws a box and whisker plot for the newGrades.csv dataset: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import pandas as pd dataset = pd.read_csv('NewGrades.csv') # The names of the columns on the x-axis cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"] dataset1 = dataset[["Final Grade", "Final Exam", \ "Quiz 1", "Quiz 2", "Midterm Exam", "Project"]] # Prepare a box and whisker diagram with all the 6 columns represented # in a single plot of size 10x10 inches dataset1.boxplot(grid = True, figsize = (10, 10), showcaps = True, \ showbox = True, showfliers = True, labels = cols) 356 Handbook of Computer Programming with Python Output 8.5.2: <AxesSubplot:> 8.5.3 Continuous Data: Line Chart A line chart is a graphical method to represent trend data as a continuous line. It connects a series of historical data points by line segments in order to depict the variations of the data continuously over time. The x-­axis corresponds to time or continuous progression, while the y-­axis represents the corresponding values. Line charts can be used when: Observation 8.39 – Line Chart: Use the plot.line() method (Pandas library) to draw a line chart. There are several parameters available for the detailed configuration of the chart. • Working with numerical data (y-­axis) that follow a continuous progression (x-­axis). • Emphasizing changes in values over time or as a continuous progression. • Comparing between different series of trends. To create a line chart, one can call the plot.line() method from the Pandas library. If multiple lines are plotted in a single line chart, Pandas automatically creates a legend. This is a rather useful feature when comparing data trends. The following script uses the newGrades.csv dataset to draw a line chart plotting all six columns of the dataset: 357 Data Analytics and Data Visualization 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import pandas as pd dataset = pd.read_csv('newGrades.csv') # The names of the columns on the x-axis cols = ["Final Grade", "Final Exam", "Quiz 1", "Quiz 2", "Midterm Exam", "Project"] dataset1 = dataset[["Final Grade", "Final Exam", \ "Quiz 1", "Quiz 2", "Midterm Exam", "Project"]] # Prepare a line chart with all the 6 columns represented # in a single plot of size 7x7 inches dataset1.plot.line(grid = True, figsize = (7, 7), title = "Grades Line Chart") Output 8.5.3: <AxesSubplot:title={'center':'Grades Line Chart'}> 8.5.4 Categorical Data: Bar Chart A bar chart is a graph that displays counts of categorical data or data associated with categorical data in the form of vertical or horizontal rectangular bars. The x-­axis (vertical bar chart) represents the data by category, while the y-­axis can take any value depending on the dataset used. Bar charts are useful for describing Observation 8.40 – Bar Chart: Use the plot.bar() method (Pandas library) to draw a bar chart. There are several parameters available for the detailed configuration of the chart. 358 Handbook of Computer Programming with Python categorical data that have less than approximately 30 categories, as anything close to or above this rough threshold tends to make them rather unreadable. In such cases, a more efficient grouping or re-­grouping approach should be considered. Bar charts can be used when: • Working with categorical data. • Investigating the frequency of the data. To plot a bar chart for categorical data one can use the plot.bar() method (Pandas library). The reader must note that before this method is called, the frequency for each category must be counted using the value_count() method. Methods plt.xlabel(), plt.ylabel(), and plt.title() can be used to add appropriate descriptions to the bar chart. The following script uses plot.bar() to draw and configure a vertical bar chart (default) based on the Letter Grade column of newGrades2.xlsx (New Data sheet): 1 2 3 4 5 6 7 8 9 import pandas as pd dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data") barChart = dataset["Letter Grade"].value_counts().plot.bar(grid = True, legend = True, figsize = (7, 7), rot = 0) barChart.set_title("Final Letter Grades") barChart.set_ylabel("Frequencies") barChart.set_xlabel("Letter Grades") Output 8.5.4.a: Text(0.5, 0, 'Letter Grades') Data Analytics and Data Visualization 359 The reader should note the use of the grid, legend, figsize, and rot parameters to configure the basic appearance of the chart (i.e., show the grid and the legend, define the size of the figure in inches, and ensure the correct orientation of the x-­axis labels, respectively). It must be also noted how methods set_title(), set_ylabel(), and set_xlabel() are used to set the title of the chart and define the headings for the x and y axes. When horizontal bars are needed instead of vertical ones the plot.barh() method should be used instead of the plot.bar(). The following script demonstrates this option, while its output illustrates how slight parameter variations can help with the new horizontal orientation: 1 2 3 4 5 6 7 8 9 import pandas as pd dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data") barChart = dataset["Final Exam Letter"].value_counts().plot.barh( grid = True, legend = True, figsize = (7, 7), rot = 0) barChart.set_title("Final Exam Letter Grades") barChart.set_ylabel("Letter Grades") barChart.set_xlabel("Frequencies") Output 8.5.4.b: Text(0.5, 0, 'Frequencies') 360 Handbook of Computer Programming with Python It is also possible to have two or more different bar charts within the same figure. This can take three different forms. The first is to have a single plot with two separate charts as in the script below. The script uses the subplots() method from the plt object of the matplotlib.pyplot package to create two different plots: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import pandas as pd import matplotlib.pyplot as plt dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data") # Draw first subplot plt.subplot(1, 2, 1) plot1 = dataset["Letter Grade"].value_counts().plot.bar(grid = True, figsize = (10, 7), legend = True, sharey = True, rot = 0) plot1.set_title("Final Letter Grades") plot1.set_ylabel("Frequencies") plot1.set_xlabel("Letter Grades") # Draw second subplot plt.subplot(1, 2, 2) plot2 = dataset["Final Exam Letter"].value_counts().plot.bar(grid=True, figsize = (10, 7), legend = True, sharey = True, rot = 0) plot2.set_title("Final Exam Letter Grades") plot2.set_ylabel("Frequencies") plot2.set_xlabel("Letter Grades") Output 8.5.4.c: Text(0.5, 0, 'Letter Grades') Data Analytics and Data Visualization 361 The second form is to create a compound or nested bar chart, allowing two or more sets of data associated with the same categorical data to be plotted in a single diagram. This is useful in situations requiring visual comparison. The following script is a variation of previously used examples, demonstrating this form of bar chart: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import pandas as pd import matplotlib.pyplot as plt # Read the Excel dataset dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data") # Count the frequencies of Letter Grade and Final Exam Letter dataset1 = dataset["Letter Grade"].value_counts() dataset2 = dataset["Final Exam Letter"].value_counts() barChart = pd.DataFrame({"Final Letter Grade": dataset1, "Final Exam Letter Grade": dataset2}) barChart.plot.bar(grid = True, title = "Final Exam and Final Grade Letter Grades", rot = 0, figsize = (8, 8), color = ["lightblue", "lightgrey"]) # Use the plt object to set the labels of the x and y axis plt.xlabel("Letter Grades") plt.ylabel("Frequencies") Output 8.5.4.d: Text(0, 0.5, 'Frequencies') 362 Handbook of Computer Programming with Python The third form is the stacked bar chart. In this case, the various components are stacked upon each other to create a single, unified bar. The following script presents columns Letter Grade and Final Exam Letter from the newGrades2.xlsx dataset (New Data sheet). The reader should note that, in addition to the previously mentioned parameters of the regular plot.bar() method, the script also uses the stacked = True parameter that is responsible for stacking the two datasets: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import pandas as pd import matplotlib.pyplot as plt # Read the Excel dataset dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data") # Count the frequencies of the "Letter Grade" & the "Final Exam Letter" dataset1 = dataset["Letter Grade"].value_counts() dataset2 = dataset["Final Exam Letter"].value_counts() barChart = pd.DataFrame({"Final Letter Grade": dataset1, "Final Exam Letter Grade": dataset2}) barChart.plot.bar(stacked = True, grid = True, title = "Final Exam and Final Grade Letter Grades", rot = 0, figsize = (8, 8), color = ["lightblue", "lightgrey"]) # Use the plt object to set the labels of the x-axis and the y-axis plt.xlabel("Letter Grades") plt.ylabel("Frequencies") Output 8.5.4.e: Text(0, 0.5 'Frequencies') Data Analytics and Data Visualization 363 8.5.5 Categorical Data: Pie Chart A pie chart is a circular graph that uses the size of pie slices to illustrate proportion. It displays a part-­to-­whole relationship of categorical data. Like in the case of the bar chart, the pie chart should be avoided for data with a significant number of categories (i.e., slices), as this would compromise readability. Ideally, data with five or less categories are preferable. If the pie chart is to be used for data with more than five categories, re-­categorising or aggregating the data should be considered. Pie charts can be used when the presentation of the part-­ to-­whole relationship of the data is more important than the precise size of each category, and when it is required Observation 8.41 – Pie Chart: Use to visually compare the size of categories in relation the pie() method (Pandas library) to the whole. However, unlike bar charts, they can- to create a pie chart based on a not explicitly demonstrate absolute numbers or values dataset. Use the plt object from for each category. To plot a pie chart, one can use the ­matplotlib.pyplot to configure plot.pie() method from the Pandas library (Pandas, and improve the appearance of the 2021b), while its appearance can be further configured chart. using the plt object from the matplotlib.pyplot package. The following script reads the New Data dataset from newGrades2.xlsx and creates a pie chart based on the Letter Grade column. Next, it demonstrates the use of the labels, autopct, shadow, and startangle parameters to define and format the labels (in percentages), to display shadows, and to dictate the orientation and angle of the slices. Finally, it uses the axis, legend, and title methods to adjust the size of the slices, and to add titles to the chart and the legend: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import pandas as pd import matplotlib.pyplot as plt # Read the Excel dataset dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data") labels1 = dataset["Letter Grade"].unique() # Count the frequencies of Letter Grade dataset1 = dataset["Letter Grade"].value_counts() plt.pie(dataset1, labels = labels1, autopct = "%1.1f%%", shadow = True, startangle = 90) plt.axis("equal") plt.legend(title = "Final Letter Grades") plt.title("Final Letter Grades") 364 Handbook of Computer Programming with Python Output 8.5.5: Text(0.5, 1.0, 'Final Letter Grades') 8.5.6 Paired Data: Scatter Plot A scatter plot is a visual representation of the relationship between two sets of data using dots or circles. The dots/circles can report the values of individual data points, but also patterns of the data as a whole. Relationships between variables can be described in the following ways: positive or negative, strong or weak, linear or nonlinear (Figure 8.5). Scatter plots can be used when: Observation 8.42 – Scatter Plot: Use plot.scatter() (Pandas library) to create a scatter plot. Scatter plots illustrate the relationship between two sets of data using dots or circles. • Working with paired numerical data. • Identifying whether the data are correlated. • Investigating data patterns (e.g., cluster, data gap, outliers) (Figure 8.6). To create a scatter plot, one can call the plot.scatter() method from the Pandas library, and use the x and y arguments to define the paired data. The following script draws a scatter plot chart using the Final Exam Grades and Final Grades columns from newGrades2.xlsx: 1 2 3 4 5 6 7 8 9 10 import pandas as pd # Read the Excel dataset dataset = pd.read_excel('newGrades2.xlsx', sheet_name = "New Data") dataFrame = pd.DataFrame(data = dataset, columns = ["Final Exam", "Final Grade"]) dataFrame.plot.scatter(x = "Final Exam", y = "Final Grade", title = "Scatter chart between final exams and final grades ", figsize = (7, 7)) Data Analytics and Data Visualization FIGURE 8.5 Types of scatter plots. FIGURE 8.6 Investigating data patterns. 365 366 Handbook of Computer Programming with Python Output 8.5.6: <AxesSubplot:title={'center':'Scatter chart between final exams and final grades '}, xlabel='Final Exam', ylabel='Final Grade'> 8.6 WRAPPING UP This chapter covered some of the basic concepts and tasks used in data analysis. Considering the large number of possibilities and analysis combinations that may be utilized in order to provide thorough data analytics results, this chapter was not meant to provide exhaustive analysis of all options, but introductions to some of the main ones that highlight the general approaches and perspectives. For instance, topics like heatmaps, word clouds, bubble charts, area charts, and geospatials were not covered, although they are rather popular and common data visualization tools. The reader can find more detailed information on such topics in the rather extensive body of work that is readily available in related publications or web sources. At the level of detail and abstraction used in this chapter, Table 8.1 can be used as a quick guide for some of the methods covered, and their use in the context of data analytics. TABLE 8.1 Quick Guide of Methods and Their Functionality and Syntax Functionality Syntax/Example Data Acquisition Import the Pandas library. import pandas as <pandas object> Example: import pandas as pd (Continued) 367 Data Analytics and Data Visualization TABLE 8.1 (Continued) Quick Guide of Methods and Their Functionality and Syntax Functionality Create a data frame through data read. Syntax/Example <name of data frame> = <name of pandas object>.read_ csv(“<Filename.csv”, delimiter = ‘,’) Example: dataset=pd.read_csv(‘WPP2019_TotalPopulationBySex. csv’, delimiter = ‘,’) <name of data frame> = <name of pandas object>.read_ excel(“<Filename.xlsx>”, sheet_name = “<Sheet name>”) Example: dataset=pd.read_excel(WPP2019_Total_Population.xlsx’, sheet_name = “ESTIMATES”) <name of data frame> = <name of pandas object>. read_html(“<url>”) Data Cleaning Delete all rows containing missing data. Delete all rows containing any missing data. Delete all rows with missing data in all columns. Replace missing values with a predefined or calculated value. Change the names of columns with new ones. Change the index of a dataset and reset it back to the original column. <name of new Data Frame> = <name of original Data Frame>.dropna() Example: dframe_no_missing_data = dataset.dropna() <name of new Data Frame> = <name of original Data Frame>.dropna(how = “any”) Example: dframe_delete_rows_with_any_na_values = dataset. dropna(how = “any”) <name of new Data Frame> = <name of original Data Frame>.dropna(how = “all”) Example: dframe_delete_rows_with_all_na_values = dataset. dropna(how = “all”) <name of new Data Frame> = <name of original Data Frame>.fillna(value[, how = ‘all’] [, inplace = True]) <name of new Data Frame> = <name of original Data Frame>.fillna(value[, how = ‘any’] [, inplace = True]) Example: dataset.fillna(0, inplace = True) <name of Data Frame>.rename(columns = {“oldname”: ”newname”, } [, inplace=True]) Example: dataset_new = dataset.rename (columns = {“Final Grade”: “Total Grade”, “Quiz 1”: “Test 1”, “Quiz 2”: “Test 2”, “Midterm Exam”: “Midterm”}) <name of dataset>.set_index(“<column name>”[, inplace=True]) <name of dataset>.reset_index([inplace=True]) Data Exploration Find the number of records in the dataset. len(<name of dataset> Example: len(dataset) (Continued) 368 Handbook of Computer Programming with Python TABLE 8.1 (Continued) Quick Guide of Methods and Their Functionality and Syntax Functionality Report the columns of the dataset. Report the number of records and columns in the dataset. Report the first n records of the dataset. Report the last n records of the dataset. Report a number of records and columns from the dataset, based on their name and/or index value. Report only the unique values from a selected column in the dataset. Report data based on simple or compound condition. Merge two datasets into a new one. Create a new column based on an expression using data from other columns. Create a new column based on a condition. Create a new column based on a set of conditions and paired values. Syntax/Example <name of dataset>.(columns) Example: dataset.columns <name of dataset>.shape Example: dataset.shape <name of dataset>.head(n) Example: dataset.head(5) <name of dataset>.tail(n) Example: dataset.tail(5) <name of dataset>[start row: end row: step] <name of dataset>.loc[start row: end row, “<name of starting column>”: “<name of ending column>”] <name of dataset>.iloc[start row: end row, start column (index): end column (index) Example: print(dataset[0:37:5]) print(dataset.loc[0:5,” Final Grade” : “Final Exam”]) print(dataset.iloc[0:5,0:3]) <name of dataset>[“<name of column>”.unique()] Example: dataset[“Project”].unique()) <name of dataset>[<condition>] <name of dataset> [<condition>[&/|] <condition>]] Examples: dataset[“Final Grade”] > 80 dataset[“Final Grade”] > 0) & (dataset[“Final Grade”] < 60) <name of new dataset> = <name of first old dataset>. append(<name of second old dataset>) Example: dataset = dataset1.append(dataset2) <name of dataset>[“<name of new column>”] = expression with other columns Example: dataset[“Course Work”] = dataset [“Quiz”]*0.2 + dataset [“Midterm Exam”] *0.25 + dataset[“Project”]*0.25 <name of dataset>[“<name of new column>”] = np.where (condition, value if True, value if False) Example: dataset[“Letter Grade”] = np.where (dataset[“Final Grade”] > 89, “A”) <name of dataset>[“<name of new column>”] = np.select (conditions, paired values) Example: dataset[“Letter Grade”] = np.select (conditions, gradeLetters) (Continued) 369 Data Analytics and Data Visualization TABLE 8.1 (Continued) Quick Guide of Methods and Their Functionality and Syntax Functionality Group a dataset based on one or more columns, and apply any aggregate method necessary (e.g., sum(), mean()). Group a dataset based on one or more columns. Use apply() to organize the records and columns in the dataset. Sort the data in a dataset. Syntax/Example <name of dataset>.groupby([“<name of column>” [, “<name of column>”,...]]).<aggregate function> Example: dataset1.groupby([“Letter Grade”]).mean() <name of dataset>.groupby([“<name of column>” [, “<name of column>”,...]]).apply(lambda x: x[<rows>, <cols>]) Example: dataset1.groupby([“Letter Grade”]).apply(lambda x: x[0:rows]) <name of dataset>.sort_values([“<name of column>” [, “<name of column>”,...]] [, ascending = False]) Example: dataset3.groupby([“Letter Grade”]).apply (lambda x: x.sort_values ([“Final Grade”], ascending=False)) Descriptive Statistics Use mean() to find the mean/ average in a dataset. Use median() to find the median in a dataset. Use mode() to find the most frequent value in a dataset. Use .values to discard all output from the mode() report except its value. Use max() to find the max value in a dataset. Use min() to find the min value in a dataset. Use quantile(x) to find the xth quantile in a dataset. Use variance() (Statistics package) to calculate data variance. Use std() or stdev() (Statistics package) to calculate standard deviation. <name of dataset>[“<name of column>”].mean() Example: dataset[“Final Grade”].mean() <name of dataset>[“<name of column>”].median() Example: dataset[“Final Grade”].median() <name of dataset>[“<name of column>”].mode() Example: dataset[“Final Grade”].mode(dropna = True).values <name of dataset>[“<name of column>”].mode().values Example: dataset[“Final Grade”].mode(dropna = True).values <name of dataset>[“<name of columna>”].max() Example: dataset[“Final Grade”].max() <name of dataset>[“<name of columna>”].min() Example: dataset[“Final Grade”].min() <name of dataset>[“<name of columna>”]. quantile(0.0–1.0) Example: dataset[“Final Grade”].quantile(0.25) statistics.variance(<name of dataset>[“<name of column>”].dropna() Example: statistics.variance(dataset[“Final Grade”].dropna()) <name of dataset>[“<name of column>”].dropna statistics.stdev(<name of dataset>[“<name of column>”].dropna() Example: dataset[“Final Grade”].std() statistics.stdev(dataset[“Final Grade”].dropna()) (Continued) 370 Handbook of Computer Programming with Python TABLE 8.1 (Continued) Quick Guide of Methods and Their Functionality and Syntax Functionality Use skew() to calculate data skewness. Use kurtosis() to calculate data kurtosis. Use count() to calculate the frequency of occurrence of a value. Use describe() to automatically report a set of basic descriptive statistics. Syntax/Example <name of dataset>[“<name of column>”].skew() Example: dataset[“Final Grade”].skew() <name of dataset>[“<name of column>”].kurtosis() Example: dataset[“Final Grade”].kurtosis() <name of dataset>[“<name of column>”].count() Example: dataset[“Final Grade”].count() <name of dataset>[“<name of column>”].describe() Example: dataset[“Final Grade”].describe() Data Visualization Use the hist() function (Pandas library) to draw histograms. Use the boxplot() function (Pandas library) to draw box and whiskers plots. Use the line() function (Pandas library) to draw a line chart. Use the bar() function (Pandas library) to draw a bar chart. Use the subplots(), and stacked() functions with appropriate code to create different types of bar charts. Use the pie() function (Pandas library) to draw a pie chart. Use the plt object of the matplotlib.pyplot package to configure and improve the appearance of the chart. plt = <name of dataset>.plot.hist(subplots = <integer>, grid = True/False, legend = True/False, layout = (<number of rows>, <number of columns>, figsize = (<size on x axis in inches>, <size on y axis in inches>), bins = <number of bins>) Example: plt = dataset1.plot.hist(subplots = 2, grid = True, legend = True, layout = (2, 3), figsize = (10, 10), bins = 10) <name of dataset>.boxplot ([grid = True/False], [figsize = (<integer>, <integer>), [showcaps = True/ False], [showbox = True/False], [showfliers = True/ False], [labels = <names of columns>) Example: dataset1.boxplot(grid = True, figsize = (10, 10), showcaps = True, showbox = True, showfliers = True, labels = cols) <name of dataset>.plot.line ([grid = True/False], [figsize = (<integer>, <integer>], [title = “<title>”]) Example: dataset1.plot.line(grid = True, figsize = (7, 7), title = “Grades Line Chart”) <name of dataset>.plot.bar() Example: see relevant script in the text <name of dataset>.pie() Example: see relevant script in the text (Continued) 371 Data Analytics and Data Visualization TABLE 8.1 (Continued) Quick Guide of Methods and Their Functionality and Syntax Functionality Use the scatter() function (Pandas library) to draw a scatter plot based on two datasets. Syntax/Example <dataFrame>.plot.scatter(x = “<column 1>”, y = “<column 2>”, [title = “<title>”,...) Example: dataFrame.plot.scatter(x = "Final Exam", y = "Final Grade", title = "Final exams and final grades ", figsize = (7, 7)) 8.7 CASE STUDY Readmission is considered a quality measure of hospital performance and a driver of healthcare costs. Studies have shown that patients with diabetes are more likely to have higher early readmissions (readmitted within 30 days of discharge), compared to those without diabetes (American Diabetes Association, 2018; McEwen & Herman, 2018). To reduce early readmission, one solution is to provide additional assistance to patients with high risk of readmission. For this purpose, the US Department of Health would like to know how to identify the patients with high risk of readmission using the collected clinical records of diabetes patients from 130 US hospitals between 1999 and 2008. As an attempt to assist the US Department of Health in understanding the data, you are asked to explore, analyse (descriptively), and visualize the data of readmission (readmitted) and the potential risk factors, such as time in hospital (time_in_hospital) and hemoglobin A1c results (HA1Cresult), using techniques covered in this chapter. More specifically, your work should cover the following: 1. Data Acquisition: Import the related data file (i.e., Diabetes.csv). 2. Data Exploration: Report the number of records/samples and the number of columns/ variables in the dataset. 3. Descriptive Statistics: Use suitable techniques to summarize or describe the three variables we are interested in: readmitted, time_in_hospital, and HA1Cresult. 4. Data Visualisation: Use appropriate techniques to visualize the three variables and the relationships between readmitted and time_in_hospital, and readmission and HA1Cresult. REFERENCES American Diabetes Association. (2018). Economic costs of diabetes in the US in 2017. Diabetes Care, 41(5), 917–928. https://doi.org/https://doi.org/10.2337/dci18-­0 007. Freedman, D., Pisani, R., & Purves, R. (1998). Statistics (3rd ed.). New York: WW Norton & Company. McEwen, L. N., & Herman, W. H. (2018). Health care utilization and costs of diabetes. Diabetes in America (3rd ed.), 40-­1–40-­78. NIDDK. Pandas. (2021a). pandas.DataFrame.boxplot. Version: 1.2.5. https://pandas.pydata.org/docs/reference/api/ pandas.DataFrame.boxplot.html. Pandas. (2021b). pandas.DataFrame.plot.pie. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame. plot.pie.html. Statistics — Mathematical statistics functions. (2021). Python. https://docs.python.org/3/library/statistics. html. 9 Statistical Analysis with Python Han-­I Wang The University of York Christos Manolas The University of York Ravensbourne University London Dimitrios Xanthidis University College London Higher Colleges of Technology Contents 9.1 9.2 9.3 9.4 9.5 Introduction........................................................................................................................... 374 9.1.1 What is Statistics?...................................................................................................... 374 9.1.2 Why Use Python for Statistical Analysis?................................................................. 375 9.1.3 Overview of Available Libraries................................................................................ 375 Basic Statistics Concepts....................................................................................................... 376 9.2.1 Population vs. Sample: From Description to Inferential Statistics............................ 376 9.2.2 Hypotheses and Statistical Significance.................................................................... 377 9.2.3 Confidence Intervals.................................................................................................. 378 Key Considerations Prior to Conducting Statistical Analysis............................................... 379 9.3.1 Level of Measures: Categorical and Numerical Variables........................................ 379 9.3.2 Types of Variables: Dependent and Independent Variables...................................... 380 9.3.3 Statistical Analysis Types and Hypothesis Tests....................................................... 381 9.3.3.1 Statistical Analysis for Summary Investigative Questions......................... 381 9.3.3.2 Statistical Analysis for Comparison Investigative Questions..................... 381 9.3.3.3 Statistical Analysis for Relationship Investigative Questions..................... 383 9.3.4 Choosing the Right Type of Statistical Analysis....................................................... 385 Setting Up the Python Environment...................................................................................... 386 9.4.1 Installing Anaconda and Launching the Jupyter Notebook...................................... 387 9.4.2 Installing and Running the Pandas Library.............................................................. 387 9.4.3 Review of Basic Data Analytics................................................................................ 387 Statistical Analysis Tasks...................................................................................................... 388 9.5.1 Descriptive Statistics................................................................................................. 388 9.5.2 Comparison: The Mann-­Whitney U Test.................................................................. 391 9.5.3 Comparison: The Wilcoxon Signed-­Rank Test......................................................... 391 9.5.4 Comparison: The Kruskal-­Wallis Test...................................................................... 392 9.5.5 Comparison: Paired t-­test.......................................................................................... 393 9.5.6 Comparison: Independent or Student t-­Test............................................................... 395 9.5.7 Comparison: ANOVA................................................................................................ 396 9.5.8 Comparison: Chi-­Square........................................................................................... 397 9.5.9 Relationship: Pearson’s Correlation........................................................................... 398 9.5.10 Relationship: The Chi-­Square Test............................................................................ 399 DOI: 10.1201/9781003139010-9 373 374 Handbook of Computer Programming with Python 9.5.11 Relationship: Linear Regression................................................................................400 9.5.12 Relationship: Logistic Regression.............................................................................402 9.6 Wrap Up.................................................................................................................................404 9.7 Exercises................................................................................................................................405 References.......................................................................................................................................407 9.1 I NTRODUCTION When working with data, one of the main questions one seeks to answer is whether the observed value fluctuations and differences are random or not. If not by chance, what are the key factors that cause such changes, and what are their relationships with the data? Statistical analysis, and in particular inferential statistics, is the key tool for answering these questions. In this chapter, some commonly used statistical functions and the relationship between different types of measurements and statistical tests are introduced, accompanied by demonstrations of how to conduct relevant statistical analysis tasks in Python. The analysis functions follow a linear and incremental order, and build on concepts introduced previously, in order to assist readers with little or no prior experience in this area. For those familiar with the various concepts and functions discussed, this chapter can be used as a refresher or as a practical guide to implementing and executing common statistical functions using the Python platform. The reader should note that before embarking on any substantial task involving statistical analysis, it is important to consult statistics experts in order to determine the appropriate data collection functions and measurement units, as well as the types of statistical tests required and the best approaches for interpreting and reporting the results. 9.1.1 What is Statistics? Statistics is a branch of applied mathematics involving the tasks of data collection, manipulation, interpretation, and prediction. Two broad categories can be identified in the field of statistics: descriptive and inferential. Descriptive statistics (covered in part in Chapter 8 on Data Analytics and Data Visualization) focus on identifying and describing patterns in the data, by utilizing straightforward functions like frequencies and mean calculations. In descriptive statistics, there is no uncertainty or unknown factors. The goal is to summarize large volumes of data, making it easier to visualize and understand. On the other hand, inferential statistics focus on putting forward hypotheses (or inferences) related to a sample taken from a wider population. The hypotheses can be then generalized and applied to the entire population. Hence, as the sample does not contain the entirety of the population, analytical tasks utilizing inferential statistics are bound to contain an element of uncertainty. The reader must note that the term statistics is commonly used to refer to inferential statistics, while the term descriptive statistics is used when analytical tasks are conducted solely for describing existing data. In line with this convention, in this chapter the term statistics will be most frequently used to refer to inferential statistics, unless stated otherwise. Observation 9.1 – Statistics: A branch of applied mathematics that involves the tasks of data collection, manipulation, interpretation, and prediction. Two broad categories can be identified: descriptive and inferential statistics. Observation 9.2 – Descriptive Statistics: The focus is on identifying and describing patterns in the data through frequencies and mean calculations. Observation 9.3 – Inferential Statistics: The focus is on putting forward hypotheses (inferences) related to a sample from a wider population. If the hypotheses are proven correct, they are generalized and applied to the entire population. Statistical Analysis 375 9.1.2 Why Use Python for Statistical Analysis? A large number of specialized statistical software tools are available, such as SAS, Stata, R, and SPSS, and are widely used for both academic and commercial purposes. However, as each of these software packages come from different developers, they use customized features and specialized commands and syntax that cannot be directly translated and exchanged across different platforms. On the contrary, Python is a general-­purpose programming language with extensive cross-­platform capabilities. This characteristic gives Python an advantage when it comes to complex statistical analysis tasks that mix statistics with other data science fields, such as image analysis, text mining, or artificial intelligence and machine learning. In such cases, the richness and flexibility of Python, provided by its ability to adapt its functionality by means of appropriate modules, make it a better choice compared to other specialized statistical software packages. Furthermore, the Python language is relatively easy to learn compared to those found in the more specialized statistical software tools. Its Observation 9.4: Python, as a syntax is reminiscent of the English language, mak- general-­ purpose programming laning it easy to learn and use, and thus accessible to guage, allows the user to integrate stausers from diverse backgrounds and programming tistics with other data science fields expertise levels. Finally, Python is an open-­source and tasks like image analysis, text minand free-­to-­use language, unlike most of the special- ing, artificial intelligence, or machine ized statistical packages that frequently come at a learning. considerable cost. 9.1.3 Overview of Available Libraries A number of Python libraries, such as NumPy, SciPy, Scikit-­learn, and Pandas, provide f­unctions and tools that allow the user to conduct specific statistical analysis tasks. As the names suggest, NumPy and SciPy focus on numeric and scientific computations, as they support basic operations on multidimensional arrays. Accordingly, Scikit-­learn is mostly used for machine learning and data mining, as it offers simple and effi- Observation 9.5: The NumPy and cient tools for common data analysis tasks. Pandas SciPy libraries focus on numeric and is derived from the term panel data, and is designed scientific computations, Scikit-­learn is for data manipulation and analysis (McKinney & used for machine learning and data Team, 2020). For pure statistical analysis purposes, mining, and Pandas for data maniputhe Pandas library is one of the most suitable options, lation and analysis. as it provides high-­p erformance data analysis tools (Anaconda Inc., 2020). The reader will notice that the library of choice for a large part of the work covered in this chapter is Pandas. This is due to three main reasons. Firstly, the library is highly suitable for the types of statistical analysis tasks covered in this chapter. Secondly, it supports different data formats like comma-­ separated values (.csv), plain text, Microsoft Excel (.xls), and SQL, allowing the user to import, export, and manipulate databases easily. Thirdly, it is built on top of the SciPy library, so the results can be easily fed into functions of associated libraries like Matplotlib for plotting and Scikit-­learn for machine learning tasks (Mclntire et al., 2019). This highlights another concept that is central to the structure and rationale of this chapter: the selective use of different libraries and functions for different analytical tasks. For instance, functions from the SciPy library may be used for a specific analytical task alongside functions from the Matplotlib library for plotting the output data. This approach aims at promoting the idea that, as long as the fundamental principles and logic for the various different analytical tasks remain the same, the reader should feel confident to explore different toolkits and solutions. 376 9.2 Handbook of Computer Programming with Python B ASIC STATISTICS CONCEPTS Readers unfamiliar with the intricacies of statistical analysis who come across the notions of significant difference, p-­value, or confidence intervals may wonder what exactly these terms mean, and why they are so central in statistics. In this section, key statistics concepts, and the frequently intimidating jargon that accompanies them, are discussed and contextualized using simple examples. This aims at assisting the reader establishing an understanding of the connections and differences between descriptive and inferential statistics, and how and why scientists frequently make the ­transition from the former to the latter. 9.2.1 Population vs. Sample: From Description to Inferential Statistics Population can be defined as the whole set of individuals or subjects for which generalized observations or assump- Observation 9.6 – Population, Sample: tions are needed, whereas sample is the actual part of this Population is the whole set of indipopulation from which data are actually collected. As viduals or subjects for which generalsuch, the sample is bound to be a small part of the entire ized observations or assumptions are needed. The sample is the part of the population. In an ideal scenario, individual information from the population from which data are actuentirety of the population would be retrieved. In this ally collected. The sample is always case, descriptive statistical functions could be utilized to a small part of the entire population. describe the patterns observed in the data. However, this scenario is extremely rare. In most cases, budget and time constraints related to the data collection and analysis tasks at hand impose significant limitations. This is especially true when the study population is substantial, a rather common situation indeed. For example, if a national survey about the quality of life of all patients with diabetes in the UK is to be carried out, researchers would have to interview a population of approximately 4.7 million people (Diabetes UK, 2019). Arguably, it would be much more efficient to survey a group of diabetes patients rather than the entire population. In such cases, since researchers would get access to the information of a sample, statistical functions that allow one to make inferences to the population based on the sample are required. Measuring the national Body Mass Index (BMI) scores can be used as an example to demonstrate the underlying rationale. Assume that one wants to measure the BMI scores of all smokers in the UK. Since it is not plausible to get information from the entire UK smoker population, a sample will be drawn, which will be then used to draw conclusions. Ultimately, findings will be generalized to the entire UK smoker population using inferential statistics. In order to determine the required sample size, various different sampling functions are available. These include, but are not limited to, random, cluster, and stratified sampling. Depending on the research question behind the study and on the characteristics of the study population, a particular sampling function may be preferable to others. A detailed analysis of sampling functions and how to choose one is outside the scope of this chapter. However, a large number of related resources, like specialized statistics books and online materials are available for those interested in learning more about the topic. In terms of generalizing findings and observations from the sample to the entire population, one may wonder how such a generalization can be possible and trustworthy. In its simplest form, this is achieved by conforming to a strict set of minimum requirements, summarized below: 1. The sample must be representative of the population to which the results will be generalized. Representative means that the sample should reflect specific characteristics of the population, such as age, gender, or ethnic background, as closely as possible. Observation 9.7 – Sample Characteris­ tics: A sample must be representative of the population, suitable for answering the research question quantitatively, and allowing for hypothesis testing. Statistical Analysis 377 2. It must be suitable for answering the research question quantitatively. 3. It must allow for hypothesis testing, as implied by the research question. 4. The data analysis must match the type of the data being analyzed. In other words, one needs to use the right statistical function for the data at hand. These concepts are further discussed in the following sections. 9.2.2 Hypotheses and Statistical Significance Once a representative sample is drawn from the study population, hypotheses are drawn based on the underlying research questions. These hypotheses are, subsequently, systematically tested in order to measure the strength of the evidence and to draw conclusions about the entire population. This is commonly known as hypothesis testing. Hypothesis testing is, therefore, the process of making a claim about the study population and using the sample data to check whether the claim is valid. A common and long-­established convention within the scientific community is that this claim is based on the assumption that the hypothesis will not be true, or in other words, that the analysis will show that the intervention or condition under investigation will have no difference or no effect in the context of the population. This is a specific and standardized type of assumption that is essential in statistical testing, and is commonly referred to as the null hypothesis (H0). For those unfamiliar with scientific methodologies, the fact that the expectation is that the analysis will unveil no difference as opposed to some difference may seem counter-­ Observation 9.8 – Null Hypothesis: intuitive. However, the reader should note that the idea The hypothesis that the intervention behind this is that the analyst seeks to reject the null or condition under investigation assohypothesis rather than confirm it. In other words, the ciated with the research question will assumption is that if one can disprove the null hypothe- have no effect in the population. sis (i.e., no difference), a difference or effect must exist within the population. To check the validity of the null hypothesis, one needs to conduct a detailed and strictly-­ defined type of testing, commonly referred to as statistical significance testing. There are numerous statistical significance tests to choose from, depending on the research questions and the data at hand (see Section 9.3 for more details on test selection and on how to conduct such tests in Python). A common attribute of all these tests is that they calculate the probability of the results observed in the sample being consistent with the results one would likely get from the entire population. This is known as the p-value, which describes how likely it Observation 9.9 – Hypothesis or is that the data would have occured by random chance Statistical Significance Testing: Tests if the null hypothesis is true. Hence, if the p-value is that calculate the probability of the high, the observed sample data will confirm the null results being consistent with those hypothesis, and thus there must be no difference in the from the entire population. If probpopulation. If the p-value is low, it is a sign that the ability is high, the null hypothesis is observed sample data are i­nconsistent with the null confirmed and there is no difference hypothesis (H0), which is, therefore, rejected. In this in the population; if it is low, the case, one can c­ onclude that there must be a difference observed sample data are inconsispresent in the population and the difference is statis- tent with the null hypothesis, which is tically significant or that a significant difference has therefore rejected. been detected. As a working example of the above, the reader can assume a study of the effectiveness of a new hypertension drug, by comparing the blood pressure levels of those using it with the levels of those using conventional hypertension drugs. A hypothesis test can be carried out to detect whether the 378 Handbook of Computer Programming with Python TABLE 9.1 p–value and Significance p-­value >0.1 0.05–0.1 0.01–0.05 0.001–0.01 <0.001 Significance Little or no evidence of a difference or relationship. Weak evidence of a difference or relationship. Evidence of a difference or relationship. Strong evidence of a difference or relationship. Very strong evidence of a difference or relationship. new drug intervention has any effects on the sample or not. The null hypothesis will be based on the claim that there will be no difference of blood pressure levels between the users of the two different drugs in the sample. Hypothesis testing will be conducted and a p-­value will be generated. If the p-­value is low and the null hypothesis is rejected, there is evidence that there must be a difference in terms of the effectiveness of the two drugs in the general population. At this point, the reader may start wondering how low the p-­value should be in order to be considered low. The answer to this is that it depends on the significance level one chooses for the research question. In other words, for each research question, one needs to determine how high or low the probability (i.e., the p-­value) must be in order to conclude whether the sample data is consistent with the null hypothesis or not. Conventionally, differences are considered to be significant if the p-­value is less than 0.05 (5%). Essentially, the p-­value can be regarded as an indicator of the strength of the evidence. The reader can use the classification of p-­values as a rough guide for determining whether statistical significance requirements are met for a specific analysis task (Table 9.1). Using the same hypertension drug example, if the p-­value of the hypothesis test is found to be 0.03, it indicates that there is a 3% chance that the same treatment effect would occur in the randomly sampled data. Since the 3% chance is lower than the 5% statistical significance threshold, the null hypothesis can be rejected, leading to the conclusion that a significant difference between the two drugs exists in terms of the treatment effects within the general population. It is worth mentioning that the p-­value here only indicates a statistical relationship and not causation. For identifying causation, more sophisticated inferential statistical analysis methods, such as regression, are needed (see Sections 9.5.11, 9.5.12). 9.2.3 Confidence Intervals Another key concept used frequently in statistics is that of confidence intervals. The term is used to describe the use of a range of values within which the actual value of the tests may fall instead of a single estimated value. More specifically, in inferential statistics, one of the primary goals is to estimate population parameters. However, such parameters like population mean and standard deviation are always unknown, as it is very difficult, or even impossible, to be measured accurately across the entire population. Instead, estimates are made based on the samples. In order to avoid selection bias when the sample is selected and to achieve an accurate and objective representation Observation 9.10 – Confidence of the population, methods like random sampling are Intervals: A range of values within commonly used. However, even when such methods which the actual value of the tests are used, uncertainty about the population estimates may fall. They act as mediators that still exists to a certain degree, due to the possibility take into account potential sampling of sampling errors. It must be noted that, despite the errors and, therefore, provide a higher term used, sampling errors do not refer to actual errors. level of confidence during the statistiThey appear due to the inevitable variability occurring cal analysis process. by chance, as random samples are used rather than an 379 Statistical Analysis entire population. Nevertheless, they are treated as errors for the purposes of statistical testing, as they may lead to inaccurate conclusions. Although sampling errors cannot be completely eliminated, confidence intervals act as a mediator by taking these potential errors into account and providing a range of values the actual population parameter value is likely to fall within. As an example of this, one can assume that researchers want to know the average height of all secondary school students in the UK. Since it is impossible to measure the height of every single student, a random sample of 1,000 secondary school students could be used. If the analysis of the sample measurements results in an average height of 165 cm, it is unlikely that the population mean will also have this exact value, despite the fact that random sampling was used for sample selection. However, if the average height of the sample is expressed as a value within a confidence interval between 160 and 170, researchers can be confident that the true average height of all UK secondary school students among the entire population is captured within this range. 9.3 K EY CONSIDERATIONS PRIOR TO CONDUCTING STATISTICAL ANALYSIS Before conducting statistical analysis in Python, key aspects of the data collection process, as well as the tools and methods that will be used for the analysis of the collected data, must be considered. At a basic level, such considerations include: Observation 9.11 – Variable: A characteristic, factor, or quantity that can be measured. As the name suggests, it varies between subjects and/ or changes over time. It is directly related to the type of statistical analysis adopted for a given task. • the measurement scales and the types of variables that will be used for data collection, • the hypothesis being tested, and • the statistical tests that will be used for data analysis. A variable is a characteristic, factor, or quantity that can be measured, and which may vary between subjects or change over time (or both). For example, age is a variable that varies between individuals and changes over time, while income also varies between individuals but may, or may not, change over time. The reason the type of the variable is important is that it is directly related to the type of statistical analysis adopted for a given task. This is true for both descriptive and inferential statistics. Certain statistical analysis tests can be used only with certain types of data. For instance, if statistical methods suitable for categorical data are used with continuous data, the results are bound to be inconsistent and inaccurate. Hence, knowing the type of data that will be collected in advance enables one to choose the appropriate analysis method. Variables are generally categorized according to the type of measurement they are used for and the level of detail of this measurement. The following sections briefly introduce the different types of variables, the associated types of statistical tests, and how to choose the right statistical test based on the type of variable at hand. 9.3.1 Level of Measures: Categorical and Numerical Variables Categorical variables, also known as qualitative variables, describe categories or factors of objects, events, or individuals. An example is gender, which contains a finite number of categories (e.g., female, male). Categorical variables can also take numerical values (e.g., 1 for female, 2 for male). However, these values are only used for coding and Observation 9.12 – Categorical Variables (Nominal, Ordinal): Categorical (or qualitative) variables describe categories or factors of objects, events, or characteristics of individuals with no mathematical meaning. Nominal variables take discrete values that have no particular order, while ordinal variables take discrete, ordered values. 380 Handbook of Computer Programming with Python indexing purposes and do not have any mathematical meaning. There are two types of categorical variables: nominal and ordinal. A brief description of each type is provided below. • Nominal variables can have two or more discrete states, but there is no implied order for these states. For example, gender (i.e., female, male) is a nominal variable. Marital status (i.e., unmarried, married, divorced) and ethnic background (e.g., African, Asian, Caucasian) are also examples of nominal variables. Similarly, in medical research, patients that are either in treatment or not in treatment can be also described by a nominal variable. • Ordinal variables can have also two or more discrete states, but contrary to nominal variables, they can be ordered or ranked. For example, a satisfaction scale that lets respondents choose a value between 1 (strongly disagree) and 5 (strongly agree) is an example of an ordinal variable. Age group (e.g., 20–29, 30–39 and so on) and income can be also expressed as ordinal variables. Continuous variables, also known as quantitative variables, are variables that can increase or decrease steadily, or by a quantifiable degree or amount. There are two types of continuous variables, namely interval and ratio. A brief description of each type is provided below. • Interval variables can be measurable and ordered, and the intervals between the different values are Observation 9.13 – Continuous (Interval, Ratio): equally spaced. For example, temperature mea- Variables Continuous (or quantitative) variables sured in degrees (e.g., Celsius) is an interval variable, as the difference between 40°C and 30°C, take continuous numerical values and 30°C and 20°C is an equidistant interval of describing measured objects, events, 10°C. Other examples of interval variables include or characteristics of individuals. They age (when measured in years, months or days can take the form of intervals with no instead of the ordinal age groups of the previous true zero values, or ratios where a true example), or pH. Another characteristic of interval zero value has a logical meaning. variables is that they do not have a true zero. For instance, there is no such thing as no temperature, as a temperature of 0°C is still a measurable temperature. Hence, interval variable values can be also added or subtracted (but not multiplied or divided). • Ratio variables are similar to interval variables, with one important difference: they do have a true zero point. When a ratio variable equals to zero, this means there is none of this variable. Examples of ratio variables include height, weight, and length. Also, due to the existence of a true zero point, the ratio between two measurements takes a new meaning. For instance, an object weighing 10 kg is twice as heavy as an object weighing 5 kg. However, a temperature of 30°C (interval variable) cannot be considered twice as hot as 15°C. One can only claim that the 30°C temperature is higher than 15°C. 9.3.2 Types of Variables: Dependent and Independent Variables Variables are typically classified as either independent or dependent. Independent variables, also called predictor, explanatory, controlled, input, or exposure variables, have an influence on the dependent variables, but are not affected by any other variables themselves, hence their name. Accordingly, dependent variables, also known as observed, outcome, output, or response variables, are variables that are changing based on changes in the Observation 9.14 – Dependent and Independent Variables: Independent variables are changed/controlled in an experiment that tests their effect on the dependent variables. Both independent and dependent variables can be either categorical or continuous. 381 Statistical Analysis associated independent variables. Ultimately, in a scientific experiment, one seeks to change or control the independent variables in order to test the effects of these changes on the dependent variables. As an example, one can consider the following research question: Does the length of treatment result in improved health outcomes? In this case, the length of treatment is the independent variable, while health outcomes are the dependent variables. Similarly, if one poses the question: How aspirin dosage affects the frequency of second heart attacks? The aspirin dosage would be the independent variable, while the heart attack frequency would be the dependent variable. It is worth mentioning that any type of categorical or continuous variables can be either independent or dependent, based on the context. A summary of the various different types of variables is provided in Figure 9.1 below. 9.3.3 Statistical Analysis Types and Hypothesis Tests There are various different statistical analysis types and hypothesis tests. In general, statistical analysis can solve three main types of investigative questions: summary, comparison, and relationship. A more detailed list of common statistical analysis types, and the categories of problems they are used to address, are presented on Table 9.2 below. Observation 9.15 – Types of Statistical Analysis: There are three statistical analysis types: summary analysis using descriptive statistics, and comparison and relationship analysis both using inferential statistics. 9.3.3.1 Statistical Analysis for Summary Investigative Questions Statistical analysis of this type is mainly used for summarizing and describing a single variable at a given time. The most common statistical methods associated with this type of analysis are those calculating the mean and median for continuous variables and the frequency for categorical variables. 9.3.3.2 Statistical Analysis for Comparison Investigative Questions This type of statistical analysis is related to the comparison of the means of a single variable between two or more groups. For example, it can be used if one needs to know whether the Body Mass Index (BMI) numbers of men and women are significantly different to each other, or whether a new drug can reduce blood pressure (i.e., measuring blood pressure before and after treatment). In this type of analysis, p-­value is used to determine whether the difference is statistically significant. Variable Connuous Interval (30o C) FIGURE 9.1 Categorical Rao (height) Ordinal (Likert scale) Types of variables. Nominal (age, gender) Dependent (health outcome) Independent (treatment type & me) 382 Handbook of Computer Programming with Python TABLE 9.2 Common Types of Statistical Tests Statistics Investigative Question Common Statistical Tests Descriptive Summary Inferential Comparison Inferential Relationship Continuous variable: Mean, Median, Mode Categorical variable: Frequency Continuous variable: Nonparametric Mann-­Whitney U test Wilcoxon Signed-­Rank test Kruskal-­Wallis, Mood’s median test Parametric Student’s t-­test Paired Student’s t-­test Analysis of Variance test (ANOVA) Categorical variable: Chi-­Square test Association strength without causal relationship Pearson’s correlation coefficient Chi-­Squared test Association strength with causal relationship Linear regression Logistic regression Overall, there are six common types of tests that can be used for comparative hypothesis cases. The choice of the appropriate test for a particular task depends on a number of factors, such as the sample size, the data characteristics, and the comparison groups. Tests of this type can be further divided into two main categories: parametric and non-­parametric (Table 9.3). The main difference between parametric and non-­parametric analysis is that the former tests the group means, while the latter tests the group medians. When the sample size of each group is large enough and the comparison data are continuous and normally distributed, parametric statistical tests are preferable. Parametric tests have more statistical power than their non-­parametric counterparts, and can thus detect an existing, underlying effect more efficiently. However, in cases where the sample size is small, or the comparison data are skewed or non-­continuous (e.g., five-­point Likert scales) (De Winter & Dodou, 2010), non-­parametric statistical methods are more appropriate. Table 9.4 provides a simple indicative list of sample size thresholds for choosing whether parametric and non-­parametric tests should be used. The reader can find more on this topic in the various available sources assisting users with statistical test selection, such as Minitab (2015). Irrespectively of the sample size, when one compares two different means or medians, statistical analysis can be further divided into two types, depending on whether the mean or median comes TABLE 9.3 Common Types of Comparison Statistical Tests Parametric Tests (Means) Non-­Parametric Tests (Median) Independent Student t-­test Dependent (Paired) Student t-­test Analysis of Variance Test (ANOVA) Mann-­Whitney U test Wilcoxon Signed-­Rank test Kruskal-­Wallis, Mood’s median test 383 Statistical Analysis TABLE 9.4 Simple Guide for Choosing between Parametric and Non-­Parametric Tests Non-­Parametric Tests Mann-­Whitney U test Wilcoxon Signed-­Rank test Kruskal-­Wallis, Mood’s median test Sample Size Parametric Tests N = 15 in each group N = 30 Compare 2–9 groups, n = 15 in each group Compare 10–12 groups, n = 20 in each group Independent Student t-­test Dependent (Paired) Student t-­test Analysis of Variance test (ANOVA) from independent groups or from repeated measurements within the same group. If it comes from independent groups, independent t-­tests should be used for parametric analysis and Mann-­Whitney U tests for non-­parametric analysis. Examples of such cases are analysis based on measurements of BMI for men and women, or the height of UK and US population. If the mean or median comes from repeated measurements within the same group, dependent t-­tests should be used for parametric analysis and Wilcoxon Signed-­Rank tests for non-­parametric analysis. An example of this is the measurement of blood pressure before and after using a new drug. One can also compare three or more different means or medians. An example of this is the comparison of height across different ethnic groups. In this case, Analysis of Variance (ANOVA) tests should be used. In simple terms, ANOVA can be viewed as different implementations of t-­tests that allow one to compare means or medians of more than two groups. 9.3.3.3 Statistical Analysis for Relationship Investigative Questions This type of statistical analysis is used to investigate the relationship between two or more variables. Depending on the type of variable and the purpose of the analysis, it can be further divided into four sub-­categories, as outlined in Table 9.5. In general terms, relationship statistical analysis is suitable for: • hypothesis testing, • measuring the association strength, and • investigating causal relationships. Hypothesis testing is an attempt to check whether two variables are associated with each other. For example, one may wish to know whether an increase in daily sodium intake results in blood pressure changes Figure 9.2. If the test results in a p-­value of 0.05, a significant relationship is assumed to exist between salt intake and blood pressure. Association strength is a measurement of how closely the two variables are correlated (Table 9.6). This is usually expressed in terms of the R or R2 value, ranging from −1.0 to 1.0 or 0 to 1.0 respectively. Positive numbers indicate a positive correlation (e.g., if one variable increases the other increases too) and negative numbers an inverse correlation (e.g., if one variable increases the other TABLE 9.5 Common Types of Relationship Statistical Tests Type of Variable Continuous Variable Categorical Variable Statistical Test Association Strength Correlation (Linear Regression) Chi-­Square (Logistic Regression) Correlation – Causal Relationship Linear Regression Logistic Regression 384 Handbook of Computer Programming with Python TABLE 9.6 R value and Strength of Correlation R value 1.0 0.7 0.5 0.3 0 −0.3 −0.5 −0.7 −1.0 Strength of Correlation Perfect positive correlation Strong positive correlation Moderate positive correlation Weak positive correlation No correlation Weak negative correlation Moderate negative correlation Strong negative correlation Perfect negative correlation decreases). In this context, a value of 1.0 indicates a perfect correlation, and 0 no correlation. A rule of thumb is that when R is higher than 0.7 or lower than −0.7 the two variables are considered to be highly correlated. When R is between −0.3 and 0.3, the correlation between the two variables is regarded as weak. In the example presented in Figure 9.2, R is 0.82. Thus, there is a positive relationship between sodium intake and blood pressure. In other words, increasing the daily sodium intake is highly correlated with high blood pressure. The investigation of causal relationships is an attempt to relate the two variables via the equation of a line that stretches across a cloud of points. The equation is usually expressed as Y = a + bX, and it can be used for prediction. In the example presented in Figure 9.2, the causal relationship results show that blood pressure equals to 114.5 + 3.5 * daily sodium intake. This indicates that if the daily sodium intake of individuals is known it is possible to predict their approximate blood pressure. For instance, when the daily salt intake is 3 g the blood pressure would be 125 mmHg, and would go up by 3.5 mmHg for every 1 g increase of the daily sodium intake. This example provides a rather simplified, but informative description of the causal relationship concept. When the two variables are continuous, two common types of statistical analysis can be used to test their relationship: correlation and linear regression (McDonald, 2014). In simple terms, correlation measures the p-­value in order to test the hypothesis, and can quantify the direction and strength of the relationship between two continuous variables by summarizing the result with an R value. However, correlation cannot infer a cause-­a nd-­effect relationship. On the other FIGURE 9.2 Relationships between daily salt intake and blood pressure. 385 Statistical Analysis TABLE 9.7 Cheat Sheet for Choosing the Right Statistical Test No. of Variables Question Type Dependent Variable 1 1 1 1 1 2 2 2+ 2+ Summary Summary Comparison Comparison Comparison Relationship Relationship Relationship Relationship Continuous Categorical Continuous Continuous Categorical Continuous Categorical Continuous Categorical Independent Variable – – 2 groups 3+ groups 2+ groups 1 continuous 1 categorical 1+ variables 1+ variables Statistical Test Mean, Mode Frequency t-­Test ANOVA Chi-­Square Correlation Chi-­Square Linear Regression Logistic Regression hand, linear regression provides a p-­value for hypothesis testing similarly to correlation, but can also summarize the causal relationship with an equation that describes the relationship between variables. When the variables are categorical (i.e., nominal and ordinal), their relationship can be tested using two additional types of statistical analysis: chi-­square test and logistic regression. The chi-­ square test is used to test the association by providing a p-­value. For example, if one is interested in the relationship between gender and smoking status, the chi-­square test can be used. If the result is a p-­value of 0.015, a strong association between gender and smoking status can be assumed. As in correlation, the chi-­square test cannot infer a cause-­and-­effect relationship. To do so, logistic regression is required. The latter works like linear regression in the sense that it can summarize the causal relationship with an equation and use the equation for prediction. The only difference between the two is that logistic regression is used for categorical data, while linear regression is used for continuous data. The reader can find a list and a brief description of a number of common statistical analysis tests discussed in this section on Table 9.7. 9.3.4 Choosing the Right Type of Statistical Analysis Selecting the right type of statistical analysis is one of the most important considerations when conducting Observation 9.16 – Selecting the analytical work. This decision is generally based on the Appropriate Test: The decision of type and number of variables, and it can be a challeng- what test to use is not an arbitrary one ing process for those with less experience in this field of but depends on a number of factors, study. Table 9.7 presents a cheat sheet that can be used to such as the types and number of varidetermine when to choose the statistical tests mentioned ables at hand, the number of groups in Section 9.3.3, Table 9.2. The first column contains the to be tested, the sample size, and the number of variables under investigation and the second data distribution characteristics. the type of the research question one is trying to answer. The third and fourth columns contain the types of the independent and dependent variables, and the fifth the recommended statistical test. A decision tree chart is also provided on Figure 9.3, with the recommended statistical test at the end of each tree branch. By using these resources as a guide, the reader should be able to find a suitable statistical test for the data type and research question at hand. It must be noted that this is a just a brief introduction to the topic of statistical test suitability and selection. In addition to any decisions based on such guides, it is always helpful and advisable to consult statisticians and analysis experts before embarking on any serious analytical task. 386 Handbook of Computer Programming with Python FIGURE 9.3 9.4 Choosing the right statistical analysis. SETTING UP THE PYTHON ENVIRONMENT General information related to the process of setting up, and operating in, the Python environment are provided in Chapter 1 of this book. Most of the essential requirements and basic programming concepts presented in these chapters are transferable and, thus, apply to the work and ideas presented here. Nevertheless, if the reader opts to focus solely on this chapter, the sections below provide a quick guide on how to set up the essential platforms, namely Anaconda and Jupyter, as well as the required libraries and modules required for the purposes of statistical analysis. Statistical Analysis 387 9.4.1 Installing Anaconda and Launching the Jupyter Notebook The official Anaconda download page allows the user to download and install the latest version of the Python platform (see Chapter 1) (Anaconda Inc., 2020). The code and examples provided in this chapter were written and tested using Python 3.9. Once Anaconda is installed, the Anaconda Navigator can be used to launch applications, and simple Python programs can be created and run using the Spyder or Jupyter Notebook environments. For the purposes of this chapter, Jupyter Notebook is the platform of choice. This is due to a number of reasons. Firstly, it offers an appropriate environment for the Pandas library, which is required for tasks related to data exploration and modelling. Secondly, it allows for the execution of code in cells rather than running the entire file, something that can save time when it comes to debugging. Thirdly, it provides an easy way to visualize datasets and plots. 9.4.2 Installing and Running the Pandas Library To install Pandas, the reader can type !pip install pandas in the command input cell. Since Pandas is used frequently, it is common to import Pandas with a shorter name, namely pd. This is done by using the import pandas as pd expression: !pip install pandas import pandas as pd 9.4.3 Review of Basic Data Analytics With Pandas imported, the user can read data from local .csv files using the pd.read_csv() function and the full path directory of the file. For example, the following command can be used to read data from a local file named purchase.csv: df = pd.read_csv('C:\Python\Example\purchase.csv', index_col=0) The same applies to reading data files of other types, like Excel spreadsheets, SQL, and JSON, using the appropriate functions (i.e., pd.read_excel(), pd.read_sql_query(), and pd.read_ json()) (The Pandas Development Team, 2020). For the purpose of importing tables from HTML webpages, Pandas uses the pd.read_html() function (Sharma, 2019). The following example uses the HTML dataset from a cryptocurrency website to showcase this (WorldCoinIndex, 2021). Firstly, the requests library is imported. After passing the website link to variable url, function request.get() attempts to connect to the web server and allocate the relevant connection information to variable crypto_url. If a connection is established, property crypto_url.text is used as an argument to the pd.read_html command that, in turn, passes a dataframe to variable crypto_df. This particular dataframe contains columns with unnecessary data that are discarded from the main dataset. Finally, the first five rows of the dataset are displayed: 1 2 3 4 5 6 7 8 9 import pandas as pd import requests # Define the url url = 'https://www.worldcoinindex.com/' # Request the url crypto_url = requests.get(url) # Read from the url to Pandas object crypto_df = pd.read_html(crypto_url.text) 388 10 11 12 13 14 15 Handbook of Computer Programming with Python # Acquire only the relevant data form the dataset dataset = crypto_df[0] # Limit the displayed columns df = dataset.iloc[0:102, 2:5] # Print the first five rows of the dataset print(df.head(5)) Output 9.4.3: Name Ticker Bitcoin BTC Ethereum ETH Axie Infinity AXS Dogecoin DOGE Ethereumclassic ETC 0 1 2 3 4 Last price $ 33,839 $ 2,140.42 $ 40.82 $ 0.193697 $ 47.64 A dataframe is a two-­dimensional tabular data structure with labeled rows and columns. To view the dataframe, the user can simply call the name of the variable it is stored in. For instance, calling variable crypto_df from the pd.read_csv example presented above will read the entire dataframe that is stored in it. By default, the first and last five rows of a dataframe can be also retrieved using commands df.head() and df.tail() respectively. Passing a specific number to the arguments list of the head() function retrieves the corresponding number of rows, in this case 10. When it comes to saving the dataframe, various different file formats can be chosen. These include, but are not limited to, the following: 1. 2. 3. 4. Plain Text CSV: A commonly used, straightforward format. Pickle: Python’s native data storage format. HDF5: A format designed to store large amounts of data. Feather: A fast and lightweight binary file format that is also compatible with statistical analysis software R. Depending on the requirements and nature of the task at hand, each format has its own advantages and disadvantages. The example below uses Pickle, as the process is rather straightforward: function to_ pickle() is used to save the dataframe to file example.pkl and pd.read_pickle() to retrieve it: df.to_pickle('example.pkl') df1 = pd.read_pickle('example.pkl') 9.5 S TATISTICAL ANALYSIS TASKS Once the Python environment is configured and the appropriate methods and tools are determined, the reader can focus on the practical implementation of the various analytical tasks using Python. This section provides coding examples for various statistical analysis concepts and tests as well as information on the interpretation of the test results. 9.5.1 Descriptive Statistics Descriptive statistics are typically used for summarizing data from a sample. Depending on the type of measures used, a number of tools can be utilized for analysis and visualization (Table 9.8). If the type of measure is a continuous variable, functions and methods like .describe(), plot(kind=‘hist’), or plt.hist() can be used to generate summarized estimates or plot histograms (Koehrsen, 2018). 389 Statistical Analysis TABLE 9.8 Common Descriptive Statistical Tools for Different Types of Measures Type of Measure Continuous Variable Categorical Variable Summarized Values Plot Mean, Median, Standard Deviation, Range Frequency, Proportion, Percentage Histogram, Box Chart and similar Pie Chart, Bar Chart, Box Chart and similar As an example, assume a survey is conducted in order to gather personal information (i.e., age, gender, or BMI) from adults (18+) in a particular geographic area, and this information should be used to describe the age distribution within the sample population. The examples below show how one can generate the associated summary statistics and plot graphs: 1 2 3 4 5 6 7 8 9 10 11 12 import pandas as pd # Define the floating numbers format pd.options.display.float_format = '${:,.2f}'.format # Define the analysis dataset dataset = pd.read_csv("Survey.csv", index_col = 0) print("Descriptive Statistics for Age") print(dataset[["age"]].describe()) # Draw the histogram of the ‘age’ column dataset["age"].plot(kind = 'hist', title = 'Age'); Output 9.5.1.a: Descriptive Statistics for Age age count $2,849.00 mean $55.83 std $16.06 min $18.00 25% $44.00 50% $58.00 75% $67.00 max $101.00 390 Handbook of Computer Programming with Python The results indicate that the mean age of this group is 55.83 years. The age ranges from 18 to 101, and the distribution is symmetrically centred around the mean. For categorical variables one can use the .value_counts() method to generate the frequency of all values in a column, and the plot(kind=‘bar’) function to plot the frequency using bars (Tavares, 2017). Using the same survey example, the gender distribution for the patient group can be calculated and plotted using the following commands: 1 2 3 4 5 6 7 8 9 10 import pandas as pd # Define the analysis dataset dataset = pd.read_csv("Survey.csv", index_col = 0) print("Descriptive Statistics for Gender") print(dataset[["gender"]].describe()) # Draw the bar graph for the gender column dataset["gender"].value_counts().plot(kind = "bar", title = "Gender", rot = 0) Output 9.5.1.b: Descriptive Statistics for Gender gender count 2849 unique 2 top Female freq 1660 <AxesSubplot:title={'center':'Gender'}> The results show that that there are 1,660 females and 1,182 males within the patient group and the related plot is generated. As the topic of descriptive statistics is covered in detail in Chapter 8: Data Analytics and Data Visualization, the information provided here is only meant to function as a quick reference. Nevertheless, it is important to mention that descriptive statistics are frequently used as a way to gauge the data and provide context to many of the inferential statistics tasks presented in the following sections. 391 Statistical Analysis 9.5.2 Comparison: The Mann-­Whitney U Test The Mann-­Whitney U test is a type of non-­parametric test for continuous variables. It is used to test whether Observation 9.17 – The Mann-­Whitney the distributions of two independent samples are equal. U Test: A non-­parametric test for conThis test is appropriate when the sample size is small, or tinuous variables. It tests whether the distributions of two independent samthe data are skewed. As a practical example, one can consider a clinical ples are equal. It is appropriate when trial comparing the treatment effects of standard and a the sample size is small or the data are new therapy for patients with depression. A total of ten skewed. Use the mannwhitneyu() participants are randomly allocated to the two groups function from the SciPy library. (i.e., standard therapy/new therapy). The primary outcome of the measurements is the depression scores, ranging from 1 (extremely depressed) to 100 (extremely euphoric): Standard therapy New therapy 85 75 65 40 70 60 55 40 40 50 75 65 30 35 80 20 20 25 80 40 The null hypothesis (H0) is that the depression scores of the two therapies are equal. Since the sample size is small (<20), the Mann-­Whitney U Test is the appropriate choice for analysis. To run the test, the user can use the mannwhitneyu() function from the SciPy library. Data arrays data1 and data2 contain the depression scores of the standard and new therapies. The two sets of results can be compared using the mannwhitneyu(data1, data2) function: 1 2 3 4 5 6 7 # Example of the Mann-­ Whitney U Test from scipy.stats import mannwhitneyu # Standard therapy data1 = [85, 65, 70, 55, 40, 75, 30, 80, 20, 80] # New therapy data2 = [75, 40, 60, 40, 50, 65, 35, 70, 25, 40] mannwhitneyu(data1, data2) Output 9.5.2: MannwhitneyuResult(statistic=34.0, pvalue=0.11941708700675263) The results provide two values: the U statistics value (34.0) and the p-­value (0.119). Since the latter is larger than the significance level of 0.05, there is no sufficient evidence to conclude that the number of bacteria in the blood between the two therapies is different. Hence, the null hypothesis can be rejected with the conclusion that the new therapy does not improve the reduction of bacteria numbers in Observation 9.18 – The Wilcoxon the blood compared to the standard therapy. Signed-­Rank Test: A non-­parametric 9.5.3 Comparison: The Wilcoxon Signed-­Rank Test The Wilcoxon Signed-­Rank Test is used to test whether the distributions of two paired samples are equal or not. It is a non-­parametric test that can be used for both continuous and ordinal variables. test for continuous or ordinal variables. It tests whether the distributions of two paired samples are equal. It is appropriate when the sample size is small or the data are skewed. Use the wilcoxon() function from the SciPy library. 392 Handbook of Computer Programming with Python As an example, one can assume a test during which depression score measurements are taken before and after a newly developed therapy for ten patients, and the goal is to find whether the therapy makes a difference: Patient 1 2 3 4 5 6 7 8 9 10 Before therapy After therapy 85 75 65 40 70 50 55 40 40 50 75 65 30 35 80 20 20 25 80 40 The null hypothesis (H0) is that there is no difference in depression scores before and after the therapy. Since the data are taken from pairs and the sample size is small, the Wilcoxon Signed-­ Rank Test is an appropriate choice. To run the test, the user can use the wilcoxon() function from the SciPy library. Data arrays data1 and data2 contain the depression scores before and after therapy. The two sets of results can be compared using the wilcoxon(data1, data2) function: 1 2 3 4 5 6 7 # Example of the Wilcoxon Signed-­ Rank Test from scipy.stats import wilcoxon # Before therapy data1 = [85, 65, 70, 55, 40, 75, 30, 80, 20, 80] # After therapy data2 = [75, 40, 50, 40, 50, 65, 35, 20, 25, 40] wilcoxon(data1, data2) Output 9.5.3: WilcoxonResult(statistic=7.0, pvalue=0.037109375) The test provides a p-­value of 0.036 which is below the significance level of 0.05. Hence, the null hypothesis can be rejected with the conclusion that the new therapy has a significant effect on the depression scores. 9.5.4 Comparison: The Kruskal-­Wallis Test The Kruskal-­Wallis Test is used to test whether the dis- Observation 9.19 – The Kruskal-­ tributions (medians) of two or more independent sam- Wallis Test: A non-­parametric test ples are equal or not. It is used for continuous or ordinal for continuous or ordinal variables variables when the sample size is small and/or data are with small sample size and/or data not normally distributed. The test indicates whether the not normally distributed but with a differences between the test groups are likely to have similar skewness. It tests whether the occurred by chance or not. It is worth noting that the differences between two or more Kruskal-­Wallis Test is used under the assumption that groups are by chance or not. Use the the observations in each group come from populations ­kruskal() function from the SciPy with the same shape of distribution. Hence, if differ- library. ent groups have different distribution shapes (e.g., one is right-­skewed and another left-­skewed), the Kruskal–Wallis Test may produce inaccurate results (Fagerland & Sandvik, 2009). As an example of how to use the test in Python, one can assume a case of three available options to alleviate depression: standard therapy, new therapy, and new therapy plus exercise. The purpose of the test is to determine whether there is any difference in depression scores between the three therapy options with the following depression scores: 393 Statistical Analysis New therapy + exercise New therapy Standard therapy 90 85 75 80 65 40 90 70 50 30 55 40 55 40 50 90 75 65 55 30 35 85 80 20 40 20 25 90 80 40 Since the sample size is small and the depression scores are ordinal, the Kruskal-­Wallis Test is an appropriate choice. To run the test in Python, one can use the kruskal() function from the SciPy library. Data arrays data1, data2 and data3 contain the depression scores for new therapy and exercise, new therapy and standard therapy respectively. The three sets of results can be compared using the kruskal(data1, data2, data3) expression: 1 2 3 4 5 6 7 8 9 # Example of the Kruskal-­ Wallis Test from scipy.stats import kruskal # New therapy and exercise data1 = [90, 80, 90, 30, 55, 90, 55, 85, 40, 90] # New therapy data2 = [85, 65, 70, 55, 40, 75, 30, 80, 20, 80] # Standard therapy data3 = [75, 40, 50, 40, 50, 65, 35, 20, 25, 40] kruskal(data1, data2, data3) Output 9.5.4: KruskalResult(statistic=7.275735789710176, pvalue=0.026308376435655575) The results show that the p-­value is 0.026, which is less than the significance level of 0.05. Hence, the null hypothesis (H0) (i.e., the depression scores of the three therapies are equal) can be rejected, with the conclusion that a significant difference exists between the three treatment options. 9.5.5 Comparison: Paired t-­test The Paired t-­Test, also referred to as the Dependent t-­Test, is used to test whether repeated measurements Observation 9.20 – The Paired t-­Test: (means) taken from the same sample are significantly A parametric test for normally distribdifferent. Since the measurements come from the same uted data with no significant outliers. sample, the terms paired samples, matched samples or Use the ttest _ rel() function repeated measures are also commonly used for this type from the SciPy library. of test. The test is used under the assumption that the measurements are normally distributed and do not contain significant outliers. If the measurements are skewed or contain significant outliers, the Wilcoxon Signed-­Rank Test should be used instead. As an example, one can assume the case of a new drug developed to assist patients by reducing blood pressure. To investigate the effectiveness of the new drug, the blood pressure of 100 patients is firstly measured prior to taking the drug and also 3 months later. Since the goal is to determine whether the new drug is effective, the null hypothesis (H0) is that the average blood pressure will be the same before and after taking the drug. Assuming a dataset stored in a file named Blood.csv, the user can conduct the Paired t-­Test in Python using the ttest_rel() function from the SciPy library: 1 2 3 4 import pandas as pd from scipy.stats import ttest_rel # Define the format of floating numbers 394 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Handbook of Computer Programming with Python pd.options.display.float_format = '{:,.2f}'.format # Define the dataset dataset = pd.read_csv("Blood.csv", index_col = 0) print("Descriptive Statistics for Blood before and after") print(dataset[["Before", "After"]].describe()) # Prepare and display the scatter plot for the dataset dataFrame = pd.DataFrame(data = dataset, columns = ["Before", "After"]) dataFrame.plot.scatter(x = "Before", y = "After", title = "Scatter chart for Blood.csv", figsize = (7, 7)) # Calculate the Paired t-Test ttest_rel(dataset[["Before"]], dataset[["After"]]) Output 9.5.5: Descriptive Statistics for Blood before and after Before After count 80.00 80.00 mean 153.39 147.55 std 10.49 13.57 man 138.00 125.00 25% 144.75 136.00 50% 151.50 146.00 75% 159.25 157.00 max 185.00 184.00 Ttest_relResult(statistic=array([2.91731434]), pvalue=array([0.00459528])) Statistical Analysis 395 Arrays data1 and data2 correspond to the blood pressure scores before and after the drug therapy. The results show that the average blood pressure before taking the new drug was higher (153.38 mmHg) compared to the measurement taken after drug administration (147.55 mmHg). The test provides a p-­value of 0.004, which is lower than the significance level of 0.05. Hence, the null hypothesis can be rejected with the conclusion that a statistically significant difference in blood pressure occurs after using the new drug. 9.5.6 Comparison: Independent or Student t-­Test The Independent t-­Test, also known as the Student t-­Test, is used to test whether the means of two inde- Observation 9.21 – The Student pendent samples are significantly different. To conduct t-­Test: A parametric test for normally Independent t-­Tests in Python, the ttest_ind() func- distributed data with no significant tion from the SciPy library can be used. The function outliers. Use the ttest _ ind() accepts two arrays as parameters, corresponding to the function from the SciPy library. sets of data under investigation. The reader can find more information on the official SciPy.org website (The SciPy Community, 2020). Using the same survey example, one can assume a case where the user needs to know whether ages between men and women within the sample are different. In this context, the null hypothesis (H0) the mean ages of the two groups are equal is used: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import pandas as pd from scipy.stats import ttest_ind # Define the format of floating numbers pd.options.display.float_format = '${:,.2f}'.format # Define the dataset dataset = pd.read_csv("survey.csv", index_col = 0) print("Descriptive Statistics for age grouped by gender") print(dataset["age"].groupby(dataset["gender"]).describe()) # Calculate the Student t-­ Test ttest_ind(dataset.age[dataset.gender == 'Male'], dataset.age[dataset.gender == 'Female']) Output 9.5.6: Descriptive Statistics for age grouped by gender count mean std min 25% 50% 75% max gender Female $1,660.00 $55.27 $16.42 $18.00 $43.00 $57.00 $67.00 $101.00 Male $1,189.00 $56.61 $15.50 $19.00 $45.00 $58.00 $68.00 $98.00 Ttest_indResult(statistic=2.1993669348926157, pvalue=0.02793196707542121) The first output shows that the average age for men (56.56) is higher than that of women (55.30). The Independent t-­Test is conducted in order to determine whether this difference is significant. The first statistic value is the t score (2.199), which is a ratio of the difference between and within the two groups. As a general rule, the higher the t score, the bigger the difference would be between groups, and vice versa. To determine whether the t score is high enough, one has to rely on the p-­value output. In this example, the p-­value is 0.0279, which is lower than the significance level of 0.05. 396 Handbook of Computer Programming with Python Thus, the null hypothesis can be rejected with the conclusion that there is a statistically significant difference between the age of male and female individuals. 9.5.7 Comparison: ANOVA The ANOVA (i.e., Analysis of Variance) Test is used to compare the means of three or more samples. It assumes Observation 9.22 – The ANOVA independence of observations, homogeneity of variances, Test: A parametric test for normally and normally distributed observations within groups. In distributed, independent observaPython, the user can utilize the f_oneway() function tions, with homogeneity of variances. from the SciPy library to calculate the F-­Statistic, which, Use the f_oneway() function from in turn, can be used to calculate the p-­value. The function the SciPy library. accepts parameters corresponding to the sample measures for each group under consideration. Using the same survey data as an example, one can assume that the user needs to know whether the Body Mass Index (BMI) values are different across non-­smokers, former smokers and current smokers (smoking status). The null hypothesis (H0) is that there is no difference between the means of the BMIs among people from the three different groups: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import pandas as pd from scipy.stats import f_oneway # Define the format of floating numbers pd.options.display.float_format = '{:,.2f}'.format # Define the dataset dataset = pd.read_csv("survey.csv", index_col = 0) print("Descriptive statistics for survey by smokestat") print(dataset.bmi.groupby(dataset.smokestat).describe(), "\n") # Calculate the one-­ way ANOVA Test print("Results of ANOVA by smokestat values of Never, Former, Current") print(f_oneway(dataset.bmi[dataset.smokestat == "Never"], \ dataset.bmi[dataset.smokestat == "Former"], \ dataset.bmi[dataset.smokestat == "Current"])) Output 9.5.7: Descriptive statistics for survey by smokestat count mean std min 25% 50% 75% max smokestat Current 363.00 28.20 6.84 17.50 23.20 27.20 31.25 62.60 Former 755.00 29.22 6.24 16.80 25.05 28.20 32.40 66.20 Never 1,731.00 28.14 6.48 16.10 23.50 27.10 31.30 75.20 Results of ANOVA by smokestat values of Never, Former, Current F_onewayResult(statistic=7.548128785289014, pvalue=0.0005377158828502398) The first output shows that the former smokers have the highest mean BMI (29.22), followed by current smokers (28.30), and non-­smokers (28.20). The output of the ANOVA Test shows that the F-­Statistic is 6.56 and the p-­value is 0.0014, indicating an overall significant effect of smoking status on BMI. However, at this point it is uncertain exactly where the difference between groups lies. To 397 Statistical Analysis clarify this, one needs to conduct post-­hoc tests. For more detailed information regarding post-­hoc tests in Python, the reader can refer to the official documentation in Scikit-posthocs (2020). 9.5.8 Comparison: Chi-­Square As shown, the t-­Test is used to check whether means Observation 9.23 – The Chi-­Square differ between two groups. The Chi-­square Test, also Test: A parametric test for categorical known as the Chi-­squared Goodness-­of-­fit Test, is variables. It tests whether data from a the equivalent of the t-­test for categorical variables. It single sample follow a specified distritests whether categorical data from a single sample fol- bution. Use the chisquare() funclow a specified distribution (i.e., external or historical tion from the SciPy library. distribution). For example, based on the example of a smoker status survey, one can assume that the proportions of non-­smokers, former smokers, and current smokers are 30%, 10%, 60% respectively. The government launched a health promotion campaign in an attempt to increase smoking cession rate. To evaluate the impact of the program, the same survey was conducted for a second time a year later. The survey was completed by 500 people, and the data obtained were the following: Before programme After programme Non-­Smokers Former Smokers Current Smokers 150 140 50 80 300 280 Since the goal is to determine the impact of the health promotion programme, the null hypothesis (H0) assumes that the distribution of smoking status is the same prior to, and after the implementation of the program and, thus, the health promotion campaign has no impact. In such cases, the Chi-­ square Test is an appropriate choice. In Python, the test can be conducted using the chisquare() function from the SciPy library. The function accepts parameters corresponding to the observed frequencies in each categorical variable: 1 2 3 4 5 6 7 8 9 10 11 12 import scipy as scipy from scipy.stats import chisquare # Define the datasets before = scipy.array([150, 50, 300]) print("The dataset before the program:") print(before) after = scipy.array([140, 80, 280]) print("The dataset after the program:") print(after) square test results are the following:") print("The Chi-­ print(scipy.stats.chisquare(before, after)) Output 9.5.8: The dataset before the program: [150 50 300] The dataset after the program: [140 80 280] The Chi-square test results are the following: Power_divergenceResult(statistic=13.392857142857142, pvalue=0.0012353158761688927) 398 Handbook of Computer Programming with Python The first value of the output (13.39) is the Chi-­square value, followed by the p-­value (0.0012). Since the p-­value is less than the significance level of 0.05, the null hypothesis is rejected, indicating that there is a significant difference in terms of the smoking status before and after the programme. 9.5.9 Relationship: Pearson’s Correlation Correlation is used to test whether two continuous variables have a linear relationship. The correlation coef- Observation 9.24 – Pearson’s Correlation: A test used to examine ficient summarizes the strength of this relationship. As an example, the reader can assume that one needs whether two normally distributed, to know whether age and BMI are correlated. The null continuous variables have a linear hypothesis (H0) for this example is that age and BMI relationship. Use the pearsonr() are not correlated. Assuming that both age and BMI are function from the SciPy library. normally distributed and have the same variance, one can use function pearsonr() from the SciPy library to calculate the correlation coefficient and estimate the strength of the relationship. The function accepts two arrays as parameters corresponding to the sets of data: 1 2 3 4 5 6 7 8 9 10 11 12 import pandas as pd import scipy as scipy import matplotlib.pyplot as plt from scipy.stats import pearsonr # Read the dataset dataset = pd.read_csv("example.csv", index_col = 0) print(pearsonr(dataset.age, dataset.bmi)) # Visualize the correlation with a scatter plot print(plt.scatter(dataset.age, dataset.bmi, alpha = 0.5, edgecolors = "none", s = 20)) Output 9.5.9: (0.0453741864067145, 0.014235768675028503) <matplotlib.collections.PathCollection object at 0x000002802BD93310> 399 Statistical Analysis The first value of the output is the correlation coefficient (0.045), followed by the p-­value (0.014). Since p-­value is less than the significance level of 0.05, one can confirm that a relationship exists between age and BMI. Another important observation is that the correlation is positive (i.e., if age increases, BMI increases too), as the correlation coefficient is a positive number. However, the strength of the correlation is rather weak, as the correlation coefficient (0.045) is quite close to 0 (i.e., no correlation). The correlation can be also visualized as a scatter plot, using the scatter() function as shown in the Output plot above. 9.5.10 Relationship: The Chi-­Square Test To test whether two categorical variables are independent, one may use the Chi-­squared Test, also known as Observation 9.25 – Pearson’s Chi-­squared Test of Independence or Pearson’s Chi-­ Chi-­Square Test: A test used to examine whether two categorical square Test. To demonstrate the logic of the test, one can use the variables are independent. Use the same survey data example and evaluate whether gender chi2_contingency() function from and smoking status are associated. The null hypothesis the SciPy library. (H0) would be that there is no relationship between gender and smoking status. When neither of the two measurements is less than 5, one can use the crosstab() function from the Pandas library to create a cross table and scipy.stats.chi2_ contingency() to conduct the Chi-­square Test on the contingency/cross table. Detailed documentation for this function can be found in the official SciPy.org website (The SciPy Community, 2020). The following Python script makes use of both the crosstab() and the chi2_­ contingency() functions to provide the frequencies of the smoking status across the two gender groups and test whether there is an indication of a relationship between them: 1 2 3 4 5 6 7 8 9 10 11 12 import pandas as pd import scipy as scipy import matplotlib.pyplot as plt from scipy.stats import chi2_contingency # Read the dataset dataset = pd.read_csv("example.csv", index_col = 0) print(pd.crosstab(dataset.smokestat, dataset.gender), "\n") squared Test of Independence # Calculate the Chi-­ print(chi2_contingency(pd.crosstab(dataset.smokestat, dataset.smokestat))) Output 9.5.10.a: gender smokestat Current Former Never Female Male 210 403 1093 162 367 683 (5835.999999999999, 0.0, 4, array([[ 47.42426319, 98.16312543, 226.41261138], [ 98.16312543, 203.18711446, 468.64976011], [ 226.41261138, 468.64976011, 1080.93762851]])) 400 Handbook of Computer Programming with Python The first value of the output (19.453) is the Chi-­square value, followed by the p-­value (5.96e−05), the degrees of freedom (2), and the expected frequencies as an array. Since the p-­value is less than 0.05, the null hypothesis can be rejected, indicating that a relationship between smoking status and gender exists. It is worth noting that if an expected frequency lower than 5 is present, the user should use the Fisher’s Exact Test instead of the Chi-­square Test. Both tests assess for independence between variables. The Chi-­square Test applies an approximation assuming the sample is large, while the Fisher’s Exact Test runs an exact procedure suitable for small-­sized samples (Kim, 2017). To visualize the results of the test, one can also create a mosaic plot using the mosaic() function from the Statsmodels library. The function accepts the source as a parameter and defines the names of the columns for the plot: 1 2 3 4 5 6 7 8 import pandas as pd import matplotlib.pyplot as plt from statsmodels.graphics.mosaicplot import mosaic # Read the dataset dataset = pd.read_csv("example.csv", index_col = 0) mosaic(dataset, ["smokestat", "gender"]) plt.show() Output 9.5.10.b: 9.5.11 Relationship: Linear Regression Linear regression is used to examine the linear relation9.26 – Linear ship between two (i.e., univariate linear regression) or Observation Regression: A test used to examine the more (i.e., multivariate linear regression) variables. linear relationship between two (i.e., To contextualize this using the previous survey example, the reader can assume a case where one wants univariate) or more (i.e., multivariate) to test the relationship between body weight and BMI, variables. Use the OLS(y, X).fit() where the BMI is normally distributed. Additionally, function from the Statsmodels library. predictions regarding the BMI should be made based on weight information. Since BMI is a continuous variable, linear regression is appropriate for 401 Statistical Analysis the analysis. In Python, linear regression can be performed using either the Statsmodels or the Scikit-­learn libraries. For this example, the test choice was function OLS(y, X).fit() from the Statsmodels library, as the Scikit-­learn library is generally associated more with tasks related to machine learning. The related Python script and its output are provided below: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm # Read the dataset dataset = pd.read_csv("example2.csv", index_col = 0) # Independent variable X = dataset.weight # Dependent variable y = dataset.bmi # Add an intercept (beta_0) to the model X = sm.add_constant(X) # Function sm.OLS(dependent variable, independent variable) model = sm.OLS(y, X).fit() # Predictions predictions = model.predict(X) # Print out the statistics print(model.summary()) # Plot the statistics print(sm.graphics.plot_ccpr(model, "weight")) Output 9.5.11.a and 9.5.11.b: OLS Regression Results Dep. Variable: Model: Method: Date: Time: No. Observations: Df Residuals: Df Model: Covariance Type: const weight bmi OLS Least Squares Sun, 25 Jul 2021 16:35:19 2849 2847 1 nonrobust R-squared: Adj. R-squared: F-statistic: Prob (F-statistic): Log-Likelihood: AIC: BIC: 0.740 0.739 8085. 0.00 -7449.6 1.490e+04 1.492e+04 coef std err t P>|t| [0.025 0.975] 6.5712 0.1218 0.251 0.001 26.188 89.918 0.000 0.000 6.079 0.119 7.063 0.124 Omnibus: Prob(Omnibus): Skew: Kurtosis: 268.275 0.000 0.574 4.953 Durbin-Watson: Jarque-Bera (JB): Prob(JB): Cond. No. 1.290 609.183 5.22e-133 750. Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Figure(432x288) 402 Handbook of Computer Programming with Python In this example of linear regression, y equals to a dependent variable, which is the variable that must be predicted or estimated. Variable x equals to a set of independent variables, which are the predictors of y. It must be noted that we need to add an intercept to the list of independent variables using sm.add_constant(x) before running the regression. The output provides several pieces of information. The first part contains information about the dependent variable, the number of observations, the model, and the method. OLS stands for Ordinary Least Squares, and method Least Squares relates to the attempt to fit a regression line that would minimize the square of vertical distance from the data points to the regression line. Another important value presented in the first part is the R squared (R² = 0.740), which is the percentage of variance that the model can justify (73.9%). The larger the R squared value the better the model fit. The second part of the output includes the intercept and the coefficients. The p-­value is lower than .0001, indicating that there is statistical significance in terms of the weight predicting the BMI, with a weight increase of 1 pound leading to a respective increase in BMI by 0.1219. The linear regression equation can be also used in the following form: BMI = ( Intercept ) + ( Weight_ coefficient ) * weight Once the output numbers are added, the equation would take the following form: BMI = 6.5531 + 0.1219* weight Therefore, if the user knows a person’s weight (e.g., 125 pounds), their BMI can be calculated as 6.5 531 + 0.1219 * 125 = 21.7906. The user can also use the Matplotlib library to plot the results, as illustrated in the associated graph. 9.5.12 Relationship: Logistic Regression Logistic regression is used to describe the relation9.27 – Logistic ship between a dependent, categorical variable and Observation Regression: A test used to examine one or more independent variables. It models the logit-­ transformed probability in a linear relationship with the the relationship between a depenpredictor variables. For instance, using the same survey dent, categorical variable and one example, one can assume that the user wants to know or more independent variables. the relationship between smoking status (i.e., 1 = current Use the logit(y, X) function from the smoker, and 0 = non-­smoker) and the potential predic- Statsmodels library. tors, such as age, gender, and marital status. In addition, the user may also want to predict the smoking status based on the predictor information. Since smoking status is a categorical variable, logistic regression is an appropriate analysis method. In Python, logistic regression can be conducted using the Logit(y, X) function from the Statsmodels library. Parameter y equals to a dependent variable, which is the variable that must be predicted or estimated. Variable X equals to a set of independent variables, which are the predictors of y: 1 2 3 4 5 6 7 # Example of Logistic Regression import pandas as pd import statsmodels.api as sm # Read data df = pd.read_csv("Example2.csv", index_col = 0) 403 Statistical Analysis 8 9 10 11 12 13 14 15 16 17 18 19 x = df[["age", "gender2", "marital_divorced", "marital_single", "marital_widowed"]] y = df.smokestat2 # Add an intercept (beta_0) to the model X = sm.add_constant(x) logit_model = sm.Logit(y, X) result = logit_model.fit() # Print result.summary() print(result.summary2()) Output 9.5.12: Optimization terminated successfully. Current function value: 0.373830 Iterations 6 Results: Logit Model: Dependent Variable: Date: No. Observations: Df Model: Df Residuals: Converged: No. Iterations: Logit smokestat2 2021-07-27 13:21 2849 5 2843 1.0000 6.0000 Coef. const age gender2 marital_divorced marital_single marital_widowed -1.7107 -0.0109 0.1805 0.8406 0.4609 0.4764 Std.Err. 0.2307 0.0040 0.1170 0.1422 0.1584 0.2229 Pseudo R-squared: AIC: BIC: Log-Likelihood: LL-Null: LLR p-value: Scale: z -7.4156 -2.7133 1.5418 5.9097 2.9096 2.1372 P>|z| 0.020 2142.0822 2177.8105 -1065.0 -1086.7 3.1240e-08 1.0000 [0.025 0.975] 0.0000 -2.1628 -1.2585 0.0067 -0.0187 -0.0030 0.1231 -0.0489 0.4098 0.0000 0.5618 1.1194 0.0036 0.1504 0.7715 0.0326 0.0395 0.9133 404 Handbook of Computer Programming with Python As in linear regression, the output contains two parts. The first part provides information about the dependent variable and the number of observations, while the second part provides the intercept and the coefficients. As shown, age and marital status are significant predictors on smoking status (p < 0.05), while gender is not (p = 0.1231). Individuals who are divorced are 2.31 (i.e., exp(0.8406)) times more likely to be smokers than those who are married. Similar trends are also observed for those who are single (1.5855 times) and widowed (1.6102 times). In terms of age, it is observed that for every 1-­year increase in age there is a decrease of approximately 1% (i.e., 1−exp(−0.0109)) in the odds of an individual being a smoker. The output information can be also used in order to build the logistic regression as follows: P(probability of being a smoker) = exp(−1.7107 − 0.0109* Age + 0.1805* gender2 + 0.8406* Divorced+0.4609*Single+0.4764*Widowed) 1 + exp(−1.7107 − 0.0109* Age + 0.1805* gender2 + 0.8406* Divorced+0.4609*Single+0.4764*Widowed) As such, it can be predicted that a 40-­year-­old divorced male will have a 24.5% probability of being a smoker: exp ( −1.7107 − 0.0109 * 40 + 0.1805*1 + 0.8406*1) 0.3244 = = 0.2450 1 + exp ( −1.7107 − 0.0109* 40 + 0.1805*1 + 0.8406*1) 1 + 0.3244 9.6 W RAP UP This chapter focused on the introduction of basic concepts and terms related to statistics analysis and on the practical demonstration of carrying out inferential statistics analysis tasks using Python. It provided an overview of statistics and the available tools for conducting the analytical tasks. Basic statistical concepts, such as population and sample, hypothesis, significance levels and confidence intervals, were introduced. It also provided a practical guide for choosing the right type of statistical test for different types of tasks. The purposes and definitions of common types of statistical analysis methods were briefly discussed. Furthermore, it covered the necessary background for choosing a statistical analysis approach, such as levels and types of variables and the corresponding statistical and hypothesis tests and demonstrated how to set up the Python environment and work with various libraries specifically designed for statistical analysis. Finally, it provided a practical guide for the implementation and execution of common statistical analysis tasks in Python. Each statistical analysis method was supported by working examples, the associated Python programming code, and result interpretations. A list of the common statistical analysis methods covered in this section, as well as the corresponding Python libraries and methods, are presented below: Statistical Test Mann-­Whitney U Test Willcoxon Signed-­rank Test Kruskal-­Wallis Test Paired t-­Test Independent t-­Test Chi-­Square of goodness of Fit ANOVA Pearson’s Correlation Pearson’s Correlation (Scatter Plot) Library SciPy SciPy SciPy SciPy SciPy SciPy SciPy SciPy Matplotlib Code mannwhitneyu(data1, data2) wilcoxon(data1, data2) kruskal(data1, data2, data3, …) ttest_rel(data1, data2) ttest_ind(data1, data2) chisquare (data1, data2) f_oneway(data 1, data 2, data3, …) pearsonr(var1, var2) scatter(var1, var2) (Continued) 405 Statistical Analysis Statistical Test Library Pearson’s Chi-­Square Test Pearson’s Chi-­Square Test (Mosaic Plot) Linear Regression Logistic Regression SciPy Statsmodels Statsmodels Statsmodels Code chisquare (data1, data2) mosaic(Dataframe, ['var1', 'var2']) OLS(y, X).fit() Logit(y, X) The basic inferential statistical tests covered in this chapter lay the foundation for other, more advanced statistical analysis tasks, such as time to event and time series analysis. Ultimately, such methods and results could be used as building blocks for even more complex system simulations, such as Markov models, discrete-­event, and agent-­based simulations. Although advanced statistical analysis and simulation tasks like these were not covered in this chapter, the reader should be able to explore them by building on the information and knowledge acquired. Relevant key textbooks and bibliography for the purposes of further study and self-­learning can be found in the Reference List of this chapter. 9.7 EXERCISES We conducted an experiment about different plant species response to length of light over 3 months. The data we collected are listed below: Sample 1 2 3 4 5 6 7 8 9 10 Plant Species Length of Daylight (Hours per Day) Growth (cm) Flowered or Not (1 = Yes, 0 = No) A B A A B A B B A B 6 7 6 5 6 8 9 5 7 8 4.2 3.1 4.6 3.3 2.5 5.2 3.9 2.1 3.5 3.4 0 1 1 0 0 1 1 0 1 1 1. The variable of Plant Species is: A. Ordinal variable B. Nominal variable C. Interval variable D. Ratio variable Answer: B 2. The variable of Length of Daylight is: A. Ordinal variable B. Nominal variable C. Interval variable D. Ratio variable Answer: D 406 Handbook of Computer Programming with Python 3. The variable of Growth is: A. Ordinal variable B. Nominal variable C. Continuous variable D. Categorical variable Answer: C 4. The variable of Flowered or not is: A. Ordinal variable B. Nominal variable C. Interval variable D. Ratio variable Answer: A 5. If we want to know the correlation between Length of Daylight and Growth, which of the following statistical methods should we use? A. Chi-­square B. Pearson’s Correlation C. Logistic Regression D. ANOVA Answer: B 6. The estimated correlation coefficient is 0.45. What is the strength of the correlation? A. Weak negative correlation B. Strong positive correlation C. Moderate positive correlation D. Weak positive correlation Answer: D 7. If we want to compare the growth difference of different plant species, which statistical analysis should we use? A. Linear Regression B. Chi-­square Test C. Student t-­Test D. Mann-­Whitney U Test Answer: D 8. We received more data from other research teams, making the total sample size 150. Next, we would like to update our growth comparison results for different plant species. Which Python codes should we use? A. mannwhitneyu(data1, data2) B. chisquare(data1, data2) C. ttest_ind(data1, data2) D. wilcoxon(data1, data2) Answer: C Statistical Analysis 407 9. Based on the total of 150 samples, we decided to investigate the relationship between Growth and Length of Daylight. What would be our dependant variable? A. Length of Daylight B. Growth C. Plant Species D. Flowered or not Answer: B 10. To explore the relationship mentioned in Question 9, which statistical analysis should be used? A. Linear Regression B. Logistic Regression C. ANOVA D. Chi-­square Test Answer: A 11. Which Python code should be used to conduct the analysis used in Question 10? A. ttest_rel(data1, data2) B. f_oneway(data1, data2, data3) C. OLS(y, X).fit() D. Logit(y, X) Answer: C 12. To explore the relationship between Flowered or not and Length of Daylight, which Python code should be used? A. ttest_rel(data1, data2) B. f_oneway(data1, data2, data3) C. OLS(y, X).fit() D. Logit(y, X) Answer: D REFERENCES Anaconda Inc. (2020). Anaconda Distribution Starter Guide. https://docs.anaconda.com/_downloads/9ee215 ff15fde24bf01791d719084950/Anaconda-­Starter-­Guide.pdf. De Winter, J. F. C., & Dodou, D. (2010). Five-­point likert items: t test versus Mann-­W hitney-­Wilcoxon (Addendum added October 2012). Practical Assessment, Research, and Evaluation, 15(1), 11. Diabetes UK. (2019). Number of People with Diabetes Reaches 4.7 Million. https://www.diabetes.org.uk/ about_us/news/new-­stats-­people-­living-­with-­diabetes. Fagerland, M. W., & Sandvik, L. (2009). The Wilcoxon–Mann–Whitney test under scrutiny. Statistics in Medicine, 28(10), 1487–1497. Kim, H.-­Y. (2017). Statistical notes for clinical researchers: Chi-­squared test and Fisher’s exact test. Restorative Dentistry & Endodontics, 42(2), 152–155. Koehrsen, W. (2018). Histograms and Density Plots in Python. Towardsdatascience. com, https://towardsdatascience.com/histograms-­and …. https://towardsdatascience.com/histograms-­and-­density-­plots-­in-­ python-­f6bda88f5ac0. McDonald, J. H. (2014). Correlation and linear regression. In Handbook of Biological Statistics (3rd ed.). Baltimore, MD: Sparky House Publishing. https://www.biostathandbook.com/HandbookBioStatThird. pdf. 408 Handbook of Computer Programming with Python McKinney, W., & Team, P. D. (2020). Pandas-­Powerful python data analysis toolkit. Pandas—Powerful Python Data Analysis Toolkit, 1625. https://pandas.pydata.org/docs/pandas.pdf. Mclntire, G., Martin, B., & Washington, L. (2019). Python Pandas Tutorial: A Complete Introduction for Beginners. Learn Data Science-­Tutorials, Books, Courses, and More. https://www.learndatasci.com/ tutorials/python-­pandas-­tutorial-­complete-­introduction-­for-­beginners/. Minitab. (2015). Choosing between a nonparametric test and a parametric test. State College: The Minitab Blog. https://blog.minitab.com/blog/adventures-­in-­statistics-­2/choosing-­between-­a-­nonparametric-­test-­ and-­a-­parametric-­test. Pandas Development Team. (2020). pandas.read_excel. https://pandas.pydata.org/pandas-­docs/stable/reference/api/pandas.read_excel.html. Scikit-­posthocs. (2020). The Scikit Posthocs Test. https://scikit-­posthocs.readthedocs.io/en/latest/. SciPy Community. (2020). scipy.stats.ttest_ind. https://docs.scipy.org/doc/scipy/reference/generated/scipy. stats.ttest_ind.html. Sharma, A. (2019). Importing Data into Pandas. https://www.datacamp.com/community/tutorials/importing-­ data-­into-­pandas#:~:targetText=To read an HTML file, to read the HTML document. Tavares, E. (2017). Counting and Basic Frequency Plots. https://etav.github.io/python/count_basic_freq_plot. html. WorldCoinIndex. (2021). WorldCoinIndex. https://www.worldcoinindex.com/. 10 Machine Learning with Python Muath Alrammal Higher Colleges of Technology University Paris-­Est (UPEC) Dimitrios Xanthidis and Munir Naveed University College London Higher Colleges of Technology CONTENTS 10.1 Introduction.........................................................................................................................409 10.2 Types of Machine Learning Algorithms............................................................................ 410 10.3 Supervised Learning Algorithms: Linear Regression........................................................ 411 10.4 Supervised Learning Algorithms: Logistic Regression...................................................... 414 10.5 Supervised Learning Algorithms: Classification and Regression Tree (CART)................ 418 10.6 Supervised Learning Algorithms: Naïve Bayes Classifier................................................. 430 10.7 Unsupervised Learning Algorithms: K-­means Clustering................................................. 435 10.8 Unsupervised Learning Algorithms: Apriori..................................................................... 438 10.9 Other Learning Algorithms................................................................................................ 443 10.10 Wrap Up - Machine Learning Applications.......................................................................444 10.11 Case Studies........................................................................................................................ 447 10.12 Exercises............................................................................................................................. 447 References....................................................................................................................................... 447 10.1 INTRODUCTION At the present time, machine learning (ML) plays an essential role in many human activities. It is applied in Observation 10.1 – Machine different areas including online shopping, medicine, Learning: A subfield of computer scivideo surveillance, email spam and malware detection, ence and Artificial Intelligence that online customer support, and search engine result refine- focuses on developing algorithms that ment. It is a subfield of computer science and a subset can learn from data and make predicof Artificial Intelligence (AI). The main focus of ML is tions based on their learning. on developing algorithms that can learn from data and make predictions based on this learning. An ML program is one that learns from experience E Observation 10.2 – Machine given some tasks (T) and performance measure (P), if it Learning Process: A Machine improves from that experience (E) (Mitchell, 1997). ML Learning program learns from experibehaves similarly to the growth of a child. As a child ence (E) given some tasks (T) and pergrows, its experience E in performing task T increases, formance measure (P), if it improves from that experience (E). which results in a higher performance measure (P). In ML, a computer is trained using a given dataset in order to predict the properties of new data. For instance, one can train a system by feeding it with 10,000 images of dogs and 10,000 more images not containing dogs, indicating in each case DOI: 10.1201/9781003139010-10 409 410 Handbook of Computer Programming with Python whether a picture is a dog or not. following this training, when the system is fed with a new image it should be able to predict whether it is the image of a dog or not. Python has an arsenal of libraries that support the implementation of ML algorithms. Some of these libraries are already discussed and used in previous chapters (e.g., Pandas, Matplotlib). Other libraries especially useful for ML applications are the following: • NumPy: It is an array-­processing library. It provides complex mathematical functions for processing multi-­dimensional arrays and matrices. It is a powerful tool for handling random numbers, Fourier transforms, and linear algebra. • SciPy: It is an open-­source Python library used for scientific computing. It contains modules for image optimization, signal processing, Fast Fourier transform, linear algebra, and ordinary differential equation (ODE). It is built on top of NumPy, as its underlying data structure is a multi-­dimensional array. • Scikit-­Learn: It is built in 2010 on top of NumPy and SciPy libraries. It contains several supervised and unsupervised ML algorithms. The library is also useful in data mining and data analysis. It handles clustering, regression, classification, model selection, and preprocessing. • TensorFlow: This library was developed by Google in 2015. It uses a NumPy backend for manipulating tensors. There is an abundance of implemented ML algorithms, applying to various domains. This chapter provides an introduction to some of the most important as well as some of the most popular domain applications. This chapter concludes with a relevant case study that explores some of the main aspects of ML. 10.2 T YPES OF MACHINE LEARNING ALGORITHMS There are three main types of ML algorithms: supervised, unsupervised, and reinforcement. A simple way Observation 10.3 – Supervised to understand the difference between supervised and Learning: Use labeled data to train unsupervised ML is by introducing the concept of using a computer how to map particular some type of help to teach a computer how to map par- input into output. If the output is in a categorical form the type is classifiticular inputs into the relevant outputs. In the case of supervised learning the supervisor uses cation. If the output is in continuous what is referred to as labeled data to direct the computer numerical form the type is regression. into understanding how to map the input into output. As Combining multiple supervised learnan example, assume the case of training a computer to ing models is referred to as type of distinguish between the images of a laptop and a desktop ensembling. PC. The computer is provided with a set of images and a label or flag for each one specifying it is a laptop. The same process is repeated for the case of the desktop PC images. Although this is a simplified example, it provides a straightforward description of supervised learning. In terms of the outputs associated with supervised learning, there are two broad types: classification and regression. Classification is related with categories, such as “sick” or “healthy” individuals, “dog” or “cat” pets, “laptop” or “desktop” PCs. Regression is related to outputs in the form of continuous numerical values, such as predicting an individual’s height or weight, or the amount of rainfall. An additional type of supervised learning is ensembling, which involves combining the predictions of multiple ML models that may be too weak to stand on their own, in order to produce a more accurate prediction for a new sample. In general, a broad statement about supervised learning is that it uses labeled data to train a computer to map inputs (X) into outputs (Y) by solving equation Y = f(X) for f. Machine Learning 411 In the case of unsupervised learning there is no supervisor to train the computer in terms of mapping inputs Observation 10.4 – Unsupervised into outputs, and no labeled training input data to model Learning: There is no supervisor to possible corresponding output variables. Essentially, the train the computer to map input into computer is left to predict the possible outputs on its output and there is no labeled data for own, given a set of previous inputs. There are three main such training. The computer is trained types of unsupervised learning: association, clustering, by itself through a trial-­and-­error process. Association is used to determine and dimensionality reduction. Association is used to discover the probability of the the probability of the co-­occurrence co-­occurrence of items in a collection. It is used exten- of items in the collection. Clustering sively in market-­based analysis. For example, an associ- is used to group samples within the ation model might be used to predict whether a purchase same cluster. Dimensionality reducof bread has an 80% probability to be connected with a tion is used to reduce the number of purchase of eggs. Clustering is used to group samples in variables of the dataset. a way that ensures that objects within the same cluster share more similarities with each other than with objects from other clusters. Dimensionality reduction is used to reduce the number of variables of a dataset, while ensuring that important information is still conveyed. Dimensionality reduction can be achieved by using feature extraction and feature selection functions. The latter essentially refers to the selection of a subset of the original variables. Feature extraction performs data transformations from a high-­dimensional space to a low-­dimensional space (e.g., PCA algorithm). Finally, reinforcement learning is a type of ML that allows an agent to decide the best action based on its current state, by learning behaviors that will maximize the associated rewards. It usually learns optimal actions through trial and error. For example, one can think of a video game in which the player needs to move to certain places at certain times in order to earn points. If a reinforcement algorithm attempts to play this game instead of a human player, it would start by moving randomly, but eventually would learn where and when it needs to move in order to maximize points accumulation through the use of an appropriate trial and error process. 10.3 SUPERVISED LEARNING ALGORITHMS: LINEAR REGRESSION The basic idea behind linear regression is the quantifica10.5 – Linear tion of the relationship between a set of inputs and their Observation corresponding outputs. This takes the form of a line Regression: Trains a system to pre(y = a + b.x) where b is the slope of the regression line dict the output of a particular input by (the coefficient of the line) and a is the y-­axis intercept. quantifying the relationship y = a + b.x The goal is to have the least number of outliers (i.e., data between a set of inputs and their corwith a large deviation from the line). This is measured responding outputs, where b is the as the sum of the squares of all the distances of the data slope of the line and a is the y-­axis points from the line. Another important parameter in intercept. Use R2 to measure the linear regression is that of R2, which suggests the pos- effect of the input on the possible outsibility that the output y is affected by a related change in put and p to measure the statistical the input x. Obviously, like in all other statistical analy- significance of the test. sis tests, this particular test results in a p value (statistical significance) that determines whether there is a statistically significant correlation between the input and output datasets. In Python, linear regression can be implemented using the linregress(X, y) function of the Stats library. The function uses an input and an output dataset (i.e., X and y, respectively). The function output consists of five values: the slope of the linear regression, the intercept, the r value, the p value, and the statistical error of the test. Based on this, the overall process can be summarized in five distinct steps: 412 Handbook of Computer Programming with Python • Step 1: Import/read the data for the linear regression. • Step 2: Define the two datasets (X and y) used to create the model. • Step 3: Use linregress() to calculate the slope, the intercept, the r, and the p values of the linear regression. • Step 4 (Optional): Use the slope and the intercept to visualize the model. • Step 5 (Optional): Test the model with new data. There are numerous real-­life applications of linear regression ML algorithms. A notable example is their use in medicine and pharmaceutical research, when trying to determine the optimal dosage of a particular drug for a particular illness. Other examples include the use of such algorithms in sales and marketing, when trying to find the correct volume of promotional material (and the associated costs) for a particular product in order to maximize revenue, and the association of a student’s coursework grades with their final grade in an educational context. The following Python script quantifies the relationship between the values of two columns of the grades2.csv dataset (Midterm Exam and Final Grade). Next, once the slope and the intercept values are calculated and the regression model is prepared for further use, both the training and the test datasets are visualized (plotted) alongside the regression line: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import pandas as pd import matplotlib.pyplot as plt # Request to plot inline with the rest of the results # This is particularly relevant in Jupyter Anaconda %matplotlib inline from scipy import stats # The function uses the calculated slope and intercept # to predict the Final Grade, given the Midterm Exam grade input def predictFinalGrade(X): return slope * X + intercept # Read the dataset dataset = pd.read_csv("grades2.csv") dataset2 = dataset[["Final Grade", "Midterm Exam"]] print("The input dataset is as follows:") print(dataset2) # Define the input and output datasets X = dataset2["Midterm Exam"]; y = dataset2["Final Grade"] # Use the linregress function from the stats library # to calculate slope, intercept, r, p, and std_err slope, intercept, r, p, std_err = stats.linregress(X, y) print("The slope and intercept values are: {:.2f}, \ {:.2f}".format(slope, intercept)) print("The value of R-­ square is: {:.2f}".format(r**2)) print("The value of statistical significance, p is: {:.2f}".format(p)) mymodel = list(map(predictFinalGrade, X)) # Plot the model of the resulting linear regression Machine Learning 33 34 35 36 37 413 plt.scatter(X, y); plt.plot(X, mymodel); plt.show() grades = int(input("Enter the new Midterm Exam grade:")) grades = predictFinalGrade(grades) print("The predicted Final Grade is: {:.2f}".format(grades)) Output 10.3: The input Final 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 The slope The value The value data set is as follows: Grade Midterm Exam 67.47 70 75.13 82 66.85 40 54.45 44 76.95 82 45.13 50 73.23 62 81.87 84 62.63 64 58.75 52 49.75 62 44.25 42 62.52 68 47.33 52 68.97 70 and intercept values are: 0.62, 23.96 of R-square is: 0.57 of statistical significance, p is: 0.00 Enter the new Midterm Exam grade:88 The predicted Final Grade is: 78.80 In terms of the information provided here, the dataset is printed first with the input values used to train the system to quantify the regression model. The stats.linregress() function of the Stats library is used to calculate the slope and the intercept values, as well as the R2 value, the statistical significance value (p) and the standard error (std_err). Next, the user is prompted to enter a new Midterm Exam grade, and the system predicts the Final Grade using the related function predictFinalGrade(). 414 Handbook of Computer Programming with Python The reader should also note that the output includes the R2 value, which can be interpreted as a 57% possibility that a change in the Midterm Exam will affect the Final Grade. Another noteworthy output is that of the p value (i.e., statistical significance), which in this particular case is less than 0.05, suggesting that there is a correlation between the Midterm Exam and the Final Grade. Another value calculated during linear regression, although not displayed in the output results, is std_err. This value describes the maximum distance of the output values from the regression line in the form of an error, which is often referred to as residual. The script makes use of the format() specifier to limit the number of decimal places of the results to 2. Finally, the reader should note the inclusion of directive %matplotlib inline, dictating that the regression model must be plotted inline with the rest of the data. 10.4 SUPERVISED LEARNING ALGORITHMS: LOGISTIC REGRESSION As shown, linear regression predictions take the form 10.6 – Logistic of continuous values. In the case of logistic regression, Observation predictions take the form of discrete values (i.e., binary), Regression: Train a system to predict such as whether a student will pass or fail a course, or the probability of an output as one of whether it will rain or not. Its name comes from the two possible values based on a given associated logistic function: y = 1/(1 + e−x). The plot of input. The function used for this purthis function is an S-­shaped curve. In contrast to linear pose is the following: y = 1/(1 + ex). regression where the output is a value directly based on the input, in logistic regression it is a probability ranging from 0 to 1. For example, if a value 1 represents a passing grade, an output of 0.85 means that a student is very likely to pass the course at a probability of 85%. There are eight possible steps to follow when performing logistic regression, of which two are optional: • • • • • • • • Step 1: Import/read the data for the logistic regression. Step 2: Split the input datasets into train and test sets. Step 3: Perform feature scaling for the data (between 0 and 1). Step 4: Build the logistic classifier (with a preferred random_state = 0 for consistent results) and fit the trained set into the classifier. Step 5: Predict the results based on the classifier. Step 6: Find the accuracy of the regression model as a percentage. Step 7 (Optional): Visualize the results of the trained set. Step 8 (Optional): Visualize the results of the test set. The following Python script uses Midterm Exam and Project grades to create a logistic regression model and visualize its results: 1 2 3 4 5 6 7 8 9 # Import train_test_split to train and test the input from sklearn.model_selection import train_test_split # Import StandardScaler to scale the data from sklearn.preprocessing import StandardScaler # Import the LogisticRegression to create the classifier object from sklearn.linear_model import LogisticRegression # Import the accuracy_score to calculare the accuracy of the model from sklearn.metrics import accuracy_score # Import numpy to prepare the plot parameters Machine Learning 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 415 import numpy as np # Import pyplot to create the plot import matplotlib.pyplot as plt # Import ListedColormap to color the data points in the plot from matplotlib.colors import ListedColormap # Define that results are to plotted inline # This is particularly relevant in Jupyter Anaconda %matplotlib inline # Step 1: Define the input dataset. X must be a 2D list with # as many rows as observations X = [[60, 55], [54, 90], [70, 80], [76, 70], [64, 87], [66, 70], [54, 87], [92, 70], [58, 78], [70, 71], [70, 70], [90, 76], [86, 92], [72, 70], [70, 72], [82, 87], [40, 80], [44, 90], [82, 92], [50, 68]] y = [0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0] # Step 2: Split set X and y into train test and test set # Test size is 25% of the dataset, train size is 75% # The new trained and test lists will be in random order X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) print("Trained X set:", X_train); print("Test X set:", X_test) print("Trained y set:", y_train); print("Test y set:", y_test) # Step 3: Perform feature scaling for the data (between 0 and 1) sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) print("\nThe 2D set of trained X input:\n", X_train) X_test = sc_X.transform(X_test) print("\nThe 2D set of test X input:\n", X_test) # Step 4: Build the logistic classifier # Set random_state to 0 for consistent results # Fit the trained set into the classifier model = LogisticRegression(solver = 'liblinear', random_state = 0).fit(X_train, y_train) print("\n", model) # Step 5: Predict the test results y_pred = model.predict(X_test) print("\nResults predicted by the model:", y_pred) print("Results from the test:", y_test) model.predict_proba(X)[:,1] # Step 6: Form the confusion matrix to get the accuracy of the model # Use y_test (actual output) and y_pred (predicted output) accuracy = accuracy_score(y_test, y_pred) print("The accuracy of the model given the test data is: ", accuracy * 100, "%") 416 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 Handbook of Computer Programming with Python # Step 7: Visualize the training set results X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)) plt.contourf(X1,X2, model.predict(np.array([X1.ravel(), \ X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red','blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1]) plt.title('Logistic Regression: Training set') plt.xlabel("Midterm Exam") plt.ylabel("Project") plt.show() # Step 8: Visualize the test results X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)) plt.contourf(X1,X2, model.predict(np.array([X1.ravel(), \ X2.ravel()]).T).reshape(X1.shape),alpha = 0.75, cmap = ListedColormap(('red','blue'))) plt.xlim(X1.min(), X1.max());plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1]) plt.title('Logistic Regression: Test set') plt.xlabel("Midterm Exam"); plt.ylabel("Project") plt.show() Output 10.4: Trained X set: [[44, 90], [54, 87], [72, 70], [64, 87], [70, 80], [66, 70], [70, 72], [70, 71], [92, 70], [40, 80], [90, 76], [76, 70], [60, 55], [82, 87], [86, 92]] Test X set: [[82, 92], [54, 90], [50, 68], [58, 78], [7 0, 70]] Trained y set: [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1] Test y set: [1, 0, 0, 0, 1] The 2D set of trained X input: [[-1.69129319 1.30698109] [-1.01657516 1.00224457] [ 0.19791729 -0.72459574] [-0.34185713 1.00224457] [ 0.06297368 0.29119268] [-0.20691353 -0.72459574] [ 0.06297368 -0.52143805] [ 0.06297368 -0.62301689] 1, 1] Test y set: [1, 0, 0, 0, 1] The 2D set of trained X input: Machine Learning [[-1.69129319 1.30698109] [-1.01657516 1.00224457] [ 0.19791729 -0.72459574] [-0.34185713 1.00224457] [ 0.06297368 0.29119268] [-0.20691353 -0.72459574] [ 0.06297368 -0.52143805] [ 0.06297368 -0.62301689] [ 1.54735334 -0.72459574] [-1.9611804 0.29119268] [ 1.41240974 -0.11512269] [ 0.4678045 -0.72459574] [-0.61174434 -2.24827836] [ 0.87263531 1.00224457] [ 1.14252253 1.51013878]] 417 The 2D set of test X input: [[ 0.87263531 1.51013878] [-1.01657516 1.30698109] [-1.28646237 -0.92775342] [-0.74668795 0.088035 ] [ 0.06297368 -0.72459574]] LogisticRegression(random_state=0, solver='1iblinear') Results predicted by the model: [1 1 0 0 0] Results from the test: [1, 0, 0, 0, 1] The accuracy of the model given the test data is: % 60.0 418 Handbook of Computer Programming with Python The above script and its output demonstrate the eight steps followed when using logistic regression. In Step 1 (data read), it is important to remember that input dataset X must be a two-­dimensional array/list of pairs of data equal to the number of observations. In this particular case, the set includes the grades of each student for Midterm Exam and Project. The y dataset includes values 0 or 1 for each student, with 0 referring to a fail and 1 to a pass. In the step, the script makes use of the train_test_split() function (train_test_split module) from the Sklearn.model_selection library. The function takes the X and y datasets, splits them to train and test subsets at a rate of 75/25 (test_size = 0.25), and randomizes the splitting process. The results of the function are datasets X_train, X_test, y_train, and y_test. In Step 3, the script imports the StandardScaler module from the Sklearn.preprocessing library and uses the StandardScaler() constructor and the fit_transform() function to scale output data y between 0 and 1, as required by the logistic regression model. In Step 4, the actual logistic regression classifier is used to fit the data and execute the model using the X_train, X_test, y_train, and y_test datasets. Next, the script uses the model to predict (.predict()) the results of the regression (fifth step). In Step 6, the script uses function accuracy_score() (Accuracy_score module) from the Sklearn.metrics library to calculate the accuracy rate of the resulting regression model, as a number between 0 and 1. Finally, Steps 7 and 8 are used to visualize the training and test set results, respectively. In both cases, function meshgrid() is used to prepare the data for plotting and ListedColormap() to color the pass and fail outputs. There are numerous different options and variations available for each of these steps, as well as for displaying and plotting the resulting data. The reader can refer to the multitude of statistics and/or machine learning textbooks and resources in order to delve deeper into the various concepts related to the interpretation and use of the results of logistic regression in various contexts. 10.5 S UPERVISED LEARNING ALGORITHMS: CLASSIFICATION AND REGRESSION TREE (CART) A decision tree consists of a root, nodes, and leaves (Figure 10.1). The starting point of the decision tree is the root; each internal node is branching out to connect to other inputs, also in the form of nodes. Each leaf node is a possible output of the tree. The branching is determined by using a split function, which divides the input data into one or more branches. The leaf nodes of the tree are the outcomes. FIGURE 10.1 Decision tree. Machine Learning 419 In order to create the order (or height) of the decision tree and its features, the decision tree algorithm uses a function to determine the information gain. There are two functions serving this purpose, referred to as indices: entropy or Gini index. Their function is to measure the impurity of a node in the tree and, based on their value, the node is being kept or discarded. These values also determine the position of a node in the tree. There are different types of the decision tree, depending on how the indices are calculated and what choices are being made in terms of splitting continuous values. The most commonly used types of a decision tree are ID3 (Quinlan, 1986), C4.5 (Salzberg, 1994) and CART (Mola, 1998). CART (Classification and Regression Tree) is one of the most important and popular types of supervised Observation 10.7 – CART: The learning algorithms. The output can be in a form of a Classification and Regression Tree categorical value (e.g., it will rain or not) or a continuous (CART) is a decision tree with a root, value (e.g., the final price of a car). A visual represen- nodes and leaves and with outputs tation of a decision tree is shown in Figure 10.2. The either in a form of a categorical or a tree starts with the Age feature, which is a numeric attri- continuous value. The branching is bute in a bank dataset. The values of Age are split into determined by using a split function three branches: 18–23, 24–34 and >35. The algorithm that divides the input data into one or can split the continuous number values of the Age fea- more branches. ture using a technique that also determines the order of features within the tree. Next, the Age feature (the root Observation 10.8 – Input and of the tree) is associated with three additional features Output Datasets: The Classification (nodes): Job, Marital Status, and Housing. and Regression Tree (CART) requires The decision tree can be built using a training data- a 2D list/array of values as its input set. In the following example, the script makes use of a and output datasets. If the input and dataset of 40 bank account customer records, contain- output datasets do not match, approing features age, job, marital status, and education. The priate amendments are required. system aims at predicting the possibility of customers FIGURE 10.2 Example of decision tree. 420 Handbook of Computer Programming with Python making a deposit in the bank or not. In order to train the CART decision tree, these four features are used as Observation 10.9 – StringIO, input and the deposit feature as output. The possible out- Graphviz: Used to depict the deciputs are Yes and No (depositing money or not). The script sion tree in a visual form. requires a number of associated libraries. Some of these libraries are already included in the system (e.g., Pandas and Numpy), while others like Pydoplus and Graphviz must be installed explicitly. Given that the installation of any libraries depends on the particular system in use, the reader is advised to check the available pip install statements for specific system settings: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 # Import the basic libraries import pandas as pd import numpy as np # Import the DecisionTreeClassifier from sklearn.tree import DecisionTreeClassifier # Import the confusion_matrix, the accuracy_score, and the # classification report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report # Import train_test_split to split the data into train and test samples from sklearn.model_selection import train_test_split # Import the libraries for the necessary hot encoding from sklearn.preprocessing import LabelEncoder # Import the libraries to plot the graph from sklearn.tree import export_graphviz # import StringIO from sklearn.externals.six from six import StringIO from IPython.display import Image import pydotplus # Plot results inline # This is often particularly needed in Jupyter Anaconda %matplotlib inline With the libraries imported, the next part of the script is the first step of this particular implementation. Initially, the list of values for input list X (2D array) is defined. Each sub-­list includes the age, job, marital status, and education features of the bank customer. Next, output Y (single dimension list) is defined as a unidimensional list, taking values of either Yes or No. In line 82, input list X is converted to a Numpy array to facilitate a more efficient manipulation of the elements in the list. In the following line (83), the 2D array is divided into four unidimensional sub-­arrays, each storing the respective elements. Finally, the data of each newly created input sub-­array (X1–X4) and of output Y are printed: 31 32 #==================================================================== # Step 1: Define and print the input and output datasets Machine Learning 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 print("Step 1: Define and print the input and output datasets\n") X = [[59, 'admin.', 'married', 'secondary'], [56, 'admin.', 'married', 'secondary'], [41, 'technician', 'married', 'secondary'], [55, 'services', 'married', 'secondary'], [54, 'admin.', 'married', 'tertiary'], [42, 'management', 'single', 'tertiary'], [56, 'management', 'married', 'tertiary'], [60, 'retired', 'divorced', 'secondary'], [37, 'technician', 'married', 'secondary'], [28, 'services', 'single', 'secondary'], [38, 'admin.', 'single', 'secondary'], [30, 'blue-collar', 'married', 'secondary'], [29, 'management', 'married', 'secondary'], [46, 'blue-collar', 'single', 'tertiary'], [31, 'technician', 'single', 'tertiary'], [35, 'management', 'divorced', 'tertiary'], [32, 'blue-collar', 'single', 'primary'], [49, 'services', 'married', 'secondary'], [41, 'admin.', 'married', 'secondary'], [49, 'admin.', 'divorced', 'secondary'], [49, 'retired', 'married', 'secondary'], [32, 'technician', 'married', 'secondary'], [30, 'self-employed', 'single', 'secondary'], [55, 'services', 'divorced', 'tertiary'], [32, 'blue-collar', 'married', 'secondary'], [52, 'admin.', 'divorced', 'secondary'], [38, 'unemployed', 'divorced', 'secondary'], [60, 'retired', 'married', 'secondary'], [60, 'retired', 'divorced', 'secondary'], [30, 'admin.', 'married', 'tertiary'], [44, 'unemployed', 'married', 'secondary'], [32, 'blue-collar', 'married', 'secondary'], [46, 'entrepreneur', 'married', 'tertiary'], [34, 'management', 'married', 'secondary'], [40, 'management', 'married', 'secondary'], [34, 'housemaid', 'married', 'primary'], [43, 'admin.', 'single', 'secondary'], [52, 'technician', 'married', 'secondary'], [35, 'blue-collar', 'married', 'secondary'], [34, 'blue-collar', 'single', 'secondary']] Y=['yes','yes','yes','yes','yes','yes','yes','yes','yes','yes', 'yes','yes','yes','yes','yes','yes','yes','yes','yes','yes', 'no','no','no','no','no','no','no','no','no','no', 'no','no','no','no','no','no','no','no','no','no' ] # Convert the list into a numpy array for better index control newX = np.array(X) newX1,newX2,newX3,newX4=newX[:,0],newX[:, 1],newX[:, 2],newX[:, 3] 421 422 84 85 86 87 88 89 90 Handbook of Computer Programming with Python print("\nThe print("\nThe print("\nThe print("\nThe input input input input of of of of ages (X1) is :\n", newX1) jobs (X2) is :\n", newX2) marital status (X3) is :\n", newX3) education (X4) is :\n", newX4) print("\nThe output of deposits (Y) is :\n", Y) Output 10.5: Step 1 Step 1: Define and print the input and output datasets The input of ages (X1) is : ['59' '56' '41' '55' '54' '42' '56' '60' '37' '28' '38' '30' '29' '46' '31' '35' '32' '49' '41' '49' '49' '32' '30' '55' '32' '52' '38' '60' '60' '30' '44' '32' '46' '34' '40' '34' '43' '52' '35' '34'] The input of jobs (X2) is : ['admin.' 'admin.' 'technician' 'services' 'admin.' 'management' 'management' 'retired' 'technician' 'services' 'admin.' 'blue-collar' 'management' 'blue-collar' 'technician' 'management' 'blue-collar' 'services' 'admin.' 'admin.' 'retired' 'technician' 'self-employed' 'services' 'blue-collar' 'admin.' 'unemployed' 'retired' 'retired' 'admin.' 'unemployed' 'blue-collar' 'entrepreneur' 'management' 'management' 'housemaid' 'admin.' 'technician' 'blue-collar' 'blue-collar'] The input of marital status (X3) is : ['married' 'married' 'married' 'married' 'married' 'single' 'married' 'divorced' 'married' 'single' 'single' 'married' 'married' 'single' 'single' 'divorced' 'single' 'married' 'married' 'divorced' 'married' 'married' 'single' 'divorced' 'married' 'divorced' 'divorced' 'married' 'divorced' 'married' 'married' 'married' 'married' 'married' 'married' 'married' 'single' 'married' 'married' 'single'] The input of education (X4) is : ['secondary' 'secondary' 'secondary' 'secondary' 'tertiary' 'tertiary' 'tertiary' 'secondary' 'secondary' 'secondary' 'secondary' 'secondary' 'secondary' 'tertiary' 'tertiary' 'tertiary' 'primary' 'secondary• 'secondary' 'secondary' 'secondary' 'secondary' 'secondary' 'tertiary' 'secondary' 'secondary' 'secondary' 'secondary' 'secondary' 'tertiary' 'secondary' 'secondary' 'tertiary' 'secondary' 'secondary' 'primary' 'secondary' 'secondary' 'secondary' 'secondary'] The output of deposits (Y) is : ['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no'] Machine Learning 423 In Step 2, the code addresses an important classifica10.10 – Integer tion issue. Since models are mathematical in nature, the Observation Encoding: The process of converting underlying calculations are based on textual rather than numerical data. Hence, it is necessary to encode the var- a categorical value into the numerical ious elements of the data into numerical (integer) values, form necessary for the CART algoa process referred to as integer encoding. Lines 97–102 rithm. Use the LabelEncoder() include code for finding the unique elements in each of function from the Sklearn.preprocessthe input sub-­arrays X1–X4. Next, in lines 105–122, the ing library. LabelEncoder() function (Sklearn.preprocessing library) is utilized to create the relevant objects, subsequently used by fit_transform() to produce the integer encoded sub-­arrays for X1–X4. The same process is also applied in the case of output dataset Y: 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 #==================================================================== # Step 2: Encode the categorical values of the input & output datasets # Find and print the unique values of the categories/columns for job # and marital status print("\n\nStep 2: The inputs of jobs, marital status,", "and education and the outputs are integer encoded") jobs = np.unique(newX2) print("\nThe various categories of jobs are:\n", jobs) maritalStatus = np.unique(newX3) print("\nThe various categories of marital status are:\n", maritalStatus) education = np.unique(newX4) print("\nThe various categories of education are:\n", education) # Integer Encode the categorical input and output values as fit() # does not accept strings label_encoderX2 = LabelEncoder() integer_encodedX2 = label_encoderX2.fit_transform(newX2) print("\nThe various categories of jobs are integer Encoded as", "follows:\n", integer_encodedX2) label_encoderX3 = LabelEncoder() integer_encodedX3 = label_encoderX3.fit_transform(newX3) print("\nThe various categories of marital status are ", "integer Encoded as follows:\n", integer_encodedX3) label_encoderX4 = LabelEncoder() integer_encodedX4 = label_encoderX4.fit_transform(newX4) print("\nThe various categories of education are integer Encoded as", "follows:\n", integer_encodedX4) label_encoderY = LabelEncoder() integer_encodedY = label_encoderY.fit_transform(Y) print("\nThe various categories of output are integer Encoded as", "follows:\n", integer_encodedY) 424 Handbook of Computer Programming with Python Output 10.5: Step 2 Step 2: The inputs of jobs, marital status, and education and the outputs are integer encoded The various categories of jobs are: ['admin.' 'blue-collar' 'entrepreneur' 'housemaid' 'management' 'retired' 'self-employed' 'services' 'technician' 'unemployed'] The various categories of marital status are: ['divorced' 'married' 'single'] The various categories of education are: ['primary' 'secondary' 'tertiary'] The various categories of jobs are integer Encoded as follows: [0 0 8 7 0 4 4 5 8 7 0 1 4 1 8 4 1 7 0 0 5 8 6 7 1 0 9 5 5 0 9 1 2 4 4 3 0 8 1 1] The various categories of marital status are integer Encoded as follows: [1 1 1 1 1 2 1 0 1 2 2 1 1 2 2 0 2 1 1 0 1 1 2 0 1 0 0 1 0 1 1 1 1 1 1 1 2 1 1 2] The various categories of education are integer Encoded as follows: [1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 0 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 0 1 1 1 1] The various categories of output are integer Encoded as follows: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] In Step 3, the code splits the datasets into train and test input and train and test output. Provided that the fit() function used in the next step needs a 2D numerical array to perform its calculations, it is necessary to combine the previously divided input sub-­arrays into a single 2D array. The zip() function takes the four input sub-­arrays and combines them in a single 2D array. However, since the result is still unusable for the relevant fitting calculations, the list() function is used to convert the 2D array to a suitable form (lines 127–128). Next, function train_test_split() (Sklearn.model_selection library) is used with the newly created 2D array, as well as the unidimensional output array, in order to split (75/25) and randomize the datasets. This is defined explicitly by the test_size = 0.25 and the random_ state = 0 arguments (lines 129–130). The test_size parameter is referring to the hold-­out validation that splits the dataset into the train and test parts, in this case 75% and 25%. The alternative to hold-­out validation is the cross-­validation technique, which selects data for training via sampling. In this approach, a block of data of fixed size is selected for training in each iteration. The technique could be also applied to smaller datasets, but the sample selection in each iteration of training can lead to heavy computation requirements and, therefore, more CPU cycles. The main types of cross-­validation are leave-­p out and k-­fold. In the case of k-­fold, the most commonly used selection is the ten-­fold (i.e., k = 10). An example of a cross-­validation statement is the following: crossValidation = cross_validate (decisionTree, X_Train, Y_Train, crossValidation = 10) In the current context, this statement would be placed in the code just after the definition of the DecisionTreeClassifier(). The last part of this step prints the train and test inputs and the train and test outputs: Machine Learning 123 124 125 126 127 128 129 130 131 132 133 134 135 425 #=================================================================== # Step 3: Define the point to split the dataset to 3/4 print("\nStep 3: Define the point to split the datasets to 3/4\n") newEncodedInput = list(zip(newX1, integer_encodedX2, integer_encodedX3, integer_encodedX4)) X_Train, X_Test, y_Train, y_Test = train_test_split(newEncodedInput, integer_encodedY, test_size = 0.25, random_state = 0) print("\nTrained X set:", X_Train) print("\nTest X set:", X_Test) print("\nTrained y set:", y_Train) print("\nTest y set:", y_Test) Output 10.5: Step 3 Step 3: Define the point to split the datasets to 3/4 Trained X set: [('60', 5, 1, 1), ('34', 3, 1, 0), ('52', 8, 1, 1), ('41', 8, 1, 1), ('34', 1, 2, 1), ('44', 9, 1, 1), ('40', 4, 1, 1), ('32', 1, 2, 0), ('43', 0, 2, 1), ('37', 8, 1, 1), ('46', 1, 2, 2), ('42', 4, 2, 2), ('49', 7, 1, 1), ('31', 8, 2, 2), ('34', 4, 1, 1), ('60', 5, 0, 1), ('46', 2, 1, 2), ('56', 0, 1, 1), ('38', 9, 0, 1), ('29', 4, 1, 1), ('32', 1, 1, 1), ('32', 1, 1, 1), ('56', 4, 1, 2), ('55', 7, 0, 2), ('32', 8, 1, 1), ('49', 0, 0, 1), ('28' ,7, 2, 1), ('35', 1, 1, 1), ('55', 7, 1, 1), ('59', 0, 1, 1)] Test X set: (('30', 6, 2, 1), ('49', 5, 1, 1), ('52', 0, 0, 1), ('54', 0, 1, 2), ('38', 0, 2, 1), ('35', 4, 0, 2), ('60', 5, 0, 1), ('30', 1, 1, 1), ('41', 0, 1, 1), ('30', 0, 1, 2)] Trained y set: [0 0 0 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 1] Test y set: [0 0 0 1 1 1 0 1 1 0] In Step 4, the defined trained and test inputs and outputs are used to train and test the model (i.e., predict Observation 10.11 – DecisionTree the possible output). This is achieved through the Classifier(): The class used to create DecisionTreeClassifier() function, (Sklearn. the decision tree model. tree library), which creates the decisionTree object model used for the output prediction (lines 144–146). The reader should note that the mathematical algorithm used in the classifier is entropy, random_state = 100, maximum_depth = 100, and min_samples_leaf = 2. In terms of the entropy mechanism, the mathematical equation used is: E = −Σ(i:n)pilog2pi. The idea is to calculate the entropy of mixed values encountered in the columns of the train dataset. If the values are heavily mixed and unequal in population, the entropy will be close to 1, otherwise it would be close to 0. Ideally, the Observation 10.12 – Entropy, Gini preferred value is 0, which means that the dataset has Index: The mathematical models largely homogeneous values. When visualizing the deci- used to define and organize the decision tree, the value of entropy suggests the impurity of the sion tree. They measure the level of values in the related tree or sub-­tree. The alternative to impurity of the values in the dataset entropy is the Gini index mechanism, which is also used by used for the tree. the classifier to organize the decision tree. Its mathematical 426 Handbook of Computer Programming with Python equation is: Gini Index = 1−Σ(P(x = k))2. This also suggests the probabilities of uncertainty of impurity among various partitions of the dataset. In the case of this example, both mechanisms are included with that of entropy applied and the Gini index deactivated as a comment. Switching the activation of one over the other would showcase that the results are quite similar. For further information on either entropy or the Gini index, the reader is advised to study textbooks specifically focused on ML. There are two more parameters specified in DecisionTreeClassifier() that affect the visualization of the tree: max_depth and min_samples_leaf. The former determines the maximum depth of the tree. If omitted, the tree will have no maximum depth but will grow as deep as necessary according to the calculation and the dataset. The latter will determine the minimum number of samples required to be present as leaves in the tree. If its value is 1, it will display every simple sample in the tree making the visual tree grow in size to its fullest. Increasing Observation 10.13 – Parameter the value of min_samples_leaf will result in a maximum _ depth: Used to define reduction of the size of the visual depiction of the tree by the depth of the decision tree (unlimcombining the number of samples in each leaf. As men- ited if omitted). tioned, the present sample code includes two alternative versions of DecisionTreeClassifier() (lines 141–146): one using entropy and one the Gini index. The Observation 10.14 – Parameter former uses a min_samples_leaf value of 1, while min _ samples _ leaf: Used to the latter a value of 6. Notice the difference in the size define the minimum number of samof the visual depiction of the decision tree in each case, ples that a leaf may have in order to and also how the algorithm makes decisions based on the be displayed in the visualization of the columns of the dataset that have the greatest influence on decision tree. the resulting visual depiction of the decision tree: 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 #==================================================================== # Step 4: Create the classifier & train & test the input & output # Create the classifier object using 4 attributes: criterion can be # entropy or gini, splitter can be best or random, print("\nStep 4: Define the point to split the datasets to 3/4") #decisionTree = DecisionTreeClassifier(criterion = "entropy", # splitter = "best", random_state = 100, max_depth = 100, # min_samples_leaf = 1) decisionTree = DecisionTreeClassifier (criterion = "gini", splitter = "best", random_state = 100, max_depth = 100, min_samples_leaf = 6) # The classifier trains the input (X_Train) & the output (y_Train) arrayX_Train = np.array(X_Train) arrayY_Train = np.array(y_Train) print("\nThe input dataset to train is:\n", arrayX_Train) print("\nThe output dataset to train is:\n", arrayY_Train) decisionTree.fit(arrayX_Train, arrayY_Train) arrayY_Test1 = np.array(y_Test) arrayY_Test = list(zip(arrayY_Test1, arrayY_Test1, arrayY_Test1, arrayY_Test1)) print("\nThe output dataset to test is:\n", arrayY_Test) y_Predict = decisionTree.predict(arrayY_Test) print("\nThe predicted output is:\n", y_Predict) Machine Learning 427 Output 10.5: Step 4 Step 4: Define the point to split the datasets to 3/4 The input dataset to train is: [['60' '51 '1' '1'] ['34' '3' '1' '0'] ['52' '8' '1' '1'] ['41' '8' '1' '1'] ['34' '1' '2' '1'] ['44' '9' '1' '1'] ['40' '4' '1' '1'] ['32' '1' '2' '0'] ['43' '0' '2' '1'] ['37' '8' '1' '1'] ['46' '1' '2' '2'] ... The output dataset to train is: [0 0 0 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 1] The output dataset to test is: [(0, 0, 0, 0), (0, 0, 0, 0), (0, 0, 0, 0), (1, 1, 1, 1), (1, 1, 1, 1), (1, 1, 1, 1), (0, 0, 0, 0), (1, 1, 1, 1), (1, 1, 1, 1), (0, 0, 0, 0)] The predicted output is: [0 0 0 0 0 0 0 0 0 0] In Step 5, the code inverts the output to the original column values, it calculates the confusion matrix and the accuracy score, and provides the classification report. For the inversion of the output, the label encoders are used in the same way as in the case of the integer encoded arrays used in the model. Next, the confusion matrix is printed followed by the accuracy score (50%). The reader should note that, in an ideal scenario, the value of the latter approaches the 100% mark. Finally, the classification report is displayed with all the relevant details. These tasks are coded in lines 164–174. The output shows the results of Step 5. From one training dataset, the CART algorithm can build several decision trees. The performance criteria determine which tree is preferable for the task at hand. Different metrics or performance measurement parameters are being used, the most common being accuracy, confusion matrix, precision, recall and f-­score. Accuracy represents the overall accuracy of a tree. It is calculated using the correctly classified observations divided by the total number of observations, and is represented as a percentage. For example, if there are 100 observations tested and 70 of them are correctly classified, the accuracy of that tree will be 70.00. A higher accuracy suggests a better performance for the decision tree. The confusion matrix represents the overall behavior of the tree, based on the test or train datasets. It provides more insight in terms of the performance of the tree on each class label. Therefore, the size of confusion matrix depends on the class labels, as it is always n × n, where n denotes the number of the class labels. For instance, if there are three class labels in a dataset, the confusion matrix will be 3 × 3. In the case of the bank dataset, the confusion matrix will be 2 × 2, as it has only two class labels (Yes/No). The matrix will also provide a breakdown of the numbers of labels being wrongly categorized by the tree. Such information is not provided by the accuracy scores. Precision is the measurement of the relevance-­based accuracy (i.e., a ratio of the number of correctly predicted observations over the total number of observations) for each label. For example, 428 Handbook of Computer Programming with Python assume a tree that has classified 60 customers out of 100 as Yes. However, only 40 out of the 60 classifications are correct. Thus, the precision will be 40/60 or 0.667. Recall is the measure of relevance with respect to the overall classification performance in for the class labels. For example, assume a tree that predicts 60 responses of Yes in a dataset of 100. If 40 of these predictions are correct, while the dataset has 75 observed responses of Yes, the recall will be 40/75 or 0.533. Fscore combines both the recall and the precision values into a single value. This value represents the performance in terms of relevance for each label. High fscore values dictate that the classifier is performing better and is more fine-­tuned than one with lower values. 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 #=================================================================== # Step 5: Invert the encoded values and calculate the confusion matrix, # the accuracy score, and the classification report print("\nStep 5: Invert the integer encoded results into " "their original text-based") invertedY_Test = label_encoderY.inverse_transform(y_Test) print ("The inverted output test values are:", invertedY_Test) invertedPredicted = label_encoderY.inverse_transform(y_Predict) print ("The inverted predicted values of the output are:", invertedPredicted) confusionMatrix = confusion_matrix(invertedY_Test, invertedPredicted) print("The confusion matrix for the particular case is:\n", confusionMatrix) accuracyScore = accuracy_score(invertedY_Test, invertedPredicted) print("\nThe accuracy of the model given the test data is: ", accuracyScore * 100, "%") classificationReport = classification_report(y_Test, y_Predict) print("\nThe classification report is as follows:\n", classificationReport) Output 10.5: Step 5 Step 5: Invert the integer encoded results into their original tex:-based The inverted output test values are: ['no' 'no' 'no' 'yes' 'yes' 'yes' 'no' 'yes' 'yes' 'no'] The inverted predicted values of the output are: ['no' 'no' 'no' 'no' 'no' 'no' 'no' 'no' 'no' 'no'] The confusion matrix for the particular case is: [[5 0] [5 0]] The accuracy of the model given the test data is: The classification report is as follows: precision recall f1-score 50.0 % support 0 1 0.50 0.00 1.00 0.00 0.67 0.00 5 5 accuracy Macro avg weighted avg 0.25 0.25 0.50 0.50 0.50 0.33 0.33 10 10 10 Machine Learning 429 Finally, Step 6 implements the statements used to visualize the decision tree based, on the parameters specified in the previous steps. The reader should note that the names of the features of the depicted decision tree, referred as graphCols, must be defined before the tree is visualized, so that proper labels are attached to the respective tree classifications: 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 #==================================================================== # Step 6: Visualizing the CART Decision Tree # Define the names of the labels/features to be depicted in the # decision tree graphCols = ['age', 'Jobs', 'marital','education'] # Define the type of I/O to be used for the visualization of the # decision tree dot_data = StringIO() # Use the export_graphviz() to prepare the visualization of the # decision tree export_graphviz(decisionTree, out_file = dot_data, filled = True, feature_names = graphCols, rounded = True) # Use the pydotplus library to plot the decision tree graph = pydotplus.graphviz.graph_from_dot_data(dot_data.getvalue()) # Save the graph of the decision tree as a .png file in the local # folder graph.write_png("test.png") Image(graph.create_png()) Output 10.5.a: Depicting the Decision Tree using gini index and min_samples_leaf = 6 430 Handbook of Computer Programming with Python Output 10.5.b: Depicting the Decision Tree using entropy and min_samples_leaf = 1 10.6 S UPERVISED LEARNING ALGORITHMS: NAÏVE BAYES CLASSIFIER Naïve Bayes is a probabilistic model, which can therefore generalize the classification problem using a set of probabilities. The main concept of this model is based on the popular Bayesian 431 Machine Learning theorem. The theorem can solve the problem of finding the probability of an event by using existing data for the conditions related to the event. For example, to find the probability of an event A to occur while event B is true is given by the equation below. This is also referred to as posterior probability. ( ) P A B = ( ) P B A ⋅ P ( A) Observation 10.15 – Naïve Bayes Classifier: A supervised ML algorithm that is used to find the probability of an event given certain conditions. This probability is referred to as posterior probability. The known information is referred to as prior probability. P ( B) P(B|A) represents the known information regarding the A occurrence, such that B occurring when A is True. This probability is also called prior probability, as it is part of the existing knowledge. P(A) is the probability or likelihood of A occurring without any condition. P(B) represents the probability of event B occurring. P(B) is called evidence. Using prior probability, evidence and likelihood, a Naïve Bayes model can determine the posterior probabilities of each class label for a set of features, and assign a label based on these probabilities. The label with the highest or maximum posterior probabilities is assigned to the current observation. As an example, consider the following weather data for the covering the previous 7 days, as given in Table 10.1. Based on the weather condition, the pilot instructors decide whether to run a training flight or not. The theorem can be used to make a decision for the following weather conditions: 1. Appearance: Sunny 2. Temperature: Hot 3. Windy: False To find the posterior probability for each label, calculate the probability for label Yes: P(Yes) = 3/7 P(Sunny|Yes) = 1/3 P(Hot| Yes) = 1/3 P(False|Yes) = 3/3 The posterior probability for label Yes would be the following: P(Yes | (Sunny, Hot, False)) = P(Sunny | Yes) * P(Hot | Yes) * P(False | Yes) * P(Yes) = = (1/3) * (1/3) * (3/3) * (3/7) = 0.047 TABLE 10.1 Weather Data for Previous 7 Days Appearance Sunny Cloudy Sunny Rainy Rainy Cloudy Cloudy Temperature Windy Training Flight? Cold Mild Cold Hot Cold Hot Cold False False True False True True False Yes Yes No Yes No No No 432 Handbook of Computer Programming with Python Similarly, the posterior probability for label No for the same observation would be the following: P(No | (Sunny, Hot, False)) = P(Sunny | No) * P(Hot | No) * P(False | No) * P(No) = (1/4) * (1/4) * (1/4) * (4/7) = 0.009 In this case, the posterior probability of Yes is higher than that of No. Therefore, training flight will run with weather condition of Appearance: Sunny, Temperature: Hot and Windy: False. Naïve Bayes may have three different implementations, depending on the data. In the case of continuous data, the Gaussian distribution is more suitable, whereas in the case of nominal data the multinomial distribution could produce better results. In the latter case (i.e., multinomial distribution), the implementation can be expressed in the following seven steps, with the last two being optional: • • • • • • • Step 1: Import/read the data. Step 2: Split the input data into train and test sets. Step 3: Build the multinomial Naïve Bayes classifier. Step 4: Predict the results based on the classifier. Step 5: Find the accuracy of the regression model as a percentage. Step 6 (Optional): Visualize the results of the trained set. Step 7 (Optional): Visualize the results of the test set. The following script uses students’ Midterm Exam and Project grades to create the Naïve Bayes model and visualize the results: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 # Import train_test_split to train and test the input from sklearn.model_selection import train_test_split # Import StandardScaler to scale the data from sklearn.preprocessing import StandardScaler # Import the Multinomial Naïve Bayes to create the classifier object from sklearn.naive_bayes import MultinomialNB # Import the accuracy_score to calculare the accuracy of the model from sklearn.metrics import accuracy_score # Import Numpy to prepare the plot parameters import numpy as np # Import Pyplot to create the plot import matplotlib.pyplot as plt # Import ListedColormap to color the data points in the plot from matplotlib.colors import ListedColormap # Plot inline # This is particularly relevant in Jupyter Anaconda %matplotlib inline # Step 1: Define the input dataset. X must be a 2D list # with as many rows as the observations X = [[30, 75], [84, 89], [79, 84], [71, 74], [68, 71], [81, 70], [61, 78], [89, 81], [58, 78], [70, 71], [70, 70], [90, 76], [86, 92], [72, 70], [70, 72], [82, 87], [51, 78], [44, 71], [82, 92], [50, 68]] y = [0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0] # Step 2: Split the set X and y into train and test sets # Test size is 25% of the dataset, Train size is 75% # The new train and test lists will be in random order X_train, X_test, y_train, y_test = train_test_split(X, y, \ test_size = 0.25, random_state = 0) Machine Learning 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 433 print("Trained X set:", X_train); print("Test X set:", X_test) print("Trained y set:", y_train); print("Test y set:", y_test) # Step 3: Build the Naïve Bayes classifier # Fit the trained set into the classifier model = MultinomialNB().fit(X_train, y_train) print("\n", model) # Step 4: Predict the test results y_pred = model.predict(X_test) print("\nResults predicted by the model:", y_pred) print("Results from the test:", y_test) model.predict_proba(X)[:,1] # Step 5: Form the confusion matrix to get the accuracy of the model # Use the y_test (actual output) and the y_pred (predicted output) accuracy = accuracy_score(y_test, y_pred) print("The accuracy of the model given the test data is: ", accuracy * 100, "%") # Step 6: Visualize the training set results X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=np.array(X_set)[:, 0].min() - 1, \ stop = np.array(X_set)[:, 0].max() + 1, step = 0.01), \ np.arange(start = np.array(X_set)[:, 1].min() - 1, \ stop = np.array(X_set)[:, 1].max() + 1, step = 0.01)) plt.contourf(X1,X2, model.predict(np.array([X1.ravel(), \ X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, \ cmap = ListedColormap(('red','blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(np.array(X_set)[y_set == j, 0], np.array(X_set)[y_set == j, 1]) plt.title('Naive Bayes: Training set') plt.xlabel("Midterm Exam") plt.ylabel("Project") plt.show() # Step 7: Visualize the test results X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=np.array(X_set)[:, 0].min() - 1, \ stop = np.array(X_set)[:, 0].max() + 1, step = 0.01), \ np.arange(start = np.array(X_set)[:, 1].min() - 1, \ stop = np.array(X_set)[:, 1].max() + 1, step = 0.01)) plt.contourf(X1,X2, model.predict(np.array([X1.ravel(), \ X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, \ cmap = ListedColormap(('red','blue'))) plt.xlim(X1.min(), X1.max());plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(np.array(X_set)[y_set == j, 0], 434 85 86 87 88 Handbook of Computer Programming with Python np.array(X_set)[y_set == j, 1]) plt.title('Naive Bayes: Test set') plt.xlabel("Midterm Exam"); plt.ylabel("Project") plt.show() Output 10.6: Trained X set: [[44, 71], [61, 78], [72, 70], [68, 71], [79, 84], [81, 7 01, [70, 72], [70, 71], [89, 81], [51, 78], [90, 761, [71, 74], [30, 75], [82, 87], [86, 92]] Test X set: [[82, 92], [84, 89], [50, 68], [58, 78], [70, 70]] Trained y set: [0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1] Test y set: [1, 1, 0, 0, 1] MultinomialNB() Results predicted by the model: [1 1 0 0 1] Results from the test: [1, 1, 0, 0, 1] The accuracy of the model given the test data is: 100.0 % Machine Learning 435 In this case, the output suggests that Naïve Bayes can predict the final grade (Pass/Fail) for the students with 100% accuracy. For the same data, a different implementation of Naïve Bayes may produce results with large variations (e.g., in the case of Gaussian Naïve Bayes function, the accuracy will be significantly lower). The reason for this is that the various Naïve Bayes functions depend on the nature of the data and are, thus, more scalable than other models. 10.7 UNSUPERVISED LEARNING ALGORITHMS: K-­MEANS CLUSTERING The k-­means clustering algorithm is an unsupervised ML means approach used to solve clustering problems in ML or data Observation 10.16 – K-­ Clustering: An unsupervised ML algoscience. Its aim is to group unlabeled datasets into difrithm that aims to group unlabeled ferent clusters, where k is equal to the chosen number of newly created clusters. Each cluster is associated with a datasets into a number (k) of different centroid, a data point representing the center of a cluster. clusters, each associated with a cenThe algorithm seeks to minimize the sum of distances troid data point representing the cenbetween the data point and their corresponding clusters. ter of cluster. Its applications may be relevant in different domains, such as customer segmentation, insurance fraud detection, and document classification just to name a few. Figure 10.3 presents a case of two clusters (k = 2) being identified in the source dataset: K-­means is, essentially, an iterative algorithm. First, it selects a value for k, that represents the number of clusters (e.g., k = 3 for 3 clusters). Next, it randomly assigns each data point to any of the clusters. Finally, it calculates the cluster centroid for each of the clusters. Once the iteration is complete a new one commences. At this stage, the algorithm reassigns each point to the closest cluster centroid. It then follows the same procedure to assign the points to the clusters containing the other centroids. The algorithm repeats the last two steps until there is no switching of data points from one cluster to another, in which case it is completed. Implementing the k-­means algorithm usually involves the following steps: • Step 1: Select the number of clusters (k). One could also use the elbow function to determine the optimal number. • Step 2: Select a random centroid for each cluster. Note that this may be other than the input dataset. FIGURE 10.3 k-­means clusters and their centroids. (See Raghupathi, 2018.) 436 Handbook of Computer Programming with Python • Step 3: Measure the distance (Euclidean function) between each point and the centroids. Assign each data point to their closest centroid. • Step 4: Calculate the variance and add a new centroid for each cluster (i.e., calculate the mean of all the points for each cluster and set the new centroid). • Step 5: Repeat Steps 3 and 4 until the centroid positions do not change. The implementation of this approach in Python is rather straightforward, making it accessible to novice programmers and/or data scientists with no programming background. The following script is an example of a k-­means algorithm implementation, with the objective to classify 100 customers based on their annual incomes and spending scores: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 # Import Pandas import pandas as pd # Import Numpy as data manipulation import numpy as np # Import the KMeans library from the sklearn from sklearn.cluster import KMeans # Import the Pyplot to create the plot import matplotlib.pyplot as plt # Plot inline # This is particularly relevant in Jupyter Anaconda %matplotlib inline # Import the operating system module import os # Import the Python data visualization library based on matplotlib import seaborn as sns sns.set(context = "notebook", palette = "Spectral", style = 'darkgrid', font_scale = 1.5, color_codes = True) # X is a list of 100 samples for customers, each representing the # annual income and the spending score X = [[15, 39], [15, 81], [16, 6], [16, 77], [17, 40], [17, 76], [18, 6], [18, 94], [19, 3], [19, 72], [19, 14], [19, 99], [20, 15], [20, 77], [20, 13], [20, 79], [21, 35], [21, 66], [23, 29], [23, 98], [24, 35], [24, 73], [25, 5], [25, 73], [28, 14], [28, 82], [28, 32], [28, 61], [29, 31], [29, 87], [30, 4], [30, 73], [33, 4], [33, 92], [33, 14], [33, 81], [34, 17], [34, 73], [37, 26], [37, 75], [38, 35], [38, 92], [39, 36], [39, 61], [39, 28], [39, 65], [40, 55], [40, 47], [40, 42], [40, 42], [42, 52], [42, 60], [43, 54], [43, 60], [43, 45], [43, 41], [44, 50], [44, 46], [46, 51], [46, 46], [46, 56], [46, 55], [47, 52], [47, 59], [48, 51], [48, 59], [48, 50], [48, 48], [48, 59], [48, 47], [49, 55], [49, 42], [50, 49], [50, 56], [54, 47], [54, 54], [54, 53], [54, 48], [54, 52], [54, 42], [54, 51], [54, 55], [54, 41], [54, 44], [54, 57], [54, 46], [57, 58], [57, 55], [58, 60], [58, 46], [59, 55], [59, 41], [60, 49], [60, 40], [60, 42], [60, 52], [60, 47], [60, 50], [61, 42], [61, 49]] # Convert the list to an np.array for plotting the clusters # of customers 437 Machine Learning 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 X = np.array(X) # Find the optimal number of clusters (elbow method) from sklearn.cluster import KMeans wcss = [] for i in range(1, 15): kmeans = KMeans(n_clusters = i, init = 'k-­ means++', \ random_state = 42) kmeans.fit(X) # Inertia function returns wcss for that model: # WCSS is the sum of squared distance between each point # and the centroid in a cluster wcss.append(kmeans.inertia_) # Plot the clusters and WCSS plt.figure(figsize = (10,5)) sns.lineplot(range(1, 15), wcss, marker = 'o', color = 'red') plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show() Output 10.7.a: The output illustrates the identification of the optimal number of clusters that can represent the k-­means, in this Case 4. Next, this is used to find, organize, and illustrate the respective clusters with their centroid data, as in the following script: 61 62 63 64 65 66 67 68 69 70 means to the dataset # Fitting K-­ kmeans = KMeans(n_clusters = 4, init = 'k-­ means++', random_state = 42) y_kmeans = kmeans.fit_predict(X) # plot ('Annual Income (k$), Spending Score) plt.figure(figsize = (15,7)) sns.scatterplot(X[y_kmeans == 0, 0], X[y_kmeans color = 'yellow', label = 'Cluster 1', s sns.scatterplot(X[y_kmeans == 1, 0], X[y_kmeans color = 'blue', label = 'Cluster 2', s = == 0, 1], \ = 50) == 1, 1], \ 50) 438 71 72 73 74 75 76 77 78 79 80 81 82 83 Handbook of Computer Programming with Python sns.scatterplot(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], \ color = 'green', label = 'Cluster 3', s = 50) sns.scatterplot(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], \ color = 'grey', label = 'Cluster 4', s = 50) sns.scatterplot(kmeans.cluster_centers_[:, 0], \ kmeans.cluster_centers_[:, 1], color = 'red', label = 'Centroids', s = 300, marker = ', ') plt.grid(False) plt.title('Clusters of customers') plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1–100)') plt.legend() plt.show() Output 10.7.b: Finding and illustrating the clusters, their data points, and their centroids The output identifies the four optimal clusters of the data points and their centroids. 10.8 U NSUPERVISED LEARNING ALGORITHMS: APRIORI The apriori algorithm is based on rule mining and is mainly used for finding the association between different items in a dataset. However, the algorithm can be also used as a classifier. It explores the data space and keeps all items in a dynamic structure. The apriori algorithm prunes the list of itemsets to keep only those that meet certain criteria. One simple criterion is the use of a threshold value: the most frequent item and itemset lists can be pruned using the threshold values on support and confidence. For example, if the support of an item is less than the threshold value the item is not added to the frequent items. The association between items is determined based on two main measurements: support and confidence. Observation 10.17 – Apriori: An unsupervised ML algorithm used to find the association between different items in a dataset. It is based on the measurements of confidence and support. Observation 10.18 – Support: Calculates the likelihood of an item being in the data space and filters the reported items. Use parameter min _ support = value (0.0–1.0). 439 Machine Learning Support calculates the likelihood of an item being in the data space and confidence measures the relationship or association of an item with another. For a given item (A) the support is calculated using the following equation (Equation 10.1): Support ( A ) = Number of observations containing A Total number of observations (10.1) The confidence is measured using the following equation (Equation 10.2) and represents the association between two items, say A and B: Confidence ( A to B ) = Number of observations containing A & B Number of observations containing A (10.2) The min_lift parameter indicates the likelihood of an item being associated with another. A value of 1 indi- Observation 10.19 – Confidence: cates that the items are not associated. A lift value Calculates the level of confidence of greater than 1 indicates that an item is likely to be asso- the association with another item and ciated with another item, while a value less than 1 means filters the reported items. Use parameter min _ confidence = value the opposite. The min_length parameter defines the minimum (0.0–1.0). number of items considered for the rules, and depends on the number of the available items. The association among the items can be determined up to a certain Observation 10.20 – min_lift: length: if the length of the association is 10, a maximum Defines the minimum number of of ten items can be related to each other. Each one of items to be considered (as a combithese combinations is called an itemset. In a large data- nation) in the displayed rules. A value set, the number of frequent items and itemsets could be of 1 suggests an association, while a value less than 1 suggests lack of an rather substantial. The apriori algorithm can be further explained using association. the dataset provided in Table 10.2. The table lists the four most recent transactions made by customers in a Observation 10.21 – min_length: supermarket. Apriori will start by calculating the support for all Defines the minimum number of items as shown on Table 10.3. Next, it will apply the items to be considered for the rules, threshold to trim the item list and build a frequent and depends on the number of availitemset. Assume that the threshold for the support is able items. 50%. The trimmed list of frequent items is shown on Table 10.4. Similarly, the algorithm will calculate the confidence for finding an association between two items, and trim the list using the threshold on confidence. Eventually, two rules will be selected: 1. If a customer buys an Apple, there are high chances the customer buys a Banana. 2. If a customer buys a Bread, there is a likelihood the customer will also buy Eggs. TABLE 10.2 Transactions at a Supermarket Transaction ID 1 2 3 4 Items Purchased Apple, Banana, Biscuits Apple, Banana, Bread Bread, Eggs, Cereal Apple, Bread, Eggs 440 Handbook of Computer Programming with Python TABLE 10.3 Support for All Items Item Apple Banana Bread Biscuits Cereals Eggs Support 0.75 0.5 0.75 0.25 0.25 0.5 TABLE 10.4 Frequent Itemset with 50% Support Item Apple Banana Bread Eggs Support 0.75 0.5 0.75 0.5 The apriori implementation in Python can be described using the following four steps (the last one being optional): • • • • Step 1: Import/read the data. Step 2: Build the apriori model. Step 3: Transform the rules into a dataframe. Step 4: Create a table to display all the rules. The following script uses the above data to create the apriori model: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # import Pandas and Numpy import pandas as pd import numpy as np # import the apriori model from apyori import apriori # Import the accuracy_score to calculare the accuracy of the model from sklearn.metrics import accuracy_score # Step 1: Define the input dataset. X must be a 2D list with # as many rows as the observations X = [["Apple", "Banana", "Biscuits"], ["Apple", "Banana", "Bread"], ["Bread", "Eggs", "Cereal"], ["Apple", "Bread", "Eggs"]] # Step 2 Build the apriori model rules = apriori(X, min_length = 2, min_support = 0.1, \ min_confidence = 0.02, min_lift = 1) # rules = apriori(X, min_length = 2, min_support = 0.5, # min_confidence = 0.5, min_lift = 1) Machine Learning 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 441 # Step3: Transform outputs in an appropriate pd.Dataframe format results = list(rules) results = pd.DataFrame(results) print("The association rules for the particular dataset are:\n", results) # Step 4 Create an output table from the ordered statistics # Note: not all tables are of the same type F1 = []; F2 = []; F3 = []; F4 = [] C3 = results.support for i in range(results.shape[0]): single_list = results['ordered_statistics'][i][0] F1.append(list(single_list[0])) F2.append(list(single_list[1])) F3.append(single_list[2]) F4.append(single_list[3]) # First column of the table C1 = pd.DataFrame(F1) # Second column of the table C2 = pd.DataFrame(F2) # Fourth column of the table C4 = pd.DataFrame(F3,columns = ['Confidence']) # Fifth column of the table C5 = pd.DataFrame(F4,columns = ['Lift']) # Concatenate all tables into one table = pd.concat([C1,C2,C3,C4,C5], axis = 1) print("\nImproved format of the association rules for the dataset:\n", table) Output 10.8.a–10.8.c: The association rules for the particular dataset are: items support \ 0.75 0 (Apple) 0.50 1 (Banana) 0.25 2 (Biscuits) 0.75 3 (Bread) 0.25 4 (Cereal) 0.50 5 (Eggs) 0.50 6 (Apple, Banana) 0.25 7 (Apple, Biscuits) 0.50 8 (Apple, Bread) 0.25 9 (Apple, Eggs) 0.25 10 (Banana, Biscuits) 0.25 11 (Banana, Bread) 0.25 12 (Bread, Cereal) 0.50 13 (Eggs, Bread) 14 0.25 (Eggs, Cereal) 0.25 15 (Apple, Banana, Biscuits) 0.25 16 (Apple, Banana, Bread) 17 0.25 (Apple, Eggs, Bread) 18 (Eggs, Bread, Cereal) 0.25 442 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Handbook of Computer Programming with Python [((), [((), [((), [((), [((), [((), [((), [((), [((), [((), ordered_statistics [((), (Apple), 0.75, 1.0)] [((), (Banana), 0.5, 1.0)] [((), (Biscuits), 0.25, 1.0)] [((), (Bread), 0.75, 1.0)] [((), (Cereal), 0.25, 1.0)] [((), (Eggs), 0.5, 1.0)] (Apple, Banana), 0.5, 1.0), ((Apple), (B... (Apple, Biscuits), 0.25, 1.0), ((Apple),... [((), (Apple, Bread), 0.5, 1.0)] [((), (Apple, Eggs), 0.25, 1.0)] (Banana, Biscuits), 0.25, 1.0), ((Banana... [((), (Banana, Bread), 0.25, 1.0)] (Bread, Cereal), 0.25, 1.0), ((Bread), (... (Eggs, Bread), 0.5, 1.0), ((Bread), (Egg... (Eggs, Cereal), 0.25, 1.0), ((Cereal), (... (Apple, Banana, Biscuits), 0.25, 1.0), (... (Apple, Banana, Bread), 0.25, 1.0), ((Ap... (Apple, Eggs, Bread), 0.25, 1.0), ((Brea... (Eggs, Bread, Cereal), 0.25, 1.0), ((Bre... Improved format of the association rules for the dataset: 0 1 2 support Confidence Lift Apple None None 0.75 0.75 1.0 0 1 Banana None None 0.50 0.50 1.0 2 Biscuits None None 0.25 0.25 1.0 3 Bread None None 0.75 0.75 1.0 4 Cereal None None 0.25 0.25 1.0 5 Eggs None None 0.50 0.50 1.0 6 Apple Banana None 0.50 0.50 1.0 7 Apple Biscuits None 0.25 0.25 1.0 8 Apple Bread None 0.50 0.50 1.0 9 Apple Eggs None 0.25 0.25 1.0 10 Banana Biscuits None 0.25 0.25 1.0 11 Banana Bread None 0.25 0.25 1.0 12 Bread Cereal None 0.25 0.25 1.0 13 Eggs Bread None 0.50 0.50 1.0 14 Eggs Cereal None 0.25 0.25 1.0 15 Apple Banana Biscuits 0.25 0.25 1.0 16 Apple Banana Bread 0.25 0.25 1.0 17 Apple Eggs Bread 0.25 0.25 1.0 18 Eggs Bread Cereal 0.25 0.25 1.0 The results demonstrate the apriori model at work, and also highlight the dominant associations between the items. Strong associations between Bread and Eggs, and Apple and Banana is evident. Changing the parameter values to min_support = 0.5 and min_confidence = 0.5 will change the reported Output 10.8.d as follows: 443 Machine Learning The association rules for the particular dataset are: items support ordered_statistics 0 (Apple) 0.75 [((), (Apple), 0.75, 1.0)] 1 (Banana) 0.50 [((), (Banana), 0.5, 1.0)] 2 (Bread) 0.75 [((), (Bread), 0.75, 1.0)] 3 (Eggs) 0.50 [((), (Eggs), 0.5, 1.0)] 4 (Apple, Banana) 0.50 [((), (Apple, Banana), 0.5, 1.0), ((Apple), (B... [((), (Apple, Bread), 0.5, 1.0)] 5 (Apple, Bread) 0.50 6 (Bread, Eggs) 0.50 [((), (Bread, Eggs), 0.5, 1.0), ((Bread), (Egg... Improved format of the association rules 0 1 support Confidence 0 Apple None 0.75 0.75 None 0.50 1 Banana 0.50 Bread None 0.75 0.75 2 None 0.50 3 0.50 Eggs Apple Banana 0.50 0.50 4 5 0.50 Apple Bread 0.50 6 Bread Eggs 0.50 0.50 for the dataset: Lift 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Notice how filtering dramatically reduces the reported rules and output, by increasing the level of confidence and the acceptable support. The rules extracted by apriori identify the patterns of item sales for a supermakert. The model can determine similar associations for a larger dataset and the report can be tweaked to display the top ranking associations (e.g. Eggs and Bread or Apple and Banana). 10.9 OTHER LEARNING ALGORITHMS A number of other ML algorithms are also frequently used in real-­life applications. One the most popular is random forest (Andrade et al., 2019; Kwon et al., 2015; Naveed & Alrammal, 2017; Naveed et al., 2020), a supervised ML algorithm. It can be used for both classification and regression. The main idea behind random forest is to create multiple ML decision tree models, with datasets created using what is referred to as a bootstrap sampling method. According to this method, each sub-­dataset is composed of random sub-­samples of the original dataset. Each of the defined training datasets is used to create a different model, using the same ML algorithm and making different predictions. The best prediction is used as the result of the process. The random forest algorithm can be described using the following four steps: • Step 1: Select random samples from a given dataset. • Step 2: Create a decision tree for each sample and get a prediction result for each decision tree. • Step 3: Perform a vote for each of the predicted results. • Step 4: Select the prediction result with the highest number votes as the final prediction. Observation 10.22 – Random Forest: Create multiple ML decision trees from random sub-­sets of the original dataset. Make predictions for each of the decision trees and vote for the best prediction. 444 Handbook of Computer Programming with Python Random forest is considered a highly accurate ML algorithm, with the larger numbers of decision trees created leading to increasingly more robust results. Since it calculates the average of all its predictions, it does not suffer from overfitting or outliers being present in the original dataset. Its main shortcomings come from the fact that it consists of multiple decision trees. Hence, it is slow in generating a final prediction as it has to get all the sub-­tree predictions and vote the best one, and it is not as straightforward to interpret as a single decision tree. The K-­Nearest Neighbors (k-­NN) algorithm uses the entire dataset as a training set, rather than splitting the dataset into a training and a test set. It assumes that similar data points are in close proximity to each other. This proximity (or distance) can be calculated using a variety of methods, such as the Euclidean theorem, or the Hamming distance (Sharma, 2020). When a new outcome is requested for a new data point, the k-­NN algorithm calculates the instances between the new data point and the Observation 10.23 – k-­NN: Use the entire dataset, or the user-­defined k data points that look whole data set as a training set to calmore similar to the new data point. Next, it calculates culate the distances between the varithe mean of the outcomes following a regression model, ous k data points in the dataset. or the mode (i.e., the most frequent class). The algorithm of the k-­NN model follows the following six main steps: • Step 1: Load the data. • Step 2: Select the number (k) of neighbors. • Step 3: For each new data point, calculate the distance between new and the current dataset points. • Step 4: Add the distance and the index of the new data point to the current collection. • Step 5: Sort the current collection of distances and indices by distance. • Step 6: Pick the first k entries from the sorted collection, get their labels, and return the mean or mode. The main disadvantage of k-­NN is that it is becoming significantly slower as the dataset increases in size. 10.10 W RAP UP - MACHINE LEARNING APPLICATIONS Through the use of Machine Learning (ML) algorithms, Artificial Intelligence (AI) has penetrated all forms of human activity. It is highly likely that the vast majority of humans has a first-­hand experience of this through one of its many real-­life applications. Traffic Alerts (maps) is such an example with several applications being used to suggestions and routes to help drivers deal with navigation and traffic. Data are collected either from other drivers currently using the same system or network and, or historical data of the various routes collected over time. Data collected when users are using the application or network include their location, average speed, and the route in which they are travelling. Figure 10.4 illustrates such an example on heavy congestion conditions (i.e., Sheikh Mohammed bin Rashid Blvd – Downtown Dubai). Another class of examples of ML algorithms are the various virtual personal assistants. Such systems assist the users on various daily tasks and include advanced detection capabilities like understanding the users’ voice (e.g., asking “what is my schedule for today?” will trigger the associated response). Common tasks implemented into contemporary virtual personal assistant systems include speech recognition, speed-­to-­text conversion, natural language processing, and text-­to-­speech conversion. The systems collect and refine the information based on previous interactions. They are integrated into a variety of platforms, including smart speakers, smartphones, and mobile apps. Social media is another space where ML applications are heavily integrated and used. From personalizing news feeds to better ads targeting, social media platforms are utilizing machine learning Machine Learning FIGURE 10.4 445 Traffic alert application. for both corporate and end-­user benefits. The list below includes some examples one may be familiar with, perhaps without even realizing that these features are nothing but the practical application of ML algorithms: • People You May Know: ML works on a simple concept: understanding through experience. For example, Social Media platforms continuously monitor the friends one connects with, the most often visited profiles, one’s interests, or work and personal status, or groups one belongs too. Based on continuous learning, a list of the Social Media users that one can become friends with is suggested. • Face Recognition: A user uploads a personal picture with a friend and the system instantly recognizes the identity of that friend. Such systems may check the poses and projections in the picture, identify unique features, and match them with people in the user’s friends or contact lists. The entire process is based on ML and is commonly referred to as friend tagging. It is a rather complex process taking place at the backend, but it is rather transparent on the user side, as it seems like a simple and unobtrusive feature at the front end. • Similar Pins: ML is a core element in computer vision, a technique to extract useful information from images and videos. An example of this can be seen in platforms which use computer vision to identify the objects (or pins) in the images and recommend other related pins accordingly. House price prediction is yet another example of ML algorithms in action. By leveraging the data collected from large numbers of houses in relation to their characteristics (e.g., square footage, number of rooms, property type), the algorithm trains the ML model to predict the price of other houses. The multiple popular online portals for searching houses or apartments (both for rental and purchase) are examples of the use of such applications. 446 Handbook of Computer Programming with Python Product recommendation is an experience most people have without even noticing. As an example, one can think of using a web browser to check a product on a specific website. It is likely that while engaging in other online activities, such as watching online videos, the same or similar products appear as an ad. In such cases, the various platforms use smart agents to track the user’s search history and recommends ads based on it. Recommender systems are another application of ML algorithms. Such systems use collaborative filtering, a method based on gathering and analyzing user behavior information and predicting what they like based on similarities with other users. Figure 10.5 provides an example of the use of collaborative filtering in an E-­commerce web app. In this context one can assume a customer (Customer 1) viewing product A and other customers viewing products A, B, C, and D. Due to the similarity of interests of all the users in product A, the web app will propose products B, C and D to Customer 1. Among the most important applications of ML is the monitoring of video cameras. In areas or countries utilizing excessive numbers of traffic monitoring video cameras, monitoring by human officers can be impractical and challenging. The idea of training computers to accomplish this task comes handy in such cases. Similarly, video surveillance systems powered by AI/ML make it possible to detect suspicious activity, sometimes even before it takes place. This is done by tracking unusual behavior (e.g., when one stands motionless for a long time, stumbles, or laying on public locations). The system can generate alerts sent to human attendants, who can then take appropriate actions. As activities are reported and verified, they help to improve the surveillance services even further. In the context of information security, one should note the use of spam filtering. The term refers to processes monitoring the user’s email traffic and executing appropriate preventive actions. It is crucial for such systems to ascertain that spam filters are continuously updated; this is accomplished through ML algorithms. While there are hundreds of thousands of malware and security threats detected every single day, it is generally accepted that the associated code is 90% or more similar to its predecessor. ML-­based security programs can identify such coding patterns and detect new malware with slight coding variations rather easily. Similarly, ML provides great potential to secure online monetary transactions from online frauds. For instance, online payment platforms use a set of tools that helps compare millions of transactions taking place almost simultaneously and identifying suspicious of fraudulent action between buyers and sellers. Finally, another common application of ML models can be found in the online customer support services of many e-­Business or e-­Commerce platforms. Such platforms frequently offer the option FIGURE 10.5 Product recommendations. (See Keshari, 2021.) Machine Learning 447 to chat with a customer support representative while navigating the website. While the transaction may seem like a regular conversation, it is not with a real representative but with a chatbot. The latter extracts information from the website and presents it to the customers in a chat-­like form. Every time a new chat begins, the answer is improved based on the previously recorded answers. The discussion on ML applications can continue further, with practical use examples like weather prediction, distinction between animals/plants/objects, or customer segmentation, just to name a few. 10.11 CASE STUDIES Use dataset dataset.csv to write a Python script that predicts whether a patient will be readmitted or not within 30 days. The application should do the following: 1. Read the dataset and create a data frame with the following categories: gender, race, age, admission type id, discharge disposition id, admission source id, max glu serum, A1Cresult, change, diabetesMed, readmitted (categorical), time in hospital, number of lab procedures, number of procedures, number of medications, number of outpatients, number of emergencies, number of inpatients, number of diagnoses (numerical). 2. Apply the following ML algorithms and calculate their accuracy: logistic regression, k-­NN, SVM, Kernel SVM, Naïve Bayes, CART Decision Tree, Random Forest. 10.12 E XERCISES 1. Use the CART example in this chapter to change the criterion from entropy to Gini index and the max depth to 10. How does this affect the accuracy of the model? What is the effect of changing the max depth to 20? 2. Test both the BEST and RANDOM splitter features on the CART example from this chapter. Explain whether the performance of a decision tree depends on the splitter feature of the classifier object. 3. Apply a smaller training dataset to the CART decision tree example to investigate whether the performance will improve or decrease (Hint: Increase and decrease the ratio of the size of the training dataset). 4. Find the precision, recall and fscore for a CART decision tree with entropy as criterion, max dept of 4 and min samples leaf nodes of 20. 5. Use the bank dataset to train a decision tree classifier with ten-­fold cross validation and generate the respective classification report. REFERENCES Andrade, E. de O., Viterbo, J., Vasconcelos, C. N., Guérin, J., & Bernardini, F. C. (2019). A model based on lstm neural networks to identify five different types of malware. Procedia Computer Science, 159, 182–191. Keshari, K. (2021). Top 10 Applications of Machine Learning: Machine Learning Applications in Daily Life. https://www.edureka.co/blog/machine-­learning-­applications/. Kwon, B. J., Mondal, J., Jang, J., Bilge, L., & Dumitraş, T. (2015). The dropper effect: Insights into malware distribution with downloader graph analytics. Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (1118–1129), Denver, Colorado. Mitchell, T. M. (1997). Machine Learning (1st ed.). New York: McGraw-­Hill. Mola, F. (1998). Classification and Regression Trees Software and New Developments BT – Advances in Data Science and Classification (A. Rizzi, M. Vichi, & H.-­H. Bock eds.; pp. 311–318). Berlin Heidelberg: Springer. 448 Handbook of Computer Programming with Python Naveed, M., & Alrammal, M. (2017). Reinforcement learning model for classification of Youtube movie. Journal of Engineering and Applied Science, 12(9), 1–7. Naveed, M., Alrammal, M., & Bensefia, A. (2020). HGM: A Novel Monte-­Carlo simulations based model for malware detection. IOP Conference Series: Materials Science and Engineering, 946(1), 12003. https:// doi.org/10.1088/1757-­899x/946/1/012003. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/ BF00116251. Raghupathi, K. (2018). 10 Interesting Use Cases for the K-­Means Algorithm. DZone AI Zone. https://dzone. com/articles/10-­interesting-­use-­cases-­for-­the-­k-­means-­algorithm. Salzberg, S. L. (1994). C4.5: Programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning, 16(3), 235–240. https://doi.org/10.1007/BF00993309. Sharma, P. (2020). 4 Types of Distance Metrics in Machine Learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/02/4-­types-­of-­distance-­metrics-­in-­machine-­learning/. 11 Introduction to Neural Networks and Deep Learning Dimitrios Xanthidis University College London Higher Colleges of Technology Muhammad Fahim Higher Colleges of Technology Han-I Wang The University of York CONTENTS 11.1 Introduction...........................................................................................................................449 11.2 Relevant Algebraic Math and Associated Python Methods for DL...................................... 452 11.2.1 The Dot Method........................................................................................................ 452 11.2.2 Matrix Operations with Python................................................................................. 455 11.2.3 Eigenvalues, Eigenvectors and Diagonals................................................................. 459 11.2.4 Solving Sets of Equations with Python.....................................................................460 11.2.5 Generating Random Numbers for Matrices with Python.......................................... 461 11.2.6 Plotting with Matplotlib............................................................................................. 463 11.2.7 Linear and Logistic Regression.................................................................................465 11.3 Introduction to Neural Networks...........................................................................................466 11.3.1 Modelling a Simple ANN with a Perceptron............................................................ 467 11.3.2 Sigmoid and Rectifier Linear Unit (ReLU) Methods................................................ 470 11.3.3 A Real-Life Example: Preparing the Dataset............................................................ 473 11.3.4 Creating and Compiling the Model........................................................................... 474 11.3.5 Stochastic Gradient Descent and the Loss Method and Parameters......................... 475 11.3.6 Fitting and Evaluating the Models, Plotting the Observed Losses............................ 477 11.3.7 Model Overfit and Underfit........................................................................................ 482 11.4 Wrap Up................................................................................................................................. 483 11.5 Case Study.............................................................................................................................484 References.......................................................................................................................................484 11.1 INTRODUCTION Deep learning is in fact a new name for an approach to artificial intelligence called neural networks, which has been going in and out of fashion for more than 70 years. Neural networks were first proposed in 1944 by Warren McCullough and Walter Pitts, two University of Chicago researchers who moved to MIT in 1952 as founding members of what’s sometimes called the first cognitive science department. (Hardesty, 2017) DOI: 10.1201/9781003139010-11 449 450 Handbook of Computer Programming with Python Human intelligence is an evolutionary, biologically controlled process. Humans learn based on their expe- Observation 11.1 – Deep Learning: A riences. Similarly, machine or artificial intelligence is specialized form of Machine Learning. subject to comparable experiences in the form of data. It uses many layers of algorithms to On a broader context, the two forms of intelligence are process the underlying data which similar in the sense that they are subject to a common could be human speeches, images, approach: “based on what I have seen and observed I text, complex objects, etc. think this will happen next”. Once this core idea is transferred to mathematical constructs and the associated algorithms (self-evolving), machines are observed to be capable of learning on their own, a process commonly referred to as machine learning (ML). ML is a branch of artificial intelligence (AI), an umbrella term used to describe approaches and techniques that can make machines think and act in a more rational and human-like way. Deep learning (DL) is a specific form of ML, and therefore another branch of AI (Figure 11.1). At a basic level, DL is based on mimicking the human thinking process and developing relevant abstractions and connections. It consists of the following elements: 1. Learning: Facilitating the functionality to artificially obtain and process new information. 2. Reasoning: Offering the functionality to process information in different, and potentially overlooked, ways. 3. Understanding: Providing ways to showcase the results of the adopted model. 4. Validating: Offering the opportunity to validate the results of the model based on theory. 5. Discovering: Providing the mechanisms to identify new relationships within the data. 6. Extracting: Allowing the extraction of new meanings based on the predictors. DL uses numerous layers of algorithms to process the underlying data, which could be spoken words, images, text, or more complex objects. The data are normally passed through interconnected layers of processing networks, as shown in Figure 11.2. In ML, there are two types of variables: dependent and independent. One way to contextualize these variables is to think of independent variables as the inputs of the ML process and dependent as the outputs. For example, one can predict a person’s weight by knowing that person’s height. Another notion the reader should be familiar with is that of data plotting. Essentially, plotting is a way to visualize the data in an effort to identify underlying patterns and groupings. As data can be scattered, when plotting them the goal is to find a line that represents the best fit for a given dataset. A simple equation can define such a process: Y = F(X) + B where Y is the dependent variable (predicted weight) and X the independent variable (an individual’s height). In ML, there are mainly two types of predictions: 1. Linear Regression: Linear regression is focused on predicting continuous values. This topic is thoroughly discussed in Chapter 10: Machine Learning with Python. It is highly recommended that the reader goes through the basic discussions on that chapter before proceeding to the next sections of the present one, as they offer a useful foundation for understanding many aspects of DL. FIGURE 11.1 Scope of data-based learning technologies. Introduction to Neural Networks and Deep Learning FIGURE 11.2 451 DL processing and layering structure. 2. Logistic Regression: Logistic regression is focused on predicting values classified as 0 or 1, and is one of the cornerstones of DL. DL is applied in cases of learning based on unlabelled data with unknown features. Thus, feature extraction (FE) is a vital aspect of DL. FE uses algorithms to construct the meaning of the features, so the training and testing processes can be applied. This chapter covers the following: 1. An introduction to the theory and mathematical constructs of DL fundamentals, supported by the associated mathematical equations, and working examples and related Python scripts. 2. An introductory discussion on Neural Networks (NN) and DL algorithms implementing NN with working examples and scripts. 3. Examples of building a DL model using NN. It should be noted that, since there are several mathematical concepts involved in the DL processes, it is possible to face compatibility issues when working with more than one libraries. In such cases, it is, often, quite useful to know if a particular library is installed in the system and, if so, which version. In that case, the following statements may come handy: 1 2 3 4 5 6 7 8 9 10 11 12 # scipy import scipy print('scipy: %s' % scipy.__version__) # numpy import numpy print('numpy: %s' % numpy.__version__) # matplotlib import matplotlib print('matplotlib: %s' % matplotlib.__version__) # pandas import pandas print('pandas: %s' % pandas.__version__) 452 13 14 15 16 17 18 Handbook of Computer Programming with Python # statsmodels import statsmodels print('statsmodels: %s' % statsmodels.__version__) # scikit-learn import sklearn print('sklearn: %s' % sklearn.__version__) Output 11.1: scipy: 1.4.1 numpy: 1.19.5 matplotlib: 3.2.2 pandas: 1.1.5 statsmode1s: 0.10.2 sklearn: 1.0.1 In addition to Pandas, MatplotLib, Nympy, and SciPy libraries already covered in previous chapters, there are a few more that are essential in DL scripts. Some of these must be installed prior to their import and use in the script. However, given the variety of installations depending on the operating systems and configurations, it is deemed impractical to cover all those in the present chapter. The reader is advised to seek instructions in the many online available sites. A list of these libraries, with a brief description, follows: 1. TensorFlow: It is used for backpropagation and passes the data for training and prediction. 2. Theano: It helps with defining, optimizing and evaluating mathematical equations on multi-dimensional arrays. It is very efficient when performing symbolic differentiation. 3. Pytorch: It helps with tensor computations with GPU and Neural Networks based data modeling. 4. Caffe: It helps with implementing DL frameworks using improved expressions and speed. 5. Apache mxnet: As a core component, it comes with a dynamic dependency scheduler that provides parallelism for both symbolic and imperative operations. 11.2 RELEVANT ALGEBRAIC MATH AND ASSOCIATED PYTHON METHODS FOR DL There are some essential mathematical concepts that must be explained and their Python implementations described before delving into the introduction of DL with Python. The most fundamental are the dot() method, the matrix operations, eigenvalues/eigenvectors and diagonals, solving equations through sets, generating random numbers, and linear and logistic regression. 11.2.1 The Dot Method A method often used in DL that is not covered in previous chapters is the dot method. It implements the math equation that sums the products of two arrays: N x. y = x b = T ∑x y n n n=1 The dot method is important in the context of DL, as the main method of the latter is to accept multiple inputs Observation 11.2 – The Dot Method: Calculates the sum of vectors, ­provided in the form of matrices. Introduction to Neural Networks and Deep Learning FIGURE 11.3 453 The dot method in DL. from various neurons and calculate their sum. Since the inputs are always in the form of vectors (i.e., pairs of values like course grade and its weight), the dot method is an effective means for this calculation. Figure 11.3 illustrates the functionality of the dot method: Consider the following Python script: 1 2 3 4 import numpy as np 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 x2, y2 = np.array([1, 2, 3]), np.array([4, 5, 6]) print("The two arrays x1 and y1 are:\n", x1, y1) print("The two arrays x2 and y2 are:\n", x2, y2) # 1x2 and 1x3 arrays x1, y1 = np.array([1, 2]), np.array([3, 4]) # Product of 2 arrays calculated as xi*yi (for each of the 2 elements) print("\nCreate a new list as products of the elements of the two \ arrays (x1 * y1):", x1 * y1) print("\nCreate a new list as products of the elements of the two \ arrays (x2 * y2):", x2 * y2) # Loop calculates the dot method of the 2 arrays (x1, y1 & x2, y2) Dot = 0 for i in range(len(x1)): Dot += x1[i] * y1[i] print("\nUsing a regular loop to calculate the dot value for \ the 1x2 arrays:", Dot) Dot = 0 for i in range(len(x2)): Dot += x2[i] * y2[i] print("Using a regular loop to calculate the dot value for \ the 1x3 arrays:", Dot) # The zip method with parallel iterations calculates # the dot for x1, y1 and x2, y2 Dot = 0 for g, h in zip(x1, y1): Dot += g * h print("\nUsing the zip method for parallel iterations:", Dot) 454 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Handbook of Computer Programming with Python Dot = 0 for g, h in zip(x2, y2): Dot += g * h print("Using the zip method for parallel iterations:", Dot) # The sum method calculates the dot for two arrays print("\nThe sum of the products of the elements of the two arrays \ (np.sum(x1 * y1)):", np.sum(x1 * y1)) print("\nThe sum of the products of the elements of the two arrays \ (np.sum(x2 * y2)):", np.sum(x2 * y2)) # A different version of the sum method calculates the dot of 2 arrays print("\nThe sum of the products of the elements of the two arrays \ ((x1 * y1).sum()):", (x1 * y1).sum()) print("The sum of the products of the elements of the two arrays \ ((x1 * y1).sum()):", (x2 * y2).sum()) # The dot method on two arrays print("\nUse the dot method on the elements of the two arrays \ (np.dot(x1, y1)):", np.dot(x1, y1)) print("Use the dot method on the elements of the two arrays \ (np.dot(x2, y2)):", np.dot(x2, y2)) # A different version of the dot method on two arrays print("\nAnother way to use the dot method on the elements \ of the two arrays (x1.dot(y1)):", x1.dot(y1)) print("Another way to use the dot method on the elements \ of the two arrays (x2.dot(y2)):", x2.dot(y2)) # Direct use of the dot notation on two arrays print("\nAnother way to use the dot method (x1 @ y1):", x1 @ y1) print("\nAnother way to use the dot method (x2 @ y2):", x2 @ y2) Output 11.2.1: The [1 The [1 two arrays xl and yl are: 2] [3 4] two arrays x2 and y2 are: 2 3] [4 5 6] Create a new list as products of the elements of the two arrays (xl * yl) : [3 8] Create a new list as products of the elements of the two arrays (x2 * y2) : [ 4 10 18] Using a regular loop to calculate the dot value for the 1x2 arrays: 11 Using a regular loop to calculate the dot value for the 1x3 arrays: 32 Using the zip method for parallel iterations: 11 Using the zip method for parallel iterations: 32 The sum of the products of the elements of the two arrays (np.sum(xl * yl)) : 11 The sum of the products of the elements of the two arrays (np.sum(x2 * y2)) : 32 The sum of the products of the elements of the two arrays ((xl * yl).sum()) Using a regular loop to calculate the dot value for the 1x2 arrays: 11 Using a regular loop to calculate the dot value for the 1x3 arrays: 32 Using the zip method for parallel iterations: 11 Introduction to Neural Networks and Deep Learning Using the zip method for parallel iterations: 32 455 The sum of the products of the elements of the two arrays (np.sum(xl * yl)) : 11 The sum of the products of the elements of the two arrays (np.sum(x2 * y2)) : 32 The sum of the products of the elements of the two arrays ((xl * yl).sum()) : 11 The sum of the products of the elements of the two arrays ((xl * yl).sum()) : 32 Use the dot method on the elements of the two arrays (np.dot(xl, yl)): 11 Use the dot method on the elements of the two arrays (np.dot(x2, y2)): 32 Another way to use the dot method on the elements of the two arrays (xl.dot (yl)): 11 Another way to use the dot method on the elements of the two arrays (x2.dot (y2)): 32 Another way to use the dot method (xl @ yl): 11 Another way to use the dot method (x2 @ y2): 32 This script calculates and presents the sum of the products of the elements of two arrays (based on their indices) in varying ways and presents their results. For illustration purposes, it uses two types of arrays (i.e., 1 × 2 elements and 1 × 3 elements). The reader should notice the various forms that the dot method can take. The method is quite useful and becomes handy in the examples provided in the following sections. 11.2.2 Matrix Operations with Python Another algebraic concept that is quite useful in DL is that of matrix multiplication. Broadly speaking, this process requires that the size of the second dimension of the first matrix must be the same as the size of the first dimension of the second matrix. In other words, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix has the size of the first dimension of the first matrix (or its number of rows) and the size of the second dimension of the second matrix (or its number of columns). For the calculation of the various elements of the new matrix the dot method is used. As an example, one can assume the following two matrices: 1 npArray = 5 3 newMatrix = 1 2 6 4 2 5 3 The first array (npArray) has two columns, whereas the second (newMatrix) has two rows. Hence, it is possible to have a new matrix as the product of these two matrices. The resulting matrix will be calculated as follows: (1* 3 + 2 *1) ( 5* 3 + 6 *1) (1* 4 + 2 * 2) ( 5* 4 + 6 * 2) (1* 5 + 2 * 3) ( 5* 5 + 6 * 3) = 5 21 8 32 11 43 456 Handbook of Computer Programming with Python Another mathematical Python method that often comes handy when using matrices is exp() from the Numpy library. The method accepts an array of elements (an algebraic matrix) as an argument and creates a new matrix as a result of e^xiyi. Using the previous example of matrix npArray, the resulting matrix will be as follows: Observation 11.3 – The exp() Method: Creates a new matrix as a result of e^xiyi of the elements of the original matrix. Observation 11.4 – Inverse Matrix: A matrix which, if multiplied by the original, gives the identity matrix. e ^ 1 e ^ 2 2.71828283 7.3890561 e ^ 5 e ^ 6 =: 148.4131591 403.42879349 Another concept often used in DL is that of the inverse matrix. If such a matrix is multiplied by the original, it will result into the identity matrix. If, in turn, the latter is multiplied by the original matrix, it will not change it. This is similar to integer 1, which when multiplied by any other integer it does not incur any value changes. The identity matrices for 2 × 2, 3 × 3, and 4 × 4 matrices can be expressed as follows: 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 This pattern can continue in a similar fashion for larger square matrices. It is important to note that there are two requirements for a matrix to have a corresponding inverse: it must be a square matrix and its determinant value must be non-zero. The determinant is a special number, either integer or real, calculated from a matrix. Its most important role is precisely to determine whether a matrix can have an inverse one, in which case the determinant is non-zero. If not, it will have a value of 0 or extremely close to 0. It must be noted that even a number like 2.3e−23 is ­considered as 0 and, therefore, such a determinant would suggest that it is not feasible to have an inverse matrix. The determinant is calculated by subtracting the product of the diagonal elements of the matrix. For 1 2 example, in the case of matrix the deter 5 6 0 0 0 1 Observation 11.5 – Identity Matrix: A matrix that has all its first diagonal elements with a value of 1, which causes no change to the corresponding values when multiplied by the original matrix. Observation 11.6 – Determinant: A special number, integer or real, calculated from the diagonals of a matrix. It determines whether a matrix has an inverse (value is non-zero) or not (value is 0). 1 minant is calculated as 1*6 – 5*2 = 6 – 10 = –4. However, in the case of 5 7 3 4 6 2 8 9 things Introduction to Neural Networks and Deep Learning 457 are more complicated. In this case the determinant is calculated as 1*((4*9) − (6*8)) − 3*((5*9) − (7*8)) + 2*((5*6) − (7*4)) = 1*(36−48) − 3*(45−56) + 2*(30−28) = − 12 − 3*(−11) + 2*2 = −12 + 33 + 4 = 25. The pattern for 3 × 3 or larger matrices is as follows: • Multiply the first element of the first row with the determinant of the matrix that is not in the same row or column. • Similarly, calculate the same values for all the elements of the first row of the matrix. • Calculate the final determinant as first result − second result + third result – fourth result and so forth. The reader should note that the determinant can be calculated only for square matrices. The following script briefly demonstrates the above concepts: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import numpy as np # Create a 2-dimensonal array (2x2) using the array function (Numpy) npArray = np.array([[1, 2], [5, 6]]) # Show the entire array and the 2nd element of the 1st dimension # in 2 different ways print("\nThe nparray's array's contents:\n", npArray) print("The 2nd element of the 1st dimension of the array:", npArray[0][1]) print("The same result from a different syntax:", npArray[0, 1]) print("\nThe elements of the 2nd dimension:", npArray[:, 0]) print("\nShow the result of the e^x for each element of the input \ array:\n", np.exp(npArray)) # Create a 2-dimensonal array (2x3) using the array function (Numpy) newMatrix = np.array([[3, 4, 5], [1, 2, 3]]) print("\nThe 2x3 matrix newMatrix is:\n", newMatrix) # Multiply the arrays npArray and newMatrix applying the .dot method print("\nThe product of npArray and newMatrix using the .dot method \ is:\n", npArray.dot(newMatrix)) # Create a 2-dimensional array (3x3) using the array function (Numpy) newMatrix2 = np.array([[1, 3, 2], [5, 4, 8], [7, 6, 9]]) print("\nThe 3x3 matrix newMatrix2 is:\n", newMatrix2) # Determinant values for npArray & newMatrix2. The matrices are squares print("\nThe determinant for the npArray is: ", np.linalg. det(npArray)) print("The determinant for the newMatrix is: ", np.linalg.det(newMatrix2)) # Calculate and display the inverse matrix for npArray and newMatrix2 inverseNpArray = np.linalg.inv(npArray) print("\nThe inverse matrix for the npArray is:\n", inverseNpArray) inverseNewMatrix2 = np.linalg.inv(newMatrix2) print("\nThe inverse matrix for the newMatrix2 is:\n", inverseNewMatrix2) # Multiplying original npArray & newMatrix2 matrices with their # inverse produces the identity matrix print("\nThe product of the npArray and its inverse matrix is:\n", 458 38 39 40 Handbook of Computer Programming with Python inverseNpArray.dot(npArray)) print("\nThe product of the newMatrix2 and its inverse matrix is:\n", inverseNewMatrix2.dot(newMatrix2)) Output 11.2.2: The nparray's array's contents: [[1 2] [5 6]] The 2nd element of the 1st dimension of the array: 2 The same result from a different syntax: 2 The elements of the 2nd dimension: [1 5] Show the result of the e^x for each element of the input array: [[ 2.71828183 7.3890561 ] [148.4131591 403.42879349]] The 2x3 matrix newMatrix is: [[3 4 5] [1 2 3]] The product of npArray and newMatrix using the .dot method is: [[ 5 8 11] [21 32 43]] The 3x3 matrix newMatrix2 is: [[1 3 2] [5 4 8] [7 6 9]] The determinant for the npArray is: -3.999999999999999 The determinant for the newMatrix is: 25.000000000000007 The inverse matrix for the npArray is: [[-1.5 0.5 ] [ 1.25 -0.25]] The inverse matrix for the newMatrix2 is: [[-0.48 -0.6 0.64] [ 0.44 -0.2 0.08] [ 0.08 0.6 -0.44]] The product of the npArray and its inverse matrix is: [[ 1.00000000e+00 -2.22044605e-16] [-5.55111512e-17 1.00000000e+00]] The product of the newMatrix2 and its inverse matrix is: [[ 1.00000000e+00 6.66133815e-16 9.99200722e-16] [-2.08166817e-16 1.00000000e+00 -1.24900090e-16] [ 7.21644966e-16 1.11022302e-16 1.00000000e+00]] Introduction to Neural Networks and Deep Learning 459 The results showcase the output of the calculations. Note that the rather complicated calculations for the determinant lead to the respective values not being whole numbers. In addition, the product of newMatrix2 and its inverse matrix is the identity matrix of 3 × 3, although some of its elements appear to be non-zero values, but are quite close to that. 11.2.3 Eigenvalues, Eigenvectors and Diagonals Another concept related to matrix operations is that of eigenvalues and eigenvectors, which determine whether a particular matrix changes direction when multiplied by a specified vector. As an example, consider a square matrix A. Its eigenvector and eigenvalue will be the ones that make the following equation true: AV = λV where A is the original matrix, V is the eigenvector and λ is the eigenvalue. It is beyond the scope of this chapter to cover algebraic mathematics in any sort of detail. The reader can find such information on the multitude of related books and resources. For the purposes of this chapter, it should suffice to mention that the concept of eigenvalues and eigenvectors is useful in several transformation processes, including but not limited to computer graphics, physics applications, and predictive modelling. Another notion that must be mentioned is that of a diagonal. It is often useful to find the diagonals above or Observation 11.7 – Eigenvalue, below the main diagonal of a matrix. In the case of the Eigenvector: Mathematical concepts former, a positive integer is suggested, whereas in the that suggest whether a particular matrix changes direction when mulcase of the latter a negative one. The following script is a demonstration of how the tiplied by a specified vector (AV = λV). concepts of eigenvalue, eigenvector, and diagonals are calculated and/or identified: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import numpy as np # Create a 2x2 array using the array function (Numpy) and # display its contents npArray = np.array([[1, 2], [5, 6]]) print("\nThe nparray's array's contents:\n", npArray) # Create a 3x3 array using the array function (Numpy) and # display its contents newMatrix = np.array([[1, 3, 2], [5, 4, 8], [7, 6, 9]]) print("\nThe 3x3 matrix newMatrix2 is:\n", newMatrix) # Display the diagonal for both arrays print("The diagonal of the npArray is: ", np.diag(npArray)) print("The diagonal of the npArray above the main diagonal is: ", np.diag(npArray, 1)) print("The diagonal of the npArray below the main diagonal is: ", np.diag(npArray, -1)) print("The diagonal of the newMatrix is: ", np.diag(newMatrix)) print("The diagonal of the newMatrix above the main diagonal is: ", np.diag(newMatrix, 1)) print("The diagonal of the newMatrix below the main diagonal is: ", np.diag(newMatrix, -1)) # Calculate and display the Eigenvalue and Eigenvector for both arrays eigenValueNpArray, eigenVectorNpArray = np.linalg.eig(npArray) 460 27 28 29 30 31 32 33 Handbook of Computer Programming with Python print("\nThe eigenvalues of the npArray are: \n", eigenValueNpArray) print("\nThe eigenvectors of the npArray are: \n", eigenVectorNpArray) eigenValueNewMatrix, eigenVectorNewMatrix = np.linalg.eig(newMatrix) print("\nThe eigenvalues of the newMatrix are: \n", eigenValueNewMatrix) print("\nThe eigenvectors of the newMatrix are: \n", eigenVectorNewMatrix) Output 11.2.3: The nparray's array's contents: [[1 2] [5 6]] The 3x3 matrix newMatrix2 is: [[1 3 2] [5 4 8] [7 6 9]] The diagonal of the npArray is: [1 6] The diagonal of the npArray above the main diagonal is: [2] The diagonal of the npArray below the main diagonal is: [5] The diagonal of the newMatrix is: [1 4 9] The diagonal of the newMatrix above the main diagonal is: [3 8] The diagonal of the newMatrix below the main diagonal is: [5 6] The eigenvalues of the npArray are: [-0.53112887 7.53112887] The eigenvectors of the npArray are: [[-0.79402877 -0.2928046 ] [ 0.60788018 -0.9561723 ]] The eigenvalues of the newMatrix are: [15.86430285+0.j -0.93215143+0.84080839j -0.93215143-0.84080839j] The eigenvectors of the newMatrix are: [[ 0.22516436+0.j 0.76184671+0.j 0.76184671-0.j ] [ 0.60816639+0.j -0.24748842+0.39196634j -0.24748842-0.39196634j] [ 0.76120605+0.j -0.36476897-0.26766596j -0.36476897+0.26766596j]] 11.2.4 Solving Sets of Equations with Python Python provides a convenient way to solve sets of equations by treating them as matrices. The idea behind this is to take a set of equations, produce the relevant matrices (i.e., one with the variable coefficients and one with the resulting values for each equation), and call the solve() method (Numpy library). Consider the following example of a set of three equations: 5 x − 3 y + 2 z = 10 −4 x − 3 y − 9 z = 3 2 x + 4 y + 3z = 6 461 Introduction to Neural Networks and Deep Learning Firstly, the following matrix of the variable coefficients is produced: 5 −4 2 −3 −3 4 2 −9 3 Observation 11.8 – The solve() Method: A method that solves a set of equations using relevant, appropriately processed matrices. This is followed by the matrix for their solutions: 10, 3, 6 Finally, the solve() method is called, producing the respective solutions for x, y, and z: 1 2 3 4 5 6 7 8 9 10 11 12 13 import numpy as np # # # # # Assume the following set of equations: 5x - 3y + 2z = 10 -4x - 3y - 9z = 3 2x + 4y + 3z = 6 Use solve() to solve the equations # Create a 3x3 matrix based on the equations and and display contents equations = np.array([[5, -3, 2], [-4, -3, -9], [2, 4, 3]]) results = np.array([10, 3, 6]) print(“\nThe solution for x, y, and z is:\n”, np.linalg.solve(equations, results)) Output 11.2.4: The solution for x, y, and z is: [ 3.90225564 1.46616541 -2.55639098] 11.2.5 Generating Random Numbers for Matrices with Python Sometimes it is useful to generate matrices with random numbers in order to evaluate models prior to using actual data. Through the Numpy library, Python provides several methods that offer such functionality. The following script can be divided into three distinct parts. In the first part, a 3 × 4 matrix is generated and filled with 0 s. Next, another two matrices are generated and filled with 1 s and 20 s, respectively. Finally, a 4 × 4 identity matrix is generated. In the second part, the script uses the rand() and randn() methods to generate numbers for the matrices, either through the regular ran- Observation 11.9 – rand(), randn(), dom numbers generator or from the Normal Gaussian mean(), var(), std(): Some of the Distribution that has a mean of 0. In the third part, the methods of the Random package of script demonstrates the use of basic statistics methods the Numpy library that provide basic from Numpy, including mean(), var(), and std() to descriptive statistical calculations on calculate the mean, the statistical variance, and the matrices. ­standard deviation of the data, respectively: 462 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Handbook of Computer Programming with Python import numpy as np # Generate 3x4 matrices of zeroes, ones, 20s, and a 4x4 identity matrix print("Generate a 3x4 matrix of zeroes\n", np.zeros((3, 4))) print("\nGenerate a 3x4 matrix of ones\n", np.ones((3, 4))) print("\nGenerate a 3x4 matrix of 20s\n", 20 * np.ones((3, 4))) print("\nGenerate an Identify matrix 4x4\n", np.eye(4)) # Generate a random number, a 3x4 matrix of random numbers, # a 3x4 matrix of random numbers from the Normal (Gaussian) # Distribution (i.e., mean = 0), and a 4x4 matrix of random # numbers between 5 and 15 from the Normal Distribution print("\nGenerate a random number\n", np.random.random()) print("\nGenerate an array 3x4 with random numbers\n", np.random.random((3, 4))) print("\nGenerate an array 3x4 with random numbers from the Normal \ Distribution\n", np.random.randn(3, 4)) print("\nGenerate an array 4x4 with random numbers between 5 and 15\n", np.random.randint(5, 15, size = (4, 4))) # Generate an array of 10 items with random numbers from the # Normal (Gaussian). Distribution and use it as a source for performing # basic statistics npArray = np.random.randn(10) print("\nGenerate an array of 10 random numbers from the Normal \ Distribution\n", np.random.randn(10)) # Print the mean of the new array print("\nThe mean of the new array is: ", npArray.mean(), ) # Print the variance of the new array print("The variance of the new array is: ", npArray.var()) # Print the standard deviation (i.e., the square root of the variance) print("The stdDev of the new array is: ", npArray.std()) Output 11.2.5: Generate a 3x4 matrix of zeroes [[0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]] Generate a 3x4 matrix of ones [[1. 1. 1. 1.] [1. 1. 1. 1.] [1. 1. 1. 1.]] Generate a 3x4 matrix of 20s [[20. 20. 20. 20.] [20. 20. 20. 20.] [20. 20. 20. 20.]] Generate an Identify matrix 4x4 [[1. 0. 0. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.] [0. 0. 0. 1.]] Generate a random number Generate a 3x4 matrix of 20s [[20. 20. 20. 20.] [20. 20. 20. 20.]Networks and Deep Learning Introduction to Neural [20. 20. 20. 20.]] 463 Generate an Identify matrix 4x4 [[1. 0. 0. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.] [0. 0. 0. 1.]] Generate a random number 0.8435542056822151 Generate an array 3x4 with random numbers [[0.35570211 0.27618855 0.0541145 0.58001638] [0.20641101 0.48294052 0.92104823 0.61556587] [0.19491554 0.5713989 0.63918665 0.81824177]] Generate an array 3x4 with random numbers from the Normal Distribution [[-0.24286997 -1.00451518 0.06104505 -1.85966171] [-0.47202171 0.01079039 0.03526387 0.44499205] [ 2.2395344 0.42076315 0.6505322 -0.6350833 ]] Generate an array 4x4 with random numbers between 5 and 15 [[ 7 13 7 12] [ 7 9 5 5] [12 12 10 8] [12 7 9 13]] Generate an array of 10 random numbers from the Normal Distribution [ 0.80516765 -0.34184534 -1.01860459 1.55026532 1.52091946 0.68490906 -0.07417641 1.35254549 0.21432432 0.29326124] The mean of the new array is: 0.009347564051776013 The variance of the new array is: 0.6866073562925792 The stdDev of the new array is: 0.8286177383405325 11.2.6 Plotting with Matplotlib As it is already discussed in Chapter 8 on Data Analytics and Data Visualization and Chapter 9 on Statistics, Python offers libraries that effectively and efficiently address all types of charts that might be required by the analysis of data at hand. These include Matplotlib and Scipy and are widely used for Deep Learning as well. The following two scripts are a quick refresh of how to use these libraries to visualize/plot the results of the mathematical methods of the previous sections: 1 2 3 4 5 6 7 8 9 10 11 # Import the Numpy and Matplotlib libraries import numpy as np import matplotlib.pyplot as plt # Plot inline alongside the rest of the results # This is particularly relevant in Jupyter Anaconda %matplotlib inline # Plot a # with 4 for i in A line as the sin of the values between 0 and 40 different types of intervals range(1, 5): = np.linspace(0, 40, 20*i) 464 12 13 14 15 16 Handbook of Computer Programming with Python B = np.sin(A) + 0.2 * A plt.plot(A, B) plt.xlabel("Input"); plt.ylabel("Output") titleShow = "Basics of Charts. Number of samples: " + str(20*i) plt.title(titleShow); plt.show() Output 11.2.6.a–11.2.6.d: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Import the Scipy and Matplotlib libraries from scipy.stats import norm import matplotlib.pyplot as plt # Plot inline alongside the rest of the results # This is particularly relevant in Jupyter Anaconda %matplotlib inline # Create data points between -10 and 10, with 2000 intervals x = np.linspace(-10, 10, 2000) # loc is the mean and scale is the standard deviation # Calculate the probability density function (Norm module/Scipy) fx = norm.pdf(x, loc = 0, scale = 1) # Plot the chart plt.plot(x, fx); plt.show() # Calculate the cumulative distribution function (Norm module/Scipy) fx2 = norm.cdf(x, loc = 0, scale = 1) # Plot the chart Introduction to Neural Networks and Deep Learning 18 19 20 21 22 23 24 25 26 465 plt.plot(x, fx2); plt.show() # Calculate the log of the probability density function (Norm # module/Scipy) fx3 = norm.logpdf(x, loc = 0, scale = 1) plt.plot(x, fx3); plt.show() # Calculate the log of the cumulative distribution function # (Norm module/Scipy) fx4 = norm.logcdf(x, loc = 0, scale = 1) plt.plot(x, fx4); plt.show() Output 11.2.6.e–11.2.6.h: 11.2.7 Linear and Logistic Regression Regression can involve either categorical or continuous variables. The input could be continuous, categorical, or discrete. If y shows the outcome and x shows the input, the model can be written as follows: y = F(x), where F is the DL model that suggests the relationship between input and output. In the case of Linear Regression this model reveals a directly proportional relationship between input and output with some possible Regression coefficients (γ) of the various inputs (x) and the possibility of an error (φ) of the model calculations. Eventually, in the case of Linear Regression, the model can be written as follows: y = F ( x ) = γ 0 + γ 1 x1 + + γ n x n + ϕ In the case of Logistic Regression (LR), the backbone of a DL Neural Network, the DL algorithm is used to classify the possible outputs as accurately as possible. The categories are encoded as either 466 Handbook of Computer Programming with Python 0 or 1 and a sigmoid method is used to output a number between 0 and 1. The output is interpreted as a probability that the data is to be categorized as 1. 11.3 INTRODUCTION TO NEURAL NETWORKS “Neural networks reflect the behavior of the human brain, allowing computer programs to recognize patterns and solve common problems in the fields of AI, machine learning, and deep learning.” (IBM Cloud Education, 2020) The artificial neural networks (ANN) technique was inspired by the basics of human functioning. The main idea behind it is to interpret data through a series of multiple ML-based perceptrons (covered in detail in the next section), and label or cluster the input as required. Real world data such as images, sounds, time series, or other complex data are translated into numbers using vectors. ANN is quite helpful in classifying and clustering raw data even if they are unidentified and unlabelled. This is because it groups data based on similarities it observes or learns in its deeper layers, thus, transforming them into labelled training data, in a similar way the human brain does. A deep neural network consists of one or more perceptrons in two or more layers (input and output). The perceptrons of each different layer are fed by the previous layer, using the same input but with different weights. The target of DL in ANN is to find correlations and map inputs to outputs. At a basic level, it extracts unknown features from the input data that can be fed to other algorithms, while also creating components of larger ML applications that may include classification, regression and reinforcement learning. It approximates the unknown method (f(x) = y) for any input x and output y. During learning, ANN finds the right method by evolving into a tuned transformation of x into y. In simple terms, this could represent methods like f(x) = 7x + 18 or f(x) = 8x−0.8. ANN performs particular well in clustering. It falls into the category of unsupervised learning, as it does not require labels to perform its tasks. It consists of the input layer, the hidden layer(s), the out- Observation 11.10 – Neuron: The put layer, the adjustable weights for model training and basic building block of a neural netlearning for all layers, and the activation method. work, also called the linear unit. It The neuron is the basic building block of a neural learns by modifying the values of the network. It is also known as the linear unit of the neural weights of the inputs and adding up network system Figure 11.4. the sum of inputs × weights and the In Figure 11.4 above, X is the input to the neuron and possible bias of the model. w is the weight. In its most basic form, the key for a neuron to be able to learn is the modification of value w. Y is the output and b the bias of the model. The bias is independent of the input and its value is provided with the model. The neuron sums up all the input values to come up with the equation that describes its model like a slope equation in linear algebra: Y = wX + b. FIGURE 11.4 A typical neuron. Introduction to Neural Networks and Deep Learning 467 11.3.1 Modelling a Simple ANN with a Perceptron Figure 11.5 illustrates the method of a single neuron in a single layer (i.e., a perceptron). Its fundamental functionality is to mimic the behavior of the human brain’s neuron. The idea is to take the inputs of the model (x1, x2,…, xn) and multiply each by their respective weights (w1, w2,…, wn), in order to produce the relevant k values (k1, k2,…, kn). Often, a constant bias value multiplied by its associated weight is also added to this sum. Next, the sum of the k values is calculated and applied to the selected sigmoid activation method. Finally, the result is frequently normalized using some type of method as the unit step. A perceptron is also called a single-layer neural network because its output is decided based on the outcome of a single activation Observation 11.11 – Perceptron: method associated with a single neuron. Figure 11.5 A single-layer neural network as its illustrates this model. output is decided on a single activaClass FirstNeuralNetwork presented below imple- tion method associated with a single ments a basic perceptron (i.e., single-layer ANN). The neuron. implementation includes the following steps: 1. Generate and initialize a new object (named ANN) based on the FirstNeuralNetwork class, to initiate the perceptron model (lines 46 and 5–10). Instead of reading the weights from a data file, these are randomly generated as an array of 3 × 1 values, ranging from −1 to 1. The calculation uses the following formula: (max−min) * randomset (lines × columns) + min. Hence, in this case, the formula will be (1−(−1)) * np. random.random((3, 1)) + (−1) = 2 * np.random.random(3, 1)−1. The reader should keep in mind that by using the seed() method with a particular parameter, in this case 1, the random sequence of numbers will always be the same. If it is preferred to have a different sequence of numbers every time the script runs, the seeding line should not be included. 2. Instead of reading the training inputs and outputs from a dataset, these are given as arrays of values (lines 49–52). Since the dot method will be used on the inputs and weights to calculate their sum, it is necessary that the number of columns of the former must match the number of lines of the latter (in this case 3). FIGURE 11.5 Perceptron. 468 Handbook of Computer Programming with Python 3. Call the Training() method to train the model (line 56). For optimum training results, it is necessary to define the number of required iterations. The number is rather subjective; however, empirical experience suggests that a number of iterations between 10,000 and 15,000 is sufficient. 4. Use the dot() method to calculate the weighted sum of the inputs and their weights (lines 38–42). 5. Use the Sigmoid() method (lines 12–15) to calculate the output based on the result of the dot() method in step 4 (lines 41–42). 6. An optional step would be to calculate the training process error as the result of the training output (originally provided) – the calculated output. There are various ways to calculate this error, depending on the required level of accuracy. In this case, the error is calculated based on the last iteration of the training process (lines 28–36). 7. Another optional step would be to adjust the weights vector, based on the error calculated in the previous step (line 34). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import numpy as np class FirstNeuralNetwork(): def __init__(self): # Create a random number using the seed method np.random.seed(1) # Convert weights to a 3x1 matrix with values from -1 to 1 and # a mean of 0 multiplied by 2 self.weights = 2 * np.random.random((3, 1)) -1 def Sigmoid(self, x): # Use the sigmoid method to calculate the output sigmoid = 1 / (1 + np.exp(-x)) return sigmoid def SigmoidDerivative(self, x): derivative = x * (1 - x) return derivative def Training(self, trainingInputs, trainingOutputs, trainingIterations): # Train the model for continuous adjustment of the weights for iteration in range(trainingIterations): # Train the data through the neuron Introduction to Neural Networks and Deep Learning 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 469 output = self.NeuronThinking(trainingInputs) # Compute the error rate for back-propagation theError = trainingOutputs - output # Perform weight adjustments during the training phase theAdjustments = np.dot(trainingInputs.T, theError * self.SigmoidDerivative(output)) self.weights += theAdjustments print("\nThe calculated error vector of the training process \ is: \n", theError) def NeuronThinking(self, inputs): # Pass the inputs through the neuron inputs = inputs.astype(float) output = self.Sigmoid(np.dot(inputs, self.weights)) return output if __name__ == "__main__": # Create an object based on the FirstNeuralNetwork neuron class ANN = FirstNeuralNetwork() print("Randomly Generated Weights:\n", ANN.weights) # Train the data with 4 input values and 1 output trainingInputs = np.array([[0,0,1], [1,1,1], [1,0,1], [0,1,1]]) print("\nThe training inputs:\n", trainingInputs) trainingOutputs = np.array([[0],[1],[1],[0]]) print("\nThe training output:\n", trainingOutputs) # Call the Training method to train the model ANN.Training(trainingInputs, trainingOutputs, 15000) print("\nThe adjusted weights vector is:\n", ANN.weights) firstInput = str(input("\nProvide first input: ")) secondInput = str(input("Provide second input: ")) thirdInput = str(input("Provide third input: ")) print("The three inputs are: ", firstInput, secondInput, thirdInput) print("The new data is projected to be: ") print(ANN.NeuronThinking(np.array([firstInput, secondInput, thirdInput]))) 470 Handbook of Computer Programming with Python Output 11.3.1: Test it with 1, 0, 0 and 0, 1, 0 Output test 1 Output test 2 Randomly Generated Weights: [[-0.16595599] [ 0.44064899] [-0.99977125]] Randomly Generated Weights: [[-0.16595599] [ 0.44064899] [-0.99977125]] The training inputs: [[0 0 1] [1 1 1] [1 0 1] [0 1 1]] The training inputs: [[0 0 1] [1 1 1] [1 0 1] [0 1 1]] The training output: [[0] [1] [1] [0]] The training output: [[0] [1] [1] [0]] The calculated error vector of the training process is: [[-0.00786416] [ 0.00641397] [ 0.00522118] [-0.00640343]] The calculated error vector of the training process is: [[-0.00786416] [ 0.00641397] [ 0.00522118] [-0.00640343]] The adjusted weights vector is: [[10.08740896] [-0.20695366] [-4.83757835]] The adjusted weights vector is: [[10.08740896] [-0.20695366] [-4.83757835]] Provide first input: 1 Provide second input: 0 Provide third input: 0 The three inputs are: 1 0 0 The new data is projected to be: [0.9999584] Provide first input: 0 Provide second input: 1 Provide third input: 0 The three inputs are: 0 1 0 The new data is projected to be: [0.44844546] 11.3.2 Sigmoid and Rectifier Linear Unit (ReLU) Methods Both sigmoid and rectifier linear unit (ReLU) are activation methods used in DL. 1 The sigmoid method is defined as: σ ( x ) = . Observation 11.12 – The Sigmoid 1 + e− x One of the drawbacks of the sigmoid method is that it Method: It takes input values in a slows down the DL process in case of big data inputs, range and calculates the relevant outas it takes time to make the necessary calculations. This put values given a specific formula. is especially true when the input is a large number. For The output is always probabilistic this reason, it is mostly used when its output is expected ranging from 0 to 1. The method is to fall in the range between 0 and 1, much like a prob- slow with big data, and particularly ability output. with large numbers. Introduction to Neural Networks and Deep Learning 471 In most cases, the ReLU method is used instead. The concept of this method is simple: if the input value Observation 11.13 – The Rectifier is higher than or equal to 0, it is returned as output Linear Unit (ReLU) Method: It takes unchanged; if it is lower, the method returns 0 as out- input values in a range. For each input put. The method is particularly useful as it is rather fast, higher than or equal to 0 it results in regardless of the input. The obvious problem with ReLU the same value as the input. For each is that it ignores the negative input values, thus, not map- input value lower than 0, it results in 0. An important restriction with this ping them into the output. The following script creates a sequence of input floats method is that it ignores negative ranging from −10 to 10. Next, it calculates the outputs values. for each of the inputs using the sigmoid method and the outputs using ReLU. Finally, it plots the results of the inputs and outputs for both cases: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # Import matplotlib, numpy and math import matplotlib.pyplot as plt import numpy as np import math # linspace(start, end) creates a sequence of integer input numbers x = np.linspace(-10, 10) print("The generated array of floats is: \n", x) # Use the sigmoid function to calculate the output sigmoid = 1/(1 + np.exp(-x)) print("\nThe calculated array of sigmoids is: \n", sigmoid) # Create the Numpy array for the ReLU results & initialize with zeros relu = np.zeros(len(x)) # Use the ReLU function to calculate the ReLU output based on the input for i in range(len(x)): if x[i]> 0: relu[i] = x[i] else: relu[i] = 0.0 print("\nThe resulting array of ReLU is: \n", relu) plt.plot(x, sigmoid) plt.xlabel("x") plt.ylabel("Sigmoid(X)") plt.title("The sigmoid function for inputs -10 to 10") plt.show() plt.plot(x, relu) plt.xlabel("x") plt.ylabel("ReLU(X)") plt.title("The ReLU function for inputs -10 to 10") plt.show() 472 Handbook of Computer Programming with Python Output 11.3.2: The generated array of floats is: [-10. -9.59183673 -9.18367347 -7.95918367 -7.55102041 -7.14285714 -5.91836735 -5.51020408 -5.10204082 -3.87755102 -3.46938776 -3.06122449 -1.83673469 -1.42857143 -1.02040816 0.20408163 0.6122449 1.02040816 2.24489796 2.65306122 3.06122449 4.28571429 4.69387755 5.10204082 6.32653061 6.73469388 7.14285714 8.36734694 8.7755102 9.18367347 -8.7755102 -6.73469388 -4.69387755 -2.65306122 -0.6122449 1.42857143 3.46938776 5.51020408 7.55102041 9.59183673 -8.36734694 -6.32653061 -4.28571429 -2.24489796 -0.20408163 1.83673469 3.87755102 5.91836735 7.95918367 10. ] The calculated array of sigmoids is: [4.53978687e-05 6.82792246e-05 1.02692018e-04 1.54446212e-04 2.32277160e-04 3.49316192e-04 5.25297471e-04 7.89865942e-04 1.18752721e-03 1.78503502e-03 2.68237328e-03 4.02898336e-03 6.04752187e-03 9.06814944e-03 1.35769169e-02 2.02816018e-02 3.01959054e-02 4.47353464e-02 6.58005831e-02 9.57904660e-02 1.37437932e-01 1.93321370e-01 2.64947903e-01 3.51547277e-01 4.49155938e-01 5.50844062e-01 6.48452723e-01 7.35052097e-01 8.06678630e-01 8.62562068e-01 9.04209534e-01 9.34199417e-01 9.55264654e-01 9.69804095e-01 9.79718398e-01 9.86423083e-01 9.90931851e-01 9.93952478e-01 9.95971017e-01 9.97317627e-01 9.98214965e-01 9.98812473e-01 9.99210134e-01 9.99474703e-01 9.99650684e-01 9.99767723e-01 9.99845554e-01 9.99897308e-01 9.99931721e-01 9.99954602e-01] The resulting array of ReLU is: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.20408163 0.6122449 2.24489796 2.65306122 3.06122449 4.69387755 5.10204082 5.51020408 7.14285714 7.55102041 7.95918367 9.59183673 10. ] 0. 0. 0. 0. 1.02040816 3.46938776 5.91836735 8.36734694 0. 0. 0. 0. 1.42857143 3.87755102 6.32653061 8.7755102 0. 0. 0. 0. 1.83673469 4.28571429 6.73469388 9.18367347 Introduction to Neural Networks and Deep Learning 473 11.3.3 A Real-Life Example: Preparing the Dataset The basic tasks when creating a multi-layer NN is to create, compile and fit the model, if necessary, plot Observation 11.14 – The sample() the associated observations and data, and evaluate Method: Use this Pandas method it. Among the most important concepts in DL are the with the frac and random_state sequential model, the dense class, the activation class, parameters, to define a sample from and adding layers to the model. A detailed analysis of the original set to be used in the DL these topics is beyond the scope of this chapter and the process. reader is encouraged to consider related sources specializing in DL. Nevertheless, a relatively common real-life example is examined in order to showcase and introduce some of the basic associated notions. This is split into a number of distinct steps, presented in the following sections. The first step involves reading a dataset from a CSV file (diabetes.csv) and taking a random sample (i.e., 70%) of its rows to use as a training dataset (frac parameter). For the same input, the sample will also be the same, as a result of the random_state = 0 parameter. Next, the index of the dataset is dropped, in order to keep only the remaining columns. Finally, the NN is optimized by scaling the dataset values to a range between 0 and 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import pandas as pd import numpy as np # Step 1: Read the csv file MyDataFrame = pd.read_csv('diabetes.csv') MyDataSource = MyDataFrame.to_numpy() X = MyDataSource[:,0:8] y = MyDataSource[:,8] # Step 2: Use frac to split dataset to the train & test parts (70/30) # Use random state to return the sample rows in every iteration # Remove the index column from the dataset and print the first 4 rows # Scale the dataset values to [0, 1] to optimize the NN My_train = MyDataFrame.sample(frac = 0.7, random_state = 0) My_test = MyDataFrame.drop(My_train.index) print(My_train.head(4)) maxTo = My_train.max(axis = 0) minTo = My_train.min(axis = 0) My_train = (My_train - minTo) / (maxTo - minTo) My_test = (My_test - minTo) / (maxTo - minTo) # Split the features and the target Xtrain = My_train.drop('Outcome', axis = 1) Xtest = My_test.drop('Outcome', axis = 1) Ytrain = My_train['Outcome'] Ytest = My_test['Outcome'] print("\nThe dataset contains", Xtrain.shape[0], "rows and", Xtrain.shape[1], "columns") 474 Handbook of Computer Programming with Python Output 11.3.3: 661 122 113 14 Pregnancies 1 2 4 5 Glucose 199 107 76 166 ••• ••• ••• ••• ••• Age Outcome 22 1 23 0 25 0 51 1 [4 rows x 9 columns] The dataset contains 538 rows and 8 columns Number 8 in the output indicates the number of inputs, as the number of features in the dataset. 11.3.4 Creating and Compiling the Model The next step involves the creation of four different models as a way to examine different scenarios. Firstly, the Keras and Layers libraries (TensorFlow package) are imported. These libraries are necessary in order to create the DL model and define its details. Next, the four models are created. SimpleModel consists of only the input and the output layers, with the former having just 12 neurons. MakeItWider doubles the number of neurons keeping the same basic layers. MakeItDeeper keeps the number of neurons the same as in the case of SimpleModel, but adds a third hidden layer between the input and the output. Finally, FinalModel defines a significant number of neurons per layer (a rather common case) and adds two layers between the input and the output. In all four cases, the newly created DL models are created following the sequential approach. This simply means that each layer builds upon the input from the previous layer, thus connecting all layers to each other. The minimum number of layers in any DL model is 2: the input and the output. Any other layer is a hidden layer. There is no consensus as to what is the correct number of neurons per layer, although there are some suggested mathematical formulae on how to determine this number. As a rough guide, the reader should note that a number between 500 and 1,000 neuros per layer is commonly used. It must be also noted that the various layers in the NN do not have to consist of the same number of neurons. The activation parameter defines the type of stochastic gradient descent used to optimize the weights Observation 11.15 – Sequential of the model. In all four cases of this example, the Approach: Each of the layers of the ReLU method is selected. The optional input_shape NN builds on the input from its previparameter defines the number of features in the NN ous layer, ensuring that all layers conmodel (i.e., in this case 8). This number defines the col- nected to each other. umns of the data set excluding the index (which is not used) and the output (i.e., the outcome column). Once the models are created, they must be compiled. Compilation basically deals with training and adjusting weights, and is often known as backend processes. It determines the best network representation for train/test and makes predictions on the specified hardware (i.e., either GPU or CPU). It also supports distributed computing such as Hadoop/MapReduce. At the moment of writing, Theano and TensorFlow are among the most commonly used libraries. In terms of the associated methods/parameters used in all four cases of this example, the loss method of choice is mae, the optimizer is adam, and the metric is accuracy. These methods/parameters are discussed in more detail in the following section. 475 Introduction to Neural Networks and Deep Learning The additionial part of the script is the following: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 from tensorflow import keras from tensorflow.keras import layers # Step 3: Prepare the models for testing and compiling # Prepare a simple model SimpleModel = keras.Sequential([layers.Dense(12, activation = 'relu'), layers.Dense(1)]) SimpleModel.compile(loss = 'mae', optimizer = 'adam', metrics = ['accuracy']) # Make the model wider by doubling the neuros of the layer MakeItWider = keras.Sequential([layers.Dense(24, activation = 'relu'), layers.Dense(1)]) MakeItWider.compile(loss = 'mae', optimizer = 'adam', metrics = ['accuracy']) # Make the model deeper by adding another layer MakeItDeeper = keras.Sequential([layers.Dense(12, activation = 'relu'), layers.Dense(12, activation = 'relu'), layers.Dense(1)]) MakeItDeeper.compile(loss = 'mae', optimizer = 'adam', metrics = ['accuracy']) # Prepare the final model with many neuros and adding another layer FinalModel = keras.Sequential([ layers.Dense(600, activation = 'relu', input_shape = [8]), layers.Dense(600, activation = 'relu'), layers.Dense(600, activation = 'relu'), layers.Dense(1)]) FinalModel.compile(loss = 'mae', optimizer = 'adam', metrics = ['accuracy']) Notice that there is no output for the above script which serves as a preparation step. 11.3.5 Stochastic Gradient Descent and the Loss Method and Parameters Stochastic gradient descent (SGD) is a family of algorithms aiming to optimize the weights for the best possible mapping of inputs to outputs. The selected algorithm is defined by the optimizer paramenter/method, which at present is most often adam. The loss parameter/method deals with the measurement of the integrity of the NN predictions. In simple terms, it measures the disparity between predicted values and desired values. Several loss method options are available, including mean square error (MSE), root mean square (RMS), and mean absolute error (MAE). MSE is amongst the most well-known methods of calculating the average (mean) of the differences between Observation 11.16 – Stochastic Gradient Descent (SGD): A family of algorithms aiming to optimize the weights for the best mapping of inputs to outputs. Observation 11.17 – Method loss Parameters: Select from a number of available mathematical methods to calculate the loss resulting from the process (e.g., mean square error, root mean square, and mean absolute error). 476 Handbook of Computer Programming with Python the real observations and the predictions. The mathematical equation for this particular method is the following: K MSE = ∑ ( xi − xi′ )2 K k =1 RMS is one of the most popular and, possibly, most accurate methods. It calculates the square root of the MSE. Its mathematical equation is the following: K ∑ RMSE = ( xi − xi′ )2 K k =1 Finally, MAE is calculated as the mean of the absolute errors between the real and the predicted observations as in the following formula (i.e., xk = true observations, xk = predictions): MAE = 1 K K ∑x k − x k′ k =1 The following script showcases the use of all three loss measuring methods discussed above: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import numpy as np # Define the actual and the predicted values as np arrays actual = np.array([1.8, 2, 1.9]) print("The actual observations are: \n", actual) predicted = np.array([2, 1.7, 1.7]) print("\nThe predicted observations are: \n", predicted) # Array calculated on the differences between the 2 sets of values difference = predicted - actual print("\nThe differences in the observations are: \n", difference) # Calculate the array based on the squares of the differences squareOfDifferences = difference ** 2 print("\nThe squares of the differences of the observations: \n", squareOfDifferences) # Calculate the mean square error for the observations MSE = squareOfDifferences.mean() print("\nThe Mean Square Error is calculated as: ", MSE) # Calculate the mean of the square of the differences meanSquareDifferences = squareOfDifferences.mean() RMSE = np.sqrt(meanSquareDifferences) print("\nThe root mean of square of differences is: ", RMSE) 477 Introduction to Neural Networks and Deep Learning 25 26 27 28 29 30 # Calculate the mean of the absolute error of the differences absoluteDifferences = np.absolute(difference) meanAbsoluteDifference = absoluteDifferences.mean() print("\nThe mean of the absolute differences of the observations \ is: ", meanAbsoluteDifference) Output 11.3.5: The actual observations are: [1.8 2. 1.9] The predicted observations are: [2. 1.7 1.7] The differences in the observations are: [ 0.2 -0.3 -0.2] The squares of the differences of the observations: [0.04 0.09 0.04] The Mean Square Error is calculated as: 0.056666666666666664 The root mean of square of differences is: 0.23804761428476165 The mean of the absolute differences of the observations is: 0.2333333333333333 11.3.6 Fitting and Evaluating the Models, Plotting the Observed Losses The next step involves the fitting of the various models, as well as the plotting of the relevant observations. The reader can follow the implementation of this step in the following script, taking note of the following: 1. For practical reasons, the number of iterations during model training is set to 5 (as defined by the epochs parameter). It must be noted that this is a quite small number to be truly efficient, but it is sufficient for demonstration purposes. In reality, this number is expected to be at least three digits long (i.e., between 100 and 1,000). 2. The fitting process investigates the training of the models with 300 rows of train data (shown in the batch_size). 3. The observations from the four different models are plotted together using the plot method (Matplotlib.pyplot library). Observation 11.18 – The epochs Parameter: Used to define the number of iterations of the training set during the training/fitting step. Usually, the number is in the hundreds. Observation 11.19 – The batch_ size Parameter: Used to define the number of rows to be observed during the training/fitting step. 478 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Handbook of Computer Programming with Python import matplotlib.pyplot as plt # Step 4: Fit the models and plot the observations # Fit the SimpleModel print(“\nThe observation epochs for the simple model: \n”) Observations1 = SimpleModel.fit(Xtrain, Ytrain, validation_data = (Xtest, Ytest), batch_size = 300, epochs = 5) # Prepare the dataframe from the SimpleModel observation history Observation1DataFrame = pd.DataFrame(Observations1.history) # Fit the MakeItWider model print(“\nThe observation epochs for the wider model: \n”) Observations2 = MakeItWider.fit(Xtrain, Ytrain, validation_data = (Xtest, Ytest), batch_size = 300, epochs = 5) # Prepare the dataframe from the MakeItWider observation history Observation2DataFrame = pd.DataFrame(Observations2.history) # Fit the MakeItDeeper model print(“\nThe observation epochs for the deeper model: \n”) Observations3 = MakeItDeeper.fit(Xtrain, Ytrain, validation_data = (Xtest, Ytest), batch_size = 300, epochs = 5) # Prepare the dataframe from the MakeItDeeper observation history Observation3DataFrame = pd.DataFrame(Observations3.history) # Fit the FinalModel model print(“\nThe observation epochs for the final model: \n”) Observations4 = FinalModel.fit(Xtrain, Ytrain, validation_data = (Xtest, Ytest), batch_size = 300, epochs = 5) # Prepare the dataframe from the FinalModel observation history Observation4DataFrame = pd.DataFrame(Observations4.history) # Plot the observations from the 4 models plt.xlabel(“Epochs”) plt.ylabel(“Loss”) plt.title(“History of observations of loss”) Observation1DataFrame[‘loss’].plot(label = “Simple model”) Observation2DataFrame[‘loss’].plot(label = “Make it wider”) Observation3DataFrame[‘loss’].plot(label = “Make it deeper”) Observation4DataFrame[‘loss’].plot(label = “Final model”) plt.legend() plt.grid() 5/5 4/5 3/5 2/5 1/5 ] - 0s 96ms/step - loss: 0.4231 - accuracy: 0.6450 - val_loss: 0.3924 - val_accuracy: 0.6652 ] - 0s 106ms/step - loss: 0.4359 - accuracy: 0.6450 - val_loss: 0.4037 - val_accuracy: 0.6652 ] - 0s 122ms/step - loss: 0.4491 - accuracy: 0.6450 - val_loss: 0.4161 - val_accuracy: 0.6652 ] - 0s 118ms/step - loss: 0.4627 - accuracy: 0.6450 - Val_loss: 0.4290 - val_accuracy: 0.6652 ] - 3s 946ms/step - loss: 0.4762 - accuracy: 0.6450 - val_loss: 0.4424 - val_accuracy: 0.6652 2/2 [ Epoch 5/5 2/2 [ Epoch 4/5 2/2 [ Epoch 3/5 2/2 [ Epoch 2/5 2/2 [ Epoch 1/5 ] - 0s 45ms/step - loss: 0.3856 - accuracy: 0.6450 - val_loss: