Machine Learning Processing Unit Effectiveness

__________________________________ Computer Science Extended Essay Higher Level Candidates number: khq293 Word Count: 3999 Title: Assessing the effectiveness of processing units used in machine learning. RESEARCH QUESTION: To what extent does the usage of: Apple M1 chip Neural Engine, GPUs, CPUs processing units, alters the efficiency and effectiveness of running the machine learning models and neural networks training? __________________________________ 1 Acknowledgements: I would like to thank my EE supervisor, Ms. Natalia for the meritorious, professional, and academic support she provided during the process of writing this research paper. I would like to thank Mr. Mateusz, for his equally nerdy excitement, support, and engagement in the model's training process. I would like to thank my brother Maksymilian and classmate Mikołaj for enabling me to use their computing devices in order to complete the experiments. I would also like to thank Mr. Bartłomiej (Cortland salon manager) for providing me with information regarding the Apple M1 Pro’s SoC (neural processing unit) articles and insights. 2 Table of contents: 1. Introduction 2. Background information 3. 2.1. Biological Neural Networks 2.2. Artificial Neural Networks 2.3. Mathematics as the backend of neural networks 2.4. CPUs 2.5. GPUs 2.6. Apple’s SoC - 16 core Neural Engine 2.7. Software used in Machine Learning 2.8. Terminology Methodology 3.1. Compuing devices used in benchmarks 3.2. Inference benchmarking - OsiriX lite - AI segmentation 3.3. Benchmarking - neural networks training 4. Experiment Results 5. Study Limitations and research opportunities 6. Conclusion 7. Bibliography 8. Appendix 3 1. Introduction AI is evolving at an astounding rate, achieving and producing creative images, allowing us to increase our privacy through technology such as facial or iris recognition or augmenting game frames, but what are the objects performing all the logic operations, allowing us to enjoy the benefits of not having an "if-else" code structure? What processing units output certain results quicker or with greater accuracy? I've always wondered how it's possible to type the object or person I'm looking for into the "images" app and have all the results from the large gallery of 5000 pictures appear within milliseconds. This research paper is focused on exploring the variables and factors affecting the efficiency and effectiveness of multiple processing units in the training and utilization of neural network models. By using artificial intelligence-accelerated segmentation of computed tomography for Apple MacBook computers from various Intel, ARM, and AMD-based generations, experiments were conducted, resulting in the definitive superiority of the M1 Pro SoC in running prerecorded models. However, to evaluate the processing units' ability to train the models, three different "PyTorch" machine learning projects regarding 25x25, image processing, and classification were used in order to obtain numerical, time, efficiency, and effectiveness-related outcomes. By combining the results of both experiments, I was able to reach a conclusion that not only included the best processing unit for those two functions but also explained how specific computational components, such as the number of FPUs, cache memory, CUDA cores, or software processing schemes, and their metrics determine one's performance superiority over another. 4 2. Background Information 2.1. Biological Neural Netoworks ANN vs Biological NN1 To understand the artificial neural network, it is necessary to understand how the biological brain learns and processes information. In the picture above, we can see both biological and artificial neurons. Dendrites are inputs, nuclei are nodes, and axons are the output, while the cell body relates to all the hidden layers. However, unlike an artificial neural network, the brain is designed to be multi-tasking and capable of performing a variety of tasks. Additionally, it is confused when few things are learned in a short span of time, making it difficult to classify and place in long-term memory, which is achieved by intense emotions or in a habitual manner. 1 (Frumusanu, Apple announces M1 Pro & M1 Max: Giant new arm socs with all-out performance 2021) 5 2.2. Artificial Neural Networks The concept of (ANN) artificial neural networks is part of the broader umbrella term "AI," or artificial intelligence, which is a field of computer science researching and developing ways in which computer algorithms can automate tasks that would otherwise be performed by humans. AI is commonly subdivided into three areas: machine learning, deep learning, and artificial neural networks. Image 2 2 Machine learning (ML) is a type of AI that uses past data about a solution to a problem to create an algorithm-based model that solves the problem more accurately. Deep learning, is a subset of machine learning that employs multiple artificial neural network architectures and layers to process past data and produce the most accurate results. It is commonly used when automation is difficult and more sophisticated algorithms are required. However, there is no set number of layers or algorithms that distinguishes regular machine learning from deep learning, and the distinction is rather 2 Own Graphics 6 contractual.ANNs, which belong to the lowest subcategory, resemble biological neural networks, where all the nodes or neurons correspond to information as inputs, weights, biases, and activation functions such as "sigmoid" and "tanh" that act as solution modeling functions for the whole network. Inputs refer to data factors that all together formulate a certain outcome; for example, certain placements of lid pixels on the screen result in a larger picture, which is an outcome. Weights describe the "importance" of that placement to later pass the changed input value to the activation algorithm function; such an activation function corresponds to chemical reactions in the brain determining whether that placement is "important enough" to let this configuration be activated and recognize the picture as a certain object , and the whole process of checking whether the function has identified the configuration of pixels as a correct corresponding object and alternation of the prediction process together is called training. Since birth, receiving inputs as information from the outside world; however, unlike machines, where everything operates on the basis of 1s and 0s, it is impossible to program the brain to pick only the relevant information and ignore all other data. Training of the neural network is based on thousands of repetitions of that process's outcome, ultimately altering the weights, bias, and activation function in such a way that it is getting more accurate with every next repetition or epoch on that dataset. The more inputs, data, and layers, the longer it takes to train the neural network. The stage after the training of the ANN is referred to as "inference," which is running the trained model using completely new data in order to obtain valid output. 7 2.3. Mathematics as the backend of neural networks In every case, terminology such as biases, weights, activation functions, or models has a mathematical foundation. As the very first layer of the neural network is the input layer, it is important to note that every input variable acts as an input node, which is directly connected to the following layers. However, every node (except the first nodes, as they are input nodes) has a weighted connection with the next layer, acting as an important variable in determining the output. The larger the value that the weight carries, the bigger the significance or multiplier. Similarly, biases serve as the error correction variable in the neural network equation, taking the form: 3 It works in a way that the next node of the next layer sums up all the nodes from the previous layer and later multiplies them by the weight of connections to, at the end, add the bias variable that shifts the function depending on the value. Later, it has to go through the activation function that the developer decided is the best fit for the problem to calculate the value of the next node. Activation functions divide into two major groups: linear and non-linear functions. 3 (Neural Network) 8 (Graphs drawn by Casio FG-CG50) Linear function: Non-Linear functions: Sigmoid Function: 9 Tanh - hyperbolic tangent 10 Relu 4 Such an activation function won’t return the output and send it to the next hidden layer unless the inputs altered by the weights and biases fit in the range of activation function values. Matrix multiplications5 are the programming part of calculation that makes running neural network trainings and models particularly difficult. Matrices are x-y-rectangular multidimensional arrays that can store information such as numbers and characters; in the case of programming, they multiply in such a way that the number of columns in the first matrix and the number of rows in the second matrix should be equal, and they output a third matrix of the columns and rows of the first and second matrices, respectively. 4 (ReLu activation Function) 5 (dishashree26, Activation functions: Fundamentals of deep learning 2022) 11 2.4. CPUs6 The CPU7 (Central Processing Unit) is the computer's brain, and it follows the instruction cycle to perform large numbers of uniform, simple computations. The most common instruction cycle is fetch-decode-execute-store. For instance, when we turn on the computer, depending on the configuration, instructions for the CPU come from the ROM read-only memory (used for the BIOS essential computer software). They are later passed forward, either straight to the random access memory (RAM) or to the CPU's superfast L1 cache memory, to be later passed to RAM anyway. After the instructions are provided to primary memory (the cache and RAM), the Control Unit (CU) receives the instructions and stores them in the CIR (Current Instruction Register) and later the SCR (Sequence Control Register) to coordinate the order of performing calculations and logic operations in the Arithmetic Logic Unit. The instructions and data are passed by the paths called address and data buses from the CU to the ALU to perform calculations and output the data back with the buses to store it on the IAS in the form of L1 CPU Cache and pass it forward to RAM to fetch another set of instructions. 6 (Does bios need to be loaded into main memory to be executed by CPU? 1966) 7 (Syed, AI vs. ML vs. DL (vs. NN) 2020) 12 8 Depending on the computer we need and use, there may be CPUs that use the Complex Instruction Set Computer (CISC) architecture, which is common for x86 Intel and AMD CPUs, or the Reduced Instruction Set Computer (RISC) architecture, which is more commonly used with ARM technology. ARM focuses on developing chipsets meant for one specific OS or task, such as the FSD (Full Self-Driving Chip) made for Tesla in a reduced instruction set computer meant to run the ML model faster and in parallel with multiple calculations. Two major factors describing the computing 8 Own Graphics 13 power of the CPU are its clock speed and cache memory size. Clock speed refers to the earlier mentioned cycles and how many of them are executed per second, usually expressed in GHz per second (billions of cycles per second). 2.5. GPUs9 A graphics processing unit (GPU) is a different kind of processing unit used for tasks requiring an immense number of operations that need to be performed and delivered very quickly. As its name suggests, its initial and major purpose is graphics processing, which needs to be continuous and performed in real time. In contrast to CPUs, where we can work with up to 8–16 cores in commercial level and 64 cores in industrial level CPUs, GPUs consist of hundreds of Compute Unified Device Architecture (CUDA) cores that are also often referred to as "floating point" cores. GPUs are characterized by higher (than CPUs) memory bandwidth, which describes the maximum amount of data transfer in a given range of time. They are less sophisticated and hence smaller, which enables them to be implemented in larger quantities and process more calculations in parallel. It means that GPUs are used for machine learning most often because of hundreds of available cores that can either work together with their own level 1 cache memory or in blocks of 32 cores that are called warps and enable the GPU to process even larger parallel computations and allocate even more memory to perform those tasks. Such GPU architecture enables it to perform calculations nearly exclusively used for machine learning, such as matrix multiplication. 9 (Typical Nvidia GPU architecture. the GPU is comprised of a set of ...) 14 GPU structure10 An example of a GPU is shown above, where each of the two major units has its own GDDR-5 RAM memory, which is controlled by the memory controller and joined with level 2 high-speed cache memory. The GPU grid at the highest level of hierarchy is built of multi-multiprocessor blocks (clusters), which consist of streaming multiprocessors characterized by "streams" of data that are processed with the aim of real-time computation. These multiprocessors in turn consist of processors with their own higher-speed (primary) cache, consisting (in this example) of 64 cores each. 10 (Exploring the GPU architecture: Vmware) 15 2.6. Apple’s SoC - 16 Core Neural Engine11 12 A System on a Chip (SoC) is an integrated circuit that houses all of the computer's major processing units, including the GPU, CPU, RAM, cache memory, and, in the case of the Apple M1 SoC chipsets, even a 16-core13 neural network accelerator engine14. The fact that it is all put together 11 (admin_mirabilis, Apple Neural Processor 2021) 12 (Frumusanu, Apple announces M1 Pro & M1 Max: Giant new arm socs with all-out performance 2021) 13 (Hollance, Hollance/neural-engine: Everything we actually know about the Apple Neural Engine (ANE)) 14 (Apple M1 chip. everything you wanted to know about it 2021) 16 on a silicon circuit reduces the latency caused by the distance between the components placed on the standard architecture of the motherboard. With that, Apple has given us the option to choose how much memory we want to have in the SoC, which acts as an astoundingly fast central, graphics, and neural engine integrated unit memory that can be accessed quicker, be more power efficient, and perform more calculations. This particular SoC is developed in the Advanced RISC Machines (ARM) architecture, which when combined with Apple's Operating System (OS) is developed with the use of their own "SWIFT" (higher level) programming language, C and C++, which are the first layer languages after the "assembly". results in less power-hungry computations. 2.7. Software used in machine Learning Keras is a high-level Python library built upon other Theano and TensorFlow libraries used for deep learning and neural networks. It is more user-friendly for programmers because it does not involve low-level, detailed, and more complex code structures. Source: Own Graphics 17 It allows the creation of so-called Keras models, which may consist of multiple neural layers, activation functions, optimizers, and other functions of neural network models such as initialization and regularization schemes. Regularization is a type of regression that looks for relationships between dependent and independent variables and makes sure no such thing as overfitting occurs, which relates to the issue of a dataset-specific trained ML model that would struggle to perform well during the inference stage. Theano is a numerical library mainly used for matrix multiplications and neural network-related calculations, while TensorFlow is a broadly used ML or DL library focusing on training and inference of models. 2.8. Terminology The key terms most often used during the stages of development and production of software based on ANNs and MLs overall are: epochs, batches, loss, validation, train, evaluation, and accuracy. Epoch is a word used for the iterations of ANN's training dataset, which refers to the number of times ANN was trained on that particular data set. Epochs are often divided into batches (smaller collections of training data samples)since during the training, memory must be allocated for storing the naturally occurring losses. Losses are penalties for bad predictions during the run of the batch and are later used to update the ANN model in the form of modified weights and biases. As a result, the smaller the batch size, the less memory that can be allocated to store the losses, making training faster and more efficient. The final stage in the development of a finished ML model is evaluation, which refers to the methods used in assessing the accuracy and effectiveness of the model. The three most commonly used evaluation methods are accuracy, precision, and recall. Accuracy can be expressed by a given formula: . 18 Precision refers to the ratio of correct predictions in the scope of all the data samples with the exact same outcomes, which can be expressed in the form of: . Precision, as opposed to accuracy, measures how relevant the predictions are within the expected outcomes, i.e., the presence of truly correctly evaluated positive values in the range of both falsely and truly evaluated values. That method enables the algorithm and developer to exclude falsely evaluated as positive negatives from the range of positives. And the third evaluation method is reverse precision, which is expressed as: It helps identify the positives that are falsely evaluated as negatives. Train, in turn, refers to the accuracy of the training data rather than the whole model, and finally, validation is based on running the model on a test dataset derived from the training data. 3. Methodolgy 3.1. Computing Devices used in benchmarks Gaming Rig PC15 15 Own Picture 19 CPU AMD Ryzen 7 3800x 8-core 3.90GHz, CISC GPU NVIDIA GeForce RTX 3060 Ti - 1.67GHz - 8gb - GDDR6 16 16 Cuda Cores 4864 Motherboard ROG STRIX X570-E GAMING Power Supply 650w BeQuiet RAM 16 GB 3200MHz System Windows 10 Pro 22H2 (image of geforce rtx 3060ti) 20 MacBook Air Retina 13-inch 201817 CPU 1.6 GHz dualcore processor intel core i5, CISC GPU Intel UHG Graphics 617 1536 MB, RAM 8 GB 2133 MHz LPDDR3 Power Supply Apple 30 watt System MacOS Ventura 13.0.1 System MacOS Ventura 13.2 MacBook Pro 14-inch 202118 17 (Apple specifications) 18 (Apple specifications) 21 Apple M1 Pro 19 SOC SOC-CPU 8-core CPU peak 3.2GHz, RISC SOC-GPU 14-core GPU peak 3.2GHz SOC-Neural Engine 16-core Neural Engine Unified Memory 16 GB -256-bit LPDDR5 SDRAM - Bandwidth 200GB/s Power Supply Apple 67 watt System MacOS Ventura 13.1 MacBook Pro 16-inch 201920 CPU 2.6 GHz - 4.5 GHz 6-core Intel Core i7 - 12MB level 3 Cache, 19 (M1 Pro motherboard) 20 (Apple specifications) 22 CISC GPU 2.6 GHz AMD Radeon Pro 5300M GDDR6 4GB + Intel UHD Graphics 630 1536 MB RAM 16 GB 2667 MHz DDR4 Power Supply Apple 96 watt System MacOS Ventura 13.2 3.2. Inference benchmarking - OsiriX lite - AI segmentation The first stage of evaluating the performance, efficiency, and effectiveness of CPUs, GPUs, and SoCs with explicit division for neural engine acceleration was performed by running a pretrained Keras neural network model. Following Takashi Shirakawa's (cardiovascular surgeon, master of mechanical engineering, programmer) lead, I used the "OsiriX lite" software for viewing "digital imaging and communication in medicine" (DICOM), information object definition (IOD) medical images (with support for 64-bit and multithreading computing), and the AI segmentation plugin to measure the capabilities of Apple computers' processing units. The author of the plugin stated that it was "specially constructed for this performance test," and the "AI core," also known as a model, "has 23 been trained on more than 90,000 CT images for semantic segmentation of the aorta." Additionally, thanks to the explicit support of the new Apple silicon lineup by "OsiriX lite" CT computed tomography software, the model was converted to Core ML format (BENCHMARK.mlmodel) using Apple’s coremltools for the macOS platform. The conversion was required to gain access to the neural engine management using the Apple machine learning framework "Core ML" written in Swift. It is not possible to run such a benchmark on a Windows-OS-based machine. However this is not a limitation, as the three chosen laptops are equipped with Intel’s i7 series processor, Intel’s integrated graphics processor, and AMD's GPU, which ultimately increases the accuracy of the benchmark in terms of mobile processing units. Each device (updated to MacOS “Ventura”) performed segmentation of abdominem and skull images in samples or slices of 100, 200, and 500, where all slices processed and added together give 1000, and efficiency is expressed in milliseconds per slice “OsiriX Lite” - Software - MacBook Pro 14’ - M1Pro21 21 Appendix C 24 Input (whole picture) Output(ROI area) 22 Abdominem 23 Abdominem ROI AI Segmentation plugin for “OsiriX lite” #Single Pretrained AI core or model #Number of Segmentation Slices Selection #ROI - Region of interest drawing and rendering #Antialias or Smoothing border #As there is no need for exporting the outcome due to benchmarking purpose this option was disabled (the input and output of computation can be seen below) #Choice of used processing units #Apple M1 Pro as SoC 22 (Softneta, Dicom Library - Anonymize, share, view dicom files online) 23 (Softneta, Dicom Library - Anonymize, share, view dicom files online) 25 Github Repository Source code for the plugin: (Tkshirakawa, TKSHIRAKAWA/AISEGMENTATION_V141: The public source code of a.i.segmentation (AIS) version 1.4.1. AIS is a plugin of OsiriX for macos enables semantic segmentation of medical images using artificial intelligence (core ML framework) in macos.) 3.3. Neural networks Training benchmarking The second stage of evaluating the previously specified processing units in terms of machine learning and training efficiency is to perform three distinct trainings of three different neural networks, each more computationally intensive than the last. To conduct the training, an appropriate environment must be prepared and installed, beginning with downloading and installing the most recent Python language version globally in the computer's terminal (MacOS) or command line (Windows). Anaconda was installed to create a clean environment containing only the necessary packages, with the goal of increasing benchmark credibility and preventing such from affecting the final results. The PyTorch DL python framework was installed in the Anaconda environment so that the image recognition and number detection training could be interpreted. Finally, the source code from 26 Sebastian Raschka and Alexander Ziskin's GitHub repository 24was downloaded and placed in the top-level computer directory. Example of lenet-mnist.py training from M1 Pro CPU (screenshot from terminal) Lenet-mnist.py training 4. Experiment Results 4.1. Inference Benchmarking After performing the AI segmentation of DICOM images on all the MacOS devices, I was able to insert the data into the excel file, sum up the total computation and core ML neural network time, divide it by the number of rendered slices, and finally round them up to whole numbers for clarity. The findings are presented as follows: 24 Appendix A 27 According to the theory of parallel computing, it took the most time per slice for CPUs (with a visible relationship between processing unit speed and number of cores) in every case, where the Mac Air performed the worst and the only computer containing a neural engine or ANE, 14-inch Macbook Pro with M1 Pro performed the best. The part of the AI segmentation accelerated by the Core ML used for predictions regarding the ROI took on average only 26.7% of the whole computation, while in the second category "CPU+GPU", Core ML time took over half the time of the whole computation for the same computer. What is surprising, but explainable, is that in the "CPU" category, Core ML and computation time were nearly equal, indicating that Core ML was actively running longer. This is likely because, like in the first, second, and third categories, 22, 14, and 17 milliseconds, respectively, are used for the computation of operations not related to the AI segmentation itself but rather preparation to run the plugin. The same situation can be seen in the cases of "Mac Air" and "16-inch Mac" for both categories. However, while speed and performance are one factor, efficiency per watt is another. 28 Similarly to the first test, the M1 Pro SoC scored the best in terms of efficiency during inference. 4.2. 25 Benchmarking - neural networks training25 Appendix B 29 Starting from the SoC Macbook 14’ chipset and GeForce RTX 3060 Ti, results vary depending on the neural network that was being trained. In the case of mlp-mnist.py, the SoC chip was in total only 2 seconds behind Nvidia's powerful GPU. However, in the case of Lenet-mnist.py, RTX 3060 Ti managed to perform the training at a time equal to the M1 Pro chip running mlp-mnist; moreover, the gap in performance got even larger, by 6 seconds, with RTX performing the very same task on a less optimized Windows OS. The hardest to train neural network with more pixels and hence input data, vgg16-cifar10.py, M1 Pro, was beaten by over three times. 30 By carrying on with the analysis, it was already confirmed in the last ‘inference” benchmark that CPUs aren’t the greatest processing units for running the models, we can conclude the same in case of performing the training. AMD Ryzen 7 3800x (overclockable CPU) did the best in all three trainings, with M1 pro on the second place and intel core on third. Unfortunately Intel i5 from Macbook air couldn’t complete the last training due to its exceptionally bad performance. It didn’t manage to finish the fourth batch while it was running over 9.85 hours which made it clear that it wouldn’t be able to finish the training under 30 hours with 14 batches of 100 samples. 31 To figure out why one CPU performs better than the other, I examined the relationship between CPU clock speed and training time. According to the diagram, AMD Ryzen 7 had the best performance, followed by the M1 Pro and Intel Core i7 in second place, and Intel Core i5 in last place. The additional factor affecting the final efficiency of the CPU is the ratio of its cache memory to results; however, it is difficult to show a direct relationship between these CPUs and the M1 Pro’s CPU since SoC has unified memory as a ram and cache for all the components of the chipset, and the rest of these CPUs have external ram and the CPU's internal cache, which would certainly affect the outcome. 32 The last crucial factors determining the processing units' effectiveness in training are the train, validation, and test accuracy (defined in the background information). Data have shown that despite the inproportionally faster speed of training in case of GPUs and whole SoC, CPUs were actually more effective. 33 5. Study Limitations and research opportunities All of the experiments and research that have been conducted could be expanded to include further evaluation of other processing units, such as Google's TPUs (tensor processing units) for machine learning and the logic architecture of processing units. While it became clear that CPUs, GPUs, etc. vary in terms of performance, in my experiments I focused on the "Pytorch" library and "Anaconda" software, which limited me to a certain range of supported computing devices. For example, I couldn’t run the AI segmentation on a Windows device, which would likely perform worse than the M1 Pro with an explicit neural engine model accelerator for things such as voice recognition in the form of Apple’s "Siri" or "FaceID." Furthermore, the "Conda" environment limited the use of supported GPUs to Nvidia manufacturers with CUDA cores only, which aren’t much different from FPU floating point operations units. The other limitation to the experiment was access to data from Apple, as it was not possible to get adequate data about SoCs' power consumption, which could bias the final result. The main research opportunity associated with the experiment is testing the M1 Pro against other rare ARM SoCs in RISC architecture and comparing them with new Nvidia GPUs using software supporting more computing devices. 34 6. Conclusion. However, this research has shown the significant power of the mobile M1 Pro SoC in comparison to one of the most powerful GPUs nowadays. Despite losing against both the desktop Ryzen 7 3800X and the desktop RTX 3060 TI, which had access to six times the power that the M1 Pro had access to, packed in the small and thin size of a laptop, the performance turned out to be astonishing. When compared to other Intel-based macbooks with CISC architecture and the ability to run both mobile and desktop versions of software, it is clear that mobile processing units are becoming increasingly efficient with the advancement of ARM architecture-based processing units and software optimization.When combined and used with Nvidia's GPU's SLI link, Nvidia's GPU, with 4864 cuda cores and the possibility of greater parallel computing, and AMD's 5300M, one of the two most powerful Intel generations of macOS GPUs, could revolutionize the AI world, either limited by the already too computationally intensive Windows OS or by a lack of graphics memory with a great OS.To address the research question, the type of processing unit has a significant impact on efficiency in terms of time and power consumption, as well as effectiveness in terms of increased model accuracy and validation. Experiments have shown that CPUs are more effective but less efficient, taking unproportionally more time to train and run the model. However, differences in effectiveness are likely due to varying internal logic architectures amongst those units, which cause them to come up with slightly different outcomes, for example in double floating point operations, etc. When it comes to fast and effective neural network training, GPUs remain the best choice. Moreover, running pretrained models on processing units designed specifically for the model is far more efficient, whereas general-purpose computing devices such as a GPU or CPU will simply be inefficient. 35 7. Bibliography: [1] SoC M1PRO (no date). Available at: https://www.anandtech.com/show/17019/apple-announced-m1-pro-m1-max-giant-new-socs-with-allo ut-performance (Accessed: February 14, 2023). [2] GPU structure (no date). https://core.vmware.com/resource/exploring-gpu-architecture#section3 Available (Accessed: February at: 14, 2023). [3] Neural Network (no date) learnopencv. Available at: https://learnopencv.com/understanding-activation-functions-in-deep-learning/ (Accessed: February 14, 2023). [4] Neurons vs Artificial Neurons (no date). Available at: https://towardsdatascience.com/the-concept-of-artificial-neurons-perceptrons-in-neural-networks-fab2 2249cbfc (Accessed: February 14, 2023). [5] dishashree26 (2022) Activation functions: Fundamentals of deep learning, Analytics Vidhya. Available at:https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-w hen-to-use-them/ (Accessed: February 14, 2023). [6] Syed, A. (2020) AI vs. ML vs. DL (vs. NN), Medium. A Coder's Guide to AI. Available at: https://medium.com/a-coders-guide-to-ai/ai-vs-ml-vs-dl-vs-nn-f6968db769d1 (Accessed: February 14, 2023). [7] Typical Nvidia GPU architecture. the GPU is comprised of a set of ... (no date). Available at: https://www.researchgate.net/figure/Typical-NVIDIA-GPU-architecture-The-GPU-is-comprised-of-aset-of-Streaming_fig1_236666656 (Accessed: February 14, 2023). 36 [8] TimTim 16.7k6767 gold badges178178 silver badges259259 bronze badges and AndyAndy 1 (1966) Does bios need to be loaded into main memory to be executed by CPU?, Super User. Available at: https://superuser.com/questions/1407254/does-bios-need-to-be-loaded-into-main-memory-to-be-execu ted-by-cpu (Accessed: February 14, 2023). [9] Apple Core ML (no date) Apple Developer Documentation. Available at: https://developer.apple.com/documentation/coreml (Accessed: February 14, 2023). [10] admin_mirabilis (2021) Apple Neural Processor, Mirabilis Design. Available at: https://www.mirabilisdesign.com/apple-neural-processor/ (Accessed: February 14, 2023). [11] Hollance (no date) Hollance/neural-engine: Everything we actually know about the Apple Neural Engine (ANE), GitHub. Available at: https://github.com/hollance/neural-engine (Accessed: February 14, 2023). [12] Apple M1 chip. everything you wanted to know about it (2021) Logidots. Available at: https://logidots.com/insights/apple-m1-chip-everything-you-wanted-to-know-about-it/ (Accessed: February 14, 2023) [13] Tkshirakawa (no date) Tkshirakawa/AIS_TRAINING_CODESET: Python code to train neural network models with your original dataset for semantic segmentation. this codeset also includes a converter to create macos core ML models from trained Keras models for A.I.Segmentation., GitHub. Available at: https://github.com/tkshirakawa/AIS_Training_Codeset (Accessed: February 14, 2023). . [14] Softneta (no date) Dicom Library - Anonymize, share, view dicom files online, DICOMLibrary. Available at: https://www.dicomlibrary.com/ (Accessed: February 14, 2023). [15] Tkshirakawa (no date) TKSHIRAKAWA/AISEGMENTATION_V141: The public source code of a.i.segmentation (AIS) version 1.4.1. AIS is a plugin of OsiriX for macos enables semantic 37 segmentation of medical images using artificial intelligence (core ML framework) in macos., GitHub. Available at: https://github.com/tkshirakawa/AISegmentation_v141 (Accessed: February 14, 2023). [16] M1 Pro motherboard (no date). Available at: https://www.ifixit.com/Guide/MacBook+Pro+14-Inch+2021+Chip+ID/145718 (Accessed: February 14, 2023). [17] Apple specifications (no date) (UK). Available at: https://support.apple.com/kb/SP783?locale=en_GB (Accessed: February 14, 2023). [18] image of geforce rtx 3060ti (no date). Available at: https://www.techpowerup.com/review/gigabyte-geforce-rtx-3060-ti-gaming-oc-pro/3.html (Accessed: February 14, 2023). [19] Hollance (no date) Hollance/neural-engine: Everything we actually know about the Apple Neural Engine (ANE), GitHub. Available at: https://github.com/hollance/neural-engine (Accessed: February 14, 2023). [20] admin_mirabilis (2021) Apple Neural Processor, Mirabilis Design. Available at: https://www.mirabilisdesign.com/apple-neural-processor/ (Accessed: February 14, 2023). [21] Exploring the GPU architecture: Vmware (no date) The Cloud Platform Tech Zone. Available at: https://core.vmware.com/resource/exploring-gpu-architecture#section3 (Accessed: February 14, 2023). (Accessed: February 14, 2023). [22] Torch.utils.data (no date) torch.utils.data - PyTorch 1.13 documentation. Available at: https://pytorch.org/docs/stable/data.html (Accessed: February 14, 2023). [23] Deep Network designer (no date) VGG-16 convolutional neural network - MATLAB. Available at: https://www.mathworks.com/help/deeplearning/ref/vgg16.html (Accessed: February 14, 2023). 38 [24] 7.6. Convolutional Neural Networks (lenet)¶ colab [pytorch] open the notebook in colab colab [mxnet] open the notebook in colab colab [jax] open the notebook in colab colab [tensorflow] open the notebook in colab sagemaker studio lab open the notebook in SageMaker Studio Lab (no date) 7.6. Convolutional Neural Networks (LeNet) - Dive into Deep Learning 1.0.0-beta0 documentation. Available at: https://d2l.ai/chapter_convolutional-neural-networks/lenet.html (Accessed: February 14, 2023). [25] Furkan Gulsen (no date) What is a Tensor? Available at: https://furkangulsen.medium.com/what-is-a-tensor-ce8e78835d08 (Accessed: February 14, 2023). [26] TimTim 16.7k6767 gold badges178178 silver badges259259 bronze badges and AndyAndy 1 (1966) Does bios need to be loaded into main memory to be executed by CPU?, Super User. Available at: https://superuser.com/questions/1407254/does-bios-need-to-be-loaded-into-main-memory-to-be-execu ted-by-cpu (Accessed: February 14, 2023). [27] Zheng, H. (1970) Model validation, Machine Learning, SpringerLink. Springer New York. Available at: https://link.springer.com/referenceworkentry/10.1007/978-1-4419-9863-7_233 (Accessed: February 14, 2023). [28] ReLu activation Function (no date) researchgate. Available at: https://www.researchgate.net/figure/ReLU-activation-function_fig7_333411007 (Accessed: February 14, 2023). 8. Appendix: A - Machine learning projects 39 Mlp-mnist.py is a linear and Relu activation function-based machine learning program for handwritten number recognition, creating 10 classes, each for a single digit, in the "mnist" database. Modified National Institute of Standards and Technology MNIST is made up of 600,000 handwritten 28x28 pixel images: Crucial pieces of code included only* # "argparse" module makes it easy to write user-friendly command-line interfaces. Mac-->Terminal, Windows --> Comannd line #Setting a random seed as an image transformation parameter #Parsing or passing the argument or information regarding the processing unit used #displaying library version and device used #Setting the number of epochs #determining batch size in a single epoch #transforming the images by resizing them #transforming the images by converting them to matrices of multilinear relationships between pictures and (Tensors relate to data structures where the data can be scalar as a singular number, a vector with two variables, a matrix with rows and columns, 40 or even tensors where rows and columns contain matrices.) 26 #defining the model structure with 784 inputs etc #Assigning linear activation function to the first layer and every next hidden layer, ReLu activation function for the use in model LeNet-mnist.py is a project similar to mlp-mnist except it is using the LeNet5 network architecture, which consists of 2 convolutional layers connected with pooling layers and a dense block consisting of 3 layers, giving a total of 7 layers. Additionally, it is taking such 28x28 pictures to create 784-dimensional [] vectors, which can be explained as an array consisting of 784 pixels per picture, which has to be transformed and results in 10 final classes. 26 (What is a Tensor?) 41 LeNet - 5 - architecture.27 Crucial pieces of code included only* #Defining the type of network, in this case it is a convolutional neural network, which relates to the mathematical convolution of functions, resulting in a third function with certain traits of the initial two functions. #A classifier, like in mlp-mnist, is used to classify the images; here, linear and tanh activation and transformation functions are used. #defining the model as LeNet5, which is a 7-layer convolutional neural network inputting 28x28 pixel images and dividing them into 10 classes in greyscale. #Assigning functions to one model function and later calling all the functions and setting logging (saving data of the processed batch) to intervals of 100 samples of pictures. 27 (7.6. Convolutional Neural Networks (lenet)) 42 Vgg16-cifar10.py28 - is another convolutional neural network made up of 16 layers classifying up to 1000 different objects such as animals, devices and office accessories. Input picture 224x224 giving vectors of 50176 Crucial pieces of code included only* #”Data loader. It combines a dataset and a sampler and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading.” #cifar10 data retrieval function #downloading vgg16 B - Machine learning training benchmark 28 (Deep Network designer) 43 ROG GAMING Ryzen 5 CPU Benchmark File - mlp-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3063 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 91.51% 93.50% 92.02% 0.10 0.23 0.27 001/001 0100/04 0.3429 21 001/001 0200/04 0.3083 21 001/001 0300/04 0.3685 21 001/001 0400/04 0.3488 21 ROG GAMING NVIDIA GeForce RTX 3060 CUDA 44 Benchmark File - mlp-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3063 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 91.43% 93.38% 92.15% 0.11 0.24 0.28 001/001 0100/04 0.3429 21 001/001 0200/04 0.3083 21 001/001 0300/04 0.3685 21 001/001 0400/04 0.3482 21 ROG GAMING ryzen 5 CPU Benchmark File - lenet-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3098 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 97.32% 97.77% 97.40% 0.11 0.24 0.28 001/001 0100/04 0.2646 21 45 001/001 0200/04 0.1437 21 001/001 0300/04 0.1009 21 001/001 0400/04 0.0734 21 ROG GAMING NVIDIA GeForce RTX 3060 CUDA Benchmark File - lenet-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3098 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 97.32% 97.78% 97.40% 0.14 0.27 0.31 001/001 0100/04 0.2646 21 001/001 0200/04 0.1437 21 001/001 0300/04 0.1012 21 001/001 0400/04 0.0734 21 46 ROG GAMING ryzen 5 CPU Benchmark File - vgg16-cifar10.py Epoch counter Batch loss 001/001 0000/14 2.6287 06 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 36.01% 35.04 36.22 235.61 313.68 329.47 001/001 0100/14 2.1928 06 001/001 0200/14 1.9123 06 001/001 0300/14 2.0286 06 001/001 0400/14 2.0359 06 001/001 0500/14 1.9385 06 001/001 0600/14 1.9830 0604 47 001/001 0700/14 1.8315 06 001/001 0800/14 2.0363 06 001/001 0900/14 1.8601 06 001/001 1000/14 1.7345 06 001/001 1100/14 06 1.8575 001/001 1200/14 1.8636 06 001/001 1300/14 2.0912 06 001/001 1400/14 1.8696 06 ROG GAMING NVIDIA GeForce RTX 3060 CUDA Benchmark File - vgg16-cifar10.py 48 Epoch counter Batch loss 001/001 0000/14 2.8048 06 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 24.56% 24.50% 25.22% 8.01 11.05 11.68 001/001 0100/14 2.2773 06 001/001 0200/14 2.3926 06 001/001 0300/14 2.2269 06 001/001 0400/14 2.1138 06 001/001 0500/14 2.1860 06 001/001 0600/14 2.1008 0604 001/001 0700/14 2.1855 06 001/001 0800/14 2.0561 06 001/001 0900/14 2.1550 06 001/001 1000/14 2.1875 06 001/001 1100/14 06 2.0554 001/001 1200/14 2.1677 06 001/001 1300/14 1.9814 06 001/001 1400/14 2.2570 06 49 APPLE M1 series MPS Benchmark File - lenet-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3098 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 97.33% 97.75% 97.39% 0.22 0.36 0.41 001/001 0100/04 0.2646 21 001/001 0200/04 0.1437 21 001/001 0300/04 0.1010 21 001/001 0400/04 0.0734 21 50 APPLE M1 series CPU Benchmark File - lenet-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3098 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 97.32% 97.75% 97.40% 2.29 3.49 3.81 001/001 0100/04 0.2646 21 001/001 0200/04 0.1437 21 001/001 0300/04 0.1010 21 001/001 0400/04 0.0734 21 51 APPLE M1 series MPS Benchmark File - vgg16-cifar10.py Epoch counter Batch loss 001/001 0000/14 2.7701 06 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 18.19% 18.26% 18.52% 30.40 38.32 40.11 001/001 0100/14 2.3483 06 001/001 0200/14 2.2327 06 001/001 0300/14 2.2476 06 001/001 0400/14 2.3149 06 001/001 0500/14 2.2989 06 001/001 0600/14 2.2574 0604 001/001 0700/14 2.1690 52 06 001/001 0800/14 2.1377 06 001/001 0900/14 1.0730 06 001/001 1000/14 1.9288 06 001/001 1100/14 06 1.1098 001/001 1200/14 2.2131 06 001/001 1300/14 2.1121 06 001/001 1400/14 2.1564 06 APPLE M1 series CPU Benchmark File - vgg16-cifar10.py 53 Epoch counter Batch loss 001/001 0000/14 2.6052 06 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 31.06% 31.38 31.61 418.48 550.32 576.87 001/001 0100/14 2.4348 06 001/001 0200/14 1.9956 06 001/001 0300/14 1.8892 06 001/001 0400/14 2.1870 06 001/001 0500/14 1.9244 06 001/001 0600/14 2.0415 0604 001/001 0700/14 2.0132 06 001/001 0800/14 2.0168 06 001/001 0900/14 2.0304 06 001/001 1000/14 1.7992 06 001/001 1100/14 06 1.8867 001/001 1200/14 1.7387 06 001/001 1300/14 1.6586 06 001/001 1400/14 1.8780 06 54 APPLE M1 series mps Benchmark File - mlp-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3063 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 91.63% 93.48% 92.15% 0.13 0.27 0.32 001/001 0100/04 0.3429 21 001/001 0200/04 0.3103 21 001/001 0300/04 0.3708 21 001/001 0400/04 0.3499 21 55 APPLE M1 series CPU Benchmark File - mlp-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3063 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 91.51% 93.50% 92.02% 0.13 0.27 0.32 001/001 0100/04 0.3429 21 001/001 0200/04 0.3083 21 001/001 0300/04 0.3685 21 001/001 0400/04 0.3488 21 56 Apple Macbook Air Intel(R) UHD 617 - CPU Benchmark File - mlp-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3063 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 91.51% 93.50% 92.02% 0.26 0.56 0.65 001/001 0100/04 0.3429 21 001/001 0200/04 0.3083 21 001/001 0300/04 0.3685 21 001/001 0400/04 0.3488 21 Apple Macbook Air Inter(R) UHD 617 - CPU Benchmark File - lenet-mnist.py 57 Epoch counter Batch loss 001/001 0000/04 2.3098 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 97.33% 97.77% 97.39% 0.58 0.98 1.08 001/001 0100/04 0.2646 21 001/001 0200/04 0.1437 21 001/001 0300/04 0.1009 21 001/001 0400/04 0.0732 21 21;25 - 7.16 still not done - 9.51 Not possible to run on cuda due to the lack of support for non nvidia GPUs Neither possible to run on the whole system with that exact same software setup. Apple Macbook Air Inter(R) UHD 617 - CPU Benchmark File - vgg16-cifar10.py Epoch counter Batch loss 001/001 0000/14 2.4554 06 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 591 min at the time the test was shut down 58 001/001 0100/14 2.4443 06 001/001 0200/14 2.3538 06 001/001 0300/14 1.9436 06 001/001 0400/14 2.026 06 MacBook Pro 16-inch 2019 Benchmark File - vgg16-cifar10.py Epoch counter Batch loss 001/001 0000/14 2.7701 06 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 36.88% 38.10% 38.29% 727 824.72 843.63 001/001 0100/14 2.3483 06 001/001 0200/14 2.2327 06 001/001 0300/14 2.2476 59 06 001/001 0400/14 2.3149 06 001/001 0500/14 2.2989 06 001/001 0600/14 2.2574 0604 001/001 0700/14 2.1690 06 001/001 0800/14 2.1377 06 001/001 0900/14 1.0730 06 001/001 1000/14 1.9288 06 001/001 1100/14 06 1.1098 001/001 1200/14 2.2131 06 001/001 1300/14 2.1121 06 001/001 1400/14 2.1564 06 MacBook Pro 16-inch 2019 Benchmark File - lenet-mnist.py Epoch Batch loss Train: Validati Test Time/ep Total Total 60 counter 001/001 0000/04 2.3098 21 97.33% on: accurac y och without evaluati on(min) Training Time: time (min) (min) 97.77% 97.39% 0.30 0.52 0.58 001/001 0100/04 0.2646 21 001/001 0200/04 0.1438 21 001/001 0300/04 0.1011 21 001/001 0400/04 0.0733 21 MacBook Pro 16-inch 2019 Benchmark File - mlp-mnist.py Epoch counter Batch loss 001/001 0000/04 2.3063 21 Train: Validati on: Test accurac y Time/ep och without evaluati on(min) Total Total Training Time: time (min) (min) 91.36% 93.23% 91.89% 0.16 0.39 0.45 001/001 0100/04 0.3437 21 001/001 0200/04 0.3072 21 001/001 0300/04 0.3702 21 61 001/001 0400/04 0.3527 21 C - DICOM AI Segmentation Apple Macbook Pro 16-inch 2019 (Without Neural Engine) DICOM IMAGE SLICES CPU-GPU-ANE* CPU-GPU CPU Core ML time Computat Core ML ion time time Computat Core ML ion time time Computat ion time Abdominem 100 3,016968 7,427609 2,177623 6,317432 41,70753 7 46,01953 0 Abdominem 200 4,286666 9,709627 4,305234 9,714994 82,29784 5 87,81043 3 Skull front 100 2,200989 6,386208 2,163848 6,472658 42,21029 5 46,58363 0 Skull 500 10,76283 1 20,99885 2 10,80430 5 21,38629 5 203,3322 55 214,0224 62 100 2,165628 6,621599 2,146345 6,221807 41,74256 2 46,25595 3 SUM and 1000 Avg time per slice in milliseconds 22,43308 2 51,14389 5 21,59735 5 50,11318 6 411,2904 94 440,6920 08 Benchmark Skull Benchmark Apple Macbook Air Inter(R) UHD 617 - (Without Neural Engine) DICOM IMAGE SLICES CPU-GPU-ANE CPU-GPU CPU 62 Core ML time Computat ion time Core ML time Computat ion time Core ML time Computat ion time Abdominem 100 45,63516 2 56,49102 1 48,92469 8 57,09714 2 79,04695 5 88,07042 9 Abdominem 200 110,2708 18 127,3409 24 121,0103 79 142,1640 06 334,1804 14 355,7949 79 Skull front 100 49,68498 5 61,51804 4 58,31912 1 70,01487 5 166,9576 88 178,8497 85 Skull Benchmark 500 259,6555 09 289,0464 55 254,8884 90 285,3958 43 404,9628 19 442,3897 22 Skull Benchmark 100 51,27504 6 59,28404 1 49,90485 7 58,53897 0 82,09275 6 91,14854 6 SUM and 10 Avg time per slice in milliseconds 516,5215 2 593,6804 85 533,0475 45 613,2108 36 1067,240 632 1156,253 461 Apple Macbook Pro 14’ 2021 - M1 Pro DICOM SLICES IMAGE CPU+GPU+ANE CPU+GPU CPU Core ML time Computat ion time Core ML time Computat ion time Core ML time Computat ion time Abdominem 100 0,770511 4,125617 1,994017 3,661932 16,19058 1 18,09745 4 abdominem 200 1,532248 5,807359 3,181870 5,719136 32,38371 4 35,46040 5 Skull front 100 0,763828 3,776746 1,590075 3,089768 15,82623 2 17,50779 8 63 Skull benchmark 100 0,767479 4,001889 1,615191 3,342085 15,46396 3 17,47674 4 Skull benchmark 500 3,847994 12,42973 2 7,918357 14,03846 4 80,21617 1 87,97623 4 SUM and 1000 Avg time per slice in milliseconds 7,68206 30,14134 3 16,29951 29,85138 5 160,0806 61 176,5186 35 64

Machine Learning Processing Unit Effectiveness

Related documents

Products

Support

Machine Learning Processing Unit Effectiveness

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib