Apache Spark using Python Complete Course PRIYAM GHOSH DASTIDAR IT Professional | Cloud Data Engineer Apache Spark • Apache Spark is a lightning-fast cluster computing technology, designed for fast computation • The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application 2 Apache Spark with Python Apache Spark Topics Apache Spark’s Distributed Execution Distributed data and partitions Understanding Spark Application Concepts Transformations, Actions, and Lazy Evaluation 3 Apache Spark with Python PySpark Definition • PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities Features • In-memory computation • Distributed processing using parallelize • Fault-tolerant • Immutable • Lazy evaluation • Cache & persistence • Inbuild-optimization when using DataFrames 4 Apache Spark with Python PySpark Modules & Packages • PySpark RDD (pyspark.RDD) • PySpark DataFrame and SQL (pyspark.sql) • PySpark Streaming (pyspark.streaming) • PySpark MLib (pyspark.ml, pyspark.mllib) • PySpark GraphFrames (GraphFrames) • PySpark Resource (pyspark.resource) It’s new in PySpark 3.0 5 Apache Spark with Python PySpark Installation Databricks Community Edition • Go to https://www.databricks.com/ 6 Apache Spark with Python PySpark Installation Databricks Community Edition • Fill in all the Required Details and click Next • If you want to use Databricks with any Cloud platform like Azure, GCP or AWS choose accordingly otherwise click the link below to choose Databricks Community Edition 7 Apache Spark with Python PySpark Installation Databricks Community Edition • You will get a notification like below screenshot. 8 Apache Spark with Python PySpark Installation Databricks Community Edition • Go to your provided Email inbox to start your trial. You will receive a email like below • Click on the provided link 9 Apache Spark with Python PySpark Installation Databricks Community Edition • Reset your Password 10 Apache Spark with Python PySpark Installation Databricks Community Edition • Tour Guide to the Databricks Cloud Platform 11 Apache Spark with Python PySpark SparkContext PySpark SparkContext • An entry point to the PySpark functionality that is used to communicate with the cluster 12 Apache Spark with Python PySpark SparkContext • Create SparkContext in PySpark • Stop PySpark SparkContext 13 Apache Spark with Python PySpark SparkSession PySpark SparkSession • Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, and DataFrame. Prior to 2.0, SparkContext used to be an entry point. • How many SparkSessions can you create in a PySpark application? You can create as many SparkSession as you want in a PySpark application using either SparkSession.builder() or SparkSession.newSession() 14 Apache Spark with Python PySpark RDD RDD (Resilient Distributed Dataset) • RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark which is fault-tolerant, immutable distributed collections of objects PySpark RDD Benefits In-Memory Processing Immutability Fault Tolerance Lazy Evolution Partitioning 15 Apache Spark with Python Create RDD • sparkContext.parallelize() • sparkContext.textFile() 16 Apache Spark with Python PySpark DataFrame • DataFrames are the distributed collections of data, organized into rows and columns • DataFrames are similar to traditional database tables, which are structured and concise • We can say that DataFrames are relational databases with better optimization techniques 17 Apache Spark with Python PySpark DataFrame Convert PySpark RDD to DataFrame • toDF() function of the RDD is used to convert RDD to DataFrame PySpark RDD Benefits Using rdd.toDF() function Using PySpark createDataFrame() function 18 Apache Spark with Python StructType & StructField • PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and create complex columns like nested struct, array, and map columns • StructType – Defines the structure of the Dataframe • StructField – Defines the metadata of the DataFrame column 19 Apache Spark with Python PySpark DataFrame Other Topics from Dataframe Select Columns From DataFrame Select Single & Multiple Columns Select All Columns From List withColumn() Usage Change DataType using PySpark withColumn() Update The Value of an Existing Column Create a Column from an Existing Apache Spark with Python 20 PySpark DataFrame Other Topics from Dataframe Add a New Column using withColumn() Rename Column Name Drop Column From PySpark DataFrame PySpark Where Filter Function PySpark orderBy() DataFrame sorting using the sort() function 21 Apache Spark with Python PySpark DataFrame Other Topics from Dataframe DataFrame sorting using orderBy() function Sort by Ascending (ASC)/DSC PySpark Groupby PySpark Join Types PySpark Union and UnionAll PySpark fillna() & fill() 22 Apache Spark with Python PySpark Read Write PySpark Read CSV file into DataFrame 23 Apache Spark with Python Spark Optimization Different optimization techniques 24 Apache Spark with Python Thank You