Uploaded by SYED AYAZ Ahmed

Apache Spark using Python - Complete Course

advertisement
Apache Spark using Python
Complete Course
PRIYAM GHOSH DASTIDAR
IT Professional | Cloud Data Engineer
Apache Spark
• Apache Spark is a lightning-fast cluster computing technology, designed
for fast computation
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application
2
Apache Spark with Python
Apache Spark
 Topics
 Apache Spark’s Distributed Execution
 Distributed data and partitions
 Understanding Spark Application Concepts
 Transformations, Actions, and Lazy Evaluation
3
Apache Spark with Python
PySpark
Definition
• PySpark is a Spark library written in Python to run Python applications
using Apache Spark capabilities
Features
• In-memory computation
• Distributed processing using parallelize
• Fault-tolerant
• Immutable
• Lazy evaluation
• Cache & persistence
• Inbuild-optimization when using DataFrames
4
Apache Spark with Python
PySpark Modules & Packages
• PySpark RDD (pyspark.RDD)
• PySpark DataFrame and SQL (pyspark.sql)
• PySpark Streaming (pyspark.streaming)
• PySpark MLib (pyspark.ml, pyspark.mllib)
• PySpark GraphFrames (GraphFrames)
• PySpark Resource (pyspark.resource) It’s new in PySpark 3.0
5
Apache Spark with Python
PySpark Installation
Databricks Community Edition
• Go to https://www.databricks.com/
6
Apache Spark with Python
PySpark Installation
Databricks Community Edition
• Fill in all the Required Details and click Next
• If you want to use Databricks with any Cloud platform like Azure, GCP or
AWS choose accordingly otherwise click the link below to choose
Databricks Community Edition
7
Apache Spark with Python
PySpark Installation
Databricks Community Edition
• You will get a notification like below screenshot.
8
Apache Spark with Python
PySpark Installation
Databricks Community Edition
• Go to your provided Email inbox to start your trial. You will receive a
email like below
• Click on the provided link
9
Apache Spark with Python
PySpark Installation
Databricks Community Edition
• Reset your Password
10
Apache Spark with Python
PySpark Installation
Databricks Community Edition
• Tour Guide to the Databricks Cloud Platform
11
Apache Spark with Python
PySpark SparkContext
PySpark SparkContext
• An entry point to the PySpark functionality that is used to communicate
with the cluster
12
Apache Spark with Python
PySpark SparkContext
• Create SparkContext in PySpark
• Stop PySpark SparkContext
13
Apache Spark with Python
PySpark SparkSession
PySpark SparkSession
• Since Spark 2.0 SparkSession has become an entry point to PySpark to
work with RDD, and DataFrame. Prior to 2.0, SparkContext used to be an
entry point.
• How many SparkSessions can you create in a PySpark application?
 You can create as many SparkSession as you want in a PySpark
application using
either SparkSession.builder() or SparkSession.newSession()
14
Apache Spark with Python
PySpark RDD
RDD (Resilient Distributed Dataset)
• RDD (Resilient Distributed Dataset) is a fundamental building block of
PySpark which is fault-tolerant, immutable distributed collections of
objects
PySpark RDD Benefits





In-Memory Processing
Immutability
Fault Tolerance
Lazy Evolution
Partitioning
15
Apache Spark with Python
Create RDD
• sparkContext.parallelize()
• sparkContext.textFile()
16
Apache Spark with Python
PySpark DataFrame
• DataFrames are the distributed collections of data, organized into rows and
columns
• DataFrames are similar to traditional database tables, which are structured
and concise
• We can say that DataFrames are relational databases with better
optimization techniques
17
Apache Spark with Python
PySpark DataFrame
Convert PySpark RDD to DataFrame
• toDF() function of the RDD is used to convert RDD to DataFrame
PySpark RDD Benefits
 Using rdd.toDF() function
 Using PySpark createDataFrame() function
18
Apache Spark with Python
StructType & StructField
• PySpark StructType & StructField classes are used to programmatically
specify the schema to the DataFrame and create complex columns like
nested struct, array, and map columns
• StructType – Defines the structure of the Dataframe
• StructField – Defines the metadata of the DataFrame column
19
Apache Spark with Python
PySpark DataFrame
Other Topics from Dataframe
 Select Columns From DataFrame
 Select Single & Multiple Columns
 Select All Columns From List
 withColumn() Usage
 Change DataType using PySpark withColumn()
 Update The Value of an Existing Column
 Create a Column from an Existing
Apache Spark with Python
20
PySpark DataFrame
Other Topics from Dataframe
 Add a New Column using withColumn()
 Rename Column Name
 Drop Column From PySpark DataFrame
 PySpark Where Filter Function
 PySpark orderBy()
 DataFrame sorting using the sort() function
21
Apache Spark with Python
PySpark DataFrame
Other Topics from Dataframe
 DataFrame sorting using orderBy() function
 Sort by Ascending (ASC)/DSC
 PySpark Groupby
 PySpark Join Types
 PySpark Union and UnionAll
 PySpark fillna() & fill()
22
Apache Spark with Python
PySpark Read Write
 PySpark Read CSV file into DataFrame
23
Apache Spark with Python
Spark Optimization
 Different optimization techniques
24
Apache Spark with Python
Thank You
Download