Uploaded by ayersrockfilms

comp-4334 week 6 lectures

advertisement
DataFrame Sources
DataFrame File Sources
• Spark contains some core sources that
automatically create/save DataFrames from/to
permanent storage
•
•
•
•
CSV
JSON
Parquet
A few others…
File Source Syntax
• File sources share the same basic syntax
• Each file type will have its own additional options
• .option(“optionName”, Optionparameter)
df = spark.read.format(“format”).load(“filename”)
df.write.format(“format”).save(“filename”)
• “mode” options:
• append
• overwrite
• errorIfExists - default
DataFrame Database Sources
• Spark also contains numerous 3rd party
extension data sources
•
•
•
•
Cassandra
HBase
MongoDB
Others…
Database Source Syntax
• Each database connector is slightly different as
they are 3rd party
• Most require a connector be added to the
system in one way or another
• Once installed, each connector provides slightly
different syntax, but the general format is:
spark.read.format("org.apache.spark.sql.cassandra") \
.options(table="emp",keyspace="test") \
.load()
Databases vs Spark
• Databases and Spark both process SQL
• Databases are designed for many users doing
low-latency queries
• Permanent storage of data
• Spark is designed for a single user doing large
throughput data manipulation
• Often load data from a database, process it in parallel
to produce new results, then store back into database
• Processing engine
DataFrame Sources
The End
CSV
CSV Files
• Recall the basic load/save syntax
df = spark.read.format(“csv”).load(“filename.csv”)
df.write.format(“csv”).save(“filename.csv”)
CSV Options
• Header:
.option(“header”, True)
• Default assumes no header (False)
• Any header would be read in as the first row
• Delimiter:
.option(“delimiter”, “\\t”)
• Default is a comma delimiter (“,”)
CSV Options
• Infer Schema:
.option(“inferSchema”, True)
• Default is to not infer (False)
• Not inferring produces all Strings
• Inferring is not always correct
• Int/Double issues for example
• Explicit Schema:
.schema(schemaName)
• Lines that do not match come in as all nulls
CSV Options
• Whitespace:
.option(“ignoreLeadingWhiteSpace”, True)
• There is also an ignoreTrailingWhiteSpace option
• Default is to not ignore it (False)
• Often causes malformed lines for non-strings
• Malformed:
.option(“mode”, “dropMalformed”)
• Drops line if bad data found
• There is also a fastFail option – error out
• Default is permissive (False) – line is all nulls
CSV Lazy Evaluation
• As with everything in Spark, loading/processing
CSVs uses lazy evaluation
• Only read in what is needed via the lineage graph
• If we only show/process 2 of the columns, then it
will try to only read-in/process those columns
• Seen with malformed processing
• CSV are row-based files and not good for reading in
only certain columns
• Parquet is designed for this situation
Multiple CSV Files
• Multiple CSV files can be placed into a single
directory
• Just like we did with text files
• Specify the directory name for the file rather than
an individual .csv file
• Spark will read in all .csv file in that directory in a
distributed way
CSV
The End
JSON
JSON Format
• JavaScript Object Notation
• Often presented as multi-line
• One object per file
{
"RecordNumber": 2,
"Zipcode": 81601,
"City": "GLENWOOD SPRINGS",
"State": "CO"
}
JSON Format
• While Spark can read/write multi-line JSON files,
it does not scale well
• One JSON object per line is much better
• Row-based format just like CSV
{"RecordNumber": 2,"Zipcode": 81601,City": "GLENWOOD SPRINGS","State": "CO"}
{"RecordNumber": 5,"Zipcode": 80424,"City": ”BRECKENRIDGE","State": "CO"}
…
JSON Options
• Similar mode and schema options to CSV
• Multi-line
• One per line is best
• Compression
• Files can be compressed with different codecs
• AllowComments
• JavaScript style comments included in file
• AllowUnquotedFieldNames
• Do strings have “ “ around them
• Others…
JSON
The End
Parquet
Parquet Format
• Columnar format
• Data is stored column by column
• Makes writing out/appending more time
consuming since each row needs to have data
written to many different locations
• Makes reading much faster if only certain
columns are needed
• Spark applications are often write-once, read-many
Parquet Read/Write
• Use same basic load/save syntax
df = spark.read.format(“parquet”).load(“dirname”)
df.write.format(“parquet”).save(“dirname”)
• But the filename is really a directory name since
the parquet format will store many different files
• As the end-user we just use the top-level dirname and
let Spark do the rest
Parquet Options
• Very few special options compared to CSV
• Most things are taken care of internally
• Options for:
• Compression
• Schema merging
• Can add to Schema after creation and this helps rectify
• Partitioning
Parquet PartitionBy
• The partitionBy option will divide up the data into
different folders on the filesystem
• The DataFrame in memory is not modified or
partitioned in any way
• Just what is written
• A key is specified, and each value of that key
gets its own sub-folder in the partition directory
• Need to be careful not to partitionBy columns with a
large number of different values
Parquet PartitionBy
df.write.format(“parquet”) \
.mode(“overwrite”) \
.partitionBy(“letterGrade”) \
.save(“classGrades")
• classGrades folder
•
•
•
•
•
letterGrade=A folder
letterGrade=B folder
letterGrade=C folder
letterGrade=D folder
letterGrade=F folder
Parquet PartitionBy
• Can speed up reading if filtering is done on
loaded file in the same way it is partitioned
• Recall Spark’s lazy evaluation
df = spark.read.format(“parquet”).load(“classGrades”)
df.select(“name”).filter(f.col(“letterGrade”) == “B”).show()
Parquet
The End
Joins
Joins
• DataFrame joins are similar to joins from
database tables
•
•
•
•
Inner joins
Outer joins
Semi joins
Anti joins
• Each of the joins produces a new DataFrame
Inner Joins
• Joins the rows that exist in both DataFrames
aDF.join(bDF, aDF["id"] == bDF["id"], "inner")
• Inner joins are most common, so there are
syntax shortcuts
• Wanting an inner join
aDF.join(bDF, aDF["id"] == bDF["id"])
• Same matching key in both
aDF.join(bDF)
Outer Joins
• Left outer join keeps all rows in left DataFrame
and adds info from right DataFrame
• Nulls are placed when missing right data
aDF.join(bDF, aDF["id"] == bDF["id"], "right_outer")
• Right outer is just reversed
• Probably no great reason to even use it as one can
simply reverse the arguments
Semi and Anti Joins
• Left Semi keeps the values in the left DataFrame
that had matching values in the right
• Does not even use the values from the right
DataFrame in the resulting DataFrame
aDF.join(bDF, aDF["id"] == bDF["id"], "left_semi")
• Left Anti keeps the values in the left DataFrame
that did not have matching values in the right
aDF.join(bDF, aDF["id"] == bDF["id"], "left_anti")
Joins
The End
Join Performance
Join Types
• To join two tables, the matching keys need to be
placed onto the same worker nodes
• There are different types of joins that can take
place
• Full Shuffle Joins
• Broadcast Joins
• Which happens depends on the DataFrame size
• And hints to the optimizer if given
Full Shuffle Joins
• Used when both DataFrames are large
• Large enough to not fit into memory on a single
worker node
• Standard Shuffles
• Slow since both DataFrames need to be shuffled
• If done repeatedly, pre-partitioning by the key and
persisting can reduce this to a single shuffle
• df.repartition(“key”).persist()
Broadcast Joins
• Used when one DataFrame is small
• Small enough to fit into memory on a single worker
node
• Broadcast variable is used to pre-send small
DataFrame to every worker node
• Faster than Full Shuffle since, after initial
broadcast, all workers have all the information
they require for the join
Join Selection
• Spark will pick the join type for you
• See which it is using by looking at explain plan
resultDF.explain()
• Set the DataFrame cutoff size (or turn it off)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
• Set hints to help optimizer
aDF.join(f.broadcast(bDF))
Join Performance
The End
User Defined Functions
User Defined Functions
• SparkSQL provides many built-in functions
• But there are times when you may need to
define your own
• This is done through User Defined Functions
User defined functions
• Use the UDF function to define
• Provide function that takes one parameter and returns
one parameter
• Provide the return type
doubleItUDF = udf(doubleIt, LongType())
def doubleIt(x):
return x*2
User defined functions
• Use of UDF in a DataFrame
• Note that the UDF is called once for each row in DF
df.select(‘a’, doubleItUDF(‘b’))
df.withColumn(‘doubleB’, doubleItUDF(‘b’))
User defined functions
• To use the UDF in a SQL command
spark.udf.register(“dbl”, doubleItUDF)
spark.sql(“SELECT a, dbl(b) FROM tbl)
User Defined Functions
The End
Download