DataFrame Sources DataFrame File Sources • Spark contains some core sources that automatically create/save DataFrames from/to permanent storage • • • • CSV JSON Parquet A few others… File Source Syntax • File sources share the same basic syntax • Each file type will have its own additional options • .option(“optionName”, Optionparameter) df = spark.read.format(“format”).load(“filename”) df.write.format(“format”).save(“filename”) • “mode” options: • append • overwrite • errorIfExists - default DataFrame Database Sources • Spark also contains numerous 3rd party extension data sources • • • • Cassandra HBase MongoDB Others… Database Source Syntax • Each database connector is slightly different as they are 3rd party • Most require a connector be added to the system in one way or another • Once installed, each connector provides slightly different syntax, but the general format is: spark.read.format("org.apache.spark.sql.cassandra") \ .options(table="emp",keyspace="test") \ .load() Databases vs Spark • Databases and Spark both process SQL • Databases are designed for many users doing low-latency queries • Permanent storage of data • Spark is designed for a single user doing large throughput data manipulation • Often load data from a database, process it in parallel to produce new results, then store back into database • Processing engine DataFrame Sources The End CSV CSV Files • Recall the basic load/save syntax df = spark.read.format(“csv”).load(“filename.csv”) df.write.format(“csv”).save(“filename.csv”) CSV Options • Header: .option(“header”, True) • Default assumes no header (False) • Any header would be read in as the first row • Delimiter: .option(“delimiter”, “\\t”) • Default is a comma delimiter (“,”) CSV Options • Infer Schema: .option(“inferSchema”, True) • Default is to not infer (False) • Not inferring produces all Strings • Inferring is not always correct • Int/Double issues for example • Explicit Schema: .schema(schemaName) • Lines that do not match come in as all nulls CSV Options • Whitespace: .option(“ignoreLeadingWhiteSpace”, True) • There is also an ignoreTrailingWhiteSpace option • Default is to not ignore it (False) • Often causes malformed lines for non-strings • Malformed: .option(“mode”, “dropMalformed”) • Drops line if bad data found • There is also a fastFail option – error out • Default is permissive (False) – line is all nulls CSV Lazy Evaluation • As with everything in Spark, loading/processing CSVs uses lazy evaluation • Only read in what is needed via the lineage graph • If we only show/process 2 of the columns, then it will try to only read-in/process those columns • Seen with malformed processing • CSV are row-based files and not good for reading in only certain columns • Parquet is designed for this situation Multiple CSV Files • Multiple CSV files can be placed into a single directory • Just like we did with text files • Specify the directory name for the file rather than an individual .csv file • Spark will read in all .csv file in that directory in a distributed way CSV The End JSON JSON Format • JavaScript Object Notation • Often presented as multi-line • One object per file { "RecordNumber": 2, "Zipcode": 81601, "City": "GLENWOOD SPRINGS", "State": "CO" } JSON Format • While Spark can read/write multi-line JSON files, it does not scale well • One JSON object per line is much better • Row-based format just like CSV {"RecordNumber": 2,"Zipcode": 81601,City": "GLENWOOD SPRINGS","State": "CO"} {"RecordNumber": 5,"Zipcode": 80424,"City": ”BRECKENRIDGE","State": "CO"} … JSON Options • Similar mode and schema options to CSV • Multi-line • One per line is best • Compression • Files can be compressed with different codecs • AllowComments • JavaScript style comments included in file • AllowUnquotedFieldNames • Do strings have “ “ around them • Others… JSON The End Parquet Parquet Format • Columnar format • Data is stored column by column • Makes writing out/appending more time consuming since each row needs to have data written to many different locations • Makes reading much faster if only certain columns are needed • Spark applications are often write-once, read-many Parquet Read/Write • Use same basic load/save syntax df = spark.read.format(“parquet”).load(“dirname”) df.write.format(“parquet”).save(“dirname”) • But the filename is really a directory name since the parquet format will store many different files • As the end-user we just use the top-level dirname and let Spark do the rest Parquet Options • Very few special options compared to CSV • Most things are taken care of internally • Options for: • Compression • Schema merging • Can add to Schema after creation and this helps rectify • Partitioning Parquet PartitionBy • The partitionBy option will divide up the data into different folders on the filesystem • The DataFrame in memory is not modified or partitioned in any way • Just what is written • A key is specified, and each value of that key gets its own sub-folder in the partition directory • Need to be careful not to partitionBy columns with a large number of different values Parquet PartitionBy df.write.format(“parquet”) \ .mode(“overwrite”) \ .partitionBy(“letterGrade”) \ .save(“classGrades") • classGrades folder • • • • • letterGrade=A folder letterGrade=B folder letterGrade=C folder letterGrade=D folder letterGrade=F folder Parquet PartitionBy • Can speed up reading if filtering is done on loaded file in the same way it is partitioned • Recall Spark’s lazy evaluation df = spark.read.format(“parquet”).load(“classGrades”) df.select(“name”).filter(f.col(“letterGrade”) == “B”).show() Parquet The End Joins Joins • DataFrame joins are similar to joins from database tables • • • • Inner joins Outer joins Semi joins Anti joins • Each of the joins produces a new DataFrame Inner Joins • Joins the rows that exist in both DataFrames aDF.join(bDF, aDF["id"] == bDF["id"], "inner") • Inner joins are most common, so there are syntax shortcuts • Wanting an inner join aDF.join(bDF, aDF["id"] == bDF["id"]) • Same matching key in both aDF.join(bDF) Outer Joins • Left outer join keeps all rows in left DataFrame and adds info from right DataFrame • Nulls are placed when missing right data aDF.join(bDF, aDF["id"] == bDF["id"], "right_outer") • Right outer is just reversed • Probably no great reason to even use it as one can simply reverse the arguments Semi and Anti Joins • Left Semi keeps the values in the left DataFrame that had matching values in the right • Does not even use the values from the right DataFrame in the resulting DataFrame aDF.join(bDF, aDF["id"] == bDF["id"], "left_semi") • Left Anti keeps the values in the left DataFrame that did not have matching values in the right aDF.join(bDF, aDF["id"] == bDF["id"], "left_anti") Joins The End Join Performance Join Types • To join two tables, the matching keys need to be placed onto the same worker nodes • There are different types of joins that can take place • Full Shuffle Joins • Broadcast Joins • Which happens depends on the DataFrame size • And hints to the optimizer if given Full Shuffle Joins • Used when both DataFrames are large • Large enough to not fit into memory on a single worker node • Standard Shuffles • Slow since both DataFrames need to be shuffled • If done repeatedly, pre-partitioning by the key and persisting can reduce this to a single shuffle • df.repartition(“key”).persist() Broadcast Joins • Used when one DataFrame is small • Small enough to fit into memory on a single worker node • Broadcast variable is used to pre-send small DataFrame to every worker node • Faster than Full Shuffle since, after initial broadcast, all workers have all the information they require for the join Join Selection • Spark will pick the join type for you • See which it is using by looking at explain plan resultDF.explain() • Set the DataFrame cutoff size (or turn it off) spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) • Set hints to help optimizer aDF.join(f.broadcast(bDF)) Join Performance The End User Defined Functions User Defined Functions • SparkSQL provides many built-in functions • But there are times when you may need to define your own • This is done through User Defined Functions User defined functions • Use the UDF function to define • Provide function that takes one parameter and returns one parameter • Provide the return type doubleItUDF = udf(doubleIt, LongType()) def doubleIt(x): return x*2 User defined functions • Use of UDF in a DataFrame • Note that the UDF is called once for each row in DF df.select(‘a’, doubleItUDF(‘b’)) df.withColumn(‘doubleB’, doubleItUDF(‘b’)) User defined functions • To use the UDF in a SQL command spark.udf.register(“dbl”, doubleItUDF) spark.sql(“SELECT a, dbl(b) FROM tbl) User Defined Functions The End