Improving PySpark Performance: Spark performance beyond the JVM - PyDataSG

Published on: Tuesday, 6 December 2016

Speaker: Holden Karau (@holdenkarau)

Abstract: This talk covers a number of important topics for making scalable Apache Spark programs with a special focus on Python - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. The talk also includes Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python and UDF performance. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead of Python with Spark is too high.

Holden Karau is transgender Canadian, and an active open source contributor. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. Holden is a co-author of numerous books on Spark including High Performance Spark (which she believes is the gift of the season for those with expense accounts) & Learning Spark. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.

Event Page:

Produced by Engineers.SG

Help us caption & translate this video!