Speaker: Chetan Khatri
Apache Spark is one of the top big-data processing platforms and has driven the adoption of Scala in many industry and academic settings. As entire Apache Spark framework has been written in scala as a base it’s real pleasure to understand beauty of functional Scala DSL with Spark.
This talk is intent to present :
Primary data structures (RDD DataSet Dataframe) usage in universal large scale data processing with Hbase (Data lake) Hive (Analytical Engine).
Case study: He will go through importance of physical data split up techniques such as coalesce Partition Repartition and other important spark internals in Scaling TB’s of data / ~17 billions records
Also the talk gives understanding about crucial part and very interesting way of understanding parallel & concurrent distributed data processing – tuning memory cache Disk I/O Leaking memory Internal shuffle spark executor spark driver etc.
Room: Training room 2-1
Date: Saturday, 24th March, 2018
Produced by Engineers.SG