Using PySpark and MlLib - PyDataSG

Published on: Tuesday, 6 December 2016

Speaker: Juliet Hougland (@j_houg)

Abstract: Spark MLlib is a library for performing machine learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take only a few lines of code, and leverage hundreds of machines. This talk will demonstrate how to use Spark MLlib to fit an ML model that can predict which customers of a telecommunications company are likely to stop using their service. It will cover the use of Spark's DataFrames API for fast data manipulation, as well as ML Pipelines for making the model development and refinement process easier.

Juliet Hougland answers complex business problems using statistics to tame multi-terabyte datasets. Juliet's been sought after by Cloudera’s customers as a field-facing data scientist advising on which tools to use, teaching how to use them, recommending the best approach to bring together the right data to answer the business problem at hand and building production machine learning models. For many years Juliet has been a contributor in the open source community working on projects such as Apache Spark, Scalding, and Kiji. Juliet is the Head of Data Science for Engineering at Cloudera.

Event Page:

Produced by Engineers.SG

Help us caption & translate this video!