Engineers.SG

Autoencoder Forest for Anomaly Detection from IoT Time Series | SP Group

2019-07-25T11:46:09Z

Get the slides: https://www.datacouncil.ai/talks/time-based-autoencoder-ensemble-for-anomaly-detection-from-iot-time-series?hsLang=en

ABOUT THE TALK

In the energy/utility context, conditional monitoring is one of the most important processes in the daily operation & maintenance of the equipment. With more and more IoT sensors being deployed on the equipment, there is an increasing demand for machine learning-based anomaly detection for conditional monitoring. In this talk, I will discuss a method we designed for anomaly detection based on a collection of autoencoders learned from time-related information. This talk will cover the whole end-to-end flow on how this method is designed, and some energy specific use cases will be used to demonstrate its performance.

ABOUT THE SPEAKER

Yiqun Hu is currently the Director, Data & AI at SP Digital and is responsible for driving the initiatives of data & AI for the whole SP Group. His team has built and manages the group's big data infrastructure and deployed production-ready AI solutions to transform the utility industry.

Before joining SP Group, Yiqun had experiences in leading data/AI teams in several industries, applying data science and machine learning to bring real impact to several organizations including a global payment company (PayPal), an e-commerce company (eBay) as well as a leading financial institute in Asia (DBS).

Besides his experience in the industry, Yiqun also spent close to a decade in the academic R&D space as an AI researcher. He has published over 40 scientific papers in flagship international AI conferences/journals, i.e. TPAMI/TIP/TM, CVPR/ICCV/ECCV/ACMMM etc, as well as one book chapter. His publications have been cited over 1,700 times in other scientific publications.

ABOUT DATA COUNCIL:
Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.

FOLLOW DATA COUNCIL:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai
Facebook: https://www.facebook.com/datacouncilai
Eventbrite: https://www.eventbrite.com/o/data-council-30357384520

7 Habits to Build Ethical AI | Teradata

2019-07-25T11:42:10Z

Get the slides: https://www.datacouncil.ai/talks/7-habits-to-build-ethical-ai?hsLang=en

ABOUT THE TALK

While AI is being applied to solving great problems of the world, it is subjected to questions regarding the morality of how it is constructed and used. Karthik Thirumalai addresses the 7 habits that are key to building ethical AI solutions which can be put to use for a better world. These habits cover Data Governance, Fairness, Privacy, Security, Accountability, Transparency and Education, all of which can help organizations successfully implement their AI strategy in a way that reflects fundamental human principles and moral values.

Learning Outcomes:

1. Why do we need Ethical AI?

2. What factors should one keep in mind while building AI systems?

3. How do I remove bias in AI systems?

4. What checklist should I follow to have a trustable AI system?

ABOUT THE SPEAKER

Karthik Bharadwaj Thirumalai is a Senior Data Scientist at Teradata. He is an analytic professional with expertise in Data Science & Artificial Intelligence coupled with strategic vision and experience in leading and building analytic teams. He is keen in solving business problems using data science and changing the world with AI.

Prior to that he was solving transportation problems with IBM Smart City Research in Singapore. He has a master’s degree from the National University of Singapore.

View from Apache Flink on Evolution & Outlooks for the Modern Stateful Stream Processor | Ververica

2019-07-25T11:16:05Z

Get the slides: https://www.datacouncil.ai/talks/a-view-from-apache-flink-on-evolution-and-outlooks-for-the-modern-stateful-stream-processor?hsLang=en

ABOUT THE TALK

Stream Processing has evolved quickly in a short time: a few years ago, stream processing was mostly simple real-time aggregations with limited throughput and consistency. Today, many stream processing applications have complex logic, strict correctness guarantees, high performance, low latency, and maintain large state without databases. Since then, stream processing has become much more sophisticated because the stream processors – the systems that run the application code, coordinate the distributed execution, route the data streams, and ensure correctness in the face of failures and crashes – have become much more technologically advanced. In this talk, we walk through some of the techniques and innovations behind Apache Flink, one of the most powerful open source stream processors.

In particular, we plan to discuss the evolution of stateful stream processing, Flink’s approach of fault-tolerance with distributed asynchronous and incremental snapshots, and how that approach looks today after multiple years of collaborative work with users running large scale stream processing deployments. Furthermore, we plan to discuss how stream processing is outgrowing its original space of real-time data processing and is becoming a technology that offers new approaches to data processing (including batch processing), real-time applications, and even distributed transactions.

ABOUT THE SPEAKER

Tzu-Li (Gordon) Tai is a Committer and PMC member of the Apache Flink project, and Software Engineer at Ververica. His contributions in Flink spans various components, including some of the most popular Flink streaming connectors (e.g. for Apache Kafka, AWS Kinesis, Elasticsearch, etc.), Flink's type serialization system, as well as several topics surrounding evolvability of stateful streaming applications. He is a frequent speaker at conferences such as Flink Forward, Strata Data, as well as many meetups related to Apache Flink or data engineering in general.

Delivering ML Models the Safe and Sane Way | Thoughtworks

2019-07-25T11:09:25Z

Get the slides: https://www.datacouncil.ai/talks/delivering-ml-models-the-safe-and-sane-way?hsLang=en

ABOUT THE TALK

Despite the hype around machine learning and AI, the lifecycle of ML models often end in Kaggle competitions, hackathons and proof of concepts. Very few make it to production because individuals and teams inevitably encounter impediments in deployments, model management, and reproducibility, just to name a few. In this talk, we will share principles and practices on how we can overcome these challenges and enable teams to iteratively deliver ML solutions.

ABOUT THE SPEAKER

David Tan is a Software Engineer at Thoughtworks and a data science enthusiast.

Building Data Orchestration for Big Data Analytics in the Cloud | Alluxio Inc

2019-07-25T11:05:09Z

Get the slides: https://www.datacouncil.ai/talks/building-data-orchestration-for-big-data-analytics-in-the-cloud

ABOUT THE TALK

Cloud has been dramatically changing the landscape of data engineering as well as the behavior of data engineers. Specifically, data storage is migrating from the colocated model (e.g., HDFS) to a more cost-effective, more scalable but often fully disaggregated and remote data lake model (e.g. AWS S3). This has also created a strong need for data orchestration in the cloud like what Kubernetes does for container-based workloads, so that data can be presented in the right layout at the right location for data-consuming applications on the cloud.

Originally developed from UC Berkeley AMPLab as research project "Tachyon", Alluxio (www.alluxio.io) implements the world’s first open-source data orchestration system in the cloud. Alluxio creates a unified access layer for data-driven applications in big data and ML, enabling Spark, Presto, TensorFlow and so on to transparently access different external storage systems while actively leveraging in-memory cache to accelerate data access.

In this talk, the speaker will present:

- New trends and challenges in the data ecosystem in the cloud era;

- Effective data engineering in the cloud world with data orchestration;

- Production use cases of using popular stacks like Presto/Alluxio/S3.

ABOUT THE SPEAKER

Bin Fan is the founding engineer of Alluxio, Inc. and the PMC member of Alluxio open source project. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems and algorithms.

FOLLOW DATA COUNCIL:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai
Facebook: https://www.facebook.com/datacouncilai
..
Eventbrite: https://www.eventbrite.com/o/data-council-30357384520

Building Data Products with Machine Learning at Zendesk | Zendesk

2019-07-25T08:28:11Z

Get the slides: https://www.datacouncil.ai/talks/building-data-products-with-machine-learning?hsLang=en

ABOUT THE TALK

At Zendesk we care deeply about customer experience and we believe we can build better experiences using Machine Learning. We’ve been on a journey over the past years, shipping Machine Learning powered products and growing a multidisciplinary team of excellent engineers, data scientists, designers and product managers. This talk will focus on the technical challenges we've faced, what we've learnt along the way and the successful approaches we’ve taken to make this a somewhat repeatable process.

ABOUT THE SPEAKER

Chris Hausler leads the Data Science team at Zendesk where he has spent the last few years building out the team and building customer facing AI products. Previously he's held the titles of data scientist, data engineer, researcher, student, consultant, programmer and before that student again. Through all these roles the single continuous theme has been a deep interest in data and the stories it can tell.

Revenue Maximization in the Shared Bike Business | Zoomcar

2019-07-25T08:21:11Z

Get the slides: https://www.datacouncil.ai/talks/bike-sharing-revenue-maximization?hsLang=en

ABOUT THE TALK

In 2017, Zoomcar launched India's first bike sharing service, PEDL, aimed to make shorter commutes convenient. This business comes with challenges such as maintaining high cycle availability at all times and managing IoT device dysfunctionalities, vandalism, fleet re-balancing and many more.The talk will cover the methodology to overcome these burning challenges, keeping in mind both revenue maximization and customer experience.

ABOUT THE SPEAKER

Arpit Agarwal is the Director of Data Science at Zoomcar, India. He has 10 years of industry experience and has solved inventory optimization & customer retention problems across industries such as shared mobility and fashion retail.

He previously pioneered the NPS rollout and feedback analytics for leading fashion brands in India; work for which he has been bestowed with the “Best Use of Customer Insights to Enhance Customer Experience” award at India's Customer FEST 2017.

Arpit has authored articles on "Fleet Management" and "Analytics in IoT Cars" in various publications. He is also a regular contributor at data science and AI conferences across India. Apart from analytics, he has a deep interest in music & arts which inspires him to drive creativity at work.

Translating Source Code into Natural Language with AI | Quod AI

2019-07-25T08:12:55Z

Get the slides: https://www.datacouncil.ai/talks/translating-source-code-into-natural-language-with-ai?hsLang=en

ABOUT THE TALK

Software engineering collaboration is hard. Software engineers spend more than 70% of their time learning about their own team’s source code. Enormous size, constant change and intricate dependencies of the source code are the main factors at play. While one software engineer is not responsible for all lines of code for their company, a single line of their code can break the entire company’s app.

To help software engineers, we translate source code into natural language to make it easier to search, navigate and understand. At Quod AI, we are building an AI knowledge assistant which generate documentation (in Q&A format) from raw source code. In order to do that, we use neural network models, natural language processing algorithms and statistical models. We retrieve, store and analyze the source code and its history to get insights from the evolution of the code.

In this talk we will share some of the insights that we gained from analyzing more than 300 millions lines of code.

ABOUT THE SPEAKER

Misha Filippov is chief scientist at Quod AI and a research fellow at University College London. At Quod AI he is applying natural language processing, deep learning and statistical models to explain source code in plain English.

Misha holds a PhD in physics from Nanyang Technological University. As a mathematical physicist he studied the dynamics of complex systems and has built mathematical and AI models for tsunami prediction, tropical atmosphere, the housing market, historical information and visual cortex of the human brain. His models have been used by NASA & the Monetary Authority of Singapore.

Data Modeling and Processing for a Travel Super App | Traveloka

2019-07-25T08:00:30Z

Get the slides: https://www.datacouncil.ai/talks/data-modeling-and-processing-for-a-travel-super-app?hsLang=en

ABOUT THE TALK

Traveloka is an app that provides a wide range of travel-related products and services, such as flights, hotels, apartments, theme parks, and even international roaming packages. Having a wide-ranging business makes data modeling particularly challenging: it is like building many data warehouses for different business flows in one place.

In order to address that, we developed a modeling method and framework that enables us to model the data across business units, and ensure data is uniform across the board so that data scientists can make sense out of it across all products and services.

We developed a data model schema with an inheritance and business glossary concept. The concept enforces uniformity of the data and consistent definitions across all our products. The schema enables data architects to model data schemas, data analysts to describe data definitions, data governance specialists to protect personal data, and data engineers to define cleansing rules, all in one place!

The framework is built on top of Python Apache Beam and currently runs on GCP DataFlow. Building on Apache Beam enables us to run the very same framework on our batch and streaming pipeline. The framework is inspired by JSON schema and BigQuery schema. We call it NeoDDL.

ABOUT THE SPEAKERs

Rendy is currently a Data System Architect at Traveloka. He built Traveloka's data pipeline from scratch and managed to handle a 10,000x growth of data. He also established a batch and realtime data platform which powers organization insights and serves data-intensive application use cases. He is currently focusing on solving data modelling and processing challenges that Traveloka faces as a Travel SuperApp with various businesses. Last but not least, he is a devoted dad to a cute daughter, and aspiring to contribute to environmental informatics.

Joshua has been a Data Engineer at Traveloka since 2018, where he is developing a framework to create and manage end-to-end data warehouse pipelines. He holds a bachelor degree in computer science from Nanyang Technological University.

Presto: Optimizing Performance of SQL-on-Anything | Starburst

2019-07-25T07:45:16Z

Get the slides: https://www.datacouncil.ai/talks/presto-optimizing-performance-of-sql-on-anything?hsLang=en

ABOUT THE TALK

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Google Cloud Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, the recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present the recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.

ABOUT THE SPEAKER

Kamil is a technology leader in the large scale data warehousing and analytics space. He is CTO of Starburst, the enterprise Presto company. Prior to co-founding Starburst, Kamil was the Chief Architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto. Previously, he was the co-founder and chief software architect of Hadapt, the first SQL-on-Hadoop company, acquired by Teradata in 2014.

Causal Inference: Making the Right Intervention | QuantumBlack

2019-07-25T07:40:54Z

Get the slides: https://www.datacouncil.ai/talks/causal-inference-making-the-right-intervention?hsLang=en

ABOUT THE TALK

Consider an organization seeking to improve their operations, using their historical data. During this type of analysis the commonly known fact that “correlation does not imply causation” comes to life. It is crucial to distinguish between events that *cause* existing inefficiencies and those that merely correlate. Spending money to fix something that is not the root cause of the problem could be an expensive folly. Causal inference aims to determine which available controls drive specific outcomes. This is a distinctly more demanding condition than learning the correlation. Many machine learning approaches disregard causal inference, despite a wide range of approaches to causal inference having been proposed in the literature. This talk will discuss the importance of causal models, as well as some of the most state-of-the-art methods for reasoning.

ABOUT THE SPEAKER

Paul Beaumont is a Senior Data Scientist at QuantumBlack, an advanced analytics consultancy based in Singapore. He works on statistical models for explanatory, predictive and prescriptive problems, and his role involves designing mathematical models to help clients understand pertinent questions about their data. Paul holds a PhD in Mathematics & Computer Science from Imperial College London, and leads QuantumBlack’s R&D efforts in Causal Inference.

Sparklens: Understanding the Scalability Limits of Spark Applications | Qubole

2019-07-25T07:33:54Z

Get the slides: https://www.datacouncil.ai/talks/sparklens-understanding-the-scalability-limits-of-spark-applications?hsLang=en

ABOUT THE TALK

One of the common requests we receive from customers at Qubole is to debug a slow Spark application. Usually this process is done with trial and error, which takes time and requires running clusters beyond normal usage (read wasted resources). Moreover, it doesn’t tell us where to look for further improvements. We at Qubole are looking into making this process more self-serve. Towards this goal we have built Sparklens (https://github.com/qubole/sparklens), an OSS tool based on Spark's event listener framework.

From a single run of the application, Sparklens provides insights about scalability limits of a given Spark application. In this talk we will cover what Sparklens does and the theory behind it. We will talk about how the structure of a Spark application puts important constraints on its scalability; how can we find these structural constraints and how to use them as a guide in solving performance and scalability problems of Spark applications.

This talk will help the audience with answering the following questions about their Spark applications: 1) Will their application run faster with more executors? 2) How will cluster utilization change as the number of executors changes? 3) What is the absolute minimum time this application will take even if we give it infinite executors? 4) What is the expected wall clock time for the application when we fix the most important structural limits of these applications?

Sparklens makes the ROI of additional executors extremely obvious for a given application and needs just a single run of the application to determine how the application will behave with different executor counts. Specifically, it will help managers take the correct side of the tradeoff between spending developer time optimizing applications vs. spending money on compute bills.

Ashish is a Big Data leader and practitioner with more than 15 years of industry experience. Equipped with immense experience involving the design and development of petabyte-scale Big Data applications, he is a seasoned technology architect with variegated experiences in customer interfacing and technical leadership roles.

ABOUT THE SPEAKER

Ashish heads Qubole's Solutions Architecture team for International Markets, and works with a number of enterprise customers in the EMEA, APAC and India regions. Prior to Qubole, Ashish worked at Microsoft as an engineer in the Windows team. Later, he worked for Claraview (Teradata), while leading their Big Data practice and helped to scale some of their Fortune 500 clients in different industry verticals such as finance, healthcare, retail and multimedia.

Data Architecture 101 for Your Business

2019-07-25T07:29:11Z

Get the slides: https://www.datacouncil.ai/talks/data-architecture-101-for-your-business?hsLang=en

ABOUT THE TALK

Setting up your data architecture can be tricky and confusing without knowing what the future holds for your company’s growth. Some might have attempted to sell you out of the shelf solutions or you could have been overwhelmed by hearing about unlimited different technologies, concepts, big data engines that are scalable without a limit... Right? Or just go with Google Analytics since your marketing team is already keen on that? Do you have a hunch of what you should use?

I have worked and built multiple data architectures for companies with different sizes from only few thousands to billions of active users; and used all modern technologies such as Azure SQL Data Warehouse, Redshift, Presto, Hive, Spark, Airflow, Kinesis Data Firehose. From my experiences at Facebook and Microsoft, I know how these tools can be used efficiently and what are the best practices of the industry.

In this talk I will guide you what solutions are available for all company sizes, when is the right time to add or replace architecture elements for better scaling and/or better engineering. What are the caveats and deep technical tricks to get the most out of these tools. Moreover, I will answer how to avoid building or setting up overcomplicated systems, and when should you hire data scientists or data engineers.

ABOUT THE SPEAKER

Bence Faludi is passionate about building data architectures and making reusable datasets that can help businesses to grow and understand their audiences. He likes simplicity, and elegant but flexible solutions for all problems.

He works as an independent consultant for various clients simultaneously to set up their Big Data platform. Nowadays, he is helping Wowcher to optimize their data processing.

Bence has worked at Facebook, Microsoft, Wunderlist as a Data Engineer; published a few open source libraries such as mETL, night-shift; and worked for a few months as a consultant for eyeo - the company behind Adblock Plus.

Cleaning up the AppDelegate - iOS Dev Scout

2019-07-25T05:21:20Z

Speaker: Kenneth Poon - Principal Software Engineer at SP Digital
Create pluggable services for your iOS Apps.

Event Page:
https://www.meetup.com/Singapore-iOS-Dev-Scout-Meetup/events/262896888/

Recorded by Vina Melody
Produced by Engineers.SG

Help us caption & translate this video!

https://amara.org/v/qSGU/

Scalability of your iOS project along with your team - iOS Dev Scout

2019-07-25T05:20:02Z

Speaker: Benoit Pasquier, Senior iOS Developer at Zalora
How to make sure your iOS project keep following the best practices while your team is growing? Let’s see what tools to use and how to define standards can help to prepare your iOS project to scale along with team.

Event Page:
https://www.meetup.com/Singapore-iOS-Dev-Scout-Meetup/events/262896888/

Recorded by Vina Melody
Produced by Engineers.SG

Help us caption & translate this video!

https://amara.org/v/qSGV/

Workshop: Deploying model with TFLite in Android Application

2019-07-24T15:07:31Z

Speaker: Ying Ka Ho

In this workshop, Ka Ho explains how to build an Android Application with CameraX and TFLite and deploy TensorFlow models into Android Application.

For repository of the project used for the workshop: https://github.com/bigdatasg/TensorflowClassification

Event: Google I/O Recap 2019 Singapore AI - From Model to Device by BigDataX
Event Page: https://www.meetup.com/BigDataX/events/262196916/

Produced by Engineers.SG

Help us caption & translate this video!

https://amara.org/v/qR8k/

Workshop: Modelling with Tensorflow 2.0

2019-07-24T15:04:07Z

Speaker: Wei Yang

In this workshop, Wei Yang provides a walk through on how to make use of transfer learning to build an emotion classification model using TensorFlow 2.0

For a full description of the workshop: https://github.com/bigdatasg/ai_from_data_to_device/blob/master/module02_modeling.md

Event: Google I/O Recap 2019 Singapore AI - From Model to Device by BigDataX

Event Page: https://www.meetup.com/BigDataX/events/262196916/

Produced by Engineers.SG

Help us caption & translate this video!

https://amara.org/v/qR8l/

Argo: Kubernetes Native Workflows and Pipelines | Canva

2019-07-24T07:26:03Z

Get the slides: https://www.datacouncil.ai/talks/argo-kubernetes-native-workflows-and-pipelines?hsLang=en

ABOUT THE TALK

Data orchestration and DAGs are something that most data teams need. There are many commercial and open source options available. Examples include Airflow, Luigi, Oozie and many others.

Airflow is very popular at the moment and rightly so; it is a very useful tool and is the backbone of very productive data teams. Argo is a relatively new challenger. It is a Kubernetes native workflow engine.

At Canva, we evaluated both Airflow and Argo and chose Argo as our primary data orchestration system. In this talk I’ll briefly compare Airflow and Argo, talk about the evaluation process we undertook and how we came to our decision. Finally, I’ll talk about our experience using it so far, the things that have been good and the things that have been not so good.

ABOUT THE SPEAKER

Greg is the Data Engineering Lead at Canva. He founded the team and quickly grew it to build out scalable systems that enable advanced analytics and machine learning features inside the Canva product.

Previously he was the Co-founder and CTO of AirHelp, a Y Combinator backed startup, where he built systems to process flight and booking data in real-time.

Greg is passionate about technology and is frequently involved with meetups, such as the Sydney Data Engineering meetup.