Distributed In-Database Machine Learning with MADlib by Frank McQuillan

Wednesday, November 16 at 3:40-4:35

Data science is moving with gusto to the enterprise.  The potential for business value in the form of better products and customer experiences is driving this growth, in addition to mounting competitive pressures.  Interest is understandably high in many industries on how to build and run the right predictive analytics models on the right data in order to realize this business value.

For many enterprises today, their valuable data resides in a relational databases and SQL is one of their main workloads.  They have made significant investments in infrastructure, software, and training of their employees that they want to leverage.  So how can an enterprise add a data science component to their business without a major IT re-architecture?

Apache MADlib (incubating) is an innovative SQL-based open source library for scalable in-database analytics.  It provides parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.

MADlib runs on the following open source platforms:  PostgreSQL, Greenplum database and Apache HAWQ (incubating) Hadoop-native SQL database.  The PostgreSQL architecture is a key enabler that permits the machine learning computations to be brought to the data where it resides in the database.  This makes for excellent scale out performance on massively parallel processing (MPP) platforms.

In this talk, we will describe the impetus behind creating a SQL-based scale-out machine learning project, review the architecture, and show how machine learning algorithms are implemented in a distributed manner.  We will give some examples of customers using MADlib and will also demonstrate MADlib using a visual notebook, focusing on some of the recent functionality added by the Apache community.  

Finally, we will talk about future direction of the project and invite PostgreSQL developers and data scientists to participate in the Apache MADlib project.

About the Speaker

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets.  Prior to Pivotal, Frank has worked on projects in the areas robotics, drones, flight simulation, and advertising technology.  He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.​

 

Monday, September 12, 2016 - 15:45