Bol.com: Analyzing and designing a scalable AI infrastructure

Bol.com, the largest e-commerce company in The Netherlands and part of the Fortune 500 company Ahold Delhaize wanted to set-up a scalable AI infrastructure for the future. They asked us to analyze and give advice about designing a scalable AI infrastructure based on the Google Cloud platform. This was a very extensive, complex and fullfilling project to work on.



On a daily basis Bol.com produces massive amounts of data, ranging from customer data to logistic data and many more data types. They already employ a decent number of data scientists to crunch the data they produce in order to make sense out of this massive amount of data. Nonetheless the ever changing field of machine learning and artificial intelligence also poses a steady stream of interesting developments and challenges to them for which it could be hard to stay up to date with.

One of the challenges they were facing was setting up a scalabe AI infrastructure on top of which their data scientists could run their models and analyses. They where in urgent need of a platform that would make their data science pipelines scalable and efficient. They had trouble with the seperation of their data sources from the infrastructure used for training and serving their machine learning models. Furthermore they were looking for a way to create sizable efficiency gains in order to shorten the timespan from the initial stage of machine learning model development to bringing these models to production. On top of this they also faced privacy challenges that needed considerdation.

Bol.com asked us to help them on this endeavour. Since Bol.com makes heavily use of the Google Cloud platform we had to direct our focus to the Google Cloud platform and especially the Google AI platform. The ultimate goal of this project was to investigate how the Google AI platform and Google Cloud infrastructure could be used for setting up the aformentioned AI infrastructure. 

The starting point of this project was a deep dive into the Cloud AI infrastructure Google is offering, in order to get a detailed overview what functionalities of the Google Cloud AI platform could be useful for Bol.com. During the initial stage of the project we were reading a lot of technical documentation accompanied by discussions with Google engineers Bol.com is working closely with. During this initital stage we were also talking a lot to Bol.com software engineers and data scientists to identify the main issues they were facing.



On the basis of all these discussions and our own research we identified the following components which could be very useful and valuable for Bol.com for them to get ahead with their machine learning and artificial intelligence ambitions:

1. Kubeflow

2. Private Jupyter notebook hub

3. AI platform training and serving

4. Automatic hyperpamerer optimization

Kubeflow is a free and open source machine learning platform designed to enable machine learning pipelines to orchestrate complicated workflows on Kubernetes. Since Bol.com is already heavily making use of Kubernetes, Kubeflow proved to be the best platform for Bol.com for deploying and creating scalable and efficient machine learning pipelines. With Kubeflow you can seperate different parts of your machine learning pipeline, through which your machine learning architecture becomes more component based, in which each component has it's own responsibilities. With Kubeflow you can for example seperate the data transforming process of your machine learning pipeline from the training phase of a machine learning model. In this way your machine pipeline becomes decoupled and better maintainable.

A Private Jupyter notebook hub proved especially useful for Bol.com data scientists to rapidly experiment with new models without worrying about the needed compute and avoiding struggles with machine learning library importing. By deploying a private Jupyter notebook hub  data scientists can rapidly experiment with model development, helping Bol.com in their aim in bringing the timespan from initital model development to production down.

AI platform training and serving are both especially useful when you want to create a scalable, efficient and consistent training and serving architecture for your machine learning and artificial intelligence models. AI platform training and serving can also be integrated into Kubeflow pipelines. By making use of these Google Cloud AI platform components your training and serving code needs to be in a consistent format, which enhances consistent code procedures. But above all, the greatest value of these Google AI platform training and serving services can be found in not having to worry and invest in a training and serving architecture for your machine and artificial intelligence models yourself. This is very useful since this will result in more time for actual model development and less time spend on creating and maintaining your own AI infrastructure. When making use of the AI training and serving services of Google you can also use the automatic hyperparameter optimization algorithms of Google. The hyperparameter optimization algorithms of Google make use of bayesian optimization which are very advanced. Since hyperparameter optimization is very important for the accuracy and usefulness of your models this was another reason to advice for using the Google AI training and serving services.

All of our research, findings and advice were discussed with the responsible Bol.com managers and formed the starting point for Bol.com in using and deploying the advised infrastructure to scale their AI infrastructure. Since all these infrastructure components are not deployed over night, implementing all the advised infrastructure resources are implemented over time with great resourcefulness.

Comments

Popular Posts