“This is f***ing crazy. Every single player is seeing a different deal!” It was 2017, I had just deployed an AI-driven sales infrastructure for a gaming company, when I saw messages like the above from our most fanatic players. I remember being excited that our users had even noticed that we were customizing their experience. But, my elation came crashing down when I realized that the initial model we deployed didn’t move revenue at all. In fact, it took many more months before we found the winning approach. Along this journey I learned a lot about building a very responsive and nimble AI infrastructure.
Fast forward to half a decade later and now that user-centric design is the cornerstone of Product-Led Growth, it is all the more important to deliver customized experiences to your users. However, as I learned the hard way many years ago, it is equally important to quantify the impact of your customizations and have an infrastructure in place that makes it easy to iterate upon. Towards that end, this article is designed to help PLG startups plan out their roadmap by giving a sample architecture to strive towards. Of course, the author would in no way claim that the described architecture is the best for everyone, or even the best AI architecture in general. The hope is that this can be a useful starting point for debating the merits of different approaches.
In order to avoid vendor-specific debates polluting the discussion, I’m avoiding naming any specific vendor or open source product. However, I am describing the specific requirements that each of the boxes below should satisfy and why. This should help you flesh out your specific infrastructure with your unique preferred solutions.
Stage 1: The Base App
Let’s start with the most simplistic App. Let’s assume that this has some kind of user interface and it is backed up by a database. Yes, this looks cartoonish, but it is surprising how many Apps can be described in such simple terms if we bundle all of the middleware-related optimizations in the “App” bubble below.
Now, the backend could be any kind of database really, but it has to support a high degree of concurrency since millions of users will be using the App simultaneously (we hope, right!). Also, the most common access pattern involves selecting and modifying a small number of rows at a time. Databases that are optimized for such a usage pattern traditionally come under the umbrella of OLTP (Online Transaction Processing) Databases. This includes what are fashionably called NoSQL or Document Databases.
Stage 2: Observability
A very natural next step on the base App is to provide some storage mechanism for the various log files that all of the application code and included libraries will spit out. These can be redirected to any log storage solution as long as it provides the ability to search for patterns and can alert your DevOps team in response to unusual errors.
In order to help your DevOps team extract more structured meaning from the log files, specific metrics will be added to track critical activities in the App. These metrics can be something as simple as
- the number of successful and failed logins,
- the real-time revenue (in case of gaming apps),
- or the number of users using some key functionality of the App.
These metrics are designed for DevOps to quickly identify any anomalies and not necessarily to slice and dice the App usage by user groups. As such these metrics could include low cardinality dimensions such as the country name (which can have a few hundred unique values) but not necessarily the user identifier (which could be in millions, right?).
These metrics are typically sent to a time-series database that provides a simple interface to receive records such as:
time=t country=us device=iphone6 event=login count=1
The database should provide the ability to create dashboards on these metrics and also provide some form of alerting on anomalies. For example, if the number of logins from the UK suddenly drops then the DevOps team should be notified. Note that if the number of logins from a particular user were to drop then that is not something of concern to DevOps, but it would be more interesting to the Business Ops team which we will describe in the next stage.
Stage 3: Events
The next stage in our hypothetical App’s evolution would be to add Events. Simply put, Events are highly structured objects that capture key activity in the App. They can be thought of as an extension of metrics to include dimensions of high cardinality. In fact, they could even be extracted from log files. The key feature of Events that distinguishes them from Metrics is that Events are used by Business Ops to analyze the health of the Product from a customer usage point-of-view while Metrics are used by Dev Ops to analyze the health of your Product from an infrastructure point-of-view.
For example, the metric that we recorded earlier could be extended to an event as follows:
time=t country=us device=iphone6 event=login userid=u first_login=t1 tier=free count=1
Notice that the presence of a high cardinality dimension such as userid makes it difficult for time-series databases to handle them efficiently. Notice that we also added a couple of extra dimensions first_login and tier to this event. We would typically add a number of such additional dimensions as we will explain below.
These events are sent to a queue from which a stream processing service is used to create real-time dashboards:
The dashboards, for example, could show the number of logins by age and by user tier. This can help Business Ops folks identify churn trends in different user cohorts. Here we can see the importance of augmenting the events with additional dimensions. If we didn’t have the first login time or the user tier in the event then the dashboarding service would have to lookup the user details from the OLTP database which would in turn make it unrealistic to have a real-time dashboard.
Stage 4: Data ****house
The final stage in the pre-AI architecture is to create a home for all the data that you are gathering. This is a place where you can archive all of your events and if necessary join them with the full contents of your OLTP database to perform very deep business analytics. This stage used to be called Data Warehouse and later Data Lake and more recently Data Lakehouse. Labels aside here are the key aspects of this stage:
- Cheap immutable object storage to archive your past events.
- A replica of your OLTP database to avoid analytics workloads from impacting your App’s performance.
- Some analytics database service that can join data from both of the above sources and allow you to query it in a high-level language like SQL.
An Analytics database is optimized for a very different access pattern than an OLTP database. While the latter works very well for concurrent access to individual rows, the former is designed to scan over very large partitions and perform aggregation or windowing operations. It is also quite normal for such databases to be synchronized every few hours or even once a day. Since the main use of this stage is to monitor long-term business trends, this delay is not significant.
Stage 5: Model Training
Now that we have a baseline architecture, we are in a position to understand our user behavior. We can start building Machine Learning Models to try and predict various things about our user such as:
- Will a user log in to the App tomorrow?
- Will they spend money on the App?
- Will they use Feature X?
- Will they like to converse with another user Y?
These models can subsequently be used to customize the user experience, but first we need to be able to do a lot of exploratory analysis. Some kind of a Notebook service (for example Jupyter Notebooks) that can be connected to your Analytics Database and have access to a number of libraries for data munging, scientific computing, and model training can be used by Data Scientists in your team to come up with some concrete ideas. At this point the Data Scientists can directly save the model in a Model Storage or set up an asynchronous job to periodically pull data and train a model.
The key aspects of the Notebook service should be that it should be easy for the Data Scientists to share their work with each other, perhaps even version control their analysis. In addition, they should be able to directly share their analysis with non-technical folks.
An Asynchronous Jobs Framework can be very simple, such as a service that runs a single job periodically. Or it can be much more sophisticated with an ability to create a DAG (Directed Acyclic Graph) of Jobs that allows for very complex and large-scale distributed data processing. The Model Storage, on the other hand, can be as simple as a cheap immutable object store.
Stage 6: Model Serving
Once we have some interesting Models of user behavior we can try to customize their experience. For example, if we think they would enjoy using Feature X, but they haven’t used it yet, we can suggest it to them. Or if we think they are about churn then we could give them some enticing offer that could persuade them to engage a bit more with App. However, it is not very easy to deploy a model, and proper choices made in this stage will determine whether you are able to actually deploy any AI in your App or if the modeling work of your Data Scientists remains a curiosity!
The key component that is needed in this stage is an Online Feature Store. This component listens to all of the events in real time and creates relevant features for each user. Some of the features could be summary statistics such as the number of times that the user has used Feature X in the last ten minutes while other features could keep bounded sets of records such as the last 50 items that the user has browsed in the App. The exact set of features depends on the model that we wish to deploy and the features that have been used in that model by the Data Scientists.
A component known as the Model Serving provides the ability for the App to quickly make any real-time prediction that is needed to customize the user experience. This component queries the Online Feature Store and looks up the specific model from the Model Storage to compute the desired prediction.
Sometimes the Model Serving layer can preemptively compute some predictions and store it in the main OLTP Database for very common predictions. For example, every time a user logs into the App we might want to compute a time-to-churn estimate and store it. Subsequently, every microservice within the App can directly lookup this estimate and use it to customize the experience.
Stage 7: Experimentation Platform
As I said in the beginning, it is not enough to deploy cool models, it is also important to measure the impact of these models and ensure that they are achieving a business objective. The key change in this stage is the introduction of an Experimentation Platform where Data Scientists can allocate populations of users for the test and control groups of their experiments.
Once a user is part of an experiment they may experience a new behavior in the App or they may continue to see the baseline behavior if they are in the control group. Typically, the new behavior is used in conjunction with a predictive model, for example we might want to measure if an enticing offer can reduce churn among at-risk users. However, even simple things like design changes in the App can be part of an experiment.
The existing events architecture has to be changed somewhat to account for this new stage. We need to tag every single event with the list of experiments that the user who triggered that event is a part of and whether they are in the control or test group for that experiment. The augmented events allow the real-time dashboarding service to measure the metrics for each experiment and estimate the lift that each experiment is causing in those metrics. Typically, the Data Scientist who creates an experiment will specify an objective metric that they are trying to improve, for example engagement with Feature X, as well as a guard-rail metric that can’t be allowed to worsen, for example churn.
Now, an experiment may fail to cause an improvement in which case if a guard-rail metric goes too low the experiment has to be rolled back. On the other hand, if a statistically significant improvement is observed in the metric, we may want to automatically deploy it to the entire population. In practice, most experiments fall somewhere in between. In other words, they don’t make things worse, but they don’t lead to a significant improvement. A good Experimentation Platform can also help identify how large a population of users is needed to get significant results, specifically if the expected improvement is vey small.
In summary it is not enough to build models, we should be prepared to objectively evaluate each model and pass over any approach that fails to deliver clear value to our customers. The key is to have an architecture that allows for quick innovation. In this article I have avoided going into some of the trickier aspects such as online models that learn continuously or tenant-specific models for multi-tenant Apps. There can also be many other aspects of your App which might require tweaking to the above stages. However, one can always evaluate an AI architecture by measuring the three key things:
- How long does it take to build and evaluate a model on offline data?
- How long does it take to deploy a model to production?
- How long does it take to safely evaluate a model in production?
Epilogue: In case you were wondering what happened with the models that I deployed in the gaming company back in 2017. Well, I never quite built a model that created sales deals that generated more revenue than the baseline because the baseline were human operators who were very hard to beat! However, we soon realized that because these models could do as well as the humans, we didn’t need humans any more for sales, and we could use them to generate creative content in the game to increase engagement. This brings me to the final point: AI folks can build all manners of predictive models, but they can rarely predict what metric their work will actually impact.