Feature stores have gotten a lot of attention lately. In December 2020, Amazon Web Services released its SageMaker Feature Store. Last month, Splice Machine, a big data platform, launched its own feature store too. Datanami even went as far as to call 2021 the year of the feature store (quoting Tecton.ai's co-founder).
It turns out that managing features, in our experience, is one of the biggest bottlenecks in productizing your ML models. — Uber
Features (and labels) are the inputs for machine learning models. In a regression equation, labels are the dependent variable, features are independent variables. In a table, labels are the column we try to predict, features are the other columns (though we usually drop IDs).
What is a feature store? Well, it depends on who you ask. Some articles define it simply as "the central place to store curated features". Others say it helps you "build features once; plug them anywhere" or "deploy models 100x faster". There's a wide range of definitions and I think it's because what a feature store is depends on what you need.
To learn more about feature stores and the needs they address, I dug into tech/blogs from companies. Unsurprisingly, I found myself organizing feature store features (haha) into a hierarchy of needs (inspired by Maslow). We’ll go over an overview of the feature store hierarchy of needs before discussing how companies have met those needs at each level.
Maslow’s Hierarchy of Needs is a motivational theory in psychology. It suggests that there are five tiers of human needs and is often shown as hierarchical levels within a pyramid. The largest, most fundamental needs at the bottom must be satisfied before individuals become motivated to higher-level needs.
I think feature stores can be organized as a hierarchy of needs too. Some needs are more pressing than others (e.g., access to data, serving features in production). These essential needs have to be fulfilled before higher needs are considered.
Feature Store Hierarchy of Needs
At the base, we have access needs. This includes access to feature information and data, transparency, and lineage. Transparency lets users view the logic and code creating the features while lineage lets them trace upstream sources. Together, they make it easier to find, share, and reuse features, minimizing duplicate work.
Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. — Airbnb
Next, we have serving needs. Here, the key need is to use the features in production, usually at high throughput and low latency (i.e., not SQL queries). Other needs include the ability to integrate with existing offline feature stores so features can be synced from offline to online feature stores (e.g., Redis). It can also include the ability to perform real-time feature transformations within the feature store itself.
Data scientists would typically implement features in individual silos, then hand over their code to a data engineering team to reimplement production-ready pipelines. This process of reimplementing pipelines can add months to a project and requires complex coordination between teams. — GoJek
Once lower needs are met, we can turn to integrity needs. Here, the prevalent need is minimizing train-serve skew. This means ensuring consistency between the features during training, and the features used as input when calling the model in production. Another common need is point-in-time correctness (aka time travel). This ensures that historical features and labels used in offline training and evaluation don’t have data leaks.
A lot of training is done in an offline way, and serving often happens, at least at Uber, in real-time environments. It is absolutely important to make sure that the data that you're using in real-time, at serving time, matches the data that is used at training. — Uber
Beyond that, we have convenience needs. The feature store needs to be easy and quick to use for data practitioners to adopt and realize the productivity gains. Such needs include having simple, intuitive APIs that can be used for training and serving, as well as interactivity to allow for faster development and debugging.
But remember, we are a platform, not an engineering team… How can we provide a tool for (users) so that we can give this power back to them, so that they can, without help from us, do all this work by themselves? — Uber
Finally, we have autopilot needs. These are needs to automate tedious work such as backfilling features, monitoring and alerts on feature distributions, etc. Some companies have implemented solutions to address these needs, but they are not widespread across the material I’ve researched.
Backfilling is a major bottleneck when you’re iterating on your training set ideas. Efficiently computing new training sets is a huge deal for speeding up your data scientist workflow. —Airbnb
Not all organizations will require all levels of needs though most teams can benefit from meeting the first two levels (access and serving) and part of the third (minimizing train-serve skew). The extent of each need may also vary. Organizations with fewer online ML use cases may have less demanding serving needs relative to DoorDash (i.e., millions of requests per second). Models and features that update intraday may have tricker needs for point-in-time correctness.
Now that we have an understanding of the various levels of needs, let’s take a look at how companies have built feature stores to address them.
What happens if access to feature information and data is low? Here’s a common experience across several companies:
Features representing the same business concepts are being redeveloped many times, when existing work from other teams could have been reused. — GoJek
To address this, GoJek built Feast, a feature store that acts as the interface between data engineers and scientists, and ML practitioners. Data engineers and scientists create features and contribute them to the feature store. Then, ML practitioners consume these ready-made features, saving time by not having to create their own features.
(Not sure how I feel about this productivity gain, given my preference for data scientists to be more end-to-end. Nonetheless, this spilt between data engineering and machine learning might have been an artifact of how GoJek’s data team was structured, with the data engineering team mostly situated in India while the ML team was mostly in Singapore. Thus, Feast served as an interface for both teams to collaborate on.)
Similarly, Uber built their Palette feature store to encourage the sharing—and reuse—of features in a single location. Various teams across Uber could contribute features to, and use features from, Palette. For Uber, this minimized duplicate work, made ML outcomes more consistent, and accelerates the ML process.
To make features easy to find and use, it’ll also need aspects of data discoverability (though it can also be argued that this is a convenience need). I didn’t come across much discussion on data discoverability in the context of feature stores, but I suspect it’ll be very similar to what I’ve previously written about open-sourced data discovery platforms.
At this level, the feature store mostly serves as a store, albeit a very useful one that contains crowd-sourced/curated features that ML practitioners can just plug-and-play. Nonetheless, it’s not much different from a data warehouse. What distinguishes feature stores (from data warehouses) is their ability to meet the second level of the hierarchy of needs—serving.
When training machine learning models, we usually train them offline with batch data. Then, when serving these models online, the same features need to be available in real-time. Here’s where teams stumble—how do we serve those features in production, at high throughput and low latency?
When shipping tabular-based models, we kept finding that many of the features we would input into a model while training it were not readily available in our production infrastructure. — Monzo Bank
For Monzo Bank, the features were available in their analytics stack, which was used for training models, but not in their production stack, which was used for serving. To solve this, they took a lean approach by automating the synchronization of features from their analytics store (BigQuery) to their production store (Cassandra).
Uber’s Palette adopts a similar dual-store approach. The offline store (Hive) saves feature snapshots and is mainly used by training jobs. The online store (Cassandra) serves the same features in real-time. Features not available in Cassandra are generated online via Flink before being saved to Cassandra. Data between both stores are synchronized. New features added in Hive are automatically copied to Cassandra; real-time features added to Cassandra are ETL-ed to Hive.
Creating batch and real-time features, and synchronizing between stores (source)
DoorDash took addressing serving needs to the extreme with their Gigascale feature store. Here are the demanding technical requirements they had:
To build this feature store, DoorDash benchmarked several key-value stores—Redis, Cassandra, CockroachDB, ScyllaDB, and YugabyteDB—before sticking with Redis. More details on their benchmarking process and optimizing Redis in their excellent write-up.
Another approach is to compute features in real-time. This is how Alibaba serves real-time recommendations, where their Alibaba Basic Feature Server computes statistical features on user interactions (e.g., clicks, likes, purchases) in real-time. These features are then used for candidate generation. More about real-time recsys in a previous post.
With serving needs met, we turn to integrity needs. There are two main pain points here:
To address the first problem, Netflix built distributed time-travel. It takes snapshots of their offline and online data. These snapshots cover various contexts such as member type, device, time of day, etc.
Creating snapshots from offline stores and online micro services (source)
However, creating snapshots for every context is expensive. Thus, they perform stratified sampling on attributes such as viewing patterns, device type, time spent on device, region, etc. These samples provide a good distribution of data to train and validate their models on. The sampling is done via Spark and snapshots are stored in S3.
A similar process snapshots online data from their hundreds of microservices. These microservices provide data such as viewing history, personalized viewing queues, and predicted ratings. Spark parallelizes the calls to Prana which fetches the data from various microservices. Similarly, the snapshots are stored in S3 via Parquet format.
To solve the second pain point of train-serve skew, GoJek adopted Apache Beam as their data processing pipeline. Data from batch and streaming sources (e.g., BigQuery, Kafka) are ingested via Beam into offline and online stores (e.g., BigQuery, Redis). A unified API (to access historical and online data) minimizes the need to rewrite feature pipelines for the serving environment and thus accidentally introducing train-serve skew.
GoJek's Feast ingesting features via Apache Beam (source)
Netflix addressed this via shared feature encoders. Though they have different pipelines for offline (Spark) and online feature generation, both pipelines use the same feature encoders (i.e., same classes, libraries, and data formats). This ensures that feature generation is consistent across training and serving environments.
Netflix uses the same encoders for offline and online feature generation (source)
We’ve also seen how Uber approached this by keeping their offline (Hive) and online (Cassandra) feature stores in sync. New features in either store are copied to the other store, ensuring consistency between data used for training and serving.
We also need monitoring to ensure the integrity of features:
To allow data scientists to visually inspect features for integrity, Airbnb's Zipline includes a UI that shows the distribution of features, correlations between features and labels, and performs clustering analysis (not sure what this clustering analysis is on though).
Similarly, Uber built Data Quality Monitor. It provides users with data quality scores daily, as well as alerts when anomalies are detected. It does this by:
Data quality scores over time, and when incidents occur (source)
I didn’t come across much content on convenience (or pain points) in the context of feature stores. Nonetheless, as practitioners, it’s important for our tools and platforms to be easy to use (e.g., PyTorch vs TensorFlow).
The best example I found comes from GoJek. They provide unified SDKs in Python, Java, and Go to simplify retrieving features from both offline and online stores. The API lets you use nearly identical code to
get_online_features(), easing the transition between offline to online feature engineering.
customer_features = ['credit_score', 'balance', 'total_purchases', 'last_active']historical_features_df = feast.get_historical_features(customer_ids, customer_features)model = ml.fit(historical_features_df) # pseudo codeonline_features = feast.get_online_features(customer_ids, customer_features)prediction = model.predict(online_features)
For time-travel, Netflix provides simple APIs so data scientists can create point-in-time correct features and labels. Here’s an example of fetching a snapshot of viewing history for the movie OUTATIME.
val snapshot = new SnapshotDataManager(sqlContext).withTimestamp(1445470140000L).withContextID(OUTATIME).getViewingHistory
Building on this snapshot, users can perform time-travel to create features—for training and validation—by providing the following:
At the top, we have autopilot needs. Not meeting these needs may mean your data scientists have to spend time on tedious, manual work—it’s not a blocker for using ML in production, though it can reduce development effort and operations cost. While some companies shared about addressing these needs, it’s not widespread in industry (yet).
Airbnb noticed that backfilling was a bottleneck when their data scientists iterated on ML experiments. Thus, Zipline includes a feature to backfill features automatically. A simple UI is provided where data scientists could define new features, specify the start and end dates, and how many processes to parallelize the backfill for. These features would then be added to their existing training feature set via an Airflow pipeline.
Airbnb`s UI for backfilling and the Airflow DAG it creates (source)
Other ways to meet autopilot needs include:
I hope it's clearer what a feature store is. If we're starting from ground zero, maybe all we need is access and serving, and a dash of integrity. If we're building our own at big tech co, we'll probably want to consider convenience and autopilot relatively early. Did I miss anything? Please reach out @eugeneyan!
Looking to get started with a feature store? Feast is a great option to consider. It meets the needs of access and serving, and its consistent API makes it easy to use similar code for training and serving. Best of all, it’s open-sourced (read: free). Let me know how it goes!
Read more guides?
© Eugene Yan 2023 • About • Suggest edits.