As a followup to a prior post - after 20 years of battle testing, are there shortcomings to Kimball's canonical approach to modeling the business?
In my opinion, there are three significant ones:
- When events or facts themselves are necessary for segmentation - let’s call this “metric-based segmentation”
- Feature generation for machine learning use cases, which is closely related to (1).
- When the sequencing of events matter.
Metrics-based Segmentation:
This use case involves taking facts such as recent 7-day usage or TTM revenue and generating conditional segments to analyze other metrics. In other words, you want to compute new attributes for an entity or dimension based on specific fact-based calculations.
The workaround for this issue using Kimball’s approach is straightforward - you can model these behavioral attributes as additional attributes on the entities or dimensional tables. However, this breaks a norm as we are capturing metric information in the dimension tables.
Feature Generation:
The prior use case for metrics-related calculations on entities is similar to building features in machine learning, where the entity or dimension can have a wide array of behavioral attributes. Generating features is even more challenging when we need to compute "point-in-time" calculations on historical data.
Accomplishing this in a flexible manner is difficult for any data model, let alone Kimball's approach.
Sequence Analytics:
This use case is relevant in product analytics, where it may be important to follow a sequence of events by an entity. For example, we may want to condition on users completing step 1, and then analyze how many users completed step 2 in the same session, followed by step 3 within a day, and so on.
Kimball's approach is less efficient when it comes to traversing sequences of facts. In my opinion, this is why product analytics tools tend to adopt an activity-first approach, although this has its own set of weaknesses, which is a topic for another post.
In summary, Kimball's ideas on dimensional modeling have stood the test of time and remain a reliable way to model business metrics and enable first-class segmentation.
And, just like databases that are designed for specific types of workloads evolve over time to adapt and survive, I believe that Kimball's approach has also evolved to serve a wide range of use cases today, despite some known weaknesses.