Data Storage Model for Data Workbench
This document is intended to aid new administrators and analysts who have just finished Adobe Analytics Data Workbench training. After reading this article, it should become easier to visualize how this software works while troubleshooting issues or running in-depth analysis.
Data Storage Model for Data Workbench
Conventional architectures such as a relational database organize data into tables interconnected by keys. Conversely, the Data Workbench component in Adobe Analytics uses a fundamentally different process known as a “wheel” architecture. This article describes how the wheel architecture works compared to conventional ones, and identifies its strengths and weaknesses.
What is a wheel architecture?
Think of the visitor data to your website like bags at the airport circulating on a baggage-claim conveyor. In this example, your goal is to count bags of a similar color as they appear before you.
To query the circulating bags, you can identify the first bag and then count the subsequent bags that move in front of you.
Every time a matching bag passes, you identify it and tally up the different types. This allows you to count all bags and group them based on type.
When the first bag is in front of you again, you know it has completed the entire circuit. At this point you can stop tallying and report the results.
If other people also need to query results, they can walk up to the wheel and watch the bags for exactly one revolution, starting at the same or different revolutions of the wheel. This conveyor, or data wheel, is always spinning and a query can begin at any time and continue until a single revolution is complete.
Also, Data Workbench guarantees that the bags on the belt will be in a completely random order. (The airlines do a good job of this too!) This allows for a query of circular data to be seen as a random snapshot of the total data, enabling queries to compute approximate answers very quickly by extrapolating the result of the entire population.
With 50 pieces of data this might not work too well. But with millions or billions of pieces of data, a very accurate estimate of the final tally can be extracted almost instantly.
Running detailed queries
Let us expand this analogy further. For more detailed analytical queries, we place all items out of the bag (but still group them by traveler). The contents are made up of small bag, an even smaller organizer bag, and individual items.
Now that the query agent has immediate access to the contents, a more detailed query like "how many traveler's carry a blue bottle?" can be made.
Hierachy structure within a bag
Let's translate this baggage analogy back to visitor data for analytics: One bag for each traveler represents all data for one Visitor, the name tag as Visitor ID, small bags representing a Visit, organizers are Hits, and individual items are Events.
On the Data Workbench client, this structure can be described as a schema diagram. A description of the schema diagram is described here.
The operation to lay out data according to this architecture is called Log Processing. It has to take place whenever the hierarchical structure changes significantly. While this process takes time, you will end up with dataset optimized for very fast analytical queries.
Relational tables in baggage analogy
Before describing how the data wheel differs from tables, let's first consider how this baggage analogy would look for a conventional approach. In a relational database, data would be something akin to a lot of shelves with bags on them. This conventional architecture gives you a quick way to insert a new bag or pull out a specific one.
For more detailed analysis, data within each bag will need to be extracted and normalized into separate shelves like this.
In this setup, the query agent has to resolve external keys to link each of the items on all four shelves.
Overhead consideration for analytical query
When a query is targeted for one entity, then the relational database can complete the query very quickly. For example, a query of “how many events did Mr. Taro Adobe fire?” would require the query agent to look up one record on Visitor table, several on the Visit table, and more on the Hit and Event tables. It would also require a count of the number of events. These are expensive steps for the database to use these external keys, but doing it for one person is still practical.
However, an analytical query is more open-ended and requires an answer to the question: "How many users fire how many events per session?" This requires a query across many records with the query agent repeating the expensive steps for all users. Consequently, when you have millions or billions of visitors, this will quickly becomes impractical.
Using data wheels, each visitor’s data is organized into a single bag, so the query agent can evaluate each bag at once and move on to the next, allowing the wheel to finish a query in a single sweep. Hence, wheel architecture is better suited for analytical queries.
The OLAP Cube
To improve analytical query speed, typical Business Intelligence (BI) software relies on a multi-dimensional structure with pre-calculated summary data known as an OLAP cube. By querying against these summaries, the response time becomes significantly shorter for some applications but runs into problems with advanced queries.
For example, one company has customers in 50 different states, 1000 product lines, and 10 billion transactions per year. The data are stored using three-dimensional cubes based on hour, product, and state. This requires pre-calculations of 1.2 million cubes.
In this architecture, the analytical query of “how many product have we sold by hour in each state?” can be answered very quickly by looking at a summary of the cubes. This is a vast improvement to standard table queries with external references.
Increasing granularity and scale
However, a more detailed query of “how many product have we sold by minutes in each state?” will require pre-calculations multiplied by 60 for the OLAP cube. Adding new products, expanding the count across countries, or introducing new dimensions (based on size, gender, material, etc.) will cause the OLAP cube summary to multiply the stack at an exponential rate.
As the cube stack grows, pre-calculation and updating the data increases overhead time and forces the analyst to query against a dataset that is often days behind. This transactional overhead and the exponential growth of data across tables increases latency and severely limits analytical queries using OLAP cube architecture.
In contrast, equivalent granularity increase on the wheel will result in smaller incremental growth.
In addition, with no pre-calculation overhead, newly added data becomes available soon after arriving. This enables analysts to query in near real-time. I will discuss more on how real-time processing will work in the next article.
Query on actual data vs. summary of actual data
Even when an organization has massive capacity to update the summary in near real-time, one more factor should be considered: A query against an OLAP cube is a query against the summary of data, making it less granular and inherently “lossy.” Conversely, Query results on the wheel are calculated from processed data directly.
As described in this document, the operation of Adobe Analytics Data Workbench can be best described as visitor data placed on a wheel instead of interconnected as linear tables or shelves. It is better suited to run analytical queries to survey vast amounts of data.
Understanding the inner workings of this database model will help you take advantage of its benefits.