Life cycle of Analytical Dataset
This article is strictly a simplified illustration of the product. It is intended to aid analysts visualize the operation to troubleshoot issues.
On our previous article, we discussed how log data from multiple channel form a single visitor data. We are now shifting our focus to a whole dataset.
Log Process - Building dataset -
The construction of the dataset is called Log Process, which consists of two phases.
(1) Log Processing Phase
First, servers must decode raw log data files, organize them as visitor data (contact card), and store them onto a dataset (card holder). This phase is also known as Fast Input.
During this phase, tens of thousands of large log files must be decoded line by line. As such, this phase takes a considerable amount of time.
(2) Transformation Phase
While previous phase was all about decoding raw log data, this phase focuses on transforming decoded data into more useful form. This phase is also known as Fast Merge.
At this point, servers are working on a dataset, which is smaller and organized for fast access unlike flat log files. For this reason, this phase typically finishes much quicker than Log Processing phase.
As it progresses, finished portion gradually becomes available for a query.
Use of resource intensive transformations such as CrossRows transformation could expand the length of this phase as well as disk consumption.
Transformation task during Log Processing phase
Simpler transformation types can be executed during the Log Processing Phase without waiting for Transformation Phase. The illustration below does the same lookup transformation described in the previous article on a single sweep.
Some transformation types must wait for completion of Log Processing phase. For example, cross row transformations take other fields on a card as input, which may not be decoded yet. They can be executed later during Transformation Phase.
Real-Time Processing - Continuous update -
Even after completing the log process, new data are continuously added to keep the dataset up to date. This continuous increment is called Real Time Processing mode, and a servers does this on a background while responding to queries.
When feeding via Sensor module, events data are processed in minutes or even less on an adequately sized cluster. Analysts can then run queries on events in near real time.
However, if log data amount spikes, they can over burden the cluster. For example, number of visitors could multiply several times on a product release day. This causes the pending data to pile up, widening the gap between As-Of Time and current time.
- Catch up on a delay -
Once as-of time delay reaches the threshold, the dataset will go back to Log Process and Transformation phase again. This will help it catch-up with the delay.
Log Processing phase (incremental) aka Fast Input: Because existing field data on dataset can be reused, only pending data are decoded, and it finishes relatively quickly. During this phase, the dataset stops accepting queries and focuses its entire resources on log processing.
Transformation phase (full) aka Fast Merge: Addition of newly decoded data makes existing transformed data invalid; hence, transformation phase will have to be executed again in full. Partial data will become available for a query as it progresses.
Once all transformations completes, dataset goes back to real time processing mode.
How data are fed into the cluster vary case by case. Your organization may feed data using sensor, daily feed from Adobe Analytics Report (SiteCatalyst), log files from various custom applications, or combination of them. The example above is bare minimum to illustrate the mechanism. Please contact your Adobe consultant to devise the best plan for your specific use case.
Reprocess - Rebuilding dataset -
Substantial architecture changes, recovery from unexpected damages, or periodical maintenance requires another round of Log Process and Transformation. Such reconstruction is called Reprocess.
For example, let us say the architect decides to incorporate call center logs. He will update the data architecture marked in yellow and initiate reprocess.
Once reprocess has finished, more sophisticated query like this can be executed.
"Among in-store add-on purchase items, which products are more likely to result in support calls?"
Naturally, reprocessing the entire dataset takes time, and it has to be performed outside business hours.
Retransformation - Partial rebuild -
When an architect needs to make changes to transformation phase operations, repeating Transformation phase only may be sufficient. This is called Retransformation, and it will skip lengthy Log Processing phase.
Obviously, retransformation will not update Log Processing phase operations retroactively, so any change to them will require full reprocess.
- Dataset Maintenance -
By design, Data Workbench keeps processing log data indefinitely, and dataset will keep growing until next reprocess. To avoid data overflow, Adobe support recommends periodical reprocess with updated Start Time. The best practice to manage dataset size can be found here.