Paragraph

Burst Data Model Architecture

Burst represents a significant architectural investment in a custom behavioral-optimized approach to data modeling.

The most salient data concepts are:


  1. A Burst behavioral analysis is applied to a dataset which is best conceptualized as an enormous number of individual complex-object-trees, each called an Entity, partitioned into one or more Slices that are distributed across some or all of the compute nodes (workers) within the Burst Cell (cluster).
  2. Each of these complex-object-tree entities represents a single behavioral analysis target e.g. a person, a mobile or PC device, customer, client, manufactured-assembly, or medical patient.
  3. This object-tree must contain one or more time or causal ordered event-collections that represent the behavior of this entity. The very basis of behavioral analysis is derived from and dependent on, the scanning of these sequences.
  4. The entity object-tree is an acyclic, rooted, tree of objects representing a network of strongly modeled singular or plural relations between object(s) i.e. a strongly-typed object-model
  5. The Burst data model support subsystem is called Brio. Brio contains a complete type-system, a language-driven schema with parser/validator for authoring entity models, as well as subsystems for encoding, decoding data to and from a highly specialized binary format called Blobs.
  6. Behavioral analysis execution against this data involves the scan or depth-first-search (DFS) of the entire object-tree including ordered iteration of the time/causal ordered sequences.
  7. The results of each of these scans of each of these entities is then merged together to provide a final result.

datasets


Behavioral analysis as Burst defines it is quite unique in that each dataset to be analyzed must be modeled as a set of individual entities, where each entity is a complex object-tree, that contains causal or time ordered collections of events along with a rich set of unordered fields and collections of values and objects. There is no ordering between entities and no direct relationships between entities and no access paths between entities that can be explored or asserted.

two phase analysis


The basic premise is that the analysis consists of two phases:

  1. depth-first-scan of each entity that explores the rich internal modeled world with very few limitations on the types and relationships. This produces a well-defined set of results that is in its own schema.
  2. merge of all scan results across all entities. The semantics of this merger across entity results can take multiple forms. Burst has a default one we will describe later.

Its important to consider carefully if this restricted semantic is a supportive of your application domain modeling needs. Remember that an entity is something out there, physical or otherwise, whose behavior you are trying to understand as it relates to the behavior of some or all of the other entities.

Brio type system


The Brio type system supports the authoring of typed object-trees that contain the structure and state associated with an entity in a dataset. The following semantic rules apply:

  1. The object-tree consists of one or more typed-structure instances
  2. each typed-structure contains one or more typed-relationship instances
  3. each typed-relationship is either scalar (1:1) or vector (0:N)
  4. each typed-relationship is a value-type or a reference-type
  5. the reference-type is a relationship to a typed-structure
  6. the value-type is a relationship to a primitive-value
  7. This means there are four relationship types:
    • value-scalar -- a 1:1 relationship with a primitive-value
    • value-vector -- a 0:1 relationship to (collection of) primitive values
    • reference-scalar -- a 1:1 relationship with a structure
    • reference-vector -- a 0:1 relationship to (collectionof ) structures
  8. There is a specialized value-vector called a value-map that is a collection of value-value associations
  9. there is always a root-relationship which is a reference-scalar pointing to a typed-structure which is the root of the tree.
  10. there are no cycles (its a tree!)
  11. there are a fixed set of primitive-value
    • boolean - single byte logical value
    • short - two byte fixed value
    • integer - four byte fixed value
    • long - eight byte fixed value
    • double - eight byte floating value
    • string -- all strings in Brio are contained in Dictionaries. Each object-tree has a dictionary. What this means is that a string value in a Brio object-tree is actually an index lookup into the object-tree dictionary.
  12. all relationships can have a null value which means the reference or value is unknown.
  13. within a typed-structure, a single value-scalar relationship can be annotated with the key word ordinal which means any collection containing that typed-structure will be sorted/ordered using the natural ordering of that value-scalar.
  14. within a typed-structure, a single value-scalar relationship can be annotated with the key word key which means any collection containing that typed-structure will treat that as a primary-key within the collection

schema specifications

Authoring of entity object-tree data models using the Burst type system is done using two schema language specification files, a high level Motif language schema, and a low level Brio encoding schema. Both of these need to be resources in the classpath for the Cell nodes as well as data import systems that use Burst standard libraries. They look very similar as structurally they describe exactly the same object model. However they are used by different layers of the architecture for slightly different purposes.

Motif schema

The Motif schema is a 'high level' data description used by the EQL language as well as in the SampleStore data import system. It is a pure data model semantics modeling tool and does not support lower level implementation annotations that help describe specific data storage and access optimizations. These lower level primitives are part of the Brio Schema language described below.

schema Unity {

  version : 1

  root) user : User

  structure Session {
    0) id : long key             // Unique ID for session, from SDK
    1) events : vector[Event]
    2) variants : vector[Variant]
    3) parameters : map[string, string]   // From SDK
    4) sessionType : byte          // From SDK
    5) applicationUserId : string      // From SDK
    6) pushTokenStatus : byte        // Enum: cfbe.PushTokenStatus
    7) limitAdTracking : boolean       // From SDK
    8) osVersionId : long          // BS(Firmware, id)
    9) startTime : long ordinal       // Epoch time
    10) timeZone : string           // See: java.time.ZoneId.getAvailableZoneIds

Brio schema

Ae mentioned above, the Brio schema language is structurally identical to the Motif schema language but supports implementation specific annotations and variations that provide for extremely specific and nuanced storage and access mechanics. The example you see below clearly does not look any different but for instance one could specify different collections other than straight map or vector. We also have defined but not implemented various extension classes that use different storage for epoch longs based on storing the earliest and latest known time values in a Blob and storing only the offset form those times. The complete set of extensions we have defined would for instance save roughly half the storage for an example data model and real world datasets.

schema Unity {

  version : 1

  root) user : User

  structure Session {
    0) id : long key             // Unique ID for session, from SDK
    1) events : vector[Event]
    2) variants : vector[Variant]
    3) parameters : map[string, string]   // From SDK
    4) sessionType : byte          // From SDK
    5) applicationUserId : string      // From SDK
    6) pushTokenStatus : byte        // Enum: cfbe.PushTokenStatus
    7) limitAdTracking : boolean       // From SDK
    8) osVersionId : long          // BS(Firmware, id)
    9) startTime : long ordinal       // Epoch time
    10) timeZone : string           // See: java.time.ZoneId.getAvailableZoneIds

Brio Blobs


The ultimate destination of your desired data model as instantiated by a Motif and Brio schema you authored to capture the appropriate metadata semantics along with the real world candidate data in the form of an 'entity' is the highly optimized binary encoded format we call the 'Blob'. This encoded form is what is 'pressed' or encoded on the remote side of the import pipeline, then passed over networks in compressed form, packaged and bundled appropriate so that they can be distributed across Burst compute cell worker nodes and ultimately written directly to disk cache in uncompressed form and loaded in and out of memory from disk via low level linux mmap primitives.

pressing

The process of taking source data in whatever form the remote datasource natively stores, and converting it to your schema conformant, Brio Blobs, one entity at a time, is called 'pressing'. Burst has a support library for building pressing pipelines that helps ensures time and space and resource efficient import subsystems.

data organization/partitioning

Each Brio Blob with its encoded Brio Entity is managed in various quantums, each with a different purpose and slightly different formatting. However in all three forms the contained Brio Blob/Entity is roughly the same.

parcels

Parcels are the network format data structure for Blobs. Parcels are the only place where Blob data is compressed. Each Parcel has a certain number of Blobs contained within it and the entire Parcel payload is compressed using snappy compression.

Metadata and Data

When talking about Burst data, it is important to distinguish between metadata and data. Metadata M is a persistent description of data that could be imported into Burst e.g. 6 months of unfiltered data regarding user X from remote datastore Y, Metadata is stored in the Burst Catalog which is in turn stored in a SQL database. Metadata is divided into Domains and Views.

Data however is the most recent snapshot of the described by metadata M i.e. real data that can be analyzed that has been imported into the Burst Cell. Data is divided into Slices and Regions. The totality of the slices is called the Dataset.

Metadata

Domains

A Domain is root object of the metadata describing datasets. It is the abstract form for the data e.g. data associated with a specific user, patient, assembly, or application.

Views

A View describes a specific set of instruction about how to import the data associated with a Domain e.g. a time window, a specific remote datastore, a filter predicate that limits the data imported etc.

Data is actual real data that can be analyzed. It is imported based on a specific unique Domain and a specific unique View within that Domain. When a Domain and View are specified as part of a Burst analysis, the Burst cell uses the information in the Domain and View in order to reach out to a remote datastore, tell it exactly what data it wants in what form, and then loads and distributes that data into the Worker nodes of the Burst cell. This data then can be scanned over and over again until it is aged out of the system or updated with a more recent load. In Burst we call this real data a Dataset.

Slices and Regions

The real physical world of datasets is divided into Slices and Regions. These are physical implementations that a given Burst Cell uses and are not readily visible or accessible to the user of the Burst Cell.

Slices

Slices are the partitioning of a given dataset across a set of Burst Cell Worker Nodes. Any given Burst Cell's analysis of a dataset is subdivided into a cooperating set of parallel analysis operations spread across the Slices.

Regions

Regions are a Worker Node local subdivision of Slice that is used to subdivide the analysis of a Slice into a cooperating set of parallel analysis operation spread across all the cores/threads in a Worker Node. Regions are stored in contiguous disk and memory locations. Regions are mmap'd from a single contiguous disk location into a single contiguous memory location.