Data Export
Data Export Overview
Rover Data Export streams your workspace's fan data into an Amazon S3 bucket you own so you can ingest it into your warehouse or data lake. Once enabled and pointed at a destination, Rover writes gzipped NDJSON files into the bucket continuously, partitioned by dataset and time.
This section is a guide for building a consumer of those files. It describes what arrives in your bucket, how the files are laid out, and how to parse each record.
What's emitted
Four datasets are written:
| Dataset | What it contains |
|---|---|
fan_profiles | One record per fan profile, carrying the full traits payload. |
fan_identity_graph | Identifier-to-fan associations (email, phone, custom IDs). |
fan_profile_merges | Profile-merge events (when two fans are unified into a single canonical profile). |
fan_track_events | An allowlist of behavioral events: SDK-emitted app lifecycle and Experience telemetry, plus Rover message-lifecycle events (sent, opened, clicked). |
The three state datasets — fan_profiles, fan_identity_graph, and fan_profile_merges — are delivered in two phases:
- Snapshot: an initial point-in-time export of every existing row at the moment the destination is enabled (or whenever a snapshot rerun is requested).
- Change stream: every subsequent insert / update / delete from that point forward.
fan_track_events is change-stream only. Track events are append-only behavioral signals (SDK telemetry from customer mobile devices, plus Rover message-lifecycle events), with no historical state to replay. This dataset begins flowing as soon as the destination is enabled and continues as new events occur.
All datasets land at the same path layout in your bucket and are deduplicated by a deterministic record id, so an upsert-by-id load handles either kind without special-casing. See File Format for details.
At a glance
- Format: gzipped NDJSON, one JSON object per line.
- Layout: Hive-style partitioned by dataset, UTC date, and UTC hour.
- Atomicity: every file is a whole-object write, never a partial upload. A retry after a transient failure may replace an existing object at the same key with an equivalent payload, but a visible object is always a complete, valid file.
- Idempotency: every record has a deterministic
idfield. Treat it as the primary key when loading; duplicates from retries collapse on upsert. - Latency: typically minutes from upstream event to file arrival.
- Delivery: at-least-once.
- Authentication: short-lived credentials only. Rover assumes an IAM role you create in your AWS account, and never holds long-lived keys to your cloud.
Getting started
The order of operations to begin receiving exports:
- Provision an S3 bucket in your AWS account that Rover can write to. Note the bucket name and an optional prefix for Rover's objects.
- Configure the destination in the Rover dashboard. The dashboard walks you through creating an IAM role Rover can assume into your account. See Setup & Authentication.
- Enable the destination. Rover begins by writing the snapshot for each of the three state datasets (
fan_profiles,fan_identity_graph,fan_profile_merges), followed immediately by the change stream.fan_track_eventshas no snapshot; it begins flowing as new events fire on-device. All datasets arrive at the same partitioned layout. - Point your loader at the bucket. Use the Hive-style partitions directly, or register the prefixes as an external table in your warehouse. See File Format for the exact layout and the Datasets page for record schemas.
Data behavior
Properties of the export stream as a whole, independent of how you load or query the data.
- Latency. Typical end-to-end latency from an upstream event to file arrival in your bucket is on the order of minutes.
- Late-arriving data. Records can arrive minutes after the events they describe; very rarely hours, during incident recovery. The data is eventually consistent within a short window. Files are not strictly ordered by the moment the underlying event occurred.
- Delivery semantics. At-least-once. Same-
idduplicates can occur during retry edge cases. The deterministicidfield makes deduplication trivial. - Throughput. Files roll over at roughly 50,000 records, ~50 MB uncompressed, or every ~60 seconds, whichever comes first. Active workspaces produce multiple files per dataset per hour; quiet workspaces produce one small file per dataset per minute or less.
- Schema evolution. Additive changes (new optional fields, new event names added to the
fan_track_eventsallowlist) do not bump theversionfield on records and may appear at any time. Breaking changes (renaming, removing, or retyping an existing field) bumpversion, and newversionrollouts are announced ahead of time.