Data Export

File Format & Layout

This page covers everything about how exported files appear in your bucket: the path layout, the file format itself, idempotency, and how to reason about ordering. Per-record schemas are on the Datasets page.


Bucket layout

Every record is written to a path of the form:

<your-prefix>/exports/<dataset>/date=YYYY-MM-DD/hour=HH/<dataset>-YYYYMMDDTHHmmssZ-part-<token>.ndjson.gz
SegmentNotes
<your-prefix>The optional prefix you configured on the destination. Omitted (no leading segment) when blank.
exports/Literal segment. Distinguishes Rover-written objects from anything else you may put in the bucket.
<dataset>One of fan_profiles, fan_identity_graph, fan_profile_merges, fan_track_events.
date=YYYY-MM-DDUTC date the file was written (Hive-style partition).
hour=HHUTC hour the file was written (Hive-style partition).
<dataset>-YYYYMMDDTHHmmssZ-part-<token>.ndjson.gzUnique filename per file. <token> is a long opaque slug derived from internal identifiers; treat it as an arbitrary string. Filenames are deterministic, so re-delivery of the same underlying file produces the same object key and list-and-process loaders deduplicate naturally on key.

The Hive-style date= / hour= partitions are recognized by most warehouse loaders (BigQuery external tables, Athena, Snowflake external stages, Databricks, etc.) and let you prune scans to specific time windows without listing every object.


File format

  • Encoding: UTF-8.
  • Compression: gzip, a single concatenated gzip stream per file.
  • Format: NDJSON, one JSON object per line, \n-terminated. A trailing newline is not guaranteed.
  • Atomicity: every file is a whole-object write; there are no partial uploads. A retry after a transient failure may replace an existing object at the same key with an equivalent payload, but a given object key, once visible, is always a complete, valid file.

Because every visible file is complete and named deterministically, you can list the bucket and process new objects with no risk of reading a partial file.


Idempotency and ordering

Every record carries an id field that is deterministic within a delivery: replays of the same record during at-least-once retry edge cases always produce the same id. Treat id as the primary key when loading into a warehouse so duplicate files collapse on insert.

Snapshot reruns produce new ids

A snapshot rerun (see Setup) re-emits every existing row with new id values, not the originals. An upsert-by-id load will see them as new records, not duplicates of what is already loaded. Plan for this by deduplicating on (roverID, updatedAt) (or the dataset's event-time field) when processing a rerun, or by truncating and reloading the affected partitions.

File ordering across partitions is not guaranteed. Use the event-time field inside each record to reason about chronological order, not the filename timestamp:

DatasetEvent-time field
fan_profilesupdatedAt
fan_identity_graphupdatedAt
fan_profile_mergesoccurredAt
fan_track_eventstimestamp

The filename timestamp is when the file was written, not when the underlying event occurred.


Snapshot vs change stream

The three state datasets — fan_profiles, fan_identity_graph, and fan_profile_merges — are delivered in two phases, both of which land at the same exports/<dataset>/... prefix:

  • Snapshot: an initial point-in-time set of every existing row at the moment the destination was enabled (or the moment a snapshot rerun was requested). All records in a snapshot represent state, not changes.
  • Change stream: every subsequent insert / update / delete from that point forward.

The snapshot for a dataset completes before its steady-state change stream begins flowing for that dataset, so when you point a loader at a freshly-enabled destination you'll see the snapshot land first, then change-stream files arrive on a rolling basis.

fan_track_events is change-stream only; it has no snapshot phase. Track events are append-only behavioral signals (SDK telemetry and Rover message-lifecycle events), so there is no historical state to replay; files for this dataset begin arriving as new events occur after the destination is enabled.

You generally do not need to distinguish snapshot records from change-stream records when loading: an upsert by id works for either within a single export run. The one dataset where the distinction shows up in the payload is fan_identity_graph, where change-stream records carry a meaningful operation field (create / update / delete); snapshot records are always emitted as operation: "create".


Common envelope

Every record across every dataset carries these fields. Per-dataset sections on the Datasets page describe additional fields.

FieldTypeNotes
idstringDeterministic 32-character hex string. Stable across at-least-once delivery retries. Use as primary key for dedupe within a single export run. (See the snapshot-rerun caveat in Idempotency and ordering.)
versionnumberSchema version. 1 today. Bumped on breaking changes; additive changes do not bump it.
roverIDstring (UUID)Canonical Rover identifier for the fan this record belongs to.
emittedAtstring (ISO 8601 UTC)When Rover wrote the record into the file. Useful for monitoring delivery latency. Not the time the underlying event happened; use the per-dataset event-time field for that.

Records also carry a per-dataset event-time field (updatedAt, occurredAt, or timestamp); see each dataset's schema on the Datasets page.

Previous
Setup & Authentication