Data Export
File Format & Layout
This page covers everything about how exported files appear in your bucket: the path layout, the file format itself, idempotency, and how to reason about ordering. Per-record schemas are on the Datasets page.
Bucket layout
Every record is written to a path of the form:
<your-prefix>/exports/<dataset>/date=YYYY-MM-DD/hour=HH/<dataset>-YYYYMMDDTHHmmssZ-part-<token>.ndjson.gz
| Segment | Notes |
|---|---|
<your-prefix> | The optional prefix you configured on the destination. Omitted (no leading segment) when blank. |
exports/ | Literal segment. Distinguishes Rover-written objects from anything else you may put in the bucket. |
<dataset> | One of fan_profiles, fan_identity_graph, fan_profile_merges, fan_track_events. |
date=YYYY-MM-DD | UTC date the file was written (Hive-style partition). |
hour=HH | UTC hour the file was written (Hive-style partition). |
<dataset>-YYYYMMDDTHHmmssZ-part-<token>.ndjson.gz | Unique filename per file. <token> is a long opaque slug derived from internal identifiers; treat it as an arbitrary string. Filenames are deterministic, so re-delivery of the same underlying file produces the same object key and list-and-process loaders deduplicate naturally on key. |
The Hive-style date= / hour= partitions are recognized by most warehouse loaders (BigQuery external tables, Athena, Snowflake external stages, Databricks, etc.) and let you prune scans to specific time windows without listing every object.
File format
- Encoding: UTF-8.
- Compression: gzip, a single concatenated gzip stream per file.
- Format: NDJSON, one JSON object per line,
\n-terminated. A trailing newline is not guaranteed. - Atomicity: every file is a whole-object write; there are no partial uploads. A retry after a transient failure may replace an existing object at the same key with an equivalent payload, but a given object key, once visible, is always a complete, valid file.
Because every visible file is complete and named deterministically, you can list the bucket and process new objects with no risk of reading a partial file.
Idempotency and ordering
Every record carries an id field that is deterministic within a delivery: replays of the same record during at-least-once retry edge cases always produce the same id. Treat id as the primary key when loading into a warehouse so duplicate files collapse on insert.
Snapshot reruns produce new ids
A snapshot rerun (see Setup) re-emits every existing row with new id values, not the originals. An upsert-by-id load will see them as new records, not duplicates of what is already loaded. Plan for this by deduplicating on (roverID, updatedAt) (or the dataset's event-time field) when processing a rerun, or by truncating and reloading the affected partitions.
File ordering across partitions is not guaranteed. Use the event-time field inside each record to reason about chronological order, not the filename timestamp:
| Dataset | Event-time field |
|---|---|
fan_profiles | updatedAt |
fan_identity_graph | updatedAt |
fan_profile_merges | occurredAt |
fan_track_events | timestamp |
The filename timestamp is when the file was written, not when the underlying event occurred.
Snapshot vs change stream
The three state datasets — fan_profiles, fan_identity_graph, and fan_profile_merges — are delivered in two phases, both of which land at the same exports/<dataset>/... prefix:
- Snapshot: an initial point-in-time set of every existing row at the moment the destination was enabled (or the moment a snapshot rerun was requested). All records in a snapshot represent state, not changes.
- Change stream: every subsequent insert / update / delete from that point forward.
The snapshot for a dataset completes before its steady-state change stream begins flowing for that dataset, so when you point a loader at a freshly-enabled destination you'll see the snapshot land first, then change-stream files arrive on a rolling basis.
fan_track_events is change-stream only; it has no snapshot phase. Track events are append-only behavioral signals (SDK telemetry and Rover message-lifecycle events), so there is no historical state to replay; files for this dataset begin arriving as new events occur after the destination is enabled.
You generally do not need to distinguish snapshot records from change-stream records when loading: an upsert by id works for either within a single export run. The one dataset where the distinction shows up in the payload is fan_identity_graph, where change-stream records carry a meaningful operation field (create / update / delete); snapshot records are always emitted as operation: "create".
Common envelope
Every record across every dataset carries these fields. Per-dataset sections on the Datasets page describe additional fields.
| Field | Type | Notes |
|---|---|---|
id | string | Deterministic 32-character hex string. Stable across at-least-once delivery retries. Use as primary key for dedupe within a single export run. (See the snapshot-rerun caveat in Idempotency and ordering.) |
version | number | Schema version. 1 today. Bumped on breaking changes; additive changes do not bump it. |
roverID | string (UUID) | Canonical Rover identifier for the fan this record belongs to. |
emittedAt | string (ISO 8601 UTC) | When Rover wrote the record into the file. Useful for monitoring delivery latency. Not the time the underlying event happened; use the per-dataset event-time field for that. |
Records also carry a per-dataset event-time field (updatedAt, occurredAt, or timestamp); see each dataset's schema on the Datasets page.