Understanding Your Data

File Structure and Naming Conventions

Userpilot exports data in a structured manner to your cloud storage, typically organized by date and possibly event type, to facilitate easier management and querying.

Directory Structure: Data files are usually organized hierarchically. A common pattern is by date:

[your_path_prefix]/userpilot_datasync_{SYNC_ID}_{APP_TOKEN}/YYYY-MM-DD/

For example, data synced on May 10, 2025, might be found in a path like userpilot_events/userpilot_datasync_13214_NX-123454/2025-05-10/ .
File Naming Conventions: Files within these directories often include timestamps or unique identifiers to prevent overwrites and indicate the batch of data they contain.
Data Granularity per File: Each file typically contains data for a specific time window (e.g., one hour or one day, depending on your sync frequency).

Data Format Details

Userpilot exports data in a well-defined format to ensure consistency and ease of parsing. The most common formats for raw event data are JSON (JavaScript Object Notation) or Parquet.

JSON

Data is typically provided as one JSON object per line (NDJSON/JSON Lines).
Each line represents a single event.
Use Cases: Easy to read and parse by many systems, widely supported in data pipelines and analytics tools.
Example:

{
  "app_token": "NX-123ASVR",
  "event_type": "track",
  "event_name": "Subscription Purchased",
  "source": "web-client",
  "user_id": "user-001",
  "company_id": "company-001",
  "inserted_at": "2025-05-11T07:45:30.123456Z",
  "hostname": "app.example.com",
  "pathname": "/features/subscription",
  "screen_width": 1920,
  "screen_height": 1080,
  "operating_system": "Windows",
  "browser": "Chrome",
  "browser_language": "en-US",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
  "device_type": "desktop",
  "country_code": "US",
  "metadata": {
    "subscription_id": "sub-001",
    "subscription_type": "monthly",
    "subscription_price": 10,
    "subscription_currency": "USD"
  }
}

CSV

Data is provided as plain text, with each line representing a record and fields separated by commas. The first line is a header row with column names.
Use Cases: Easily imported into spreadsheets, relational databases, and many data analysis tools. Best for flat data structures where all records have the same set of fields.
Example:

app_token,event_type,event_name,source,user_id,company_id,hostname,pathname,country_code,screen_width,screen_height,operating_system,browser,browser_language,user_agent,device_type,inserted_at
NX-APP-TOKEN,page_view,,web-client,user-123,company-456,example.com,/,US,1920,1080,macOS,Chrome,en-US,Mozilla/5.0...,desktop,2025-05-06T12:30:00.123Z

Note: The metadata field may be stringified as JSON, omitted, or flattened depending on implementation.

Parquet

Parquet is a columnar storage file format optimized for use with big data processing frameworks.
Use Cases: Ideal for large-scale analytics, efficient storage, and fast querying in data lakes and warehouses. Supported by tools like Apache Spark, Pandas (Python), and most modern data platforms.
Note: You will need tools or libraries that can read this format (e.g., Apache Spark, Pandas in Python with pyarrow or fastparquet ).

Apache Avro

Avro is a row-oriented data serialization framework that stores data in a compact binary format, with the schema included alongside the data.
Use Cases: Common in Apache Kafka, Hadoop, and for long-term data archival where schema evolution is important. Supports adding/removing fields over time without breaking downstream consumers.
Note: Data is stored in binary format. Use Avro libraries in your programming language of choice (Java, Python, etc.) to read and write Avro files. The schema is used to interpret the binary data.

Folder Structure

Userpilot Data Sync organizes your exported data in a clear, hierarchical folder structure to make it easy to locate, process, and analyze your data. This structure is designed to support both granular event analysis and high-level reporting. Below is an overview of the folder structure you will find in your storage destination:

userpilot_datasync_{Sync_ID}_{APP_TOKEN}/
├── all_companies/
│   └── all_companies.avro
├── all_users/
│   └── all_users.avro
├── {Date}/
│   ├── all_events.avro
│   ├── feature_tags/
│   │   ├── matched_events/
│   │   │   └── track_feature_{ID}.avro
│   │   ├── feature_tags_breakdown.avro
│   │   └── feature_tags_definitions.avro
│   ├── interaction/
│   │   └── matched_events/
│   │       └── interaction_{Type}_{Id}.avro
│   ├── labeled_events/
│   │   ├── matched_events/
│   │   │   └── track_labeled_{ID}.avro
│   │   ├── labeled_events_breakdown.avro
│   │   └── labeled_events_definitions.avro
│   ├── tagged_pages/
│   │   ├── matched_events/
│   │   │   └── page_view_{ID}.avro
│   │   ├── tagged_pages_breakdown.avro
│   │   └── tagged_pages_definitions.avro
│   ├── trackable_events/
│   │   ├── matched_events/
│   │   │   └── track_event_{NAME}.avro
│   │   ├── trackable_events_breakdown.avro
│   │   └── trackable_events_definitions.avro
│   ├── session_start/
│   │   └── matched_events/
│   │       └── session_start.avro
│   ├── users/
│   │   └── identify_user.avro
│   └── companies/
│       └── identify_company.avro

Folder & File Descriptions

all_companies/ and all_users/
- Contain a snapshot of all identified companies and users, respectively, as of the latest sync. Each file (e.g., all_companies.avro , all_users.avro ) includes the most recent state and all auto-captured and custom properties for each entity.
{Date}/
- Each date folder contains all data synced for that specific day. This allows for easy partitioning and historical analysis.
- all_events.avro: All raw events captured on that date, across all event types.
- feature_tags/, labeled_events/, tagged_pages/, trackable_events/, interaction/:
  - Each of these folders contains:
    - matched_events/: Raw events for each identified/tagged/labeled event (e.g., track_feature_{ID}.avro , track_labeled_{ID}.avro , etc.).
    - breakdown files (e.g., feature_tags_breakdown.avro ): Aggregated counts and engagement metrics for that day.
    - definitions files (e.g., feature_tags_definitions.avro ): Metadata such as name, description, and category, allowing you to map event IDs to human-readable definitions.
- users/ and companies/: Contain user and company identification events for that day (identify_user.avro , identify_company.avro ).

How to Use This Structure

Use the all_companies and all_users files to get the latest state and properties for all entities in your app.
Use the folders to analyze daily event activity, engagement, and breakdowns.
The matched_events folders provide access to all raw events for each feature, label, or tag, enabling deep-dive analysis.
The breakdown and definitions files make it easy to join raw event data with descriptive metadata for reporting and analytics.

Planning for Data Ingestion:A clear understanding of the file structure and naming is vital for setting up automated ETL/ELT pipelines to load this data into your data warehouse or data lake. Your ingestion scripts will rely on these patterns to discover new files.

Retrospective Export Runs

Userpilot automatically performs retrospective export runs every Saturday to ensure your exported data reflects the most current configuration settings. Retrospective exports only run for configurational data that requires labeling — specifically tagged pages and labeled events. This process ensures that historical events are re-exported with the correct data when configuration changes occur.

How It Works

Each Saturday, the Userpilot system:

Detects Configuration Changes: The system checks for “touched events” — events that have been affected by changes to configuration settings that require labeling, including:
- Tagged Pages: Changes to Targeting Settings like Domains and Path matching that determine which page views match a tagged page
- Labeled Events: Updates to Domains and Path matching Or CSS selector and Text that determine which events match a labeled event definition
Identifies Affected Events: When configuration changes are detected, the system identifies all historical events that may have been impacted by these changes.
Runs Retrospective Export: The system performs a retrospective export on all old partitions up to the previous 365 days, re-exporting data for the affected events to capture the correct raw data after your updates to Targeting Settings for tagged pages or matching criteria for labeled events.

Affected Folders

Only the following folders are updated during retrospective export runs:

tagged_pages/* — All files within the tagged_pages directory structure, including:
- tagged_pages/matched_events/page_view_{ID}.avro
- tagged_pages/tagged_pages_breakdown.avro
- tagged_pages/tagged_pages_definitions.avro
labeled_events/* — All files within the labeled_events directory structure, including:
- labeled_events/matched_events/track_labeled_{ID}.avro
- labeled_events/labeled_events_breakdown.avro
- labeled_events/labeled_events_definitions.avro

All other folders (e.g., feature_tags/, trackable_events/, interaction/, users/, companies/, all_events.avro) are not affected by retrospective exports and remain unchanged.

Why This Matters

When you update configuration settings for tagged pages or labeled events, such as:

Modifying Targeting Settings (Domains and Path matching) for tagged pages
Changing Domains and Path matching Or CSS selector and Text for labeled events

The original events may have been exported with data that no longer matches your current configuration. The retrospective export ensures that:

Historical data reflects your current configuration rules
Events are correctly categorized and matched based on updated settings
Your analytics and reporting remain accurate even after configuration changes

What to Expect

Frequency: Retrospective exports run automatically every Saturday
Scope: Covers all partitions up to 365 days in the past
Impact: Only affected events (those matching changed configurations) are re-exported
File Updates: Updated files will appear in your storage destination with the corrected data

Data Consistency:If you notice updated files in your storage destination on Saturdays, this is likely due to a retrospective export run affecting the tagged_pages/* and labeled_events/* folders.Recommended Approach: Setting up your data warehouse or database to read from S3 directly/remotely is preferable, as this keeps data consistency responsibility under Userpilot. When files are updated, your queries will automatically reflect the latest data without requiring manual intervention.Alternative Approach: If you are copying data into your own storage and ingesting the mutable folders (tagged_pages/* and labeled_events/*), ensure you are rewriting entire data parts on weekly updates to match Userpilot’s data sync behavior. This means replacing the entire affected partition or data segment rather than attempting incremental updates, as the retrospective export process overwrites files completely.

Installation

Security

Data sync

Advanced

File Structure and Naming Conventions

Data Format Details

JSON

CSV

Parquet

Apache Avro

Folder Structure

Folder & File Descriptions

How to Use This Structure

Retrospective Export Runs

How It Works

Affected Folders

Why This Matters

What to Expect

Installation

Security

Data sync

Advanced

​File Structure and Naming Conventions

​Data Format Details

​JSON

​CSV

​Parquet

​Apache Avro

​Folder Structure

​Folder & File Descriptions

​How to Use This Structure

​Retrospective Export Runs

​How It Works

​Affected Folders

​Why This Matters

​What to Expect

File Structure and Naming Conventions

Data Format Details

JSON

CSV

Parquet

Apache Avro

Folder Structure

Folder & File Descriptions

How to Use This Structure

Retrospective Export Runs

How It Works

Affected Folders

Why This Matters

What to Expect