File Structure and Naming Conventions
Userpilot exports data in a structured manner to your cloud storage, typically organized by date and possibly event type, to facilitate easier management and querying.- Directory Structure: Data files are usually organized hierarchically. A common pattern is by date:
- For example, data synced on May 10, 2025, might be found in a path like
userpilot_events/userpilot_datasync_13214_NX-123454/2025-05-10/. - File Naming Conventions: Files within these directories often include timestamps or unique identifiers to prevent overwrites and indicate the batch of data they contain.
- Data Granularity per File: Each file typically contains data for a specific time window (e.g., one hour or one day, depending on your sync frequency).
Data Format Details
Userpilot exports data in a well-defined format to ensure consistency and ease of parsing. The most common formats for raw event data are JSON (JavaScript Object Notation) or Parquet.JSON
- Data is typically provided as one JSON object per line (NDJSON/JSON Lines).
- Each line represents a single event.
- Use Cases: Easy to read and parse by many systems, widely supported in data pipelines and analytics tools.
- Example:
CSV
- Data is provided as plain text, with each line representing a record and fields separated by commas. The first line is a header row with column names.
- Use Cases: Easily imported into spreadsheets, relational databases, and many data analysis tools. Best for flat data structures where all records have the same set of fields.
- Example:
Note: The
metadata field may be stringified as JSON, omitted, or flattened depending on implementation.Parquet
- Parquet is a columnar storage file format optimized for use with big data processing frameworks.
- Use Cases: Ideal for large-scale analytics, efficient storage, and fast querying in data lakes and warehouses. Supported by tools like Apache Spark, Pandas (Python), and most modern data platforms.
- Note: You will need tools or libraries that can read this format (e.g., Apache Spark, Pandas in Python with
pyarroworfastparquet).
Apache Avro
- Avro is a row-oriented data serialization framework that stores data in a compact binary format, with the schema included alongside the data.
- Use Cases: Common in Apache Kafka, Hadoop, and for long-term data archival where schema evolution is important. Supports adding/removing fields over time without breaking downstream consumers.
- Note: Data is stored in binary format. Use Avro libraries in your programming language of choice (Java, Python, etc.) to read and write Avro files. The schema is used to interpret the binary data.
Folder Structure
Userpilot Data Sync organizes your exported data in a clear, hierarchical folder structure to make it easy to locate, process, and analyze your data. This structure is designed to support both granular event analysis and high-level reporting. Below is an overview of the folder structure you will find in your storage destination:Folder & File Descriptions
- all_companies/ and all_users/
- Contain a snapshot of all identified companies and users, respectively, as of the latest sync. Each file (e.g.,
all_companies.avro,all_users.avro) includes the most recent state and all auto-captured and custom properties for each entity.
- Contain a snapshot of all identified companies and users, respectively, as of the latest sync. Each file (e.g.,
- {Date}/
- Each date folder contains all data synced for that specific day. This allows for easy partitioning and historical analysis.
- all_events.avro: All raw events captured on that date, across all event types.
- feature_tags/, labeled_events/, tagged_pages/, trackable_events/, interaction/:
- Each of these folders contains:
- matched_events/: Raw events for each identified/tagged/labeled event (e.g.,
track_feature_{ID}.avro,track_labeled_{ID}.avro, etc.). - breakdown files (e.g.,
feature_tags_breakdown.avro): Aggregated counts and engagement metrics for that day. - definitions files (e.g.,
feature_tags_definitions.avro): Metadata such as name, description, and category, allowing you to map event IDs to human-readable definitions.
- matched_events/: Raw events for each identified/tagged/labeled event (e.g.,
- Each of these folders contains:
- users/ and companies/: Contain user and company identification events for that day (
identify_user.avro,identify_company.avro).
How to Use This Structure
- Use the all_companies and all_users files to get the latest state and properties for all entities in your app.
- Use the folders to analyze daily event activity, engagement, and breakdowns.
- The matched_events folders provide access to all raw events for each feature, label, or tag, enabling deep-dive analysis.
- The breakdown and definitions files make it easy to join raw event data with descriptive metadata for reporting and analytics.
Planning for Data Ingestion:A clear understanding of the file structure and naming is vital for setting up automated ETL/ELT pipelines to load this data into your data warehouse or data lake. Your ingestion scripts will rely on these patterns to discover new files.
Retrospective Export Runs
Userpilot automatically performs retrospective export runs every Saturday to ensure your exported data reflects the most current configuration settings. Retrospective exports only run for configurational data that requires labeling — specifically tagged pages and labeled events. This process ensures that historical events are re-exported with the correct data when configuration changes occur.How It Works
Each Saturday, the Userpilot system:-
Detects Configuration Changes: The system checks for “touched events” — events that have been affected by changes to configuration settings that require labeling, including:
- Tagged Pages: Changes to Targeting Settings like Domains and Path matching that determine which page views match a tagged page
- Labeled Events: Updates to Domains and Path matching Or CSS selector and Text that determine which events match a labeled event definition
- Identifies Affected Events: When configuration changes are detected, the system identifies all historical events that may have been impacted by these changes.
- Runs Retrospective Export: The system performs a retrospective export on all old partitions up to the previous 365 days, re-exporting data for the affected events to capture the correct raw data after your updates to Targeting Settings for tagged pages or matching criteria for labeled events.
Affected Folders
Only the following folders are updated during retrospective export runs:-
tagged_pages/*— All files within thetagged_pagesdirectory structure, including:tagged_pages/matched_events/page_view_{ID}.avrotagged_pages/tagged_pages_breakdown.avrotagged_pages/tagged_pages_definitions.avro
-
labeled_events/*— All files within thelabeled_eventsdirectory structure, including:labeled_events/matched_events/track_labeled_{ID}.avrolabeled_events/labeled_events_breakdown.avrolabeled_events/labeled_events_definitions.avro
feature_tags/, trackable_events/, interaction/, users/, companies/, all_events.avro) are not affected by retrospective exports and remain unchanged.
Why This Matters
When you update configuration settings for tagged pages or labeled events, such as:- Modifying Targeting Settings (Domains and Path matching) for tagged pages
- Changing Domains and Path matching Or CSS selector and Text for labeled events
- Historical data reflects your current configuration rules
- Events are correctly categorized and matched based on updated settings
- Your analytics and reporting remain accurate even after configuration changes
What to Expect
- Frequency: Retrospective exports run automatically every Saturday
- Scope: Covers all partitions up to 365 days in the past
- Impact: Only affected events (those matching changed configurations) are re-exported
- File Updates: Updated files will appear in your storage destination with the corrected data
Data Consistency:If you notice updated files in your storage destination on Saturdays, this is likely due to a retrospective export run affecting the
tagged_pages/* and labeled_events/* folders.Recommended Approach: Setting up your data warehouse or database to read from S3 directly/remotely is preferable, as this keeps data consistency responsibility under Userpilot. When files are updated, your queries will automatically reflect the latest data without requiring manual intervention.Alternative Approach: If you are copying data into your own storage and ingesting the mutable folders (tagged_pages/* and labeled_events/*), ensure you are rewriting entire data parts on weekly updates to match Userpilot’s data sync behavior. This means replacing the entire affected partition or data segment rather than attempting incremental updates, as the retrospective export process overwrites files completely.