Journeys Evaluation Dataset

Intent-grounded browsing dataset for evaluating Edge Journeys quality

Background

Edge Journeys is an AI-powered browser feature in Microsoft Edge's Copilot Mode that transforms browsing history into task-themed clusters, helping users resume and continue their work without starting over. The feature surfaces up to 3 Journey cards on the New Tab Page (NTP), each with a title, preview image, and suggested next-step action.

This dataset provides intent-grounded evaluation data — browsing sessions where we know exactly what the user was trying to do. This lets us objectively measure whether a generated Journey "got it right" instead of relying on heuristic metrics alone.

The dataset was collected using an LLM-driven browsing agent (claude-opus-4.6-1m) that role-plays as different user personas, making reactive browsing decisions based on actual page content — not scripted paths.

Layer 1 — Generation Quality (64 tasks)

Each task is a single-topic browsing session (~10 pages) with a defined browsing goal and ground-truth intent. Tests whether Journeys can generate a good card from relatively clean data.

POS (Positive, 50 tasks): A Journey should be generated. The user was actively researching a topic (e.g., comparing hotels, shopping for headphones, writing a paper). We evaluate card quality, CTA accuracy, and groundedness.
NEG (Negative, 14 tasks): A Journey should NOT be generated. Reasons include: task already completed (bought the product), trivial lookup (checked weather), noise-only session (background tabs), privacy-sensitive topic (surprise party, health), or expired event (past concert).

Layer 2 — Selection Quality (10 profiles)

Each profile is a multi-topic, multi-day browsing history (~100 pages) that combines 3-4 Layer 1 tasks with 60-70% background noise (email, news, social media). Simulates what a real user's browsingHistory looks like in production.

Tests signal extraction: Can Journeys find the 2-3 real topics buried in noise?
Tests suppression: Are completed tasks, sensitive topics, and expired events correctly excluded?
Tests dedup: If a user researched two similar topics, does Journeys merge or separate them correctly?
Tests ranking: Is the most important Journey shown first?