Datasets

Datasets

The nxthdr platform makes all collected data freely available to the research community. This section provides comprehensive documentation for accessing and querying our datasets, which include active measurement results, BGP routing data, and traffic flow information.

You can find the equivalent but more interactive documentation on our website.

Data Licensing

These datasets are made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://opendatacommons.org/licenses/pddl/1.0/.

Access & Authentication

To access the data, use basic HTTP authentication with the following public credentials:

Endpoint: https://nxthdr.dev/api/query/
Username: read
Password: read

These credentials are intentionally public and can be freely shared. Authentication exists solely to prevent automated scraping while allowing legitimate access.

You can query the datasets using:

  • Command line: curl with SQL queries (examples below)
  • ClickHouse client: Official CLI tool with native protocol
  • Programming languages: HTTP libraries in Python, R, JavaScript, etc.
  • Analytics platforms: Jupyter notebooks, R Studio, or custom applications

All queries must be compatible with ClickHouse SQL. See the examples below for each dataset to get started.

Available Datasets

Probing Dataset

The probing dataset is available in the saimiris.replies table. Active measurements using saimiris, whether scheduled via cron jobs or performed on demand by the users, are stored in a ClickHouse database. This data consists of traceroute-like and ping-like measurement results collected from multiple vantage points.

Each row corresponds to a measurement result, capturing the source and destination IP addresses of the sent packet, the reply, the hop count, and other relevant attributes. This dataset is ideal for network topology discovery, latency analysis, and path characterization.

Example Query

Find the top 10 destinations with the highest average RTT from the CDG agent in the last hour:

curl -X POST "https://nxthdr.dev/api/query/" \
  -u "read:read" \
  -H "Content-Type: text/plain" \
  -d "SELECT probe_dst_addr,
       count(*) as measurements,
       round(avg(rtt), 2) as avg_rtt_us
FROM saimiris.replies
WHERE time_received_ns >= now() - INTERVAL 1 HOUR
  AND agent_id = 'vltcdg01'
GROUP BY probe_dst_addr
ORDER BY avg_rtt_us DESC
LIMIT 10 FORMAT CSVWithNames"

Peering Dataset

The peering dataset is available in the bmp.updates table. Each router of as215011, including those inside IXPs, sends BMP messages to risotto, which records the updates in a ClickHouse database.

Each row corresponds to an update or a withdraw, capturing prefixes, AS paths, communities, and other attributes. This dataset is perfect for analyzing BGP routing dynamics, prefix announcements, and AS relationship studies.

Example Query

Example Query Find prefixes with the most BGP communities in the last 24 hours:

curl -X POST "https://nxthdr.dev/api/query/" \
  -u "read:read" \
  -H "Content-Type: text/plain" \
  -d "WITH concat(prefix_addr, '/', prefix_len) AS prefix
SELECT prefix,
       max(length(communities)) AS n_communities
FROM bmp.updates
WHERE time_received_ns >= now() - INTERVAL 1 DAY
GROUP BY prefix
ORDER BY n_communities DESC
LIMIT 5 FORMAT CSVWithNames"

Traffic Dataset

The traffic dataset is available in the flows.records table. Each router of as215011 sends sFlow messages to goflow2, which records the flow samples in a ClickHouse database.

Each row represents a sampled network flow, capturing traffic statistics between source and destination endpoints. This dataset is more useful for internal troubleshooting, but is public for transparency and possible research use cases.

Example Query

Find the top 10 destination IP addresses by average bandwidth in the last 24 hours:

curl -X POST "https://nxthdr.dev/api/query/" \
  -u "read:read" \
  -H "Content-Type: text/plain" \
  -d "SELECT
  dst_addr,
  sum(bytes * sampling_rate) / (24 * 3600) as avg_bytes_per_second,
  count(*) as flow_records,
  sum(bytes * sampling_rate) as total_estimated_bytes
FROM flows.records
WHERE time_flow_start_ns >= now() - INTERVAL 1 DAY
GROUP BY dst_addr
ORDER BY avg_bytes_per_second DESC
LIMIT 10 FORMAT CSVWithNames"