Home
Graphs
Compressed graph
A compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in Boldi-Vigna representation, designed to be loaded by the WebGraph framework, specifically using our swh-graph library.
13
datasets
Dataset size
14 TiB
Export date
2025-05-18
Derived
dataset
Aggregated Contents
S3 URL
s3://softwareheritage/graph/2025-05-18/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2025-05-18/compressed/ 2025-05-18-compressed
# OR
swh datasets download-graph 2025-05-18
Comments
This is a compressed graph of only the "history and hosting" layer (origins, snapshots, releases, revisions) and the root directory (or rarely content) of every revision/release; but most directories and contents are excluded
Dataset size
Unknown
Export date
2025-05-18
S3 URL
s3://softwareheritage/graph/2025-05-18-history-hosting/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2025-05-18-history-hosting/compressed/ 2025-05-18-history-hosting-compressed
# OR
swh datasets download-graph 2025-05-18-history-hosting
Dataset size
12 TiB
Export date
2024-12-06
Derived
datasets
Aggregated Contents
Provenance Index (all releases and revisions)
Provenance Index (releases and head revisions)
Topology
S3 URL
s3://softwareheritage/graph/2024-12-06/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2024-12-06/compressed/ 2024-12-06-compressed
# OR
swh datasets download-graph 2024-12-06
Comments
This graph changed the MPH from GOV/Cmph to PTHash; Rust code hardcoding GOVMPH needs to replace it with DynMph or SwhidPthash. Java is no longer supported to read this graph.
Dataset size
11 TiB
Export date
2024-08-23
Teaser
dataset
Popular 500 python compressed graph
(15 GB)
Popular 500 python provenance (all releases and revisions)
(46 GB)
S3 URL
s3://softwareheritage/graph/2024-08-23/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2024-08-23/compressed/ 2024-08-23-compressed
# OR
swh datasets download-graph 2024-08-23
Dataset size
11 TiB
Export date
2024-05-16
Derived
dataset
Path counts
S3 URL
s3://softwareheritage/graph/2024-05-16/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2024-05-16/compressed/ 2024-05-16-compressed
# OR
swh datasets download-graph 2024-05-16
Dataset size
11 TiB
Export date
2023-09-06
Teaser
dataset
Popular 1k compressed graph
(42 GB)
S3 URL
s3://softwareheritage/graph/2023-09-06/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2023-09-06/compressed/ 2023-09-06-compressed
# OR
swh datasets download-graph 2023-09-06
Comments
author and committer timestamps were shifted back 1 or 2 hours, based on the Europe/Paris timezone, see https://gitlab.softwareheritage.org/swh/devel/swh-graph/-/issues/4788
Dataset size
7.1 TiB
Export date
2022-12-07
Derived
dataset
Path counts
S3 URL
s3://softwareheritage/graph/2022-12-07/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2022-12-07/compressed/ 2022-12-07-compressed
# OR
swh datasets download-graph 2022-12-07
Comments
This is a compressed graph of only the "history and hosting" layer (origins, snapshots, releases, revisions) and the root directory (or rarely content) of every revision/release; but most directories and contents are excluded
Dataset size
1 TiB
Export date
2022-12-07
S3 URL
s3://softwareheritage/graph/2022-12-07-history/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2022-12-07-history/compressed/ 2022-12-07-history-compressed
# OR
swh datasets download-graph 2022-12-07-history
Dataset size
6.5 TiB
Export date
2022-04-25
Derived
datasets
Popular Contents
Path counts
S3 URL
s3://softwareheritage/graph/2022-04-25/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2022-04-25/compressed/ 2022-04-25-compressed
# OR
swh datasets download-graph 2022-04-25
Dataset size
Unknown
Export date
2021-03-23
Teaser
dataset
Popular 3k python compressed graph
(15 GB)
S3 URL
s3://softwareheritage/graph/2021-03-23/compressed/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2021-03-23/compressed/ 2021-03-23-compressed
# OR
swh datasets download-graph 2021-03-23
Dataset size
Unknown
Export date
2020-12-15
Teaser
dataset
GitLab 100k compressed graph
GitLab all compressed graph
S3 URL
s3://softwareheritage/graph/2020-12-15/compressed/
SWH Annex URL
https://annex.softwareheritage.org/public/dataset/graph/2020-12-15/compressed/
Download
The HTTP links point to directories listing all available files.
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2020-12-15/compressed/ 2020-12-15-compressed
# OR
swh datasets download-graph 2020-12-15
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2020-12-15/compressed/
Comments
DEPRECATED: known issue with missing snapshot edges
Dataset size
Unknown
Export date
2020-05-20
SWH Annex URL
https://annex.softwareheritage.org/public/dataset/graph/2020-05-20/compressed/
Download
The HTTP links point to directories listing all available files.
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2020-05-20/compressed/
deprecated
Comments
A full export of the graph dated from January 2019. The export was done in two phases, one of them called "2018-09-25" and the other "2019-01-28". They both refer to the same dataset, but the different formats have various inconsistencies between them.
Dataset size
Unknown
Export date
2018-09-25
SWH Annex URL
https://annex.softwareheritage.org/public/dataset/graph/2018-09-25/compressed/
Download
The HTTP links point to directories listing all available files.
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2018-09-25/compressed/
deprecated
By accessing the datasets, you agree with the Software Heritage Ethical Charter for using the archive data , the terms of use for bulk access , and the Software Heritage principles for large language models .
To learn how to use the datasets read the documentation .
If you use these datasets for research purposes, please cite the following paper:
Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.
The Software Heritage Graph Dataset: Public software development under one roof .
In proceedings of MSR 2019 : The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019 .
preprint , bibtex
Software Heritage — Copyright (C) 2025, The Software Heritage developers.
Licenses: GNU AGPLv3+ (code) / Creative Commons Attribution 4.0 International license (datasets).