Home
Graphs
Graph export in columnar tables
A set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.
10
datasets
Dataset size
27 TiB
Export date
2025-05-18
S3 URL
s3://softwareheritage/graph/2025-05-18/orc/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2025-05-18/orc/ 2025-05-18-orc
# OR
swh datasets download-export 2025-05-18
Dataset size
23 TiB
Export date
2024-12-06
S3 URL
s3://softwareheritage/graph/2024-12-06/orc/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2024-12-06/orc/ 2024-12-06-orc
# OR
swh datasets download-export 2024-12-06
Dataset size
19 TiB
Export date
2024-08-23
Teaser
dataset
Popular 500 python columnar tables
(36 GB)
S3 URL
s3://softwareheritage/graph/2024-08-23/orc/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2024-08-23/orc/ 2024-08-23-orc
# OR
swh datasets download-export 2024-08-23
Dataset size
18 TiB
Export date
2024-05-16
S3 URL
s3://softwareheritage/graph/2024-05-16/orc/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2024-05-16/orc/ 2024-05-16-orc
# OR
swh datasets download-export 2024-05-16
Dataset size
18 TiB
Export date
2023-09-06
Teaser
dataset
Popular 1k columnar tables
(280 GB)
S3 URL
s3://softwareheritage/graph/2023-09-06/orc/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2023-09-06/orc/ 2023-09-06-orc
# OR
swh datasets download-export 2023-09-06
Dataset size
13 TiB
Export date
2022-12-07
S3 URL
s3://softwareheritage/graph/2022-12-07/orc/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2022-12-07/orc/ 2022-12-07-orc
# OR
swh datasets download-export 2022-12-07
Dataset size
11 TiB
Export date
2022-04-25
S3 URL
s3://softwareheritage/graph/2022-04-25/orc/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2022-04-25/orc/ 2022-04-25-orc
# OR
swh datasets download-export 2022-04-25
Dataset size
8.4 TiB
Export date
2021-03-23
Teaser
dataset
Popular 3k python columnar tables
(36 GB)
S3 URL
s3://softwareheritage/graph/2021-03-23/orc/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2021-03-23/orc/ 2021-03-23-orc
# OR
swh datasets download-export 2021-03-23
Comments
edges as graph.edges.{cnt,ori,rel,rev,snp}.csv.zst and graph.edges.dir.{00..21}.csv.zst
nodes as graph.nodes.csv.zst
deduplicated labels as graph.labels.csv.zst
statistics as graph.edges.count.txt, graph.edges.stats.txt, graph.labels.count.txt, graph.nodes.count.txt, and graph.nodes.stats.txt
Dataset size
8.4 TiB
Export date
2020-12-15
S3 URL
s3://softwareheritage/graph/2020-12-15/csv/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2020-12-15/csv/ 2020-12-15-csv
deprecated
Comments
A full export of the graph dated from January 2019. The export was done in two phases, one of them called "2018-09-25" and the other "2019-01-28". They both refer to the same dataset, but the different formats have various inconsistencies between them.
Dataset size
1.2 TiB
Export date
2018-09-25
Teaser
dataset
Popular 4k columnar tables
(27 GB)
Popular 3k python columnar tables
(5.3 GB)
S3 URL
s3://softwareheritage/graph/2018-09-25/parquet/
SWH Annex URL
https://annex.softwareheritage.org/public/dataset/graph/2018-09-25/parquet/
Download
The HTTP links point to directories listing all available files.
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2018-09-25/parquet/ 2018-09-25-parquet
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2018-09-25/parquet/
deprecated
By accessing the datasets, you agree with the Software Heritage Ethical Charter for using the archive data , the terms of use for bulk access , and the Software Heritage principles for large language models .
To learn how to use the datasets read the documentation .
If you use these datasets for research purposes, please cite the following paper:
Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.
The Software Heritage Graph Dataset: Public software development under one roof .
In proceedings of MSR 2019 : The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019 .
preprint , bibtex
Software Heritage — Copyright (C) 2025, The Software Heritage developers.
Licenses: GNU AGPLv3+ (code) / Creative Commons Attribution 4.0 International license (datasets).