Home
Exports
Datasets based on the 2020-12-15 export
Datasets generated from an export of the graph dated from 2020-12-15
4
datasets
Dataset size
Unknown
Export date
2020-12-15
Teaser
dataset
GitLab 100k compressed graph
GitLab all compressed graph
S3 URL
s3://softwareheritage/graph/2020-12-15/compressed/
SWH Annex URL
https://annex.softwareheritage.org/public/dataset/graph/2020-12-15/compressed/
Download
The HTTP links point to directories listing all available files.
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2020-12-15/compressed/ 2020-12-15-compressed
# OR
swh datasets download-graph 2020-12-15
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2020-12-15/compressed/
Comments
A teaser dataset containing the 100k most popular GitLab.com repositories
Dataset size
Unknown
Export date
2020-12-15
Teaser of
Compressed graph [2020-12-15]
SWH Annex URL
https://annex.softwareheritage.org/public/dataset/graph/2020-12-15-gitlab-100k/compressed/
Download
The HTTP links point to directories listing all available files.
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2020-12-15-gitlab-100k/compressed/
Comments
A teaser dataset containing the entirety of GitLab.com
Dataset size
Unknown
Export date
2020-12-15
Teaser of
Compressed graph [2020-12-15]
SWH Annex URL
https://annex.softwareheritage.org/public/dataset/graph/2020-12-15-gitlab-all/compressed/
Download
The HTTP links point to directories listing all available files.
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2020-12-15-gitlab-all/compressed/
Comments
edges as graph.edges.{cnt,ori,rel,rev,snp}.csv.zst and graph.edges.dir.{00..21}.csv.zst
nodes as graph.nodes.csv.zst
deduplicated labels as graph.labels.csv.zst
statistics as graph.edges.count.txt, graph.edges.stats.txt, graph.labels.count.txt, graph.nodes.count.txt, and graph.nodes.stats.txt
Dataset size
8.4 TiB
Export date
2020-12-15
S3 URL
s3://softwareheritage/graph/2020-12-15/csv/
Download
For Amazon S3 links, you'll need to install either awscli or swh.datasets .
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2020-12-15/csv/ 2020-12-15-csv
deprecated
By accessing the datasets, you agree with the Software Heritage Ethical Charter for using the archive data , the terms of use for bulk access , and the Software Heritage principles for large language models .
To learn how to use the datasets read the documentation .
If you use these datasets for research purposes, please cite the following paper:
Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.
The Software Heritage Graph Dataset: Public software development under one roof .
In proceedings of MSR 2019 : The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019 .
preprint , bibtex
Software Heritage — Copyright (C) 2025, The Software Heritage developers.
Licenses: GNU AGPLv3+ (code) / Creative Commons Attribution 4.0 International license (datasets).