Software Heritage Datasets

Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. Author and committer information is anonymized.

Full graph exports

A fully-deduplicated Merkle DAG representation of the Software Heritage archive

Derived datasets

TODO

Teaser datasets

If the full datasets are too big, we also provide teaser datasets that can get you started and have a smaller footprint.

Datasets by export date

Access datasets by date.

By accessing the datasets, you agree with the Software Heritage Ethical Charter for using the archive data, the terms of use for bulk access, and the Software Heritage principles for large language models.

To learn how to use the datasets read the documentation.

If you use any of the datasets indexed on this website for research purposes, please acknowledge Software Heritage as recommended in the publications page, that is:

Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”; and
cite at least one of the following papers:
- Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. iPRES 2017. (BibTeX)
- Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli. Building the universal archive of source code. Commun. ACM 61(10): 29-31 (2018). (BibTeX)

Specific datasets might recommend additional citations, to credit their creators.