Software Heritage Datasets
Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Datasets provide diverse representations and subsets of this vast archive, designed to support a wide range of research and analytical needs. These include the fully deduplicated Merkle DAG representation, derived datasets tailored for specific use cases, lightweight teaser datasets for preliminary exploration, etc. The data is collected from major development forges, FOSS distributions, and language-specific package managers, and includes crawling metadata along with anonymized author and committer information.
Graph exports
A fully-deduplicated Merkle DAG representation of the Software Heritage archive.
Derived datasets
Datasets derived from a graph export and/or compressed graph, with specific data, goals and intended usages.
Datasets by export date
Access datasets by date.
By accessing the datasets, you agree with the Software Heritage Ethical Charter for using the archive data, the terms of use for bulk access, and the Software Heritage principles for large language models.
To learn how to use the datasets read the documentation.
If you use any of the datasets indexed on this website for research purposes, please acknowledge Software Heritage as recommended in the publications page, which means doing the next two things:
- Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”
-
Cite at least one of the following papers:
- Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. iPRES 2017. (BibTeX)
- Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli. Building the universal archive of source code. Communications of the ACM 61(10): 29-31 (2018). (BibTeX)
Specific datasets might recommend additional citations, to credit their creators.