Provenance Index (all releases and revisions)
Precomputed tables of all revisions a content is in, and all origins a revision is in, indexed for reasonably efficient backward queries. Its current implementation is primarily a Parquet database, along with some external indexes for more efficient access. The swh-provenance Rust crate provides access to these indexes and a gRPC server to query the data remotely.
- Dataset size
- Unknown
- Export date
- Derived of
- Compressed graph [2024-12-06]
- S3 URL
- s3://softwareheritage/derived_datasets/2024-12-06/provenance/all/
- Deprecated
- False
Referencing Software Heritage
If you use any of the datasets indexed on this website for research purposes, please acknowledge Software Heritage as recommended in the publications page, that is:
- Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”; and
-
cite at least one of the following papers:
- Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. iPRES 2017. (BibTeX)
- Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli. Building the universal archive of source code. Commun. ACM 61(10): 29-31 (2018). (BibTeX)
Specific datasets might recommend additional citations, to credit their creators.
Download the dataset
For Amazon S3 links, you'll need to install either awscli or swh.datasets.
aws s3 cp --recursive --no-sign-request s3://softwareheritage/derived_datasets/2024-12-06/provenance/all/ 2024-12-06-provenance_all