If you use this dataset for research purposes, please acknowledge Software Heritage as recommended in the publications page, which means doing the next two things:
Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”
Roberto Di Cosmo and Stefano Zacchiroli.
Software heritage: why and how to preserve software source code.
In Shoichiro Hara, Shigeo Sugimoto, and Makoto Goto, editors, Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017. 2017.
URL: https://hdl.handle.net/11353/10.931064. (BibTeX)
Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli.
The software heritage graph dataset: public software development under one roof.
In MSR 2019: The 16th International Conference on Mining Software Repositories, 138–142. IEEE, 2019.
doi:10.1109/MSR.2019.00030. (BibTeX)
If you use this dataset for research purposes, please acknowledge Software Heritage as recommended in the publications page, which means doing the next two things:
Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”
Roberto Di Cosmo and Stefano Zacchiroli.
Software heritage: why and how to preserve software source code.
In Shoichiro Hara, Shigeo Sugimoto, and Makoto Goto, editors, Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017. 2017.
URL: https://hdl.handle.net/11353/10.931064. (BibTeX)
If you use this dataset for research purposes, please acknowledge Software Heritage as recommended in the publications page, which means doing the next two things:
Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”
Cite the following papers:
Stefano Zacchiroli.
A large-scale dataset of (open source) license text variants.
In 19th IEEE/ACM International Conference on Mining Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022, 757–761. ACM, 2022.
URL: https://doi.org/10.1145/3524842.3528491, doi:10.1145/3524842.3528491. (BibTeX)
The popular-3k-python teaser contains a subset of 2197 popular repositories tagged as being written in the Python language, from GitHub, GitLab.com, PyPI and Debian. The selection criteria to pick the software origins was the following:
the 580 most popular GitHub projects written in Python (by number of stars),
the 135 GitLab.com projects written in Python that have 2 stars or more,
the 827 most popular PyPI projects (by usage statistics, according to the Top PyPI Packages database),
the 655 most popular Debian packages with the debtag implemented-in::python (by "votes" according to the Debian Popularity Contest database)
If you use this dataset for research purposes, please acknowledge Software Heritage as recommended in the publications page, which means doing the next two things:
Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”
Roberto Di Cosmo and Stefano Zacchiroli.
Software heritage: why and how to preserve software source code.
In Shoichiro Hara, Shigeo Sugimoto, and Makoto Goto, editors, Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017. 2017.
URL: https://hdl.handle.net/11353/10.931064. (BibTeX)
The popular-3k-python teaser contains a subset of 2197 popular repositories tagged as being written in the Python language, from GitHub, GitLab.com, PyPI and Debian. The selection criteria to pick the software origins was the following:
the 580 most popular GitHub projects written in Python (by number of stars),
the 135 GitLab.com projects written in Python that have 2 stars or more,
the 827 most popular PyPI projects (by usage statistics, according to the Top PyPI Packages database),
the 655 most popular Debian packages with the debtag implemented-in::python (by "votes" according to the Debian Popularity Contest database)
If you use this dataset for research purposes, please acknowledge Software Heritage as recommended in the publications page, which means doing the next two things:
Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”
Roberto Di Cosmo and Stefano Zacchiroli.
Software heritage: why and how to preserve software source code.
In Shoichiro Hara, Shigeo Sugimoto, and Makoto Goto, editors, Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017. 2017.
URL: https://hdl.handle.net/11353/10.931064. (BibTeX)
Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli.
The software heritage graph dataset: public software development under one roof.
In MSR 2019: The 16th International Conference on Mining Software Repositories, 138–142. IEEE, 2019.
doi:10.1109/MSR.2019.00030. (BibTeX)