Popular 1k compressed graph
A compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in Boldi-Vigna representation, designed to be loaded by the WebGraph framework, specifically using our swh-graph library.
- Comments
-
The popular-1k teaser contains a subset of 1120 popular repositories tagged as being written in one of the 10 most popular languages (Javascript, Python, Java, Typescript, C#, C++, PHP, Shell, C, Ruby), from GitHub, GitLab.com, Packagist, PyPI and Debian. The selection criteria to pick the software origins for each language was the following:
- the 50 most popular Gitlab.com projects written in that language that have 2 stars or more,
- for Python, the 50 most popular PyPI projects (by usage statistics, according to the Top PyPI Packages database),
- for PHP, the 50 most popular Packagist projects (by usage statistics, according to Packagist's API),
- the 50 most popular Debian packages with the relevant implemented-in:: debtag (by "installs" according to the Debian Popularity Contest database).
- most popular GitHub projects written in Python (by number of stars), until the total number of origins for that language reaches 200
- removing origins not archived by Software Heritage by 2023-09-06
- Dataset size
- 42 GB
- Export date
- Teaser of
- Compressed graph [2023-09-06]
- S3 URL
- s3://softwareheritage/graph/2023-09-06-popular-1k/compressed/
- Deprecated
- False
Referencing Software Heritage
If you use any of the datasets indexed on this website for research purposes, please acknowledge Software Heritage as recommended in the publications page, that is:
- Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”; and
-
cite at least one of the following papers:
- Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. iPRES 2017. (BibTeX)
- Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli. Building the universal archive of source code. Commun. ACM 61(10): 29-31 (2018). (BibTeX)
Specific datasets might recommend additional citations, to credit their creators.
Specific citation instructions
If you use this dataset for research purposes, please cite the following paper: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019 (Preprint), (BibTeX).
Download the dataset
For Amazon S3 links, you'll need to install either awscli or swh.datasets.
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2023-09-06-popular-1k/compressed/ 2023-09-06-popular-1k-compressed# ORswh datasets download-graph 2023-09-06-popular-1k