Popular 1k compressed graph
A compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in Boldi-Vigna representation, designed to be loaded by the WebGraph framework, specifically using our swh-graph library.
- Comments
-
The popular-1k teaser contains a subset of 1120 popular repositories tagged as being written in one of the 10 most popular languages (Javascript, Python, Java, Typescript, C#, C++, PHP, Shell, C, Ruby), from GitHub, GitLab.com, Packagist, PyPI and Debian. The selection criteria to pick the software origins for each language was the following:
- the 50 most popular Gitlab.com projects written in that language that have 2 stars or more,
- for Python, the 50 most popular PyPI projects (by usage statistics, according to the Top PyPI Packages database),
- for PHP, the 50 most popular Packagist projects (by usage statistics, according to Packagist's API),
- the 50 most popular Debian packages with the relevant implemented-in:: debtag (by "installs" according to the Debian Popularity Contest database).
- most popular GitHub projects written in Python (by number of stars), until the total number of origins for that language reaches 200
- removing origins not archived by Software Heritage by 2023-09-06
- Dataset size
- 42 GB
- Export date
- Teaser of
- Compressed graph [2023-09-06]
- S3 URL
- s3://softwareheritage/graph/2023-09-06-popular-1k/compressed/
- Deprecated
- False
Download the dataset
For Amazon S3 links, you'll need to install either awscli or swh.datasets.
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2023-09-06-popular-1k/compressed/ 2023-09-06-popular-1k-compressed
# ORswh datasets download-graph 2023-09-06-popular-1k