Popular 1k columnar tables
A set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.
- Comments
-
The popular-1k teaser contains a subset of 1120 popular repositories tagged as being written in one of the 10 most popular languages (Javascript, Python, Java, Typescript, C#, C++, PHP, Shell, C, Ruby), from GitHub, GitLab.com, Packagist, PyPI and Debian. The selection criteria to pick the software origins for each language was the following:
- the 50 most popular Gitlab.com projects written in that language that have 2 stars or more,
- for Python, the 50 most popular PyPI projects (by usage statistics, according to the Top PyPI Packages database),
- for PHP, the 50 most popular Packagist projects (by usage statistics, according to Packagist's API),
- the 50 most popular Debian packages with the relevant implemented-in:: debtag (by "installs" according to the Debian Popularity Contest database).
- most popular GitHub projects written in Python (by number of stars), until the total number of origins for that language reaches 200
- removing origins not archived by Software Heritage by 2023-09-06
- Dataset size
- 280 GB
- Export date
- Teaser of
- Graph export in columnar tables [2023-09-06]
- S3 URL
- s3://softwareheritage/graph/2023-09-06-popular-1k/orc/
- Deprecated
- False
Download the dataset
For Amazon S3 links, you'll need to install either awscli or swh.datasets.
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2023-09-06-popular-1k/orc/ 2023-09-06-popular-1k-orc
# ORswh datasets download-export 2023-09-06-popular-1k