Popular 4k columnar tables
A set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.
- Comments
-
This teaser dataset contains a subset of 4000 popular repositories from GitHub, GitLab.com, PyPI and Debian. The selection criteria to pick the software origins was the following:
- The 1000 most popular GitHub projects (by number of stars)
- The 1000 most popular GitLab.com projects (by number of stars)
- The 1000 most popular PyPI projects (by usage statistics, according to the Top PyPI Packages database),
- The 1000 most popular Debian packages (by "votes" according to the Debian Popularity Contest database)
- Dataset size
- 27 GB
- Export date
- Teaser of
- Graph export in columnar tables [2018-09-25]
- S3 URL
- s3://softwareheritage/graph/2019-01-28-popular-4k/parquet/
- SWH Annex URL
- https://annex.softwareheritage.org/public/dataset/graph/2019-01-28-popular-4k/parquet/
- Deprecated
- False
Download the dataset
The HTTP links point to directories listing all available files. For Amazon S3 links, you'll need to install either awscli or swh.datasets.
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2019-01-28-popular-4k/parquet/ 2019-01-28-popular-4k-parquet
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2019-01-28-popular-4k/parquet/