Popular 500 python columnar tables
A set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.
- Comments
-
This teaser contains a subset of the 443 repositories archived by Software Heritage as of 2024-08-23, among the 700 GitHub repositories tagged as being written in Python with the most stars.
- Dataset size
- 36 GB
- Export date
- Teaser of
- Graph export in columnar tables [2024-08-23]
- S3 URL
- s3://softwareheritage/graph/2024-08-23-popular-500-python/orc/
- Deprecated
- False
Referencing Software Heritage
If you use any of the datasets indexed on this website for research purposes, please acknowledge Software Heritage as recommended in the publications page, that is:
- Add a footnote on the title page of your paper, formatted as: “This work was made possible by Software Heritage, the universal source code archive: https://www.softwareheritage.org”; and
-
cite at least one of the following papers:
- Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. iPRES 2017. (BibTeX)
- Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli. Building the universal archive of source code. Commun. ACM 61(10): 29-31 (2018). (BibTeX)
Specific datasets might recommend additional citations, to credit their creators.
Download the dataset
For Amazon S3 links, you'll need to install either awscli or swh.datasets.
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2024-08-23-popular-500-python/orc/ 2024-08-23-popular-500-python-orc# ORswh datasets download-export 2024-08-23-popular-500-python