Graph export in columnar tables
A set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.
- Comments
-
A full export of the graph dated from January 2019. The export was done in two phases, one of them called "2018-09-25" and the other "2019-01-28". They both refer to the same dataset, but the different formats have various inconsistencies between them.
- Dataset size
- 1.2 TiB
- Export date
- Teaser dataset
- Popular 4k columnar tables (27 GB)
- Popular 3k python columnar tables (5.3 GB)
- S3 URL
- s3://softwareheritage/graph/2018-09-25/parquet/
- SWH Annex URL
- https://annex.softwareheritage.org/public/dataset/graph/2018-09-25/parquet/
- Deprecated
- True
Download the dataset
The HTTP links point to directories listing all available files. For Amazon S3 links, you'll need to install either awscli or swh.datasets.
aws s3 cp --recursive --no-sign-request s3://softwareheritage/graph/2018-09-25/parquet/ 2018-09-25-parquet
wget --recursive --no-parent --reject "index.html*" https://annex.softwareheritage.org/public/dataset/graph/2018-09-25/parquet/