PyPI-DLSC

Supplementary materials for "Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement"

This is the replication package for our TOSEM Submission Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement. It can be used to replicate all three research questions in the paper using our preprocessed and manually labeled data.

Replication Package Setup

We use Anaconda for Python development, so please configure to use a new Conda environment by executing the following commands step by step.

conda create -n CompareDLSC python=3.8
conda activate CompareDLSC
pip install -r requirements.txt
pip install ipykernel

Folder Structure

coding guide: contains code books developed in this paper.
data:
- community types.xlsx: data and results for cluster shape labelling in RQ2.
- downloads.csv: monthly downloads data for all PyPI packages between Nov 4, 2021 and Dec 4, 2021.
- labeled_detached_packages.xlsx: data and results for disengagement reason labelling in RQ3.
- most_downloaded_packages.xlsx: data and results for package domain labelling in RQ1.
- pytorch_most_downloaded.csv: information for the 259 most downloaded packages in PyTorch SC.
- tensorflow_most_downloaded.csv: information for the 219 most downloaded packages in TensorFlow SC.
- TensorFlow folder:
  - package_version_info.json: contains the PyPI versions and SC versions for packages in TensorFlow SC.
  - TensorFlow_communities: contains the packages in each cluster. Clusters are sorted by the number of packages in it and the alpha order. Packages in each cluster are sorted by the Katz centrality within the cluster.
  - TensorFlow_community_info.csv: contains the cluster information for each package.
  - TensorFlow_edges.csv: each row contains a dependent and its dependency.
  - TensorFlow_nodes.csv: the information of all packages in TensorFlow SC.
  - versioned_sc.json: the SC for each version of TensorFlow.
- PyTorch folder: similar to TensorFlow folder.
Figures folder:
- cluster_centralities.pdf: corresponds to Figure 5 in the paper.
- deprecation_trend.pdf: corresponds to Figure 6 in the paper.
- most_downloaded_package_domains.pdf: corresponds to Figure 2 in the paper.
- shape_distribution.pdf: corresponds to Figure 4 in the paper.
- TensorFlow folder: each file corresponds to a visualization of a cluster in TensorFlow SC. Files are sorted by cluster size and alpha order.
- PyTorch folder: similar to TensorFlow folder.
mongodb_dump folder: contains the dump of distribution_metadata collection and versioned_dependencies collection.
SupplyChains: contains the dump of edges and nodes in TensorFlow SC and PyTorch SC.

Replication Results

Construct PyPI DL SCs

We retrieve the PyPI distribution metadata dump from Google Big query with the following query:
```
SELECT
  metadata_version, name, version, summary, author, author_email, maintainer, maintainer_email,
   license, keywords, classifiers, platform, home_page, download_url, requires_python, requires,
   provides, obsoletes, requires_dist, provides_dist, obsoletes_dist, requires_external, project_urls,
   upload_time, filename, size, python_version, packagetype, comment_text
FROM
  `bigquery-public-data.pypi.distribution_metadata`
WHERE
```
We import the results as a json file distribution_metadata.json. Then import it into a MongoDB database:

We need to build index on the name field to speed up the construction process.

Then, we build the ecosystem-wise and version-sensitive database using the following commands:
```
python 1.parse_distribution_metadata.py
python 2.parse_package_versions.py
```
We provide the MongoDB database dump in the replication package, please refer to the mongodb_dump folder:

We construct PyPI DL SCs with this command:
```
python 3.construct_supply_chain.py
```
We also provide the SC database dump in the replication package, please refer to the mongodb_dump folder:
There are three jupyter notebooks: RQ1_domains.ipynb, RQ2_clusters.ipynb, and RQ3_deprecations.ipynb, corresponding to the three RQs in our paper. You can directly see the plots and numbers used in our paper in the cells' output. For each notebook, you can start a Python kernel and run all cells, and then you should be able to replicate all the results in this notebook. The results should look identical or similar to the plots in the paper if it is working properly. 4.deprecation_utils.py are used to construct datasets for RQ3: python 4.deprecation_utils.py. utils.ipynb are to generate some statistics in the paper, including footnote 2, the manual inspection of bloated dependencies in Section 4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyPI-DLSC

Replication Package Setup

Folder Structure

Replication Results

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
coding guide		coding guide
data		data
figures		figures
mongodb_dump		mongodb_dump
.python-version		.python-version
1.parse_distribution_metadata.py		1.parse_distribution_metadata.py
2.parse_package_versions.py		2.parse_package_versions.py
3.construct_supply_chain.py		3.construct_supply_chain.py
4.deprecation_utils.py		4.deprecation_utils.py
README.md		README.md
RQ1_domains.ipynb		RQ1_domains.ipynb
RQ2_clusters.ipynb		RQ2_clusters.ipynb
RQ3_disengagement.ipynb		RQ3_disengagement.ipynb
requirements.txt		requirements.txt
revision.ipynb		revision.ipynb
setting.py		setting.py
utils.ipynb		utils.ipynb

gaokai320/PyPI-DLSC

Folders and files

Latest commit

History

Repository files navigation

PyPI-DLSC

Replication Package Setup

Folder Structure

Replication Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages