Skip to content

sharma-apoorv/data-512-a1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A1 - Data Curation

Contributors MIT License LinkedIn

About The Project

The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through August 30 2021. For this assignment, we combine data about Wikipedia page traffic from two different Wikimedia REST API endpoints into a single dataset, perform some simple data processing steps on the data, and then analyze that data, by creating a simple visualization.

The purpose of the assignment is to demonstrate that we can follow best practices for open scientific research in designing and implementing your project, and make the project fully reproducible by others: from data collection to data analysis. As a result, the following Notebook was created to demonstrate the best practices being followed.

The following is the structure of the directory and files present:

.
├── README.md
├── requirements.txt
├── data
    ├── raw
        ├── pagecounts_desktop-site_200712-201608.json
        ├── pagecounts_mobile-site_200712-201608.json
        ├── pageviews_desktop_201507-202110.json
        ├── pageviews_mobile-app_201507-202110.json
        └── pageviews_mobile-web_201507-202110.json
│   ├── processed
        ├── en-wikipedia_traffic_200712-202109.csv
│   └── visualizations
        ├── en-wikipedia_traffic_200712-202109.png
└── src
│    └── A1_Data_Curation.ipynb

API Documentation

To collect the data we used 2 different API's:

  1. Pageview API
    • Given a date range, this API returns a timeseries of pageview counts. You can filter by project, access method and/or agent type. You can choose between daily and hourly granularity as well.
    • Data available from December 2007 through July 2016
  2. Legacy (Pagecount) API
    • Given a project and a date range this API returns a timeseries of pagecounts. You can filter by access site (mobile or desktop) and you can choose between monthly, daily and hourly granularity as well.
    • Data available from July 2015 through present

Dataset

Once the data was collected, the data was cleaned and transformed into a more consumable format. This data was stored as a CSV file. The following table describes the columns in the file:

Column Name Description
year The year in which the data was collected (YYYY)
month The month in which the data was collected (MM)
pagecount_all_views Number of page views from all clients acquired from the Legacy Pagecounts API
pagecount_desktop_views Number of page views from desktop clients acquired from the Legacy Pagecounts API
pagecount_mobile_views Number of page page views from mobile clients acquired from the Legacy Pagecounts API
pageview_all_views Number of page views from all clients acquired from the Pageviews API
pageview_desktop_views Number of page views from desktop clients acquired from the Pageviews API
pageview_mobile_views Number of page views from mobile clients acquired from the Pageviews API

Results

The following chart was generated after collecting and processing the data:

Page Views on English Wikipedia (x 1,000,000)

Important notes

  • As much as possible, we're interested in organic (user) traffic, as opposed to traffic by web crawlers or spiders. The Pageview API (but not the Pagecount API) allows you to filter by agent=user. Thus, we filter by user in the Pageview API.
  • There was about 1 year of overlapping traffic data between the two APIs. We gather and graph data from both the API's, including the overlapping period

Dependencies

Anaconda was used to maintain and install project dependencies. Additionally, a requirements.txt file has been provided to install all dependencies used in the notebook.

Usage

The repo can be cloned using the following command:

git clone https://github.com/sharma-apoorv/data-512-a1.git

The dependencies can be installed into an environment using the following command:

conda create --name <envname> --file requirements.txt

The environment is activated as follows:

conda activate <envname>

Lastly, the jupyter notebook kernel can be started using the following command:

jupyter-notebook

Executing the above command will open a link in the browser. Navigate to the notebook and click on "Run All" in the notebook to execute the commands.

License

Distributed under the MIT License. See LICENSE.md for more information.

Wikimedia Foundation REST API - License agreements

(back to top)

About

UW - DATA 512 - A1 - Data Curation

Resources

License

Stars

Watchers

Forks