Skip to content

A tool to search for networks of takedown requests with the Lumen API

Notifications You must be signed in to change notification settings

hektorloto/Lumen-API-Search-Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A Search Tool for Lumen

  1. Introduction
  2. Modes
  3. Main variables
  4. Field Names
  5. Output
  6. Useful links
  7. Credits

Introduction

This repository contains a search tool designed to mine data from the Lumen database.

It was developed for the Latin American Center for Investigative Journalism, with the support of the International Center For Journalism and the Scripps Howard Foundation.

An authentication key is needed to access to the Lumen API. To request one, you can write to [email protected]

Scope of the tool is to iterate three different types of searches that, starting from a suspicious-looking DMCA takedown request, will extract the network of connected sites likely administered by the same actor. Because the Lumen API doesn't have a specific parameter for certain fields (such as copyrighted_urls and infringing_urls) search may result in noisy data, especially if it includes generic words that may be found in the description field. For this reason, during every search it's better to keep an eye on the terminal.

If you notice the search is looking for urls that are too generic or too far away from the starting point, stop the process and review the output.csv file to see when the search took the wrong path. The field searched_url should help find the problematic url, so that you can add it to the to_skip lists.

To limit the amount of noise (but, also, of data) the function SAVE_ONLY_BAD_URLS should stay on True.

1) Modes

The tool work in three different modes:

  1. MODE 1 : search by specific url or other key words
  2. MODE 2 : search by infringing_urls
  3. MODE 3 : search by copyright_urls from blogging platforms

MODE 1 is the starting point and the first search we are going to make. Add the desired url(s) to the list. MODE 2 looks for notices that include the already fetched infringing_urls value. MODE 3 looks for notices that include the already fetched copyright_urls value, but expressed in its source format (ex. from "fraudulentDMCA.blogpot.com/article-about-nasty-person" search for "fraudulentDMCA.blogpot.com)".

Once you started in MODE 1, you should me able to run MODE 2 and MODE 3 in an alternate way until you getting no more results and a cluster is identified.

Mode can be changed by changing the value (1, 2 or 3) in the GET_URL variable.

2) Main variables

  • type_search : on the Lumen API there is no specific parameter for urls. Hence, we rely on the parameter type_search for our searches. Note that this may result in unwanted data if the searched url have very generic words, as they may be found in other fields (such as description).

  • init_url : specific url, list of urls or other key words that are used in search mode 1.

  • to_skip : specific url, list of urls or other key words to avoid searching. Add new entries if you realize the script is fetching wrong data. Despite the term-require-all parameter, some urls with very generic names (ex. news-breaking.news.blog) might create partial matches with a lot of results, and ruin our data.

  • bad_words : words used to identify blogging platforms. These words are used in MODE 2 to find copyright urls that are registered in blogging platform and extract the source origin (for example, from "http://world-newshub.blogspot.co.il/2012/05/israeli-fugitive-caught.html" to "http://world-newshub.blogspot.co.il"). Right now, the words are: blogspot livejournal issuu weebly wordpress tumblr over-blog food.blog.

  • SAVE_ONLY_BAD_URLS : By default, the variable SAVE_ONLY_BAD_URLS = True. This was introduced to limit the noise that created by long sets of searches. In fact, partial matches might result in fetching of data that is not of our interest. This function limits the saving to only those notices that include as their Infringing Url a blogging platform, recognized if it include a one of the terms included in the bad_words list.

3) Field Names

This tool will extract notice information that include:

  • id : the unique ID number in the Lumen dataset.
  • type : the type of notice. This field should always be DMCA, but it is included as a control tool.
  • date_received : the date in which Google received the DMCA notice, expressed in yyyy-mm-dd.
  • sender_name : the name of the person or organization that presented the DMCA notice, as filled out by the author of the notice.
  • jursdictions : the jurisdiction for the DMCA notice, as filled out by the author of the notice.
  • copyrighted_urls : the url allegedly owning ownership over the content published by the infringing_url. If more than one, values are separated my semi columns.
  • infringing_urls : the url of the content allegedly stolen from the copyrighted_url. If more than one, values are separated my semi columns.

4) Output

The script results in the creation of a number of files in the directory:

  • output.csv : is a .csv file containing the data fetched from the Lumen database. It includes eight columns corresponding to the fieldnames values.
  • read_urls.pkl : a python pickle file containing information about the urls ot other words searched so far. Once stored there, a url or other word will not result in a new search, regardless on whether it is included among the ones to be searched in MODE 2 or it was automatically detected in MODE 2 or MODE 3.
  • infringing.pkl : a python pickle file containing information about the infringing_urls fields found so far.
  • bad_urls.pkl : a python pickle file containing information about the blogging platforms found in the copyrighted_urls field.

5) Useful reads

Legal and academic

In the news

Links

Credits

About

A tool to search for networks of takedown requests with the Lumen API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published