This repository provides tools that make advanced studies of social media data easier by managing the storage and augmentation of data collected around a particular focal event or query on social media. Currently, focalevents
supports data collection from Twitter using the v2 API with academic credentials.
It is often difficult to organize data from multiple API queries. For example, we may collect tweets when a hashtag starts trending by using Twitter's filter stream. Later, we may make a separate query to the search endpoint to backfill our stream with what we missed before we started it, or update it with tweets that occurred since we stopped it. We may also want to get reply threads, quote tweets, or user timelines based on the tweets we collected. All of these queries are related to a common focal event—the hashtag—but they require several separate calls to the API. It is easy for these multiple queries to result in many disjoint files, making it difficult to organize, merge, update, backfill, and preprocess them quickly and reliably.
The focalevents
codebase organizes social media focal event data using PostgreSQL, making it easy to query, backfill, update, sort, and augment the data. For example, collecting Twitter conversations, quotes, or user timelines are all easy, single line commands, instead of multi-line scripts that need to read IDs, query the API, and output the data. This allows researchers to design more complex studies of social media data, and spend more time focusing on data analysis, rather than data storage and maintenance.
The repository's code can be downloaded directly from Github, or cloned using git:
git clone https://github.com/ryanjgallagher/focalevents
You can install any needed packages by navigating to the project directory and running:
pip install -r requirements.txt
You will also need to install PostgreSQL and create a database on the computer that you want to run this code. There are many online resources for installing PostgreSQL and configuring a database, so there are no utilities or instructions here for doing so.
The configuration file config.yaml
specifies important information for connecting to different APIs and storing the data. Some of these fields need to be set before starting.
-
Under the
psql
field, you need to provide information for connecting to the database. At minimum, you need to specify thedatabase
name anduser
name. If you have altered any of the PostgreSQL defaults, you may also need to enter thehost
,port
, orpassword
. Otherwise, these can be left asnull
. -
Under
keys
, you need to provide API authorization tokens.
Once the database information and API tokens are set, go to the focalevents
folder and run:
python config.py
This will create all of the necessary directories, schemas, and tables needed for reading and writing data.
All data is organized around an "event name." This name should be a unique signifer for the focal event around which want to collect data.
Queries are separated from the code by using an event query configuration file, which is a YAML file named with the event name. For example, say we want to search for tweets about Facebook's Oversight Board. Then we can name our event "facebook_oversight"
. The specific search queries to the Twitter API are specified the event configuration file input/twitter/search/facebook_oversight.yaml
.
The format of the .yaml
event configuration files depends on the platform and the type of query being done. You can find examples in this repository's input directory. The syntax for Twitter queries follows the API's operators (stream, search).
Once the focal event's query configuration file is set, you are ready to run it! All queries are run using Python at the command line with the -m
flag. For example, if the event's name is facebook_oversight
, then we can run a basic Twitter search by going to the focalevents
directory and entering:
python -m twitter.search facebook_oversight
For details on collecting Twitter data, see here.
First and foremost, the code here is designed to help the repository's author manage their own data and create replicable pipelines. They are sharing it in the hope that it may help others who have similar workflows and are interested in organizing their Twitter data according to focal events using PostgreSQL. However, most requests for enhancements or additions to the code will likely be declined if the author does not anticipate using them in their own research. It is highly unlikely the code will ever be adapted to work with databases other than PostgreSQL. Further, general problems with database setup or conflicts with pre-existing database structurs are beyond the scope of this project and will not be addressed.