Skip to content

AmirhoseinBodaghi/P5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

P5

^^^^^ An Analysis on Newly Emerged Rumors in Twitter ^^^^^

This is a ten-stage project that has been devised to first make a newly-emerged rumor dataset out of Twitter and then do a set of analysis (Micro + Macro) on it. These eleven stages (from S4 to S13), each one comes with a distinct python code with the following short description for each one:

S4 : This code crawls the tweets related to the rumor we want to hunt. But prior to the running of the code we should have made a Twitter API so the API keys would be set in the code. Also we would need to set the queries (key words related to the rumor) in the code, so to capture only those tweets that include the queries. The output which is the collection of all tweets related to the rumor will be ready in a .jsonl file.

S5 : This code by getting the S4 output as its input, separates the data into meaningful pieces of information and then organizes them in an excel file. This file assigns one distinct row for each tweet and save its information (including tweet ID, User ID, Date and Time and ...) in the columns.

S6 : This code by getting the S5 output as its input, picks up a set of tweets to then get used for human annotation. The selecting strategies is based on the fact that tweets with more frequencies of occurs in the dataset are in priority since annotating them by human would have left less tweets for machine and the overall accuracy of the dataset would be higher. However the specified size for this pack of selected tweets and the desire to have tweets from all kinds (even those with less frequency) lead the code to choose the most appropriate tweets. This code also provides another output that is the collection of all tweets of the dataset minus the repetitive versions. This collection can be used when human annotators want to annotate whole the dataset all by their own as we would see in S6-1 (without using machine)

S6-1 : The output of S6, the one that includes the whole tweets of the dataset minus the repetitive versions, after being annotated by the annotators will be given to this code as the input so the annotations will be assigned to the repetitive versions of tweets and the final dataset would be ready as the output.

S7 : The output of S6, the one that includes only a selected collection of the whole tweets, after being annotated by the annotators will be given to this code as the inputs (two input excel files) to this code, and this code checks if the entries (r: rumor, a: anti-rumor, q: related to the rumor, n: not related to the rumor) are given correctly and if so then by computing the Kappa coefficient examines the conformity between the annotators. The output is the excel file ready to get used for extracting of the learning features.

S8 : This code by getting the S7 output as its input, extracts all the unigrams and bigrams from all the tweets in the collection. And then yields out an excel file in which for each tweet (in a distinct row) it clarifies the existence of each feature (in a distinct column) by 1 and 0 numbers. The output is the excel file ready to get used for machine learning.

S9 : This code by getting the S8 output as its input, and using the random forest algorithm, learns the machine to be able to annotate the rest of the tweets. Indeed by running this code we would have all the tweets annotated and finally the accuracy of the dataset would be calculated by the code. The output of this code is final annotated excel file of the dataset.

S10 : This code gets the dataset excel file (the output of S9) and performs a microscopic analysis particularly on the user types. At first, it divides users into six categories; rumor tweeter, anti-rumor tweeter, rumor retweeter, anti-rumor retweeter, rumor-related tweeter, rumor-related retweeter. The last two categories correspondingly refer to the tweeters and retweeters of what that do not support rumor but are related to it, for example those ones that question about the rumor. And then calculates the user dynamics such as : ratio of following to follower, number of days of membership, number of published tweets per day, and compound score for the description sentence in the users' profiles, both for individual users and the user categories.

S11 : This code by getting the S10 output as its input, choose a portion of the users from each category of the sextuple categories and then by crawling the last 200 tweets\retweets of them calculate the user dynamics such as : percentage of tweets (toward all retweets among the last 200 tweet/retweet), the average of like per tweet, the average of like per tweet per follower, the average of retweet per tweet, the average of retweet per tweet per follower.

S12 : This code by getting the S11 output as its input, calculates the Etta squared correlation between the user categories and the values of user dynamics such as percentage of tweets (toward all retweets among the last 200 tweet/retweet), the average of like per tweet, the average of like per tweet per follower, the average of retweet per tweet, the average of retweet per tweet per follower.

S13 : This code gets the outputs of S9 and S10 as its inputs and calculates the number of users who belong to any combination of the sextuple categories. Also this code performs a macroscopic analysis over datasets in which the number of tweet/retweets in each hour would be plotted.

About

Newly Emerged Rumors in Twitter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages