by Toxic Players
- Elena Calmis
- Liliana Gavrilas
- Adrian Plesescu (project leader)
- Social Media platforms wanting to monitor and filter offensive content.
- Government entities fighting against possible online threats.
- Antivirus like programs which act as a firewall that blocks children to access certain toxic/ offensive/ hateful contents.
- Basically all platforms managing a comment section, including online stores, blogs, etc.
(censoring notice is forwarded to the offender, prefferably with an atached explanation)
Eddie Yang and Margaret Roberts from University of California, San Diego published on 22.Jan.2021 the atricle "Censorship of online Encyclopedia", which discusses the differences of training a NLP model on 2 online encyclopedia: Wikipedia and its Chinese equivalent, Baidu Baike. The researchers found proof of what they initially believed, that censorship influences the classification of words based on their political meaning: for example 'democracy' is more related to 'chaos' then it is to 'stability'. Finally, a news headline analyzer built on top of this had a bias in siding with propaganda ideas and against non-patriots.
- Brain stormed ideas for gathering corpus content:
- Sentiment Analysis for Romanian -> gets positive (non-offensive comments);
- Offensive/Obscene/Vulgar lexicon -> gets a list of highly offensive words and expressions from an online dictionary(mainly slang) using regex;
- Facebook comment extractor -> automates the extraction of Facebook comments from a specific post, needs further manual labeling;
- Twitch live comments extractor -> a crawler(v.0.1) that combined with a timed html saving chrome extension produces a list of comments
- Built a Trello page to easily monitor progress.
- We updated our existing shallow corpus with thousands of YouTube comments dynamically gathered by a python scrips and a few hundreds from Facebook.
- Running the last week's sentiment analyzer, we found out that the results were largely inconclusive.
- We manually annotated them on the basis of 3 categories = Offensive/ Non-offensive/ Neutral(which we can also count), but at this point we have doubts of the accuracy of these annotations. That is because we discovered that many comments are offensive only looking at the context/ or can be better classified at threats or hate. We surely have to step back and rethink this issue.
- Having the Romanian Dictionary's(dex) database already downloaded from last week, we queried it based on representative tags/ abbreviations for our specific problem (prst., obs., vulg., eufem, depr.) and filtered them manually. This was done to gather a lexicon of bad words and expressions to use it as a first wall against the use of obscene words, no matter the context. We will be improving it (460+).
- We currently have to resolve how our program can deal with common abbreviations / misspelled words / emoji combos / popular spellings (more or less like the ones we found in the comments at this time).
- As for the next sprint, we can think more about the difficulties we faced and also work at the tokenization, stemming, lemmatization and pos tagging. I think this will give us a clearer picture on how we need to adapt our corpus for real work and a bit of insight on the ML algorithm / neural network we will use in the weeks to come.
- We have also worked on the frontend side(one main page with a textbox + a second page for the result). This said, we leave an open door for next improvements like advanced options / filtering and black box interpretations.
- This weeks's attention was on resources that can help us preprocess data. We managed to write a script that normalizes, removes stop words and punctuation marks and lemmatizes random texts.
- Using these 2 links we received last week: the TEPROLIN documentation and the UAIC Natural Language Processing Group's Resources and Tools, we enriched our understanding on preprocessing and our NLP vocabulary(anaphora, chunking, etc.).
- Through TEPROLIN's API, we fed a sentence and received its POS tagging, followed by lemmatization.
- We compared different methods of lemmatization inluding the snowball lemmatizer and the one TEPROLIN provides us, concluding that the second delivers a better performance. We will use TEPROLIN further in the project and hope its standardized format will help us in the integration process.
- We researched language models and how can we build one using an existing corpus for Romanian. We tried using Rowordnet but failed discovering that synsets can be valuable later on. Not being able to easily import a Romanian corpus through nltk(because there is none available in the nltk.corpus) and not being able to download parts of CoRoLa, we are still trying to find a solution.
- Everyone filled out the SWOT part of the project descripton file, but sections such as goal and qualitative benchmarks are needed to be further discussed.
- Research about input has led us to this article, which will serve as guideline.
- We centralized the corpus comment files and counted the annotated texts: ~3000
- This week we had a second collaboration/exchange with another team =>TEPROLIN connection / <=a large corpus(~12.000 non-offensive & ~6.000 offensive comments). This came at a good time for us, giving that we had already planed to implement preprocessing operations on our dataset. After we took a look at this new corpus, it became obvious that we had a "positive bias", and based our classification of offensive comments on other criteria.
- We compiled the received dataset in a .csv file and began adding our own comments to lower the inbalance between the 2 categories. This meant converting our -1/0/1 system to a 0/1.
- We used our stop word removal and tokenization in conjuction with TEPROLIN's POS tagging(also gives us the lemma).
- We researched how POS tagging can influence and create a feature in the Naive Bayes algorithm. This lead us to implement our own algorithm, but also test pre-built ones such as those provided by scikit learn.
- Thought of filters for detecting misspelings(internal) and a spellchecker module for Romanian(an external one but no results).
- We researched BERT but have to build a larger knowledge base in order to fully understand it.
- We found new resources to use in the next 2 weeks: BERT, Voting Process and NLP-Cube, udpipe, paperswithcode.com(from the 30.mar course)
- Started working on a shared Google Colab notebook, and managed to test one Naive Bayes algorithm and a SVM network uing scikit-learn. We reached an impressive detection rate of over 90% for our original smaller corpus.
- Integration is the name of the game now as we rush to finish the basic functionalities. We bottlenecked firstly because we do not heve well-defined in-code modules. So we want to decouple some of our files and better organze the project.
- We downloaded TEPROLIN on a virtual machine to avoid overloading the main server with thousands of requests given our corpus size. We used the offered docker container. We still have to test the pos tagging and lemas we get from this.
- The corpus is growing in offensive words, and for the spelling mistakes we continue to see, we adopted a strategy that will be implemented by first developing modules for finding whether or not a text is in Romanian/ a word is correctly written in Romanian.
- In order to highlight the offensive part of a longer phrase, we decided to go for a segmented input aproach. So we first need to split a comment in text segments and feed them to the algorithm condecutively, thus calculating a offensive score for each segment. Then, we will have a general idea where the expression/ bad word is found, so we could provide a visual supervised feedback to the user.
- We are in the process of integrating a module that detects text zones/ ROI in an image and censors any offensive text. We already had some code along the lines of OCR text detection and we figured it will come usefull in this project too. Its main utility is to capture offensive material in images with clear text, preferably black on white(memes, ad campains and screenshots). We are using easyocr to find the ROI and tesseract to extract the text. We might go only with easyocr after some more testing and some driver installs because it is the slowest between the 2. All other image processing is done with OpenCV.
- These 2 weeks marked a slowdown to our usual routine, so we made little progress regarding the integration of extra features. That being said, we've progressed with the current algorithm and corpus.
- We added more than 3000 new offensive entries in the existing corpus achieving a balance between offensive and non-offensive comments.
- We added 2 new algorithms to the code and tested again, these time with precision and recall. Also, we managed to save the results, in order to review them each time we add a new feature or test a different encoding system.
- We also tested TEPROLIN's lemmatization, but our results didn't seem to improve.
- Being already in the late in the semester, we plan to finalize the project within 1-2 weeks after Easter. We've thought of many features and models we could implement each week, but we felt the lack of time/previous knowledge to do proper research and implement things from scratch. We would really like to use BERT and Deep Learning to be on part with what's being discussed at the course.
- After the Easter break we managed to add 4 new algoritms which we ran on the whole corpus.
- We gathered all outputs and compiled them into a test report in order to revie them later and compare future variations of the model.
- We kept in mind the idea of treating misspelled/ English words differend then others, so we implemented some simple rules that recognize and correct/ filter these inputs.
- We designed a minimal interface build in our Colab notebook not only as a temporary alternative for the full feature web one, but also as a practical test machine for whoever wants to work directly besides the code.
- Based on the 3 ways we processed the input (1.No postagging+No lemmatization; 2.Only lemmatization; 3.Postagging+Lemmatization), Liliana placed the results into a spreadsheed graph as follows:
- As promised, Elena implemented the voting mechanism, which works as follows: there are 2 types - hard and soft - each one having the 'all' and 'best3' variations. Hard vodes with 0 and 1. On the other hand, soft votes by assigning probabilities. We discovered that soft-best3 works best, but is in fact the slowest; no surprise there.
- Liliana further updated the graphs, now adding the new results in a new format. She added everything to a pdf:
- We are halfway done with the the proper web interface which we are keen on doing, beacuse it will show how well integrated our modules are.
- We finished the interface, meaning we integrated our backend prediction code with our other modules(OCR, trained models) through an api flask backend and a Vue3 frontend. We had not worked with Vue previously, and it was a learning curve. Our code might not be pretty:), but it is functional. We built components using at-design and made a logo.
- The interface has 3 main functionalities: can test typed text input, csv input and images. The results mirror the Colab interface + upload an image and if the whole text on it is considelled offensive, blurr it + upload a csv file and simultaniously test the whole text and each comment individually.
- Finally, we encountered difficulties with the CORS (we tested with Postman and mockapi) and still did not managed to find a solution. Also, little design flaws like not filtering uploads make us feel unsure about how the interface will behave online (on Heroku).