Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
model-cnn-spm		model-cnn-spm
word2vec		word2vec
.gitignore		.gitignore
Junk.txt		Junk.txt
README.md		README.md
Screenshot-mailer4-diffmode.png		Screenshot-mailer4-diffmode.png
deepspam3.py		deepspam3.py
eml2str.py		eml2str.py
hdrdecode.py		hdrdecode.py
maildedup3.py		maildedup3.py
mailer4.py		mailer4.py
ppymilterbase.py		ppymilterbase.py
striprtf.py		striprtf.py
torch_emb.py		torch_emb.py
torch_eval.py		torch_eval.py
torch_spm3.py		torch_spm3.py
ttykeymap.py		ttykeymap.py
widechars.py		widechars.py

Repository files navigation

deepspam2

DeepSpam milter v2 development

Project history:

deepspam1 tf2-keras model ported to pytorch
new model using SentencePiece and embedding from pretrained XLM
new metrics for model eval, optimized for spam filtering
dataset found to be wrong, requires a full review and cleanup :(
old mailer3 unable to handle dataset properly (utf8, bad html etc)
mailer3 ported to python3 -> mailer4 initial version (same ui/keys)
wide unicode display issues -> wcwidth/widechars added/implemented...
old html2txt used in deepspam1 found to be wrong... new html parser developed!
python's mime email parser found to be sloooow and sometimes broken -> implemented my own
mailer4: integrated deepspam model evaluation, see Screenshot-mailer4-diffmode.png
mailer4: added search, selection, deduplication, tagging features - tested on real data

torch_emb.py: direct rewrite of deepspam1's model train code to pytorch
torch_spm3.py: new model trainer, uses py class from model/ dir
maildedup3.py: email deduplication and parsing, from mbox to txt
mailer4.py: Python3 version of my old email reader, used primarily for spam-dataset preparation & model eval

eml2str.py: my mime email parser and html2txt converter functions
hdrdecode.py: my email header parser functions, from the pymavis/spamwall project
ttykeymap.py: python version of my old getch2.c, console i/o functions and text UI
widechars.py: wide unicode utilities, based on https://github.com/jquast/wcwidth
striprtf.py: rtf to txt converter, from https://github.com/joshy/striprtf