Processed ChEMBL30 chemreps
python ProcessLibrary.py \
-i chembl_30_chemreps.csv \
-o chembl_30_chemreps_proc.smi \
-s SMILES -id name -o_sep ' ' --chunk_size 10000 --max_len 80
That is:
- Dropping SMILES > 80 chars, desalting, neutralising, canonicalising
- Deduplicated by SMILES
- Keep SMILES that only contain tokens with more than 1000 occurrences across the corpus
- train_test_split with sklearn (0.95 train, 0.025 valid, 0.025 test)