Subreddit Generator

Train a neural network optimized for generating Reddit subreddit posts based off of any number of subreddits! Subreddit Generator automatically downloads as many Reddit submissions as you want from as many subreddits as you want. Blend wildly different subreddits for hilarity!

Subreddit Generator is based off of textgenrnn, and trains the network using context labels for better submission synthesis.

Usage

After installing the dependencies, open config.yml and at the top, add the project_id of a Google Compute Engine project owned by an account with access to BigQuery (free). Below that, specify the list of subreddit(s) you wish to retrieve. You can then change the time horizon of Reddit data to check using start_month and end_month (between December 2015 and December 2017), and change the number of top submissions retieved from each subreddit during that timeframe. You can also configure the num_epochs and whether to use the pretrained model or train a new_model. Then simply run:

python3 subreddit_generator.py

On the first time running the script, the console will ask you to authenticate with Google; do so.

The script will automatically save the weights (+ config and vocab for if new_model) for the trained model, which can then be loaded into textgenrnn and used anywhere.

The included askreddit_weights.hdf5 file + relevant config info was trained on the top 50,000 /r/AskReddit submissions in 2017. You can load it and generate text from it simply with:

from textgenrnn import textgenrnn
textgen=textgenrnn(weights_path="askreddit_weights.hdf5",
                   config_path="askreddit_config.json",
                   vocab_path="askreddit_vocab.json")
textgen.generate_samples()

You can view examples of the output at various temperatures in the /example_output folder.

Warning

Google BigQuery gives 1 TB of data processing for free, and it will only charge for the data processed. For the default time range of 2017, BigQuery will consume 8.68 GB worth of data (regardless of how many subreddits and how many submissions you retrieve), which gives you plenty of leeway. Adjusting the time frame consumes data proportionately.

Requirements

textgenrnn
tensorflow (either CPU or GPU flavors)
pandas
pandas-gbq

Maintainer

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

Credits

Jason Baumgartner for collecting the Reddit data, and Felipe Hoffa for putting it in BigQuery.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
example_output		example_output
LICENSE		LICENSE
README.md		README.md
askreddit_config.json		askreddit_config.json
askreddit_vocab.json		askreddit_vocab.json
askreddit_weights.hdf5		askreddit_weights.hdf5
config.yml		config.yml
query.py		query.py
requirements.txt		requirements.txt
subreddit_generator.py		subreddit_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Subreddit Generator

Usage

Warning

Requirements

Maintainer

License

Credits

About

Releases

Packages

Languages

License

minimaxir/subreddit-generator

Folders and files

Latest commit

History

Repository files navigation

Subreddit Generator

Usage

Warning

Requirements

Maintainer

License

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages