Skip to content

Corpus of scraped Speculative Fiction stories from 28 old and new webzines

Notifications You must be signed in to change notification settings

martysteer/webspecfi-corpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webspecfi-corpora

This data repository contains scraped stories from some old and new web zines.

These are or were all public pages, either from the live web or the internet archive. All rights remain with the original authors and/or the magazines in which their imaginations were published.

The corpora should all be the same structure and can be combined into a larger corpus by concatenating all the corpus.jsonl files together, or merging the CSV files.

This public Nomic Atlas map has all the data uploaded, so you can explore it a bit there: https://atlas.nomic.ai/data/martysteer/webspecfi-corpora/map

nomic-atlas-word-embeds

Ad here is the current list of magazines and number of stories from each:

webspecfi-corpora git:(main) $ wc -l **/*.jsonl | sort -r 
   13976 total
    2990 Bewildering Stories/corpus.jsonl
    2597 Aphelion/corpus.jsonl
     964 ClarkesWorld/corpus.jsonl
     824 Beneath Ceasless Skies/corpus.jsonl
     807 Tor.com/corpus.jsonl
     606 Apex Magazine/corpus.jsonl
     587 Uncanny Magazine/corpus.jsonl
     519 Dargonzine/corpus.jsonl
     500 Weird Fiction Review/corpus.jsonl
     468 Abyss & Apex/corpus.jsonl
     445 Fireside Fiction/corpus.jsonl
     381 Metaphorosis/corpus.jsonl
     361 The Future Fire/corpus.jsonl
     255 MythAxis Magazine/corpus.jsonl
     235 Vestal Review/corpus.jsonl
     213 Cosmic Roots and Eldritch Shores/corpus.jsonl
     193 Bourbon Penn/corpus.jsonl
     160 Fabulist Magazine/corpus.jsonl
     154 Ibn Qirtaiba/corpus.jsonl
     134 Chaos Theory Tales Askew/corpus.jsonl
     121 Quanta/corpus.jsonl
     116 Three-lobed Burning Eye/corpus.jsonl
      94 Would That It Were/corpus.jsonl
      65 Atomjack/corpus.jsonl
      63 Betwixt/corpus.jsonl
      44 Scifi Dimensions/corpus.jsonl
      41 Darker Matter/corpus.jsonl
      39 Capricious SF/corpus.jsonl

And the published date distributions, with some validation errors:

webspecfi-corpora git:(main) ✗ cat **/corpus.jsonl | jq -r '.date'| cut -d- -f1 | sort | uniq -c | sort -r
1135 2023
 932 2022
 854 2020
 811 2017
 756 2018
 738 2016
 725 2015
 708 2011
 677 2021
 675 2019
 648 2013
 626 2014
 606 2012
 458 2010
 339 2009
 309 1999
 304 null
 293 2008
 291 2006
 233 2007
 230 2005
 195 2004
 186 2000
 157 2003
 154 2002
 149 1998
 142 2001
 141 1997
  30 1996
  28 1990
  28 1986
  22 1995
  21 1989
  21 1988
  19 1987
  13 1994
  12 Summer 2022
  11 1993
  10 1992
  10 1991
   9 Spring 2022
   9 November 2007
   9 February 2008
   8 May 2008
   7 December 2001
   7 August 2008
   6 January 1996
   5 November 2006
   5 May 2007
   5 May 2006
   5 January 2018
   4 Winter 2012
   4 Summer 2012
   4 Spring 2012
   4 September 2018
   4 September 2006
   4 September 1997
   4 November 1998
   4 May 2017
   4 May 2000
   4 March 2007
   4 July 2016
   4 July 2006
   4 January 1999
   4 February 2019
   4 December 2017
   4 December 2015
   4 August 1999
   4 August 1997
   4 April 2016
   4 April 2000
   3 September 2015
   3 September 2000
   3 September 1999
   3 October 2007
   3 October 1999
   3 October 1997
   3 November 1999
   3 November 1997
   3 May 1999
   3 June 1999
   3 June 1998
   3 June 1997
   3 July 2000
   3 July 1999
   3 July 1997
   3 January 2000
   3 February 2000
   3 February 1999
   3 December 1999
   3 December 1998
   3 August 1996
   2 September 1996
   2 October 1998
   2 November 1996
   2 November 1995
   2 May 2003
   2 May 1997
   2 May 1996
   2 March 2000
   2 March 1999
   2 March 1998
   2 March 1997
   2 June 2000
   2 June 1995
   2 July 1996
   2 January 2017
   2 January 2007
   2 January 1998
   2 February 1998
   2 February 1997
   2 December 2000
   2 December 1997
   2 August 2000
   2 August 1998
   2 April 1999
   2 April 1997
   1 September 1994
   1 October 2016
   1 May 1998
   1 March 1996
   1 March 1995
   1 March 1994
   1 June 1994
   1 July 2001
   1 December 1994
   1 December 1993
   1 April 1996

The data was scraped using wget and trafilatura, then the automatically detected metadata was manually cleanup a little bit, but overall it's not great metadata I'm afraid to say. Any improvements are welcome. Just make a pull reqeust.

Notes of the commands used to produce each of these are in the various README.md files, but I got a little lazy repeating myself over and over so they are incomplete. (hah!) Each magazine has a plain text list of the issue and story URL's that were used for scraping, so the entire corpus could easily be scraped again, if you have another usecase or improved pipeline (e.g. Jina AI's Reader API). The CSV files have the same metadata fields as the JSONL files, but without the full text.

This collection was created mid-2023 so some of the live magazine sites will have published new literature since then.

There are some amazing stories in here. Please use them fairly and transparently, and cite your sources. Thx.


Marty Steer, Freelance Computational Humanist :-P~ 2024

About

Corpus of scraped Speculative Fiction stories from 28 old and new webzines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published