Skip to content

Commit

Permalink
Mirroring infrastructure (acl-org#1124)
Browse files Browse the repository at this point in the history
* Remove build_hugo, closes acl-org#1089

unused since the introduction of the Makefile ages ago

* Makefile: documentation, remove unneeded dependencies

hugo already depends on bibtex, mods, endnote, so site does
not need to depend on it as well.

* Mirroring infrastructure

This commit adds
 - a script to download all ACL files not in the git repo
 - a configurable websites directory, including the ability
   to host in any subdir (/anthology/, ... also top-level)
 - reworked anthology-files directory, symlinked into the
   anthology web directory (and automatically adapted in the
   .htaccess file)
 - renamed constants in anthology/data.py, including facility
   to set them via environment variables
 - additional Makefile documentation
 - checks in the Makefile and fewer dependencies on phony tasks

create_mirror.py reads a list of anthology XML files, checks the
checksums of already downloaded files, downloads new ones and checks
the checksum to only put correct files into the download dir.  It can
be parallelized by calling it several times with different sets of XML
files.

The website is now built under a "website" directory; otherwise
top-level builds would not be separate from other generated data.  The
Makefile creates a symlink inside the anthology directory to the path
where the anthology-files will be on the server.  Apache needs to
follow symlinks for this to work.

The environment variable ANTHOLOGY_PREFIX defines the host and
directory under which the site is supposed to be hosted.

The constants in anthology/data.py have been renamed (long-standing
TODO) and the canonical URL template has been separated from the
host and prefix used for hosting a copy.  It is therefore now possible
to host a mirror of only the HTML or HTML plus files.

* Workflow adjustments

No longer use Github secrets for mirroring. Github "publish" workflow now uses a Make target to sync to the main server at aclweb.org.

Added a workflow that uses the new mirroring infrastructure to automatically create and (untested) remove branch previews at `https://aclanthology.org/previews/{branchname}`.

Co-authored-by: Matt Post <[email protected]>
  • Loading branch information
akoehn and mjpost authored Apr 5, 2021
1 parent 837c132 commit 25c2ede
Show file tree
Hide file tree
Showing 19 changed files with 516 additions and 116 deletions.
42 changes: 42 additions & 0 deletions .github/workflows/preview.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: preview

on:
push:
branches:
- '*'
- '!master'

jobs:
preview:
runs-on: ubuntu-20.04
steps:
- name: install hugo
run: wget https://github.com/gohugoio/hugo/releases/download/v0.58.3/hugo_extended_0.58.3_Linux-64bit.deb && sudo dpkg -i hugo_extended*.deb
- name: update
run: sudo apt-get update
- name: install other deps
run: sudo apt-get install -y jing bibutils openssh-client rsync libyaml-dev libpython3.8-dev
- name: dump secret key
env:
SSH_KEY: ${{ secrets.PUBLISH_SSH_KEY }}
run: |
mkdir -p $HOME/.ssh/
echo "$SSH_KEY" > $HOME/.ssh/id_rsa
chmod 600 $HOME/.ssh/id_rsa
- uses: actions/checkout@v1
- name: extract branch name
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
id: extract_branch
- name: build
shell: bash
env:
ANTHOLOGY_PREFIX: https://aclanthology.org/previews/${{ steps.extract_branch.outputs.branch }}
run: |
echo "Running make ANTHOLOGY_PREFIX=${ANTHOLOGY_PREFIX} NOBIB=true check site"
make ANTHOLOGY_PREFIX=${ANTHOLOGY_PREFIX} NOBIB=true check site
- name: preview
env:
ANTHOLOGY_PREFIX: https://aclanthology.org/previews/${{ steps.extract_branch.outputs.branch }}
run: |
make ANTHOLOGY_PREFIX=${ANTHOLOGY_PREFIX} NOBIB=true preview
9 changes: 5 additions & 4 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,12 @@ jobs:
- uses: actions/checkout@v1
- name: build
env:
ANTHOLOGYHOST: ${{ secrets.PUBLISH_ANTHOLOGYHOST }}
ANTHOLOGY_PREFIX: https://www.aclweb.org/anthology
run: |
make ANTHOLOGYHOST=$ANTHOLOGYHOST check site
make ANTHOLOGY_PREFIX=${ANTHOLOGY_PREFIX} check site
- name: publish
env:
PUBLISH_TARGET: ${{ secrets.PUBLISH_TARGET }}
PUBLISH_TARGET: 50.87.169.12:anthology-static
ANTHOLOGY_PREFIX: https://www.aclweb.org/anthology
run: |
rsync -aze "ssh -o StrictHostKeyChecking=accept-new" --delete build/anthology/ $PUBLISH_TARGET
make ANTHOLOGY_PREFIX=${ANTHOLOGY_PREFIX} upload
34 changes: 34 additions & 0 deletions .github/workflows/remote-preview.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: remove-preview

on:
delete:
branches:
- '*'
- '!master'

jobs:
remove-preview:
runs-on: ubuntu-20.04
steps:
- name: update
run: sudo apt-get update
- name: install other deps
run: sudo apt-get install -y openssh-client rsync
- name: dump secret key
env:
SSH_KEY: ${{ secrets.PUBLISH_SSH_KEY }}
run: |
mkdir -p $HOME/.ssh/
echo "$SSH_KEY" > $HOME/.ssh/id_rsa
chmod 600 $HOME/.ssh/id_rsa
- uses: actions/checkout@v1
- name: extract branch name
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
id: extract_branch
- name: remove-preview
env:
BRANCH: ${{ steps.extract_branch.outputs.branch }}
run: |
echo "Would delete branch ${BRANCH}"
echo ssh -o StrictHostKeyChecking=accept-new rm -rf /var/www/aclanthology.org/previews/${BRANCH}
101 changes: 78 additions & 23 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
#
# Copyright 2019 Arne Köhn <[email protected]>
# Copyright 2019-2021 Arne Köhn <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -15,23 +15,60 @@
# limitations under the License.

# Instructions:
# - if you edit the a command running python, make sure to
# run . $(VENV) && python3 -- this sets up the virtual environment.
# - all targets running python somewhere should have venv as a dependency.
# - if you edit a command running python, make sure to
# write . $(VENV) && python3 -- this sets up the virtual environment.
# if you just write "python3 foo.py" without the ". $(VENV) && " before,
# the libraries will not be loaded during run time.
# - all targets running python somewhere should have venv/bin/activate as a dependency.
# this makes sure that all required packages are installed.
# - Disable bibtex etc. targets by setting NOBIB=true (for debugging etc.)
# (e.g., make -j4 NOBIB=true)

SHELL = /bin/sh
ANTHOLOGYHOST := "https://www.aclweb.org"
ANTHOLOGYDIR := anthology

# If you want to host the anthology on your own, set ANTHOLOGY_PREFIX
# in your call to make to your prefix, e.g.
#
# ANTHOLOGY_PREFIX="https://example.com" make
#
# (There is no need to change the value here.). PLEASE NOTE that the prefix
# cannot contain any '#' character, or a Perl regex below will fail.
# The following line ensures that it is exported as an environment variable
# for all sub-processes

export ANTHOLOGY_PREFIX ?= https://www.aclweb.org/anthology

SLASHATEND:=$(shell echo ${ANTHOLOGY_PREFIX} | grep -q '/$$'; echo $$?)

ifeq (${SLASHATEND},0)
$(error ANTHOLOGY_PREFIX is not allowed to have a slash at the end.)
endif

# hugo wants to know the host and base dir on its own, so
# we sed the prefix into those parts.
ANTHOLOGYHOST := $(shell echo "${ANTHOLOGY_PREFIX}" | sed 's|\(https*://[^/]*\).*|\1|')
ANTHOLOGYDIR := $(shell echo "${ANTHOLOGY_PREFIX}" | sed 's|https*://[^/]*/\(.*\)|\1|')

# the regexp above only matches if we actually have a subdirectory.
# make the dir empty if only a tld was provided as the prefix.
ifeq ($(ANTHOLOGY_PREFIX),$(ANTHOLOGYDIR))
ANTHOLOGYDIR :=
endif

# We create a symlink from $ANTHOLOGYDIR/anthology-files to this dir
# to always have the same internal link to PDFs etc.
# This is the directory where you have to put all the papers and attachments.
ANTHOLOGYFILES ?= /var/www/html/anthology-files

HUGO_ENV ?= production

sourcefiles=$(shell find data -type f '(' -name "*.yaml" -o -name "*.xml" ')')
xmlstaged=$(shell git diff --staged --name-only --diff-filter=d data/xml/*.xml)
pysources=$(shell git ls-files | egrep "\.pyi?$$")
pystaged=$(shell git diff --staged --name-only --diff-filter=d | egrep "\.pyi?$$")

# these are shown in the generated html so everyone knows when the data
# was generated.
timestamp=$(shell date -u +"%d %B %Y at %H:%M %Z")
githash=$(shell git rev-parse HEAD)
githashshort=$(shell git rev-parse --short HEAD)
Expand Down Expand Up @@ -68,7 +105,7 @@ HAS_BIB2XML=$(shell which bib2xml > /dev/null && echo true || echo false)
VENV := "venv/bin/activate"

.PHONY: site
site: bibtex mods endnote hugo sitemap
site: build/.hugo build/.sitemap


# Split the file sitemap into Google-ingestible chunks.
Expand All @@ -77,10 +114,10 @@ site: bibtex mods endnote hugo sitemap
sitemap: build/.sitemap

build/.sitemap: venv/bin/activate build/.hugo
. $(VENV) && python3 bin/split_sitemap.py build/anthology/sitemap.xml
@rm -f build/anthology/sitemap_*.xml.gz
@gzip -9n build/anthology/sitemap_*.xml
@bin/create_sitemapindex.sh `ls build/anthology/ | grep 'sitemap_.*xml.gz'` > build/anthology/sitemapindex.xml
. $(VENV) && python3 bin/split_sitemap.py build/website/$(ANTHOLOGYDIR)/sitemap.xml
@rm -f build/website/$(ANTHOLOGYDIR)/sitemap_*.xml.gz
@gzip -9n build/website/$(ANTHOLOGYDIR)/sitemap_*.xml
@bin/create_sitemapindex.sh `ls build/website/$(ANTHOLOGYDIR)/ | grep 'sitemap_.*xml.gz'` > build/website/$(ANTHOLOGYDIR)/sitemapindex.xml
@touch build/.sitemap

.PHONY: venv
Expand Down Expand Up @@ -115,13 +152,14 @@ static: build/.static

build/.static: build/.basedirs $(shell find hugo -type f)
@echo "INFO Creating and populating build directory..."
@echo "INFO Split ${ANTHOLOGY_PREFIX} into HOST=${ANTHOLOGYHOST} DIR=${ANTHOLOGYDIR}"
@cp -r hugo/* build
@echo >> build/config.toml
@echo "[params]" >> build/config.toml
@echo " githash = \"${githash}\"" >> build/config.toml
@echo " githashshort = \"${githashshort}\"" >> build/config.toml
@echo " timestamp = \"${timestamp}\"" >> build/config.toml
@perl -pi -e "s/ANTHOLOGYDIR/$(ANTHOLOGYDIR)/g" build/index.html
@perl -pi -e "s#ANTHOLOGYDIR#$(ANTHOLOGYDIR)#g" build/website/index.html
@touch build/.static

.PHONY: yaml
Expand Down Expand Up @@ -202,16 +240,27 @@ build/.hugo: build/.static build/.pages build/.bibtex build/.mods build/.endnote
@echo "INFO Running Hugo... this may take a while."
@cd build && \
hugo -b $(ANTHOLOGYHOST)/$(ANTHOLOGYDIR) \
-d $(ANTHOLOGYDIR) \
-d website/$(ANTHOLOGYDIR) \
-e $(HUGO_ENV) \
--cleanDestinationDir \
--minify
@cd build/website/$(ANTHOLOGYDIR) \
&& perl -i -pe 's|ANTHOLOGYDIR|$(ANTHOLOGYDIR)|g' .htaccess
@cd build/website/$(ANTHOLOGYDIR) && ln -s $(ANTHOLOGYFILES) anthology-files
@touch build/.hugo

.PHONY: mirror
mirror: venv/bin/activate
. $(VENV) && bin/create_mirror.py data/xml/*xml

.PHONY: mirror-no-attachments
mirror-no-attachments: venv/bin/activate
. $(VENV) && bin/create_mirror.py --only-papers data/xml/*xml

.PHONY: test
test: hugo
diff -u build/anthology/P19-1007.bib test/data/P19-1007.bib
diff -u build/anthology/P19-1007.xml test/data/P19-1007.xml
diff -u build/website/$(ANTHOLOGYDIR)/P19-1007.bib test/data/P19-1007.bib
diff -u build/website/$(ANTHOLOGYDIR)/P19-1007.xml test/data/P19-1007.xml

.PHONY: clean
clean:
Expand All @@ -235,14 +284,14 @@ check_staged_xml:
fi

.PHONY: check_commit
check_commit: check_staged_xml venv
check_commit: check_staged_xml venv/bin/activate
@. $(VENV) && pre-commit run
@if [ ! -z "$(pystaged)" ]; then \
. $(VENV) && black --check $(pystaged) ;\
fi

.PHONY: autofix
autofix: check_staged_xml venv
autofix: check_staged_xml venv/bin/activate
@. $(VENV) && \
EXIT_STATUS=0 ;\
pre-commit run || EXIT_STATUS=$$? ;\
Expand All @@ -255,7 +304,7 @@ autofix: check_staged_xml venv
.PHONY: serve
serve:
@echo "INFO Starting a server at http://localhost:8000/"
@cd build && python3 -m http.server 8000
@cd build/website && python3 -m http.server 8000

# this target does not use ANTHOLOGYDIR because the official website
# only works if ANTHOLOGYDIR == anthology.
Expand All @@ -265,8 +314,14 @@ upload:
echo "WARNING: Can't upload because ANTHOLOGYDIR was set to '$(ANTHOLOGYDIR)' instead of 'anthology'"; \
exit 1; \
fi
@echo "INFO Running rsync..."
# main site
@rsync -azve ssh --delete build/anthology/ [email protected]:anthology-static
# aclanthology.org
# @rsync -azve ssh --delete build/anthology/ [email protected]:/var/www/html
@echo "INFO Running rsync for main site and mirror..."
# main site
@rsync -aze "ssh -o StrictHostKeyChecking=accept-new" --delete build/website/anthology/ [email protected]:anthology-static
# mirror
@rsync -aze "ssh -o StrictHostKeyChecking=accept-new" --delete build/website/anthology/ [email protected]:/var/www/aclanthology.org

# Push a preview to the mirror
.PHONY: preview
preview:
@echo "INFO Running rsync for the '${ANTHOLOGYDIR}' branch preview..."
@rsync -avze "ssh -o StrictHostKeyChecking=accept-new" --delete build/website/${ANTHOLOGYDIR}/ [email protected]:/var/www/aclanthology.org/${ANTHOLOGYDIR}
47 changes: 47 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,53 @@ The anthology can be viewed locally by running `hugo server` in the
`hugo/` directory. Note that it rebuilds the site and therefore takes
about a minute to start.


## Hosting a mirror of the ACL anthology

First, creating a mirror is slow and stresses the ACL anthology
infrastructure because on initial setup you have to download every
single file of the anthology from the official webserver. This can
take up to 8 hours no matter how fast *your* connection is. So please
don't play around with this just for fun.

If you want to host a mirror, you have to set two environment variables:
- `ANTHOLOGY_PREFIX` the http prefix your mirror will be reachable under
e.g. https://example.com/my-awesome-mirror or http://aclanthology.lst.uni-saarland.de
(Notice that there is no slash at the end!)
- `ANTHOLOGYFILES` the directory under which papers, attachments etc.
will reside on your webserver. This directory needs to be readable
by your webserver (obviously) but should not be a subdirectory
of the anthology mirror directory.

With these variables set, you run `make` to create the pages and `make
mirror` to mirror all additional files into the build/anthology-files
directory. If you created a mirror before already, it will only
download the missing files.

If you want to mirror the papers but not all attachments, you can run
`make mirror-no-attachments` instead.

You then rsync the `build/website/` directory to your webserver or, if
you serve the mirror in a subdirectory `FOO`, you mirror
`build/website/FOO`. The `build/anthology-files` directory needs to
be rsync-ed to the `ANTHOLOGYFILES` directory of your webserver.

As you probably want to keep the mirror up to date, you can modify the
shell script `bin/acl-mirror-cronjob.sh` to your needs.

You will need this software on the server
- rsync
- git
- python3
- hugo > 0.58
- python3-venv

If you want the build process to be fast, install `cython3` and
`libyaml-dev` (see above).

Note that generating the anthology takes quite a bit of RAM, so make
sure it is available on your machine.

## Contributing

If you'd like to contribute to the ACL Anthology, please take a look at:
Expand Down
55 changes: 55 additions & 0 deletions bin/acl-mirror-cronjob.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#! /bin/bash
# -*- coding: utf-8 -*-
#
# Copyright 2021 Arne Köhn <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -e
set -u

# modify these two variables to your needs.
# This is the URL under which your mirror will be accessible.
# Note: There is no slash at the end.
export ANTHOLOGY_PREFIX="https://example.com/aclmirror"

# The directory under which the HTML files will reside
export ANTHOLOGY_HTML_ROOT="/var/www/aclmirror"

# this is the directory under which the additional files
# will be stored. This directory will be symlinked
# into the ANTHOLOGY_HTML_ROOT and needs to be accessible
# by the webserver (depending on your configuration, it
# might not need to be under the www document root).
export ANTHOLOGYFILES="/var/www/html/anthology-files"

# This is the directory where the anthology git will be cloned
# to and the website will be built.
export GITDIR="/home/anthology/anthology-git-dir"

# initialize if necessary
if [[ ! -e $GITDIR ]]; then
mkdir -p $GITDIR
fi
cd $GITDIR
if [[ ! -e .git ]]; then
git clone https://github.com/acl-org/acl-anthology .
fi

ANTHOLOGYDIR=$(echo "${ANTHOLOGY_PREFIX}" | sed 's|https*://[^/]*/\(.*\)|\1|')

if git pull -q; then
make -j4
make mirror-no-attachments
rsync -av --delete build/website/$ANTHOLOGYDIR $ANTHOLOGY_HTML_ROOT
fi
Loading

0 comments on commit 25c2ede

Please sign in to comment.