Silly script to periodically check webpage changes.
No bullshit gem from hell.
- Script is run
- checks for every new URL to check, if we've waited long enough
- pulls interesting HTML from that page
- if content is different, alerts you with the new content
# Installation
apt-get install ruby ruby-nokogiri
# if you want fancier Diffs, for DiffString objects, apt install ruby-diffy
apt install ruby-mechanize
git clone https://github.com/conchyliculture/webwatchr/
cd webwatchr
cp config.json.template config.json
# Take a breath here, it's going to be alright
rvm implode
gem uninstall --all
sudo apt-get remove -y --purge rubygems-integration rubygems rake bundler
sudo find / -name ".rvm" -exec rm -rf "{}" \;
Then edit config.json to your needs, and enable some sites for checking by symlinking from sites-available
into sites-enabled
Run the cron often:
*/5 * * * * cd /home/poil/my_fav_scripts/webwatchr; ruby webwatchr.rb
This means these website will only extract "interesting" information from the page, and won't use the whole html page.
- Bandcamp merch pages
- Dealabs
- Package tracking (DHL, Colissimo, i-parcel, Royalmail, PostNL, UPS, USPS)
- galaxus/digitec daily deals
- Noquarterprod
- Qwertee
- Trello
- Twitter (via Nitter instances)
Some of these have been such a pain in the ass to scrape, I resorted to use their (usually terrible) APIs (ie: USPS)
Just make a file sites-enabled/mysites.rb
and append new pages to the end as new instances of the Site::SimpleString class.
By default it will check each page every hour.
#/usr/bin/ruby
require_relative "../lib/site.rb"
Site::SimpleString.new(
url: "https://www.google.com",
test: __FILE__ == $0 # This is so you can run ruby mysites.rb to check your code
).update
Site::SimpleString.new(
url: "https://www.google.es",
test: __FILE__ == $0 # This is so you can run ruby mysites.rb to check your code
).update
## Extract part of the DOM first
Basically, just make a new sites-enabled/mysite.rb
using one of the two examples below
then overwrite the get_content()
method.
Use @parsed_content
which is a Nokogiri parsed HTML document. Checkout sites-available/dhl.rb
which is a simple example.
Also override the to_html()
method if you want to change how the new content will be formatted.
In the following example, everytime the first <table>
element appearing on the DOM
changes, this will use the HTML code of this element as the content to check for update.
#/usr/bin/ruby
# encoding: utf-8
require_relative "../lib/site.rb"
# We subclass Site::SimpleString, as the result of get_content() will be a String
class Mysite < Site::SimpleString
def get_content()
# @parse_content is the result of Nokogiri.parse(html of https://www.mydomistoobig.pt)
return @parsed_content.css("table.result-summary")[0].to_s
end
end
Mysite.new(
url: "https://www.mydomistoobig.pt",
every: 10*60 # Check every 10 minutes,
test: __FILE__ == $0
).update
Move that into sites-enabled
, and you're good to go.
In the following example, you fetch an array of things, that I call "articles" at every run of the code. Only new articles that have never been seen will be sent.
#!/usr/bin/ruby
# encoding: utf-8
require "../lib/site.rb"
class Mysite < Site::Articles
# This time, get_content calls add_article() on a Hash of "articles"
def get_content()
# Parses the DOM, returns an Array of Hash with articles
#
# If DOM is:
# <div class="article">
# <a href="http://lol/article/1.html">Lol 1</a>
# </div>
# <div class="article">
# <a href="http://lol/article/1.html">Lol 1</a>
# </div>
#
# returns:
# [{'id' => 'http://lol/article/1.html', 'url' => 'http://lol/article/1.html'},
# 'id' => 'http://lol/article/1.html', 'url' => 'http://lol/article/2.html'}]
#
# If for example this previously only returned the following
# [{'id' => 'http://lol/article/1.html', 'url' => 'http://lol/article/1.html'}]
# A mail will be sent containing just HTML for the second article
res = []
@parsed_content.css("div.article") do |article|
link = article.css("a").attr("href")
title = article.css("a").content
add_article({
"id"=> link, # This needs to be unique, per Article
# Magic keys for a nice html ul/li message
"url" => link,
"title" => title
})
end
# This time we don't return anything
end
Mysite.new(
url: "https://www.mydomistoobig.pt",
every: 10*60 # Check every 10 minutes,
test: __FILE__ == $0
).update()
Just do ruby sites-available/mysite.rb
. It will run, and display what it would alert you with, without updating the state.
If everything looks right, cd sites-enabled; ln -s ../sites-available/mysite.rb .
## I need to do more complex stuff!
If you need to do weird things like authentication, session handling, form posting and whatnots, I've been playing around with Mechanize and Curb which are kind of nice, and also have proper Debian packages.
## I need to do more even more complex stuff!
If you need javascript... well... lol. I'll probably have to use Selenium one day but the later the better.
This can be useful to run a site update at a specific time/day with a crontab, instead of every specified amount of time. You can force update a website using the -s flag:
ruby webwatchr.rb -s lol.rb
Make sure lol.rb
is in the sites-available
directory
## POST?
If you need to actually fetch your URL using a POST HTTP request, add post_data
as argument, when instanciating your new class:
postnl_id="RSAAAAAAAAAAAAA"
PostNL.new(
url: "http://www.postnl.post/details/",
post_data: {"barcodes" => postnl_id},
every: 30*60,
test: __FILE__ == $0
).update
run ruby tests/test.rb
You can use the @logger Logger object in your mysite.rb.
Set the log file in config.json under the "log" key
Email is the main method of alerting, but you can also set webwatchr to talk to you on Telegram through a bot.
First make a bot and grab a token following the Telegram procedure.
You also need to know the chat_id
for its discussion with you. The code in there can help you.
Install some dependencies from the one and only repo you should kind of trust:
apt install ruby-virtus ruby-inflecto ruby-faraday
Then grab the code for the Telegram bot client. Run this from inside the current dir:
CURDIR="$(pwd)"
cd /tmp
git clone https://github.com/atipugin/telegram-bot-ruby
mv telegram-bot-ruby/lib/telegram "webwatchr/lib/"
cd -
Then edit the config.json file accordingly.