- 2/28 - Changed
GraphScraper
toGraphSearcher
in lab 6
In this project you will practice inheritance, graph search, and web
scraping. You'll hand-in a module called scrape.py
.
Make sure to run the tests before handing in. During development, we recommend having a debug.ipynb notebook to make calls to your module.
Run python3 tester.py
often and work on fixing any issues.
You'll turn in one file (a Python module), scrape.py
. It should
have a comment like the following:
# project: p3
# submitter: ????
# partner: none
# hours: ????
scrape.py
will have the following
- GraphSearcher (a class)
- MatrixSearcher (a class)
- FileSearcher (a class)
- WebSearcher (a class)
- reveal_secrets (a function)
For this portion of the project, you may collaborate with your group members in any way (even looking at working code). You may also seek help from 320 staff (mentors, TAs, instructor). You may not seek receive help from other 320 students (outside your group) or anybody outside the course.
Complete the lab exercises if you haven't already done so: https://github.com/cs320-wisc/s22/blob/main/labs/lab6.md
Add a bfs_search
method to GraphSearcher
. It should behave the same as
dfs_search
, but use the BFS algorithm instead of DFS. The
difference will be evident at the end if someone looks at the .order
attribute.
Note that without changing MatrixSearcher
, it now supports both DFS
and BFS search since it inherits from GraphSearcher
.
Add another class, FileSearcher
, which also inherits from
GraphSearcher
. It should have three methods (besides those
inherited): __init__
, go
, and message
.
The nodes of this graph are files in the file_nodes
directory. For
example, 1.txt
contains this:
M
2.txt,4.txt
This means the value for node 1.txt
is "M", and the children of
1.txt
are 2.txt
and 4.txt
.
All the files will have two lines like this, with a value on the first line, and a comma-separated list of children on the second line.
The go
method should read a node file and return a list of children.
For example:
import scrape
f = scrape.FileSearcher()
print(f.go("1.txt"))
Expected result: ['2.txt', '4.txt']
. You could test this by pasting
the sample code to a debug.ipynb
notebook in your p3
directory.
go
will also somehow record the values (1st lines) of the nodes that
are visited, in order. The message
method should return all the
values concatenated together. Take a look at bfs_test
in
tester.py
for an example of how this should work.
In general, reading test cases is a great way to see how your classes are supposed to work. Specifications, like this document you're reading now, are sometimes ambiguous (sorry!), but test cases tell you what is expected from your code with complete precision.
It's often useful to copy/paste code snippets from tester.py
to your
debug.ipynb
when your debugging an issue too.
Don't start this part until we learned about Selenium in class and how to run it in "headless" mode.
For this part of the project you'll need to install a Chrome and a few packages on your VM:
pip3 install selenium==4.1.2 Flask lxml html5lib
sudo apt -y install chromium-browser
When it's all done, run both of the following, and verify that both commands print the same version and it is 98+ (like "98.X.X.X", but it may be a bigger number if there are browser updates before P3 is complete):
chromium-browser --version
chromium.chromedriver --version
Note: launching many web browsers via code can quickly eat up
all the memory on your VM. You can run the htop
command to see
how much memory you have (hit "q" to quit when done). If you're low
on memory (you might notice your VM being sluggish), you can run
pkill -f -9 chromium
shutdown all browser instances hanging around
in the background.
You'll be scraping a website implemented as a web application built using the Flask framework (you don't need to know flask for this project, though you'll learn it soon and get a chance to build your own website in the next project). In an SSH session, run the following to launch it:
python3 application.py
Then, open http://<YOUR-VM-IP>:5000
in your web browser. Do not
use the IP address that is printed to console in the ssh session (it
won't work). It should look like this:
If you click "TRAVEL HISTORY", you'll enter a graph of pages, each with a table fragment. Your job is to search the graph (using the search methods you wrote earlier), collect all the table fragments, and concatenate them into one big DataFrame.
Write a WebSearcher
class that inherits from GraphSearcher
. The
constructor should take a Chrome webdriver object as a parameter so
that it is possible to create WebSearcher
object with ws = WebSearcher(some_driver)
.
For example, one could run the following:
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
options = Options()
options.headless = True
service = Service(executable_path="chromium.chromedriver")
driver = webdriver.Chrome(options=options, service=service)
ws = WebSearcher(driver)
The go
method of WebSearcher
should treat the node as a URL. It
should use the webdriver to visit that page and return the URLs of
other pages to which the visited page has hyperlinks. See web_test
in the tester for examples of how it should behave.
The go
method should also use the following to read any table
fragments on a visited page and store them somewhere (for example, in
an attribute):
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
WebSearcher
should have a table()
method that
concatenates
all the fragments in the order they were visited and returns one big
DataFrame. Use ignore_index=True
when concatenating.
Here is a code snippet you can use as you write your methods to help test whether they're working:
import os
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
import scrape
# kill previous chrome instance if still around (to conserve memory)
os.system("pkill -f -9 chromium")
options = Options()
options.headless = True
service = Service(executable_path="chromium.chromedriver")
driver = webdriver.Chrome(options=options, service=service)
# TODO: use IP address of your VM
start_url = "http://YOUR_VM_IP:5000/Node_1.html"
s = scrape.WebSearcher(driver)
print(s.go(start_url))
s = scrape.WebSearcher(driver)
s.bfs_search(start_url)
print(s.table())
driver.close()
Expected output:
['http://YOUR_VM_IP:5000/Node_2.html', 'http://YOUR_VM_IP:5000/Node_4.html']
clue latitude longitude description
0 1 43.089034 -89.416128 Picnic Point in Madison
1 7 38.105507 126.910613 Silver Beach in Hawaii
2 1 65.044901 -16.712836 Shore of a Volcanic Lake in Iceland
3 3 48.860945 2.335773 The Louvre in Paris
4 8 51.180315 -1.829659 Stonehenge in the UK
5 5 37.434183 -122.321990 Redwood forest in San Francisco
6 2 27.987586 86.925002 Mt. Everest in Nepal
7 4 34.134117 -118.321495 Hollywood Sign in LA
8 5 38.655100 90.061800 Cahokia Mounds in IL
9 9 40.748400 73.985700 Empire State Building in NY
10 4 29.975300 31.137600 Great Sphinx of Giza in Egypt
11 1 47.557600 10.749800 Neuschwanstein Castle in Germany
12 5 38.624700 90.184800 The Gateway Arch in St. Louis
13 3 30.328500 35.444400 Petra in Jordan
14 2 41.480800 82.683400 Cedar Point in OH
15 6 43.070010 -89.409450 Quick Trip on Monroe Street
You have to do the remainder of this project on your own. Do not discuss with anybody except 320 staff (mentors, TAs, instructor).
Write a function (remember that functions aren't inside any class) in
scrape.py
like the following:
def reveal_secrets(driver, url, travellog):
....
The function should do the following:
- generate a password from the "clues" column of the
travellog
DataFrame. For example, iftravellog
is the big DataFrame built after doing BFS (as shown earlier), the password will start with "17138..." - visit
url
with thedriver
- automate typing the password in the box and clicking "GO"
- wait until the pages is loaded (perhaps with
time.sleep
) - click the "View Location" button and wait until the result finishes loading
- save the image that appears to a file named 'Current_Location.jpg' (use the
requests
module to do the download, once you get the URL from selenium) - return the current location that appears on the page (should be "BASCOM HALL")
Hints for step 6: jpeg files are a binary format (they don't contain text for a human to read). You'll need to do some searching online to learn how to (a) download binary data and (b) write it to a file. Remember to cite any code you copy/paste. Here are some example Google searches you might start with to find how to do these things:
- "how to write bytes to a file in python"
- "how to fetch a binary file with python requests"