Hire Experts For Answers
Order NowRelated Study Services
- Homework Answers
- Coursework writing help
- Term paper writing help
- Writing Help
- Paper Writing Help
- Research paper help
- Thesis Help
- Dissertation Help
- Case study writing service
- Capstone Project Writing Help
- Lab report Writing
- Take my online class
- Take my online exam
- Do my test for me
- Do my homework
- Do my math homework
- Online Assignment Help
- Do my assignment
- Essay Writing Help
- Write my college essay
- Write my essay for me
DESCRIPTION
Posted
Modified
Viewed
18
need help with uploaded assignment
involves use of splay tress, bfs, string parsing techniques
please have a look at youtube links (part 1, part 2) in the assignment to understand the details requirements. I need help with the entire assignment and do have starting files.
This also involves writing test function to ensure the code in main file is working fine.
Attachments
Part 1: https://www.youtube.com/watch?v=LAWEkvhhS-Y
Part 2: https://www.youtube.com/watch?v=09k5KNVLNpk
Wikiracer : In this assignment, you will implement different types of graph searches that will run on Wikipedia. Specifically, Wikipedia pages are the graph nodes and links to other pages are the edges.
1 Wikipedia, Wikiracer, and Graph Algorithms
Wikipedia is a giant online encyclopedia that anyone can edit (with approval), available at wikipedia.org. Among many other interesting properties, its size makes it particularly hard to digest. As such, a wiki racing game has been created that uses Wikipedia’s large graph as the basis for games. The goal is get from one specified Wikipedia
page (the start node) to another (the goal node) as fast as possible (or perhaps in as few clicks as possible). You can find variations of this game online by searching “Wikiracer." We recommend that you play the game a couple of times first before beginning this assignment. Here is one implementation that we recommend playing: https://dlab.epfl.ch/wikispeedia/play/.
Before beginning this assignment, you should be familiar with the different types of graph searches that we discussed in lecture, specifically, breadth-first search (bfs), depth-first search (dfs), and Dijkstra’s algorithm.
2 Your Assignment
2.1 Parser
This assignment has a few parts. First, we ask you to build a Wikipedia page parser. In other words, once we’ve downloaded the HTML markup for a Wikipedia page, we need to read the HTML code and find all of the page’s neighbors (links to otherWikipedia pages). We provide the code to download a page’s HTML code, but we ask you to parse it inside the get_links_in_page function inside of py_wikiracer/wikiracer.py.
Since we are starting on Wikipedia and ending on Wikipedia, we only want to follow links in the form
href="/wiki/*", since that will ensure that we stay on the https://wikipedia.org domain.
Your first task is to implement the get_links_in_page function inside of py_wikiracer/wikiracer.py.
This function accepts the HTML code for a single webpage and should find all of the valid links inside of that page.
For instance, running the function with the example HTML code above should return ["/wiki/Computer"]. Be sure to return links in the order that they appear, and be sure to filter out duplicate entries.
Additionally, there are a few characters that will invalidate a link in our game. These characters are the colon, pound sign, forward slash, and question mark. You should not hardcode these four characters into being disallowed, but rather use the DISALLOWED global variable provided inside py_wikiracer/wikiracer.py. If a link is found with one of these characters, you can pretend it doesn’t exist in the HTML. Also, if a link is found that doesn’t start with /wiki/, then you can pretend it doesn’t exist, since it could be an external link that takes us outside of Wikipedia.
If you want to play around with HTML, you can save the example script above in an .html file, and then open it with a web browser. You can (and should) look around real websites’ HTML by using the Inspect Element feature available on most web browsers. In Google Chrome, F12 opens the Inspect Element menu, where you can click on a specific piece of text within a website to look at the underlying HTML code. Before starting the project, we encourage you to look at different websites’ HTML structure and see how the links look on Wikipedia. You can also save a file with Ctrl-S to download its HTML markup and examine it with a text editor.
To accomplish your task, you should make good use of Python’s built-in string functions to search for identifiable landmarks near links. You can use Regex to find links as well (though be wary of speed).
2.2 Shortest Path Routines
The second part of the assignment is to implement 3 different search routines to create a naive wikiracer. We use Wikipedia as our graph, with websites representing vertices and links representing edges, and we want to efficiently traverse Wikipedia to find the shortest path between two vertices. Each of these three (bfs, dfs, and Dijkstra’s) share a lot of ideas, so be sure to reflect this in your code reuse. They should be written inside of py_wikiracer/wikiracer.py in their respective class. Each of these routines accepts as input a start page and goal page. They should use py_wikiracer/internet.py’s Internet class to download a page, and then Parser’s get_links_in_page to find a node’s neighbors.
For BFS and DFS, you can think of Wikipedia as a directed, unweighted graph. In BFS, your goal is to spread out evenly across the graph. DFS is similar to BFS, except that you are almost always clicking the last link inside of each webpage, unless you have been there already. This makes it pretty impractical for actually searching for something in Wikipedia, but a useful exercise nonetheless.
See https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy for an interesting result of this behavior.
Dijkstra’s algorithm finds the lowest cost path between two nodes in a weighted graph. In our case, Wikipedia links have no weight, so we pass in a weighting function to Dijkstra’s algorithm that creates a weight given a linked start and end node. For example, if /wiki/Computer has a link to /wiki/Hippopotamus, then the weight of that edge would be costFn("/wiki/Computer", "/wiki/Hippopotamus"). By default, costFn simply takes the
length of the second argument as the weight, but your code should work with any cost function.
2.3 Wikiracer
The third part of the assignment is to combine ideas from BFS, DFS, and Dijkstra’s algorithm into a well-crafted Wikiracer. You should optimize your Wikiracer to find a path between two Wikipedia pages in as few downloads as possible, regardless of the length of the path between the two pages. This strategy works because in a competitive Wikiracer game, you are trying to find any path between two pages as quickly as possible, and jumping between pages dominates the overall runtime due to the slow nature of downloads. That being said, you should still be able to analyze a page in well under a second, but you’ll need to be intelligent about picking the next link to follow next. Some example strategies are:
• Look for links that are substrings or superstrings of the goal URL (be careful with pages like /wiki/A matching everything, though).
• Load the "goal" page, and perform a shallow BFS from the goal page. During your search from the source, when you find one of these links in a page that is near the goal, prioritize it, hoping that it bidirectionally links back to the goal.
• Use Internet.get_random to filter out common links between pages, or for other uses.
Hint: One approach is to build an implementation that will probably look a lot like Dijkstra’s, but with a much more complex cost function, so that the next link you look at has the highest probability of having a link to the goal.
(This is similar in spirit to the A* search algorithm with an inadmissible heuristic, which is beyond the scope of this course but you may find interesting.)
3 Tips
3.1 Using the Internet
To access the internet, we provide the Internet class. This class contains a internet.get_page method, which returns the HTML associated with a link provided in the form "/wiki/Something". Internally, the Internet class caches HTML files to disk so that you can test without having to download Wikipedia pages from the internet every time. Be sure to test your implementation’s runtime from an empty cache, and ensure that it finishes in a reasonable amount of
time. At the moment, the autograder will timeout at around 100 seconds per function call, and it will not have the files cached.
Each search problem (BFS, DFS, Dijkstra’s, andWikiracer) is contained within its own class and has its own local self.internet. Be sure to use this local variable when accessing webpages.
wikiracer/py_wikiracer/internet.py
from urllib.request import urlopen
from pathlib import Path
import base64
FILE_CACHE_DIR = "wiki_cache"
class Internet:
DISALLOWED = [":", "#", "/", "?"]
"""
This class represents the Internet. Feel free to look,
but do NOT edit this file. Also, do NOT access the
Internet in any way except through this class,
or you will see unexpected results.
get_page(page) will return the HTML code from a given page.
page should be formatted as "/wiki/Something", which will
give the HTML for the page https://en.wikipedia.org/wiki/Something.
get_random() is a function you can use if you desire. It will return the HTML
of a random page on Wikipedia.
Usage of Internet:
internet = Internet()
html = internet.get_page("/wiki/Computer_science")
print(html)
"""
def __init__(self):
self.requests = []
def get_page(self, page):
if page[:6] != "/wiki/":
raise ValueError(f"Links must start with /wiki/. {page} is not valid.")
if any(i in page[6:] for i in Internet.DISALLOWED):
raise ValueError(f"Link cannot contain disallowed character. {page} is not valid.")
self.requests.append(page)
return Internet.__get_page_internal(page)
# You may find this useful in your wikiracer implementation.
def get_random(self):
return urlopen(f"https://en.wikipedia.org/wiki/Special:Random").read().decode('utf-8')
@staticmethod
def __get_page_internal(page):
# First see if we have it in the local cache, to reduce the number of spam requests to Wikipedia
file_cache_dir_path = Path(FILE_CACHE_DIR)
if not file_cache_dir_path.is_dir():
file_cache_dir_path.mkdir()
# Convert page to a filesystem safe name
safe_name = base64.urlsafe_b64encode(page.encode("utf-8")).decode("utf-8")
local_path = file_cache_dir_path / safe_name
if local_path.is_file():
return local_path.read_text()
html = urlopen(f"https://en.wikipedia.org{page}").read().decode('utf-8')
# write to file cache
local_path.write_text(html)
return html
wikiracer/py_wikiracer/wikiracer.py
from py_wikiracer.internet import Internet
from typing import List
class Parser:
@staticmethod
def get_links_in_page(html: str) -> List[str]:
"""
In this method, we should parse a page's HTML and return a list of links in the page.
Be sure not to return any link with a DISALLOWED character.
All links should be of the form "/wiki/<page name>", as to not follow external links
"""
links = []
disallowed = Internet.DISALLOWED
# YOUR CODE HERE
# You can look into using regex, or just use Python's find methods to find the <a> tags or any other identifiable features
# A good starting place is to print out `html` and look for patterns before/after the links that you can string.find().
# Make sure your list doesn't have duplicates. Return the list in the same order as they appear in the HTML.
# This function will be stress tested so make it efficient!
return links
# In these methods, we are given a source page and a goal page, and we should return
# the shortest path between the two pages. Be careful! Wikipedia is very large.
# These are all very similar algorithms, so it is advisable to make a global helper function that does all of the work, and have
# each of these call the helper with a different data type (stack, queue, priority queue, etc.)
class BFSProblem:
def __init__(self):
self.internet = Internet()
# Example in/outputs:
# bfs(source = "/wiki/Computer_science", goal = "/wiki/Computer_science") == ["/wiki/Computer_science"]
# bfs(source = "/wiki/Computer_science", goal = "/wiki/Computation") == ["/wiki/Computer_science", "/wiki/Computation"]
# Find more in the test case file.
# Do not try to make fancy optimizations here. The autograder depends on you following standard BFS and will check all of the pages you download.
# Links should be inserted into the queue as they are located in the page, and should be obtained using Parser's get_links_in_page.
# Be very careful not to add things to the "visited" set of pages too early. You must wait for them to come out of the queue first. See if you can figure out why.
# This applies for bfs, dfs, and dijkstra's.
# Download a page with self.internet.get_page().
def bfs(self, source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia"):
path = [source]
# YOUR CODE HERE
# ...
path.append(goal)
return path # if no path exists, return None
class DFSProblem:
def __init__(self):
self.internet = Internet()
# Links should be inserted into a stack as they are located in the page. Do not add things to the visited list until they are taken out of the stack.
def dfs(self, source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia"):
path = [source]
# YOUR CODE HERE
# ...
path.append(goal)
return path # if no path exists, return None
class DijkstrasProblem:
def __init__(self):
self.internet = Internet()
# Links should be inserted into the heap as they are located in the page.
# By default, the cost of going to a link is the length of a particular destination link's name. For instance,
# if we consider /wiki/a -> /wiki/ab, then the default cost function will have a value of 8.
# This cost function is overridable and your implementation will be tested on different cost functions. Use costFn(node1, node2)
# to get the cost of a particular edge.
# You should return the path from source to goal that minimizes the total cost. Assume cost > 0 for all edges.
def dijkstras(self, source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia", costFn = lambda x, y: len(y)):
path = [source]
# YOUR CODE HERE
# ...
path.append(goal)
return path # if no path exists, return None
class WikiracerProblem:
def __init__(self):
self.internet = Internet()
# Time for you to have fun! Using what you know, try to efficiently find the shortest path between two wikipedia pages.
# Your only goal here is to minimize the total amount of pages downloaded from the Internet, as that is the dominating time-consuming action.
# Your answer doesn't have to be perfect by any means, but we want to see some creative ideas.
# One possible starting place is to get the links in `goal`, and then search for any of those from the source page, hoping that those pages lead back to goal.
# Note: a BFS implementation with no optimizations will not get credit, and it will suck.
# You may find Internet.get_random() useful, or you may not.
def wikiracer(self, source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia"):
path = [source]
# YOUR CODE HERE
# ...
path.append(goal)
return path # if no path exists, return None
# KARMA
class FindInPageProblem:
def __init__(self):
self.internet = Internet()
# This Karma problem is a little different. In this, we give you a source page, and then ask you to make up some heuristics that will allow you to efficiently
# find a page containing all of the words in `query`. Again, optimize for the fewest number of internet downloads, not for the shortest path.
def find_in_page(self, source = "/wiki/Calvin_Li", query = ["ham", "cheese"]):
raise NotImplementedError("Karma method find_in_page")
path = [source]
# find a path to a page that contains ALL of the words in query in any place within the page
# path[-1] should be the page that fulfills the query.
# YOUR CODE HERE
return path # if no path exists, return None
wikiracer/py_wikiracer/__init__.py
wikiracer/tests/test_search.py
import random
import sys
from typing import Callable, Iterator
from itertools import chain
from collections import defaultdict
from types import ModuleType
from importlib import reload
from urllib.request import urlopen
import pytest
from py_wikiracer.internet import Internet
from py_wikiracer.wikiracer import Parser, BFSProblem, DFSProblem, DijkstrasProblem, WikiracerProblem
REQ_LIMIT = 75 # per test, normally
def test_parser():
internet = Internet()
html = internet.get_page("/wiki/Henry_Krumrey")
assert Parser.get_links_in_page(html) == ['/wiki/Wisconsin_State_Senate', '/wiki/Wisconsin_Senate,_District_20', '/wiki/Wisconsin_State_Assembly', '/wiki/Plymouth,_Sheboygan_County,_Wisconsin', '/wiki/Republican_Party_(United_States)', '/wiki/Sheboygan_County,_Wisconsin', '/wiki/United_States_presidential_election_in_Wisconsin,_1900', '/wiki/Crystal_Lake,_Illinois', '/wiki/Henry_Krumrey', '/wiki/Main_Page']
def test_trivial():
"""
All pages contain a link to themselves, which any search algorithm should recognize.
"""
bfs = BFSProblem()
dfs = DFSProblem()
dij = DijkstrasProblem()
assert bfs.bfs(source = "/wiki/ASDF", goal = "/wiki/ASDF") == ["/wiki/ASDF", "/wiki/ASDF"]
assert dfs.dfs(source = "/wiki/ASDF", goal = "/wiki/ASDF") == ["/wiki/ASDF", "/wiki/ASDF"]
assert dij.dijkstras(source = "/wiki/ASDF", goal = "/wiki/ASDF") == ["/wiki/ASDF", "/wiki/ASDF"]
assert bfs.internet.requests == ["/wiki/ASDF"]
assert dfs.internet.requests == ["/wiki/ASDF"]
assert dij.internet.requests == ["/wiki/ASDF"]
def test_trivial_2():
"""
Searches going to page 1 distance away.
"""
bfs = BFSProblem()
dfs = DFSProblem()
dij = DijkstrasProblem()
assert bfs.bfs(source = "/wiki/Reese_Witherspoon", goal = "/wiki/Academy_Awards") == ["/wiki/Reese_Witherspoon", "/wiki/Academy_Awards"]
assert dfs.dfs(source = "/wiki/Reese_Witherspoon", goal = "/wiki/Academy_Awards") == ["/wiki/Reese_Witherspoon", "/wiki/Academy_Awards"]
assert dij.dijkstras(source = "/wiki/Reese_Witherspoon", goal = "/wiki/Academy_Awards") == ["/wiki/Reese_Witherspoon", "/wiki/Academy_Awards"]
assert bfs.internet.requests == ["/wiki/Reese_Witherspoon"]
assert dfs.internet.requests == ["/wiki/Reese_Witherspoon"]
assert dij.internet.requests == ["/wiki/Reese_Witherspoon"]
def test_bfs_basic():
"""
BFS depth 2 search
"""
bfs = BFSProblem()
assert bfs.bfs(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == ['/wiki/Calvin_Li', '/wiki/Chinese_language', '/wiki/Wikipedia']
assert bfs.internet.requests == ['/wiki/Calvin_Li', '/wiki/Chinese_name', '/wiki/Chinese_surname', '/wiki/Li_(surname_%E6%9D%8E)', '/wiki/Wuhan', '/wiki/Hubei', '/wiki/Central_Academy_of_Drama', '/wiki/All_Men_Are_Brothers_(TV_series)', '/wiki/Chinese_language']
def test_dfs_basic():
"""
DFS depth 2 search
"""
dfs = DFSProblem()
assert dfs.dfs(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == ['/wiki/Calvin_Li', '/wiki/Main_Page', '/wiki/Wikipedia']
assert dfs.internet.requests == ['/wiki/Calvin_Li', '/wiki/Main_Page']
def test_dijkstras_basic():
"""
DFS depth 2 search
"""
dij = DijkstrasProblem()
# This costFn is to make sure there are never any ties coming out of the heap, since the default costFn produces ties and we don't define a tiebreaking mechanism for priorities
assert dij.dijkstras(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia", costFn = lambda y, x: len(x) * 1000 + x.count("a") * 100 + x.count("u") + x.count("h") * 5 - x.count("F")) == ['/wiki/Calvin_Li', '/wiki/Main_Page', '/wiki/Wikipedia']
assert dij.internet.requests == ['/wiki/Calvin_Li', '/wiki/Hubei', '/wiki/Wuxia', '/wiki/Wuhan', '/wiki/Pinyin', '/wiki/Firefox', '/wiki/Tencent', '/wiki/Wu_Yong', '/wiki/Cao_Cao', '/wiki/John_Woo', '/wiki/Kelly_Lin', '/wiki/Sina_Corp', '/wiki/Huo_Siyan', '/wiki/Shawn_Yue', '/wiki/Main_Page']
class CustomInternet():
def __init__(self):
self.requests = []
def get_page(self, page):
self.requests.append(page)
return f'<a href="{page}"></a>'
def test_none_on_fail():
"""
Program should return None on failure
"""
bfs = BFSProblem()
dfs = DFSProblem()
dij = DijkstrasProblem()
# Testing hack: override the internet to inject our own HTML
bfs.internet = CustomInternet()
dfs.internet = CustomInternet()
dij.internet = CustomInternet()
assert bfs.bfs(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == None
assert dfs.dfs(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == None
assert dij.dijkstras(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == None
assert bfs.internet.requests == ["/wiki/Calvin_Li"]
assert dfs.internet.requests == ["/wiki/Calvin_Li"]
assert dij.internet.requests == ["/wiki/Calvin_Li"]
def test_dfs_complex():
"""
A complex DFS example to test your searching algorithm.
"""
dfs = DFSProblem()
assert dfs.dfs(source = "/wiki/Calvin_Li", goal = "/wiki/Quebecor") == ['/wiki/Calvin_Li', '/wiki/Main_Page', '/wiki/Wikimedia_Foundation', '/wiki/VIAF_(identifier)', '/wiki/Virtual_International_Authority_File', '/wiki/Interested_Parties_Information', '/wiki/Law', '/wiki/Human_science', '/wiki/Truth', '/wiki/Verstehen', '/wiki/Phronesis', '/wiki/Knowledge', '/wiki/Max_Weber', '/wiki/Trove_(identifier)', '/wiki/Trove', '/wiki/The_Sydney_Morning_Herald', '/wiki/OzTAM', '/wiki/Canwest', '/wiki/Pembroke_Daily_Observer', '/wiki/Postmedia_News', '/wiki/Postmedia_Network', '/wiki/Dose_(magazine)', '/wiki/Northern_News', '/wiki/Jam!', '/wiki/Quebecor']
assert dfs.internet.requests == ['/wiki/Calvin_Li', '/wiki/Main_Page', '/wiki/Wikimedia_Foundation', '/wiki/VIAF_(identifier)', '/wiki/Virtual_International_Authority_File', '/wiki/Interested_Parties_Information', '/wiki/Law', '/wiki/Human_science', '/wiki/Truth', '/wiki/Verstehen', '/wiki/Phronesis', '/wiki/Knowledge', '/wiki/Max_Weber', '/wiki/Trove_(identifier)', '/wiki/Trove', '/wiki/The_Sydney_Morning_Herald', '/wiki/OzTAM', '/wiki/Canwest', '/wiki/Pembroke_Daily_Observer', '/wiki/Postmedia_News', '/wiki/Postmedia_Network', '/wiki/Dose_(magazine)', '/wiki/Northern_News', '/wiki/Jam!']
def test_wikiracer_1():
"""
Tests wikiracer speed on one input.
A great implementation can do this in less than 8 internet requests.
A good implementation can do this in less than 15 internet requests.
A mediocre implementation can do this in less than 30 internet requests.
To make your own test cases like this, I recommend finding a starting page,
clicking on a few links, and then seeing if your program can get from your
start to your end in only a few downloads.
"""
limit = 15
racer = WikiracerProblem()
racer.wikiracer(source="/wiki/Computer_science", goal="/wiki/Richard_Soley")
assert len(racer.internet.requests) <= limit
def test_wikiracer_2():
"""
Tests wikiracer speed on one input.
A great implementation can do this in less than 25 internet requests.
A good implementation can do this in less than 80 internet requests.
A mediocre implementation can do this in less than 300 internet requests.
"""
limit = 80
racer = WikiracerProblem()
racer.wikiracer(source="/wiki/Waakirchen", goal="/wiki/A")
assert len(racer.internet.requests) <= limit
wikiracer/tests/__init__.py
wikiracer/pyproject.toml
[tool.poetry]
name = "py_wikiracer"
version = "0.1.0"
description = "UTCS DSC 395T: Wikiracer"
authors = ["Matthew Giordano <mgiordano@utexas.edu>"]
# prevent this package from being published by mistake
exclude = ["**/*"]
[tool.poetry.dependencies]
python = "^3.6.9"
pytest = "^6"
pytest-depends = "^1.0.1"
mypy = "^0.910"
wikiracer/handout.pdf
DSC 395T Data Structures and Algorithms — Spring 2022 Programming Assignment #6
Wikiracer Due May 9, 2022
In this assignment, you will implement different types of graph searches that will run on Wikipedia. Specifically,
Wikipedia pages are the graph nodes and links to other pages are the edges.
1 Wikipedia, Wikiracer, and Graph Algorithms
Wikipedia is a giant online encyclopedia that anyone can edit (with approval), available at wikipedia.org. Among
many other interesting properties, its size makes it particularly hard to digest. As such, a wiki racing game has
been created that uses Wikipedia’s large graph as the basis for games. The goal is get from one specified Wikipedia
page (the start node) to another (the goal node) as fast as possible (or perhaps in as few clicks as possible). You
can find variations of this game online by searching “Wikiracer." We recommend that you play the game a couple
of times first before beginning this assignment. Here is one implementation that we recommend playing: https:
//dlab.epfl.ch/wikispeedia/play/.
Before beginning this assignment, you should be familiar with the different types of graph searches that we dis-
cussed in lecture, specifically, breadth-first search (bfs), depth-first search (dfs), and Dijkstra’s algorithm.
2 Your Assignment
2.1 Parser
This assignment has a few parts. First, we ask you to build a Wikipedia page parser. In other words, once we’ve
downloaded the HTML markup for a Wikipedia page, we need to read the HTML code and find all of the page’s
neighbors (links to other Wikipedia pages). We provide the code to download a page’s HTML code, but we ask you to
parse it inside the get_links_in_page function inside of py_wikiracer/wikiracer.py.
HTML code is organized into different cascading tags, and it is the building block of all websites. For instance,
consider the following HTML markup:
<html>
<head>
<script href="..."></script>
</head>
<body>
<p>This is a paragraph!</p>
<p>I can put any text I want, including <a href="/wiki/Science">a link</a>.</p>
</body>
</html>
When Google Chrome or Firefox encounter this HTML segment, they will interpret the HTML commands and cre-
ate a webpage with two paragraphs. The second one will have a link over the text “a link" that points to “/wiki/Science".
The first thing to know about HTML is that tags are written inside of < and >, and a closing tag is denoted via a
</ and >. Every opening tag must have a closing tag (save for a few self-closing tags), and the content of the tag is
1
https://wikipedia.org
https://dlab.epfl.ch/wikispeedia/play/
https://dlab.epfl.ch/wikispeedia/play/
contained between the opening and closing tags. Notice above that <html> is closed by </html> at the bottom of the
document, and likewise for <head>, <body>, <p>, and <a>.
Here are some different types of tags:
• <html> tags mark the start and end of the website. There can be only one html section.
• <head> marks the beginning of the metadata section of a website. The head section normally contain tags that
load page materials (fonts, scripts).
• <script> tags contain JavaScript, or reference an external JavaScript file that should be loaded and ran.
• <body> marks the main content of a website. This is where most visible text will be located.
• <p> tags are used to denote different paragraphs.
• <a> tags represent hyperlinks and will be a main focus of this project. The href parameter contains the des-
tination of the link, and the text between the <a> tags contains the visible text. When a link does not cross
domains (i.e. the part between the https:// and the .com/.org/...), it can start with a / to move between pages
in the same domain easily. For example, when the source page is also on the https://wikipedia.org
domain, the link https://wikipedia.org/wiki/Computer can be accessed either by the parameter
href="https://wikipedia.org/wiki/Computer" or the parameter href="/wiki/Computer"
. Since we are starting on Wikipedia and ending on Wikipedia, we only want to follow links in the form
href="/wiki/*", since that will ensure that we stay on the https://wikipedia.org domain.
Your first task is to implement the get_links_in_page function inside of py_wikiracer/wikiracer.py.
This function accepts the HTML code for a single webpage and should find all of the valid links inside of that page.
For instance, running the function with the example HTML code above should return ["/wiki/Computer"]. Be
sure to return links in the order that they appear, and be sure to filter out duplicate entries.
Additionally, there are a few characters that will invalidate a link in our game. These characters are the colon,
pound sign, forward slash, and question mark. You should not hardcode these four characters into being disallowed,
but rather use the DISALLOWED global variable provided inside py_wikiracer/wikiracer.py. If a link is
found with one of these characters, you can pretend it doesn’t exist in the HTML. Also, if a link is found that doesn’t
start with /wiki/, then you can pretend it doesn’t exist, since it could be an external link that takes us outside of
Wikipedia.
If you want to play around with HTML, you can save the example script above in an .html file, and then open it
with a web browser. You can (and should) look around real websites’ HTML by using the Inspect Element feature
avaliable on most web browsers. In Google Chrome, F12 opens the Inspect Element menu, where you can click on a
specific piece of text within a website to look at the underlying HTML code. Before starting the project, we encourage
you to look at different websites’ HTML structure, and see how the links look on Wikipedia. You can also save a file
with Ctrl-S to download its HTML markup and examine it with a text editor.
To accomplish your task, you should make good use of Python’s built in string functions to search for identifiable
landmarks near links. You can use Regex to find links as well (though be wary of speed).
2.2 Shortest Path Routines
The second part of the assignment is to implement 3 different search routines to create a naive wikiracer. We use
Wikipedia as our graph, with websites representing vertices and links representing edges, and we want to efficiently tra-
verse Wikipedia to find the shortest path between two vertices. Each of these three (bfs, dfs, and Dijkstra’s) share a lot
of ideas, so be sure to reflect this in your code reuse. They should be written inside of py_wikiracer/wikiracer.py
in their respective class. Each of these routines accepts as input a start page and goal page. They should use
py_wikiracer/internet.py’s Internet class to download a page, and then Parser’s get_links_in_page
to find a node’s neighbors.
For BFS and DFS, you can think of Wikipedia as a directed, unweighted graph. In BFS, your goal is to spread out
evenly across the graph. DFS is similar to BFS, except that you are almost always clicking the last link inside of each
webpage, unless you have been there already. This makes it pretty impractical for actually searching for something in
Wikipedia, but a useful exercise nonetheless.1
1See https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy for an interesting result of this behavior.
2
https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy
Dijkstra’s algorithm finds the lowest cost path between two nodes in a weighted graph. In our case, Wikipedia
links have no weight, so we pass in a weighting function to Dijkstra’s algorithm that creates a weight given a linked
start and end node. For example, if /wiki/Computer has a link to /wiki/Hippopotamus, then the weight of that edge
would be costFn("/wiki/Computer", "/wiki/Hippopotamus"). By default, costFn simply takes the
length of the second argument as the weight, but your code should work with any cost function.
2.3 Wikiracer
The third part of the assignment is to combine ideas from BFS, DFS, and Dijkstra’s algorithm into a well-crafted
Wikiracer. You should optimize your Wikiracer to find a path between two Wikipedia pages in as few downloads as
possible, regardless of the length of the path between the two pages. This strategy works because in a competitive
Wikiracer game, you are trying to find any path between two pages as quickly as possible, and jumping between pages
dominates the overall runtime due to the slow nature of downloads. That being said, you should still be able to analyze
a page in well under a second, but you’ll need to be intelligent about picking the next link to follow next. Some
example strategies are:
• Look for links that are substrings or superstrings of the goal URL (be careful with pages like /wiki/A matching
everything, though).
• Load the "goal" page, and perform a shallow BFS from the goal page. During your search from the source,
when you find one of these links in a page that is near the goal, prioritize it, hoping that it bidirectionally links
back to the goal.
• Use Internet.get_random to filter out common links between pages, or for other uses.
Hint: One approach is to build an implementation that will probably look a lot like Dijkstra’s, but with a much
more complex cost function, so that the next link you look at has the highest probability of having a link to the goal.
(This is similar in spirit to the A* search algorithm with an inadmissible heuristic, which is beyond the scope of this
course but you may find interesting.)
3 Tips
3.1 Using the Internet
To access the internet, we provide the Internet class. This class contains a internet.get_page method, which
returns the HTML associated with a link provided in the form "/wiki/Something". Internally, the Internet class caches
HTML files to disk so that you can test without having to download Wikipedia pages from the internet every time. Be
sure to test your implementation’s runtime from an empty cache, and ensure that it finishes in a reasonable amount of
time. At the moment, the autograder will timeout at around 100 seconds per function call, and it will not have the files
cached.
Each search problem (BFS, DFS, Dijkstra’s, and Wikiracer) is contained within its own class and has its own local
self.internet. Be sure to use this local variable when accessing webpages.
3.2 Karma
It may be interesting to try and go from a start page to a page that contains certain text, which may be useful in a search
engine, where you have a homepage (e.g. wikipedia.org) and want to quickly identify pages via a query. Implement
such a function in find_in_page, which takes a starting URL and a list of words. You should try and quickly return a
path to a page that contains all of the words in the query, and/or explain how you would do this in karma.txt. It may
be easier to start with the assumption that len(query) == 1.
3
4 What To Turn In
Use the Canvas website to submit a single ZIP file that contains your modified py_wikiracer folder. Specific
instructions can be found at this course’s EdX site under the "Submission Instructions" page. Canvas is available at
http://canvas.utexas.edu/.
Important Details.
• This assignment is due at 11:59pm on the due date. Late assignments will be penalized 10% per day.
• We have already identified many instances of plagiarism on prior assignments. Do not be tempted to look up
existing implementations of a Wikiracer, as that is a violation of our course policies.
Acknowledgments. This assignment was created by Calvin Lin and Matthew Giordano. Special thanks to Ali Malik
for inspiration.
4
http://canvas.utexas.edu/
Wikipedia, Wikiracer, and Graph Algorithms
Your Assignment
Parser
Shortest Path Routines
Wikiracer
Tips
Using the Internet
Karma
What To Turn In
Explanations and Answers
0
No answers posted
Post your Answer - free or at a fee
NB: Post a homework question for free and get answers - free or paid homework help.
Get answers to: Wanted Help With Python Assignment Involving Data Structures or similar questions only at Tutlance.
Related Questions
- Create A Linked Implementation Of A Set Call Linkedset<T> That Implements The Set Interface).
- Help Write This Simple Checkers Code Game
- Testing Differences In Measurements Between Different Observers
- Testing Differences In Measurements Between Different Observers
- Final Assignment Programming Class
- Stat200 - Assignment #1: Descriptive Statistics Data Analysis Plan
- Gis Project - Identifying Farmlands In Allegheny County, Pittsburgh
- Can You Solve The Attached Coding Problem In Mpi With Fortran?
- R Code Assignment 4 Questions Please Finish On Time!!!!!!!!!!!
- Practice Inheritance, Virtual Functions And Base Class Pointers.
- Practice Inheritance, Virtual Functions And Base Class Pointers
- Use This File And Download: Company_Attrition.csv Open Up And Clean The File, Create A Jupyter Notebook
- Creating A Regression Model In R That Accurately Predicts Responding Variable 'Score'
- Lab 9&10 Programming Questions
- Creating Gradebook Nasm Virtual Machine
- C++ Assignment Including User-Defined Functions, Vectors/Arrays, And I/O
- Need Help With Three Basic C++ Questions
- Difference In Means Estimator On Average Treatment Effect
- Statistics Assignment: Difference In Means Estimand, Sutva, And Randomization Inference
- Assignment From Statistics Class
- Python 3: Building A Restaurant That Outsources Their Ice Cream Desserts To The Ice Cream Shoppe
- Arcgis Lab Help Please........
- Quiz On Python Object Oriented Programming
- Quiz On Python Object Oriented Programming
- Strategy Memo-Msf Access To Medicines Campaign
- I Need Help Adding A Rest Api Over Http Function To My Programs
- Creating A Deck Of Cards And A Menu For Doing Simple Actions With Those Cards.
- Embedded Systems Assignment Look At Attached File For Instructions.
- Project: Tic Tac Toe 2.0 In Java
- Write Python Code For Analyze People Migration Data Human Migration.
- Coding Assingment. 2 (Python/Sql)
- Performing A Regression Analysis With Time Series Data And Possible Autocorrelation
- Performing A Regression Analysis With Time Series Data And Possible Autocorrelation
- Python Program Homework. Implementation Of The M-Index Calculation
- Python Program Homework. Implementation Of The M-Index Calculation
- Ruby Terminal App - Simple Library App
- C++ Bowling App That Uses Arrays And Prints Scores
- Maths Questions Based Off Calculus And Matlab Coding
- I Need Someone To Convert My Song Lyric Website To A React Application
- I Need Someone To Convert My Song Lyric Website To A React Application
- Programs That Write Html, Write A Program That Writes A Web Page
- Image Processing With Matlab Programming
- C++ Final Project For A College Students
- Convolutions And Digital Holography In Matlab
- Convolutions And Digital Holography In Matlab
- Design Project For Aerospace Engineering...........
- I/O Cycle Coding Project On Website
- Matrix Algebra And Ols, And Simulations Of Key Properties Of The Mean
- C And Python Programming Combined With Math.
- Matlab Program To Calculate Jacobian Matrix And Simulate Resolved-Rate Control For The Planar 3R Robot