Wanted Help With Python Assignment Involving Data Structures

Posted Under: Python

Ask A Question
DESCRIPTION
Posted
Modified
Viewed 18
need help with uploaded assignment involves use of splay tress, bfs, string parsing techniques please have a look at youtube links (part 1, part 2) in the assignment to understand the details requirements. I need help with the entire assignment and do have starting files. This also involves writing test function to ensure the code in main file is working fine.
Attachments
Part 1: https://www.youtube.com/watch?v=LAWEkvhhS-Y Part 2: https://www.youtube.com/watch?v=09k5KNVLNpk Wikiracer : In this assignment, you will implement different types of graph searches that will run on Wikipedia. Specifically, Wikipedia pages are the graph nodes and links to other pages are the edges. 1 Wikipedia, Wikiracer, and Graph Algorithms Wikipedia is a giant online encyclopedia that anyone can edit (with approval), available at wikipedia.org. Among many other interesting properties, its size makes it particularly hard to digest. As such, a wiki racing game has been created that uses Wikipedia’s large graph as the basis for games. The goal is get from one specified Wikipedia page (the start node) to another (the goal node) as fast as possible (or perhaps in as few clicks as possible). You can find variations of this game online by searching “Wikiracer." We recommend that you play the game a couple of times first before beginning this assignment. Here is one implementation that we recommend playing: https://dlab.epfl.ch/wikispeedia/play/. Before beginning this assignment, you should be familiar with the different types of graph searches that we discussed in lecture, specifically, breadth-first search (bfs), depth-first search (dfs), and Dijkstra’s algorithm. 2 Your Assignment 2.1 Parser This assignment has a few parts. First, we ask you to build a Wikipedia page parser. In other words, once we’ve downloaded the HTML markup for a Wikipedia page, we need to read the HTML code and find all of the page’s neighbors (links to otherWikipedia pages). We provide the code to download a page’s HTML code, but we ask you to parse it inside the get_links_in_page function inside of py_wikiracer/wikiracer.py. Since we are starting on Wikipedia and ending on Wikipedia, we only want to follow links in the form href="/wiki/*", since that will ensure that we stay on the https://wikipedia.org domain. Your first task is to implement the get_links_in_page function inside of py_wikiracer/wikiracer.py. This function accepts the HTML code for a single webpage and should find all of the valid links inside of that page. For instance, running the function with the example HTML code above should return ["/wiki/Computer"]. Be sure to return links in the order that they appear, and be sure to filter out duplicate entries. Additionally, there are a few characters that will invalidate a link in our game. These characters are the colon, pound sign, forward slash, and question mark. You should not hardcode these four characters into being disallowed, but rather use the DISALLOWED global variable provided inside py_wikiracer/wikiracer.py. If a link is found with one of these characters, you can pretend it doesn’t exist in the HTML. Also, if a link is found that doesn’t start with /wiki/, then you can pretend it doesn’t exist, since it could be an external link that takes us outside of Wikipedia. If you want to play around with HTML, you can save the example script above in an .html file, and then open it with a web browser. You can (and should) look around real websites’ HTML by using the Inspect Element feature available on most web browsers. In Google Chrome, F12 opens the Inspect Element menu, where you can click on a specific piece of text within a website to look at the underlying HTML code. Before starting the project, we encourage you to look at different websites’ HTML structure and see how the links look on Wikipedia. You can also save a file with Ctrl-S to download its HTML markup and examine it with a text editor. To accomplish your task, you should make good use of Python’s built-in string functions to search for identifiable landmarks near links. You can use Regex to find links as well (though be wary of speed). 2.2 Shortest Path Routines The second part of the assignment is to implement 3 different search routines to create a naive wikiracer. We use Wikipedia as our graph, with websites representing vertices and links representing edges, and we want to efficiently traverse Wikipedia to find the shortest path between two vertices. Each of these three (bfs, dfs, and Dijkstra’s) share a lot of ideas, so be sure to reflect this in your code reuse. They should be written inside of py_wikiracer/wikiracer.py in their respective class. Each of these routines accepts as input a start page and goal page. They should use py_wikiracer/internet.py’s Internet class to download a page, and then Parser’s get_links_in_page to find a node’s neighbors. For BFS and DFS, you can think of Wikipedia as a directed, unweighted graph. In BFS, your goal is to spread out evenly across the graph. DFS is similar to BFS, except that you are almost always clicking the last link inside of each webpage, unless you have been there already. This makes it pretty impractical for actually searching for something in Wikipedia, but a useful exercise nonetheless. See https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy for an interesting result of this behavior. Dijkstra’s algorithm finds the lowest cost path between two nodes in a weighted graph. In our case, Wikipedia links have no weight, so we pass in a weighting function to Dijkstra’s algorithm that creates a weight given a linked start and end node. For example, if /wiki/Computer has a link to /wiki/Hippopotamus, then the weight of that edge would be costFn("/wiki/Computer", "/wiki/Hippopotamus"). By default, costFn simply takes the length of the second argument as the weight, but your code should work with any cost function. 2.3 Wikiracer The third part of the assignment is to combine ideas from BFS, DFS, and Dijkstra’s algorithm into a well-crafted Wikiracer. You should optimize your Wikiracer to find a path between two Wikipedia pages in as few downloads as possible, regardless of the length of the path between the two pages. This strategy works because in a competitive Wikiracer game, you are trying to find any path between two pages as quickly as possible, and jumping between pages dominates the overall runtime due to the slow nature of downloads. That being said, you should still be able to analyze a page in well under a second, but you’ll need to be intelligent about picking the next link to follow next. Some example strategies are: • Look for links that are substrings or superstrings of the goal URL (be careful with pages like /wiki/A matching everything, though). • Load the "goal" page, and perform a shallow BFS from the goal page. During your search from the source, when you find one of these links in a page that is near the goal, prioritize it, hoping that it bidirectionally links back to the goal. • Use Internet.get_random to filter out common links between pages, or for other uses. Hint: One approach is to build an implementation that will probably look a lot like Dijkstra’s, but with a much more complex cost function, so that the next link you look at has the highest probability of having a link to the goal. (This is similar in spirit to the A* search algorithm with an inadmissible heuristic, which is beyond the scope of this course but you may find interesting.) 3 Tips 3.1 Using the Internet To access the internet, we provide the Internet class. This class contains a internet.get_page method, which returns the HTML associated with a link provided in the form "/wiki/Something". Internally, the Internet class caches HTML files to disk so that you can test without having to download Wikipedia pages from the internet every time. Be sure to test your implementation’s runtime from an empty cache, and ensure that it finishes in a reasonable amount of time. At the moment, the autograder will timeout at around 100 seconds per function call, and it will not have the files cached. Each search problem (BFS, DFS, Dijkstra’s, andWikiracer) is contained within its own class and has its own local self.internet. Be sure to use this local variable when accessing webpages. wikiracer/py_wikiracer/internet.py from urllib.request import urlopen from pathlib import Path import base64 FILE_CACHE_DIR = "wiki_cache" class Internet: DISALLOWED = [":", "#", "/", "?"] """ This class represents the Internet. Feel free to look, but do NOT edit this file. Also, do NOT access the Internet in any way except through this class, or you will see unexpected results. get_page(page) will return the HTML code from a given page. page should be formatted as "/wiki/Something", which will give the HTML for the page https://en.wikipedia.org/wiki/Something. get_random() is a function you can use if you desire. It will return the HTML of a random page on Wikipedia. Usage of Internet: internet = Internet() html = internet.get_page("/wiki/Computer_science") print(html) """ def __init__(self): self.requests = [] def get_page(self, page): if page[:6] != "/wiki/": raise ValueError(f"Links must start with /wiki/. {page} is not valid.") if any(i in page[6:] for i in Internet.DISALLOWED): raise ValueError(f"Link cannot contain disallowed character. {page} is not valid.") self.requests.append(page) return Internet.__get_page_internal(page) # You may find this useful in your wikiracer implementation. def get_random(self): return urlopen(f"https://en.wikipedia.org/wiki/Special:Random").read().decode('utf-8') @staticmethod def __get_page_internal(page): # First see if we have it in the local cache, to reduce the number of spam requests to Wikipedia file_cache_dir_path = Path(FILE_CACHE_DIR) if not file_cache_dir_path.is_dir(): file_cache_dir_path.mkdir() # Convert page to a filesystem safe name safe_name = base64.urlsafe_b64encode(page.encode("utf-8")).decode("utf-8") local_path = file_cache_dir_path / safe_name if local_path.is_file(): return local_path.read_text() html = urlopen(f"https://en.wikipedia.org{page}").read().decode('utf-8') # write to file cache local_path.write_text(html) return html wikiracer/py_wikiracer/wikiracer.py from py_wikiracer.internet import Internet from typing import List class Parser: @staticmethod def get_links_in_page(html: str) -> List[str]: """ In this method, we should parse a page's HTML and return a list of links in the page. Be sure not to return any link with a DISALLOWED character. All links should be of the form "/wiki/<page name>", as to not follow external links """ links = [] disallowed = Internet.DISALLOWED # YOUR CODE HERE # You can look into using regex, or just use Python's find methods to find the <a> tags or any other identifiable features # A good starting place is to print out `html` and look for patterns before/after the links that you can string.find(). # Make sure your list doesn't have duplicates. Return the list in the same order as they appear in the HTML. # This function will be stress tested so make it efficient! return links # In these methods, we are given a source page and a goal page, and we should return # the shortest path between the two pages. Be careful! Wikipedia is very large. # These are all very similar algorithms, so it is advisable to make a global helper function that does all of the work, and have # each of these call the helper with a different data type (stack, queue, priority queue, etc.) class BFSProblem: def __init__(self): self.internet = Internet() # Example in/outputs: # bfs(source = "/wiki/Computer_science", goal = "/wiki/Computer_science") == ["/wiki/Computer_science"] # bfs(source = "/wiki/Computer_science", goal = "/wiki/Computation") == ["/wiki/Computer_science", "/wiki/Computation"] # Find more in the test case file. # Do not try to make fancy optimizations here. The autograder depends on you following standard BFS and will check all of the pages you download. # Links should be inserted into the queue as they are located in the page, and should be obtained using Parser's get_links_in_page. # Be very careful not to add things to the "visited" set of pages too early. You must wait for them to come out of the queue first. See if you can figure out why. # This applies for bfs, dfs, and dijkstra's. # Download a page with self.internet.get_page(). def bfs(self, source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia"): path = [source] # YOUR CODE HERE # ... path.append(goal) return path # if no path exists, return None class DFSProblem: def __init__(self): self.internet = Internet() # Links should be inserted into a stack as they are located in the page. Do not add things to the visited list until they are taken out of the stack. def dfs(self, source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia"): path = [source] # YOUR CODE HERE # ... path.append(goal) return path # if no path exists, return None class DijkstrasProblem: def __init__(self): self.internet = Internet() # Links should be inserted into the heap as they are located in the page. # By default, the cost of going to a link is the length of a particular destination link's name. For instance, # if we consider /wiki/a -> /wiki/ab, then the default cost function will have a value of 8. # This cost function is overridable and your implementation will be tested on different cost functions. Use costFn(node1, node2) # to get the cost of a particular edge. # You should return the path from source to goal that minimizes the total cost. Assume cost > 0 for all edges. def dijkstras(self, source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia", costFn = lambda x, y: len(y)): path = [source] # YOUR CODE HERE # ... path.append(goal) return path # if no path exists, return None class WikiracerProblem: def __init__(self): self.internet = Internet() # Time for you to have fun! Using what you know, try to efficiently find the shortest path between two wikipedia pages. # Your only goal here is to minimize the total amount of pages downloaded from the Internet, as that is the dominating time-consuming action. # Your answer doesn't have to be perfect by any means, but we want to see some creative ideas. # One possible starting place is to get the links in `goal`, and then search for any of those from the source page, hoping that those pages lead back to goal. # Note: a BFS implementation with no optimizations will not get credit, and it will suck. # You may find Internet.get_random() useful, or you may not. def wikiracer(self, source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia"): path = [source] # YOUR CODE HERE # ... path.append(goal) return path # if no path exists, return None # KARMA class FindInPageProblem: def __init__(self): self.internet = Internet() # This Karma problem is a little different. In this, we give you a source page, and then ask you to make up some heuristics that will allow you to efficiently # find a page containing all of the words in `query`. Again, optimize for the fewest number of internet downloads, not for the shortest path. def find_in_page(self, source = "/wiki/Calvin_Li", query = ["ham", "cheese"]): raise NotImplementedError("Karma method find_in_page") path = [source] # find a path to a page that contains ALL of the words in query in any place within the page # path[-1] should be the page that fulfills the query. # YOUR CODE HERE return path # if no path exists, return None wikiracer/py_wikiracer/__init__.py wikiracer/tests/test_search.py import random import sys from typing import Callable, Iterator from itertools import chain from collections import defaultdict from types import ModuleType from importlib import reload from urllib.request import urlopen import pytest from py_wikiracer.internet import Internet from py_wikiracer.wikiracer import Parser, BFSProblem, DFSProblem, DijkstrasProblem, WikiracerProblem REQ_LIMIT = 75 # per test, normally def test_parser(): internet = Internet() html = internet.get_page("/wiki/Henry_Krumrey") assert Parser.get_links_in_page(html) == ['/wiki/Wisconsin_State_Senate', '/wiki/Wisconsin_Senate,_District_20', '/wiki/Wisconsin_State_Assembly', '/wiki/Plymouth,_Sheboygan_County,_Wisconsin', '/wiki/Republican_Party_(United_States)', '/wiki/Sheboygan_County,_Wisconsin', '/wiki/United_States_presidential_election_in_Wisconsin,_1900', '/wiki/Crystal_Lake,_Illinois', '/wiki/Henry_Krumrey', '/wiki/Main_Page'] def test_trivial(): """ All pages contain a link to themselves, which any search algorithm should recognize. """ bfs = BFSProblem() dfs = DFSProblem() dij = DijkstrasProblem() assert bfs.bfs(source = "/wiki/ASDF", goal = "/wiki/ASDF") == ["/wiki/ASDF", "/wiki/ASDF"] assert dfs.dfs(source = "/wiki/ASDF", goal = "/wiki/ASDF") == ["/wiki/ASDF", "/wiki/ASDF"] assert dij.dijkstras(source = "/wiki/ASDF", goal = "/wiki/ASDF") == ["/wiki/ASDF", "/wiki/ASDF"] assert bfs.internet.requests == ["/wiki/ASDF"] assert dfs.internet.requests == ["/wiki/ASDF"] assert dij.internet.requests == ["/wiki/ASDF"] def test_trivial_2(): """ Searches going to page 1 distance away. """ bfs = BFSProblem() dfs = DFSProblem() dij = DijkstrasProblem() assert bfs.bfs(source = "/wiki/Reese_Witherspoon", goal = "/wiki/Academy_Awards") == ["/wiki/Reese_Witherspoon", "/wiki/Academy_Awards"] assert dfs.dfs(source = "/wiki/Reese_Witherspoon", goal = "/wiki/Academy_Awards") == ["/wiki/Reese_Witherspoon", "/wiki/Academy_Awards"] assert dij.dijkstras(source = "/wiki/Reese_Witherspoon", goal = "/wiki/Academy_Awards") == ["/wiki/Reese_Witherspoon", "/wiki/Academy_Awards"] assert bfs.internet.requests == ["/wiki/Reese_Witherspoon"] assert dfs.internet.requests == ["/wiki/Reese_Witherspoon"] assert dij.internet.requests == ["/wiki/Reese_Witherspoon"] def test_bfs_basic(): """ BFS depth 2 search """ bfs = BFSProblem() assert bfs.bfs(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == ['/wiki/Calvin_Li', '/wiki/Chinese_language', '/wiki/Wikipedia'] assert bfs.internet.requests == ['/wiki/Calvin_Li', '/wiki/Chinese_name', '/wiki/Chinese_surname', '/wiki/Li_(surname_%E6%9D%8E)', '/wiki/Wuhan', '/wiki/Hubei', '/wiki/Central_Academy_of_Drama', '/wiki/All_Men_Are_Brothers_(TV_series)', '/wiki/Chinese_language'] def test_dfs_basic(): """ DFS depth 2 search """ dfs = DFSProblem() assert dfs.dfs(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == ['/wiki/Calvin_Li', '/wiki/Main_Page', '/wiki/Wikipedia'] assert dfs.internet.requests == ['/wiki/Calvin_Li', '/wiki/Main_Page'] def test_dijkstras_basic(): """ DFS depth 2 search """ dij = DijkstrasProblem() # This costFn is to make sure there are never any ties coming out of the heap, since the default costFn produces ties and we don't define a tiebreaking mechanism for priorities assert dij.dijkstras(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia", costFn = lambda y, x: len(x) * 1000 + x.count("a") * 100 + x.count("u") + x.count("h") * 5 - x.count("F")) == ['/wiki/Calvin_Li', '/wiki/Main_Page', '/wiki/Wikipedia'] assert dij.internet.requests == ['/wiki/Calvin_Li', '/wiki/Hubei', '/wiki/Wuxia', '/wiki/Wuhan', '/wiki/Pinyin', '/wiki/Firefox', '/wiki/Tencent', '/wiki/Wu_Yong', '/wiki/Cao_Cao', '/wiki/John_Woo', '/wiki/Kelly_Lin', '/wiki/Sina_Corp', '/wiki/Huo_Siyan', '/wiki/Shawn_Yue', '/wiki/Main_Page'] class CustomInternet(): def __init__(self): self.requests = [] def get_page(self, page): self.requests.append(page) return f'<a href="{page}"></a>' def test_none_on_fail(): """ Program should return None on failure """ bfs = BFSProblem() dfs = DFSProblem() dij = DijkstrasProblem() # Testing hack: override the internet to inject our own HTML bfs.internet = CustomInternet() dfs.internet = CustomInternet() dij.internet = CustomInternet() assert bfs.bfs(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == None assert dfs.dfs(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == None assert dij.dijkstras(source = "/wiki/Calvin_Li", goal = "/wiki/Wikipedia") == None assert bfs.internet.requests == ["/wiki/Calvin_Li"] assert dfs.internet.requests == ["/wiki/Calvin_Li"] assert dij.internet.requests == ["/wiki/Calvin_Li"] def test_dfs_complex(): """ A complex DFS example to test your searching algorithm. """ dfs = DFSProblem() assert dfs.dfs(source = "/wiki/Calvin_Li", goal = "/wiki/Quebecor") == ['/wiki/Calvin_Li', '/wiki/Main_Page', '/wiki/Wikimedia_Foundation', '/wiki/VIAF_(identifier)', '/wiki/Virtual_International_Authority_File', '/wiki/Interested_Parties_Information', '/wiki/Law', '/wiki/Human_science', '/wiki/Truth', '/wiki/Verstehen', '/wiki/Phronesis', '/wiki/Knowledge', '/wiki/Max_Weber', '/wiki/Trove_(identifier)', '/wiki/Trove', '/wiki/The_Sydney_Morning_Herald', '/wiki/OzTAM', '/wiki/Canwest', '/wiki/Pembroke_Daily_Observer', '/wiki/Postmedia_News', '/wiki/Postmedia_Network', '/wiki/Dose_(magazine)', '/wiki/Northern_News', '/wiki/Jam!', '/wiki/Quebecor'] assert dfs.internet.requests == ['/wiki/Calvin_Li', '/wiki/Main_Page', '/wiki/Wikimedia_Foundation', '/wiki/VIAF_(identifier)', '/wiki/Virtual_International_Authority_File', '/wiki/Interested_Parties_Information', '/wiki/Law', '/wiki/Human_science', '/wiki/Truth', '/wiki/Verstehen', '/wiki/Phronesis', '/wiki/Knowledge', '/wiki/Max_Weber', '/wiki/Trove_(identifier)', '/wiki/Trove', '/wiki/The_Sydney_Morning_Herald', '/wiki/OzTAM', '/wiki/Canwest', '/wiki/Pembroke_Daily_Observer', '/wiki/Postmedia_News', '/wiki/Postmedia_Network', '/wiki/Dose_(magazine)', '/wiki/Northern_News', '/wiki/Jam!'] def test_wikiracer_1(): """ Tests wikiracer speed on one input. A great implementation can do this in less than 8 internet requests. A good implementation can do this in less than 15 internet requests. A mediocre implementation can do this in less than 30 internet requests. To make your own test cases like this, I recommend finding a starting page, clicking on a few links, and then seeing if your program can get from your start to your end in only a few downloads. """ limit = 15 racer = WikiracerProblem() racer.wikiracer(source="/wiki/Computer_science", goal="/wiki/Richard_Soley") assert len(racer.internet.requests) <= limit def test_wikiracer_2(): """ Tests wikiracer speed on one input. A great implementation can do this in less than 25 internet requests. A good implementation can do this in less than 80 internet requests. A mediocre implementation can do this in less than 300 internet requests. """ limit = 80 racer = WikiracerProblem() racer.wikiracer(source="/wiki/Waakirchen", goal="/wiki/A") assert len(racer.internet.requests) <= limit wikiracer/tests/__init__.py wikiracer/pyproject.toml [tool.poetry] name = "py_wikiracer" version = "0.1.0" description = "UTCS DSC 395T: Wikiracer" authors = ["Matthew Giordano <mgiordano@utexas.edu>"] # prevent this package from being published by mistake exclude = ["**/*"] [tool.poetry.dependencies] python = "^3.6.9" pytest = "^6" pytest-depends = "^1.0.1" mypy = "^0.910" wikiracer/handout.pdf DSC 395T Data Structures and Algorithms — Spring 2022 Programming Assignment #6 Wikiracer Due May 9, 2022 In this assignment, you will implement different types of graph searches that will run on Wikipedia. Specifically, Wikipedia pages are the graph nodes and links to other pages are the edges. 1 Wikipedia, Wikiracer, and Graph Algorithms Wikipedia is a giant online encyclopedia that anyone can edit (with approval), available at wikipedia.org. Among many other interesting properties, its size makes it particularly hard to digest. As such, a wiki racing game has been created that uses Wikipedia’s large graph as the basis for games. The goal is get from one specified Wikipedia page (the start node) to another (the goal node) as fast as possible (or perhaps in as few clicks as possible). You can find variations of this game online by searching “Wikiracer." We recommend that you play the game a couple of times first before beginning this assignment. Here is one implementation that we recommend playing: https: //dlab.epfl.ch/wikispeedia/play/. Before beginning this assignment, you should be familiar with the different types of graph searches that we dis- cussed in lecture, specifically, breadth-first search (bfs), depth-first search (dfs), and Dijkstra’s algorithm. 2 Your Assignment 2.1 Parser This assignment has a few parts. First, we ask you to build a Wikipedia page parser. In other words, once we’ve downloaded the HTML markup for a Wikipedia page, we need to read the HTML code and find all of the page’s neighbors (links to other Wikipedia pages). We provide the code to download a page’s HTML code, but we ask you to parse it inside the get_links_in_page function inside of py_wikiracer/wikiracer.py. HTML code is organized into different cascading tags, and it is the building block of all websites. For instance, consider the following HTML markup: <html> <head> <script href="..."></script> </head> <body> <p>This is a paragraph!</p> <p>I can put any text I want, including <a href="/wiki/Science">a link</a>.</p> </body> </html> When Google Chrome or Firefox encounter this HTML segment, they will interpret the HTML commands and cre- ate a webpage with two paragraphs. The second one will have a link over the text “a link" that points to “/wiki/Science". The first thing to know about HTML is that tags are written inside of < and >, and a closing tag is denoted via a </ and >. Every opening tag must have a closing tag (save for a few self-closing tags), and the content of the tag is 1 https://wikipedia.org https://dlab.epfl.ch/wikispeedia/play/ https://dlab.epfl.ch/wikispeedia/play/ contained between the opening and closing tags. Notice above that <html> is closed by </html> at the bottom of the document, and likewise for <head>, <body>, <p>, and <a>. Here are some different types of tags: • <html> tags mark the start and end of the website. There can be only one html section. • <head> marks the beginning of the metadata section of a website. The head section normally contain tags that load page materials (fonts, scripts). • <script> tags contain JavaScript, or reference an external JavaScript file that should be loaded and ran. • <body> marks the main content of a website. This is where most visible text will be located. • <p> tags are used to denote different paragraphs. • <a> tags represent hyperlinks and will be a main focus of this project. The href parameter contains the des- tination of the link, and the text between the <a> tags contains the visible text. When a link does not cross domains (i.e. the part between the https:// and the .com/.org/...), it can start with a / to move between pages in the same domain easily. For example, when the source page is also on the https://wikipedia.org domain, the link https://wikipedia.org/wiki/Computer can be accessed either by the parameter href="https://wikipedia.org/wiki/Computer" or the parameter href="/wiki/Computer" . Since we are starting on Wikipedia and ending on Wikipedia, we only want to follow links in the form href="/wiki/*", since that will ensure that we stay on the https://wikipedia.org domain. Your first task is to implement the get_links_in_page function inside of py_wikiracer/wikiracer.py. This function accepts the HTML code for a single webpage and should find all of the valid links inside of that page. For instance, running the function with the example HTML code above should return ["/wiki/Computer"]. Be sure to return links in the order that they appear, and be sure to filter out duplicate entries. Additionally, there are a few characters that will invalidate a link in our game. These characters are the colon, pound sign, forward slash, and question mark. You should not hardcode these four characters into being disallowed, but rather use the DISALLOWED global variable provided inside py_wikiracer/wikiracer.py. If a link is found with one of these characters, you can pretend it doesn’t exist in the HTML. Also, if a link is found that doesn’t start with /wiki/, then you can pretend it doesn’t exist, since it could be an external link that takes us outside of Wikipedia. If you want to play around with HTML, you can save the example script above in an .html file, and then open it with a web browser. You can (and should) look around real websites’ HTML by using the Inspect Element feature avaliable on most web browsers. In Google Chrome, F12 opens the Inspect Element menu, where you can click on a specific piece of text within a website to look at the underlying HTML code. Before starting the project, we encourage you to look at different websites’ HTML structure, and see how the links look on Wikipedia. You can also save a file with Ctrl-S to download its HTML markup and examine it with a text editor. To accomplish your task, you should make good use of Python’s built in string functions to search for identifiable landmarks near links. You can use Regex to find links as well (though be wary of speed). 2.2 Shortest Path Routines The second part of the assignment is to implement 3 different search routines to create a naive wikiracer. We use Wikipedia as our graph, with websites representing vertices and links representing edges, and we want to efficiently tra- verse Wikipedia to find the shortest path between two vertices. Each of these three (bfs, dfs, and Dijkstra’s) share a lot of ideas, so be sure to reflect this in your code reuse. They should be written inside of py_wikiracer/wikiracer.py in their respective class. Each of these routines accepts as input a start page and goal page. They should use py_wikiracer/internet.py’s Internet class to download a page, and then Parser’s get_links_in_page to find a node’s neighbors. For BFS and DFS, you can think of Wikipedia as a directed, unweighted graph. In BFS, your goal is to spread out evenly across the graph. DFS is similar to BFS, except that you are almost always clicking the last link inside of each webpage, unless you have been there already. This makes it pretty impractical for actually searching for something in Wikipedia, but a useful exercise nonetheless.1 1See https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy for an interesting result of this behavior. 2 https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy Dijkstra’s algorithm finds the lowest cost path between two nodes in a weighted graph. In our case, Wikipedia links have no weight, so we pass in a weighting function to Dijkstra’s algorithm that creates a weight given a linked start and end node. For example, if /wiki/Computer has a link to /wiki/Hippopotamus, then the weight of that edge would be costFn("/wiki/Computer", "/wiki/Hippopotamus"). By default, costFn simply takes the length of the second argument as the weight, but your code should work with any cost function. 2.3 Wikiracer The third part of the assignment is to combine ideas from BFS, DFS, and Dijkstra’s algorithm into a well-crafted Wikiracer. You should optimize your Wikiracer to find a path between two Wikipedia pages in as few downloads as possible, regardless of the length of the path between the two pages. This strategy works because in a competitive Wikiracer game, you are trying to find any path between two pages as quickly as possible, and jumping between pages dominates the overall runtime due to the slow nature of downloads. That being said, you should still be able to analyze a page in well under a second, but you’ll need to be intelligent about picking the next link to follow next. Some example strategies are: • Look for links that are substrings or superstrings of the goal URL (be careful with pages like /wiki/A matching everything, though). • Load the "goal" page, and perform a shallow BFS from the goal page. During your search from the source, when you find one of these links in a page that is near the goal, prioritize it, hoping that it bidirectionally links back to the goal. • Use Internet.get_random to filter out common links between pages, or for other uses. Hint: One approach is to build an implementation that will probably look a lot like Dijkstra’s, but with a much more complex cost function, so that the next link you look at has the highest probability of having a link to the goal. (This is similar in spirit to the A* search algorithm with an inadmissible heuristic, which is beyond the scope of this course but you may find interesting.) 3 Tips 3.1 Using the Internet To access the internet, we provide the Internet class. This class contains a internet.get_page method, which returns the HTML associated with a link provided in the form "/wiki/Something". Internally, the Internet class caches HTML files to disk so that you can test without having to download Wikipedia pages from the internet every time. Be sure to test your implementation’s runtime from an empty cache, and ensure that it finishes in a reasonable amount of time. At the moment, the autograder will timeout at around 100 seconds per function call, and it will not have the files cached. Each search problem (BFS, DFS, Dijkstra’s, and Wikiracer) is contained within its own class and has its own local self.internet. Be sure to use this local variable when accessing webpages. 3.2 Karma It may be interesting to try and go from a start page to a page that contains certain text, which may be useful in a search engine, where you have a homepage (e.g. wikipedia.org) and want to quickly identify pages via a query. Implement such a function in find_in_page, which takes a starting URL and a list of words. You should try and quickly return a path to a page that contains all of the words in the query, and/or explain how you would do this in karma.txt. It may be easier to start with the assumption that len(query) == 1. 3 4 What To Turn In Use the Canvas website to submit a single ZIP file that contains your modified py_wikiracer folder. Specific instructions can be found at this course’s EdX site under the "Submission Instructions" page. Canvas is available at http://canvas.utexas.edu/. Important Details. • This assignment is due at 11:59pm on the due date. Late assignments will be penalized 10% per day. • We have already identified many instances of plagiarism on prior assignments. Do not be tempted to look up existing implementations of a Wikiracer, as that is a violation of our course policies. Acknowledgments. This assignment was created by Calvin Lin and Matthew Giordano. Special thanks to Ali Malik for inspiration. 4 http://canvas.utexas.edu/ Wikipedia, Wikiracer, and Graph Algorithms Your Assignment Parser Shortest Path Routines Wikiracer Tips Using the Internet Karma What To Turn In
Explanations and Answers 0

No answers posted

Post your Answer - free or at a fee

Login to your tutor account to post an answer

Posting a free answer earns you +20 points.

Login

NB: Post a homework question for free and get answers - free or paid homework help.

Get answers to: Wanted Help With Python Assignment Involving Data Structures or similar questions only at Tutlance.

Related Questions