Java Website Crawler Assignment Project

Need help with similar Java questions?

Ask A Question

Question: Java Website Crawler Assignment Project

Viewed 84

Overall, the project is to obtain data from a set of webpages from the website assigned to you. Each student will have a different website. If you do not want the site assigned to you, then provide three website URLs that are not on the list emailed to you and not one of Amazon, Craigslist, Ebay and Abebooks. In this case, I will then choose one of the three provided as alternatives if it has the capabilities of searching, returning results and its results page can be read in the source html.

Requirement1. Online (Search) function. The user should be able to obtain/search for data online from your approved chosen website. Included in this is the ability to download at least one picture or icon found on the website. We will be discussing the code to accessing the webpages and downloading images in upcoming lectures. Decide within the context of your URL what data (and to what extent) can your user search for/obtain from the website chosen. GUIs should be provided for this purpose. OPTIONS: If your program implements a search function for data provided by the website, and there exists an advanced search feature on that site, implementing the advanced search completely via your own GUI would be an acceptable innovation.

Requirement2. Storage of Data. All data obtained from/for the user should be stored in a storage structure. The program must provide a hash based storage for this purpose. When data is entered into your storage, the user name and a timestamp should be entered with each item of data obtained in Requirement#1. The user should have the option to delete data that they have requested and have been stored prior. In addition, the user should be able to modify the data provided for the user profile (Requirement#11).

Requirement3. Offline queries. (By Requirement#4 next, the data must be maintained between executions of the program. So, if you shut down the program and start it up again, the data is not lost.) The user should be able to conduct offline queries of your data storage via a GUI. The extent of these queries will depend on what you have stored in your storage structure; i.e., the queries will target the specific information you have stored about each item and/or transactional history in your storage. Since you have a timestamp associated with each data entered into your storage, the user can ask for data obtained within a certain time range. The user should only be able to retrieve data that they requested and was entered to the data storage on the user’s behalf. The admin can retrieve data from a specific user(s) or all users. The user should be allowed to printout the results of the queries to a local file (presumably interacting with a GUI.) OPTIONS: A nifty option would to allow the user to email the user a report of the obtained data. This would count as an innovation.

Requirement4. Persistence of Data. The notion of persistence states that the data is preserved even if the system crashes. The privileged user/admin only from a GUI should be able to reconstruct the storage from backup data (possibly the Transactional Log from Requirement#9, assuming you have enough information in the log to carry out this task.) There are number of ways that persistence can be implemented in Java. You can use any such method that allows reconstruction of the data system.

Requirement5. Processing individual pages. This requirement dovetails Requirement#1. The data obtained from the website will be embedded within the webpages provided by the website. These pages are in html. You are to temporarily store the html into a temporary file and then process it. You must write the code yourself and cannot use a third party html API or parser. We will be discussing an approach using regex in class in upcoming lectures. You will be processing at two different types of webpages from the site. The first page is the “initial results” page (first level). This is the page that their server sends in response to your query. It contains a summary of individual results that matches your query and embedded in this page are links to the actual individual results pages. The latter are the second type (second level) of webpages you will be processing. You will be extracting pertinent data (text) to be stored into your data storage and images that will be stored locally and the names of the image files and their locations stored in the database. As to which data should be stored will depend of course on which website you are assigned. OPTIONS: There are typically a number of initial result pages provided by their server in response to your query. Have your system go through all (if requested, but be careful testing live, as their system may log you out) or some (user-chosen how many) or predefined limit (system sets the limit). Then, you process the individual (second level) results pages, as described above.

Requirement6. GUIs. Your system should include appropriate GUIs to enhance the user experience. The lines of code for these GUIs will not count towards Requirement#10. You may user JavaFX to enhance the user’s experience. Development with GUIs will not be considered an innovation.

Requirement7. Transaction Log. Every transaction that interacts with the data storage should be written to a text file (the log) that contains the name of the transaction (e.g. INSERT, DELETE, MODIFY) with its parameters (data values), the user that requested it, along with the date/timestamp for security purposes. If the transaction doesn't make changes to the data storage, it need not be logged in for persistence, but you still need to keep it for security logging purposes. The admin is the only one who can directly interact with this log.

Requirement8. User Registration. Because security is such an important aspect of modern software engineering, you are to implement a User account system, with a special account for Administrator who can have access to the transaction log and rebuild system based on persistence. A Guest account will allow anyone to search but not store data. The User account in addition stores data and interact with User’s obtained data that is stored in storage structure.

CHOOSE TWO NEW FEATURES TO IMPLEMENT FOR THE ABOVE SYSTEM ON YOUR OWN. Requirement1. Innovation#1: A nifty option would to allow the user to email the user a report of the obtained data. This would count as an innovation.

Requirement2 . Innovation#2: Having a SQL database as well as hashtable option to save the data.

More Instructions

No uploads for this question

Answers 0

No answers posted

Post your Answer - free or at a fee

Login to your tutor account to post an answer

Posting a free answer earns you +20 points.


Ask a question for free and get answers to get Java assignment help with a similar task to this question.