Implement A Complete Search Engine Milestones Overview

Question

Implement A Complete Search Engine Milestones Overviewm Goal: Implement a complete search engine. Milestones Overview Milestone Goal #1 Produce an initial index for the corpus and a basic retrieval component #2 Complete Search System PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a zip file (webpages_raw.zip). This contains the downloaded content of the ICS web pages that were crawled by a previous quarter. You are expected to build your search engine index off of this data. Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command line or desktop GUI application or web interface) COMPONENT 1 - INDEX: Create an inverted index for all the corpus given to you. You can either use a database to store your index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You are free to choose an approach here. The index should store more than just a simple list of documents where the token occurs. At the very least, your index should store the TF-IDF of every term/document. Sample Index: Note: This is a simplistic example provided for your understanding. Please do not consider this as the expected index format. A good inverted index will store more information than this. Index Structure: token – docId1, tf-idf1 ; docId2, tf-idf2 Example: informatics – doc_1, 5 ; doc_2, 10 ; doc_3, 7 You are encouraged to come up with heuristics that make sense and will help in retrieving relevant search results. For e.g. - words in bold and in heading (h1, h2, h3) could be treated as more important than the other words. These are useful metadata that could be added to your inverted index data. Optional (1 point for each meta data item up to 2 points max): Extra credit will be given for ideas that improve the quality of the retrieval, so you may add more metadata to your index, if you think it will help improve the quality of the retrieval. For this, instead of storing a simple TF

Dr. Jack HW Helper · Accepted Answer

This project aims to develop a comprehensive search engine capable of indexing and retrieving web page data efficiently, leveraging the specific dataset provided. The process involves multiple interconnected components: inverted index creation, search/query retrieval, and ranking to rank search results effectively. Each stage presents distinct challenges and opportunities for optimization, demanding a careful balance of data structure design, algorithmic efficiency, and user interaction. Creating the Index The initial step involves parsing the HTML content of crawled web pages to construct an inverted index that maps terms to document identifiers. This inverted index forms the backbone for fast retrieval, enabling the system to quickly identify documents that contain specific search terms. Critical considerations include parsing broken HTML, avoiding unnecessary dependencies, and optimizing storage for quick lookups. The choice of storage strategy for the inverted index significantly influences the system’s performance. Options include using document-oriented NoSQL databases like MongoDB, in-memory systems like Redis, or even flat files. The index should go beyond mere term-document mappings and include metadata such as TF-IDF scores, word positions, and importance weights derived from HTML structure (e.g., headings, bold text). Including such metadata enhances the retrieval relevance. Implementing Search and Retrieval Once the index is built, implementing a user interface for queries is essential. For simplicity and flexibility, a command-line interface is recommended, which prompts users for search input and presents relevant results. The search function should locate documents containing the query terms by consulting the inverted index, then rank the documents based on a scoring mechanism. Designing the Ranking Mechanism The core ranking approach utilizes TF-IDF scores, which quantify term importance both within individual documents and across the corpus. This re

Implement A Complete Search Engine Milestones Overview

Implement A Complete Search Engine Milestones Overviewm

Paper For Above instruction

Creating the Index

Implementing Search and Retrieval

Designing the Ranking Mechanism

Evaluation and Optimization

Challenges and Future Directions

Conclusion

References