Problem 1: Automatically Collect From Memphisedu - 10,000 Un

Question

Problem 1 Automatically Collect From Memphisedu 10000unique Documen Problem 1 Automatically collect from memphis.edu 10,000 unique documents. The documents should be proper after converting them to txt (>50 valid tokens after saved as text); only collect .html, .txt, and .pdf web files and then convert them to text - make sure you do not keep any of the presentation tags such as html tags. You may use third party tools to convert the original files to text. Your output should be a set of 10,000 text files (not html, txt, or pdf docs) of at least 50 textual tokens each. You must write your own code to collect the documents - DO NOT use an existing or third party crawlercrawler. Store for each proper file the original URL as you will need it later when displaying the results to the user. Problem 2 Preprocess all the files using assignment #4( "python program that preprocesses a collection of documents using the recommendations given in the Text Operations lecture. The input to the program will be a directory containing a list of (10000 unique documents)text files collected in above program. documents must be converted to text before using them. Remove the following during the preprocessing: - digits - punctuation - stop words (use the generic list available at ...ir-websearch/papers/english.stopwords.txt) - urls and other html-like strings - uppercases - morphological variations).)" This directory should have index terms( inverted index of a set of already preprocessed files.Use raw term frequency (tf) in the document without normalizing it. Think about saving the generated index, including the document frequency (df), in a file so that you can retrieve it later) .Save all preprocessed documents in a single directory .

Dr. Jack HW Helper · Accepted Answer

Problem 1 Automatically Collect From Memphisedu 10000unique Documen The task involves designing a comprehensive system to automatically collect, process, and index a large corpus of web documents from memphis.edu, ensuring that the data is suitable for subsequent information retrieval and analysis tasks. The project is divided into two main parts: document collection and preprocessing. Each phase requires meticulous implementation to meet specified criteria, including quality, format, and data integrity. Part 1: Automated Collection of Web Documents The first step in the project focuses on the extraction of 10,000 unique web documents from memphis.edu. This process entails developing a custom web crawler that can systematically navigate the site, identify relevant web files, and convert them into plaintext documents. Crucially, the crawler must be built from scratch—using no existing third-party crawling libraries—to fulfill the assignment's requirement for originality and technical depth. The crawler should target specific file types—namely HTML, TXT, and PDF files—deriving plain text from each. During the extraction, presentation elements such as HTML tags must be stripped out to retain only meaningful textual content. Additionally, each collection process must verify that the extracted text contains at least 50 valid tokens, ensuring the documents are substantial enough for analysis. Files that do not meet this threshold should be discarded. Each valid document must be saved as a text file, with a strict output count of 10,000 such files. Simultaneously, the system should record the original URL of each document, storing this metadata to facilitate future referencing and display. This ensures traceability and supports subsequent tasks such as corpus analysis or user query relevance assessment. Part 2: Preprocessing and Indexing of Collected Documents Following collection, the second phase involves processing the gathered documents using a detailed preprocessing p

Problem 1 Automatically Collect From Memphisedu 10000unique Documen

Paper For Above instruction

Part 1: Automated Collection of Web Documents

Part 2: Preprocessing and Indexing of Collected Documents

Related Assignments