Problem 1: Automatically Collect From Memphisedu - 10,000 Un
Problem 1 Automatically Collect From Memphisedu 10000unique Documen
Problem 1 Automatically collect from memphis.edu 10,000 unique documents. The documents should be proper after converting them to txt (>50 valid tokens after saved as text); only collect .html, .txt, and .pdf web files and then convert them to text - make sure you do not keep any of the presentation tags such as html tags. You may use third party tools to convert the original files to text. Your output should be a set of 10,000 text files (not html, txt, or pdf docs) of at least 50 textual tokens each. You must write your own code to collect the documents - DO NOT use an existing or third party crawlercrawler. Store for each proper file the original URL as you will need it later when displaying the results to the user. Problem 2 Preprocess all the files using assignment #4( "python program that preprocesses a collection of documents using the recommendations given in the Text Operations lecture. The input to the program will be a directory containing a list of (10000 unique documents)text files collected in above program. documents must be converted to text before using them. Remove the following during the preprocessing: - digits - punctuation - stop words (use the generic list available at ...ir-websearch/papers/english.stopwords.txt) - urls and other html-like strings - uppercases - morphological variations).)" This directory should have index terms( inverted index of a set of already preprocessed files.Use raw term frequency (tf) in the document without normalizing it. Think about saving the generated index, including the document frequency (df), in a file so that you can retrieve it later) .Save all preprocessed documents in a single directory .
Paper For Above instruction
The task involves designing a comprehensive system to automatically collect, process, and index a large corpus of web documents from memphis.edu, ensuring that the data is suitable for subsequent information retrieval and analysis tasks. The project is divided into two main parts: document collection and preprocessing. Each phase requires meticulous implementation to meet specified criteria, including quality, format, and data integrity.
Part 1: Automated Collection of Web Documents
The first step in the project focuses on the extraction of 10,000 unique web documents from memphis.edu. This process entails developing a custom web crawler that can systematically navigate the site, identify relevant web files, and convert them into plaintext documents. Crucially, the crawler must be built from scratch—using no existing third-party crawling libraries—to fulfill the assignment's requirement for originality and technical depth.
The crawler should target specific file types—namely HTML, TXT, and PDF files—deriving plain text from each. During the extraction, presentation elements such as HTML tags must be stripped out to retain only meaningful textual content. Additionally, each collection process must verify that the extracted text contains at least 50 valid tokens, ensuring the documents are substantial enough for analysis. Files that do not meet this threshold should be discarded.
Each valid document must be saved as a text file, with a strict output count of 10,000 such files. Simultaneously, the system should record the original URL of each document, storing this metadata to facilitate future referencing and display. This ensures traceability and supports subsequent tasks such as corpus analysis or user query relevance assessment.
Part 2: Preprocessing and Indexing of Collected Documents
Following collection, the second phase involves processing the gathered documents using a detailed preprocessing pipeline. This pipeline is based on prior assignment guidelines yet must be enhanced with specific techniques recommended during the "Text Operations" lecture.
A dedicated Python program should be developed to intake a directory containing the 10,000 text documents. Preprocessing steps include:
- Removing digits and punctuation
- Eliminating stop words, utilizing the provided stopword list located at ...ir-websearch/papers/english.stopwords.txt
- Stripping URLs and HTML-like strings
- Converting all text to lowercase
- Removing morphological variations (e.g., stemming or lemmatization)
The output of this process should be cleaned, normalized text files stored in a single directory. Additionally, an inverted index should be constructed, capturing the raw term frequency (tf) within each document—without normalization. The index must also include document frequency (df) counts for each term, enabling later retrieval and analysis. It is essential to save this index in a file structure that permits future access for tasks such as search queries or statistical analysis.
Proper documentation of the collected data and the preprocessing results, including the indexing, will support subsequent information retrieval experiments and demonstrate the effectiveness of the developed system.