Overview Of The Assignment In This Assignment You Are Going

Question

Overview Of The Assignmentin This Assignment You Are Going To Imp In this assignment, you are tasked with implementing three streaming algorithms: Bloom filtering, Flajolet-Martin, and reservoir sampling. The first task involves applying Bloom Filtering on a static (off-line) Yelp business dataset to estimate whether a certain city has been seen before. The second task requires simulating a data stream based on Yelp data and implementing the Flajolet-Martin algorithm using Spark Streaming to estimate the number of unique cities over time. The third task involves analyzing a live Twitter stream to identify popular tags using fixed-size reservoir sampling, maintaining a sample of 100 tweets for trend analysis. You must use Python and Spark to develop all three implementations, with an extra bonus offered for providing correct Scala versions. The environment includes Python 3.6, Scala 2.11, and Spark 2.3.2. All code must be written independently; copying from web sources or peers will be grounds for plagiarism detection and penalties. You will submit six scripts (task1.py, task2.py, task3.py, task1.scala, task2.scala, task3.scala) and a jar file if Scala code is provided. Data files specific to each task are to be downloaded from Vocareum, and Twitter API credentials must be set up for task 3.

Dr. Jack HW Helper · Accepted Answer

The current digital landscape requires the efficient handling and analysis of large-scale streaming data. This assignment emphasizes some of the most fundamental algorithms used in streaming analytics—Bloom filtering, Flajolet-Martin, and reservoir sampling—covering scenarios from off-line batch processing to real-time data stream analysis. Bloom Filtering for Business Data Bloom filters enable probabilistic membership testing with space efficiency, making them suitable for checking whether a city has appeared before in a large dataset. For this task, a Bloom filter with a fixed size (200 bits) is employed. Seeding proper hash functions—such as linear hash functions with predetermined coefficients—is crucial to ensure consistent results. The dataset of Yelp businesses is split into two files, with one used to build the filter containing known cities, and the other serving as input for evaluation. The goal is to calculate the false positive rate (FPR) at each time interval, storing the timestamp and FPR in a CSV file. This process demonstrates how Bloom filters can be used in offline data and how estimation accuracy evolves over time. Flajolet-Martin Algorithm for Unique Counts in Streaming Data The Flajolet-Martin algorithm efficiently estimates the number of distinct elements in a stream using probabilistic counting. This is achieved through multiple hash functions producing bit patterns, with the position of the least significant '1' bit serving as an estimator. Hash functions are chosen to maximize randomness and independence, and the algorithm is applied within a sliding window (30 seconds, with 5-second batches). The results include actual counts of unique cities and the estimations derived from the algorithm, recorded with timestamps. This example showcases how probabilistic algorithms reduce memory usage and computational complexity in streaming contexts. Reservoir Sampling for Twitter Tag Analysis Analyzing live social media data requires sampling methods th

Overview Of The Assignment In This Assignment You Are Going

Overview Of The Assignmentin This Assignment You Are Going To Imp

Paper For Above instruction

References

Overview Of The Assignmentin This Assignment You Are Going To Imp

Paper For Above instruction

References

Related Assignments