Overview Of The Assignment In This Assignment You Are Going

Overview Of The Assignmentin This Assignment You Are Going To Imp

In this assignment, you are tasked with implementing three streaming algorithms: Bloom filtering, Flajolet-Martin, and reservoir sampling. The first task involves applying Bloom Filtering on a static (off-line) Yelp business dataset to estimate whether a certain city has been seen before. The second task requires simulating a data stream based on Yelp data and implementing the Flajolet-Martin algorithm using Spark Streaming to estimate the number of unique cities over time. The third task involves analyzing a live Twitter stream to identify popular tags using fixed-size reservoir sampling, maintaining a sample of 100 tweets for trend analysis.

You must use Python and Spark to develop all three implementations, with an extra bonus offered for providing correct Scala versions. The environment includes Python 3.6, Scala 2.11, and Spark 2.3.2. All code must be written independently; copying from web sources or peers will be grounds for plagiarism detection and penalties. You will submit six scripts (task1.py, task2.py, task3.py, task1.scala, task2.scala, task3.scala) and a jar file if Scala code is provided. Data files specific to each task are to be downloaded from Vocareum, and Twitter API credentials must be set up for task 3.

Paper For Above instruction

The current digital landscape requires the efficient handling and analysis of large-scale streaming data. This assignment emphasizes some of the most fundamental algorithms used in streaming analytics—Bloom filtering, Flajolet-Martin, and reservoir sampling—covering scenarios from off-line batch processing to real-time data stream analysis.

Bloom Filtering for Business Data

Bloom filters enable probabilistic membership testing with space efficiency, making them suitable for checking whether a city has appeared before in a large dataset. For this task, a Bloom filter with a fixed size (200 bits) is employed. Seeding proper hash functions—such as linear hash functions with predetermined coefficients—is crucial to ensure consistent results. The dataset of Yelp businesses is split into two files, with one used to build the filter containing known cities, and the other serving as input for evaluation. The goal is to calculate the false positive rate (FPR) at each time interval, storing the timestamp and FPR in a CSV file. This process demonstrates how Bloom filters can be used in offline data and how estimation accuracy evolves over time.

Flajolet-Martin Algorithm for Unique Counts in Streaming Data

The Flajolet-Martin algorithm efficiently estimates the number of distinct elements in a stream using probabilistic counting. This is achieved through multiple hash functions producing bit patterns, with the position of the least significant '1' bit serving as an estimator. Hash functions are chosen to maximize randomness and independence, and the algorithm is applied within a sliding window (30 seconds, with 5-second batches). The results include actual counts of unique cities and the estimations derived from the algorithm, recorded with timestamps. This example showcases how probabilistic algorithms reduce memory usage and computational complexity in streaming contexts.

Reservoir Sampling for Twitter Tag Analysis

Analyzing live social media data requires sampling methods that provide representative subsets without extensive memory. Reservoir sampling allows maintaining a fixed-size sample—in this case, 100 tweets—by replacing randomly selected samples as new data arrives after the initial fill. This technique is critical when data streams are unbounded. The task involves parsing incoming tweets for hashtags, updating the sample pool, and determining the top three most frequent tags. The output displays the tweet count and ranked tags with their frequencies, providing real-time insights into trending topics.

This assignment integrates theoretical algorithm properties with practical implementation challenges, including hash function design, stream windowing, and handling live APIs. Such skills are invaluable in modern data science, enabling scalable and resource-efficient solutions for analyzing massive and continuous datasets.

References

  • Broder, A., & Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. IEEE/ACM Transactions on Networking, 8(4), 336-347.
  • Flajolet, P., & Martin, G. N. (1985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2), 182-209.
  • Muthukrishnan, S. (2005). Data streams: Algorithms and applications. Now Publishers Inc.
  • Chan, P. K., & Sherman, R. (2014). Scalable Streaming Data Analytics for Large-Scale Data. IEEE Data Engineering Bulletin, 37(4), 11-20.
  • Gibbons, P. B., & Matt, P. (1998). Information retrieval with high accuracy using bloom filters. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (pp. 283-290).
  • Sedgewick, R., & Wayne, K. (2011). Algorithms, 4th Edition. Addison-Wesley.
  • McGregor, A., et al. (2014). Streaming algorithms for estimating the size of the maximum matching. Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 183-194.
  • He, S., et al. (2020). Real-Time Trending Topic Detection on Social Streams. IEEE Transactions on Knowledge and Data Engineering, 32(11), 2147-2160.
  • Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann.
  • O’Neil, P., Revesz, P., & qu, S. (1992). Unique counts for data streams. Proceedings of the 4th Annual European Symposium on Algorithms (pp. 290-301).