Lab 1 OpenMP In C Implementation Due 08/21/2020 11:55 Pm

Lab 1 Openmp In Cimplementation Due 08 21 2020 1155pmpar

Implement a parallel Prewitt filter edge detection algorithm using C++ threads, which takes a grayscale image as input and outputs an image with edge outlines. The implementation involves reading input and output image filenames from command line arguments, reading the image into a 2D array, generating Prewitt masks, applying the filter with parallelism using both static and dynamic scheduling, and processing multiple test cases with different thread counts and chunk sizes. Additionally, implement a parallel quicksort algorithm using OpenMP tasks to sort a 10,000-element array initialized with random numbers. The code must be compiled with appropriate flags and can include environment variables for thread affinity. Timing should be measured using omp_get_wtime(), and performance analysis must be reported for different configurations.

Paper For Above instruction

The goal of this project is to develop an efficient parallel implementation of the Prewitt edge detection algorithm using OpenMP in C++, alongside a parallel quicksort. This encompasses understanding image processing, parallel programming paradigms, and performance evaluation techniques. This paper discusses the implementation details, algorithmic considerations, parallelization strategies, performance analysis, and the significance of these techniques in high-performance computing applications.

Introduction

Edge detection is a fundamental aspect of image processing and computer vision, facilitating feature extraction by emphasizing regions where image intensity varies sharply. Among various algorithms, the Prewitt operator is widely used for its simplicity and efficiency. It applies convolution masks to detect horizontal and vertical edges, yielding a gradient magnitude that signifies edge presence. With the proliferation of multi-core processors, parallelizing such algorithms can significantly accelerate processing times, especially for large images or real-time applications.

Prewitt Filter Algorithm and Implementation

The Prewitt filter utilizes two 3x3 masks to calculate the gradient of pixel intensities in the x and y directions. The masks are applied to each pixel's neighborhood (excluding the boundary pixels for simplicity), computing a gradient magnitude from the sum of the squared gradients in each direction. To handle the local neighborhoods efficiently, the algorithm multiplies neighboring pixel intensities by corresponding mask values, sums these products, and computes the overall gradient using the Euclidean distance formula. The result is then clamped within the 0-255 range, suitable for grayscale images.

In serial implementation, the process involves nested loops over each pixel, with boundary pixels directly assigned zero or skipped to avoid out-of-bounds errors. This sequential approach, while straightforward, becomes computationally intensive for large images, motivating the use of parallel techniques.

Parallelization Using OpenMP

The OpenMP library provides directives to facilitate parallelization of loops via simple pragmas. Implementing the Prewitt filter in parallel involves parallelizing the outer loop over image rows, exploiting data independence among different pixels. Two scheduling strategies are highlighted:

  • Static Scheduling: The workload is divided into chunks fixed in size and assigned at compile time, providing predictable load balancing. Implementing static scheduling using OpenMP's schedule(static, chunk_size) allows control over task distribution and can improve cache locality.
  • Dynamic Scheduling: Chunks are dynamically assigned to threads during execution, which can better handle irregular workloads or non-uniform processing times. OpenMP's schedule(dynamic, chunk_size) enables this mode, potentially reducing idle time among threads.

For both strategies, thread information—such as thread IDs and starting points—are collected within each thread for reporting and analysis. This helps in understanding thread workload distribution and performance bottlenecks.

Performance Evaluation

Timing the parallel implementation involves wrapping the filtering code with omp_get_wtime() calls to record execution durations. Variations in thread count (2, 4, 8) and chunk sizes (24, 65) are explored to analyze their effects on performance. Commonly, increasing the number of threads reduces computation time up to the point where overheads outweigh benefits. The best configuration balances thread count with chunk size to optimize throughput, especially on systems with at least 4 cores.

Evaluation metrics include execution time, speedup, and efficiency, providing insight into scalability and parallel overheads. Hardware characteristics, such as core count and cache hierarchies, influence these results. Optimizations like thread affinity further enhance performance by minimizing context switching and cache misses.

Parallel Quicksort Using OpenMP Tasks

To demonstrate recursive parallelism, the quicksort algorithm is implemented using OpenMP tasks. The array of 10,000 integers is initialized with random values and sorted through recursive partitioning:

  • A pivot is selected, and the array is partitioned into elements less than or equal to and greater than the pivot.
  • Two tasks are spawned to sort the subarrays concurrently, with synchronization via #pragma omp taskwait to coordinate completion.
  • This approach effectively leverages task parallelism, potentially reducing sorting time significantly compared to serial execution.

The code employs dynamic memory allocation for the array and includes necessary OpenMP clauses such as shared for the array and firstprivate for local variables. Proper task management ensures efficient utilization of threads and eliminates data races.

Compilation and System Considerations

Compilation requires specific flags to enable OpenMP and C++11 features: -std=c++11 -fopenmp. It's recommended to compile with -Wall to flag potential issues and use sanitizers like -fsanitize=thread to detect data races. Thread affinity settings are set via environment variables, allowing binding threads to specific CPU cores, which aids in performance consistency and reproducibility.

Conclusion

This project underscores the importance of parallel programming techniques in accelerating computationally intensive tasks such as image processing and sorting. By leveraging OpenMP directives, static and dynamic scheduling strategies, and task parallelism, developers can optimize performance on multi-core systems. Proper performance analysis and system tuning, such as thread affinity, further enhance efficiency, making such techniques invaluable in high-performance computing domains.

References

  • OpenMP Architecture Review Board. (2015). OpenMP Application Programming Interface Version 4.5. https://www.openmp.org/wp-content/uploads/OpenMP4.5.pdf