Review Questions: What Are The Major Challenges To Large
Review Questions111 1what Are The Major Challenges To Large Corpora
Review Questions 1.1 and 1.2 explore the major challenges faced by large corporations in managing information to support decision-making, limitations of conventional information systems, and various types of information processes needed by organizations. They also examine the differences between data, information, and knowledge, and delve into why the IT industry has developed Data Mining (DM) and Data Warehousing (DW) technologies beyond traditional RDBMS solutions. Further questions address criteria for choosing business information solutions, purposes and categorization of DM, and the empirical cycle model of scientific research, including its relation to machine learning (ML) and DM.
Questions 2.1 and 3.1-3.3 analyze the properties and structures of DW and OLAP systems, including the multi-dimensional space model, data cube lattice, cuboids, and typical schemas. They compare OLAP and OLTP operations, explore conceptual differences between logic and physical spaces, and discuss visualization methods for OLAP results, exemplified by the wealth vs. health analysis case.
Section 4 focuses on Data Mining's role in deriving new information, contrasting DM and DW technologies, and illustrating association rule mining—its principles, algorithms like Apriori, and their application in CRM. It discusses the purpose of association patterns, the empirical cycle in scientific research, and classification versus association mining, emphasizing supervised learning approaches like decision trees (ID3) and Naive Bayes classifiers, including their strategies, biases, and application to text classification.
Further questions compare clustering with classification, discussing cluster quality criteria, data structures, dissimilarity measures, and clustering methods such as k-means and hierarchical clustering, including handling outliers and evaluating cluster results. They also highlight web mining's challenges, major data types, and tasks across web data categories.
Paper For Above instruction
The challenges of managing large corpora in modern organizations are multifaceted, impacting decision-making, operational efficiency, and strategic planning. Effective management demands addressing issues related to data volume, variety, veracity, and velocity, often referred to as the "4Vs" of big data. These challenges include storing vast amounts of data securely, ensuring data quality, integrating heterogeneous data sources, and enabling timely access for decision makers. Conventional information systems like traditional Database Management Systems (DBMS) encounter limitations in handling the scale and complexity of big data, such as slower query responses, limited scalability, and difficulties in performing complex analytical queries.
To support complex decision-making processes, organizations require various types of information processing: data processing (organizing and storing data), information processing (analyzing data to produce meaningful insights), and knowledge processing (deriving actionable knowledge). For instance, a university president might be interested in high-level insights about enrollment trends, resource utilization, or research output, whereas a store manager’s focus might be daily sales performance or inventory levels. On the other hand, the typical user of online banking systems deals with information queries related to account balances, transaction histories, and fund transfers, primarily involving data and information processing.
Distinguishing among data, information, and knowledge is key to understanding information processing. Data comprises raw, unprocessed facts; for example, individual sales transactions. Information results from analyzing data, such as total sales for a given period. Knowledge emerges from synthesizing information to make strategic decisions, like identifying sales trends or customer preferences. Each of these plays a crucial role in business queries, with data serving as the foundation, information providing context, and knowledge enabling decision-making.
The development of Data Mining (DM) and Data Warehousing (DW) technologies stems from the limitations of traditional RDBMS/SQL systems, which are primarily designed for transaction processing rather than analytical tasks. DW models enable efficient storage and retrieval of large, historical datasets, facilitating complex queries that support strategic analysis. Unlike normalized RDB schemas, DW employs denormalized schemas, such as star and snowflake structures, to optimize query performance. Multi-dimensional modeling in DW, represented via data cube lattices, allows users to analyze data from multiple perspectives, such as time, geography, or product categories.
Data warehouses differ significantly from conventional databases in their focus on read-heavy operations, flexibility in schema design, and support for OLAP functionalities. The multi-dimensional space model, described by data cubes, supports fast, aggregate-level analysis—key for managerial decision-making. Cuboids, the cells within a data cube, enable quick retrieval of summarized data; the total number of cuboids depends on the dimensions and hierarchy levels chosen. In DW architecture, common schemas include star, snowflake, and constellation schemas, each with varying degrees of normalization and complexity to suit evolving analytical needs.
OLAP systems operate differently from OLTP systems by emphasizing read-intensive, aggregate queries rather than transaction processing. Conceptually, OLAP involves navigating a logic-based multi-dimensional space, contrasting with the physical space used in 3D graphics. OLAP queries often involve slicing, dicing, roll-up, and drill-down operations, visualized through data cubes or dashboards. For example, in wealth versus health analysis, OLAP facilitates examining combined metrics across dimensions such as time, location, or demographic groups, supporting strategic insights.
Data Mining (DM) advances knowledge discovery from large datasets by identifying hidden patterns and relationships. Unlike DW, which focuses on data storage and retrieval, DM involves applying algorithms like association rule mining, classification, and clustering to uncover actionable insights. Association rule mining, exemplified by the Apriori algorithm, finds frequent itemsets and derives rules—such as "customers who buy bread and butter also buy milk”—which are useful in customer relationship management (CRM). The concept of support, confidence, and lift measures helps evaluate rule significance.
Scientific research's empirical cycle model (ECM)—comprising observation, hypothesis formulation, experimentation, and conclusion—aligns with DM techniques. Knowledge discovered through DM provides temporary insights, subject to validation via statistical testing and corroboration, recognizing the dynamic nature of data. Classification, a supervised learning task using decision trees like ID3, and association rule mining, reveal relationships and patterns within data. These methods exemplify the scientific approach—forming hypotheses, testing, and refining knowledge.
For data warehousing and DM purposes, data quality plays a vital role; data must be accurate, complete, consistent, timely, and relevant. Poor quality data can lead to incorrect insights, misinforming strategic decisions. Data preprocessing tasks such as cleaning, transformation, and integration prepare data for analytical tasks. These steps are crucial for ensuring valid findings and reliable model performance.
The evolution of DW and OLAP technologies was driven by the need for complex, multi-dimensional data analysis in decision support. Unlike traditional relational databases, DW emphasizes denormalization, aggregation, and pre-computation, enabling high-performance retrieval of summarized data. The multi-dimensional space model, represented through data cubes, facilitates intuitive, visual analysis of data across multiple dimensions, supporting managerial decision-making.
Three common DW schemas—star, snowflake, and constellation—offer different balances between normalization and query performance. The star schema features a central fact table linked to dimension tables, optimized for simplicity and speed. Snowflake schemas normalize dimension tables, reducing redundancy at the expense of query complexity. Constellation schemas support multiple fact tables sharing dimension tables, enabling more complex analyses.
OLAP and OLTP systems serve distinct purposes: OLTP architectures are optimized for transactional processing, ensuring data consistency and speed for operations like order entry, whereas OLAP focuses on complex queries and data analysis. OLAP supports "slice," "dice," "roll-up," and "drill-down" operations, essential for strategic analysis, while OLTP emphasizes efficiency and concurrency.
The logical multi-dimensional space utilized in OLAP contrasts with the physical space used in 3D graphics. OLAP queries are defined using multidimensional expressions, often optimized for performance via pre-aggregated data stored in data cubes. Visualizations, such as heatmaps or dashboards, help decision-makers interpret multidimensional results effectively.
Data mining aims to identify patterns, correlations, and rules from large datasets, facilitating predictive and descriptive analytics. Association rule mining, via the Apriori algorithm, efficiently explores the search space by leveraging the anti-monotonic property, pruning infrequent itemsets early, thus improving computational efficiency. Frequent pattern analysis discovers recurrent item combinations, enabling targeted marketing and cross-selling strategies.
In applying DM, the empirical cycle involves problem understanding, data preparation, model building, evaluation, and deployment. Classification methods like decision trees recursively partition data based on attributes, using heuristics such as information gain, to produce rules for predicting class labels. Naive Bayes classifiers, based on probability theory, assume attribute independence and estimate prior and conditional probabilities for classification tasks, particularly in high-dimensional text data.
Text classification methods—including topic-oriented and sentiment analysis—employ different strategies and algorithms, ranging from rule-based systems to artificial neural networks (ANNs). ANNs, inspired by biological neural systems, can model complex nonlinear relationships but require extensive training data and computational resources.
Clustering differs from classification by involving unsupervised learning, grouping similar objects based on attributes. Metrics such as dissimilarity (distance measures) and attributes' types influence clustering quality. K-means and hierarchical clustering are prominent methods, each with strengths and limitations, such as sensitivity to outliers or computational complexity. Proper data preprocessing, attribute selection, and validation are vital for meaningful clusters that can support business insights, such as customer segmentation or market discovery.
Handling outliers—unusual data points—requires robust techniques to prevent distortion of clusters, including outlier detection algorithms or data transformation. The hierarchical clustering approach produces dendrograms, revealing nested groupings. Constraints and distance metrics significantly impact clustering results, with different methods suited for various data types and business applications.
Web mining poses unique challenges due to data heterogeneity, web-scale size, dynamic content, and unstructured data formats. Major categories—web content mining, web structure mining, and web usage mining—each target different data aspects, such as textual content, hyperlink structures, and user interaction logs. Tasks include information retrieval optimization, link analysis, and personalization, which contribute to improved search engines, recommendations, and web-based decision support.
References
- Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171-209.
- Elbashir, M. Z., Collier, P. A., & Sutton, S. G. (2011). The role of organizational absorptive capacity in strategic use of business intelligence to support integrated reporting. Journal of Strategic Information Systems, 20(4), 331-354.
- Inmon, W. H. (2005). Building the Data Warehouse. Wiley Publishing.
- Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. Wiley.
- Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann.
- Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), 37-54.
- Agrawal, R., Imieliński, T., & Swami, N. (1993). Mining association rules between sets of items in large databases. ACM SIGMOD Proceedings, 207-216.
- Shim, J. P., Warkentin, M., Courtney, J. F., & Power, D. J. (2002). Past, present, and future of decision support technology. Decision Support Systems, 33(2), 105-117.
- Fayyad, U., Piatetsky-Shapiro, G., & Matheus, C. J. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), 37-54.
- Wang, F. K., & Romano, N. C. (2008). Web mining for web personalization. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3), 216-227.