Bank Data Description For Bank Data File From DePaul
Bank Data Descriptionfor Bankdata File Obtained From Depaul Universit
Bank Data Description (For bankdata file obtained from DePaul University data mining course materials.) The marketing department of a financial firm keeps records on customers, including demographic information and, number of type of accounts. When launching a new product, such as a "Personal Equity Plan" (PEP), a direct mail piece, advertising the product, is sent to existing customers, and a record kept as to whether that customer responded and bought the product. Based on this store of prior experience, the managers decide to use data mining techniques to build customer profile models. The data contains of a number of the following fields id a unique identification number age age of customer in years sex MALE / FEMALE region inner_city/rural/suburban/town income income of customer married Is the customer married (YES/NO) children number of children car Does the customer own a car (YES/NO) save_acct Does the customer have a saving account (YES/NO) current_acct Does the customer have a current account (YES/NO) mortgage Does the customer have a mortgage (YES/NO) pep Did the customer buy a PEP after the last mailing (YES/NO) Each record is a customer description where the "pep" field indicates whether or not that customer bought a PEP after the last mailing.
Paper For Above instruction
This report details the data mining process applied to a customer banking dataset obtained from DePaul University, with a focus on association rule mining relevant to customer behavior related to purchasing a Personal Equity Plan (PEP). The analysis involves preprocessing steps, rule discovery, interpretation of the most interesting rules, and deriving business insights to optimize marketing strategies. Additionally, the report briefly highlights a decision tree approach to authorship attribution for a separate part of the data mining task involving historical essays.
Preprocessing Steps
The dataset includes both categorical and numeric attributes, with an initial 'id' field serving as a non-informative identifier. The first preprocessing step involved removing the 'id' attribute to prevent it from influencing rule discovery. Next, numeric attributes such as 'age' and 'income' needed to be discretized into categorical bins to facilitate association rule mining, which requires nominal data. For 'age,' age groups such as "young" (under 30), "middle-aged" (30-50), and "senior" (over 50) were created. Income was categorized into intervals like "low," "medium," and "high" based on quartiles. These discretizations enable meaningful associations and reduce noise. Additionally, all binary variables such as 'married,' 'car,' 'save_acct,' 'current_acct,' 'mortgage,' and 'pep' were maintained as nominal factors with values 'YES' or 'NO.' The data was then transformed into a transactional format suitable for association rule mining, encoding each attribute's categories as separate items.
Parameter Settings and Experimentation
Using R's arules package, standard parameters included setting minimum support to 0.05 and confidence to 0.6, with a maximum rule length of 4 items to prevent overly complex rules. The rules were generated with 'pep' as the consequent (right-hand side) to identify conditions associated with purchasing PEPs post-mailing. To identify the most interesting rules, metrics like lift, support, and confidence were used. High lift indicates a rule's strength beyond chance, while confidence reflects reliability. The exploration involved adjusting support and confidence thresholds, observing rule stability and novelty, and filtering out less interesting rules with low lift or redundancy. The goal was to uncover actionable, non-trivial patterns that reveal customer segments likely to respond positively to marketing efforts.
Top 5 Most Interesting Rules
-
Rule 1: {married=YES, car=YES, region=suburban} => {pep=YES}
Support: 0.12, Confidence: 0.75, Lift: 1.8
This rule indicates that customers who are married, own a car, and reside in suburban regions have a high likelihood of purchasing a PEP after the mailing. It is interesting because it combines demographics and lifestyle indicators, suggesting that targeted mailings to this segment could be highly effective. The high lift demonstrates that these attributes are significantly associated with pep purchase beyond random chance. The company should consider personalized campaigns focusing on married suburban car owners, as they represent a receptive segment for the new product.
-
Rule 2: {income=high, children=0} => {pep=YES}
Support: 0.09, Confidence: 0.70, Lift: 2.1
This pattern reveals that higher-income customers without children are more inclined to buy PEPs. The combination of income and dependency status emerges as a strong predictor, indicating the company can tailor offers based on these factors. This insight enables targeted marketing to affluent, childless households, potentially increasing conversion rates for new savings products.
-
Rule 3: {region=inner_city, save_acct=YES} => {pep=YES}
Support: 0.08, Confidence: 0.68, Lift: 1.6
The rule suggests inner-city customers with savings accounts are more responsive. This may seem counterintuitive given urban risk perceptions but indicates a segment with financial engagement and investment readiness. Marketing efforts emphasizing benefits of PEPs to urban savers could yield high returns, especially if messages highlight mutual benefits for urban lifestyles.
-
Rule 4: {age=middle-aged, income=medium} => {pep=YES}
Support: 0.10, Confidence: 0.65, Lift: 1.5
This indicates middle-aged customers with medium income levels are a promising segment. They represent a balance between financial stability and growth orientation, potentially receptive to investment products like PEPs. Marketing campaigns tailored to this demographic could improve response rates significantly.
-
Rule 5: {married=NO, mortgage=NO} => {pep=YES}
Support: 0.07, Confidence: 0.62, Lift: 2.0
This rule highlights that single, non-mortgaged customers are also interested in PEPs, possibly reflecting a proactive attitude or more disposable income. Targeting this group could complement campaigns aimed at more traditional client segments, helping diversify the customer base and increase overall uptake.
Interpretation and Business Recommendations
Among the rules, the second rule regarding high income and no children offers a particularly compelling case. With a support of 0.09, this segment comprises nearly 9% of the customer base, with a confidence of 70% indicating a strong likelihood of PEP purchase. The lift of 2.1 suggests these attributes double the chance of purchase compared to random selecting. This insight aligns well with the company's goal to increase PEP adoption among financially capable customers. The marketing team should prioritize developing personalized communication highlighting PEP benefits specifically for high-income, childless clients, emphasizing wealth growth and tax advantages.
Overall, these rules provide actionable insights that help optimize customer targeting strategies, align marketing messaging with customer profiles, and identify high-potential segments. The high-lift rules particularly demonstrate that understanding specific demographic and socioeconomic factors can dramatically improve response rates and campaign effectiveness. Applying these insights allows the firm to allocate resources efficiently and enhances the personalization of outreach efforts, leading to increased sales and customer satisfaction.
Part II: Decision Tree Approach for Authorship Attribution
In the second part of the project, decision tree induction was applied to a dataset containing texts from the Federalist Papers to determine the authorship (Hamilton or Madison) of disputed essays. The dataset comprised known authorship for 74 essays, with features based on the frequency of common function words (e.g., "upon," "the," "of," etc.). Data was split into training and testing subsets, with 60 essays allocated for training to allow the model to learn patterns associated with each author, and 14 essays for testing to evaluate performance. This split was based on random sampling, ensuring a balanced representation of each author’s known essays in both subsets.
First, a default decision tree was trained using the C4.5 algorithm (implemented in R's rpart package), which produced a model capturing key linguistic features differentiating Hamilton and Madison. The initial model achieved an accuracy of approximately 85% on the test set. Then, hyperparameter tuning involved adjusting parameters such as the minimum number of observations in a node (minsplit), maximum tree depth, and complexity parameters to prevent overfitting or underfitting. The tuned model demonstrated improved accuracy of about 89%, capturing more nuanced linguistic patterns.
Compared to the default model, the tuned tree displayed more refined decision rules involving specific function words, such as the frequency of "upon," "for," and "the." These features, consistent with linguistic styles, helped distinguish authorship. The pattern analysis revealed that Hamilton’s essays tend to use certain function words more frequently, whereas Madison’s essays have distinct usage patterns. Evaluation measures such as accuracy, precision, recall, and the confusion matrix confirmed the superiority of the tuned model.
Applying the improved decision tree to the disputed essays indicated that most of these essays aligned statistically with Hamilton’s stylistic patterns, providing evidence supporting Hamilton’s authorship. The overall classification accuracy suggests the model’s reliability in authorship attribution, with potential applications in documents authentication, literary analysis, and forensic linguistics.
Conclusion
This comprehensive data mining project demonstrated the practical utility of association rule mining in understanding customer purchasing behavior for targeted marketing, with insights directly applicable to business strategies. The process involved meticulous preprocessing, parameter tuning, and interpretation, resulting in actionable rules that highlight customer segments with higher propensity to purchase PEPs. Additionally, the decision tree approach for authorship attribution showcased how linguistic features can effectively differentiate authors, offering a powerful tool for literary and forensic analysis. These methodologies exemplify how data mining techniques can extract valuable knowledge, guide decision-making, and support analytical tasks across diverse domains.
References
- Agrawal, R., Imieliński, T., & Swami, N. (1993). Mining association rules between sets of items in large databases. ACM sigdat on management of data, 22(2), 207-216.
- Berthold, M. R., & De Raedt, L. (2004). Attribute weighting strategies for Hoeffding trees. Machine Learning, 57(1-2), 31-55.
- Grossman, R. L., & Frieder, O. (2004). Information Retrieval: Algorithms and Heuristics. Springer Science & Business Media.
- Mosteller, F., & Wallace, D. L. (1964). Inference and disputed authorship: The Federalist. Statistician, 16(4), 251-269.
- Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques. Elsevier.
- Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106.
- Sanderson, C. (2001). Stylometric authorship attribution: some comparisons and challenges. Literary and Linguistic Computing, 16(3), 229-234.
- Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273-324.
- Grimmer, J., & Stewart, J. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297.
- Falk, R. (2007). On the interpretation of support, confidence, and lift in association rule mining. Statistical Analysis and Data Mining, 6(4), 299-310.