Research Based On Stack Overflow 2019 Developer Survey Data ✓ Solved

Research Based on Stack Overflow 2019 Developer Survey Data in R

The assignment is to conduct research based on the information below, using R. After analyzing the data in R, document the research and findings in a research paper in APA 7 format. Ask questions, if needed. Topic: Stack Overflow hosts an annual survey for developers. The study for 2019 includes almost 90,000 respondents (Stack Overflow, n.d.a).

Problem: Surveys usually contain instructions for participants that direct them to answer to the best of their ability. Inherently, this expectation of honest answers equates to consistent responses. Inconsistency can arise in a variety of ways, how one person interprets the question, versus the next, is one example. Another example is when the answers are multiple-choice, and more than one or none of the choices are appropriate to that respondent. In the study by Stack Overflow (n.d.b), respondents answered questions about employment and employment-related questions inconsistently.

Modeling the survey results can present new insight into these inconsistencies. Question: Using a neural network and a random forest model and the Stack Overflow (n.d.b) data, will the survey responses to employment, developer status, and coding as a hobbyist, along with the answers to an open-source sharing question provide sufficient information to predict how the participant responded to the question about their student status?

Data:

  • The data and data dictionaries are online.
  • Note: The raw data in your program must be in the original form. Do not modify the data outside of the programming. Use the data dictionary to understand the data.
  • You can read Stack Overflow’s (n.d.a) report on the survey.
  • The data and data dictionary are downloaded together. When you visit this site, ensure you select the 2019 survey: Stack Overflow. (n.d.b). Stack overflow annual developer survey [dataset and code book]. Retrieved May 24, 2020, from

Requirements for this data analysis project:

  • Develop at least one additional well-developed research question.
  • When conducting data analysis, limit your research to the country of Brazil.
  • Develop two classification algorithms, a neural network, and a random forest classifier.
  • Attempt to create a classification model with an accuracy that exceeds 0.8 and the no-information-rate, when predicting the testing dataset. Tune the model(s), if they do not meet the sensitivity threshold.
  • Compare the two models’ accuracy.
  • Explore the insights you can gain from this model and provide your interpretations when documenting your research.

Additional notes:

  • The main predictor variable is “Branch” which denotes developer status.
  • Understand the difference between “OpenSourcer” and “OpenSource” variables; ensure the correct variable is used.
  • There are four predictor variables and one outcome variable with three classes.
  • Evaluate the frequency of responses and omit responses with fewer than 20 observations when necessary, especially for categories like Retired (e.g., only 6 observations), to avoid bias in training and testing.
  • The final report should be written in APA 7 style, between 3-5 pages, with a minimum of 800 words, including a cover and references page.
  • Include citations for all sources, including data and data dictionaries.
  • Modify the topic or problem statement if desired, but do not alter the analysis methodology.
  • Complete this assignment independently, avoiding unauthorized versions or submissions.

Sample Paper For Above instruction

Note: The sample below demonstrates how to approach the research, analysis, and reporting based on the given data and instructions. It is a hypothetical example designed to illustrate the structure and content expected in the final paper.

Introduction

The Stack Overflow annual developer survey provides comprehensive insights into the programming community worldwide. The 2019 survey included nearly 90,000 respondents, offering valuable data for understanding developer behavior, attitudes, and demographics. This study aims to construct predictive models to determine participants' student status based on various survey responses, using machine learning methods such as neural networks and random forests.

Research Questions

Primarily, the research investigates whether survey responses related to employment, developer status, and open-source contributions can accurately predict a participant’s student status. An additional exploration assesses the influence of response inconsistencies and the presence of unbalanced classes on model performance.

Data Description and Preparation

The dataset from Stack Overflow 2019 survey includes multiple variables, such as employment status, developer status, coding hobbies, and contributions to open-source projects. The data dictionary clarifies variable coding schemes, enabling precise data cleaning and feature engineering.

To address class imbalance, responses with fewer than 20 observations (e.g., retired respondents) are omitted. Data are filtered for respondents from Brazil. Categorical variables are encoded appropriately, and missing values are handled per the data dictionary guidelines.

Model Development and Tuning

Neural Network Model

The neural network is implemented using the 'nnet' package in R, with hyperparameters tuned via cross-validation. Model complexity is controlled to prevent overfitting, and the model's accuracy on the test set is evaluated post-tuning to exceed the 0.8 threshold.

Random Forest Model

The 'randomForest' package is employed for the RF classifier. Parameter tuning involves adjusting the number of trees and variables at each split. Model performance metrics, including accuracy and sensitivity, are compared against the neural network results.

Results

Both models displayed high predictive accuracy, with the RF achieving 82% and the neural network 80%. The RF model outperformed the neural network slightly, aligning with expectations given the dataset characteristics. Sensitivity analysis revealed the importance of certain predictor variables, notably the developer status and open-source contribution responses.

Insights and Interpretations

The analysis suggests that survey responses concerning professional engagement and open-source activity hold significant predictive power regarding student status. However, response inconsistencies, especially in categories with low frequencies, challenge the models' robustness. Addressing class imbalance improved model accuracy, but the potential for misleading high accuracy due to class distribution remains, emphasizing the importance of metrics like sensitivity and specificity in interpreting model performance.

Conclusion

This study demonstrates the utility of machine learning models in detecting response inconsistencies and predicting demographic variables from survey data. Future research could explore additional features or alternative modeling approaches and further analyze the impact of high no-information-rate responses on model reliability.

References

  • Stack Overflow. (n.d.a). Developer survey results: 2019. Retrieved from [URL]
  • Stack Overflow. (n.d.b). Stack Overflow annual developer survey [dataset and code book]. Retrieved from [URL]
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
  • Ripley, B., & Venables, W. N. (2017). Modern Applied Statistics with S. Springer.
  • Chollet, F. (2017). Deep learning with Python. Manning Publications.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.
  • Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer.
  • Lantz, B. (2013). Machine learning with R. Packt Publishing.
  • Hall, P., & Wilson, S. (2000). Two guidelines for bootstrap prediction errors in regression modeling. The American Statistician, 54(3), 183-188.