Zachary-Raup

Data Science Portfolio


Project maintained by ZRaup Hosted on GitHub Pages — Theme by mattgraham

Zachary Raup



Zachary’s Resume (pdf)
Data Science Courses Completed

About Me

Welcome to my data science portfolio! I’m Zachary Raup, a dedicated and curious data scientist with a strong foundation in physics and a passion for uncovering insights through data. I graduated Summa Cum Laude from Kutztown University with a bachelor’s degree in Physics, where I specialized in data analysis and modeling—particularly in the fields of exoplanets and binary star systems. My research experience sharpened my analytical thinking and deepened my ability to work with complex, real-world datasets.

To build upon this foundation, I pursued professional development through DataCamp, earning certifications as a Data Scientist Associate, Data Analyst Associate, and receiving credentials in Python and SQL. I’ve also completed coursework in machine learning, data preprocessing, and visualization, equipping me with both the theoretical knowledge and hands-on skills needed to drive data-driven solutions.

In my work, I use Python for data exploration, feature engineering, statistical modeling, and machine learning, with libraries such as pandas, scikit-learn, matplotlib, and numpy. I’m also proficient in SQL, where I manage, query, and analyze large datasets efficiently to support decision-making. My visualization skills extend to tools like Tableau and Power BI, where I create compelling dashboards and storytelling visuals to communicate insights clearly and effectively.

I thrive in collaborative environments, enjoy solving challenging problems, and am committed to continuous learning in the fast-evolving world of data science. Whether I’m optimizing a machine learning model, developing a visualization, or diving into raw data, my goal is always to make a meaningful impact through data.

 


Certifications


 

Project 1

Discovering Similar Songs Using Machine Learning | Unsupervised Learning with Spotify Data

Project Overview

This project explores the use of unsupervised learning and dimensionality reduction techniques to analyze and visualize the similarity between songs based on audio characteristics available from Spotify. By applying preprocessing, feature engineering, and advanced techniques like Non-negative Matrix Factorization (NMF) and t-distributed Stochastic Neighbor Embedding (t-SNE), I created an interactive, interpretable map of songs, with a focus on comparing all tracks to “Blinding Lights” by The Weeknd.

Skills Applied: Unsupervised Machine Learning, Python (scikit-learn), Cosine Similarity, NMF, t-SNE and more

Image 1: t-SNE Projection of Spotify Tracks Based on Audio Features + Top 10 Song Recommendations

This visualization displays a two-dimensional t-SNE projection of over 6,000 songs from Spotify, where each point represents a song and is colored by its cosine similarity to “Blinding Lights” by The Weeknd. Audio features were normalized and reduced using Non-negative Matrix Factorization (NMF) before applying t-SNE. Shapes indicate song relevance: circles for general tracks, a diamond for the reference song, and squares for the top 10 most similar tracks. The color gradient highlights how closely a song matches the reference based on key musical characteristics.
This curated list reveals tracks that reflect Blinding Lights’ blend of synth-pop, electronic beats, and emotional tone. From genre-bending hits by K/DA and BLANCO to chart-toppers by Post Malone and Sia, the recommendations showcase the model’s strength in identifying sonic similarity beyond surface-level genre classifications. These insights demonstrate the value of machine learning for personalized music discovery and highlight how audio features can effectively map musical taste.

🎵 Top 10 Similar Songs to: Blinding Lights - The Weekend

 

Project 2

Walmart Sales Prediction | Regression Modeling

Project Overview

Accurate weekly sales predictions are essential for retail businesses to manage inventory, forecast demand, and optimize profitability. This project explores the use of machine learning techniques to predict weekly sales for Walmart stores based on historical data spanning 2010 to 2012. Various regression models, including Random Forest, Boosted Trees, and Ridge Regression, were applied and compared to identify the most reliable approach for capturing complex data relationships and improving predictive accuracy.

Skills Applied: Machine Learning, Python (scikit-learn), Regression Modeling, Data Cleaning, Feature Engineering and more

Image 2: Average Weekly Sales by Store and Regression Model Performance

The first chart visualizes the average weekly sales across all stores, revealing that stores like Store 4 and Store 20 consistently outperform others in sales volume, while stores such as Store 33 report the lowest averages. The second chart ranks the performance of various regression models based on RMSE. Random Forest Regression stands out with the lowest RMSE (107,130.99) and highest R² score (0.9636), demonstrating strong predictive accuracy. Decision Tree and Boosted Tree models also show solid performance, whereas linear and neural network models lag behind, highlighting the effectiveness of ensemble methods for this task.

 

Project 3

Predicting Diabetes Using Machine Learning | Comparison of Classification Models

Project Overview

This project explores the effectiveness of five machine learning models—Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Support Vector Machine (SVM)—in predicting diabetes status using a cleaned patient dataset. By employing cross-validation and assessing key metrics such as accuracy, precision, recall, and F1 score, the analysis highlights the importance of selecting a model that balances these metrics for reliable healthcare applications. A model with high accuracy and recall is crucial for effectively identifying diabetic patients, thereby minimizing the risks associated with missed diagnoses.

Skills Applied: Machine Learning, Supervised Learning, Python (scikit-learn), Cross-Validation, Hyperparameter Tuning and more

Image 3: Classification Model Comparison

This boxplot illustrates the cross-validation accuracy of five classification models—Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Support Vector Machine (SVM). Each box represents the distribution of accuracy scores obtained through 5-fold cross-validation, highlighting the performance stability and variability of each model. The results emphasize the importance of model selection in achieving high accuracy for diabetes classification, crucial for effective healthcare decision-making.

 

Project 4

Utilizing MCMC to Explore the Parameter Space of an Exoplanet Transit

Project Overview

This research project focuses on modeling the transit of exoplanets across stars using the Python package ‘batman’. The objective was to accurately predict changes in stellar brightness during these transits, validated against photometry data from the CR Chambliss Astronomical Observatory (CRCAO). Methodologically, a physics-based model was developed and evaluated using a log likelihood function to fit observational data. The Markov Chain Monte Carlo (MCMC) algorithm, facilitated by ‘emcee’, enabled exploration of parameter uncertainties such as planet radius and transit timing. Visualizations created with matplotlib included light curves, histograms of parameter distributions, and a corner plot illustrating parameter correlations. Presenting findings at the 241st AAS meeting highlighted contributions to understanding exoplanet transit dynamics, crucial for advancing knowledge of planetary systems beyond our solar system.

Skills Applied: Python (pandas, matplotlib, numpy, emcee, & batman), Jupyter Notebook, and Excel

Image 4: TOI-4153 modeled lightcurve

Light curve of TOI-4153 data (CRCAO) taken in a Blue (B) and Infrared (I) filter. The model is built using the Python transit modeler package ‘batman’. The parameters of the model were determined using the Markov Chain Monte Carlo algorithm and known parameters taken from the ExoFOP database.

 

Project 5

Insights into Dog Behavior: Analyzing Dognition Data with MySQL

Project Overview

The goal of this project is to utilize MySQL queries to perform analysis of trends and relationships embedded within the Dognition database. Developed as a fundamental component of the ‘Managing Big Data with MySQL’ course from Duke University, the project focuses on refining and applying skills in data cleaning, sorting, and employing advanced analytical techniques using SQL. By exploring large datasets such as the Dognition database, the project aims to uncover meaningful insights into canine behavior patterns and preferences, leveraging robust data management practices to extract actionable intelligence for further research and practical applications in understanding and enhancing dog-human interactions.

Skills Applied: MySQL, Writing Queries, Data Cleaning, and Big Data

Image 5: Top States by Number of Dognition Users

 

Project 6

Interactive Animation of Museum Visitor Paths and Hourly Room Traffic in Tableau

Project Overview

The project was undertaken as part of the ‘Data Visualization in Tableau’ course in Data Camp, where I applied advanced data visualization techniques to transform raw museum data into a meaningful and engaging interactive animation. By leveraging Tableau’s powerful features, I was able to create a comprehensive and user-friendly tool that highlights key patterns and trends in museum visitor behavior by the hour. This project not only demonstrates my proficiency in using Tableau for data visualization but also underscores the practical application of these skills in real-world scenarios.

Skills Applied: Tableau, Data Visualization

Image 6: Common Musuem Visitor Paths

 



Zachary’s Portfolio
Project 1: Discovering Similar Songs using Machine Learning and Spotify
Project 2: Regression Modeling | Walmart Sales Prediction
Project 3: Predicting Diabetes Using Machine Learning | Comparison of Classification Models
Project 4: Utilizing MCMC in Python to Explore the Parameter Space of an Exoplanet Transit
Project 5: Insights into Dog Behavior: Analyzing Dognition Data with MySQL
Project 6: Interactive Animation of Museum Visitor Paths and Hourly Room Traffic in Tableau