Data Science Portfolio
Zachary’s Resume (pdf)
Data Science Courses Completed
Welcome to my data science portfolio! I’m Zachary Raup, a data scientist with a strong foundation in physics and a passion for uncovering insights from complex datasets. I graduated Summa Cum Laude from Kutztown University with a B.S. in Physics, where I focused on data modeling in astrophysical systems—particularly exoplanets and binary stars. This research experience trained me to approach problems analytically, work with real-world uncertainty, and extract meaning from noisy data.
To strengthen my data science skillset, I earned certifications from DataCamp in Data Science, Data Analysis, Python, and SQL, and completed coursework in machine learning, data preprocessing, and visualization. I apply these skills using Python (pandas, scikit-learn, matplotlib, numpy) and SQL, with additional proficiency in Tableau and Power BI for data storytelling.
My work focuses on building interpretable, performance-driven models to support real-world decision-making. I enjoy collaborating across disciplines, turning messy data into actionable insight, and constantly learning new techniques to grow as a scientist and developer.
Thanks for visiting—feel free to explore my projects!
Developed a deep learning pipeline to classify chest X-rays as Normal or Pneumonia using an ensemble of pretrained CNNs (ResNet18, DenseNet121, EfficientNet-B0). Achieved a 91.2% test accuracy and an F1-score of 0.9332, with all models demonstrating high pneumonia recall, minimizing false negatives.
Key components:
This project showcases how deep learning and explainable AI can support radiologists by improving diagnostic accuracy and transparency in medical imaging.
Accurate weekly sales predictions are crucial for large-scale retailers like Walmart to optimize inventory management, labor allocation, and supply chain planning. This project uses historical sales data (2010–2012) to build a machine learning pipeline that predicts weekly sales using a blend of store-level, temporal, and economic features. After thorough EDA, feature engineering, and model tuning, five regression algorithms were evaluated, with XGBoost and LightGBM demonstrating top performance.
Key goals included:
Best Model: XGBoost with RMSE ≈ $61.4K and R² ≈ 0.988 on the test set
Notable Insight: Holiday weeks and store-specific trends were the strongest predictors of weekly sales variability
This project applies unsupervised machine learning techniques to uncover patterns in Spotify audio data and recommend musically similar songs. By using Non-negative Matrix Factorization (NMF) for dimensionality reduction and t-distributed Stochastic Neighbor Embedding (t-SNE) for visualization, the feature space of over 6,000 tracks was mapped into an interpretable 2D projection. Cosine similarity was then used to identify songs most similar to “Blinding Lights” by The Weekend. The final result is an insightful visual and analytical exploration of musical relationships based on audio characteristics..
Top 10 Similar Songs to: Blinding Lights - The Weekend
Developed a machine learning pipeline to detect machine failures in a manufacturing environment using sensor data. Feature engineering and class rebalancing were applied to improve signal extraction and model fairness. Compared Logistic Regression, Random Forest, and XGBoost, with the latter achieving 98.6% accuracy and perfect AUC on the test set.
Key components:
Power
, Temp_Delta
, Speed_Torque_Ratio
, and Wear_per_Torque
This project demonstrates how domain knowledge, feature construction, and ensemble methods can work together to create a reliable and explainable predictive maintenance solution.
This research project focuses on modeling the transit of exoplanets across stars using the Python package ‘batman’. The objective was to accurately predict changes in stellar brightness during these transits, validated against photometry data from the CR Chambliss Astronomical Observatory (CRCAO). Methodologically, a physics-based model was developed and evaluated using a log likelihood function to fit observational data. The Markov Chain Monte Carlo (MCMC) algorithm, facilitated by ‘emcee’, enabled exploration of parameter uncertainties such as planet radius and transit timing. Visualizations created with matplotlib included light curves and histograms of parameter distributions. Presenting findings at the 241st AAS meeting highlighted contributions to understanding exoplanet transit dynamics, crucial for advancing knowledge of planetary systems beyond our solar system.
The goal of this project is to utilize MySQL queries to perform analysis of trends and relationships embedded within the Dognition database. Developed as a fundamental component of the ‘Managing Big Data with MySQL’ course from Duke University, the project focuses on refining and applying skills in data cleaning, sorting, and employing advanced analytical techniques using SQL. By exploring large datasets such as the Dognition database, the project aims to uncover meaningful insights into canine behavior patterns and preferences, leveraging robust data management practices to extract actionable intelligence for further research and practical applications in understanding and enhancing dog-human interactions.
Zachary’s Portfolio
Project 1: Chest X-Ray Pneumonia Detection with Deep Learning
Project 2: Forecasting Retail Sales with Machine Learning | Regression Modeling
Project 3: Discovering Similar Songs using Machine Learning and Spotify
Project 4: Predictive Maintenance in Manufacturing
Project 5: Utilizing MCMC in Python to Explore the Parameter Space of an Exoplanet Transit
Project 6: Insights into Dog Behavior: Analyzing Dognition Data with MySQL