Zachary-Raup

Data Science Portfolio


Project maintained by ZRaup Hosted on GitHub Pages — Theme by mattgraham

Insights into Dog Behavior: Analyzing Dognition Data with MySQL

By: Zachary Raup

Goal of Project:

The primary objective of this project is to utilize MySQL queries to conduct a comprehensive analysis of trends and correlations within the vast Dognition dataset. As part of the ‘Managing Big Data with MySQL’ course at Duke University, the project highlights improving skills in data cleaning, organizing, and applying advanced analytical techniques using SQL. By concentrating on these areas, the queries seek to reveal actionable insights from the dataset, deepening the understanding of dogs’ behavioral patterns and preferences based on the collected data. This approach not only enables the practical use of database management techniques but also develops expertise in managing and interpreting large-scale datasets efficiently.


About the Data Set: Dognition

Dognition is a company that offers interactive tests designed for users to engage with their dogs. These tests are intended to create a personalized profile of the dog’s personality based on collected data and information about the dog’s background. The company uses this data to provide insights into various aspects of a dog’s behavior and preferences.

The Dognition dataset encompasses a comprehensive range of information across six tables, featuring over 30 columns and exceeding 100,000 rows of data. This dataset includes details about users, such as their demographics and geographical information. It also captures specific attributes related to dogs, such as breed, age, and size. Additionally, the dataset covers information about the tests conducted and the corresponding results, offering a rich source of information for analyzing behavioral patterns and preferences in dogs.

By leveraging this dataset, Dognition aims to enhance knowledge of canine behavior, helping dog owners discover activities and interests that their dogs may excel in or enjoy. This data-driven approach not only supports personalized recommendations but also contributes to broader comprehension into pet behavior and psychology.

* data set was not cleaned beforehand but was changed to protect personal information

Link to Dognition Website


Writing Queries to Analyze the Data:

Cleaning and Analyzing Data: Number of Tests Completed per Day of the Week

This query calculates the count of tests completed by distinct users with their dogs on each weekday across multiple years within the Dognition dataset. The dataset is filtered to include only users located in the contiguous United States (excluding Hawaii and Alaska) and excludes entries marked with “1” in their exclude columns to eliminate non-user testing inputs. The results are sorted first by year in ascending order and then by the total number of tests completed in descending order. The query utilizes a join function between the ‘dogs’ and ‘users’ tables to combine data from the complete_tests table (c) with a refined subset of dog and user data (dogs_cleaned). This ensures that the analysis focuses exclusively on tests associated with dogs owned by users meeting the specified geographical and exclusion criteria.

%%sql
SELECT DAYOFWEEK(c.created_at) AS dayasnum, YEAR(c.created_at) AS year,
COUNT(c.created_at) AS numtests,
(CASE
    WHEN DAYOFWEEK(c.created_at)=1 THEN "Su"
    WHEN DAYOFWEEK(c.created_at)=2 THEN "Mo"
    WHEN DAYOFWEEK(c.created_at)=3 THEN "Tu"
    WHEN DAYOFWEEK(c.created_at)=4 THEN "We"
    WHEN DAYOFWEEK(c.created_at)=5 THEN "Th"
    WHEN DAYOFWEEK(c.created_at)=6 THEN "Fr"
    WHEN DAYOFWEEK(c.created_at)=7 THEN "Sa"
END) AS daylabel
FROM complete_tests c JOIN
    (SELECT DISTINCT dog_guid
    FROM dogs d JOIN users u
        ON d.user_guid=u.user_guid
    WHERE ((u.exclude IS NULL OR u.exclude=0)
        AND u.country="US"
        AND (u.state!="HI" AND u.state!="AK")
        AND (d.exclude IS NULL OR d.exclude=0))) AS dogs_cleaned
    ON c.dog_guid=dogs_cleaned.dog_guid
GROUP BY year,daylabel
ORDER BY year ASC, numtests DESC
LIMIT 5;
 * mysql://studentuser:***@localhost/dognitiondb
5 rows affected.
dayasnum year numtests daylabel
1 2013 5860 Su
7 2013 4674 Sa
2 2013 3695 Mo
4 2013 3496 We
3 2013 3449 Tu

Actionable Insights:

Based on the SQL query results that analyze test completion trends within the Dognition dataset, a few notable findings emerge that can be used to optimize test participation strategies and grow user engagement:

Peak Test Completion Days:
Sundays and Saturdays have the highest test completion rates, with 5,860 and 4,674 tests completed respectively in the year 2013. This indicates that users are more likely to engage in test activities during weekends.

Recommendation: Increase marketing efforts and user engagement initiatives on weekends to draw in higher user activity. Consider launching weekend-specific challenges or promotions to boost participation.
Among weekdays, Mondays show relatively higher engagement compared to the other weekdays in 2013.

Recommendation: Schedule mid-week reminders or activities to maintain user engagement throughout the week. Introduce mid-week incentives or thematic content to sustain interest and participation.

By capitalizing on these observations, Dognition can optimize user engagement strategies to enhance user experience and drive higher participation rates in their testing activities. The regular analysis of test completion trends allows for targeted marketing efforts and timely reminders, ultimately fostering a more active and engaged community of users.


Cleaning and Analyzing Data: Number of Distinct Users by State

This query calculates the number of distinct users who have completed tests with their dogs, segmented by state within the Dognition dataset. The dataset is filtered to include only users located in the contiguous United States (excluding Hawaii and Alaska) and excludes entries marked with “1” in their exclude columns to remove non-user inputs. The results are sorted by the total number of distinct users in descending order. The query joins the ‘dogs’ and ‘users’ tables to combine data from the complete_tests table (c) with a refined subset of dog and user data (dogs_cleaned). This ensures the analysis focuses exclusively on tests associated with dogs owned by users who meet the specified geographical and exclusion criteria.

%%sql
SELECT dogs_cleaned.state AS state, COUNT(DISTINCT dogs_cleaned.user_guid) AS
numusers
FROM complete_tests c JOIN
    (SELECT DISTINCT dog_guid, u.user_guid, u.state
    FROM dogs d JOIN users u
        ON d.user_guid=u.user_guid
    WHERE ((u.exclude IS NULL OR u.exclude=0)
        AND u.country="US"
        AND (d.exclude IS NULL OR d.exclude=0))) AS dogs_cleaned
    ON c.dog_guid=dogs_cleaned.dog_guid
GROUP BY state
ORDER BY numusers DESC
LIMIT 5;
 * mysql://studentuser:***@localhost/dognitiondb
5 rows affected.
state numusers
CA 1363
NY 628
TX 536
FL 502
NC 467

Actionable Insights:

The SQL query identifies the top 5 states with the highest number of Dognition users. This analysis uncovers several key insights that can be used to enhance user engagement and improve test participation strategies:

Peak User States:
California (CA) has the highest number of distinct users, with 1,363 unique users engaging in test activities.New York (NY) follows with 628 users, Texas (TX) with 536 users, Florida (FL) with 502 users, and North Carolina (NC) with 467 users. The high participation rates in states like California, New York, and Texas suggest that these regions have a significant user base interested in Dognition activities.

Recommendation: Focus marketing and engagement efforts on these top states to further capitalize on the existing user base. Consider organizing state-specific events or promotions to increase engagement in these high-participation areas. Tailor content and communication strategies to the preferences and interests of users in these states. Implement localized advertising and partnerships with regional organizations to strengthen community ties and drive further participation.
Underrepresented Regions:
The query results provide a clear picture of regional engagement, allowing for targeted geographical analysis. This can be useful for identifying potential markets for expansion and understanding regional behavior patterns.

Recommendation: Use this awareness  to identify underrepresented regions that may benefit from increased outreach efforts. Develop strategies to engage users in states with lower participation rates by addressing potential barriers and promoting the benefits of the Dognition program.

Utilizing these findings from the top 5 states with the highest number of Dognition users, Dognition can enhance user engagement strategies to improve user experience and drive higher participation rates in these key regions. Regular analysis of participation trends in these states allows for targeted marketing efforts, timely reminders, and region-specific initiatives, ultimately fostering a more active and engaged community of users in these high-participation areas.


Cleaning and Analyzing Data: Average Test Complettion Time per Breed Type

This query calculates the average completion time and standard deviation for each dog breed across all tests using data from the exam_answers table in the Dognition dataset. Negative durations are excluded from the analysis. The results are categorized by breed type and are derived through a join between the dogs and exam_answers tables to aggregate the necessary data.

%%sql
SELECT d.breed_type AS breed_type,
AVG(TIMESTAMPDIFF(minute,e.start_time,e.end_time)) AS AvgDuration,
STDDEV(TIMESTAMPDIFF(minute,e.start_time,e.end_time)) AS StdDevDuration
FROM dogs d JOIN exam_answers e
    ON d.dog_guid=e.dog_guid
WHERE TIMESTAMPDIFF(minute,e.start_time,e.end_time)>0
GROUP BY breed_type;
 * mysql://studentuser:***@localhost/dognitiondb
4 rows affected.
breed_type AvgDuration StdDevDuration
Cross Breed 11810.3230 59113.45580229881
Mixed Breed/ Other/ I Don't Know 9145.1575 48748.626840777506
Popular Hybrid 7734.0763 45577.65824281632
Pure Breed 12311.2558 60997.35425304078

Actionable Insights:

The SQL query analyzes the average duration and standard deviation of exam completion times for different breed types within the Dognition dataset and reveals important findings that can inform user engagement and test optimization.

Longest Test Completion Time:
Pure breed dogs have the longest average exam duration of 12,311 minutes and the highest variability in completion times. Cross breed dogs exhibit the second highest average exam duration and significant variability.

Recommendation: Investigate potential factors contributing to longer and more variable completion times for pure breed and cross breed dogs. Consider offering tailored guidance or support to owners of these breed types to streamline the testing process. Providing specific tips or resources that could help these users complete exams more efficiently.
High Variability Between Completion Times:
The query results show high variability, with each breed type exhibiting a high standard deviation in completion times.

Recommendation: Further investigate the causes of high variability in test completion times. Determine whether this measure accurately reflects the pace at which different dog breeds complete the tests. Look for outliers within the data to understand the underlying reasons for this variability which can inform the development of more effective engagement and support strategies.

Dognition can refine its test optimization strategies to enhance the testing experience across different breed types. Regular analysis of exam duration trends allows for targeted support and resources, helping to reduce variability and improve the overall efficiency of the testing process. This approach will ultimately lead to higher user satisfaction and increased participation rates in Dognition’s testing activities.



Zachary’s Portfolio
Project 1: Utilizing MCMC in Python to Explore the Parameter Space of an Exoplanet Transit
Project 3: Interactive Animation of Museum Visitor Paths and Hourly Room Traffic in Tableau