NFL QB Touchdown Predictor - Case Study

Project Overview

A comprehensive database-driven machine learning project that predicts whether a quarterback (QB) will throw at least one touchdown (TD) in a given NFL game using player statistics, game context, and historical performance data.

Project Objective: Predict whether an NFL quarterback will throw a touchdown pass in a game using past performance and game details. This binary classification model uses historical QB game logs, player career stats, and basic bio data stored in a SQLite database, demonstrating feature engineering, model evaluation, and explainability techniques.

Key Features

🗄️ Database-Driven

All data stored in SQLite for easy management and validation with comprehensive relational schema

⚡ Real-time Predictions

Make predictions using current player data from database with confidence scores

✅ Data Validation

Comprehensive data quality checks and validation framework

🚀 Automated Workflow

One-command setup and deployment with modular architecture

📊 Historical Tracking

View prediction history and accuracy tracking in database

🌐 Modern Web App

Beautiful Streamlit interface with multiple pages and real-time updates

System Architecture

Built with a modular, database-first approach using SQLite for data persistence and Streamlit for the user interface.

Project Structure

📁 data/

├── 📁 raw/ # Original CSV files

└── 📁 processed/ # Cleaned & engineered datasets

📁 src/

├── 🗄️ database.py # Database management

├── 📥 data_loader.py # Load CSV data into database

├── ✅ data_validator.py # Data quality validation

├── 🔄 preprocess.py # Database-driven preprocessing

├── 🎯 train_model.py # Model training

└── 📊 explain_shap.py # Model explainability

📁 app/

└── 🌐 app.py # Streamlit web application

Database Schema

The project uses a relational SQLite database with comprehensive data management:

Core Tables

basic_stats: Player demographics and physical info
game_logs: Game-by-game performance records
qb_stats: QB-specific game statistics
career_stats: Season-level career statistics
qb_career_passing: QB career passing stats
predictions: Model prediction history

Key Relationships

Players linked by player_id
Game logs linked to QB stats by game_log_id
Career stats linked to QB passing stats by career_id

Model Performance

88%

Accuracy

85%

F1 Score

91%

ROC-AUC

15+

Features

Data Sources

Total Records: 100,000+ game logs
Quarterbacks: 500+ players
Years Covered: 2000-2024
Files Used: Game_Logs_Quarterback.csv, Career_Stats_Passing.csv, Basic_Stats.csv

Web Application Features

The Streamlit app provides multiple comprehensive pages:

🎯 Make Prediction

Select quarterback from database
View recent performance stats
Make real-time predictions
Save predictions to database

🗄️ Player Database

Browse all available data
View database statistics
Explore sample records

📊 Prediction History

Track all predictions made
View prediction accuracy
Analyze confidence scores

Technical Implementation

Python SQLite XGBoost Streamlit Pandas NumPy Scikit-learn SHAP

Data Validation Framework

Data completeness: Check for missing values
Data consistency: Verify relationships between tables
Value ranges: Ensure statistics are reasonable
Duplicate detection: Find and handle duplicates
Date validation: Check for valid game dates
QB-specific checks: Validate quarterback data quality

Advanced Features

Automated Workflow

One-command setup with modular execution:

python main.py - Complete workflow
python main.py --setup - Database setup
python main.py --validate - Data validation
python main.py --app - Launch web app

Database Management

Connection pooling and transaction management
Automated data loading from CSV files
Prediction history tracking
Data quality monitoring

What I'd Improve

Real-time Data Integration: Connect to live NFL API for current game data
Model Ensemble: Combine XGBoost with neural networks for better performance
Advanced Analytics: Add trend analysis and player comparison features
Performance Optimization: Implement caching and query optimization
Cloud Deployment: Deploy to AWS/GCP with containerization
API Development: Create REST API for external integrations

Business Impact

This project demonstrates my ability to:

Design and implement comprehensive database architectures
Build production-ready machine learning pipelines
Create user-friendly interfaces for complex data systems
Implement robust data validation and quality assurance
Develop modular, maintainable code with automated workflows
Handle large-scale data processing and real-time predictions