PyPIStats.org is a Flask-based web application that provides analytics and visualization for Python package download statistics from PyPI (Python Package Index). It queries BigQuery public datasets, aggregates the data, and presents it through both a web interface and JSON API.
- Framework: Flask (Python web framework)
- Database: PostgreSQL for storing aggregated statistics
- Task Queue: Celery with Redis for background processing
- Data Source: Google BigQuery (bigquery-public-data.pypi.file_downloads)
- Visualization: Plotly.js for interactive charts
- Authentication: GitHub OAuth for user features
- Deployment: Docker/Kubernetes support included
- ETL Process: Daily scheduled task (1am UTC) via Celery
- BigQuery Integration: Queries PyPI public dataset for download statistics
- Data Categories:
- Overall downloads (with/without mirrors)
- Python major versions (2, 3)
- Python minor versions (2.7, 3.6, etc.)
- Operating systems (Windows, Linux, Darwin, other)
- Mirror Filtering: Excludes downloads from known mirrors (bandersnatch, z3c.pypimirror, Artifactory, devpi)
- Data Retention: 180 days of historical data
- Aggregation: Creates all package statistics for total ecosystem metrics
- OverallDownloadCount: Total downloads per package per day
- PythonMajorDownloadCount: Downloads by Python major version
- PythonMinorDownloadCount: Downloads by Python minor version
- SystemDownloadCount: Downloads by operating system
- RecentDownloadCount: Cached daily/weekly/monthly totals
/- Home page with package search/packages/<package>- Package statistics dashboard with interactive charts/search/<package>- Package search results/top- Top 20 packages by download count (day/week/month)/about- About page/faqs- Frequently asked questions
- Interactive time-series charts with date range selectors
- Download proportion visualizations
- PyPI metadata integration (dependencies, description)
- Customizable lookback periods (up to 180 days)
/api/packages/<package>/recent- Recent download counts- Query params:
period(day/week/month)
- Query params:
/api/packages/<package>/overall- Overall download time series- Query params:
mirrors(true/false)
- Query params:
/api/packages/<package>/python_major- Downloads by Python major version- Query params:
version(2/3)
- Query params:
/api/packages/<package>/python_minor- Downloads by Python minor version- Query params:
version(2.7/3.6/etc)
- Query params:
/api/packages/<package>/system- Downloads by operating system- Query params:
os(Windows/Linux/Darwin)
- Query params:
- GitHub OAuth integration
- Personal dashboard for package maintainers
- Track multiple packages in one view
-
Daily ETL Process:
- Celery scheduled task triggers at 1am UTC
- Queries BigQuery for previous day's download data
- Aggregates data by multiple dimensions
- Stores results in PostgreSQL
- Updates recent stats cache
- Purges data older than 180 days
- Runs VACUUM ANALYZE for database optimization
-
Request Handling:
- User requests package stats via web or API
- Flask queries PostgreSQL for aggregated data
- Data formatted for Plotly.js visualization or JSON response
- Results cached where appropriate
- Database:
POSTGRESQL_*(username, password, host, port, dbname) - Google Cloud:
GOOGLE_*(project_id, private_key, etc.) - GitHub OAuth:
GITHUB_CLIENT_ID,GITHUB_CLIENT_SECRET - Celery:
CELERY_BROKER_URL(Redis connection) - Flask:
PYPISTATS_SECRET(session secret key)
LocalConfig: Local developmentDevConfig: Development environmentProdConfig: Production environmentTestConfig: Testing environment
make pypistats # Launch complete dev environment with docker-composepypistats/
├── application.py # Flask app factory
├── config.py # Configuration management
├── database.py # Database utilities
├── extensions.py # Flask extensions
├── models/ # SQLAlchemy models
├── tasks/ # Celery background tasks
├── views/ # Flask blueprints/routes
├── templates/ # Jinja2 HTML templates
├── static/ # CSS and static files
└── plots/ # Plotly chart configurations
- Python 3.7+
- Flask & extensions (SQLAlchemy, Migrate, Login, WTF, Limiter, HTTPAuth)
- Google Cloud BigQuery client
- Celery & Redis
- PostgreSQL (psycopg2)
- Requests
- Gunicorn (production server)
- Automatically converts dots and underscores to hyphens
- Handles PyPI's package naming conventions
- Aggregates statistics across all PyPI packages
- Provides ecosystem-wide metrics
- Flask-Limiter integration for API protection
- Configurable visibility timeout for long-running queries
- Plotly.js charts with zoom, pan, and hover details
- Range selector buttons (30d, 60d, 90d, 120d, all)
- Toggle between absolute and percentage views
- Dockerfile and docker-compose.yml provided
- docker-entrypoint.sh for container initialization
- Complete K8s manifests in
kubernetes/directory - Includes web, tasks, redis, and flower deployments
- Deployment script included
- Celery monitoring dashboard
- Real-time task status and history
/health- Basic health check/status- Application status
- Data is updated daily with previous day's statistics
- All timestamps are in UTC
- Package statistics exclude known mirror downloads by default
- Maximum lookback period is 180 days to manage database size
- Uses BigQuery's public PyPI dataset (bigquery-public-data.pypi.file_downloads)