Repository Recommendation System
Overview
The AI-powered repository recommendation system analyzes your project and suggests relevant repositories from your GitHub starred list. It uses semantic understanding, multi-factor scoring, and intelligent categorization to provide meaningful recommendations.
Features
Multi-Factor Scoring System
The recommendation engine evaluates repositories using five key factors:
- Semantic Similarity (35% weight): Uses sentence transformers to understand the conceptual similarity between your project and repositories
- Technology Stack Matching (25% weight): Compares programming languages, frameworks, and dependencies
- Topic Overlap (20% weight): Evaluates shared topics and keywords
- Popularity (10% weight): Considers star count and community engagement
- Recency (10% weight): Favors actively maintained projects
Each repository receives a composite score from 0-100, with detailed reasoning for recommendations.
Intelligent Categorization
Recommendations are automatically categorized into:
- 📦 Direct Dependencies: Libraries and packages that can be directly added to your project
- 🔧 Tools & Utilities: Command-line tools and automation utilities
- 📚 Reference Implementations: Similar projects and code examples
- 🎓 Learning Resources: Tutorials, guides, and educational materials
Project Context Extraction
The system automatically analyzes your project by:
- Parsing README files for project description and technologies
- Reading
package.json(Node.js) for dependencies and metadata - Reading
requirements.txt(Python) for dependencies - Detecting programming languages from file extensions
- Extracting frameworks and technologies mentioned
Installation
Install Dependencies
# Install all dependencies
pip install -r scripts/requirements.txt
# Or install individually
pip install sentence-transformers scikit-learn numpy torch
Note: The first run will download the ML model (~80MB). This is a one-time download.
Verify Installation
python3 -c "from sentence_transformers import SentenceTransformer; print('OK')"
Usage
Basic Usage
# Analyze current project and get recommendations
python3 scripts/scan_starred_repos.py --recommend
Analyze Specific Project
# Analyze a different project
python3 scripts/scan_starred_repos.py --recommend --project-path /path/to/project
Customize Output
# Save recommendations to file
python3 scripts/scan_starred_repos.py --recommend --recommend-output recommendations.md
# Use text format instead of markdown
python3 scripts/scan_starred_repos.py --recommend --recommend-format text
# Get more recommendations per category
python3 scripts/scan_starred_repos.py --recommend --recommend-top-n 20
# Lower the minimum score threshold
python3 scripts/scan_starred_repos.py --recommend --recommend-min-score 20.0
Complete Workflow
# 1. First, scan and save your starred repositories
python3 scripts/scan_starred_repos.py --enhance-description --output repos/starred-repos.json
# 2. Then generate recommendations for your project
python3 scripts/scan_starred_repos.py --recommend --recommend-output recommendations.md
How It Works
1. Project Analysis
The system extracts context from your project:
Project: AI Tools collection for prompt engineering and repository analysis
Languages: python, javascript
Frameworks: flask, react
2. Repository Scoring
Each starred repository is scored across multiple dimensions:
Repository: username/awesome-project
- Semantic Similarity: 0.85 (high conceptual match)
- Tech Stack Match: 0.70 (Python + Flask)
- Topic Overlap: 0.60 (shared: ai, tools)
- Popularity: 0.75 (5,000 stars)
- Recency: 0.90 (updated last week)
Composite Score: 76.5/100
3. Categorization
Repositories are categorized based on:
- Keywords in descriptions and topics
- Repository characteristics (library vs tool)
- Score patterns (high tech match → dependency)
4. Report Generation
Results are presented with:
- Ranked recommendations per category
- Detailed scoring breakdown
- Human-readable reasoning
- Relevant metadata (stars, language, topics)
Output Example
# Repository Recommendations
## 📦 Direct Dependencies
### username/awesome-library
**Score: 85.2/100**
A powerful Python library for data processing and analysis.
**Scoring Breakdown:**
- Semantic Similarity: 0.87
- Tech Stack Match: 0.90
- Topic Overlap: 0.70
- Popularity: 0.80
- Recency: 0.85
**Why this recommendation:**
- High semantic similarity (0.87)
- Tech stack match: language: python
- Shared topics: data, analysis
- Highly popular (12,500 stars)
- Recently updated (< 1 month)
Language: Python | ⭐ 12,500 | Topics: data, analysis, python, library
Configuration
Scoring Weights
You can modify scoring weights in repo_recommender.py:
WEIGHTS = {
"semantic": 0.35, # Adjust importance of semantic similarity
"tech_stack": 0.25, # Adjust tech stack matching weight
"topic": 0.20, # Adjust topic overlap weight
"popularity": 0.10, # Adjust popularity factor
"recency": 0.10, # Adjust recency importance
}
Category Thresholds
Adjust categorization rules:
CATEGORIES = {
"direct_dependency": {
"keywords": ["library", "package", "module"],
"threshold": 0.7, # Minimum score for this category
},
# ...
}
ML Model Selection
Choose a different sentence transformer model:
# Faster but less accurate
recommender = RepositoryRecommender(model_name="all-MiniLM-L6-v2")
# More accurate but slower
recommender = RepositoryRecommender(model_name="all-mpnet-base-v2")
Performance
- Model Loading: 2-5 seconds (first time only)
- Embedding Generation: ~0.5 seconds per 100 repositories
- Total Analysis: < 2 minutes for 500+ repositories
Troubleshooting
Model Download Issues
If the model fails to download:
# Manually download model
python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
Memory Issues
For large repository lists (1000+), consider:
# Process in batches
python3 scripts/scan_starred_repos.py --recommend --recommend-top-n 5 --recommend-min-score 50
No Recommendations
If you get no recommendations:
- Lower the minimum score:
--recommend-min-score 20 - Ensure your project has a README with description
- Check that starred-repos.json exists and has data
Extensibility
Adding New Scoring Factors
def _calculate_custom_score(self, repo, context, reasoning):
# Add your custom scoring logic
score = 0.0
# ...
return score
Custom Context Extraction
class ProjectContext:
def _extract_custom_source(self):
# Add support for new project file types
# e.g., Cargo.toml, go.mod, etc.
pass
Hybrid ML Approach
The architecture supports adding API-based models:
class RepositoryRecommender:
def _get_embedding_api(self, text):
# Call OpenAI, Cohere, or other embedding APIs
# Fall back to local model if API fails
pass
Future Enhancements
- Support for more project file types (Cargo.toml, go.mod, etc.)
- Hybrid local + cloud ML models
- Dependency graph analysis
- GitHub API integration for real-time data
- Interactive CLI with filtering
- Web UI for visualization
- Export to various formats (JSON, CSV, HTML)
- Integration with project management tools
Contributing
Contributions welcome! Areas of interest:
- Additional scoring factors
- Better categorization logic
- Support for more programming languages
- Performance optimizations
- UI/UX improvements
License
Same as parent project.