mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-01-05 04:56:17 +01:00
13 KiB
13 KiB
IntelliDocs-ngx - Quick Reference Guide
🎯 One-Page Overview
What is IntelliDocs-ngx?
A document management system that scans, organizes, and searches your documents using AI and OCR.
Tech Stack
- Backend: Django 5.2 + Python 3.10+
- Frontend: Angular 20 + TypeScript
- Database: PostgreSQL/MySQL
- Queue: Celery + Redis
- OCR: Tesseract + Tika
📁 Project Structure
IntelliDocs-ngx/
├── src/ # Backend (Python/Django)
│ ├── documents/ # Core document management
│ │ ├── consumer.py # Document ingestion
│ │ ├── classifier.py # ML classification
│ │ ├── index.py # Search indexing
│ │ ├── matching.py # Auto-classification rules
│ │ ├── models.py # Database models
│ │ ├── views.py # REST API endpoints
│ │ └── tasks.py # Background tasks
│ ├── paperless/ # Core framework
│ │ ├── settings.py # Configuration
│ │ ├── celery.py # Task queue
│ │ └── urls.py # URL routing
│ ├── paperless_mail/ # Email integration
│ ├── paperless_tesseract/ # Tesseract OCR
│ ├── paperless_text/ # Text extraction
│ └── paperless_tika/ # Tika parsing
│
├── src-ui/ # Frontend (Angular)
│ ├── src/
│ │ ├── app/
│ │ │ ├── components/ # UI components
│ │ │ ├── services/ # API services
│ │ │ └── models/ # TypeScript models
│ │ └── assets/ # Static files
│
├── docs/ # User documentation
├── docker/ # Docker configurations
└── scripts/ # Utility scripts
🔑 Key Concepts
Document Lifecycle
1. Upload → 2. OCR → 3. Classify → 4. Index → 5. Archive
Components
- Consumer: Processes incoming documents
- Classifier: Auto-assigns tags/types using ML
- Index: Makes documents searchable
- Workflow: Automates document actions
- API: Exposes functionality to frontend
📊 Module Map
| Module | Purpose | Key Files |
|---|---|---|
| documents | Core DMS | consumer.py, classifier.py, models.py, views.py |
| paperless | Framework | settings.py, celery.py, auth.py |
| paperless_mail | Email import | mail.py, oauth.py |
| paperless_tesseract | OCR engine | parsers.py |
| paperless_text | Text extraction | parsers.py |
| paperless_tika | Format parsing | parsers.py |
🔧 Common Tasks
Add New Document
from documents.consumer import Consumer
consumer = Consumer()
doc_id = consumer.try_consume_file(
path="/path/to/document.pdf",
override_correspondent_id=5,
override_tag_ids=[1, 3, 7]
)
Search Documents
from documents.index import DocumentIndex
index = DocumentIndex()
results = index.search("invoice 2023")
Train Classifier
from documents.classifier import DocumentClassifier
classifier = DocumentClassifier()
classifier.train()
Create Workflow
from documents.models import Workflow, WorkflowAction
workflow = Workflow.objects.create(
name="Auto-file invoices",
enabled=True
)
action = WorkflowAction.objects.create(
workflow=workflow,
type="set_document_type",
value=2 # Invoice type ID
)
🌐 API Endpoints
Documents
GET /api/documents/ # List documents
GET /api/documents/{id}/ # Get document
POST /api/documents/ # Upload document
PATCH /api/documents/{id}/ # Update document
DELETE /api/documents/{id}/ # Delete document
GET /api/documents/{id}/download/ # Download file
GET /api/documents/{id}/preview/ # Get preview
POST /api/documents/bulk_edit/ # Bulk operations
Search
GET /api/search/?query=invoice # Full-text search
Metadata
GET /api/correspondents/ # List correspondents
GET /api/document_types/ # List types
GET /api/tags/ # List tags
GET /api/storage_paths/ # List storage paths
Workflows
GET /api/workflows/ # List workflows
POST /api/workflows/ # Create workflow
🎨 Frontend Components
Main Components
DocumentListComponent- Document grid viewDocumentDetailComponent- Single document viewDocumentEditComponent- Edit document metadataSearchComponent- Search interfaceSettingsComponent- Configuration UI
Key Services
DocumentService- API calls for documentsSearchService- Search functionalityPermissionsService- Access controlSettingsService- User settings
🗄️ Database Models
Core Models
Document
├── title: CharField
├── content: TextField
├── correspondent: ForeignKey → Correspondent
├── document_type: ForeignKey → DocumentType
├── tags: ManyToManyField → Tag
├── storage_path: ForeignKey → StoragePath
├── created: DateTimeField
├── modified: DateTimeField
├── owner: ForeignKey → User
└── custom_fields: ManyToManyField → CustomFieldInstance
Correspondent
├── name: CharField
├── match: CharField
└── matching_algorithm: IntegerField
DocumentType
├── name: CharField
└── match: CharField
Tag
├── name: CharField
├── color: CharField
└── is_inbox_tag: BooleanField
Workflow
├── name: CharField
├── enabled: BooleanField
├── triggers: ManyToManyField → WorkflowTrigger
└── actions: ManyToManyField → WorkflowAction
⚡ Performance Tips
Backend
# ✅ Good: Use select_related for ForeignKey
documents = Document.objects.select_related(
'correspondent', 'document_type'
).all()
# ✅ Good: Use prefetch_related for ManyToMany
documents = Document.objects.prefetch_related(
'tags', 'custom_fields'
).all()
# ❌ Bad: N+1 queries
for doc in Document.objects.all():
print(doc.correspondent.name) # Extra query each time!
Caching
from django.core.cache import cache
# Cache expensive operations
def get_document_stats():
stats = cache.get('document_stats')
if stats is None:
stats = calculate_stats()
cache.set('document_stats', stats, 3600)
return stats
Database Indexes
# Add indexes in migrations
migrations.AddIndex(
model_name='document',
index=models.Index(
fields=['correspondent', 'created'],
name='doc_corr_created_idx'
)
)
🔒 Security Checklist
- Validate all user inputs
- Use parameterized queries (Django ORM does this)
- Check permissions on all endpoints
- Implement rate limiting
- Add security headers
- Enable HTTPS
- Use strong password hashing
- Implement CSRF protection
- Sanitize file uploads
- Regular dependency updates
🐛 Debugging Tips
Backend
# Add logging
import logging
logger = logging.getLogger(__name__)
def my_function():
logger.debug("Debug information")
logger.info("Important event")
logger.error("Something went wrong")
# Django shell
python manage.py shell
>>> from documents.models import Document
>>> Document.objects.count()
# Run tests
python manage.py test documents
Frontend
// Console logging
console.log('Debug:', someVariable);
console.error('Error:', error);
// Angular DevTools
// Install Chrome extension for debugging
// Check network requests
// Use browser DevTools Network tab
Celery Tasks
# View running tasks
celery -A paperless inspect active
# View scheduled tasks
celery -A paperless inspect scheduled
# Purge queue
celery -A paperless purge
📦 Common Commands
Development
# Start development server
python manage.py runserver
# Start Celery worker
celery -A paperless worker -l INFO
# Run migrations
python manage.py migrate
# Create superuser
python manage.py createsuperuser
# Start frontend dev server
cd src-ui && ng serve
Testing
# Run backend tests
python manage.py test
# Run frontend tests
cd src-ui && npm test
# Run specific test
python manage.py test documents.tests.test_consumer
Production
# Collect static files
python manage.py collectstatic
# Check deployment
python manage.py check --deploy
# Start with Gunicorn
gunicorn paperless.wsgi:application
🔍 Troubleshooting
Document not consuming
- Check file permissions
- Check Celery is running
- Check logs:
docker logs paperless-worker - Verify OCR languages installed
Search not working
- Rebuild index:
python manage.py document_index reindex - Check Whoosh index permissions
- Verify search settings
Classification not accurate
- Train classifier:
python manage.py document_classifier train - Need 50+ documents per category
- Check matching rules
Frontend not loading
- Check CORS settings
- Verify API_URL configuration
- Check browser console for errors
- Clear browser cache
📈 Monitoring
Key Metrics to Track
- Document processing rate (docs/minute)
- API response time (ms)
- Search query time (ms)
- Celery queue length
- Database query count
- Storage usage (GB)
- Error rate (%)
Health Checks
# Add to views.py
def health_check(request):
checks = {
'database': check_database(),
'celery': check_celery(),
'redis': check_redis(),
'storage': check_storage(),
}
return JsonResponse(checks)
🎓 Learning Resources
Python/Django
- Django Docs: https://docs.djangoproject.com/
- Celery Docs: https://docs.celeryproject.org/
- Django REST Framework: https://www.django-rest-framework.org/
Frontend
- Angular Docs: https://angular.io/docs
- TypeScript: https://www.typescriptlang.org/docs/
- RxJS: https://rxjs.dev/
Machine Learning
- scikit-learn: https://scikit-learn.org/
- Transformers: https://huggingface.co/docs/transformers/
OCR
- Tesseract: https://github.com/tesseract-ocr/tesseract
- Apache Tika: https://tika.apache.org/
🚀 Quick Improvements
5-Minute Fixes
- Add database index: +3x query speed
- Enable gzip compression: +50% faster transfers
- Add security headers: Better security score
1-Hour Improvements
- Implement Redis caching: +2x API speed
- Add lazy loading: +50% faster page load
- Optimize images: Smaller bundle size
1-Day Projects
- Frontend code splitting: Better performance
- Add API rate limiting: DoS protection
- Implement proper logging: Better debugging
1-Week Projects
- Database optimization: 5-10x faster queries
- Improve classification: +20% accuracy
- Add mobile responsive: Better mobile UX
💡 Best Practices
Code Style
# ✅ Good
def process_document(document_id: int) -> Document:
"""Process a document and return the result.
Args:
document_id: ID of document to process
Returns:
Processed document instance
"""
document = Document.objects.get(id=document_id)
# ... processing logic
return document
# ❌ Bad
def proc(d):
x = Document.objects.get(id=d)
return x
Error Handling
# ✅ Good
try:
document = Document.objects.get(id=doc_id)
except Document.DoesNotExist:
logger.error(f"Document {doc_id} not found")
raise Http404("Document not found")
except Exception as e:
logger.exception("Unexpected error")
raise
# ❌ Bad
try:
document = Document.objects.get(id=doc_id)
except:
pass # Silent failure!
Testing
# ✅ Good: Test important functionality
class DocumentConsumerTest(TestCase):
def test_consume_pdf(self):
doc_id = consumer.try_consume_file('/path/to/test.pdf')
document = Document.objects.get(id=doc_id)
self.assertIsNotNone(document.content)
self.assertEqual(document.title, 'test')
📞 Getting Help
Documentation Files
- DOCS_README.md - Start here
- EXECUTIVE_SUMMARY.md - High-level overview
- DOCUMENTATION_ANALYSIS.md - Detailed analysis
- TECHNICAL_FUNCTIONS_GUIDE.md - Function reference
- IMPROVEMENT_ROADMAP.md - Implementation guide
- QUICK_REFERENCE.md - This file!
When Stuck
- Check this quick reference
- Review function documentation
- Look at test files for examples
- Check Django/Angular docs
- Review original Paperless-ngx docs
✅ Pre-deployment Checklist
- All tests passing
- Code coverage > 80%
- Security scan completed
- Performance tests passed
- Documentation updated
- Backup strategy in place
- Monitoring configured
- Error tracking setup
- SSL/HTTPS enabled
- Environment variables configured
- Database optimized
- Static files collected
- Migrations applied
- Health check endpoint working
Last Updated: November 9, 2025
Version: 1.0
IntelliDocs-ngx v2.19.5