mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-12-08 15:55:31 +01:00
Enable parallel OCR processing
At the moment, every page in a PDF will be processed one by one using tesseract. Since the processing of a single page is independent from every other page, one can make use of multi-core machines. This PR introduces a multiprocessing pool to process multiple pages simultaneously. The amount of threads to use can be specified in the environment variable `PAPERLESS_OCR_THREADS`. This will default to the number of cores/hyperthreads Python detects for your system.
This commit is contained in:
parent
6b0a537bff
commit
f5beda9c56
2 changed files with 18 additions and 5 deletions
|
|
@ -144,6 +144,9 @@ MEDIA_URL = "/media/"
|
|||
# documents. It should be a 3-letter language code consistent with ISO 639.
|
||||
OCR_LANGUAGE = "eng"
|
||||
|
||||
# The amount of threads to use for OCR
|
||||
OCR_THREADS = os.environ.get("PAPERLESS_OCR_THREADS")
|
||||
|
||||
# If this is true, any failed attempts to OCR a PDF will result in the PDF being
|
||||
# indexed anyway, with whatever we could get. If it's False, the file will
|
||||
# simply be left in the CONSUMPTION_DIR.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue