Metadata-Version: 2.1
Name: tesserocr
Version: 2.5.0
Summary: A simple, Pillow-friendly, Python wrapper around tesseract-ocr API using Cython
Home-page: https://github.com/sirfz/tesserocr
Author: Fayez Zouheiry
Author-email: iamfayez@gmail.com
License: MIT
Description: =========
        tesserocr
        =========
        
        A simple, |Pillow|_-friendly,
        wrapper around the ``tesseract-ocr`` API for Optical Character Recognition
        (OCR).
        
        .. image:: https://travis-ci.org/sirfz/tesserocr.svg?branch=master
            :target: https://travis-ci.org/sirfz/tesserocr
            :alt: TravisCI build status
        
        .. image:: https://img.shields.io/pypi/v/tesserocr.svg?maxAge=2592000
            :target: https://pypi.python.org/pypi/tesserocr
            :alt: Latest version on PyPi
        
        .. image:: https://img.shields.io/pypi/pyversions/tesserocr.svg?maxAge=2592000
            :alt: Supported python versions
        
        **tesserocr** integrates directly with Tesseract's C++ API using Cython
        which allows for a simple Pythonic and easy-to-read source code. It
        enables real concurrent execution when used with Python's ``threading``
        module by releasing the GIL while processing an image in tesseract.
        
        **tesserocr** is designed to be |Pillow|_-friendly but can also be used
        with image files instead.
        
        .. |Pillow| replace:: ``Pillow``
        .. _Pillow: http://python-pillow.github.io/
        
        Requirements
        ============
        
        Requires libtesseract (>=3.04) and libleptonica (>=1.71).
        
        On Debian/Ubuntu:
        
        ::
        
            $ apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
        
        You may need to `manually compile tesseract`_ for a more recent version. Note that you may need
        to update your ``LD_LIBRARY_PATH`` environment variable to point to the right library versions in
        case you have multiple tesseract/leptonica installations.
        
        |Cython|_ (>=0.23) is required for building and optionally |Pillow|_ to support ``PIL.Image`` objects.
        
        .. _manually compile tesseract: https://github.com/tesseract-ocr/tesseract/wiki/Compiling
        .. |Cython| replace:: ``Cython``
        .. _Cython: http://cython.org/
        
        Installation
        ============
        Linux and BSD/MacOS
        -------------------
        ::
        
            $ pip install tesserocr
        
        The setup script attempts to detect the include/library dirs (via |pkg-config|_ if available) but you
        can override them with your own parameters, e.g.:
        
        ::
        
            $ CPPFLAGS=-I/usr/local/include pip install tesserocr
        
        or
        
        ::
        
            $ python setup.py build_ext -I/usr/local/include
        
        Tested on Linux and BSD/MacOS
        
        .. |pkg-config| replace:: **pkg-config**
        .. _pkg-config: https://pkgconfig.freedesktop.org/
        
        Windows
        -------
        
        The proposed downloads consist of stand-alone packages containing all the Windows libraries needed for execution. This means that no additional installation of tesseract is required on your system.
        
        The recommended method of installation is via Conda as described below.
        
        Conda
        `````
        
        You can use the `conda-forge <https://anaconda.org/conda-forge/tesserocr>`_ channel to install from Conda:
        
        ::
        
            > conda install -c conda-forge tesserocr
        
        pip
        ```
        
        Download the wheel file corresponding to your Windows platform and Python installation from `simonflueckiger/tesserocr-windows_build/releases <https://github.com/simonflueckiger/tesserocr-windows_build/releases>`_ and install them via:
        
        ::
        
            > pip install <package_name>.whl
        
        Usage
        =====
        
        Initialize and re-use the tesseract API instance to score multiple
        images:
        
        .. code:: python
        
            from tesserocr import PyTessBaseAPI
        
            images = ['sample.jpg', 'sample2.jpg', 'sample3.jpg']
        
            with PyTessBaseAPI() as api:
                for img in images:
                    api.SetImageFile(img)
                    print(api.GetUTF8Text())
                    print(api.AllWordConfidences())
            # api is automatically finalized when used in a with-statement (context manager).
            # otherwise api.End() should be explicitly called when it's no longer needed.
        
        ``PyTessBaseAPI`` exposes several tesseract API methods. Make sure you
        read their docstrings for more info.
        
        Basic example using available helper functions:
        
        .. code:: python
        
            import tesserocr
            from PIL import Image
        
            print(tesserocr.tesseract_version())  # print tesseract-ocr version
            print(tesserocr.get_languages())  # prints tessdata path and list of available languages
        
            image = Image.open('sample.jpg')
            print(tesserocr.image_to_text(image))  # print ocr text from image
            # or
            print(tesserocr.file_to_text('sample.jpg'))
        
        ``image_to_text`` and ``file_to_text`` can be used with ``threading`` to
        concurrently process multiple images which is highly efficient.
        
        Advanced API Examples
        ---------------------
        
        GetComponentImages example:
        ```````````````````````````
        
        .. code:: python
        
            from PIL import Image
            from tesserocr import PyTessBaseAPI, RIL
        
            image = Image.open('/usr/src/tesseract/testing/phototest.tif')
            with PyTessBaseAPI() as api:
                api.SetImage(image)
                boxes = api.GetComponentImages(RIL.TEXTLINE, True)
                print('Found {} textline image components.'.format(len(boxes)))
                for i, (im, box, _, _) in enumerate(boxes):
                    # im is a PIL image object
                    # box is a dict with x, y, w and h keys
                    api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
                    ocrResult = api.GetUTF8Text()
                    conf = api.MeanTextConf()
                    print(u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
                          "confidence: {1}, text: {2}".format(i, conf, ocrResult, **box))
        
        Orientation and script detection (OSD):
        ```````````````````````````````````````
        
        .. code:: python
        
            from PIL import Image
            from tesserocr import PyTessBaseAPI, PSM
        
            with PyTessBaseAPI(psm=PSM.AUTO_OSD) as api:
                image = Image.open("/usr/src/tesseract/testing/eurotext.tif")
                api.SetImage(image)
                api.Recognize()
        
                it = api.AnalyseLayout()
                orientation, direction, order, deskew_angle = it.Orientation()
                print("Orientation: {:d}".format(orientation))
                print("WritingDirection: {:d}".format(direction))
                print("TextlineOrder: {:d}".format(order))
                print("Deskew angle: {:.4f}".format(deskew_angle))
        
        or more simply with ``OSD_ONLY`` page segmentation mode:
        
        .. code:: python
        
            from tesserocr import PyTessBaseAPI, PSM
        
            with PyTessBaseAPI(psm=PSM.OSD_ONLY) as api:
                api.SetImageFile("/usr/src/tesseract/testing/eurotext.tif")
        
                os = api.DetectOS()
                print("Orientation: {orientation}\nOrientation confidence: {oconfidence}\n"
                      "Script: {script}\nScript confidence: {sconfidence}".format(**os))
        
        more human-readable info with tesseract 4+ (demonstrates LSTM engine usage):
        
        .. code:: python
        
            from tesserocr import PyTessBaseAPI, PSM, OEM
        
            with PyTessBaseAPI(psm=PSM.OSD_ONLY, oem=OEM.LSTM_ONLY) as api:
                api.SetImageFile("/usr/src/tesseract/testing/eurotext.tif")
        
                os = api.DetectOrientationScript()
                print("Orientation: {orient_deg}\nOrientation confidence: {orient_conf}\n"
                      "Script: {script_name}\nScript confidence: {script_conf}".format(**os))
        
        Iterator over the classifier choices for a single symbol:
        `````````````````````````````````````````````````````````
        
        .. code:: python
        
            from __future__ import print_function
        
            from tesserocr import PyTessBaseAPI, RIL, iterate_level
        
            with PyTessBaseAPI() as api:
                api.SetImageFile('/usr/src/tesseract/testing/phototest.tif')
                api.SetVariable("save_blob_choices", "T")
                api.SetRectangle(37, 228, 548, 31)
                api.Recognize()
        
                ri = api.GetIterator()
                level = RIL.SYMBOL
                for r in iterate_level(ri, level):
                    symbol = r.GetUTF8Text(level)  # r == ri
                    conf = r.Confidence(level)
                    if symbol:
                        print(u'symbol {}, conf: {}'.format(symbol, conf), end='')
                    indent = False
                    ci = r.GetChoiceIterator()
                    for c in ci:
                        if indent:
                            print('\t\t ', end='')
                        print('\t- ', end='')
                        choice = c.GetUTF8Text()  # c == ci
                        print(u'{} conf: {}'.format(choice, c.Confidence()))
                        indent = True
                    print('---------------------------------------------')
        
Keywords: Tesseract,tesseract-ocr,OCR,optical character recognition,PIL,Pillow,Cython
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Topic :: Multimedia :: Graphics :: Capture :: Scanners
Classifier: Topic :: Multimedia :: Graphics :: Graphics Conversion
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Programming Language :: Cython
Description-Content-Type: text/x-rst
