Introduction

MorphoCut Library can be used to process thousands of images almost like you would process a single image. It was created out of the need to process large collections of images, but is also able to treat other data types.

MorphoCut is data-type agnostic, modular, and easily parallelizable.

Writing a MorphoCut program

First, a Pipeline is defined that contains all operations that should be carried out on the objects of the stream. These operations are then applied to a whole stream of images.

MorphoCut allows concise defititions of heavily nested image processing pipelines:

import os.path

from morphocut import Call, Pipeline
from morphocut.contrib.ecotaxa import EcotaxaWriter
from morphocut.contrib.zooprocess import CalculateZooProcessFeatures
from morphocut.file import Glob
from morphocut.image import FindRegions, ImageReader
from morphocut.parallel import ParallelPipeline
from morphocut.str import Format
from morphocut.stream import Enumerate, Unpack

# First, a Pipeline is defined that contains all operations
# that should be carried out on the objects of the stream.
with Pipeline() as p:
    # Corresponds to `for base_path in ["/path/a", "/path/b", "/path/c"]:`
    base_path = Unpack(["/path/a", "/path/b", "/path/c"])

    # Number the objects in the stream
    running_number = Enumerate()

    # Call calls regular Python functions.
    # Here, a subpath is appended to base_path.
    pattern = Call(os.path.join, base_path, "subpath/to/input/files/*.jpg")

    # Corresponds to `for path in glob(pattern):`
    path = Glob(pattern)

    # Remove path and extension from the filename
    source_basename = Call(lambda x: os.path.splitext(os.path.basename(x))[0], path)

    with ParallelPipeline():
        # The following operations are distributed among multiple
        # worker processes to speed up the calculations.

        # Read the image
        image = ImageReader(path)

        # Do some thresholding
        mask = image < 128

        # Find regions in the image
        region = FindRegions(mask, image)

        # Extract just the object
        roi_image = region.intensity_image

        # An object is identified by its label
        roi_label = region.label

        # Calculate a filename for the ROI image:
        # "RUNNING_NUMBER-SOURCE_BASENAME-ROI_LABEL"
        roi_name = Format(
            "{:d}-{}-{:d}.jpg", running_number, source_basename, roi_label
        )

        meta = CalculateZooProcessFeatures(region, prefix="object_")
        # End of parallel execution

    # Store results
    EcotaxaWriter("archive.zip", (roi_name, roi_image), meta)

# After the Pipeline was defined, it can be executed.
# A stream is created and transformed by the operations
# defined in the Pipeline.
p.run()

While creating the pipeline, everything is just placeholders. In this step, the actions that should be performed are just recorded but not yet applied. The Nodes, therefore, don’t return real values, but identifiers for the values that will later flow through the stream.

Concepts

An operation in the Pipeline is called a “Node”. It usually returns one (or multiple) Variables.

These are the Nodes used in this example:

Unpack

Stream Unpack values from a collection into the Stream.

Enumerate

Enumerate objects in the Stream.

Call

Call a function with the supplied parameters.

Glob

Stream Find files matching pathname.

ParallelPipeline

Parallel processing of the stream in multiple processes.

ImageReader

Read and open the image from a given path.

FindRegions

Stream Find regions in a mask and calculate properties.

Format

Format strings using str.format().

CalculateZooProcessFeatures

Calculate descriptive features similar to ZooProcess using skimage.measure.regionprops().

EcotaxaWriter

Create an archive of images and metadata that is importable to EcoTaxa.

Note

Nodes that change the stream are labeled with “Stream”.

Unpack, Glob and FindRegions all introduce new objects into the stream. Traditionally, this would be written using nested for-loops.

MorphoCut, on the other hand, applies a sequence of processing steps (Nodes) which allows for easy parallelization and nicely decouples the individual steps in the pipeline.