Quickstart

Minimal pipeline: download a CSV and filter it

If you have installed from source / github, then you can test your installation by invoking pytest, e.g.

  1. Describe an input dataset with a UFile (here: a CSV hosted over HTTP).

  2. Configure a node run with a URunDict.

  3. Execute a node (here: FilterTabularToCSV) to produce a new output file.

The same principles are used for larger pipelines.

import pandas as pd

import urgap

# 1) Define a remote input file
#
# UFile is Urgap's data abstraction: it stores “where” the data lives (URI),
# *plus* metadata (uftype, tags, hashes, …).
#
# When a node needs the local path (or when you access `ufile.path` yourself),
# Urgap downloads the data into a local scratch area automatically.
csv = urgap.UFile(
    uri=(
        "https://raw.githubusercontent.com/mwaskom/seaborn-data/master"
        f"?uftype={urgap.uftypes.any.CSV}#iris.csv"
    )
)

out_dir = "/some/local/path_where_you_want_the_results"

# 3) Configure and run a node
#
# A URunDict holds *execution parameters*.
# - `parameters` are node-specific CLI parameters.
# - `unode_parameters.storage_base_uri` is the common output base location.
#
# FilterTabularToCSV:1.0.0 uses a pandas query string.
urun_dict = urgap.URunDict(
    {
        "parameters": {
            "FilterTabularToCSV:1.0.0": {
                "-q": "`species` == 'setosa'",
            },
        },
        "unode_parameters": {
            "storage_base_uri": f"file://{out_dir}",
        },
    }
)

filter_node = urgap.init_unode("FilterTabularToCSV:1.0.0")
filtered = filter_node.run(ufiles=urgap.UFileList([csv]), urun_dict=urun_dict)

# 4) Inspect the result
#
# `filtered` is a UFileList. The output file is again represented as a UFile.
print(filtered)
print("Output path:", filtered[0].path)

df = pd.read_csv(filtered[0].path)
print(df.head())

Note

What exactly happened?

  • The input UFile points at an HTTP URL. When the node runs, it resolves UFile.path, which triggers a download into a local scratch file.

  • The node writes its result under storage_base_uri. In this example that’s a local file://... URI, but the same pattern works for other backends.

  • The -q parameter is interpreted by the node executable. In FilterTabularToCSV:1.0.0 it’s a pandas query string.

Executing example scripts

Urgap comes with a set of example scripts covering single node execution to fully-fledged pipelines. Have a look in the example_scripts folder to get started with more advanced workflows. Most example scripts come with a click interface, for help simply run python scriptXXX.py –help