Quickstart
Minimal pipeline: download a CSV and filter it
If you have installed from source / github, then you can test your installation by invoking pytest, e.g.
Describe an input dataset with a UFile (here: a CSV hosted over HTTP).
Configure a node run with a URunDict.
Execute a node (here:
FilterTabularToCSV) to produce a new output file.
The same principles are used for larger pipelines.
import pandas as pd
import urgap
# 1) Define a remote input file
#
# UFile is Urgap's data abstraction: it stores “where” the data lives (URI),
# *plus* metadata (uftype, tags, hashes, …).
#
# When a node needs the local path (or when you access `ufile.path` yourself),
# Urgap downloads the data into a local scratch area automatically.
csv = urgap.UFile(
uri=(
"https://raw.githubusercontent.com/mwaskom/seaborn-data/master"
f"?uftype={urgap.uftypes.any.CSV}#iris.csv"
)
)
out_dir = "/some/local/path_where_you_want_the_results"
# 3) Configure and run a node
#
# A URunDict holds *execution parameters*.
# - `parameters` are node-specific CLI parameters.
# - `unode_parameters.storage_base_uri` is the common output base location.
#
# FilterTabularToCSV:1.0.0 uses a pandas query string.
urun_dict = urgap.URunDict(
{
"parameters": {
"FilterTabularToCSV:1.0.0": {
"-q": "`species` == 'setosa'",
},
},
"unode_parameters": {
"storage_base_uri": f"file://{out_dir}",
},
}
)
filter_node = urgap.init_unode("FilterTabularToCSV:1.0.0")
filtered = filter_node.run(ufiles=urgap.UFileList([csv]), urun_dict=urun_dict)
# 4) Inspect the result
#
# `filtered` is a UFileList. The output file is again represented as a UFile.
print(filtered)
print("Output path:", filtered[0].path)
df = pd.read_csv(filtered[0].path)
print(df.head())
Note
What exactly happened?
The input UFile points at an HTTP URL. When the node runs, it resolves
UFile.path, which triggers a download into a local scratch file.The node writes its result under
storage_base_uri. In this example that’s a localfile://...URI, but the same pattern works for other backends.The
-qparameter is interpreted by the node executable. InFilterTabularToCSV:1.0.0it’s a pandas query string.
Executing example scripts
Urgap comes with a set of example scripts covering single node execution to fully-fledged pipelines. Have a look in the example_scripts folder to get started with more advanced workflows. Most example scripts come with a click interface, for help simply run python scriptXXX.py –help