An Introduction to Using PyBen

This is a small tutorial that is meant to help users get to using PyBen: the Python interface for the binary-ensemble Rust package.

BEN (short for Binary-Ensemble) is a compression algorithm designed for efficient storage and access of ensembles of districting plans, and was designed to work primarily as a companion to the GerrySuite collection of packages (GerryChain, GerryTools, FRCW) and to also be compatible with other ensemble generators (e.g. ForestRecom, Sequential Monte Carlo [SMC]).

When working with an ensemble of plans, there is generally an underlying dual graph, :math:G, on which there is an ordering of nodes :math:(n_1, n_2, \ldots, n_\ell). If we then wish to partition the graph into districts, then the only thing that we need to do is assign each node in the graph a district number. This is what we call the assignment vector for the districting plan. Then to encode an ensemble of districting plans in a JSONL file (short for JSON Lines and it really just means a file with a dictionary on every line), we may format each of the lines in the following way:

{"assignment": <assignment_vector>, "sample": <sample_number_indexed_from_1>}

However, if the graph has a lot of nodes in it and we want to collect millions of samples (as we tend to want to do), then this JSONL format can make for MASSIVE (tens or hundreds of Gb) files. So this is why we have BEN and XBEN (e[X]treme BEN): to make the storage and processing of these millions of plans possible without needing to buy an extra hard drive for every project that you would like to work with.

Setup for the Tutorial

For this tutorial, you will need access to a few files. We are going to go ahead and download them here and then place them in a folder called “example_data”.

from urllib.request import urlopen
from pathlib import Path
import shutil

if Path("example_data").exists():
    shutil.rmtree("example_data")

Path("example_data").mkdir()
def open_and_save(base_url, file_name):
    url = f"{base_url}/{file_name}"
    out_path = f"./example_data/{file_name}"

    chunk = 1024 * 64
    with urlopen(url, timeout=120) as resp, open(out_path, "wb") as f:
        while True:
            buf = resp.read(chunk)
            if not buf:
                break
            f.write(buf)

url_base = "https://raw.githubusercontent.com/peterrrock2/binary-ensemble/main/example"
for file_name in [
    "CO_small.json",
    "small_example.jsonl",
]:
    out_path = f"./example_data/{file_name}"
    if not Path(out_path).exists():
        print(f"Downloading {file_name}...")
        open_and_save(url_base, file_name)
    else:
        print(f"{file_name} already exists, skipping download.")


url_base = "https://github.com/peterrrock2/binary-ensemble/raw/refs/heads/main/example/"
for file_name in [
    "100k_CO_chain.jsonl.xben",
]:
    out_path = f"./example_data/{file_name}"
    if not Path(out_path).exists():
        print(f"Downloading {file_name}...")
        open_and_save(url_base, file_name)
    else:
        print(f"{file_name} already exists, skipping download.")


url_base = "https://raw.githubusercontent.com/mggg/GerryChain/refs/heads/main/docs/_static"
for file_name in [
    "gerrymandria.json",
]:
    out_path = f"./example_data/{file_name}"
    if not Path(out_path).exists():
        print(f"Downloading {file_name}...")
        open_and_save(url_base, file_name)
    else:
        print(f"{file_name} already exists, skipping download.")
Downloading CO_small.json...
Downloading small_example.jsonl...
Downloading 100k_CO_chain.jsonl.xben...
Downloading gerrymandria.json...

Converting between file types

PyBen comes equiped with some utility functions for users who wish to convert between different file types.

from pyben import (
    compress_jsonl_to_ben, compress_jsonl_to_xben, compress_ben_to_xben, decompress_ben_to_jsonl, decompress_xben_to_jsonl, decompress_xben_to_ben
)

BEN compression

The most basic (and quickest) type of compression available is the BEN compression format. You may convert between a standard JSONL file to a BEN file using the following function:

compress_jsonl_to_ben(
    in_file="example_data/small_example.jsonl", 
    out_file="example_data/small_example_jsonl_to_ben.jsonl.ben",
)

As a small note, the above function (and all the conversion functions) has a default behavior of not overwriting output.

try:
    compress_jsonl_to_ben(
        in_file="example_data/small_example.jsonl", 
        out_file="example_data/small_example_jsonl_to_ben.jsonl.ben",
    )
except OSError as e:
    print(f"Found Error: {e}")
Found Error: Output file example_data/small_example_jsonl_to_ben.jsonl.ben already exists (use overwrite=True to replace).

In addition, there is a variant parameter with two options: “standard” and “mkv_chain”. The “mkv_chain” variation is a special version of BEN that is optimized for ensembles generated using an MCMC method with a non-zero rejection probability (so the generated maps may repeat a few times to target an appropriate probability distribution like in Reversible ReCom).

For ensembles without repetition, the output size of the “mkv_chain” variant is very slightly larger than the “standard” variant, but for MCMC chains, the savings can be significant, so “mkv_chain” is set as the default variant.

XBEN Compression

XBEN (short for e[X]treme BEN) is a much more powerful version of our compression. In fact, with some coercing of the data, it is not uncommon to get 1000x compression compared to base JSONL files. However, all of these savings come at a cost: time and compute power. In general, while XBEN is relatively quick to decompress, it can take up to a few hours to compress a large sample. So this format is great for when the user wants to store data long-term, but is less good in an actively changing project.

compress_jsonl_to_xben(
    in_file="example_data/small_example.jsonl", 
    out_file="example_data/small_example_jsonl_to_xben.jsonl.xben",
    overwrite=True,
    variant="mkv_chain",
    n_threads=1,
    compression_level=9,
)

compress_ben_to_xben(
    in_file="example_data/small_example_jsonl_to_ben.jsonl.ben", 
    out_file="example_data/small_example_jsonl_to_ben_to_xben.jsonl.xben",
    overwrite=True,
    n_threads=1,
    compression_level=9,
)

There are now a few new parameters added to the XBEN compression functions: n_threads and compression_level.

  • n_threads: In the interest of actually finishing the compression at a reasonable pace, XBEN has been parallelized to allow the user to take advantage of modern CPUs with higher thread counts. So increasing the number of threads in the parameter will decrease the compression time.

  • compression_level: There are 10 possible compression levels 0 (fastest) - 9 (slowest) (these follow the XZ compression levels). The higher the compression level, the better the compression ratio and the higher the demands on the CPU when compressing the object.

By default, all XBEN compression functions will use all available threads on the machine and will use the highest compression level (9). The XBEN format is only really needed for very large ensemble analysis, and machines running such analysis tend to have the compute power to accommodate these defaults.

Decompression

Insofar as file decompression goes, what you see is what you get. All of the functions have the exact same signature, and should be pretty self-explanatory.

decompress_ben_to_jsonl(
    in_file="example_data/small_example_jsonl_to_ben.jsonl.ben", 
    out_file="example_data/small_example_jsonl_to_ben_to_jsonl.jsonl",
    overwrite=True,
)   

decompress_xben_to_jsonl(
    in_file="example_data/small_example_jsonl_to_xben.jsonl.xben",
    out_file="example_data/small_example_jsonl_to_xben_to_jsonl.jsonl",
    overwrite=True,
)

decompress_xben_to_ben(
    in_file="example_data/small_example_jsonl_to_xben.jsonl.xben",
    out_file="example_data/small_example_jsonl_to_xben_to_ben.jsonl.ben",
    overwrite=True,
)

PyBen and GerryChain

As mentioned before, PyBen was originally designed to work with ensembles generated by programs like GerryChain, and so we will give a small tutorial here.

Note: in the current version of GerryChain (0.3.2), there are some small peculiarities in the way that the Assignment class works that require some care.

Encoding

Working with the PyBen encoder should feel a lot like working with any Python object that handles writing to files. In particular, we will use the context manager pattern to make sure that the file is appropriately opened and closed as we write assignment vectors to it.

from gerrychain import Partition, Graph, MarkovChain, updaters, accept
from gerrychain.proposals import recom
from gerrychain.constraints import contiguous
from functools import partial


graph = Graph.from_json("./example_data/gerrymandria.json")

my_updaters = { "population": updaters.Tally("TOTPOP"), }

initial_partition = Partition(
    graph,
    assignment="district",
    updaters=my_updaters
)

ideal_population = sum(initial_partition["population"].values()) / len(initial_partition)

proposal = partial(
    recom,
    pop_col="TOTPOP",
    pop_target=ideal_population,
    epsilon=0.01,
    node_repeats=2
)

recom_chain = MarkovChain(
    proposal=proposal,
    constraints=[contiguous],
    accept=accept.always_accept,
    initial_state=initial_partition,
    total_steps=10_000
)

Okay, now it is time to write the output into a BEN format. The most important thing that we need to keep track of here is the order of the Assignment returned by GerryChain. In general GerryChain makes no guarantees about the ordering of the nodes in the output assignment, and to write to a BEN file, we MUST make sure that the ordering of the values in the assignment vector lines up with the order of the nodes in the graph.

from pyben import PyBenEncoder

graph_node_order = list(graph.nodes)

with PyBenEncoder("example_data/gerrychain_10000.jsonl.ben", overwrite=True) as encoder:
    for partition in recom_chain.with_progress_bar():
        assignment_series = partition.assignment.to_series()
        # Assignment vectors must be lists of integers
        ordered_assignment = assignment_series.loc[graph_node_order].astype(int).tolist() 
        encoder.write(ordered_assignment)

Decoding

Decoding with PyBen should also feel fairly simple: just iterate over the file and pull out the assignment vector that you would like to work with.

from pyben import PyBenDecoder
import pandas as pd


graph_node_order_series = pd.Index(graph.nodes)

for i, assignment in enumerate(PyBenDecoder("example_data/gerrychain_10000.jsonl.ben")):
    assignment = pd.Series(assignment, index=graph_node_order_series)
    partition = Partition(
        graph,
        assignment=assignment,
        updaters=my_updaters
    )
    if i % 1000 == 0:
        print(f"Sample: {i+1}, Cut Edge Count: {len(partition['cut_edges'])}")
Sample: 1, Cut Edge Count: 56
Sample: 1001, Cut Edge Count: 42
Sample: 2001, Cut Edge Count: 38
Sample: 3001, Cut Edge Count: 43
Sample: 4001, Cut Edge Count: 40
Sample: 5001, Cut Edge Count: 39
Sample: 6001, Cut Edge Count: 38
Sample: 7001, Cut Edge Count: 44
Sample: 8001, Cut Edge Count: 39
Sample: 9001, Cut Edge Count: 38

Subsampling

Often times, when working with ensembles of plans, it is desirable to subsample from the ensemble for the sake of winnowing, and the PyBenDecoder has native support for this.

We’ll work with the “100k_CO_chain.json.xben” file which contains 100k districting plans on Colorado Census blocks (there are ~140k census blocks in Colorado).

# Warning, this BEN file will be ~2Gb
decompress_xben_to_ben(
    in_file="example_data/100k_CO_chain.jsonl.xben",
    out_file="example_data/100k_CO_chain.jsonl.ben",
    overwrite=True,
)
for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.ben").subsample_indices([1, 23978, 100000]):
    print(assignment[:10])
[8, 8, 5, 4, 5, 5, 5, 5, 3, 3]
[1, 1, 6, 5, 6, 6, 6, 6, 2, 2]
[1, 1, 3, 8, 8, 8, 3, 3, 4, 4]
for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.ben").subsample_range(1000,1005):
    print(assignment[:10])
[5, 5, 6, 1, 2, 1, 6, 6, 4, 4]
[5, 5, 1, 1, 2, 1, 1, 1, 4, 4]
[7, 7, 1, 1, 2, 1, 1, 1, 4, 4]
[7, 7, 4, 4, 2, 4, 4, 4, 1, 1]
[7, 7, 4, 4, 2, 4, 4, 4, 1, 1]
[7, 7, 4, 4, 2, 4, 4, 4, 1, 1]
for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.ben").subsample_every(10000):
    print(assignment[:10])
[8, 8, 5, 4, 5, 5, 5, 5, 3, 3]
[7, 7, 3, 3, 3, 3, 3, 3, 2, 2]
[5, 5, 1, 3, 4, 4, 1, 1, 3, 3]
[5, 5, 3, 8, 3, 3, 3, 6, 1, 1]
[8, 8, 4, 4, 3, 4, 4, 7, 1, 1]
[5, 1, 7, 3, 3, 7, 7, 3, 6, 6]
[4, 4, 1, 3, 1, 1, 1, 1, 5, 5]
[6, 6, 7, 2, 1, 2, 7, 7, 7, 5]
[4, 4, 8, 1, 8, 8, 8, 3, 7, 7]
[1, 1, 5, 5, 5, 5, 5, 5, 6, 6]
[1, 1, 3, 8, 8, 8, 3, 3, 4, 4]

Of course, you can also do subsampling from XBEN, but the extra compression induces a startup cost for accessing anything in the file.

for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.xben", mode="xben").subsample_indices([1, 23978, 100000]):
    print(assignment[:10])
/tmp/ipykernel_1525384/229284435.py:1: UserWarning: XBEN may take a second to start decoding.
  for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.xben", mode="xben").subsample_indices([1, 23978, 100000]):
[8, 8, 5, 4, 5, 5, 5, 5, 3, 3]
[1, 1, 6, 5, 6, 6, 6, 6, 2, 2]
[1, 1, 3, 8, 8, 8, 3, 3, 4, 4]
for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.xben", mode="xben").subsample_range(1000,1005):
    print(assignment[:10])
/tmp/ipykernel_1525384/1010090289.py:1: UserWarning: XBEN may take a second to start decoding.
  for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.xben", mode="xben").subsample_range(1000,1005):
[5, 5, 6, 1, 2, 1, 6, 6, 4, 4]
[5, 5, 1, 1, 2, 1, 1, 1, 4, 4]
[7, 7, 1, 1, 2, 1, 1, 1, 4, 4]
[7, 7, 4, 4, 2, 4, 4, 4, 1, 1]
[7, 7, 4, 4, 2, 4, 4, 4, 1, 1]
[7, 7, 4, 4, 2, 4, 4, 4, 1, 1]
for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.xben", mode="xben").subsample_every(10000):
    print(assignment[:10])
/tmp/ipykernel_1525384/49125867.py:1: UserWarning: XBEN may take a second to start decoding.
  for assignment in PyBenDecoder("example_data/100k_CO_chain.jsonl.xben", mode="xben").subsample_every(10000):
[8, 8, 5, 4, 5, 5, 5, 5, 3, 3]
[7, 7, 3, 3, 3, 3, 3, 3, 2, 2]
[5, 5, 1, 3, 4, 4, 1, 1, 3, 3]
[5, 5, 3, 8, 3, 3, 3, 6, 1, 1]
[8, 8, 4, 4, 3, 4, 4, 7, 1, 1]
[5, 1, 7, 3, 3, 7, 7, 3, 6, 6]
[4, 4, 1, 3, 1, 1, 1, 1, 5, 5]
[6, 6, 7, 2, 1, 2, 7, 7, 7, 5]
[4, 4, 8, 1, 8, 8, 8, 3, 7, 7]
[1, 1, 5, 5, 5, 5, 5, 5, 6, 6]
[1, 1, 3, 8, 8, 8, 3, 3, 4, 4]