{ "cells": [ { "cell_type": "markdown", "id": "c7a7778b", "metadata": {}, "source": [ "# An Introduction to Using PyBen\n", "\n", "This is a small tutorial that is meant to help users get to using PyBen: the Python interface\n", "for the [binary-ensemble](https://crates.io/crates/binary-ensemble) Rust package.\n", "\n", "BEN (short for Binary-Ensemble) is a compression algorithm designed for efficient storage and\n", "access of ensembles of districting plans, and was designed to work primarily as a companion to\n", "the GerrySuite collection of packages (GerryChain, GerryTools, FRCW) and to also be compatible\n", "with other ensemble generators (e.g. ForestRecom, Sequential Monte Carlo \\[SMC\\]). \n", "\n", "When working with an ensemble of plans, there is generally an underlying dual graph, :math:`G`,\n", "on which there is an ordering of nodes :math:`(n_1, n_2, \\ldots, n_\\ell)`. If we then wish to \n", "partition the graph into districts, then the only thing that we need to do is assign each\n", "node in the graph a district number. This is what we call the ***assignment vector*** for the \n", "districting plan. Then to encode an ensemble of districting plans in a JSONL file (short for JSON \n", "Lines and it really just means a file with a dictionary on every line), we may format each of the\n", "lines in the following way:\n", "\n", "```\n", "{\"assignment\": , \"sample\": }\n", "```\n", "\n", "However, if the graph has a lot of nodes in it and we want to collect millions of samples (as we \n", "tend to want to do), then this JSONL format can make for MASSIVE (tens or hundreds of Gb) files. So\n", "this is why we have BEN and XBEN (e\\[X\\]treme BEN): to make the storage and processing of these\n", "millions of plans possible without needing to buy an extra hard drive for every project that you \n", "would like to work with." ] }, { "cell_type": "markdown", "id": "3fea1b6a", "metadata": {}, "source": [ "## Setup for the Tutorial\n", "\n", "For this tutorial, you will need access to a few files. We are going to go ahead and download\n", "them here and then place them in a folder called \"example_data\"." ] }, { "cell_type": "code", "execution_count": 1, "id": "1c33cbfb", "metadata": {}, "outputs": [], "source": [ "from urllib.request import urlopen\n", "from pathlib import Path\n", "import shutil\n", "\n", "if Path(\"example_data\").exists():\n", " shutil.rmtree(\"example_data\")\n", "\n", "Path(\"example_data\").mkdir()" ] }, { "cell_type": "code", "execution_count": 2, "id": "f8a56fa7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading CO_small.json...\n", "Downloading small_example.jsonl...\n", "Downloading 100k_CO_chain.jsonl.xben...\n", "Downloading gerrymandria.json...\n" ] } ], "source": [ "def open_and_save(base_url, file_name):\n", " url = f\"{base_url}/{file_name}\"\n", " out_path = f\"./example_data/{file_name}\"\n", "\n", " chunk = 1024 * 64\n", " with urlopen(url, timeout=120) as resp, open(out_path, \"wb\") as f:\n", " while True:\n", " buf = resp.read(chunk)\n", " if not buf:\n", " break\n", " f.write(buf)\n", "\n", "url_base = \"https://raw.githubusercontent.com/peterrrock2/binary-ensemble/main/example\"\n", "for file_name in [\n", " \"CO_small.json\",\n", " \"small_example.jsonl\",\n", "]:\n", " out_path = f\"./example_data/{file_name}\"\n", " if not Path(out_path).exists():\n", " print(f\"Downloading {file_name}...\")\n", " open_and_save(url_base, file_name)\n", " else:\n", " print(f\"{file_name} already exists, skipping download.\")\n", "\n", "\n", "url_base = \"https://github.com/peterrrock2/binary-ensemble/raw/refs/heads/main/example/\"\n", "for file_name in [\n", " \"100k_CO_chain.jsonl.xben\",\n", "]:\n", " out_path = f\"./example_data/{file_name}\"\n", " if not Path(out_path).exists():\n", " print(f\"Downloading {file_name}...\")\n", " open_and_save(url_base, file_name)\n", " else:\n", " print(f\"{file_name} already exists, skipping download.\")\n", "\n", "\n", "url_base = \"https://raw.githubusercontent.com/mggg/GerryChain/refs/heads/main/docs/_static\"\n", "for file_name in [\n", " \"gerrymandria.json\",\n", "]:\n", " out_path = f\"./example_data/{file_name}\"\n", " if not Path(out_path).exists():\n", " print(f\"Downloading {file_name}...\")\n", " open_and_save(url_base, file_name)\n", " else:\n", " print(f\"{file_name} already exists, skipping download.\")" ] }, { "cell_type": "markdown", "id": "197dd92d", "metadata": {}, "source": [ "## Converting between file types\n", "\n", "PyBen comes equiped with some utility functions for users who wish to convert between different\n", "file types." ] }, { "cell_type": "code", "execution_count": 3, "id": "9296ca41", "metadata": {}, "outputs": [], "source": [ "from pyben import (\n", " compress_jsonl_to_ben, compress_jsonl_to_xben, compress_ben_to_xben, decompress_ben_to_jsonl, decompress_xben_to_jsonl, decompress_xben_to_ben\n", ")" ] }, { "cell_type": "markdown", "id": "84f7c7f6", "metadata": {}, "source": [ "### BEN compression\n", "\n", "The most basic (and quickest) type of compression available is the BEN compression format. You \n", "may convert between a standard JSONL file to a BEN file using the following function:\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "1e1e32b0", "metadata": {}, "outputs": [], "source": [ "compress_jsonl_to_ben(\n", " in_file=\"example_data/small_example.jsonl\", \n", " out_file=\"example_data/small_example_jsonl_to_ben.jsonl.ben\",\n", ")" ] }, { "cell_type": "markdown", "id": "60f4ff71", "metadata": {}, "source": [ "As a small note, the above function (and all the conversion functions) has a default behavior of \n", "not overwriting output. " ] }, { "cell_type": "code", "execution_count": 5, "id": "2f1ce280", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found Error: Output file example_data/small_example_jsonl_to_ben.jsonl.ben already exists (use overwrite=True to replace).\n" ] } ], "source": [ "try:\n", " compress_jsonl_to_ben(\n", " in_file=\"example_data/small_example.jsonl\", \n", " out_file=\"example_data/small_example_jsonl_to_ben.jsonl.ben\",\n", " )\n", "except OSError as e:\n", " print(f\"Found Error: {e}\")" ] }, { "cell_type": "markdown", "id": "5d107d57", "metadata": {}, "source": [ "In addition, there is a `variant`\n", "parameter with two options: \"standard\" and \"mkv_chain\". The \"mkv_chain\" variation is a special \n", "version of BEN that is optimized for ensembles generated using an MCMC method with a non-zero \n", "rejection probability (so the generated maps may repeat a few times to target an appropriate \n", "probability distribution like in [Reversible ReCom](https://mggg.org/rrc)).\n", "\n", "For ensembles without repetition, the output size of the \"mkv_chain\" variant is very slightly larger\n", "than the \"standard\" variant, but for MCMC chains, the savings can be significant, so \"mkv_chain\"\n", "is set as the default variant." ] }, { "cell_type": "markdown", "id": "8333a252", "metadata": {}, "source": [ "### XBEN Compression\n", "\n", "XBEN (short for e\\[X\\]treme BEN) is a much more powerful version of our compression. In fact, with\n", "some coercing of the data, it is not uncommon to get 1000x compression compared to base JSONL files.\n", "However, all of these savings come at a cost: time and compute power. In general, while XBEN is \n", "relatively quick to decompress, it can take up to a few hours to compress a large sample. So this\n", "format is great for when the user wants to store data long-term, but is less good in an actively \n", "changing project. " ] }, { "cell_type": "code", "execution_count": 6, "id": "81b1f724", "metadata": {}, "outputs": [], "source": [ "compress_jsonl_to_xben(\n", " in_file=\"example_data/small_example.jsonl\", \n", " out_file=\"example_data/small_example_jsonl_to_xben.jsonl.xben\",\n", " overwrite=True,\n", " variant=\"mkv_chain\",\n", " n_threads=1,\n", " compression_level=9,\n", ")\n", "\n", "compress_ben_to_xben(\n", " in_file=\"example_data/small_example_jsonl_to_ben.jsonl.ben\", \n", " out_file=\"example_data/small_example_jsonl_to_ben_to_xben.jsonl.xben\",\n", " overwrite=True,\n", " n_threads=1,\n", " compression_level=9,\n", ")" ] }, { "cell_type": "markdown", "id": "cbfe2361", "metadata": {}, "source": [ "There are now a few new parameters added to the XBEN compression functions: `n_threads` and \n", "`compression_level`. \n", "\n", "- `n_threads`: In the interest of actually finishing the compression at a reasonable \n", "pace, XBEN has been parallelized to allow the user to take advantage of modern CPUs with \n", "higher thread counts. So increasing the number of threads in the parameter will decrease the \n", "compression time. \n", "\n", "- `compression_level`: There are 10 possible compression levels 0 (fastest) - 9 (slowest) (these\n", "follow the XZ compression levels). The higher the compression level, the better the compression \n", "ratio and the higher the demands on the CPU when compressing the object. \n", "\n", "By default, all XBEN compression functions will use all available threads on the machine and will\n", "use the highest compression level (9). The XBEN format is only really needed for very large ensemble \n", "analysis, and machines running such analysis tend to have the compute power to accommodate these\n", "defaults." ] }, { "cell_type": "markdown", "id": "09bb043b", "metadata": {}, "source": [ "### Decompression\n", "\n", "Insofar as file decompression goes, what you see is what you get. All of the functions have the \n", "exact same signature, and should be pretty self-explanatory." ] }, { "cell_type": "code", "execution_count": 7, "id": "a4e512b3", "metadata": {}, "outputs": [], "source": [ "decompress_ben_to_jsonl(\n", " in_file=\"example_data/small_example_jsonl_to_ben.jsonl.ben\", \n", " out_file=\"example_data/small_example_jsonl_to_ben_to_jsonl.jsonl\",\n", " overwrite=True,\n", ") \n", "\n", "decompress_xben_to_jsonl(\n", " in_file=\"example_data/small_example_jsonl_to_xben.jsonl.xben\",\n", " out_file=\"example_data/small_example_jsonl_to_xben_to_jsonl.jsonl\",\n", " overwrite=True,\n", ")\n", "\n", "decompress_xben_to_ben(\n", " in_file=\"example_data/small_example_jsonl_to_xben.jsonl.xben\",\n", " out_file=\"example_data/small_example_jsonl_to_xben_to_ben.jsonl.ben\",\n", " overwrite=True,\n", ")" ] }, { "cell_type": "markdown", "id": "157bc601", "metadata": {}, "source": [ "## PyBen and GerryChain\n", "\n", "As mentioned before, PyBen was originally designed to work with ensembles generated by programs\n", "like [GerryChain](https://gerrychain.readthedocs.io), and so we will give a small tutorial here.\n", "\n", "> **Note:** in the current version of GerryChain (0.3.2), there are some small peculiarities in\n", "> the way that the `Assignment` class works that require some care." ] }, { "cell_type": "markdown", "id": "ed52aff8", "metadata": {}, "source": [ "### Encoding\n", "\n", "Working with the PyBen encoder should feel a lot like working with any Python object that handles\n", "writing to files. In particular, we will use the context manager pattern to make sure that the\n", "file is appropriately opened and closed as we write assignment vectors to it." ] }, { "cell_type": "code", "execution_count": 8, "id": "eb43be57", "metadata": {}, "outputs": [], "source": [ "from gerrychain import Partition, Graph, MarkovChain, updaters, accept\n", "from gerrychain.proposals import recom\n", "from gerrychain.constraints import contiguous\n", "from functools import partial\n", "\n", "\n", "graph = Graph.from_json(\"./example_data/gerrymandria.json\")\n", "\n", "my_updaters = { \"population\": updaters.Tally(\"TOTPOP\"), }\n", "\n", "initial_partition = Partition(\n", " graph,\n", " assignment=\"district\",\n", " updaters=my_updaters\n", ")\n", "\n", "ideal_population = sum(initial_partition[\"population\"].values()) / len(initial_partition)\n", "\n", "proposal = partial(\n", " recom,\n", " pop_col=\"TOTPOP\",\n", " pop_target=ideal_population,\n", " epsilon=0.01,\n", " node_repeats=2\n", ")\n", "\n", "recom_chain = MarkovChain(\n", " proposal=proposal,\n", " constraints=[contiguous],\n", " accept=accept.always_accept,\n", " initial_state=initial_partition,\n", " total_steps=10_000\n", ")" ] }, { "cell_type": "markdown", "id": "a5ef02f8", "metadata": {}, "source": [ "Okay, now it is time to write the output into a BEN format. The most important thing that\n", "we need to keep track of here is the order of the `Assignment` returned by GerryChain. In general\n", "GerryChain makes no guarantees about the ordering of the nodes in the output assignment, and to\n", "write to a BEN file, we MUST make sure that the ordering of the values in the assignment vector\n", "lines up with the order of the nodes in the graph." ] }, { "cell_type": "code", "execution_count": 9, "id": "dec15cda", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6e2bbae3f017444e9dfa43e6a72d7352", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/10000 [00:00