Minimal Builder Example#

This is the minimal working version of a BloArk Builder. This script is intended to build the initial warehouses out from the original data sources, such as Wikipedia edit histories.

Python script#

Putting this script in the same directory as the bash script is recommended. This script will be executed by the bash script. For example, we name this script as blocks_0_builder.py.

import logging
import bloark


if __name__ == '__main__':
    # Create a builder instance with 8 processes and INFO-level logging.
    builder = bloark.Builder(output_dir='./output', num_proc=8, log_level=logging.INFO)
    
    # Preload all files from the input directory (original data sources).
    # This command should be instant because it only loads paths rather than files themselves.
    builder.preload('./input')
    
    # For testing purposes, we only build the first 10 files.
    # This way of modification is possible, but not recommended in production.
    builder.files = builder.files[:10]
    
    # Start building the warehouses (this command will take a long time).
    builder.build()

Bash script#

Note

Check cluster requirements for more details about cluster environment setup.

This is an example bash script that will be used by sbatch to submit the job to a cluster. Put the following script in the same directory as the Python script. For example, we name this script as blocks_0_builder.sh.

#!/bin/bash

#SBATCH --job-name=blocks_0
#SBATCH --partition=longq
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=14-00:00
#SBATCH --mem-per-cpu=6000
#SBATCH --output=log_%j.out
#SBATCH --error=log_%j.error

python blocks_0_builder.py

After activating the correct conda environment (having correct terminal prefix like (an_environment_with_bloark) if you are using conda), you can simply submit the job by executing the following command in the same directory as the bash script:

sbatch blocks_0_builder.sh