Minimal Modifier Example#
This is the minimal working version of a BloArk Modifier. This script is intended to modify the warehouses built by the Builder. For example, you can use this script to modify the warehouses into a format that is more suitable for your further analysis.
Python script#
Putting this script in the same directory as the bash script is recommended. This script will be executed by the bash script. For example, we name this script as blocks_1_modifier.py
.
import logging
import bloark
# Define a modifier profile.
class PTFModifier(bloark.ModifierProfile):
count: int = 0
def __init__(self):
self.count = 0
def block(self, content: dict, metadata: dict):
self.count += 1
logging.debug(f'Modifier: test printout! {self.count}')
return content, metadata
if __name__ == '__main__':
# Create a modifier instance with 8 processes (CPUs) and INFO-level logging.
modifier = bloark.Modifier(output_dir='./tests/output', num_proc=2, log_level=logging.INFO)
# Preload all files from the input directory (original warehouses).
modifier.preload('./tests/sample_data/sample_warehouses')
# Add the modifier profile to the modifier instance.
modifier.add_profile(PTFModifier())
# Start modifying the warehouses (this command will take a long time).
modifier.start()
Bash script#
Note
Check cluster requirements for more details about cluster environment setup.
This is an example bash script that will be used by sbatch
to submit the job to a cluster. Put the following script in the same directory as the Python script. For example, we name this script as blocks_1_modifier.sh
.
#!/bin/bash
#SBATCH --job-name=blocks_1
#SBATCH --partition=longq
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=14-00:00
#SBATCH --mem-per-cpu=6000
#SBATCH --output=log_%j.out
#SBATCH --error=log_%j.error
python blocks_1_modifier.py
After activating the correct conda environment (having correct terminal prefix like (an_environment_with_bloark)
if you are using conda), you can simply submit the job by executing the following command in the same directory as the bash script:
sbatch blocks_1_modifier.sh