How does it work?#
BloArk is a complicated system that has some different mind models within its design. It is a good idea to understand the architecture of BloArk before using it.
Process#
This architecture typically works with any revision-based data. In Wikipedia Edit History scenario, the following steps are followed:
Building: The Wikipedia edit history is first divided into several blocks that obey the following rules:
Each block reflects one revision from an article.
All revisions of an article will be stored in the same warehouse.
Each block could be independently processed or analyzed.
Modifying: The blocks can then be modified for different purposes, with the following benefits:
Easy: Defining modifiers is as easy as defining a function that just tells how each block should be edited.
Parallelization: The blocks can be processed in parallel, which significantly reduces the processing time.
Memory-friendly: The blocks are, and should be, small enough to be loaded into memory, which makes the processing more efficient.
Composition: The blocks can be composed to form a larger block, which can then be processed or analyzed as a whole.
Reusing: The blocks can be reused for further modifications, which saves the time and effort of building the blocks from scratch.
Sharing: As long as the blocks are stored in the same format and be read by BloArk, they can easily be shared and reused by other users in other machines.
Design considerations#
When BloArk is designed, the following scenarios (including but not limited) are considered:
The single processable unit of a dataset is NOT too large to be loaded into memory.
A long-running device, such as Slurm job. It means that typical Jupyter Notebook is runnable but not suitable for this scenario in following reasons:
The Jupyter Notebook is not designed for long-running tasks. When you exit the browser (or close the browser tab), the Jupyter Notebook running session will be terminated.
Logs on Jupyter Notebook are not persistent and are not user-friendly. The scrolling experience along a long log is very bad on Jupyter Notebook (and Jupyter Lab, including all Jupyter-based software).