How does it work?#
BloArk is a complicated system that has some different mind models within its design. It is a good idea to understand the architecture of BloArk before using it.
Process#
This architecture typically works with any revision-based data. In Wikipedia Edit History scenario, the following steps are followed:
- Building: The Wikipedia edit history is first divided into several blocks that obey the following rules: - Each block reflects one revision from an article. 
- All revisions of an article will be stored in the same warehouse. 
- Each block could be independently processed or analyzed. 
 
- Modifying: The blocks can then be modified for different purposes, with the following benefits: - Easy: Defining modifiers is as easy as defining a function that just tells how each block should be edited. 
- Parallelization: The blocks can be processed in parallel, which significantly reduces the processing time. 
- Memory-friendly: The blocks are, and should be, small enough to be loaded into memory, which makes the processing more efficient. 
- Composition: The blocks can be composed to form a larger block, which can then be processed or analyzed as a whole. 
 
- Reusing: The blocks can be reused for further modifications, which saves the time and effort of building the blocks from scratch. 
- Sharing: As long as the blocks are stored in the same format and be read by BloArk, they can easily be shared and reused by other users in other machines. 
Design considerations#
When BloArk is designed, the following scenarios (including but not limited) are considered:
- The single processable unit of a dataset is NOT too large to be loaded into memory. 
- A long-running device, such as Slurm job. It means that typical Jupyter Notebook is runnable but not suitable for this scenario in following reasons: - The Jupyter Notebook is not designed for long-running tasks. When you exit the browser (or close the browser tab), the Jupyter Notebook running session will be terminated. 
- Logs on Jupyter Notebook are not persistent and are not user-friendly. The scrolling experience along a long log is very bad on Jupyter Notebook (and Jupyter Lab, including all Jupyter-based software). 
 
