API references#

Note

Please note that all unstable APIs that are marked with @unstable in code will also be marked as Unstable in this page.

class bloark.Builder(output_dir: str, num_proc: int = 1, log_level: int = 20, max_size: int = 16, compress: bool = True)#

Builder is a class for building the warehouse from the original data source.

output_dir#

The output directory.

Type:

str

num_proc#

The number of processes to use.

Type:

int

log_level#

The built-in logging level.

Type:

int

compress#

Whether to compress the output files.

Type:

bool

files#

A list of files to be read.

Type:

list

build()#

Build the blocks after apply the modifiers.

preload(path: str)#

Preload the files to be processed. It will not actually load to the memory until the build() method is called.

Parameters:

path (str) – The path of a file or a directory.

Raises:
  • ValueError – If the path is empty.

  • FileNotFoundError – If the path does not exist.

class bloark.Modifier(output_dir: str, num_proc: int = 1, log_level: int = 20)#

Modifier is the class to define how to modify the JSON content of a block (or a segment) from the warehouse.

num_proc#

The number of processes to use.

Type:

int

log_level#

The log level.

Type:

int

files#

A list of files to be read.

Type:

list

modifiers#

A list of modifiers to be applied.

Type:

list

add_profile(profile: ModifierProfile)#

Map a function to each block.

Parameters:

profile (ModifierProfile) – The modifier profile to be added.

build()#

Deprecated since version 2.1.2: This function name is opaque. Please use start() instead. This API will be removed after v2.4.

preload(path: str)#

Preload the files to be processed. It will not actually load to the memory until the build() method is called.

Parameters:

path (str) – The path of a file or a directory.

Raises:
  • ValueError – If the path is empty.

  • FileNotFoundError – If the path does not exist.

start()#

Start applying modifiers over blocks and segments. Check documentation for more details of our architecture.

class bloark.ModifierProfile#

The core class to define how to modify the JSON content.

abstract block(content: dict, metadata: dict) Tuple[dict | None, dict | None]#

Returns a list of batches of URLs to download.

Parameters:
  • content (dict) – The JSON content to be modified.

  • metadata (dict) – The metadata of the JSON content. This will be updated within one segment from the previous return value.

  • logger (logging.Logger) – The logger that should be used if you want to print out something. Check log standards for more details.

Returns:

Data JSON content and metadata (whatever modified or not). Return None in the first value if the content should be removed. Return None in the second value if the entire segment should be removed.

Return type:

Tuple[Optional[dict], Optional[dict]]

class bloark.Reader(output_dir: str, num_proc: int = 1, log_level: int = 20)#

Reader is a class for reading the data from the warehouse (rather than from the original data source).

output_dir#

The output directory.

Type:

str

num_proc#

The number of processes to use.

Type:

int

log_level#

The built-in logging level.

Type:

int

files#

A list of files to be read.

Type:

list

decompress()#

Decompress the preloaded files.

glimpse() Tuple[dict, dict] | Tuple[None, None]#

Take a glimpse of the preloaded data. It could still be large if one object contains a lot of information (e.g. many revisions, long article).

Returns:

A tuple of two dictionaries: (page, revision). If there is no file loaded, it returns (None, None).

Return type:

Tuple[dict, dict]

Notes

This function does not use any parallelization technique.

preload(path: str)#

Preload the files to be processed. It will not actually load to the memory until any other method is called.

Parameters:

path (str) – The path of a file or a directory.

Raises:
  • ValueError – If the path is empty.

  • FileNotFoundError – If the path does not exist.

class bloark.Warehouse(output_dir: str, prefix: str = 'warehouse_', suffix: str = '', max_size: int = 12, compress: bool = True)#

Warehouse is a module that manages the creation, assignment, and equal distribution of BloArk blocks (and segments).

assign_warehouse() str#

Request to assign a warehouse to a block.

Returns:

assigned_warehouse – The name of the assigned warehouse.

Return type:

str

Notes

This function is intended to be called in main process (no parallelism).

create_warehouse()#

Create a new warehouse when needed. This function will never return the created warehouse. Please try to access one via assign_warehouse().

Notes

This function is intended to be called in main process (no parallelism).

finalize_warehouse(warehouse: str)#

Finalize a warehouse. This function should be called when the warehouse is full and no more blocks will be assigned to it.

Parameters:

warehouse (str) – The name of the warehouse to be finalized.

Notes

This function is intended to be called in main process (no parallelism).

release_warehouse(warehouse: str) str | None#

Release the assignment of a warehouse.

Parameters:

warehouse (str) – The name of the warehouse to be released.

Returns:

warehouse_file_should_compress – The name of the warehouse file that should be compressed. If None, no compression is needed.

Return type:

str or None

Notes

This function is intended to be called in main process (no parallelism).