How to use metadata?#

Either in the output of the builder or within the modifier module, there is something called metadata. It is a dictionary that contains some information about the segment, such as article Id, article title, the number of revisions (it could count), or a list of categories of the article. The metadata is stored in a dedicated metadata file that has the same name with its associated warehouse.

Auto-generated from Builder module#

For your convenience, the metadata will be auto-generated by the builder module. You can modify the metadata dictionary further using Modifier module, but in this section, I will describe what you will get from the builder module.

  • id: The article id of the article.

  • title: The title of the article.

  • categories: A list of categories that the article belongs to (extracted from the last revision).

  • source_revision: The revision id that categories are extracted from.

  • byte_start: The starting byte offset of this article in the warehouse.

  • byte_end: The ending byte offset of this article in the warehouse. This offset number obeys the same rule as Python ending index, which means that the byte at this offset is not included in the article.

Use with Modifier module#

In modifier module, metadata is shared across all revisions for one article. So you are expected to use the same dictionary if revisions are related to the same article. That means if you are changing the metadata dictionary in an early revision, the changes will also be applied to all later revisions during the modification process. In the end, it will be stored into the new metadata file along with the newly-outputted warehouse.