CitedEvidence
User Settings
Article

BAYESIAN LEARNING OF 2D DOCUMENT LAYOUT MODELS FOR PRESERVATION METADATA EXTRACTION

Song Mao-2004-01-01
1

TL;DRAbstract

Digital preservation addresses the storage, maintenance, accessibility, and technical integrity of digital materials over the long term. Preservation metadata is the information required to perform these tasks. Given the volume of these journals and high labor cost of manual metadata entry, automated metadata extraction is necessary. Document layout analysis is a process of partitioning document images into hierarchically structured and labeled homogeneous physical regions. Descriptive metadata such as bibliographic information can then be extracted from these segmented and labeled regions using OCR. While numerous algorithms have been proposed for document layout analysis, most of them require manually specified rules or models. In this paper, we first define the hierarchical 2D layout model of document pages as a set of attributed hidden semiMarkov Models (HSMM). Each attributed HSMM represents the projection profile of the character bounding boxes in a physical region on either the

Chat with Paper

AI Agents for this Paper

Digital preservation addresses the storage, maintenance, accessibility, and technical integrity of digital materials over the long term. Preservation metadata is the information required to perform these tasks. Given the volume of these journals and high labor cost of manual metadata entry, automated metadata extraction is necessary. Document layout analysis is a process of partitioning document images into hierarchically structured and labeled homogeneous physical regions. Descriptive metadata such as bibliographic information can then be extracted from these segmented and labeled regions using OCR. While numerous algorithms have been proposed for document layout analysis, most of them require manually specified rules or models. In this paper, we first define the hierarchical 2D layout model of document pages as a set of attributed hidden semiMarkov Models (HSMM). Each attributed HSMM represents the projection profile of the character bounding boxes in a physical region on either the

Keywords

MetadataComputer scienceMetadata repositoryBounding overwatchInformation retrievalSet (abstract data type)Artificial intelligenceData mining

Chat

Click to start Chat