API Reference

This section documents the Python API for using epub2text as a library.

Core Classes

EPUBParser

Data Models

Chapter

Metadata

Text Cleaning

TextCleaner

clean_text

Text Formatting

format_paragraphs

format_sentences

split_long_lines

Usage Examples

Basic Usage

Parse an EPUB file and extract metadata:

from epub2text import EPUBParser

parser = EPUBParser("book.epub")

# Get metadata
metadata = parser.get_metadata()
print(f"Title: {metadata.title}")
print(f"Authors: {', '.join(metadata.authors)}")
print(f"Language: {metadata.language}")

# Get all chapters
chapters = parser.get_chapters()
for chapter in chapters:
    print(f"{chapter.title}: {chapter.char_count:,} characters")

# Extract all text
full_text = parser.extract_chapters()

Chapter Selection

Extract specific chapters:

from epub2text import EPUBParser

parser = EPUBParser("book.epub")
chapters = parser.get_chapters()

# Extract first 3 chapters
chapter_ids = [chapters[0].id, chapters[1].id, chapters[2].id]
text = parser.extract_chapters(chapter_ids)

Custom Text Cleaning

Apply custom cleaning options:

from epub2text import EPUBParser, TextCleaner

parser = EPUBParser("book.epub")
text = parser.extract_chapters()

# Custom cleaning
cleaner = TextCleaner(
    remove_bracketed_numbers=True,
    remove_page_numbers=True,
    normalize_whitespace=True,
    replace_single_newlines=True,
)
cleaned_text = cleaner.clean(text)

Sentence Formatting

Format text with one sentence per line:

from epub2text import EPUBParser
from epub2text.formatters import format_sentences

parser = EPUBParser("book.epub")
text = parser.extract_chapters()

# One sentence per line
formatted = format_sentences(text, separator="  ")

Line Splitting

Split long lines at clause boundaries:

from epub2text import EPUBParser
from epub2text.formatters import split_long_lines

parser = EPUBParser("book.epub")
text = parser.extract_chapters()

# Split lines exceeding 80 characters
split_text = split_long_lines(text, max_length=80)

Full Metadata Access

Access all Dublin Core metadata fields:

from epub2text import EPUBParser

parser = EPUBParser("book.epub")
metadata = parser.get_metadata()

print(f"Title: {metadata.title}")
print(f"Authors: {metadata.authors}")
print(f"Contributors: {metadata.contributors}")
print(f"Publisher: {metadata.publisher}")
print(f"Publication Year: {metadata.publication_year}")
print(f"Identifier: {metadata.identifier}")
print(f"Language: {metadata.language}")
print(f"Rights: {metadata.rights}")
print(f"Coverage: {metadata.coverage}")
print(f"Description: {metadata.description}")

Module Index

epub2text

Main package exports:

  • EPUBParser - Main parser class

  • Chapter - Chapter data model

  • Metadata - Metadata data model

  • TextCleaner - Text cleaning class

  • clean_text - Convenience function for text cleaning

epub2text.formatters

Text formatting utilities:

  • format_paragraphs - Format text with paragraph separators

  • format_sentences - One sentence per line formatting

  • split_long_lines - Split long lines at clause boundaries

  • split_into_paragraphs - Split text into paragraph list

  • collapse_paragraph - Collapse paragraph to single line

epub2text.cleaner

Text cleaning utilities:

  • TextCleaner - Configurable text cleaner class

  • clean_text - Convenience function

  • calculate_text_length - Calculate text length excluding markers