epub2text Documentation

A niche CLI tool to extract text from EPUB files with smart cleaning capabilities.

Features

  • Smart Navigation Parsing: Supports both EPUB3 (NAV HTML) and EPUB2 (NCX) navigation formats

  • Selective Extraction: Extract specific chapters by range or interactive selection

  • Flexible Output Formatting: - One paragraph per line with customizable separators - One sentence per line using spaCy NLP - Automatic line splitting at clause boundaries for long lines

  • Smart Text Cleaning: - Remove bracketed footnotes ([1], [42]) - Remove page numbers (standalone, at line ends, with dashes) - Normalize whitespace and paragraph breaks - Preserve ordered lists with proper numbering

  • Rich Interactive UI: Beautiful terminal output with tables and tree views

  • Pipe-Friendly: Works as both CLI tool and Python library

  • Nested Chapter Support: Handles hierarchical chapter structures

  • Full Dublin Core Metadata: Extract all EPUB metadata fields

Quick Start

Install epub2text:

pip install epub2text

Extract text from an EPUB file:

epub2text extract book.epub

List chapters:

epub2text list book.epub

Show metadata:

epub2text info book.epub

Indices and tables