epub2text Documentation
A niche CLI tool to extract text from EPUB files with smart cleaning capabilities.
Contents:
Features
Smart Navigation Parsing: Supports both EPUB3 (NAV HTML) and EPUB2 (NCX) navigation formats
Selective Extraction: Extract specific chapters by range or interactive selection
Flexible Output Formatting: - One paragraph per line with customizable separators - One sentence per line using spaCy NLP - Automatic line splitting at clause boundaries for long lines
Smart Text Cleaning: - Remove bracketed footnotes (
[1],[42]) - Remove page numbers (standalone, at line ends, with dashes) - Normalize whitespace and paragraph breaks - Preserve ordered lists with proper numberingRich Interactive UI: Beautiful terminal output with tables and tree views
Pipe-Friendly: Works as both CLI tool and Python library
Nested Chapter Support: Handles hierarchical chapter structures
Full Dublin Core Metadata: Extract all EPUB metadata fields
Quick Start
Install epub2text:
pip install epub2text
Extract text from an EPUB file:
epub2text extract book.epub
List chapters:
epub2text list book.epub
Show metadata:
epub2text info book.epub