Usage Guide

epub2text provides three main commands: list, extract, and info.

List Chapters

Display all chapters in an EPUB file:

# Table format (default)
epub2text list book.epub

# Tree format (shows hierarchy)
epub2text list book.epub --format tree

The table format shows chapter numbers, titles, character counts, and nesting levels. The tree format displays the hierarchical structure of nested chapters.

Extract Text

Basic Extraction

Extract all chapters to stdout:

epub2text extract book.epub

Extract to a file:

epub2text extract book.epub -o output.txt

Chapter Selection

Extract specific chapters by number:

# Single chapter
epub2text extract book.epub -c 1

# Multiple chapters
epub2text extract book.epub -c 1,3,5

# Chapter range
epub2text extract book.epub -c 1-5

# Complex range
epub2text extract book.epub -c 1-5,7,9-12 -o selected.txt

Interactive chapter selection:

epub2text extract book.epub --interactive

Output Formatting

Paragraph Mode (-p, --paragraphs): One line per paragraph:

epub2text extract book.epub --paragraphs

Sentence Mode (-s, --sentences): One line per sentence (requires spaCy):

epub2text extract book.epub --sentences

Max Line Length (-m, --max-length): Split long lines at clause boundaries:

epub2text extract book.epub --max-length 80

Paragraph Separators:

By default, paragraphs are separated by two spaces at the start of each new paragraph. You can customize this behavior:

# Use empty lines between paragraphs
epub2text extract book.epub --empty-lines

# Custom separator (e.g., tab)
epub2text extract book.epub --separator "\\t"

# No separator
epub2text extract book.epub --separator ""

Text Cleaning Options

By default, epub2text applies smart text cleaning. You can disable or customize it:

# Disable all cleaning (raw output)
epub2text extract book.epub --raw

# Keep bracketed footnotes like [1]
epub2text extract book.epub --keep-footnotes

# Keep page numbers
epub2text extract book.epub --keep-page-numbers

# Hide chapter markers
epub2text extract book.epub --no-markers

Output Control

Control which lines are output:

# Skip first 10 lines
epub2text extract book.epub --offset 10

# Limit to 100 lines
epub2text extract book.epub --limit 100

# Add line numbers
epub2text extract book.epub --line-numbers

Language Model

For sentence-level formatting, you can specify a different spaCy language model:

# Use German language model
epub2text extract book.epub --sentences --language-model de_core_news_sm

Show Metadata

Display EPUB metadata and statistics:

# Panel format (default)
epub2text info book.epub

# Table format
epub2text info book.epub --format table

# JSON format (for scripting)
epub2text info book.epub --format json

The info command displays:

  • Title

  • Authors

  • Contributors

  • Publisher

  • Publication Year

  • Identifier (ISBN, UUID, etc.)

  • Language

  • Rights (copyright)

  • Coverage

  • Description

  • Chapter count

  • Total character count

Chapter Markers

Extracted text includes chapter markers in the format:

<<CHAPTER: Chapter Title>>

Chapter text content here...

<<CHAPTER: Next Chapter>>

More content...

Use --no-markers to hide these markers.

Examples

Extract a book for text-to-speech processing:

# One sentence per line, suitable for TTS
epub2text extract book.epub --sentences -o book.txt

Create a clean plain text version:

# Paragraphs with empty lines, no markers
epub2text extract book.epub --paragraphs --empty-lines --no-markers -o book.txt

Extract specific chapters with line length limit:

# Chapters 1-5 with max 100 chars per line
epub2text extract book.epub -c 1-5 --max-length 100 -o excerpt.txt

Get metadata as JSON for scripting:

epub2text info book.epub --format json | jq '.title'