Handlers Module
Overview
File format handlers for different document types.
- CSV Handler - CSV file format handler
- DOC Handler - DOC file format handler
- DOCX Handler - DOCX file format handler
- HTML Handler - HTML file format handler
- JSON Handler - JSON file format handler
- MD Handler - MD file format handler
- PDF Handler - PDF file format handler
- RTF Handler - RTF file format handler
- TXT Handler - TXT file format handler
- XML Handler - XML file format handler
- ZIP Handler - ZIP file format handler
File type-specific handlers package.
Modules:
| Name | Description |
|---|---|
csv |
CSV file handler for text extraction. |
doc |
DOC file handler for text extraction. |
docx |
DOCX file handler for comprehensive text extraction. |
html |
HTML file handler for text extraction. |
json |
JSON file handler for text extraction. |
md |
Markdown (.md) file handler for text extraction. |
pdf |
PDF file handler for text extraction. |
rtf |
RTF file handler for text extraction. |
txt |
TXT file handler for text extraction. |
xml |
XML file handler for text extraction. |
zip |
ZIP file handler for text extraction. |
Modules
csv
CSV file handler for text extraction.
Classes:
| Name | Description |
|---|---|
CSVHandler |
Handler for extracting text from CSV files. |
Classes
CSVHandler
Bases: FileTypeHandler
Handler for extracting text from CSV files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/csv.py
Functions
extract
Source code in textxtract/handlers/csv.py
doc
DOC file handler for text extraction.
Classes:
| Name | Description |
|---|---|
DOCHandler |
Handler for extracting text from DOC files with fallback options. |
Classes
DOCHandler
Bases: FileTypeHandler
Handler for extracting text from DOC files with fallback options.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/doc.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
Functions
extract
Source code in textxtract/handlers/doc.py
docx
DOCX file handler for comprehensive text extraction.
This handler extracts text from: - Document paragraphs - Tables and cells - Headers and footers - Text boxes and shapes - Footnotes and endnotes (if available)
Classes:
| Name | Description |
|---|---|
DOCXHandler |
Enhanced handler for comprehensive text extraction from DOCX files. |
Classes
DOCXHandler
Bases: FileTypeHandler
Enhanced handler for comprehensive text extraction from DOCX files.
This handler provides complete text extraction from Microsoft Word documents,
including all document elements such as paragraphs, tables, headers, footers,
text boxes, and footnotes. It's designed to handle complex document layouts
commonly found in resumes, reports, and structured documents.
Features:
- Extracts text from document body paragraphs
- Processes table content with cell-by-cell extraction
- Captures header and footer text from all sections
- Attempts to extract text from embedded text boxes and shapes
- Handles footnotes and endnotes when available
- Deduplicates repeated content
- Cleans and normalizes extracted text
Example:
>>> handler = DOCXHandler()
>>> text = handler.extract(Path("document.docx"))
>>> print(text)
"Document title
Paragraph content... Table data | Column 2..."
>>> # Async extraction
>>> text = await handler.extract_async(Path("document.docx"))
Methods:
| Name | Description |
|---|---|
extract |
Extract text from a DOCX file with comprehensive content capture. |
extract_async |
Asynchronously extract text from a DOCX file. |
Source code in textxtract/handlers/docx.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | |
Functions
extract
Extract text from a DOCX file with comprehensive content capture.
Performs thorough text extraction from all available document elements including body text, tables, headers, footers, and embedded content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the DOCX file to extract text from. |
required |
config
|
Optional[dict]
|
Configuration options for extraction. Currently not used but reserved for future enhancements. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted and cleaned text from the document with proper formatting. Returns empty string if no text is found. |
Raises:
| Type | Description |
|---|---|
ExtractionError
|
If the file cannot be read or processed, or if the python-docx library is not available. |
Note
- Text is deduplicated to avoid repeated content from overlapping elements
- Table content is formatted with pipe separators between columns
- Special content (footnotes, text boxes) is labeled with descriptive tags
- Sentence breaks are automatically inserted for better readability
Source code in textxtract/handlers/docx.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | |
extract_async
async
Asynchronously extract text from a DOCX file.
Provides non-blocking text extraction by running the synchronous extraction method in a separate thread.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the DOCX file to extract text from. |
required |
config
|
Optional[dict]
|
Configuration options for extraction. Currently not used but reserved for future enhancements. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted and cleaned text from the document with proper formatting. Returns empty string if no text is found. |
Raises:
| Type | Description |
|---|---|
ExtractionError
|
If the file cannot be read or processed, or if the python-docx library is not available. |
Note
This method uses asyncio.to_thread() to run the synchronous extraction in a thread pool, making it suitable for async/await usage patterns.
Source code in textxtract/handlers/docx.py
html
HTML file handler for text extraction.
Classes:
| Name | Description |
|---|---|
HTMLHandler |
Handler for extracting text from HTML files. |
Classes
HTMLHandler
Bases: FileTypeHandler
Handler for extracting text from HTML files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/html.py
Functions
extract
Source code in textxtract/handlers/html.py
json
JSON file handler for text extraction.
Classes:
| Name | Description |
|---|---|
JSONHandler |
Handler for extracting text from JSON files. |
Classes
JSONHandler
Bases: FileTypeHandler
Handler for extracting text from JSON files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/json.py
Functions
extract
Source code in textxtract/handlers/json.py
md
Markdown (.md) file handler for text extraction.
Classes:
| Name | Description |
|---|---|
MDHandler |
Handler for extracting text from Markdown files. |
Classes
MDHandler
Bases: FileTypeHandler
Handler for extracting text from Markdown files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/md.py
Functions
extract
Source code in textxtract/handlers/md.py
pdf
PDF file handler for text extraction.
Classes:
| Name | Description |
|---|---|
PDFHandler |
Handler for extracting text from PDF files with improved error handling. |
Classes
PDFHandler
Bases: FileTypeHandler
Handler for extracting text from PDF files with improved error handling.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/pdf.py
Functions
extract
Source code in textxtract/handlers/pdf.py
rtf
RTF file handler for text extraction.
Classes:
| Name | Description |
|---|---|
RTFHandler |
Handler for extracting text from RTF files. |
Classes
RTFHandler
Bases: FileTypeHandler
Handler for extracting text from RTF files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/rtf.py
Functions
extract
Source code in textxtract/handlers/rtf.py
txt
TXT file handler for text extraction.
Classes:
| Name | Description |
|---|---|
TXTHandler |
Handler for extracting text from TXT files. |
Classes
TXTHandler
Bases: FileTypeHandler
Handler for extracting text from TXT files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/txt.py
Functions
extract
Source code in textxtract/handlers/txt.py
xml
XML file handler for text extraction.
Classes:
| Name | Description |
|---|---|
XMLHandler |
Handler for extracting text from XML files. |
Classes
XMLHandler
Bases: FileTypeHandler
Handler for extracting text from XML files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/xml.py
Functions
extract
Source code in textxtract/handlers/xml.py
zip
ZIP file handler for text extraction.
Classes:
| Name | Description |
|---|---|
ZIPHandler |
Handler for extracting text from ZIP archives with security checks. |
Attributes:
| Name | Type | Description |
|---|---|---|
logger |
|
Attributes
Classes
ZIPHandler
Bases: FileTypeHandler
Handler for extracting text from ZIP archives with security checks.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Attributes:
| Name | Type | Description |
|---|---|---|
MAX_EXTRACT_SIZE |
|
|
MAX_FILES |
|
Source code in textxtract/handlers/zip.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | |
Attributes
Functions
extract