Text Extractor Package
Text Extractor package - Professional text extraction from multiple file formats.
Modules:
| Name | Description |
|---|---|
aio |
Asynchronous extraction logic package. |
core |
Core components for textxtract package. |
exceptions |
|
handlers |
File type-specific handlers package. |
sync |
Synchronous extraction logic package. |
Classes:
| Name | Description |
|---|---|
AsyncTextExtractor |
Asynchronous text extractor with support for file paths and bytes. |
ExtractorConfig |
Enhanced configuration options for text extraction with validation. |
SyncTextExtractor |
Synchronous text extractor with support for file paths and bytes. |
Attributes
Classes
AsyncTextExtractor
Bases: TextExtractor
Asynchronous text extractor with support for file paths and bytes.
Provides asynchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Uses thread pool for I/O-bound operations.
Methods:
| Name | Description |
|---|---|
__aenter__ |
Async context manager entry. |
__aexit__ |
Async context manager exit with cleanup. |
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit with cleanup. |
__init__ |
|
extract |
Extract text asynchronously from file path or bytes using thread pool. |
Attributes:
| Name | Type | Description |
|---|---|---|
config |
|
Source code in textxtract/aio/extractor.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
Attributes
Functions
__aenter__
async
__aexit__
async
__enter__
__exit__
__init__
Source code in textxtract/aio/extractor.py
extract
async
Extract text asynchronously from file path or bytes using thread pool.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted text. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |
Source code in textxtract/aio/extractor.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
ExtractorConfig
Enhanced configuration options for text extraction with validation.
Methods:
| Name | Description |
|---|---|
__init__ |
|
__repr__ |
|
from_file |
Load configuration from a file (JSON, YAML, or TOML). |
get_handler |
Retrieve a handler for a given file extension. |
get_handler_config |
Get configuration specific to a handler. |
register_handler |
Register a custom file type handler. |
to_dict |
Convert configuration to dictionary. |
Attributes:
| Name | Type | Description |
|---|---|---|
custom_handlers |
|
|
encoding |
|
|
extra_config |
|
|
logging_format |
|
|
logging_level |
|
|
max_file_size |
|
|
max_memory_usage |
|
|
timeout |
|
Source code in textxtract/core/config.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |
Attributes
logging_format
instance-attribute
Functions
__init__
__init__(encoding='utf-8', logging_level='INFO', logging_format=None, timeout=None, max_file_size=None, max_memory_usage=None, custom_handlers=None, **kwargs)
Source code in textxtract/core/config.py
__repr__
from_file
classmethod
Load configuration from a file (JSON, YAML, or TOML).
Source code in textxtract/core/config.py
get_handler
get_handler_config
Get configuration specific to a handler.
Source code in textxtract/core/config.py
register_handler
Register a custom file type handler.
to_dict
Convert configuration to dictionary.
Source code in textxtract/core/config.py
SyncTextExtractor
Bases: TextExtractor
Synchronous text extractor with support for file paths and bytes.
Provides synchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Supports context manager protocol for proper cleanup.
Methods:
| Name | Description |
|---|---|
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit. |
__init__ |
|
extract |
Extract text synchronously from file path or bytes. |
Attributes:
| Name | Type | Description |
|---|---|---|
config |
|
Source code in textxtract/sync/extractor.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
Attributes
Functions
__enter__
__exit__
__init__
extract
Extract text synchronously from file path or bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted text. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |
Source code in textxtract/sync/extractor.py
Modules
aio
Asynchronous extraction logic package.
Modules:
| Name | Description |
|---|---|
extractor |
Asynchronous text extraction logic with support for file paths and bytes. |
Classes:
| Name | Description |
|---|---|
AsyncTextExtractor |
Asynchronous text extractor with support for file paths and bytes. |
Attributes
Classes
AsyncTextExtractor
Bases: TextExtractor
Asynchronous text extractor with support for file paths and bytes.
Provides asynchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Uses thread pool for I/O-bound operations.
Methods:
| Name | Description |
|---|---|
__aenter__ |
Async context manager entry. |
__aexit__ |
Async context manager exit with cleanup. |
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit with cleanup. |
__init__ |
|
extract |
Extract text asynchronously from file path or bytes using thread pool. |
Attributes:
| Name | Type | Description |
|---|---|---|
config |
|
Source code in textxtract/aio/extractor.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
Attributes
Functions
__aenter__
async
__aexit__
async
__enter__
__exit__
__init__
Source code in textxtract/aio/extractor.py
extract
async
Extract text asynchronously from file path or bytes using thread pool.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
filename
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
config
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted text. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |
Source code in textxtract/aio/extractor.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
Modules
extractor
Asynchronous text extraction logic with support for file paths and bytes.
Classes:
| Name | Description |
|---|---|
AsyncTextExtractor |
Asynchronous text extractor with support for file paths and bytes. |
Attributes:
| Name | Type | Description |
|---|---|---|
logger |
|
Attributes
Classes
AsyncTextExtractor
Bases: TextExtractor
Asynchronous text extractor with support for file paths and bytes.
Provides asynchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Uses thread pool for I/O-bound operations.
Methods:
| Name | Description |
|---|---|
__aenter__ |
Async context manager entry. |
__aexit__ |
Async context manager exit with cleanup. |
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit with cleanup. |
__init__ |
|
extract |
Extract text asynchronously from file path or bytes using thread pool. |
Attributes:
| Name | Type | Description |
|---|---|---|
config |
|
Source code in textxtract/aio/extractor.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
__aenter__
async
__aexit__
async
__enter__
__exit__
__init__
Source code in textxtract/aio/extractor.py
extract
async
Extract text asynchronously from file path or bytes using thread pool.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
filename
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
config
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted text. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |
Source code in textxtract/aio/extractor.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
Functions
core
Core components for textxtract package.
Modules:
| Name | Description |
|---|---|
base |
Abstract base classes for text extraction. |
config |
Configuration and customization for textxtract package. |
exceptions |
Custom exceptions for textxtract package. |
logging_config |
Logging configuration for textxtract package. |
registry |
Handler registry for centralized handler management. |
utils |
Utility functions for textxtract package. |
Modules
base
Abstract base classes for text extraction.
Classes:
| Name | Description |
|---|---|
FileTypeHandler |
Abstract base class for file type-specific handlers. |
TextExtractor |
Abstract base class for text extractors. |
Classes
FileTypeHandler
Bases: ABC
Abstract base class for file type-specific handlers.
Methods:
| Name | Description |
|---|---|
extract |
Extract text synchronously from a file. |
extract_async |
Extract text asynchronously from a file. |
Source code in textxtract/core/base.py
extract
abstractmethod
extract_async
abstractmethod
async
TextExtractor
Bases: ABC
Abstract base class for text extractors.
Methods:
| Name | Description |
|---|---|
extract |
Extract text synchronously from file path or bytes. |
Source code in textxtract/core/base.py
extract
abstractmethod
Extract text synchronously from file path or bytes.
config
Configuration and customization for textxtract package.
Classes:
| Name | Description |
|---|---|
ExtractorConfig |
Enhanced configuration options for text extraction with validation. |
Classes
ExtractorConfig
Enhanced configuration options for text extraction with validation.
Methods:
| Name | Description |
|---|---|
__init__ |
|
__repr__ |
|
from_file |
Load configuration from a file (JSON, YAML, or TOML). |
get_handler |
Retrieve a handler for a given file extension. |
get_handler_config |
Get configuration specific to a handler. |
register_handler |
Register a custom file type handler. |
to_dict |
Convert configuration to dictionary. |
Attributes:
| Name | Type | Description |
|---|---|---|
custom_handlers |
|
|
encoding |
|
|
extra_config |
|
|
logging_format |
|
|
logging_level |
|
|
max_file_size |
|
|
max_memory_usage |
|
|
timeout |
|
Source code in textxtract/core/config.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |
logging_format
instance-attribute
__init__
__init__(encoding='utf-8', logging_level='INFO', logging_format=None, timeout=None, max_file_size=None, max_memory_usage=None, custom_handlers=None, **kwargs)
Source code in textxtract/core/config.py
__repr__
from_file
classmethod
Load configuration from a file (JSON, YAML, or TOML).
Source code in textxtract/core/config.py
get_handler
get_handler_config
Get configuration specific to a handler.
Source code in textxtract/core/config.py
register_handler
Register a custom file type handler.
to_dict
Convert configuration to dictionary.
Source code in textxtract/core/config.py
exceptions
Custom exceptions for textxtract package.
Classes:
| Name | Description |
|---|---|
ExtractionError |
Raised when a general extraction error occurs. |
ExtractionTimeoutError |
Raised when extraction exceeds the allowed timeout. |
FileTypeNotSupportedError |
Raised when the file type is not supported. |
InvalidFileError |
Raised when the file is invalid or unsupported. |
Classes
ExtractionError
ExtractionTimeoutError
Bases: ExtractionError
Raised when extraction exceeds the allowed timeout.
FileTypeNotSupportedError
Bases: ExtractionError
Raised when the file type is not supported.
InvalidFileError
Bases: ExtractionError
Raised when the file is invalid or unsupported.
logging_config
Logging configuration for textxtract package.
Functions:
| Name | Description |
|---|---|
setup_logging |
Configure logging for the package. |
Functions
setup_logging
Configure logging for the package.
registry
Handler registry for centralized handler management.
Classes:
| Name | Description |
|---|---|
HandlerRegistry |
Central registry for file type handlers with caching and lazy loading. |
Attributes:
| Name | Type | Description |
|---|---|---|
logger |
|
|
registry |
|
Attributes
Classes
HandlerRegistry
Central registry for file type handlers with caching and lazy loading.
Methods:
| Name | Description |
|---|---|
__init__ |
|
__new__ |
|
get_handler |
Get handler instance for file extension with caching. |
get_supported_extensions |
Get list of all supported file extensions. |
is_supported |
Check if a file extension is supported. |
register_handler |
Register a custom handler for a file extension. |
Source code in textxtract/core/registry.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | |
__init__
__new__
get_handler
cached
Get handler instance for file extension with caching.
Source code in textxtract/core/registry.py
get_supported_extensions
is_supported
register_handler
Register a custom handler for a file extension.
Source code in textxtract/core/registry.py
utils
Utility functions for textxtract package.
Classes:
| Name | Description |
|---|---|
FileInfo |
File information data class. |
Functions:
| Name | Description |
|---|---|
create_temp_file |
Create a temporary file from bytes and return its path with security validation. |
get_file_info |
Get file information for logging and debugging. |
safe_unlink |
Safely delete a file if it exists, optionally logging errors. |
validate_file_extension |
Check if the file has an allowed extension. |
validate_file_size |
Validate file size doesn't exceed limits. |
validate_filename |
Validate filename for security issues. |
Attributes:
| Name | Type | Description |
|---|---|---|
DEFAULT_MAX_FILE_SIZE |
|
|
DEFAULT_MAX_TEMP_FILES |
|
Attributes
Classes
FileInfo
dataclass
Functions
create_temp_file
Create a temporary file from bytes and return its path with security validation.
Source code in textxtract/core/utils.py
get_file_info
Get file information for logging and debugging.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Union[Path, str, bytes]
|
File path or file bytes |
required |
filename
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FileInfo |
FileInfo
|
Data class with file information |
Source code in textxtract/core/utils.py
safe_unlink
Safely delete a file if it exists, optionally logging errors.
Source code in textxtract/core/utils.py
validate_file_extension
validate_file_size
Validate file size doesn't exceed limits.
Source code in textxtract/core/utils.py
validate_filename
Validate filename for security issues.
Source code in textxtract/core/utils.py
exceptions
Classes:
| Name | Description |
|---|---|
ExtractionError |
Raised when a general extraction error occurs. |
ExtractionTimeoutError |
Raised when extraction exceeds the allowed timeout. |
FileTypeNotSupportedError |
Raised when the file type is not supported. |
InvalidFileError |
Raised when the file is invalid or unsupported. |
Attributes
__all__
module-attribute
__all__ = ['ExtractionError', 'InvalidFileError', 'FileTypeNotSupportedError', 'ExtractionTimeoutError']
Classes
ExtractionError
ExtractionTimeoutError
Bases: ExtractionError
Raised when extraction exceeds the allowed timeout.
FileTypeNotSupportedError
Bases: ExtractionError
Raised when the file type is not supported.
InvalidFileError
Bases: ExtractionError
Raised when the file is invalid or unsupported.
handlers
File type-specific handlers package.
Modules:
| Name | Description |
|---|---|
csv |
CSV file handler for text extraction. |
doc |
DOC file handler for text extraction. |
docx |
DOCX file handler for comprehensive text extraction. |
html |
HTML file handler for text extraction. |
json |
JSON file handler for text extraction. |
md |
Markdown (.md) file handler for text extraction. |
pdf |
PDF file handler for text extraction. |
rtf |
RTF file handler for text extraction. |
txt |
TXT file handler for text extraction. |
xml |
XML file handler for text extraction. |
zip |
ZIP file handler for text extraction. |
Modules
csv
CSV file handler for text extraction.
Classes:
| Name | Description |
|---|---|
CSVHandler |
Handler for extracting text from CSV files. |
Classes
CSVHandler
Bases: FileTypeHandler
Handler for extracting text from CSV files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/csv.py
extract
Source code in textxtract/handlers/csv.py
doc
DOC file handler for text extraction.
Classes:
| Name | Description |
|---|---|
DOCHandler |
Handler for extracting text from DOC files with fallback options. |
Classes
DOCHandler
Bases: FileTypeHandler
Handler for extracting text from DOC files with fallback options.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/doc.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
extract
Source code in textxtract/handlers/doc.py
docx
DOCX file handler for comprehensive text extraction.
This handler extracts text from: - Document paragraphs - Tables and cells - Headers and footers - Text boxes and shapes - Footnotes and endnotes (if available)
Classes:
| Name | Description |
|---|---|
DOCXHandler |
Enhanced handler for comprehensive text extraction from DOCX files. |
Classes
DOCXHandler
Bases: FileTypeHandler
Enhanced handler for comprehensive text extraction from DOCX files.
This handler provides complete text extraction from Microsoft Word documents,
including all document elements such as paragraphs, tables, headers, footers,
text boxes, and footnotes. It's designed to handle complex document layouts
commonly found in resumes, reports, and structured documents.
Features:
- Extracts text from document body paragraphs
- Processes table content with cell-by-cell extraction
- Captures header and footer text from all sections
- Attempts to extract text from embedded text boxes and shapes
- Handles footnotes and endnotes when available
- Deduplicates repeated content
- Cleans and normalizes extracted text
Example:
>>> handler = DOCXHandler()
>>> text = handler.extract(Path("document.docx"))
>>> print(text)
"Document title
Paragraph content... Table data | Column 2..."
>>> # Async extraction
>>> text = await handler.extract_async(Path("document.docx"))
Methods:
| Name | Description |
|---|---|
extract |
Extract text from a DOCX file with comprehensive content capture. |
extract_async |
Asynchronously extract text from a DOCX file. |
Source code in textxtract/handlers/docx.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | |
extract
Extract text from a DOCX file with comprehensive content capture.
Performs thorough text extraction from all available document elements including body text, tables, headers, footers, and embedded content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the DOCX file to extract text from. |
required |
config
|
Optional[dict]
|
Configuration options for extraction. Currently not used but reserved for future enhancements. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted and cleaned text from the document with proper formatting. Returns empty string if no text is found. |
Raises:
| Type | Description |
|---|---|
ExtractionError
|
If the file cannot be read or processed, or if the python-docx library is not available. |
Note
- Text is deduplicated to avoid repeated content from overlapping elements
- Table content is formatted with pipe separators between columns
- Special content (footnotes, text boxes) is labeled with descriptive tags
- Sentence breaks are automatically inserted for better readability
Source code in textxtract/handlers/docx.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | |
extract_async
async
Asynchronously extract text from a DOCX file.
Provides non-blocking text extraction by running the synchronous extraction method in a separate thread.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the DOCX file to extract text from. |
required |
config
|
Optional[dict]
|
Configuration options for extraction. Currently not used but reserved for future enhancements. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted and cleaned text from the document with proper formatting. Returns empty string if no text is found. |
Raises:
| Type | Description |
|---|---|
ExtractionError
|
If the file cannot be read or processed, or if the python-docx library is not available. |
Note
This method uses asyncio.to_thread() to run the synchronous extraction in a thread pool, making it suitable for async/await usage patterns.
Source code in textxtract/handlers/docx.py
html
HTML file handler for text extraction.
Classes:
| Name | Description |
|---|---|
HTMLHandler |
Handler for extracting text from HTML files. |
Classes
HTMLHandler
Bases: FileTypeHandler
Handler for extracting text from HTML files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/html.py
extract
Source code in textxtract/handlers/html.py
json
JSON file handler for text extraction.
Classes:
| Name | Description |
|---|---|
JSONHandler |
Handler for extracting text from JSON files. |
Classes
JSONHandler
Bases: FileTypeHandler
Handler for extracting text from JSON files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/json.py
extract
Source code in textxtract/handlers/json.py
md
Markdown (.md) file handler for text extraction.
Classes:
| Name | Description |
|---|---|
MDHandler |
Handler for extracting text from Markdown files. |
Classes
MDHandler
Bases: FileTypeHandler
Handler for extracting text from Markdown files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/md.py
extract
Source code in textxtract/handlers/md.py
pdf
PDF file handler for text extraction.
Classes:
| Name | Description |
|---|---|
PDFHandler |
Handler for extracting text from PDF files with improved error handling. |
Classes
PDFHandler
Bases: FileTypeHandler
Handler for extracting text from PDF files with improved error handling.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/pdf.py
extract
Source code in textxtract/handlers/pdf.py
rtf
RTF file handler for text extraction.
Classes:
| Name | Description |
|---|---|
RTFHandler |
Handler for extracting text from RTF files. |
Classes
RTFHandler
Bases: FileTypeHandler
Handler for extracting text from RTF files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/rtf.py
extract
Source code in textxtract/handlers/rtf.py
txt
TXT file handler for text extraction.
Classes:
| Name | Description |
|---|---|
TXTHandler |
Handler for extracting text from TXT files. |
Classes
TXTHandler
Bases: FileTypeHandler
Handler for extracting text from TXT files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/txt.py
extract
Source code in textxtract/handlers/txt.py
xml
XML file handler for text extraction.
Classes:
| Name | Description |
|---|---|
XMLHandler |
Handler for extracting text from XML files. |
Classes
XMLHandler
Bases: FileTypeHandler
Handler for extracting text from XML files.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Source code in textxtract/handlers/xml.py
extract
Source code in textxtract/handlers/xml.py
zip
ZIP file handler for text extraction.
Classes:
| Name | Description |
|---|---|
ZIPHandler |
Handler for extracting text from ZIP archives with security checks. |
Attributes:
| Name | Type | Description |
|---|---|---|
logger |
|
Attributes
Classes
ZIPHandler
Bases: FileTypeHandler
Handler for extracting text from ZIP archives with security checks.
Methods:
| Name | Description |
|---|---|
extract |
|
extract_async |
|
Attributes:
| Name | Type | Description |
|---|---|---|
MAX_EXTRACT_SIZE |
|
|
MAX_FILES |
|
Source code in textxtract/handlers/zip.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | |
extract
Source code in textxtract/handlers/zip.py
sync
Synchronous extraction logic package.
Modules:
| Name | Description |
|---|---|
extractor |
Synchronous text extraction logic with support for file paths and bytes. |
Modules
extractor
Synchronous text extraction logic with support for file paths and bytes.
Classes:
| Name | Description |
|---|---|
SyncTextExtractor |
Synchronous text extractor with support for file paths and bytes. |
Attributes:
| Name | Type | Description |
|---|---|---|
logger |
|
Attributes
Classes
SyncTextExtractor
Bases: TextExtractor
Synchronous text extractor with support for file paths and bytes.
Provides synchronous text extraction from various file types. Logs debug and info level messages for tracing and diagnostics. Supports context manager protocol for proper cleanup.
Methods:
| Name | Description |
|---|---|
__enter__ |
Context manager entry. |
__exit__ |
Context manager exit. |
__init__ |
|
extract |
Extract text synchronously from file path or bytes. |
Attributes:
| Name | Type | Description |
|---|---|---|
config |
|
Source code in textxtract/sync/extractor.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
__enter__
__exit__
__init__
extract
Extract text synchronously from file path or bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Union[Path, str, bytes]
|
File path (Path/str) or file bytes |
required |
filename
|
Optional[str]
|
Required if source is bytes, optional for file paths |
None
|
config
|
Optional[dict]
|
Optional configuration overrides |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Extracted text. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If filename is missing when source is bytes |
FileTypeNotSupportedError
|
If the file extension is not supported. |
ExtractionError
|
If extraction fails. |
InvalidFileError
|
If the file is invalid or corrupted. |