OpenXML (Office) format (.docx, .docm, .dotx, .dotm, .xlsx, .xlsm, .xltx, .xltm, .pptx, .pptm, .ppsx, .potx)
The OpenXML format handles Microsoft Office Open XML documents — Word
(.docx), Excel (.xlsx), and PowerPoint (.pptx). All three share a
ZIP-and-XML package, so one filter serves them and auto-detects the
variant from the package contents, per ECMA-376 / ISO/IEC 29500.
Extraction is selective: by default the reader pulls the main body text plus the parts that are normally user-visible — document properties, headers and footers, footnotes, comments, and hyperlink text in Word; shared strings in Excel; and speaker notes in PowerPoint — while leaving hidden, structural, or rarely translated content out unless you opt in. A set of style and colour filters lets you include or exclude runs by paragraph/character style or highlight colour. This mirrors Okapi's OpenXML filter, including its default of accepting tracked changes before extraction.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
aggressiveCleanup | boolean | true | Strip revision IDs, proofing errors, and other noise before merging runs. |
automaticallyAcceptRevisions | boolean | true | Automatically accept tracked changes before extraction. When true (default, matching Okapi), inserted runs are kept and deleted runs are dropped; rows marked with <w:trPr><w:del/> (ECMA-376 §17.13.5.13) are removed entirely; rows marked with <w:trPr><w:ins/> (§17.13.5.16) are kept. |
codeFinderRules | array | Regex patterns that match inline codes within translatable text. | |
complexFieldDefinitionsToExtract | array | Field instruction prefixes to extract (e.g., "HYPERLINK", "REF"). | |
excludeColors | array | Font colors to exclude (hex RGB, e.g., "FF0000" for red). | |
excludedColumns | array | Column letters to exclude (e.g., "A", "C", "AA"). | |
excludedSheets | array | Sheet names to exclude from extraction. | |
excludeHighlightColors | array | Highlight colors to exclude (e.g., "yellow", "red"). | |
excludeStyles | array | Paragraph/character style names to exclude. | |
extractRunFontsInfo | boolean | false | Emit font metadata as annotations on blocks. |
fontMappings | object | Font name to script group mapping (e.g., "MS Gothic": "ja"). | |
ignoreSoftHyphenTag | boolean | false | Ignore soft hyphen tags in the document. |
includedSlides | array | If non-empty, only extract these slide numbers (1-based). | |
includeHighlightColors | array | If non-empty, only extract text with these highlight colors. | |
includeStyles | array | If non-empty, only extract text with these styles. | |
lineSeparatorReplacement | string | Replacement string for line separator characters. | |
replaceLineSeparator | boolean | false | Replace Unicode line separator (U+2028) in output. |
replaceNoBreakHyphenTag | boolean | false | Replace no-break hyphen tags with the non-breaking hyphen character. |
tabAsCharacter | boolean | false | Treat tab elements as tab characters instead of placeholder spans. |
translateCharts | boolean | false | Extract strings from embedded charts. |
translateComments | boolean | false | Extract comment text from Word documents. |
translateDiagrams | boolean | false | Extract text from SmartArt diagrams. |
translateDocProperties | boolean | true | Extract title, subject, keywords from document properties. |
translateFootnotes | boolean | true | Extract footnotes and endnotes from Word documents. |
translateHeadersFooters | boolean | true | Extract text from headers and footers in Word documents. |
translateHiddenSlides | boolean | false | Extract content from hidden slides in PowerPoint. |
translateHiddenText | boolean | false | Extract text with the vanish (hidden) property in Word documents. |
translateHyperlinks | boolean | true | Extract hyperlink text for translation. |
translateSharedStrings | boolean | true | Extract shared strings from Excel workbooks. |
translateSheetNames | boolean | false | Extract sheet names in Excel workbooks. |
translateSlideMasters | boolean | false | Extract text from slide masters in PowerPoint. |
translateSlideNotes | boolean | true | Extract speaker notes in PowerPoint presentations. |
useCodeFinder | boolean | false | Enable pattern-based detection of inline codes (placeholders, tags, etc.). |
Configure these parameters interactively and copy the YAML on the Format Reference.
Examples
Translate everything in a Word document
Include comments, hidden text, and slide masters that are off by default.
translateHiddenText: true translateComments: true
Limit an Excel workbook to selected sheets
Skip data and lookup sheets during extraction.
excludedSheets: - Data - Lookups
Extract only specific PowerPoint slides
Translate slides 1, 2, and 5 and skip the rest.
includedSlides: - 1 - 2 - 5
Processing notes
The document variant (Word / Excel / PowerPoint) is auto-detected from the package contents.
The writer is faithful by default — source run properties are preserved inline rather than rewritten into synthesised styles.
Limitations
One filter covers Word, Excel, and PowerPoint; some options apply only to the relevant variant (for example, the slide options affect PowerPoint only).
← Back to the Format Reference