OpenXML (Office) format (.docx, .docm, .dotx, .dotm, .xlsx, .xlsm, .xltx, .xltm, .pptx, .pptm, .ppsx, .potx)

The OpenXML format handles Microsoft Office Open XML documents — Word (.docx), Excel (.xlsx), and PowerPoint (.pptx). All three share a ZIP-and-XML package, so one filter serves them and auto-detects the variant from the package contents, per ECMA-376 / ISO/IEC 29500.

Extraction is selective: by default the reader pulls the main body text plus the parts that are normally user-visible — document properties, headers and footers, footnotes, comments, and hyperlink text in Word; shared strings in Excel; and speaker notes in PowerPoint — while leaving hidden, structural, or rarely translated content out unless you opt in. A set of style and colour filters lets you include or exclude runs by paragraph/character style or highlight colour. This mirrors Okapi's OpenXML filter, including its default of accepting tracked changes before extraction.

IDopenxml

SourceBuilt-in

Extensions.docx, .docm, .dotx, .dotm, .xlsx, .xlsm, .xltx, .xltm, .pptx, .pptm, .ppsx, .potx

MIME Typesapplication/vnd.openxmlformats-officedocument.wordprocessingml.document, application/vnd.openxmlformats-officedocument.presentationml.presentation, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

CapabilitiesRead + Write

Parameters

Parameter	Type	Default	Description
`aggressiveCleanup`	boolean	true	Strip revision IDs, proofing errors, and other noise before merging runs.
`automaticallyAcceptRevisions`	boolean	true	Automatically accept tracked changes before extraction. When true (default, matching Okapi), inserted runs are kept and deleted runs are dropped; rows marked with <w:trPr><w:del/> (ECMA-376 §17.13.5.13) are removed entirely; rows marked with <w:trPr><w:ins/> (§17.13.5.16) are kept.
`codeFinderRules`	array		Regex patterns that match inline codes within translatable text.
`complexFieldDefinitionsToExtract`	array		Field instruction prefixes to extract (e.g., "HYPERLINK", "REF").
`excludeColors`	array		Font colors to exclude (hex RGB, e.g., "FF0000" for red).
`excludedColumns`	array		Column letters to exclude (e.g., "A", "C", "AA").
`excludedSheets`	array		Sheet names to exclude from extraction.
`excludeHighlightColors`	array		Highlight colors to exclude (e.g., "yellow", "red").
`excludeStyles`	array		Paragraph/character style names to exclude.
`extractRunFontsInfo`	boolean	false	Emit font metadata as annotations on blocks.
`fontMappings`	object		Font name to script group mapping (e.g., "MS Gothic": "ja").
`ignoreSoftHyphenTag`	boolean	false	Ignore soft hyphen tags in the document.
`includedSlides`	array		If non-empty, only extract these slide numbers (1-based).
`includeHighlightColors`	array		If non-empty, only extract text with these highlight colors.
`includeStyles`	array		If non-empty, only extract text with these styles.
`lineSeparatorReplacement`	string		Replacement string for line separator characters.
`replaceLineSeparator`	boolean	false	Replace Unicode line separator (U+2028) in output.
`replaceNoBreakHyphenTag`	boolean	false	Replace no-break hyphen tags with the non-breaking hyphen character.
`tabAsCharacter`	boolean	false	Treat tab elements as tab characters instead of placeholder spans.
`translateCharts`	boolean	false	Extract strings from embedded charts.
`translateComments`	boolean	false	Extract comment text from Word documents.
`translateDiagrams`	boolean	false	Extract text from SmartArt diagrams.
`translateDocProperties`	boolean	true	Extract title, subject, keywords from document properties.
`translateFootnotes`	boolean	true	Extract footnotes and endnotes from Word documents.
`translateHeadersFooters`	boolean	true	Extract text from headers and footers in Word documents.
`translateHiddenSlides`	boolean	false	Extract content from hidden slides in PowerPoint.
`translateHiddenText`	boolean	false	Extract text with the vanish (hidden) property in Word documents.
`translateHyperlinks`	boolean	true	Extract hyperlink text for translation.
`translateSharedStrings`	boolean	true	Extract shared strings from Excel workbooks.
`translateSheetNames`	boolean	false	Extract sheet names in Excel workbooks.
`translateSlideMasters`	boolean	false	Extract text from slide masters in PowerPoint.
`translateSlideNotes`	boolean	true	Extract speaker notes in PowerPoint presentations.
`useCodeFinder`	boolean	false	Enable pattern-based detection of inline codes (placeholders, tags, etc.).

Configure these parameters interactively and copy the YAML on the Format Reference.

Examples

Translate everything in a Word document

Include comments, hidden text, and slide masters that are off by default.

translateHiddenText: true
translateComments: true

Limit an Excel workbook to selected sheets

Skip data and lookup sheets during extraction.

excludedSheets:
  - Data
  - Lookups

Extract only specific PowerPoint slides

Translate slides 1, 2, and 5 and skip the rest.

includedSlides:
  - 1
  - 2
  - 5

Processing notes

The document variant (Word / Excel / PowerPoint) is auto-detected from the package contents.
The writer is faithful by default — source run properties are preserved inline rather than rewritten into synthesised styles.

Limitations

One filter covers Word, Excel, and PowerPoint; some options apply only to the relevant variant (for example, the slide options affect PowerPoint only).

Reference documentation

← Back to the Format Reference