Gå til hovedinnhold

OpenXML (Office) format (.docx, .docm, .dotx, .dotm, .xlsx, .xlsm, .xltx, .xltm, .pptx, .pptm, .ppsx, .potx)

The OpenXML format handles Microsoft Office Open XML documents — Word (.docx), Excel (.xlsx), and PowerPoint (.pptx). All three share a ZIP-and-XML package, so one filter serves them and auto-detects the variant from the package contents, per ECMA-376 / ISO/IEC 29500.

Extraction is selective: by default the reader pulls the main body text plus the parts that are normally user-visible — document properties, headers and footers, footnotes, comments, and hyperlink text in Word; shared strings in Excel; and speaker notes in PowerPoint — while leaving hidden, structural, or rarely translated content out unless you opt in. A set of style and colour filters lets you include or exclude runs by paragraph/character style or highlight colour. This mirrors Okapi's OpenXML filter, including its default of accepting tracked changes before extraction.

IDopenxml
SourceBuilt-in
Extensions.docx, .docm, .dotx, .dotm, .xlsx, .xlsm, .xltx, .xltm, .pptx, .pptm, .ppsx, .potx
MIME Typesapplication/vnd.openxmlformats-officedocument.wordprocessingml.document, application/vnd.openxmlformats-officedocument.presentationml.presentation, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
CapabilitiesRead + Write

Parameters

ParameterTypeDefaultDescription
aggressiveCleanupbooleantrueStrip revision IDs, proofing errors, and other noise before merging runs.
automaticallyAcceptRevisionsbooleantrueAutomatically accept tracked changes before extraction. When true (default, matching Okapi), inserted runs are kept and deleted runs are dropped; rows marked with <w:trPr><w:del/> (ECMA-376 §17.13.5.13) are removed entirely; rows marked with <w:trPr><w:ins/> (§17.13.5.16) are kept.
codeFinderRulesarrayRegex patterns that match inline codes within translatable text.
complexFieldDefinitionsToExtractarrayField instruction prefixes to extract (e.g., "HYPERLINK", "REF").
excludeColorsarrayFont colors to exclude (hex RGB, e.g., "FF0000" for red).
excludedColumnsarrayColumn letters to exclude (e.g., "A", "C", "AA").
excludedSheetsarraySheet names to exclude from extraction.
excludeHighlightColorsarrayHighlight colors to exclude (e.g., "yellow", "red").
excludeStylesarrayParagraph/character style names to exclude.
extractRunFontsInfobooleanfalseEmit font metadata as annotations on blocks.
fontMappingsobjectFont name to script group mapping (e.g., "MS Gothic": "ja").
ignoreSoftHyphenTagbooleanfalseIgnore soft hyphen tags in the document.
includedSlidesarrayIf non-empty, only extract these slide numbers (1-based).
includeHighlightColorsarrayIf non-empty, only extract text with these highlight colors.
includeStylesarrayIf non-empty, only extract text with these styles.
lineSeparatorReplacementstring Replacement string for line separator characters.
replaceLineSeparatorbooleanfalseReplace Unicode line separator (U+2028) in output.
replaceNoBreakHyphenTagbooleanfalseReplace no-break hyphen tags with the non-breaking hyphen character.
tabAsCharacterbooleanfalseTreat tab elements as tab characters instead of placeholder spans.
translateChartsbooleanfalseExtract strings from embedded charts.
translateCommentsbooleanfalseExtract comment text from Word documents.
translateDiagramsbooleanfalseExtract text from SmartArt diagrams.
translateDocPropertiesbooleantrueExtract title, subject, keywords from document properties.
translateFootnotesbooleantrueExtract footnotes and endnotes from Word documents.
translateHeadersFootersbooleantrueExtract text from headers and footers in Word documents.
translateHiddenSlidesbooleanfalseExtract content from hidden slides in PowerPoint.
translateHiddenTextbooleanfalseExtract text with the vanish (hidden) property in Word documents.
translateHyperlinksbooleantrueExtract hyperlink text for translation.
translateSharedStringsbooleantrueExtract shared strings from Excel workbooks.
translateSheetNamesbooleanfalseExtract sheet names in Excel workbooks.
translateSlideMastersbooleanfalseExtract text from slide masters in PowerPoint.
translateSlideNotesbooleantrueExtract speaker notes in PowerPoint presentations.
useCodeFinderbooleanfalseEnable pattern-based detection of inline codes (placeholders, tags, etc.).

Configure these parameters interactively and copy the YAML on the Format Reference.

Examples

Translate everything in a Word document

Include comments, hidden text, and slide masters that are off by default.

translateHiddenText: true
translateComments: true

Limit an Excel workbook to selected sheets

Skip data and lookup sheets during extraction.

excludedSheets:
  - Data
  - Lookups

Extract only specific PowerPoint slides

Translate slides 1, 2, and 5 and skip the rest.

includedSlides:
  - 1
  - 2
  - 5

Processing notes

  • The document variant (Word / Excel / PowerPoint) is auto-detected from the package contents.

  • The writer is faithful by default — source run properties are preserved inline rather than rewritten into synthesised styles.

Limitations

  • One filter covers Word, Excel, and PowerPoint; some options apply only to the relevant variant (for example, the slide options affect PowerPoint only).

← Back to the Format Reference