Most people use Word processing programs such as Microsoft Office Word, WPS or macOS Pages for Word documents in their daily work. Are there other ways to process Word documents besides using the Word processor described above? The answer is yes.

After reading this article, you should know the following:

  • Microsoft Office Word supported file formats and Docx document features;
  • How to convert Word documents into HTML documents;
  • How to process ZIP documents in the browser;
  • How to convert Word documents into Markdown documents;
  • How to dynamically generate Word documents in the front end.

Read Po’s recent popular articles (thanks to Digg friends for their encouragement and support 🌹🌹🌹) :

  • 1.2 W word | great TypeScript introductory tutorial (1336 + 👍)
  • Top 10 AMAZING TS Projects (697+ 👍)
  • Understanding TypeScript generics and applications (7.8k words) (573+ 👍)
  • Image processing need not worry, give you ten small helpers (480+ 👍)
  • [13000 words] Play front-end binary (328+ 👍)

Are you ready for the Word document journey? Let’s go!

Introduction to Microsoft Office Word

Microsoft Office Word is a Word processor application from Microsoft corporation. It was originally written by Richard Brodie in 1983 for the IBM computer running DOS. Subsequent versions ran on Apple Macintosh (1984), SCO UNIX, and Microsoft Windows (1989) and became part of Microsoft Office.

Word provides users with tools for creating professional and elegant documents that help users save time and get elegant and beautiful results. Microsoft Office Word has long been the most popular Word processor.

1.1 File formats supported by Word

The following table lists several common file formats supported by Word, in alphabetical order by extension.


To learn all the supported formats of Word, refer to Microsoft office-file-format-reference online documentation. The document with the extension.docx is the focus of this article.

1.2 Docx document

As the saying goes, “know yourself, know your enemy and win every battle”. Before “going to war”, we first have a brief understanding of the “DOCX” document. “The old version of the file name suffix is.doc, after 2007 is.docx.” Docx formats are compressed documents that are smaller, can handle more complex content, and are faster to access.

The “docx” document is actually a compressed file (ZIP format). The ZIP file format is a file format for data compression and document storage. Originally known as Deflate, it was invented by Phil Katz, who announced the format in January 1989. ZIP usually uses the suffix. ZIP, and the MIME format is Application/ZIP.

Here you have prepared a “abao.docx” file containing your image and some text, then copy a copy and rename it to “abao.zip”, then use zip compression/decompression software to decompress it.


By observing the decompressed directory, we found that the Word document was composed of a series of XML files and multimedia files. The popo’s head in the “abao.docx” document was finally decompressed to the “Word/Media” directory. Let’s look at the directory structure of the abAO folder:

-rw-rw-r--@  1 fer  staff  1641  7 11 01:25 [Content_Types].xml
drwxr-xr-x@  3 fer  staff    96  7 11 09:41 _rels
drwxr-xr-x@  4 fer  staff   128  7 11 09:41 docProps
drwxr-xr-x@ 13 fer  staff   416  7 11 09:42 word
Copy the code

The abao directory obviously contains a “[Content_Types].xml” file and three subdirectories “_rels, docProps, and Word”.

  • [Content_Types].xmlThis file is used to define the content type of each XML file.
  • _rels: There is usually one in this directory.relsSuffix of the file, which saves the relationship between the various parts of the directory._relsThere’s more than one directory, it’s actually hierarchical.
  • docProps: The XML file in this directory is used to save the attributes of docx files.
  • word: This directory contains information such as the content, font, style, or theme of a Word document.

After introducing the file formats supported by Word and Docx documents, we began to get into the main topic — “How to play Word documents in the front end”.

2. Convert Word documents into HTML documents

In daily work, sometimes we want to import an existing Word document into a rich text editor for secondary processing. To meet this requirement, we need to convert the Word document into HTML document first. To achieve this, there are “server-side transformation and front-end transformation” two solutions:

  • Server-side conversion: For Java developers, you can build directly from POI, an open source project of Apache, It was originally designed to process documents based on the Office Open XML standard (OOXML) and Microsoft OLE2 Composite Document format (OLE2) in a variety of file formats and supports read and write operations.
  • Front-end conversion: For front-end developers, in order to parse Word documents in the front end, we first need to decompress Word documents, and then further parse the decompressed XML documents. This might seem like a bit of a chore to implement, but thankfully Mammoth. Js is a library that does just that.

Before introducing Mammoth. Js to converting your Word document into HTML, let’s get a taste of the final transformation.


2.1 Mammoth. Introduction of js

Mammoth. Js aims to transform.docx documents (such as those created by Microsoft Word) and convert them to HTML. “Mammoth’s goal is to generate simple, clean HTML by using semantic information in documents and ignoring other details.” For example, Mammoth will convert any paragraph that applies a heading 1 style into an H1 element, rather than trying to copy exactly the heading style (font, text size, color, etc.).

Because there is a large mismatch between the structure used by.docx and that of HTML, this transformation is unlikely to be perfect for more complex documents. But if you just use styles to mark up documents semantically, Mammoth can achieve better transformations.

Currently Mammoth supports the following key features:

  • Headings
  • Lists, Table
  • Images
  • Bold, italics, underlines, strikethrough, superscript and subscript
  • The Links, and Line breaks
  • Footnotes and endnotes

It also supports custom mapping rules. For example, you can convert WarningHeading to h1.warning by providing the appropriate style mapping. In addition, the content of the text box is treated as a separate paragraph that appears after the paragraph containing the text box.

The Mammoth. Js library provides a wide range of methods, but here are three commonly used apis:

  • mammoth.convertToHtml(input, options): Converts the source document to HTML
  • mammoth.convertToMarkdown(input, options): Converts the source document to a Markdown document. This method is related toconvertToHtmlThe method is similar except that the value attribute of the result object is Markdown instead of HTML.
  • mammoth.extractRawText(input): Extracts the original text of the document. This will ignore all formats in the document. Each paragraph is followed by two newlines.

With Mammoth. Js features and apis covered, it’s time to get into the game.

2.2 Mammoth. Js in actual combat

ConvertToHtml method input parameter is in the format {arrayBuffer: ArrayBuffer}, where arrayBuffer is the contents of the.docx file. On the front end you can read the contents of a file using the FileReader API, which also provides a readAsArrayBuffer method to read the contents of a specified Blob. The result property holds the ArrayBuffer data object of the file being read. Here we define a readFileInputEventAsArrayBuffer method:

export function readFileInputEventAsArrayBuffer(event, callback) {
  const file = event.target.files[0];

  const reader = new FileReader();

 reader.onload = function(loadEvent: Event) {  const arrayBuffer = loadEvent.target["result"];  callback(arrayBuffer);  };   reader.readAsArrayBuffer(file); } Copy the code

This method is used to convert the input File object to an ArrayBuffer object. After obtaining the Word document’s corresponding ArrayBuffer object, you can call the convertToHtml method to convert the Word document’s contents to HTML.

mammoth.convertToHtml({ arrayBuffer })
Copy the code

At this point, if your document does not include a special image type, such as WMF or EMF, but rather a common JPG or PNG type, you can see images in your Word document. Is that it? Is that too easy? It’s just the beginning. When you review the HTML document parsed by Word through your browser’s developer tools, you’ll see that the images are embedded in Base64. If there are not many images and the individual images are not too large, then this option can be considered.

In the case of multiple or large images, a good solution is to submit the images to the file resource server. To do this in Mammoth. Js, you can customize the picture processor using the “convertImage” configuration option. The specific usage example is as follows:

let options = {
    convertImage: mammoth.images.imgElement(function(image) {
      return image.read("base64").then(function(imageBuffer) {
        return {
          src: "data:" + image.contentType + "; base64," + imageBuffer
 };  });  }) }; Copy the code

The function of the above example is to Base64 encode the pictures in Word, and then convert them into the form of Data URL to realize the display of pictures. Obviously this does not meet our requirements, so we need to make the following adjustments:

const mammothOptions = {
  convertImage: mammoth.images.imgElement(function(image) {
    return image.read("base64").then(async (imageBuffer) => {
      const result = await uploadBase64Image(imageBuffer, image.contentType);
      return {
 src: result.data.path // Get the URL on the image line  };  });  }) }; Copy the code

UploadBase64Image uploadBase64Image uploadBase64Image uploadBase64Image uploadBase64Image uploadBase64

async function uploadBase64Image(base64Image, mime) {
  const formData = new FormData();
  formData.append("file", base64ToBlob(base64Image, mime));
  
  return await axios({
 method: "post". url: "http://localhost:3000/uploadfile".// Local image upload API address  data: formData,  config: { headers: { "Content-Type": "multipart/form-data"}} }); } Copy the code

To reduce the image file size, we need to convert Base64 images into Blob objects and then commit them by creating FormData objects. The base64ToBlob method is defined as follows:

function base64ToBlob(base64, mimeType) {
  let bytes = window.atob(base64);
  let ab = new ArrayBuffer(bytes.length);
  let ia = new Uint8Array(ab);
  for (let i = 0; i < bytes.length; i++) {
 ia[i] = bytes.charCodeAt(i);  }  return new Blob([ia], { type: mimeType }); } Copy the code

At this point, the basic function of converting Word documents into HTML and automatically uploading images in Word documents to the file resource server has been realized. Instead of explaining how Mammoth. Js internally parses XML files in Word, we’ll take a look at the JSZip library Mammoth.

2.3 JSZip profile

JSZip is a JavaScript library for creating, reading, and editing “. Zip “files with cute and simple apis. The compatibility of the library is as follows:

Opera Firefox Safari Chrome Internet Explorer Node.js
Yes Yes Yes Yes Yes Yes
After the latest version of the test Tested by 3.0/3.6/ latest version After the latest version of the test After the latest version of the test Passed IE 6/7/8/9/10 test Tested with Node.js 0.10 / latest version
2.3.1 JSZip installation

With JSZip, you can install it in one of the following ways:

  • NPM: NPM install jszip

  • Bower: bower install Stuk/jszip

  • Component: Component install Stuk/jszip

  • Manual: Download the JSZip installation package and import the dist/jszip.js or dist/jszip.min.js file

2.3.2 JSZip Example
let zip = new JSZip();
zip.file("Hello.txt"."Hello Semlinker\n");

let img = zip.folder("images");
img.file("smile.gif", imgData, {base64: true});
zip.generateAsync({type: "blob"}) .then(function(content) {  // see FileSaver.js  saveAs(content, "example.zip"); }); Copy the code

This example is from the JSZip official website. After a successful operation, the example.zip file is automatically downloaded and saved. The directory structure of the decompressed file is as follows:


3. Convert Word documents into Markdown documents

Markdown is a lightweight markup language created by John Gruber. It allows people to write documents in plain text format that is easy to read and write, and then convert them into valid XHTML (or HTML) documents. The language has absorbed many of the features of plain text markup already found in E-mail.

Because Markdown is lightweight, easy to read and write, and has support for images, charts, and mathematical expressions, it is now widely used by many websites to write help documents or to post messages on forums.

Now that we know what Markdown is, let’s analyze how to convert a Word document into a Markdown document. We also have two ways to handle this feature:

  • The first: useMammoth.jsThis library providesmammoth.convertToMarkdown(input, options)Methods;
  • Second: based onmammoth.convertToHtml(input, options)The generated HTML document uses the HTML to Markdown conversion tool to achieve this function indirectly.

Let’s take a look at the second solution. Here we use an open source converter on Github called Turndown, which is an HTML to Markdown converter developed in JavaScript that is very simple to use.

First you can install it in two ways:

  • NPM:npm install turndown
  • Script:<script src="https://unpkg.com/turndown/dist/turndown.js"></script>

Once installed, you can create an instance of TurndownService by calling the TurndownService constructor and then calling the instance’s turndown() method to perform the conversion:

let markdown = turndownService.turndown(
  document.getElementById('content')
)
Copy the code

For the “abao.docx” document used earlier, the resulting Markdown document from the transformation is as follows:

Full stack, focus on the full stack, focus on sharing TypeScript, Web API, Node.js, Deno and other full stack dry products.
! [](https://cdn.xxx.com/rich_159444942843202)Copy the code

Note that the TurndownService constructor supports a number of configuration items that are not covered in detail here. Interested partners can read turnDown official documents or visit turndown online examples to have a practical experience.

Markmap is an open source library on Github, which uses mind mapping to visualize Markdown documents.


(photo: https://markmap.js.org/repl/)

Finally, let’s look at how Word documents are dynamically generated on the front end.

Fourth, front-end dynamic generation of Word documents

If you want to dynamically generate Word documents in the front end, we can directly use some mature third-party open source libraries, such as DOCX or HTML-DOCX-JS.

Here we will take DOCx as an example to introduce how to generate Word documents in the “. Docx “format in the front end. The Docx library provides an elegant declarative API that allows you to easily generate.docx files using JS/TS. It also supports both Node.js and browsers.

The Docx library provides many classes for developers to create corresponding elements in Word. Here we briefly introduce a few common classes:

  • Document: used to create a new Word Document;
  • Paragraph: Used to create new paragraphs;
  • TextRun: used to create text, supporting bold, italic, and underline styles;
  • Tables: Used to create Tables that support setting the contents of each row and each table cell.

Docx will be used to dynamically generate the “abao. Docx” file, as shown below:


      
<html>
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="Width = device - width, initial - scale = 1.0" />
 <title></title>  </head>  <body>  <h1>Po - Dynamic generation of Word document example</h1>   <button type="button" onclick="generate()"> Click generate Docx document </button>  <script src="https://unpkg.com/[email protected]/build/index.js"></script>  <script src="https://cdnjs.cloudflare.com/ajax/libs/FileSaver.js/1.3.8/FileSaver.js"></script>  <script>  async function generate() {  const doc = new docx.Document();   const imageBuffer = await fetch(  "https://avatars3.githubusercontent.com/u/4220799"  ).then((response) = > response.arrayBuffer());   const image = docx.Media.addImage(doc, imageBuffer, 230.230);   doc.addSection({  properties: {},  children: [  new docx.Paragraph({  children: [  new docx.TextRun({  text: "The whole stack of the path to immortality,". bold: true. }),  new docx.TextRun({  text:  "Full stack focus, dedicated to share TypeScript, Web API, Node.js, Deno and other full stack dry goods.". }), ]. }),  new docx.Paragraph(image), ]. });   docx.Packer.toBlob(doc).then((blob) = > {  console.log(blob);  saveAs(blob, "abao.docx");  console.log("Document generated successfully");  });  }  </script>  </body> </html> Copy the code

In the example above, the generate() callback is called after the user clicks on the generate Docx document button. Within this callback, a new Document object is created, then the Fetch API is used to download the arbog avatar from Github, and when the image data is successfully retrieved, the docx.media.addimage () method is called to add the image.

We then call the doc.addSection() method to add the Section block, which will act as a container for the paragraph. In our example, the Section block we created contains two paragraphs, one for text information and one for image information. Finally, we’ll convert the Document object into a Blob object and download it locally using the saveAs() method.

5. Reference resources

  • MDN – FileReader
  • Baidu Encyclopedia – Microsoft Office Word
  • office-file-format-reference
  • Github – mammoth.js

This article is formatted using MDNICE