Why PDF to WORD is a historical problem

PDF to Word is a very, very common demand, it can be said that everyone avoid danger, why so common demand, but so difficult, also depends on why there is such a demand:

PDF documents follow the specification of iOS32000 is launched by Adobe document format, the reason why it is so widely used, because PDF accurately positioned the coordinates of each character, according to the coordinates of the various shapes drawn, using PDF format transmission and print documents can ensure the consistency of the format, and then a lot of PDF files can be used to read, Display, print, but it is very hard to edit, such as formatting, text changes, style adjustment, then derived the PDF to Word this historic demand, but because between the coding standards and layout mechanism completely inconsistent, would lead to transformation is very complicated, generally not format tool disorder, is the content disorder, It is difficult to meet customers’ native expectations.

The difficulty lies in mapping from PDF’s element-location-based format to Word’s content-based format. PDF documents actually do not exist paragraph, table concept, PDF to Word to do is the PDF document “horizontal and vertical lines around the text” to the Word “table” and “text and a line below” to “text underline” and so on.

Two tools two sets of rules, since the ancient times compatible conversions between two tools, unless it’s for a all, there will be a general standard and interface reserved, achieve very good compatibility, but Adobe and Microsoft are huge technology companies, and two software functions are very strong and the coverage of the whole, to achieve a perfect match all rules are very suffering.

For report users, many users will understand the report as a report, and the report will naturally associate with Word, so it is very hoped that the content displayed on the page can be archived and edited into Word files.

ActiveReportsJS is a front-end report development tool that is not associated with the back end. Therefore, to generate Word from the displayed HTML, the r&d team did some research and found that the whole process would be very complicated and difficult, as they reported: “It’s not a sprint problem.” There is strong Mozilla support behind pdF.js, and Word documents are generated by Microsoft’s Offic development component.

However, in actual contact with customers, many users will ask relevant content, including how to use reports to design Word reports such as approval forms, personnel resume forms, test reports and other common reports. Users were happy with the results, but the only complaint was that the report results were only PDF. It’s a tradition, it’s a core need, it’s a pain point.

This grape is some very anxious, so do not believe this evil, in the front tools so rich in the case, there is no such a tool available?

I started searching, opened Google, squeezed my brain out of vocabulary, entered the keywords I needed, and found the following results.

At first glance, the first rule seems to fit perfectly, and Node.js is not unacceptable as a server, as long as there is a solution.

Use Cloudmersive convert-apI-client to convert any file format

Cloudmersive.medium.com/how-to-conv…

It looks very promising

Simple code:

But take a closer look at the code, ** sure enough, god has marked the price on the back when we send things:

I thought if I could, I would pay for it. After all, we are also professional ER of paid commercial software, and we still need some copyright awareness.

Click login and use Google account successfully to reference the Cloudmersive convert-apI-client installation package in the project.

The JS library provides nearly dozens of apis and classes to handle the conversion of different file formats: in addition to converting PDF to Word, there are other file formats conversion, it is also very easy to use.

Conversion result evaluation:

Can recognize local PDF files, conversion results:

  1. Can guarantee 90% of the format and style, meet the requirements
  2. Images can be imported directly
  3. The background color cannot be preserved
  4. A table cannot be imported directly as a table in Word and can only be used as text
  5. Header footer information cannot be imported directly into the header footer of Word and is only used as text
  6. Part of content missing

  • The product price

Because the entire conversion API is only one CloudMersive API function, the product is charged by the month and number of concurrent requests for additional security checks. You can search to understand, but their website is provided with several file conversion tools very easy to use, without logging directly to obtain the conversion results

cloudmersive.com/tools

Try converting a PDF stream directly to a Word document, can you?

Through searching, it was found that it was very difficult to convert PDF object stream directly into Word file with JS. Moreover, after verification that ARJS exported PDF file could be opened with Word software, it suddenly occurred to me whether we could find a middleware to convert PDF stream directly into DOC or DOCX format. However, after searching and trying, Just add document.docx.pdf to the.pdf

This method attempt failed.

After talking to the tech guru, I found that although PDF and Word are binary streams in nature, the internal declaration and other attributes are unique to their respective files, so they cannot be converted directly. In short, the file stream can only be saved as it is. And PDF and Word are two major technology companies endorsement, direct conversion to use professional tools, so this road is blocked.

Curve save Coder: HTML to PDF big work will be done?

HTML can convert everything from HTML to PDF, from HTML to images, from HTML to Excel, etc. ActiveReportsJS provides the ability to export reports to HTML files in exactly the same format, so here’s the solution. Isn’t it more convenient for me to transfer Word directly from HTML? Google search for such information is indeed a hundred times more than PDF to Word, and look at the code is also very simple operation:

Jscodemine.grapecity.com/share/Itym7…

Just 3 steps:

1. Export the report to HTML

var pageReport = new ARJS.PageReport(); pageReport.load('./BandedReport.rdlx-json') .then(function() { return pageReport.run() }) .then(function(pageDocument) { return HTMLExport.exportDocument(pageDocument) })

2. Process HTML code to add Office markup

3. Create a label and download the DOC file

var fileDownload = document.createElement("a"); document.body.appendChild(fileDownload); fileDownload.href = sourceHTML; fileDownload.download = 'document.doc'; fileDownload.click(); document.body.removeChild(fileDownload);

And look at the results: they’re Nice

Conversion result evaluation:

  1. Missing styles, including font color, background color, shape
  2. The image is lost
  3. Tables can be imported directly as Word tables
  4. Icon to retain

4. To summarize

The two transformation results are summarized as follows:

After some attempts, it can be regarded as a Workaround. Considering that reports of the report class are generally text-based and simple in style, converting HTML to Word is a fast and concise method, and most of them need to be saved in Word for secondary editing. This grape is also trying to find a way to keep the HTML to Word style, we will update the second article if there is new progress.

The official website of Grape City, grape city for developers to provide professional development tools, solutions and services, enabling developers.

  • Wikipedia: PDF introduction
  • Word introduction
  • pdf.js analog for Word Documents
  • Pure front-end reporting control ActiveReportsJS