From zero to develop an automatic extraction of webpage HTML and one-click conversion to MD file tool (VUE source version)

In recent years, there have been a lot of tech blogs and tech communities, and a lot of tech folks are starting to create their own blogs. We can sync our blogs to different tech platforms, but as the number of tech platforms increases, it takes more and more time to sync our posts. Is there a tool to quickly publish to different platforms? Or is there a tool that can translate HTML directly into a “language” that the technology platform can recognize?

We all know that the most popular blogging “language” for programmers is makeDown, and most of the tech community now supports the MakeDown syntax, so with MakeDown we can quickly synchronize to different technology platforms.

Some might say, why don’t we just write our blog in makedown syntax? This is fine, but the downside is that we have to keep a local MakeDown file, and if the blog content involves images, we also need to maintain an IMG directory, which can be a hassle to post to a different technical community every time. So we developed a tool that automatically crawls HTML content and converts it into a makedown with one click so that we can blog “uncontrollably”.

You will reap

turndownUsing skills of
vue + nuxtProject development mode
nodejsCrawler related applications

Github address will be attached at the end of the article, interested friends can build together, learn and explore.

Results demonstrate

The client

Train of thought

Let’s get this straight:

Enter a link address
Gets the value returned by the serverhtml 串
willhtmlString into amd 串
Synchronize the display preview to the editor

Why choose`turndown`

The most important step for the client is HTML to MD, where we use turndown. The reasons for using turndown are as follows:

Talk is cheap, Show me the code. One of the key functions of writing technical articles isThe code blockAn article without code has no soul. I’ve compared a fewhtml2mdPlugin.turndownCode block display effect and compatibility of the best.
turndownAlso supports custom rules, flexible, you can customize a variety of syntax labels and matching rules.
turndownThird-party plug-ins are also supportedturndown-plugin-gfmSupport integrationGFM(MDA superset ofGitHub Flavored Markdown),table,strikethroughSuch as grammar.

The specific implementation

  // Introduce third-party plug-ins
  import { gfm, tables, strikethrough } from 'turndown-plugin-gfm'

  const turndownService = new TurndownService({ codeBlockStyle: 'fenced' })
  // Use the gfm plugin
  turndownService.use(gfm)

  // Use the table and strikethrough plugins only
  turndownService.use([tables, strikethrough])

  /** * custom configuration (rule name cannot be the same) * here we specify the 'pre' label as the code block, and add a newline before and after the code block to prevent abnormal display */
  turndownService.addRule('pre2Code', {
    filter: ['pre'],
    replacement (content) {
      return '```\n' + content + '\n```'}})Copy the code

Additional functionality

Support automatic access to linked article titles, no need to manually copy the original.

The service side

Here we use the server is Node.js, with the front-end framework to write the server, experience the bar.

Train of thought

Let’s get this straight:

Gets the address of the link passed by the front end
Obtain by requesthtml 串
Obtain different domain names according to different platformsdom
Convert the relative paths of images and links to absolute paths
htmlAdd the reprint source statement at the bottom
Get the title of the articletitle
returntitle 和 htmlTo the front

The specific implementation

Gets the address of the link passed by the front end

Here we use node’s native syntax directly, we use get form pass, use query can be
```
const qUrl = req.query.url
Copy the code
```

Obtain by request`html` 串

Here we’re using request

 request({
   url: qUrl
 }, (error, response, body) = > {
   if (error) {
     res.status(404).send('Url Error')
     return
   }
   // The body here is the 'HTML' of the article
   console.log(body)
 })
Copy the code

Obtain different domain names according to different platforms`dom`

Due to the large number of technology platforms, each platform will have different content tags, style names or IDS, which need to be compatible.

First, jS-DOM is used to simulate dom manipulation, encapsulating a method

 /** * get the exact content of the article *@param {string} HTML HTML string *@param {string} Selector CSS selector *@return {string} htmlContent* /
 const getDom = (html, selector) = > {
   const dom = new JSDOM(html)
   const htmlContent = dom.window.document.querySelector(selector)
   return htmlContent
 }
Copy the code

Compatible with different platforms, using different CSS selectors

 // For nuggets, the style of the content block is.markdown-body. The content will contain the style tag and some extra copied code text, which will be deleted by native DOM manipulation
 if (qUrl.includes('juejin.cn')) {
   const htmlContent = getBySelector('.markdown-body')
   const extraDom = htmlContent.querySelector('style')
   const extraDomArr = htmlContent.querySelectorAll('.copy-code-btn')
   extraDom && extraDom.remove()
   extraDomArr.length > 0 && extraDomArr.forEach((v) = > { v.remove() })
   return htmlContent
 }

 // For osChina, the format of the content block is.article-detail, and there is extra.ad-wrap content in the content
 if (qUrl.includes('oschina.net')) {
   const htmlContent = getBySelector('.article-detail')
   const extraDom = htmlContent.querySelector('.ad-wrap')
   extraDom && extraDom.remove()
   return htmlContent
 }

 // Finally matches the generic label. The article tag is preferred, not the body tag
 const htmlArticle = getBySelector('article')
 if (htmlArticle) { return htmlArticle }

 const htmlBody = getBySelector('body')
 if (htmlBody) { return htmlBody }
Copy the code

Convert the relative paths of images and links to absolute paths to facilitate future source path searches

 // Use native api-url to get the source domain name of the link
 const qOrigin = new URL(qUrl).origin || ' '

 // Get the absolute path of the image and link. Convert 'path + source domain name' to absolute path through URL, students who are not familiar with it please understand by yourself
 const getAbsoluteUrl = p= > new URL(p, qOrigin).href

 // Convert the relative path of images and links. Different platforms have different image lazy loading attribute names, which need to be specially compatible
 const changeRelativeUrl = (dom) = > {
   if(! dom) {return '
        
          content error ~
        
' }
   const copyDom = dom
   // Get all the images
   const imgs = copyDom.querySelectorAll('img')
   // Get all links
   const links = copyDom.querySelectorAll('a')
   // Replace all paths and return a new DOM
   imgs.length > 0 && imgs.forEach((v) = > {
     /** * handle lazy load path * simple book: data-original-src * digg: data-src * segmentfault: data-src */
     const src = v.src || v.getAttribute('data-src') || v.getAttribute('data-original-src') | |' '
     v.src = getAbsoluteUrl(src)
   })
   links.length > 0 && links.forEach((v) = > {
     const href = v.href || qUrl
     v.href = getAbsoluteUrl(href)
   })
   return copyDom
 }

 // Apply the changeRelativeUrl method in the getBody method to get the article content from different platforms
 const getBody = (content) = >{... .return changeRelativeUrl(htmlContent)
 }
Copy the code

Add reprint source statement at the bottom to prevent infringement

I don’t have to explain this much, it’s very simple.

 // Add the source statement at the bottom
 const addOriginText = (dom) = > {
   const html = dom.innerHTML
   const resHtml = html + '<br/><div>${qUrl}" target="_blank">${qUrl}</a>, if there is infringement, please contact to delete. </div>`
   return resHtml
 }

 // Apply the addOriginText method to the getBody method to get the article content for different platforms
 const getBody = (content) = >{... .return addOriginText(changeRelativeUrl(htmlContent))
 }
Copy the code

Get the title of the article`title`

 // Get the title of the article
 const getTitle = (content) = > {
   const title = getDom(content, 'title')
   if (title) { return title.textContent }
   return 'Failed to get title ~'
 }
Copy the code

return`title` 和 `html`To the front

 request({
   url: qUrl,
   headers: {}},(error, response, body) = > {
   if (error) {
     res.status(404).send('Url Error')
     return
   }
   // Sets the JSON response type
   res.type('text/json')
   const json = {
     code: 1.title: getTitle(body),
     html: getBody(body)
   }
   res.status(200).send(json)
 })
Copy the code

The practical application

This open source tool can be used in a wide range of scenarios. We can convert almost any web link to MD content and synchronize it to our blog or content management platform, but we need to be aware of copyright and be a law-abiding “netizen”.

Supporting environment

Modern browsers and IE11.

IE / Edge	Firefox	Chrome	Safari	Opera
IE11, Edge	last 2 versions	last 2 versions	last 2 versions	last 2 versions

Participate in the contribution

We welcome your contributions, and you can help us build it together at 😃

Report bugs through Issue.
Submit Pull Request for improvement together.
Github address: portal

From zero to develop an automatic extraction of webpage HTML and one-click conversion to MD file tool (VUE source version)

You will reap

Results demonstrate

The client

Train of thought

Why chooseturndown

The specific implementation

Additional functionality

The service side

Train of thought

The specific implementation

Gets the address of the link passed by the front end

Obtain by requesthtml 串

Obtain different domain names according to different platformsdom

Convert the relative paths of images and links to absolute paths to facilitate future source path searches

Add reprint source statement at the bottom to prevent infringement

Get the title of the articletitle

returntitle 和 htmlTo the front

The practical application

Supporting environment

Participate in the contribution

Related Posts

Small program development encountered pit, always know the next good!

Graduation about a year of the front end of the girl face by summary

TypeScript From 0: WebPack TypeScript

Why choose`turndown`

Obtain by request`html` 串

Obtain different domain names according to different platforms`dom`

Get the title of the article`title`

return`title` 和 `html`To the front