“This is the first day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021”

Author: ArmorHTK

Open source data: www.openkg.cn/dataset/202…

Contains the information of the top 100 universities in China; In this paper, the structured data of all colleges and universities in China are acquired by crawler to expand and form the knowledge map of colleges and universities.

Introduction to the

Constructing knowledge graph is a complex system engineering. Its construction and implementation methods are not unique and there is no fixed paradigm.

Entity extraction, relation extraction, knowledge ablation and embedding algorithms are not considered in the algorithm.

In terms of data, unstructured data is not considered, and only semi-structured data is obtained through Baidu Encyclopedia.

Therefore, this Demo assumes that on the premise of having good data quality, the data is released from Neo4j or other graph databases for web visualization, and some basic functions or downstream tasks are developed accordingly.

Refer to openKG’s open source project for some modifications and adaptations.

1. Obtain and clean data

  • Public information from the government to obtain the list of national ordinary institutions of higher learning documents to obtain my point
  • By The end of June 30, 2020, there were 3,005 institutions of higher learning in China, including 2,740 regular institutions of higher learning, 1,258 undergraduate institutions, and 1,482 vocational and technical colleges.

1.1 Cleaning the list data of universities

  • Before cleaning, open an Excel file and delete the first two lines, then clean them with Python code
import numpy as np
import pandas as pd

# open it with pandas
df = pd.read_excel("National List of Higher education institutions. XLS")
Replace all nans in the remarks with public ones
df["Note"].fillna("The public",inplace=True)
Delete Nan from the blank line
df.dropna(axis=0,inplace=True)
Convert the data type to INT64
df["School Identification Code"] = df["School Identification Code"].astype(np.int64)
# save file
df.to_csv("School_List_2020.csv",index=False)
# See what it looks like
print(df.shape)
df.head()
Copy the code

🚀 OK! You get a clean CSV file, and then you crawl.

1.2 Obtaining JSON data by crawler

Can see the composition of baidu encyclopedia is https://baike.baidu.com/item/ + “name of colleges and universities”

Read the name of the school from the CSV file and concatenate the list of URLS obtained from the URL

df = pd.read_csv("School_List_2020.csv")
urls = []
for i in df["Name of School"]:
    url = "https://baike.baidu.com/item/" + str(i)
    urls.append(url)
Copy the code

Grab the following six labels [‘ Chinese name ‘, ‘English name’, ‘abbreviation’, ‘founded time’, ‘type’, ‘department’]

Crawler complete code:

import requests
import json
import time
from tqdm import tqdm
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

# timing
def run_time(start_time) :
    current_time = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime())
    print(F "Current time:{current_time}")
    print("Time :%.3f SEC" %(time.time()-start_time))

Get the url and get the content of the page
def url_open(url) :
    headers = {'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
    r = requests.get(url, headers = headers)
    return r 

# get the school list
def school_list(filename) :
    schools = []
    df = pd.read_csv(filename)
    for i in df["Name of School"]:
        schools.append(i)
    return schools

if __name__ == "__main__":
    school = school_list("School_List_2020.csv")
    # print(school)
    result_data = []
    start_time = time.time()
    for index in tqdm(school):
        url = 'https://baike.baidu.com/item/' + index 
        print(url)
        data = url_open(url)
        soup = BeautifulSoup(data.content, 'html.parser', from_encoding='utf-8')
        name_data = []
        value_data = []
        name_node = soup.find_all('dt', class_='basicInfo-item name')
        # print(name_node)
        for i in range(len(name_node)):
            name_data.append(name_node[i].get_text().replace('\xa0'.' '))
            # name_data.append(name_node[i].get_text())
            # print(name_data)
        value_node = soup.find_all('dd', class_='basicInfo-item value')
        for i in range(len(value_node)):
            value_data.append(value_node[i].get_text().replace('\n'.' '))
            # print(type(value_node[i].get_text().replace('\n', '')))
            # print(value_node[i].get_text().replace('\n', ''))
            # print(value_data)
            # print(type(value_data))
        result = {'Chinese name': 'No information'.'English name': 'No information'.'简称':'No information'.'Founding Date': 'No information'.'type': 'comprehensive'.'Authority': 'No information'}
        for i in range(len(name_data)):
            if name_data[i] == 'Chinese name':
                result['Chinese name'] = value_data[i]
            if name_data[i] in ['English name'.'Foreign name']:
                result['English name'] = value_data[i]
            if name_data[i] == '简称':
                result['简称'] = value_data[i]
            if name_data[i] == 'Founding Date':
                result['Founding Date'] = value_data[i]
            if name_data[i] == 'type':
                result['type'] = value_data[i]
            if name_data[i] == 'Authority':
                result['Authority'] = value_data[i]
        result_data.append({'Chinese name': result['Chinese name'].'English name': result['English name'].'简称': result['简称'].'Founding Date': result['Founding Date'].'type': result['type'].'Authority': result['Authority']})
        # print('reading the website... ')
        # print(result_data)
    fw = open('all.json'.'w', encoding='utf-8')
    fw.write(json.dumps(result_data, ensure_ascii=False))
    fw.close()
    print('complete! ')
    run_time(start_time)
Copy the code

  • Expect to wait about 15 minutes

1.3 Cleaning JSON Data

  • Json data cleaning is divided into two steps: feature extraction and Linknode construction.
  • Feature extraction: it is mainly to extract node attributes into TXT text one by one, which is convenient to construct the triplet form of Node-link-Node.
import json 

with open('./spider/all.json'.'r', encoding='utf-8') as fr:
    str_data = fr.read()
    full_data = json.loads(str_data) # json decode
    fw1 = open('./dataprocess/Name.txt'.'w', encoding='utf-8')    # the name list
    fw2 = open('./dataprocess/English.txt'.'w', encoding='utf-8') # English name list
    fw3 = open('./dataprocess/Abbr.txt'.'w', encoding='utf-8')    # hereinafter referred to as the list
    fw4 = open('./dataprocess/Time.txt'.'w', encoding='utf-8')    # Start time list
    fw5 = open('./dataprocess/Type.txt'.'w', encoding='utf-8')    # type list
    fw6 = open('./dataprocess/Admin.txt'.'w', encoding='utf-8')   # List of competent authorities
    
    for i in range(len(full_data)):
        # fool traversal
        for key, value in full_data[i].items():
            if key == 'Chinese name':
                fw1.write("{' Chinese name ': '" + value +"'}\n")
            if key == 'English name':
                fw2.write("{' English name ': '" + value +"'}\n")
            if key == '简称':
                fw3.write("{' for short ': '" + value +"'}\n")
            if key == 'Founding Date':
                # fw4.write("{' start time ': '" + value[0:4] +" year '}\n")
                fw4.write("{' founding date ': '" + value +"'}\n")
            if key == 'type':
                fw5.write("{' type ': '" + value +"'}\n")
            if key == 'Authority':
                fw6.write({' authority ': '" + value +"'}\n")

fw1.close()
fw2.close()
fw3.close()
fw4.close()
fw5.close()
fw6.close()
Copy the code
  • Linknode structure
import json 
import csv 

nodes = []
links = []

name_list = []
english_list = []
abbr_list = []
time_list = []
type_list = []
admin_list = []
# english2_list = []
# time2_list = []
# abbr2_list = []

# central node
nodes.append({'id': 'university'.'class': 'university'.'group': 0.'size': 22})

# type node
fr = open('./dataprocess/Type.txt'.'r', encoding='utf-8')
for line in fr.readlines():
    tmp = line.strip('\n')
    for key, value in eval(tmp).items():
        if value not in type_list:
            type_list.append(value)
            nodes.append({'id': value, 'class': 'type'.'group': 5.'size': 18})
            links.append({'source': 'university'.'target': value, 'value': 3})
            links.append({'source': value, 'target': 'university'.'value': 3})
fr.close()

# english node
fr = open('./dataprocess/English.txt'.'r', encoding='utf-8')
for line in fr.readlines():
    tmp = line.strip('\n')
    for key, value in eval(tmp).items():
        if value not in english_list:
            english_list.append(value)
            nodes.append({'id': value, 'class': 'english'.'group': 2.'size': 15})
fr.close()

# abbr node
fr = open('./dataprocess/Abbr.txt'.'r', encoding='utf-8')
for line in fr.readlines():
    tmp = line.strip('\n')
    for key, value in eval(tmp).items():
        if value not in abbr_list:
            abbr_list.append(value)
            nodes.append({'id': value, 'class': 'abbr'.'group': 3.'size': 15})
fr.close()

# time node
fr = open('./dataprocess/Time.txt'.'r', encoding='utf-8')
for line in fr.readlines():
    tmp = line.strip('\n')
    for key, value in eval(tmp).items():
        if value not in time_list:
            time_list.append(value)
            nodes.append({'id': value, 'class': 'time'.'group': 4.'size': 11})
fr.close()

# admin node
fr = open('./dataprocess/Admin.txt'.'r', encoding='utf-8')
for line in fr.readlines():
    tmp = line.strip('\n')
    for key, value in eval(tmp).items():
        if value not in admin_list:
            admin_list.append(value)
            nodes.append({'id': value, 'class': 'admin'.'group': 6.'size': 11})
fr.close()

with open('./spider/all.json'.'r', encoding='utf-8') as fr:
    str_data = fr.read()
    full_data = json.loads(str_data)
    for i in range(len(full_data)):
        # for key, value in full_data[i].items():
        # name node
        nodes.append({'id': full_data[i]['Chinese name'].'class': 'names'.'group': 1.'size': 20})
        links.append({'source': full_data[i]['type'].'target': full_data[i]['Chinese name'].'value': 3})
        links.append({'source': full_data[i]['Chinese name'].'target': full_data[i]['type'].'value': 3})
        # english node
        links.append({'source': full_data[i]['Chinese name'].'target': full_data[i]['English name'].'value': 3})
        links.append({'source': full_data[i]['English name'].'target': full_data[i]['Chinese name'].'value': 3})
        # abbr node
        links.append({'source': full_data[i]['Chinese name'].'target': full_data[i]['简称'].'value': 3})
        links.append({'source': full_data[i]['简称'].'target': full_data[i]['Chinese name'].'value': 3})
        # time node
        links.append({'source': full_data[i]['简称'].'target': full_data[i]['Founding Date'].'value': 3})
        links.append({'source': full_data[i]['Founding Date'].'target': full_data[i]['简称'].'value': 3})
        # admin node
        links.append({'source': full_data[i]['简称'].'target': full_data[i]['Authority'].'value': 3})
        links.append({'source': full_data[i]['Authority'].'target': full_data[i]['简称'].'value': 3})

fw = open('./nodes.json'.'w', encoding='utf-8')
fw.write(json.dumps({'nodes': nodes, 'links': links}, ensure_ascii=False))
fw.close()
Copy the code

After a series of processing, nodes.json containing node and link information was obtained. Next, the map was visualized through d3.js.

2. Generate an atlas

Map generation can be done using Echarts or D3.js. Echarts is easy to use but not flexible enough to customize, while D3.js is flexible but difficult to customize. According to the reference materials, the whole atlas is displayed with D3.js first, and modified and beautified appropriately.

  • About force mapping reference: blog.csdn.net/tengxing007…

Results show

<! DOCTYPEhtml>
<html>

<head>
    <meta charset="UTF-8"/>
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Visualization of the Atlas of Chinese Universities in 2020</title>
    <meta name="description" content=""/>
    <meta name="keywords" content=""/>
    <meta name="author" content=""/>
    <link rel="shortcut icon" href="">
    <script src="http://cdn.bootcss.com/jquery/2.1.4/jquery.min.js"></script>
    <link href="http://cdn.bootcss.com/bootstrap/3.3.4/css/bootstrap.min.css" rel="stylesheet">
    <script src="http://cdn.bootcss.com/bootstrap/3.3.4/js/bootstrap.min.js"></script>
    <script src="https://cdn.staticfile.org/echarts/4.3.0/echarts.min.js"></script>
</head>

<style>
    body {
        background-color: # 333333;
        padding: 30px 40px;
        text-align: center;
        font-family: OpenSans-Light, PingFang SC, Hiragino Sans GB, Microsoft Yahei, Microsoft Jhenghei, sans-serif;
    }

    .links line {
        stroke: rgb(240.240.240);
        stroke-opacity: 0.8;
    }

    .links line.inactive {
        /*display: none ! important; * /
        stroke-opacity: 0;
    }

    .nodes circle {
        stroke: #fff;
        stroke-width: 1.5 px.;
    }

    .nodes circle:hover {
        cursor: pointer;
    }

    .nodes circle.inactive {
        display: none ! important;
    }

    .texts text {
        display: none;
    }

    .texts text:hover {
        cursor: pointer;
    }

    .texts text.inactive {
        display: none ! important;
    }

    #indicator {
        position: absolute;
        left: 45px;
        bottom: 50px;
        text-align: left;
        color: #f2f2f2;
        font-size: 20px;
    }

    #indicator > div {
        margin-bottom: 4px;
    }

    #indicator span {
        display: inline-block;
        width: 30px;
        height: 14px;
        position: relative;
        top: 2px;
        margin-right: 8px;
    }

    #mode {
        position: absolute;
        top: 60px;
        left: 45px;
    }

    #mode span {
        display: inline-block;
        border: 1px solid #fff;
        color: #fff;
        padding: 6px 10px;
        border-radius: 4px;
        font-size: 14px;
        transition: color, background-color .3s;
        -o-transition: color, background-color .3s;
        -ms-transition: color, background-color .3s;
        -moz-transition: color, background-color .3s;
        -webkit-transition: color, background-color .3s;
    }

    #mode span.active.#mode span:hover {
        background-color: #fff;
        color: # 333;
        cursor: pointer;
    }

    #info {
        position: absolute;
        bottom: 40px;
        right: 30px;
        text-align: right;
        width: 270px;
    }

    #info p {
        color: #fff;
        font-size: 12px;
        margin-bottom: 5px;
        margin-top: 0px;
    }

    #info p span {
        color: # 888;
        margin-right: 10px;
    }

    #search input {
        position: absolute;
        top: 100px;
        left: 45px;
        color: # 000;
        border: none;
        outline: none;
        box-shadow: none;
        width: 160px;
        background-color: #FFF;
    }

    #svg2 g.row:hover {
        stroke-width: 1px;
        stroke: #fff;
    }
</style>

<body>
    <h1 style="color: #fff; font-size: 32px; text-align: left; margin-left:40px;">Knowledge Map of Chinese Universities in 2020</h1>
    <div style="text-align: center; position: relative;">
        <svg width="1600" height="1200" style="margin-left: 0px; margin-bottom: 0px;" id="svg1"></svg>
        <div id="indicator"></div>
        <div id="mode">
            <span class="active" style="border-top-right-radius: 0; border-bottom-right-radius: 0; ">graphics</span>
            <span style="border-top-left-radius: 0; border-bottom-left-radius: 0; position: relative; left: -5px;">The text</span>
        </div>
        <div id="search">
            <input type="text" class="form-control">
        </div>
    </div>
    <div style="text-align: center; position: relative;"></div>
    <div id="info">
        <h4></h4>
    </div>
    <! -- <div id="main" style="width: 600px; height:400px;" ></div> -->


</body>
<script src="https://d3js.org/d3.v4.min.js"></script>
<script>
    $(document).ready(function () {
        var svg = d3.select("#svg1"), width = svg.attr('width'), height = svg.attr('height');
        var names = ['university'.'Chinese name'.'English name'.'简称'.'Founding Date'.'type'.'Authority'];
        var colors = ['#bd0404'.'#b7d28d'.'#b8f1ed'.'#ca635f'.'#5153ee'.'#836FFF'.'#f0b631'];
 
        / / note
        for (var i = 0; i < names.length; i++) {
            $('#indicator').append("<div><span style='background-color: " + colors[i] + "'></span>" + names[i] + "</div>");
        }
        // Define the effect of mouse drag
        var simulation = d3.forceSimulation()
            // Velocity decay factor, equivalent to friction, 0 is frictionless, 1 is frozen
            .velocityDecay(0.6)
            // Alpha decay, using the concept of particle radioactivity, refers to the force simulation after a certain number of times will gradually stop;
            // The value range is also 0-1. If set to 1, the simulation will stop after 300 iterations; We're going to set this to zero, and we're going to keep doing the simulation.
            .alphaDecay(0)
            // Repulsion between wires
            .force("link", d3.forceLink().id(function (d) {
                return d.id;
            }))
            / / repulsion
            .force("charge", d3.forceManyBody())
            / / centre
            .force("center", d3.forceCenter(width / 2, height / 2));

        // Map Settings
        var graph;

        d3.json("nodes.json".function (error, data) {
            if (error) throw error;
            graph = data;
            console.log(graph);

            var link = svg.append("g").attr("class"."links").selectAll("line").data(graph.links).enter().append("line").attr("stroke-width".function (d) {
                return 1;
            });

            var node = svg.append("g").attr("class"."nodes").selectAll("circle").data(graph.nodes).enter().append('circle').attr('r'.function (d) {
                return d.size; 
            }).attr("fill".function (d) {
                return colors[d.group];
            }).attr("stroke"."none").attr("name".function (d) {
                return d.id;
            }).call(d3.drag().on("start", dragstarted).on("drag", dragged).on("end", dragended));

            var text =
                svg.append("g").attr("class"."texts").selectAll("text").data(graph.nodes).enter().append('text').attr("font-size".function (d) {
                    return d.size;
                }).attr("fill".function (d) {
                    return colors[d.group];
                }).attr("name".function (d) {
                    return d.id;
                }).text(function (d) {
                    return d.id;
                }).attr("text-anchor".'middle').call(d3.drag().on("start", dragstarted).on("drag", dragged).on("end", dragended));

            var data = svg.append("g").attr("class"."datas").selectAll("text").data(graph.nodes).enter();

            node.append("title").text(function (d) {
                return d.id;
            });

            print = node.append("title").text(function (d) {
                return d.id;
            });
            print.enter().append("text").style("text-anchor"."middle").text(function (d) {
                return d.name;
            });

            simulation
                .nodes(graph.nodes)
                .on("tick", ticked);
            simulation.force("link")
                .links(graph.links);

            // The function of the tick function: Since the force guide graph is in constant motion, it is updated every moment, so the position of nodes and lines must be updated constantly.
            // Iterate the position of the force term mapping
            function ticked() {
                link
                    .attr("x1".function (d) {
                        return d.source.x;
                    })
                    .attr("y1".function (d) {
                        return d.source.y;
                    })
                    .attr("x2".function (d) {
                        return d.target.x;
                    })
                    .attr("y2".function (d) {
                        return d.target.y;
                    });
                node
                    .attr("cx".function (d) {
                        return d.x;
                    })
                    .attr("cy".function (d) {
                        return d.y;
                    });
                text.attr('transform'.function (d) {
                    return 'translate(' + d.x + ', ' + (d.y + d.size / 2) + ') '; }); }});// Activate the graph function
        var dragging = false;

        // Start position
        function dragstarted(d) {
            if(! d3.event.active) simulation.alphaTarget(0.6).restart();
            d.fx = d.x;
            d.fy = d.y;
            dragging = true;
        }

        / / drawing
        function dragged(d) {
            d.fx = d3.event.x;
            d.fy = d3.event.y;
        }
        // The end position alphaTarget also represents the decay factor, if set to 1 after the iteration position fixed dead position
        function dragended(d) {
            if(! d3.event.active) simulation.alphaTarget(0);
            d.fx = null;
            d.fy = null;
            dragging = false;
        }

        // Image/text button
        $('#mode span').click(function (event) {$('#mode span').removeClass('active');
            $(this).addClass('active');
            if ($(this).text() == 'graphics') {$('.texts text').hide();
                $('.nodes circle').show();
            }
            else{$('.texts text').show();
                $('.nodes circle').show(); }}); $('#svg1').on('mouseenter'.'.nodes circle'.function (event) {
            if(! dragging) {var name = $(this).attr('name');

                $('#info h4').css('color', $(this).attr('fill')).text(name);
                $('#info p').remove();
                console.log(info[name]);
                for (var key in info[name]) {
                    if (typeof(info[name][key]) == 'object') {
                        continue;
                    }
                    if (key == 'url' || key == 'title' || key == 'name' || key == 'edited' || key == 'created' || key == 'homeworld') {
                        continue;
                    }
                    $('#info').append('<p><span>' + key + '</span>' + info[name][key] + '</p>');
                }

                d3.select("#svg1 .nodes").selectAll('circle').attr('class'.function (d) {
                    if (d.id == name) {
                        return ' ';
                    }

                    for (var i = 0; i < graph.links.length; i++) {
                        if (graph.links[i]['source'].id == name && graph.links[i]['target'].id == d.id) {
                            return ' ';
                        }
                        if (graph.links[i]['target'].id == name && graph.links[i]['source'].id == d.id) {
                            return ' '; }}return 'inactive';
                });

                d3.select("#svg1 .links").selectAll('line').attr('class'.function (d) {
                    if (d.source.id == name || d.target.id == name) {
                        return ' ';
                    } else {
                        return 'inactive'; }}); }}); $('#svg1').on('mouseleave'.'.nodes circle'.function (event) {
            if(! dragging) { d3.select('#svg1 .nodes').selectAll('circle').attr('class'.' ');
                d3.select('#svg1 .links').selectAll('line').attr('class'.' '); }}); $('#svg1').on('mouseenter'.'.texts text'.function (event) {
            if(! dragging) {var name = $(this).attr('name');
                $('#info h4').css('color', $(this).attr('fill')).text(name);
                $('#info p').remove();
                for (var key in info[name]) {
                    if (typeof(info[name][key]) == 'object') {
                        continue;
                    }
                    if (key == 'url' || key == 'title' || key == 'name' || key == 'edited' || key == 'created' || key == 'homeworld') {
                        continue;
                    }
                    $('#info').append('<p><span>' + key + '</span>' + info[name][key] + '</p>');
                }
                d3.select('#svg1 .texts').selectAll('text').attr('class'.function (d) {
                    if (d.id == name) {
                        return ' ';
                    }
                    for (var i = 0; i < graph.links.length; i++) {
                        if (graph.links[i]['source'].id == name && graph.links[i]['target'].id == d.id) {
                            return ' ';
                        }
                        if (graph.links[i]['target'].id == name && graph.links[i]['source'].id == d.id) {
                            return ' '; }}return 'inactive';
                });
                d3.select("#svg1 .links").selectAll('line').attr('class'.function (d) {
                    if (d.source.id == name || d.target.id == name) {
                        return ' ';
                    } else {
                        return 'inactive'; }}); }}); $('#svg1').on('mouseleave'.'.texts text'.function (event) {
            if(! dragging) { d3.select('#svg1 .texts').selectAll('text').attr('class'.' ');
                d3.select('#svg1 .links').selectAll('line').attr('class'.' '); }}); $('#search input').keyup(function (event) {
            if ($(this).val() == ' ') {
                d3.select('#svg1 .texts').selectAll('text').attr('class'.' ');
                d3.select('#svg1 .nodes').selectAll('circle').attr('class'.' ');
                d3.select('#svg1 .links').selectAll('line').attr('class'.' ');
            }
            else {
                var name = $(this).val();
                d3.select('#svg1 .nodes').selectAll('circle').attr('class'.function (d) {
                    if (d.id.toLowerCase().indexOf(name.toLowerCase()) >= 0) {
                        return ' ';
                    } else {
                        return 'inactive'; }}); d3.select('#svg1 .texts').selectAll('text').attr('class'.function (d) {
                    if (d.id.toLowerCase().indexOf(name.toLowerCase()) >= 0) {
                        return ' ';
                    } else {
                        return 'inactive'; }}); d3.select("#svg1 .links").selectAll('line').attr('class'.function (d) {
                    return 'inactive'; }); }});var info;

        d3.json("all.json".function (error, data) {
            info = data;
        });


    });
</script>

</html>
Copy the code

3. Local deployment

The structure of the current directory is shown below:

Json │ creatnodelink. ipynb │ getFeature. Ipynb │ index. HTML │ nodes │ national institutions of higher learning list. XLS │ └ ─ dataprocess Abbr. TXT Admin. TXT English. TXT Name. TXT Time. TXT Type. TXTCopy the code
  • Go to the current index. HTML directory, enter CMD in the directory, and quickly assume the local server.
  • One line of python3 takes care of the server
python -m http.server 8000
Copy the code

Go to http://localhost:8000/ and see your own atlas

This project is only a simple visualization realization of transforming semi-structured data into knowledge graph, the technology is a little rough, data mining and natural language processing is a new beginning, please give me more advice ~