preface

This article will take you to quickly implement a star relationship map using Neo4j. Due to the delay, it happens to be on April 1 of another year, so we change several examples in the article into “brother” Leslie Cheung. Is the so-called “clever housewife without rice”, this climb entertainment _ professional entertainment integrated portal website subordinate “star” page “more star” in all 9141 data.

Screen out the data containing “star relationship” in the personal home page, and further climb and parse the data required by the follow-up relationship atlas. Take “Leslie Cheung – Personal homepage” as an example, there are not many stars directly related to it, so it can be seen that the data quality is not necessarily high, and it is only for practice, so it is not entangled here too much.

After getting the data, save it into CSV and throw it into neo4J, you can query the relationship of “Leslie cheung”.

It is also convenient for those who want to take a closer look at the relationship spread by Leslie Cheung.

karma

Do you think it’s cool and want to learn quickly? No hurry, the neo4j part is very simple, so let’s stick to the “karma” stuff.

In the past, Gephi has been used for several times to conduct the relationship map, which shows the rules and regulations of the microblog forwarding map and zhihu Big V concern map (see: Gephi draws the Microblog forwarding map: taking “@ wife and Children in Heaven” as an example, 374 people with 100,000 + Zhihu Big V (I) : Mutual attention), and analysis of my own diaries is inventive, eye-popping, and something of a favorite, even if the technical details are rather sketchy (see: 2017, Those who Appear in Diaries: Simple text mining). However, in retrospect, these data formats can be applied seamlessly into Neo4j. Interested friends can go to weibo Forwarding Map to get data and realize a wave.

In fact, it was the visualization map of character relationship and event in Amway’s Dream of Red Mansions half a year ago that was used by Neo4j. At that time, I had analyzed the JSON data supporting the project with great interest and wrote a slightly complicated function manually to extract the character relationship chain related to “fornations”. Now it seems that neo4j can solve this problem in one line of code. (See: amway’s amazing visualisation of dream of Red Mansions, reading dream of Red Mansions with left hand, writing BUG with right hand, leisure)

def word2id(word):
    df = edges_df[edges_df.label== word]
    from_id = df['from'].values.tolist()
    to_id = df['to'].values.tolist()
    return from_id, to_id

def id2label(ids):
    tables = []
    for ID in ids:
        tables.append(person_df[person_df['id']==ID])
    labels = pd.concat(tables)['label'].values.tolist()
    return labels

def get_relation(from_id,to_id):
    for from_label, to_label in zip(id2label(from_id), id2label(to_id)):
        print(from_label, '- > {} -- >'.format(word), to_label)

word = "Adultery"
from_id,to_id = word2id(word)
get_relation(from_id,to_id)
# # # # # # # # # # # # # # # # # # # # # # # # # # # #
The following is the outputJia qiang -- > -- > age extramarital affairs officer Gu Zhen -- -- -- -- > > what rafer qin ke Gu Lian > liaisons -- - > many girl xue fan -- > -- > treasure toad extramarital affairs Search -- -- -- -- > > what rafer Gu Rong Qin ke -- -- -- -- > > what rafer jia qiang sze > liaisons -- - > Pan and Ann toad treasure -- -- -- -- > > what rafer xue fan saw -- -- -- -- > > what rafer Gu Zhen bao two -- -- -- -- > > what rafer Gu Lian smart son -- -- -- -- > > what rafer Qin Zhong wan son -- -- -- -- > > what rafer nameplates, smokeCopy the code

Secondary installation

Neo4j is a graphical database. Different from relational databases such as MySQL, which are more widely known, Neo4j saves data in the format of nodes and relationships between nodes, so it is very efficient and convenient to build and query relational data.

I wanted to skip this part, but I encountered a few minor problems, so I will briefly talk about it.

  • Install the Java JDK. I will skip this because I did this when I installed Gephi.

  • Download the latest Community version from the official website of Neo4j and unzip it to E:\neo4j-file\neo4j-community-3.5.3\.

  • Start the Neo4j program: combination key Windows+R, enter CMD, open the COMMAND line window, switch to the home directory CD E:\neo4j-file\neo4j-community-3.5.3, run the command: neo4j.bat console as the administrator, an error message will be displayed.

  • In “my computer” – “properties” – “Advanced System Settings” – “Environment variables”, put the main path into the system variable NEO4J_HOME=E:\neo4j-file\neo4j-community-3.5.3, Add %NEO4J_HOME%\bin to path, delimited by semicolons (;).

  • Then there is the error: import-module: Failed to load the specified module “\ neo4j-management-psd1”, Change the import-module in E:\neo4j-file\neo4j-community-3.5.3\bin\neo4j.ps1 $PSScriptRoot\ neo4j-management-psd1 “import-module “E:\neo4j-file\neo4j-community-3.5.3\bin\ neo4j-management-psd1”

  • After the file is saved and the red message disappears, run the Neo4j install-service command to install the Neo4j service on the system. Then run the Neo4j start command to start Neo4j.

  • Enter http://localhost:7474 in the browser to enter the neo4j interface. The initial login name and password are both neo4j. After changing the password as prompted, preparations are complete.

At the beginning of secondary experience

After installation, in later years, you can start neo4j by simply entering the E:\neo4j-file\neo4j-community-3.5.3\bin folder in a command line window, running neo4j start, and then opening the url http://localhost:7474, Enter both the original login name and password neo4j or the new password.

cd /d E:
cdE: \ secondary - file \ secondary - community - 3.5.3 \ bin secondary startCopy the code

You can then create nodes and relationships using A Cypher query language (CQL, like Oracle database with SQL, and Neo4j with CQL). A quick introduction to NEO4J-CQL can be found in the w3CSchool tutorial

The following are some introductory statements that will suffice for implementing the star relationship graph.

Create a People node with attributes (name, age)
create(p:People{name:"Alex", age:20});

create(p:People{name:"Tom", age:22});

Match the People node and return its name and age attributes
match (p:People) return p.name, p.age

Match all People nodes whose age is 20
match (p:People{age:20}) RETURN p

Create a one-way Friend relationship between Alex and Tom
create(:People{name:"Alex", age:20})-[r:Friends]->(:People{name:"Tom", age:22})

# 
match p=()-[r:RELATION]->() return p LIMIT 25

Match all nodes and view 25 of them
match (n) return n LIMIT 25;

Delete all nodes and node-related relationships
match (n) detach delete n
Copy the code

Data crawl

The crawler part will not explain too much, just turn the page until you get all 9141 star names and personal home page links. The complete code is at: DesertsX/ Gulius-Projects

In addition, star picture links and other information are extracted, which is not used this time and can be ignored. However, if people pictures can be added into the relationship map, the effect will be better, but I don’t know how to achieve it yet.

import time
import random
import requests
from lxml import etree
import pandas as pd
from fake_useragent import UserAgent

ylq_all_star_ids = pd.DataFrame(columns = ['num'.'name'.'star_id'.'star_url'.'image'])
total_pages=153
for page in range(1, total_pages+1):
    ua = UserAgent()
    url = 'http://www.ylq.com/star/list-all-all-all-all-all-all-all-{}.html'
    r = requests.get(url=url.format(page), headers=headers)
    r.encoding = r.apparent_encoding
    dom = etree.HTML(r.text)

    # 'http://www.ylq.com/neidi/xingyufei/'
    star_urls = dom.xpath('//div[@class="fContent"]/ul/li/a/@href')
    star_ids = [star_url.split('/')[-2] for star_url in star_urls]
    star_names = dom.xpath('//div[@class="fContent"]/ul/li/a/h2/text()')
    star_images = dom.xpath('//div[@class="fContent"]/ul/li/a/img/@src')

    print(page, len(star_urls), len(star_ids), len(star_images), len(star_names))

    for i in range(len(star_ids)):
        ylq_all_star_ids = ylq_all_star_ids.append({'num':int((page-1)*60+i+1), 'name': star_names[i],
                                                    'star_id':star_ids[i], 'star_url': star_urls[i],
                                                    'image':star_images[i]},ignore_index=True)
    # if page%5 == 0:
    # time. Sleep (random. Randint (0, 2))
print("End of creeper!")
Copy the code

Check the data. No problem.

Since not many celebrity profiles contain “celebrity relationship” data, all 1,263 links containing relationship data were screened out. Note that this part is time-consuming, so you can optimize and accelerate it by yourself and improve it later when you are free.

star_has_relations = []
for num, url in enumerate(star_urls):
    ua = UserAgent()
    headers ={"User-Agent": ua.random,
              'Host': 'www.ylq.com'}
    try:
        r = requests.get(url=url, headers =headers, timeout=5)
        r.encoding = r.apparent_encoding

        if 'starRelation' in r.text:
            star_has_relations.append(url)
            print(num, "Bingo!", end=' ')
        if num%100==0:
            print(num, end=' ')
    except:
        print(num, star_has_relations)
# if (num+index)%50==0:
# time. Sleep (random. Randint (0, 2))
Copy the code

Then, it is ok to crawl this part of relational data. Of course, crawler part can combine some steps according to their preferences, such as filtering relational links and crawling relational data in one step.

datas = []
ylq_all_star_relations = pd.DataFrame(columns = ['num'.'subject'.'relation'.'object'.'subject_url'.'object_url'.'obeject_image'])
for num, subject_url in enumerate(star_has_relations):
    ua = UserAgent()
    headers ={"User-Agent": ua.random,
              'Host': 'www.ylq.com'}
    try:
        r = requests.get(url=subject_url, headers =headers, timeout=5)
        r.encoding = r.apparent_encoding
        dom = etree.HTML(r.text)
        subject = dom.xpath('//div/div/div/h1/text()')[0]
        relations = dom.xpath('//div[@class="hd starRelation"]/ul/li/a/span/em/text()')
        objects = dom.xpath('//div[@class="hd starRelation"]/ul/li/a/p/text()')
        object_urls = dom.xpath('//div[@class="hd starRelation"]/ul/li/a/@href')
        object_images = dom.xpath('//div[@class="hd starRelation"]/ul/li/a/img/@src')
        for i in range(len(relations)):
            relation_data = {'num': int(num+1), 'subject': subject, 'relation': relations[i],
                             'object': objects[i], 'subject_url':subject_url,
                             'object_url': object_urls[i], 'obeject_image':object_images[i]}
            datas.append(relation_data)
            ylq_all_star_relations = ylq_all_star_relations.append(relation_data,
                                                                   ignore_index=True)
        print(num, subject, end=' ')
    except:
        print(num, datas)
# if num%20 == 0:
# time. Sleep (random. Randint (0, 2))
# print(num, 'sleep a moment')
Copy the code

The data format of the star relationship is as follows, and the situation is also taken into account at the back, but it seems that all can be deleted, so I will not repeat it here, the complete code is seen in: DesertsX/ Gulius-projects

Build star relationship map

If you are not interested in crawlers, but just want to know how to import existing CSV data and use Neo4j to build a relationship map, then you can start from here. After all, the data is also provided for free this time.

CSV and ylq_star_relations. CSV files in E:\neo4j-file\neo4j-community-3.5.3\import directory. Then execute the following two commands respectively to complete the creation of the graph! Yes, in a second, of course, if there’s a lot of data, it might take a little while.

LOAD CSV  WITH HEADERS FROM 'file:///ylq_star_nodes.csv' AS data CREATE (:star{starname:data.name, starid:data.id});

LOAD CSV  WITH HEADERS FROM "file:///ylq_star_relations.csv" AS relations
MATCH (entity1:star{starname:relations.subject}) , (entity2:star{starname:relations.object})
CREATE (entity1)-[:rel{relation: relations.relation}]->(entity2)
Copy the code

You can then query the information separately.

# Check out all of someone's relationships
return (:star{starname:Leslie Cheung}) - > ();Copy the code

# Check friends of someone's friends (5 levels of relationships)
match p=(n:star{starname:Leslie Cheung}) - [*.. 5) - > ()return p limit 50;
Copy the code

Query for specific relationships
match p=()-[:rel{relation:"Old love"}] - > ()return p LIMIT 25;
Copy the code

Use the function to query the shortest path of Leslie and Weijian Zhang
match p=shortestpath((:star{starname:Leslie Cheung})-[*..5]->(:star{starname:"Cheung Wai-kin"})) return p;
Copy the code

More interesting commands can be learned and tried on their own, and other interesting data sets can also be played with according to personal interests.