Build a full text search engine in Python in 10 minutes

There is a group of friends in the group asked how to quickly build a search engine, after the search I saw this

The code is located

Git:github.com/asciimoo/se…

The official is very considerate, very convenient is that docker image has been provided, basic pull down can be very convenient to use, execute the command

cid=$(sudo docker ps -a | grep searx | awk '{print $1}') echo searx cid is $cid if [ "$cid" != "" ]; then sudo docker stop $cid sudo docker rm $cid fi sudo docker run -d --name searx -e IMAGE_PROXY=True -e BASE_URL=http://yourdomain.com -p 7777:8888 wonderfall/searxCopy the code

Then it can be used. Check the status of docker normally and it can be used normally

thinking

How, is not very convenient, we first look at the source code is how to achieve

We open the code inside, in fact, the essence is to make a large aggregation of the results after the request, as for the data source, we can come from DB, or file, we can look at its core code

from urllib import urlencode
from json import loads
from collections import Iterable

search_url = None
url_query = None
content_query = None
title_query = None
suggestion_query = ''
results_query = ''

# parameters for engines with paging support
#
# number of results on each page
# (only needed if the site requires not a page number, but an offset)
page_size = 1
# number of the first page (usually 0 or 1)
first_page_num = 1

def iterate(iterable):
    if type(iterable) == dict:
        it = iterable.iteritems()

    else:
        it = enumerate(iterable)
    for index, value in it:
        yield str(index), value

def is_iterable(obj):
    if type(obj) == str:
        return False
    if type(obj) == unicode:
        return False
    return isinstance(obj, Iterable)

def parse(query):
    q = []
    for part in query.split('/'):
        if part == '':
            continue
        else:
            q.append(part)
    return q

def do_query(data, q):
    ret = []
    if not q:
        return ret

    qkey = q[0]

    for key, value in iterate(data):

        if len(q) == 1:
            if key == qkey:
                ret.append(value)
            elif is_iterable(value):
                ret.extend(do_query(value, q))
        else:
            if not is_iterable(value):
                continue
            if key == qkey:
                ret.extend(do_query(value, q[1:]))
            else:
                ret.extend(do_query(value, q))
    return ret

def query(data, query_string):
    q = parse(query_string)

    return do_query(data, q)

def request(query, params):
    query = urlencode({'q': query})[2:]

    fp = {'query': query}
    if paging and search_url.find('{pageno}') >= 0:
        fp['pageno'] = (params['pageno'] - 1) * page_size + first_page_num

    params['url'] = search_url.format(**fp)
    params['query'] = query

    return params

def response(resp):
    results = []
    json = loads(resp.text)
    if results_query:
        for result in query(json, results_query)[0]:
            url = query(result, url_query)[0]
            title = query(result, title_query)[0]
            content = query(result, content_query)[0]
            results.append({'url': url, 'title': title, 'content': content})
    else:
        for url, title, content in zip(
            query(json, url_query),
            query(json, title_query),
            query(json, content_query)
        ):
            results.append({'url': url, 'title': title, 'content': content})

    if not suggestion_query:
        return results
    for suggestion in query(json, suggestion_query):
        results.append({'suggestion': suggestion})
    return results
Copy the code

The results of

Every time we respond, we need to easily customize the returned data (network, database or file). Then let’s think further, if we can hack the response result, we can completely take the data we crawl as the return result. If it is 1024 and so on, can create their own “hobby” small engine, code I will not stick, we can start their own play. Combine jieba participle, can play a little more.

Build a full text search engine in Python in 10 minutes

The code is located

thinking

The results of

Related Posts

Digest 2021.02.24 | SSH tunnel, alternative case, computer programmers soft skills

Easy to understand Wait and Notify and their usage scenarios

The use of Spring in the template | JDBCTemplate