There are so many seed search engines in the world, why would you bother to build a new one?
It can be said that most of the seed search engines on the planet have very old technology in front and back. Although the old technology is both classic and useful, as an early adopter, I still decided to use the most advanced development technology to make a simple seed search engine.
What technology is used?
Front end: Vue was chosen over vue, Angular, and React, simply because of the enigmatic love of Vue. Sometimes that’s the way it goes, and nuxtJS got a 2.0 update in September, so I chose Vue without hesitation.
Back end: koA, GIN, Springboot weighing for a long time, because a long time did not write Java, finally choose Springboot + JDK11, with the feeling of writing javascript to write Java, or very good. In terms of speed, it might be faster to use GIN or Koa, but this improvement doesn’t make much sense for my experimental site.
Full text search: Elasticsearch, redisSearch, redisSearch, redisSearch, redisSearch, RedisSearch, RedisSearch, RedisSearch, RedisSearch, RedisSearch, RedisSearch Complexity can eat up too much memory.
How to make it?
Below I will share the general process, involving the complex principle, please Google, I don’t think I can describe the complex principle is very simple.
About naming:
From the dozen or so domain names in hand
btzhai.top
There are several websites with the same name in China, but this is not a problem.
About servers:
After many twists and turns, an American server was purchased. Configuration is: E5-1620 24 g | | | 1 TB bandwidth of 200 m, the real human service 24 hours a day. Given cloudfare, no hard defense is required. In January 1200 RMB.
During this period tried a lot of servers, feel this free record server this line is really a rough and tumble.
About reptiles:
Around the beginning of August, I finally got around to the Bt search engine.
So the first thing that I have in front of me is the source of the data, because you know what’s called a DHT network, which is basically a node that’s both a server and a client, and when you download from a DHT network, you broadcast it to the network, Other nodes receive the unique identifier InfoHash (sometimes referred to as a secret code) and metadata of the downloaded file, including the file name, size, creation time, and contained file. Using this mechanism, the DHT crawler can collect popular downloads in the DHT network.
If only rely on DHT crawler to climb, theoretically the initial speed is about 40W a day, 30 days can collect tens of millions, but DHT network nodes can not always download new files, the reality is: in most cases, the unpopular seeds for several years no one, hot seeds download hundreds of thousands of people every day. It can be assumed that as the seed base increases, there will be more and more repeated InfoHashes, slowly increasing the so-called seed heat without increasing the base, but without 1000W + seeds, the appearance is not good.
Where to get 1000W seeds became my main research problem at that time. First I took a couple of DHT crawlers from Github that I thought were pretty good and modified them so that they could store data directly into ElasticSearch and automatically +1 heat when InfoHash repeats.
Elasticsearch mapping is as follows, smartCN is selected for Chinese word segmentation, ik is also available. The list of files in the seed was set to files, which was deleted because the nested query performance was not high:
{
"properties": {
"name": {
"type": "text"."analyzer": "smartcn"."search_analyzer": "smartcn"
},
"length": {
"type": "long"
},
"popularity": {
"type": "integer"
},
"create_time": {
"type": "date"."format": "epoch_millis"
},
"files": {
"properties": {
"length": {
"type": "long"
},
"path": {
"type": "text"."analyzer": "smartcn"."search_analyzer": "smartcn"
}
}
}
}
}
Copy the code
DHT crawlers start hanging 24 hours on the server. I also tried open source crawlers in many different languages to compare performance and even had people try to buy BitTorrent seeds. Here are some of the crawlers I’ve actually used:
https://github.com/shiyanhui/dht
https://github.com/fanpei91/p2pspider
https://github.com/fanpei91/simDHT
https://github.com/keenwon/antcolony
https://github.com/wenguonideshou/zsky
Copy the code
However, these DHT crawlers have more or less some problems through tests, some can only collect infohash but not metadata, some collection speed is not enough, and some occupy more and more resources as time increases.
Finally, the optimal solution is determined:
github.com/neoql/btlet
The only problem is that it will crash and exit after running for some time (about 10 hours), which may be related to the acquisition speed. A few days before I wrote this article, the author claimed to have fixed the problem, and I haven’t had time to follow up. This is arguably the fastest COLLECTION DHT crawler I’ve ever experimented with. Interested students can try, PR.
After the crawler normalized operation, I finally found the solution to the cardinality problem, that is, the database and OpenBay dumped after SkyTorrent closed. With the 4000W InfoHash data and Bthub, tens of thousands of new metadata can be guaranteed every day.
What I want to say about Bthub is that if the API request frequency is too high, the IP will be blocked. The result of email inquiry is as follows. After my repeated testing, the API request interval is set to 1s.
About the front end:
I tend to draw the simple front end first and then write the back end. Once the front end is clear, I can write the corresponding interface quickly. Bt search engine currently has the following functions are enough:
-
You can search for keywords
-
The home page displays the top 10 previously searched keywords
-
Some files can be recommended randomly
-
You can sort by relevance, size, creation time, and popularity
The @scheduled cache is automatically updated every day using @scheduled to automatically read the cache from the background to speed things up. Infohashes, random recommended file names, top10 search terms, etc.
Click on “Highlight” to go to the results display page, where only elasticSearch results after highlight processing are displayed instead of all original results, 10 results per page.
The presentation of the original results is placed on the last detailed screen.
Another important issue with front-end hosting is SEO, which is why I use NuxTJS. After the front end was complete, I added meta description, Google Analytics, and Baidu to it.
Adding sitemap took some time because it was a dynamic web page and had to be dynamically generated using Nuxt-sitemap.
In addition, media query and VH, VW to do mobile adaptation. I can’t say 100%, at least 90% of the devices.
About the back end:
Spring Data is having problems implementing the core search API. The core search, if written as JSON, for example, might look something like this:
{
"from": 0."size": 10."sort": [{
"_score": "desc"
}, {
"length": "desc"
}, {
"popularity": "desc"
}, {
"create_time": "desc"}]."query": {
"multi_match": {
"query": "Here are the search terms."."fields": ["name"."files.path"]}},"highlight": {
"pre_tags": ["<strong>"]."post_tags": ["</strong>"]."fields": {
"name": {
"number_of_fragments": 1."no_match_size": 150
},
"files.path": {
"number_of_fragments": 3."no_match_size": 150}}}}Copy the code
There is no way for the result returned by Highlight to automatically match the entity because this part of the data is not in the source and Spring Data cannot be retrieved by getSourceAsMap. Here you need to use the Creative SearchQueryBuilder to manually configure, if there is a better way, please feel free to comment. The Java code is as follows:
var searchQuery = new NativeSearchQueryBuilder()
.withIndices("torrent_info").withTypes("common")
.withQuery(QueryBuilders.multiMatchQuery(param.getKeyword(), "name"."files.path"))
.withHighlightFields(new HighlightBuilder.Field("name").preTags("<strong>").postTags("</strong>").noMatchSize(150).numOfFragments(1), new HighlightBuilder.Field("files.path").preTags("<strong>").postTags("</strong>").noMatchSize(150).numOfFragments(3))
.withPageable(PageRequest.of(param.getPageNum(), param.getPageSize(), sort))
.build();
var torrentInfoPage = elasticsearchTemplate.queryForPage(searchQuery, TorrentInfoDo.class, new SearchResultMapper() {
@SuppressWarnings("unchecked")
@Override
public <T> AggregatedPage<T> mapResults(SearchResponse searchResponse, Class<T> aClass, Pageable pageable) {
if (searchResponse.getHits().getHits().length <= 0) {
return null;
}
var chunk = new ArrayList<>();
for (var searchHit : searchResponse.getHits()) {
// Set the info section
var torrentInfo = new TorrentInfoDo();
torrentInfo.setId(searchHit.getId());
torrentInfo.setName((String) searchHit.getSourceAsMap().get("name"));
torrentInfo.setLength(Long.parseLong(searchHit.getSourceAsMap().get("length").toString()));
torrentInfo.setCreate_time(Long.parseLong(searchHit.getSourceAsMap().get("create_time").toString()));
torrentInfo.setPopularity((Integer) searchHit.getSourceAsMap().get("popularity"));
// ArrayList<Map>->Map->FileList->List<FileList>
var resList = ((ArrayList<Map>) searchHit.getSourceAsMap().get("files"));
var fileList = new ArrayList<FileList>();
for (var map : resList) {
FileList file = new FileList();
file.setPath((String) map.get("path"));
file.setLength(Long.parseLong(map.get("length").toString()));
fileList.add(file);
}
torrentInfo.setFiles(fileList);
// Set the highlight section
// Seed name highlight(usually only one)
var nameHighlight = searchHit.getHighlightFields().get("name").getFragments()[0].toString();
// Path highlight list
var pathHighlight = getFileListFromHighLightFields(searchHit.getHighlightFields().get("files.path").fragments(), fileList);
torrentInfo.setNameHighLight(nameHighlight);
torrentInfo.setPathHighlight(pathHighlight);
chunk.add(torrentInfo);
}
if (chunk.size() > 0) {
// No correct page result is returned without setting total
return new AggregatedPageImpl<>((List<T>) chunk, pageable, searchResponse.getHits().getTotalHits());
}
return null; }});Copy the code
About elasticsearch:
Seed search does not require much real time, and a server does not require a copy, so index is set like this:
{
"settings": {
"number_of_shards": 2."number_of_replicas": 0."refresh_interval": "90s"}}Copy the code
JVM configured with 8G memory, G1GC, plus cleanup Considerations:
## IMPORTANT: JVM heap size
-Xms8g
-Xmx8g
## GC configuration
-XX:+UseG1GC
-XX:MaxGCPauseMillis=50
Copy the code
How’s it working?
Due to the complexity of the search, the average search time is about 1s, and the search will hit millions of data more than 2s.
Here’s cloudfare’s tally: