URL to focus is common in our daily work and interviews, such as these:It can be seen that, including Ali, netease Cloud, Youku, Homework help and other well-known Internet companies have appeared similar interview questions, and URL to heavy more similar, such as IP black/white list judgment often appears in our work, so we will come to the “plate” URL to heavy problems.
URL to heavy thinking
Regardless of the business scenario and the amount of data, we can use the following scheme to achieve the repeated judgment of URLS:
- The Set Set of Java is used to determine whether the URL is repeated according to the result of adding (adding successfully means that the URL is not repeated).
- Use the Set Set in Redis to judge whether the URL is repeated according to the result when adding;
- The URL is stored in the database, and then through THE SQL statement to determine whether there is a duplicate URL;
- The URL column in the database is set as a unique index, according to the result of adding to determine whether the URL is repeated;
- Guava’s Bloom filter is used to achieve URL weighting;
- Use Redis bloom filter to achieve URL weight determination.
The concrete implementation of the above scheme is as follows.
URL re implementation scheme
1. Use Java’s Set Set to judge weights
The Set can only store elements with different values. If the values are the same, adding will fail. Therefore, we can determine whether the URL is repeated by adding the result of the Set, and the code is as follows:
public class URLRepeat {
// delete the URL
public static final String[] URLS = {
"www.apigo.cn"."www.baidu.com"."www.apigo.cn"
};
public static void main(String[] args) {
Set<String> set = new HashSet();
for (int i = 0; i < URLS.length; i++) {
String url = URLS[i];
boolean result = set.add(url);
if(! result) {// Duplicate URL
System.out.println("URL already exists:"+ url); }}}}Copy the code
The execution result of the program is:
The URL already exists: www.apigo.cn
It can be seen from the above results that the function of URL weighting can be realized by using Set Set.
2.Redis Set deduplicates
The Set Set of Redis is the same as the Set Set in Java. It is realized by using the unrepeatability of Set. We first use Redis client Redis – CLI to realize the EXAMPLE of URL weight determination:As can be seen from the above results, when adding successfully, it means that the URL is not repeated, but when adding failed (the result is 0), it means that the URL already exists.
We use code to implement the Redis Set to duplicate, the implementation code is as follows:
// delete the URL
public static final String[] URLS = {
"www.apigo.cn"."www.baidu.com"."www.apigo.cn"
};
@Autowired
RedisTemplate redisTemplate;
@RequestMapping("/url")
public void urlRepeat(a) {
for (int i = 0; i < URLS.length; i++) {
String url = URLS[i];
Long result = redisTemplate.opsForSet().add("urlrepeat", url);
if (result == 0) {
// Duplicate URL
System.out.println("URL already exists:"+ url); }}}Copy the code
The execution results of the above programs are as follows:
The URL already exists: www.apigo.cn
In the above code, we use the RedisTemplate in Spring Data to implement, To use the RedisTemplate object in the Spring Boot project, we need to introduce the spring-boot-starter-data-redis framework. The configuration information is as follows:
<! -- Add RedisTemplate reference -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
Copy the code
Then you need to configure the connection information of Redis in the project, and configure the following content in application. Properties:
Port =6379 #spring.redis. Password =123456 # redis server passwordCopy the code
After the above two steps, we can now operate Redis normally in our Spring Boot project using the RedisTemplate object.
3. Database deduplication
We can also use the database to achieve repeated URL judgment. First of all, we will design a STORAGE table of URL, as shown in the figure below:The SQL corresponding to this table is as follows:
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
/* Table: urlinfo */
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
create table urlinfo
(
id int not null auto_increment,
url varchar(1000),
ctime date,
del boolean.primary key (id)
);
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
/* Index: Index_url */
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
create index Index_url on urlinfo
(
url
);
Copy the code
The ID is the primary key that is increased by itself, and the URL field is set as the index. Setting the index can speed up the query.
Let’s start by adding two test data to the database, as shown below:
We use SQL statement query, as shown below:If the result is greater than 0, duplicate urls already exist; otherwise, no duplicate urls exist.
4. Unique index deduplication
We can also use a unique index of the database to prevent URL duplication, which is implemented in much the same way as the Set Set idea.
First, we set a unique index for the field URL, and then add THE URL data. If the URL can be added successfully, it indicates that the URL is not repeated, otherwise, it indicates that the URL is repeated.
The SQL implementation for creating a unique index is as follows:
create unique index Index_url on urlinfo
(
url
);
Copy the code
5.Guava Bloom filter to weight
Bloom Filter was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain error recognition rate and deletion difficulty.
The core implementation of a Bloom filter is a large bit array and several hash functions, assuming that the length of the bit array is M and the number of hash functions is K.The above picture shows the specific operation process: Suppose there are three elements {x, y, z} in the set, and the number of hash functions is 3. We first initialize the bit array, setting each bit to bit 0. For each element in the set, the elements are mapped through three hash functions in turn. Each mapping produces a hash value corresponding to a point on the bit array, and the position corresponding to the bit array is marked as 1. When checking whether W element exists in the set, The same method hashes W into the three points on the array. If one of the three points is not 1, then the element must not be in the set. Conversely, if all three points are 1, the element may exist in the set. Note: It is not possible to judge whether the element exists in the set, and there may be a certain misjudgment rate. As can be seen from the figure, suppose an element is mapped to three points with subscripts of 4, 5 and 6. Although these three points are all 1, it is obvious that these three points are the hashed positions of different elements. Therefore, this situation indicates that even though elements are not in the set, they may also correspond to 1, which is the reason for the existence of misjudgment rate.
We can use the Guava framework provided by Google to operate the Bloom filter by adding a reference to Guava in POM.xml as follows:
<! -- Add Guava framework -->
<! -- https://mvnrepository.com/artifact/com.google.guava/guava -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>28.2 the jre</version>
</dependency>
Copy the code
URL weight determination code:
public class URLRepeat {
// delete the URL
public static final String[] URLS = {
"www.apigo.cn"."www.baidu.com"."www.apigo.cn"
};
public static void main(String[] args) {
// Create a Bloom filter
BloomFilter<String> filter = BloomFilter.create(
Funnels.stringFunnel(Charset.defaultCharset()),
10.// The number of elements expected to be processed
0.01); // Expected false positives
for (int i = 0; i < URLS.length; i++) {
String url = URLS[i];
if (filter.mightContain(url)) {
// Use a duplicate URL
System.out.println("URL already exists:" + url);
} else {
// Store the URL in the Bloom filterfilter.put(url); }}}}Copy the code
The execution results of the above programs are as follows:
The URL already exists: www.apigo.cn
6.Redis Bloom filter for weight removal
In addition to Guava’s Bloom filter, we can also use Redis’s Bloom filter to implement URL weighting. Before using the Redis server, we must ensure that the Redis server version is greater than 4.0 (this version can only support bloom filter), and the Redis bloom filter function can be used normally.
Take Docker as an example, let’s demonstrate the installation and startup of The Redis Bloom filter. First download the Redis Bloom filter, and then start the Bloom filter when restarting the Redis service, as shown in the following figure:
Bloom filter is usedAfter bloom filter is normally enabled, we first use Redis client Redis – CLI to implement bloom filter URL rejudgment, implementation command is as follows:
In Redis, bloom filter operation commands are few, mainly including the following:
- Bf. add adds elements;
- Bf. exists Determines whether an element exists.
- Bf. madd adds multiple elements;
- Bf. Mexists judge whether multiple elements exist;
- Bf. Reserve Sets the accuracy of bloom filter.
Next we use code to demonstrate the use of Redis Bloom filter:
import redis.clients.jedis.Jedis;
import utils.JedisUtils;
import java.util.Arrays;
public class BloomExample {
// Bloom filter key
private static final String _KEY = "URLREPEAT_KEY";
// delete the URL
public static final String[] URLS = {
"www.apigo.cn"."www.baidu.com"."www.apigo.cn"
};
public static void main(String[] args) {
Jedis jedis = JedisUtils.getJedis();
for (int i = 0; i < URLS.length; i++) {
String url = URLS[i];
boolean exists = bfExists(jedis, _KEY, url);
if (exists) {
// Duplicate URL
System.out.println("URL already exists:" + url);
} else{ bfAdd(jedis, _KEY, url); }}}/** * add element *@paramJedis Redis client *@param key key
* @param value value
* @return boolean
*/
public static boolean bfAdd(Jedis jedis, String key, String value) {
String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])";
Object result = jedis.eval(luaStr, Arrays.asList(key, value),
Arrays.asList());
if (result.equals(1L)) {
return true;
}
return false;
}
/** * query if the element exists *@paramJedis Redis client *@param key key
* @param value value
* @return boolean
*/
public static boolean bfExists(Jedis jedis, String key, String value) {
String luaStr = "return redis.call('bf.exists', KEYS[1], KEYS[2])";
Object result = jedis.eval(luaStr, Arrays.asList(key, value),
Arrays.asList());
if (result.equals(1L)) {
return true;
}
return false; }}Copy the code
The execution results of the above programs are as follows:
The URL already exists: www.apigo.cn
conclusion
This paper introduces 6 URL deduplication solutions, among which Redis Set, Redis Bloom filter, database and unique index are suitable for distributed system. If it is a massive distributed system, it is recommended to use Redis Bloom filter to achieve URL deduplication. It is recommended to use Guava’s Bloom to achieve URL deduplication if it is a stand-alone massive data.
Follow the public account “Java Chinese community” to send “interview”, I organized the latest interview review materials.