This is the 7th day of my participation in Gwen Challenge

Original: Telami’s Blog, welcome to share, reprint please reserve source.

I am working overtime in the company today. This is the background.

The reason for working overtime was that I went online and solved the problem of duplicate data in the online database. I found the bug of the program and solved it well. The problem was that I corrected the duplicate data in the online database.

There are 6 tables with duplicate data in the online library, two of which are large, one with 960,000 + and the other with 300,000 +. Since we have dealt with the same problem before, we directly used the Python deduplication script last time. The script is very simple, which is to connect to the database, find the duplicate data, and delete it circulatively.

Emmmm, but the efficiency is really too low, one second, about 20,000 + duplicate data, estimated time is about 8 hours…

There is something wrong with blindly relying on what our predecessors have done instead of thinking for ourselves. There’s something wrong with thinking about how you could have done it before and how you can’t do it now!

I found that I was not quite in the right state recently, and lost the desire to explore and seek knowledge. Today, IT was a wake-up call and I felt quite lost.

Without further ado, the following steps are detailed.

CREATE TABLE `animal` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(20) DEFAULT NULL.`age` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`))ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;


INSERT INTO `pilipa_dds`.`student` (`id`.`name`.`age`) VALUES ('1'.'cat'.'12');
INSERT INTO `pilipa_dds`.`student` (`id`.`name`.`age`) VALUES ('2'.'dog'.'13');
INSERT INTO `pilipa_dds`.`student` (`id`.`name`.`age`) VALUES ('3'.'camel'.'25');
INSERT INTO `pilipa_dds`.`student` (`id`.`name`.`age`) VALUES ('4'.'cat'.'32');
INSERT INTO `pilipa_dds`.`student` (`id`.`name`.`age`) VALUES ('5'.'dog'.The '42');
Copy the code

Goal: We want to remove data with the same name.

Let’s take a look at what’s duplicated

select name.count(1) from student GROUP BY name having count(1) > 1;

name	count(1)
cat		2
dog		2
Copy the code

The data whose names are cat and dog are duplicated, and there are two duplicate data for each.

Select * From table_name Where table_name = 1 and table_name = 1 Select * From table_name Where table_name = 1 and table_name = 1 and table_name = 1

Delete all duplicate data. None is left

An error will be reported if you delete it directly

DELETE FROM student WHERE NAME IN ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1)

1093 - You can't specify target table 'student'For update in FROM clause, Time: 0.016000sCopy the code

A deadlock occurs when a table is updated and a table is queried, and a table is queried and a table is updated. Mysql does not support updating the same table

Solution: Query the columns to be updated as a third-party table and filter the updates.

DELETE FROM student WHERE NAME IN (select t.name from ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1)t)
Copy the code

Delete Delete only one duplicate data in the table

Before deleting, we can check what kind of duplicate data we want to delete

SELECT * FROM student WHERE id NOT IN ( SELECT t.id FROM ( SELECT MIN(id) AS id FROM student GROUP BY `name` ) t )
Copy the code

First, group the data with the smallest ID by name, and find the data with the smallest ID. These data are the spark that we want to leave. Then query the data with the least ID, which is the duplicate data we want to delete.

The system starts to delete all but one duplicate data

It’s very simple. Replace select with delete

delete FROM student WHERE id NOT IN ( SELECT t.id FROM ( SELECT MIN(id) AS id FROM student GROUP BY `name` ) t )
Copy the code

900,000 + tables are super fast to execute.

All done 👏 👏 👏 👏 ~