Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

Writing in the front

In this paper, through the recent three years of “Chinese College Students Computer Design Competition” award-winning data (2021 results have not yet been announced), analysis and mining some deep content of the competition, mainly including the following points:

  • The proportion distribution of each award in each year
  • Top10 schools with the most awards each year
  • Which schools have repeatedly entered the Top10 for the most awards
  • Statistics of the number of competitions in each school in three years
  • The division of schools in each level of awards
  • The relationship between the number of entries and the awards
  • The title of the winning work is hot words

If there is a lot of code in each part of the article, for the sake of reading experience, it will not be displayed. If you need the code and relevant documents, you can send me a private message.


Data reading and description

The data set is the official data provided by the competition. The data in 2018 and 2019 are. XLSX files, and the data in 2020 are. First read the 2018 and 2019 data and observe the data set information.

import pandas as pd

df_2018 = pd.read_excel('Final 2018 official results.xlsx', sheet_name='123 award,')
df_2019 = pd.read_excel('2019 Award list published)
Copy the code

df_2019.info()
Copy the code

As can be seen from the information in the figure above, the data set formats of 2018 and 2019 are quite different, which need to be unified when merging later.

Since the data for 2020 is a.pdf file, we define a separate function to read it. Some details of reading have been marked out in the code in the form of comments.

import pdfplumber

def read_pdf_2020(read_path) :
    pdf_2020 = pdfplumber.open(read_path)
    result_df = pd.DataFrame()
    for page in pdf_2020.pages:
        table = page.extract_table()
        df_detail = pd.DataFrame(table[1:], columns=table[0])
        # Merge data sets per page
        result_df = pd.concat([df_detail, result_df], ignore_index=True)
    Delete columns whose values are all NaN
    result_df.dropna(axis=1, how='all', inplace=True)
    Reset column name
    result_df.columns = ['award'.'Work No.'.'Title of Work'.'Participating Schools'.'the writer'.'Advisor']
    return result_df

df_2020 = read_pdf_2020('2020 Chinese College Students Computer Design Competition Winners list. PDF ')
Copy the code

By observing the data of 2020, compared with the data of the previous two years, there are no missing values in each column, but the award information of 2020 does not contain the column of work category, so we need to delete the column of the previous two years when processing the data set. In this way, we can use the format of the 2020 dataset as a template, convert the dataset of the previous two years into the same format, and then merge the dataset.


Data preprocessing

Data set formatting for each year

In accordance with the 2020 format, some columns in the dataset of 2018 and 2019 are merged, and the column names are changed to delete the redundant columns. And add the “year” column.

Below are the processed data for 2018 and 2019.

For the processing of data set in 2020, it should be noted that data is read based on each page of data. If there is too much data in the last line of a page and line breaking is required, the data in the first line of the next page will be missing, as shown below.

In this case, it is necessary to first screen out the lines with the work number blank, and then add the data to the previous line.

Data set processing for 2020
clean_df_2020 = df_2020.copy()

# Part of the message is too long, causing it to be split in the pagebreak, appearing on two separate pages. Below, add the prize empty data to the previous message.
clean_df_2020.iloc[609] ['Participating Schools'] + ='Medical University'
clean_df_2020.iloc[1330] ['Title of Work'] + ='danxia'
clean_df_2020.iloc[2121] ['Title of Work'] + ='now'
clean_df_2020.iloc[2997] ['Title of Work'] + ='Cloud Platform'

del_index = clean_df_2020.loc[clean_df_2020['award'] = =' '].index
clean_df_2020.drop(del_index, inplace=True)
clean_df_2020.reset_index(drop=True, inplace=True)
clean_df_2020['year'] = [2020 for _ in range(len(clean_df_2020))]
Copy the code


Data consolidation

Now combine three years of data. The merged data set is as follows.

Data cleaning

Now we need to do some processing on the combined data set for better analysis and visualization. Since we need to use some basic information of colleges and universities in China later, such as the school level (985 211, etc.), we need to import college_info.csv. This data was crawled by the blogger on June 15, and some vocational colleges and universities may not include it. For this part of colleges and universities, the corresponding label is assigned as “no data temporarily”. Clear the line break “\n” in the names of participating schools and works, and add the number of participants column to record the number of authors of each work, and the number of advisers column to record the number of advisers of this work.

college_info = pd.read_csv('college_info.csv')
college_name = college_info['Name of School'].values.tolist()
college_level = []
for college in all_df['Participating Schools'] :if college not in college_name:
        college_level.append('No data at present')
    else:
        college_level.append(college_info['School level'][college_name.index(college)])
all_df['School level'] = college_level
all_df['Participating Schools'] = all_df['Participating Schools'].str.replace('\n|\r'.' ')
all_df['Title of Work'] = all_df['Title of Work'].str.replace('\n|\r'.' ')

# delete column whose author is empty
all_df.dropna(subset=['the writer'], axis=0, inplace=True)
# add parameter number column to record the number of authors of each work
all_df['Number of entries'] = all_df['the writer'].apply(lambda x: len(x.split(', ')))
count_list = []
for index, row in all_df.iterrows():
    try:
        count_list.append(len(row['Advisor'].split(', ')))
    except:
        count_list.append(0)
all_df['Number of instructors'] = count_list

all_df.to_csv('all_df.csv', index=False)
all_df
Copy the code

The processed data set is as follows.


Data analysis and visualization

Distribution of awards in each year

The proportion of the first prize, second prize and third prize in three years was counted and the stacked bar chart was drawn.

"Statistics omitted"

from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts.commons.utils import JsCode
from pyecharts.globals import ThemeType

c = Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add_xaxis([2018.2019.2020])
c.add_yaxis("Third prize", list1, stack="stack1", category_gap="70%")
c.add_yaxis("Second prize", list2, stack="stack1", category_gap="70%")
c.add_yaxis("First Prize", list3, stack="stack1", category_gap="70%")
c.set_series_opts(label_opts=opts.LabelOpts(
                    position="right",
                    formatter=JsCode(
                        "function(x){return Number(x.data.percent * 100).toFixed() + '%'; }"
                    ),
                )
            )
c.render("./images/ Stacked bar chart of awards distribution by year.html")
c.render_notebook()
Copy the code

As can be seen from the above figure, with the passage of time, the proportion of first prize and second prize starts to decrease, while the proportion of third prize increases. In 2020, the proportion of third prize will reach 68%.


Top10 schools with the most awards each year

Count the number of prizes awarded by the top 10 schools in each year and draw a graph.

As you can see from the graph above, many universities appear in the Top10 more than once, in part because of the importance attached to the competition.

Here’s a closer look at which schools have repeatedly made the Top10, using a Venn diagram.

Shenyang normal university, shenyang institute of technology and liaoning university of technology all made it into the Top10 in all three years, and a number of other schools made it into the Top10 twice, notably in the northeast.


The number of schools participating in the statistics

Now count the number of entries for each school and count the number of schools for each time.

from collections import Counter
all_school = []
for year in [2018.2019.2020]:
    school_set = set(all_df.loc[all_df['year'] == year, 'Participating Schools'].values.tolist())
    all_school += list(school_set)
value_count = Counter(all_school)
count_list = ['play' + str(n) + 'time' for n in value_count.values()]
counter = Counter(count_list)

from pyecharts.charts import Pie

c = Pie(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add(""[list(z) for z in zip(counter.keys(), counter.values())])
c.set_global_opts(title_opts=opts.TitleOpts(title="Pie- Basic Example"))
c.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
c.render("./images/ Pie chart of attendance by schools. HTML")
c.render_notebook()
Copy the code

In these three years, about half of the schools participated in the competition for three times, and about 25% of the schools participated in the competition for one time and two times respectively. In this way, the schools participating in the competition are still willing to continue to participate in the next competition, indicating that the competition is attractive to schools.


Each year the school level division

The level of participating schools in each year was counted and the level distribution of participating schools was observed.

In the past three years, most of the participants came from the general undergraduate level, followed by 211, and the number of schools at all levels increased year by year. General undergraduate is the most significant. It can be seen that with the publicity of the competition and the popularization of computers, more and more people pay attention to the computer competitions. (For the column with no data, part of the reason is that the information of the school is not included, and it may also be caused by the mistakes made by contestants when filling in the school.)


Number of participants and distribution of awards

According to the number of authors and instructors, the number of occurrences in each award is counted and the following graph is drawn.

In each category, the team with three authors and two teachers won the most prizes, followed by three authors and one teacher. The rest of the lineup has fewer winners. It doesn’t seem like the bigger the number, the better the odds.


The title of the winning work is hot words

Start by defining a function that loads stop terms to load local stop terms.

def load_stopwords(read_path) :
    Read each line of the file and save it to a list :param read_path: the path to the file to be read :return: The list to save each line of the file.
    result = []
    with open(read_path, "r", encoding='utf-8') as f:
        for line in f.readlines():
            line = line.strip('\n')  # Remove line breaks for each element in the list
            result.append(line)
    return result

# load Chinese stop words
stopwords = load_stopwords('wordcloud_stopwords.txt')
Copy the code

Count the words after removing the stop words in the titles of all works and save them in the list.

import jieba

Add a custom dictionary
jieba.load_userdict("Custom dictionary.txt")

token_list = []
# Word segmentation for the title content and save the word segmentation results in the list
for name in all_df['Title of Work']:
    tokens = jieba.lcut(name, cut_all=False)
    token_list += [token for token in tokens if token not in stopwords]
len(token_list)
Copy the code

The frequency of each word in the list was counted, and the top 100 words were taken as the hot words to draw the word cloud map.

from pyecharts.charts import WordCloud
from collections import Counter

token_count_list = Counter(token_list).most_common(100)
new_token_list = []
for token, count in token_count_list:
    new_token_list.append((token, str(count)))

c = WordCloud()   
c.add(series_name="Hot word", data_pair=new_token_list, word_size_range=[20.200])
c.set_global_opts(
    title_opts=opts.TitleOpts(
        title="Winning words", title_textstyle_opts=opts.TextStyleOpts(font_size=23)
    ),
    tooltip_opts=opts.TooltipOpts(is_show=True),
)
c.render("./images/ Award-winning buzzword.html")
c.render_notebook()
Copy the code

Observing the figure above, we can clearly understand the current hot topics in the computer, such as big data, artificial intelligence, algorithms, visualization, management systems, robots, etc. These directions have always been the hot direction of the computer industry, and can also be used as a road for our future development.


conclusion

  • In recent years, the proportion of the first prize and second prize of the computer design competition is reduced, and the proportion of the third prize is increased, which increases the difficulty and gold content of the first and second prize.
  • Universities in the northeast pay more attention to the competition. In terms of the number of participants and winners, shenyang normal university, shenyang institute of technology and liaoning university of technology have repeatedly ranked in the Top10.
  • The vast majority of schools participate in the competition consecutively, with three entries in the last three years accounting for about half of all winning schools.
  • In the past three years, most of the participants came from the general undergraduate level, followed by 211, and the number of schools at all levels increased year by year.
  • The team with 3 authors and 2 teachers has the largest proportion of winning contestants, followed by 3 authors and 1 teacher. (The maximum number of authors is 5, and the total number of advisors is 2)
  • The hot words in the work are big data, artificial intelligence, algorithm, visualization, management system, robot, etc.

That’s all for this article, if it feels good.❤ Just like it before you go!! ❤

The following will continue to share the data analysis of the article, if interested, you can click a focus not lost oh ~.

For startersPythonOr they want to get startedPythonYou can search on wechat [A new vision of PythonSometimes a simple question card for a long time, but others may dial a point will suddenly see light, heartfelt hope that we can make progress together.