preface


  • I wanted to practice my machine learning project, and I happened to encounter a variety of final papers, so I combined it with the final paper.
  • In order to predict and analyze the innovation intensity of patents, the innovation intensity index confirmed by domestic and foreign studies was used to predict the innovation intensity of patents by patent’s own attributes, such as patent age, inventor team size, knowledge heterogeneity, cooperation heterogeneity and so on.
  • If there is any mistake in learning, welcome to correct, thank you

1. Data sources and project description


In this paper, the UTPTO patent database established by the United States Patent Office is taken as the data source, and the molecular biology and microbiology patents (the United States classification category is 435) are taken as the research samples. There are 34,154 invention patent data and related description information. Such information as patent number, title, abstract, license date, U.S. Classification Number (CCL), inventor, inventor’s city, and references (patent references only). To predict innovative analysis of the patent, with the domestic and foreign research confirms the innovation strength index, using the type of patent, patent inventor team size, reference number and the heterogeneity of knowledge heterogeneity, cooperation and so on five patents owned properties (with review of related literature, suggests that these indicators have certain relevance and innovation). The machine learning regression method is used to predict the innovation intensity. Therefore, on the basis of the original data set, the patent type (CCL), inventor team size (INVT_NUM), number of patent citations (REF_NUM), knowledge heterogeneity (DIF_REF), cooperation heterogeneity (DIF_CIT) and innovation intensity value (CD) of each patent information are extracted, as shown in the figure below

2. Prepare and preliminarily analyze data


Import pandas as pd patents = pd.read_excel('435classdatav5.xlsx')Copy the code

# Patents describe()Copy the code

Drop ("NO", axis=1, Import matplotlib.pyplot as PLT %matplotlib inline plt.rcparams ['font. Sans-serif '] = ['axes. Unicode_minus '] = False patents. Hist (bins=50, figsize=(20,15))Copy the code

Intuitively, there is little correlation between innovation intensity CD and other indicators

Corr_matrix =patents. Corr () corr_matrix["CD"].sort_values(ascending=False)Copy the code

3. Data preprocessing

Separate training and test sets

Because the attributes of the classification number contained in the data set are discrete variables, stratified sampling is carried out according to the classification number at a ratio of 8:2 in order to ensure the normal functioning of the training set and test set.

# Stratified sampling, StratifiedShuffleSplit split = StratifiedShuffleSplit(N_Splits =1, Test_size =0.2, random_state=42) for train_index, test_index in split. Split (patents, patents["CCL"]): Strat_train_set = patents. Loc [train_index] strat_test_set = patents print(strat_train_set.shape, '\n',strat_test_set.shape) (27323, 6) (6831, 6)Copy the code

First, separate feature and target tags, with CD as the label and other features.

# Delete tag patenting_labels = strat_train_set. Drop ("CD", axis=1) # Create training copy patenting_labels = strat_train_set["CD"].copy()Copy the code

Data cleaning

There is some data processing to be done before formal model training:

  1. Handling of null values
  2. Exception handling
  3. Discrete variable processing

For missing values, drop the individual with the missing value (DROPna), drop the entire feature with the missing value (DROP), and fill in the missing value with some values (0, mean, median, etc.)(fillna).

Processing text features

The text attribute cannot be directly passed into the training. In this project, the classification number attribute belongs to the text attribute, and One Hot encoding can be used here.

A custom Transformer

The Transformer methods provided in the sciKit-learn functions may not be appropriate for real-world situations, and sometimes you need to customize a Transformer. When defining a class, you add a base class, a BaseEstimator, and a TransformerMixin(used to generate fit_Transformer () methods). The data transformation operations mentioned above can be assisted by the Pipeline class and encapsulated.

Try: from sklearn. Impute import SimpleImputer # scikit-learn 0.20+ except ImportError: from sklearn.preprocessing import Imputer as SimpleImputer from sklearn.pipeline import Pipeline from Sklearn. Preprocessing import StandardScaler num_pipeline = Pipeline(#median error ('imputer', SimpleImputer(strategy="median"), #StandardScaler ('std_scaler', StandardScaler()),]) try: Preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20 from sklearn.preprocessing import OneHotEncoder except ImportError: From future_encoders import OneHotEncoder # scikit-learn < 0.20 num_attribs = list(patenting_num) cat_attribs = ['CCL'] Full_pipeline = ColumnTransformer([("num", num_pipeline, num_attribs), # text feature to single hot code conversion ("cat", OneHotEncoder(), cat_attribs), ]) patenting_prepared = full_pipeline.fit_transform(patenting)Copy the code

4. Model training

Decision tree regression

Tree_reg = DecisionTreeRegressor(random_state=42) tree_reg.fit(patenting_prepared, Metrics import mean_squared_error import numpy as NP # Error of decision tree patenting_Predictions = tree_reg.predict(patenting_prepared) tree_mse = mean_squared_error(patenting_labels, patenting_predictions) tree_rmse = np.sqrt(tree_mse) print(tree_rmse)Copy the code

0.1222348604334857

From sklearn.model_selection import cross_val_score scores = cross_val_score(tree_reg, tree_reg, tree_reg) patenting_prepared, patenting_labels, scoring="neg_mean_squared_error", CV =10) tree_rmse_scores = np. SQRT (-scores) def display_scores(scores): print("Scores:", scores) print("Mean:", scores.mean()) print("Standard deviation:", scores.std()) display_scores(tree_rmse_scores)Copy the code

Scores: [0.23523046 0.23163885 0.23027035 0.23314598 0.25142486 0.23322703 0.24014534 0.24452541 0.23922923 0.23047444] Mean: 0.23693119435252902 Standard deviation: 0.006544897283615564

Random forest regression

Batch = random (n_estimators=30) max_features=8,random_state=42) forest_reg.fit(patenting_prepared, Patenting_labels) # random forest regression from sklearn. Model_selection import cross_val_score forest_scores = cross_val_score(forest_reg, patenting_prepared, patenting_labels, scoring="neg_mean_squared_error", cv=10) forest_rmse_scores = np.sqrt(-forest_scores) display_scores(forest_rmse_scores)Copy the code

Scores: [0.20408526 0.22027575 0.20893746 0.21425075 0.20945561 0.19779428 0.22585026 0.20252916 0.21406977 0.21369368] Mean: 0.21109419959839934 Standard deviation: 0.007964464578674136

Final_model =tree_reg X_test = strat_test_set.drop("CD", axis=1) y_test = strat_test_set["CD"].copy() X_test_prepared = full_pipeline.transform(X_test) final_predictions = final_model.predict(X_test_prepared) final_mse = mean_squared_error(y_test, final_predictions) final_rmse = np.sqrt(final_mse) print(final_rmse)Copy the code

0.23717729756390787

Random forest model parameter adjustment

Model_selection import GridSearchCV param_grid = [# try 12 (3×4) string of hyperparameters {' n_estimators: [10, 30], 'max_features' : [4, 4,6,8]}, # then try 4 (4 ×3) as False {'bootstrap': [False], 'n_estimators': [10, 30], 'max_features': [2, 4, 6]}, ] forest_reg = RandomForestRegressor(random_state=42) # train across 5 folds, that's a total of (12+6)*5=90 rounds of training grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True) grid_search.fit(patenting_prepared, patenting_labels)Copy the code

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), param_grid=[{‘max_features’: [2, 4, 6, 8], ‘n_estimators’: [10, 30]}, {‘bootstrap’: [False], ‘max_features’: [2, 4, 6], ‘n_estimators’: [10, 30]}], return_train_score=True, scoring=’neg_mean_squared_error’)

grid_search.best_params_
Copy the code

{‘max_features’: 8, ‘n_estimators’: 30}

#final_model = grid_search.best_estimator_ final_model=forest_reg X_test = strat_test_set.drop("CD", axis=1) y_test = strat_test_set["CD"].copy() X_test_prepared = full_pipeline.transform(X_test) final_predictions = final_model.predict(X_test_prepared) final_mse = mean_squared_error(y_test, final_predictions) final_rmse = np.sqrt(final_mse) final_rmseCopy the code

0.2041034942299966

Artificial neural network

from sklearn.neural_network import MLPRegressor ann_reg = MLPRegressor(random_state=42, max_iter=100) ann_reg.fit(patenting_prepared, Patenting_labels) ann_reg_predict = ann_reg.predict(patenting_prepared) # Use cross validation to evaluate -- artificial neural network from sklearn.model_selection import cross_val_score ann_reg_scores = cross_val_score(ann_reg, patenting_prepared, patenting_labels, scoring="neg_mean_squared_error", cv=10) ann_reg_rmse_scores = np.sqrt(-ann_reg_scores) display_scores(ann_reg_rmse_scores)Copy the code

Scores: [0.19838524 0.21552342 0.20064441 0.20629559 0.20471131 0.19499855 0.21604244 0.19780061 0.20117999 0.20474261] Mean: 0.20403241846550388 Standard deviation: 0.006740346518995299

Final_model = ann_reg X_test = strat_test_set.drop("CD", axis=1) y_test = strat_test_set["CD"].copy() X_test_prepared = full_pipeline.transform(X_test) final_predictions = final_model.predict(X_test_prepared) final_mse = mean_squared_error(y_test, final_predictions) final_rmse = np.sqrt(final_mse) print(final_rmse)Copy the code

0.20358969321799883

5. Summary


Comparison of effects of different models:

The training set Validation set The test set
Decision tree regression 0.1223 0.1338 0.1678
Random forest regression 0.2396 0.2111 0.2040
Artificial neural network 0.2372 0.2041 0.2035

It can be seen from the root mean square error of each model in the training set, verification set and test set that the performance of these three algorithms is similar without a big difference. The performance of these three algorithms is the best in the training set, followed by the test set, and the worst in the cross verification set.

The mean square error on the test set is about 0.2, indicating that patent description information and backward reference feature can predict patent innovation intensity to a certain extent.

Decision tree regression algorithm performs best on training set, but worst on verification set and test set, which indicates that decision tree regression may have the problem of overfitting. The artificial neural network algorithm performs poorly on the training set, but best on the verification set and test set.

However, in terms of the mean square error of the test machine, the decision tree algorithm is the optimal model of the three machine learning algorithms, followed by artificial neural network regression and random forest regression.

However, only 5 attributes are used in this project, because patents have limited extractable attributes. If more variables can be used to predict attributes are found, the prediction accuracy may be improved.