Original link:tecdat.cn/?p=8450

Original source:Tuo End number according to the tribe public number

 

introduce

Software development positions typically require skills such as experience with NoSQL databases, including MongoDB. This tutorial explores collecting data using the API, storing it in a MongoDB database, and doing some analysis of the data.

 

What API will we use?

The API we’re going to use is the GameSpot API. GameSpot is one of the largest video game review sites on the web, and its API can be accessed here.

set

Before we begin, you should make sure you get the GameSpot API key. You should also ensure that MongoDB and its Python libraries are installed. Installation instructions for Mongo can be found here.

Simply run the following command to install the PyMongo library:

$ pip install pymongo
Copy the code

 

 

Creating a MongoDB database

Now we can start our project by creating a MongoDB database. First, we have to deal with imports. We’ll MongoClient from PyMongo and requests and import PANDAS:

from pymongo import MongoClient
import requests
import pandas as pd
Copy the code

To create a database with MongoDB, we first need to connect to the client and then use the client to create the required database:

MongoClient = MongoClient(' mongodb_name ', 27017) db_name = 'gamespot_reviews' # connect to the database db = MongoClient(' mongodb_name ', 27017)Copy the code

MongoDB can store multiple sets of data in a database, so we also need to define the names of the sets we want to use:

 # open the specific collection
reviews = db.reviews
Copy the code

Our database and collection have been created and we are ready to start inserting data into it.

Use the API

We need to make a request to the base URL that contains our API key. GameSpot’s API has its own resources from which to extract data. For example, they have a resource that lists data about the game, such as release date and console.

However, we were interested in their game review resources, and we did this by creating a header that we would pass to the Requests function:

headers = {
    "user_agent": "[YOUR IDENTIFIER] API Access"
}

games_base = "http://www.gamespot.com/api/reviews/?api_key=[YOUR API KEY HERE]&format=json"
Copy the code

Id, title, score, deck, body, good, bad:

review_fields = "id,title,score,deck,body,good,bad"
Copy the code

GameSpot can only return 100 results at a time. So, to get a decent number of comments for analysis, we need to create a series of numbers and walk through them, retrieving 100 results at a time.

You can choose any number. I chose to accept all of their reviews with the highest score of 14,900:

pages = list(range(0, 14900))
pages_list = pages[0:14900:100]
Copy the code

We will create a function that concatenates the base URL, the list of fields we are returning, the sorting scheme (ascending or descending), and the offset of the query.

We’ll get the number of pages to loop through, and then for every 100 items we’ll create a new URL and request the data:

def get_games(url_base, num_pages, fields, collection):

    field_list = "&field_list=" + fields + "&sort=score:desc" + "&offset="

    for page in num_pages:
        url = url_base + field_list + str(page)
  ...
            print("Data Inserted")
Copy the code

Recall that MongoDB stores data as JSON. Therefore, we need to convert the response data to JSON format using the JSON () method.

After the data is converted to JSON, we will get the “result” attribute from the response, because this is actually the part that contains the data we are interested in. We will then iterate over 100 different results and insert each result into our collection using the command in insert_One ()PyMongo. You can also put them all into a list and use insert_many().

Now let’s call this function and let it collect data:

get_games(review_base, pages_list, review_fields, reviews)
Copy the code

We can view the database and its contents directly using the Compass program:

 

We can see that the data has been inserted correctly.

 

We can also do some database searches and print. To do this, we will create an empty list to store our entries and.find() uses this command on the Comments collection.

When using the functions in findPyMongo, the retrieval also needs to be formatted as JSON. The arguments given to find will have a field and a value.

By default, MongoDB always returns the _ID field (its own unique ID field, not the ID we extracted from GameSpot), but we can tell it to suppress it by specifying a value of 0. We do want the score field returned (as in this case) to be given a value of 1:

scores = []

...

print(scores[:900])
Copy the code

This is what was extracted and printed successfully:

[{' score ':' 10.0 '}, {' score ':' 10.0 '}, {' score ':' 10.0 '}, {' score ':' 10.0 '}, {' score ':' 10.0 '}, {' score ': 'score': 'score', {'score': 'score', {'score': 'score'}...Copy the code

We can also use Pandas to easily convert query results into data boxes:

scores_data = pd.DataFrame(scores, index=None)
print(scores_data.head(20))
Copy the code

This is what is returned:

Score 0 10.0 1 10.0 2 10.0 3 10.0 4 10.0 5 10.0 6 10.0 7 10.0 8 10.0 9 10.0 10 10.0 11 10.0 12 10.0 13 10.0 14 10.0 15 10.0 16 10.0 17 9.9 18 9.9 19 9.9Copy the code

Before we start analyzing some of the data, let’s take a moment to look at how the two sets can potentially be combined. As mentioned earlier, GameSpot has multiple resources to extract data, and we might want to get values from a second database, such as the “Games” database.

MongoDB is a NoSQL database, so unlike SQL, MongoDB is not designed to handle relationships between databases and join data fields together. However, there is a function that approximates the database join-lookup ().

Finally, you choose a name to convert the external documents to that name, and they will display in our query response table with that new name. If you have another database to call, games, and want to join them together for a query, you can do this:

pipeline = [{
    '$lookup': {
...
    }
},]

for doc in (games.aggregate(pipeline)):
    print(doc)
Copy the code

Analyze the data

Now we can analyze and visualize some of the data found in the newly created database. Let’s make sure we have all the capabilities we need for analysis.


 

from pymongo import MongoClient
import pymongo
import pandas as pd
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import string
import en_core_web_sm
import seaborn as sns
Copy the code

Suppose we want to do some analysis of the words in the GameSpot game review.

We can start with find() using the function to collect the first 40 (or any number) comments from the database as before, but this time we will specify that we want to sort by score variable and in descending order:

MongoClient = MongoClient(' mongodbath ', 'mongodbath') 27017) db = client[d_name] ...Copy the code

We will convert this response to the Pandas data box and convert it to a string. We then extract all the values in the

HTML tag that contain the review text and process them using BeautifulSoup:

reviews_data = pd.DataFrame(review_bodies, index=None)

def extract_comments(input):
...

review_entries = extract_comments(str(review_bodies))
print(review_entries[:500])
Copy the code

Look at the print statement to see if the comment text has been collected:

[<p>For anyone who hasn't actually seen the game on a TV right in front of them, the screenshots look too good to be true. In fact, when you see NFL 2K for the first time right in front of you...]
Copy the code

Now that we have the review text data, we want to analyze it in several different ways. We can do this in several different ways:

  • We can create a word cloud
  • We can count all the words and sort them by their frequency of occurrence
  •  

However, before any analysis of the data can be carried out, we must pre-process it.

To preprocess the data, we want to create a function to filter the entries. The text data is still full of tags and non-standard characters that we want to remove by getting the original text of the comment comment. We’ll use regular expressions to replace nonstandard characters with Spaces.

We will also use some stop words in NTLK (very common words that mean almost nothing to our text) and remove them from our list of stop words by creating a list that preserves all the words and then removes them from the list only if they are not included.

The word cloud

Let’s get some review words for visualization corpus. If it is too large when generated, it can cause problems with the word cloud.

For example, I filtered out the first 5,000 words:

stop_words = set(stopwords.words('english'))

def filter_entries(entries, stopwords):

...

    for word in split_entries:
...
    return entries_words

review_words = filter_entries(review_entries, stop_words)
review_words = review_words[5000:]
Copy the code

Now we can make word clouds very easily using the prefabricated WordCloud library provided here.

The word cloud does give us some information about common words in popular comments:

 

In fact, we do have some information about the concepts discussed in game reviews: gameplay, story, characters, world, action, location, etc.

 

We can break down the most common words into a list of words and add them to the word dictionary along with the total number of words, increasing the list each time we see the same word.

Then, we just need to use the Counter and most_common() functions:

def get_word_counts(words_list):
    word_count = {}

    for word in words_list:
...

    return word_count
...
print(review_list)
Copy the code

Here’s a count of some of the most commonly used words:

[('game', 1231), ('one', 405), ('also', 308), ('time', 293), ('games', 289), ('like', 285), ('get', 278), ('even', 271), ('well', 224), ('much', 212), ('new', 200), ('play', 199), ('level', 195), ('different', 195), ('players', 193)...Copy the code

Named entity recognition

We can also use the language model that accompanies spaCyen_core_web_sm for named entity recognition. The various concepts and language features that can be detected are listed here.

We need to get the list of named entities and concepts detected from the document (word list) :

doc = nlp(str(review_words))
...
Copy the code

We can print out the entities we found and the number of entities.

 # Example of named entities and their categories
print([(X.text, X.label_) for X in doc.ents])

 # All categories and their counts
print(Counter(labels))

 # Most common named entities
print(Counter(items).most_common(20))
Copy the code

As follows:

[('Nintendo', 'ORG'), ('NES', 'ORG'), ('Super', 'WORK_OF_ART'), ('Mario', 'PERSON'), ('15', 'CARDINAL'), ('Super', 'WORK_OF_ART'), ('Mario', 'PERSON'), ('Super', 'WORK_OF_ART') ...]  Counter({'PERSON': 1227, 'CARDINAL': 496, 'ORG': 478, 'WORK_OF_ART': 204, 'ORDINAL': 200, 'NORP': 110, 'PRODUCT': 88, 'GPE': 63, 'TIME': 12, 'DATE': 12, 'LOC': 12, 'QUANTITY': 4 ... ]  [('first', 147), ('two', 110), ('Metal', 85), ('Solid', 82), ('GTAIII', 78), ('Warcraft', 72), ('2', 59), ('Mario', 56), ('four', 54), ('three', 42), ('NBA', 41) ...]Copy the code

We just need to create a function to get the number of entities of different categories, and then use it to get the required entities.

We get a list of named entities, organizations, and GPES (locations) :

def word_counter(doc, ent_name, col_name):
    ent_list = []
    for ent in doc.ents:
...
review_gpe = word_counter(doc, 'GPE', 'GPEs')
Copy the code

Now all we need to do is plot the count with a function:

 
plot_categories("Named Entities", review_persons, 30)
plot_categories("Organizations", review_org, 30)
plot_categories("GPEs", review_gpe, 30)
Copy the code

Let’s take a look at the resulting graph.

 

As expected of named entities, most of the results returned were the names of video game characters.

 

The organization chart shows suitable game developers and publishers, such as Playstation and Nintendo.

 

Above is a map of GPE or geographic location. It seems like “Hollywood” and “Miami” are used a lot in game reviews.

Draw the numerical

Finally, we can try to draw values from the database. Let’s get the score values from the comment collection, count them, and then plot them:

scores = []
...
plt.xticks(rotation=-90)
plt.show()
Copy the code

The chart above shows the total number of ratings given (from 0 to 9.9).

conclusion

Collecting, storing, retrieving, and analyzing data are highly needed skills in today’s world, and MongoDB is one of the most commonly used NoSQL database platforms.

Knowing how to use a NoSQL database and how to interpret the data in it will enable you to perform many common data analysis tasks.