Introduction to the

Want to see what you’ve been doing for the last year? Are you surfing the Internet or working hard? Want to write an annual report summary, but don’t have the data? Now, it’s coming.

This is a Chrome history analyzer that allows you to understand your browsing history. Of course, it only works with Chrome or Chrome based browsers.

On this page you will be able to view the top ten domains, urls and busy days you have visited in the past, as well as related data charts.

Part of the screenshot

Code thinking

1. Directory structure

First, let’s take a look at the overall directory structure

Code ├─ App_callback.py Py Web Server Configuration ├─ App_layout. py Web front page Configuration ├─ app_plot.py Web Chart Drawing ├─ App.py Web Server start ├─ Assets │ ├─ Vanilla vanilla vanilla vanilla vanilla vanilla vanilla vanilla vanilla vanilla Vanilla Vanilla Vanilla Vanilla Vanilla Vanilla Vanilla Vanilla Vanilla Vanilla Vanilla ├ ─ making - Mark - Light. PNG │ └ ─ static web front-end help page │ │ ├ ─ help. HTML │ │ └ ─ help. The md ├ ─ history_data. Py parsing chrome history file └ ─ The requirement. TXT program depends on the libraryCopy the code
  • App_callback. py This program is based on Python and deployed using the Dash Web lightweight framework.app_callback.pyMainly used for callback, can be understood as the implementation of background functions.
  • App_configuration. py, as the name implies, provides configuration operations for the Web server.
  • app_layout.. Py Web front-end page configuration, including HTML, CSS elements.
  • App_plot. py is mainly used to implement some chart data for the Web front-end.
  • Start the app.py Web server.
  • Assets static resource directory, used to store some of the static resource data we need.
  • History_data.py connects to the SQLite database and parses the Chrome history file.
  • Requirement. TXT The required libraries to run this program.

2. Parse historical file data

The file associated with parsing history file data is the history_data.py file. Let’s analyze them.

' ''Have no one to answer your questions? We have created a Python learning QQ group: 857662006 to find like-minded friends and help each other. There are also good video tutorials and PDF e-books in the group. '' '
SQL > select * from database
def query_sqlite_db(history_db, query):

    Query the SQLite database
    # Note that History is a file without a suffix. It's not a directory.
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()

    # Use SQLite to see the field URL of the table visits = field ID of the table urls
    # join urls and visits and get specified data
    select_statement = query

    SQL > execute database query
    cursor.execute(select_statement)

    # fetch data in tuple format
    results = cursor.fetchall()

    # close
    cursor.close()
    conn.close()

    return results
Copy the code

The code flow of this function is as follows:

  1. Connect to the SQLite database, execute the query statement, return to the query structure, and finally close the database connection.
' ''Have no one to answer your questions? We have created a Python learning QQ group: 857662006 to find like-minded friends and help each other. There are also good video tutorials and PDF e-books in the group. '' '
Get sorted historical data
def get_history_data(history_file_path):

    try:

        Get database content
        Data format: tuple
        select_statement = "SELECT urls.id, urls.url, urls.title, urls.last_visit_time, urls.visit_count, visits.visit_time, visits.from_visit, visits.transition, visits.visit_duration FROM urls, visits WHERE urls.id = visits.url;"
        result = query_sqlite_db(history_file_path, select_statement)

        Sort the results by the first element
        The # sort and sorted built-in functions prioritize the first element, then the second, and so on
        result_sort = sorted(result, key=lambda x: (x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8]))

        Return the sorted data
        return result_sort
    except:
        # print(' Error reading! ')
        return 'error'
Copy the code

The code flow of this function is as follows:

  1. Set the database query statementselect_statement, the callquery_sqlite_db()Function to get parsed history file data. And the returned historical data files are sorted according to the rules of different elements. At this point, the sorted parsed history data file is successfully retrieved.

3. Configure the Web server

The files related to the basic configuration of the Web server are the app_configuration.py and app.py files. You can set port numbers, access permissions, and static resource directories for the Web server.

4. Deploy front-end pages

The files associated with front-end deployment are app_layout.py and app_plot.py and assets.

The front-end layout consists of the following elements:

  • Upload history file component
  • Draws the page number of visits component
  • Draws the page visit total dwell time ranking component
  • Daily page views scatter chart component
  • The number of times a scatter chart component is accessed at different times on a certain day
  • The 10 most visited URL components
  • Search keyword ranking component
  • Search engine usage component

In app_layout.py, most of these components are configured the same as normal HTML and CSS configurations, so we’ll just use the page view ranking component as an example.

# ranking of page visits
html.Div(
    style={'margin-bottom':'150px'},
    children=[
        html.Div(
            style={'border-top-style':'solid'.'border-bottom-style':'solid'},
            className='row',
            children=[
                html.Span(
                    children='Page view ranking,',
                    style={'font-weight': 'bold'.'color':'red'}
                ),

                html.Span(
                    children='Display number :',
                ),
                dcc.Input(
                    id='input_website_count_rank'.type='text',
                    value=10,
                    style={'margin-top':'10px'.'margin-bottom':'10px'}
                ),
            ]
        ),


        html.Div(
            style={'position': 'relative'.'margin': '0 auto'.'width': '100%'.'padding-bottom': '50%', },
            children=[
                dcc.Loading(
                    children=[
                        dcc.Graph(
                            id='graph_website_count_rank',
                            style={'position': 'absolute'.'width': '100%'.'height': '100%'.'top': '0'.'left': '0'.'bottom': '0'.'right': '0'},
                            config={'displayModeBar': False},
                        ),
                    ],
                    type='dot',
                    style={'position': 'absolute'.'top': '50%'.'left': '50%'.'transform': 'translate(-50%,-50%)'}),],)])Copy the code

As you can see, it’s written in Python, but anyone with front-end experience can easily add or remove elements from it, so we won’t go into the details of how to use HTML and CSS.

In app_plot.py, it is primarily related to charting. You use the Plotly library, which is a library for drawing components with Web interaction. Here’s an example of plotting a page-visit frequency ranking histogram to show you how to do it using the Plotly library.

' ''Have no one to answer your questions? We have created a Python learning QQ group: 857662006 to find like-minded friends and help each other. There are also good video tutorials and PDF e-books in the group. '' '
Draw a bar chart ranking the frequency of page visits
def plot_bar_website_count_rank(value, history_data):

    # Frequency dictionary
    dict_data = {}

    Pass through the history file
    for data in history_data:
        url = data[1]
        # to simplify the url
        key = url_simplification(url)

        if (key in dict_data.keys()):
            dict_data[key] += 1
        else:
            dict_data[key] = 0

    Filter out the first k data with the highest frequency
    k = convert_to_number(value)
    top_10_dict = get_top_k_from_dict(dict_data, k)

    figure = go.Figure(
        data=[
            go.Bar(
                x=[i for i in top_10_dict.keys()],
                y=[i for i in top_10_dict.values()],
                name='bar',
                marker=go.bar.Marker(
                    color='rgb(55, 83, 109)'
                )
            )
        ],
        layout=go.Layout(
            showlegend=False,
            margin=go.layout.Margin(l=40, r=0, t=40, b=30),
            paper_bgcolor='rgba (0,0,0,0)',
            plot_bgcolor='rgba (0,0,0,0)',
            xaxis=dict(title='the website'),
            yaxis=dict(title='number')))return figure
Copy the code

The code flow of this function is as follows:

  1. First, the database file is returned after parsinghistory_dataI iterate, and I geturlData, and callurl_simplification(url)Align to simplify. And then, one by oneSimplified URLPut it in a dictionary.
  2. callget_top_k_from_dict(dict_data, k)In his dictionary,dict_dataIn the formerkThe maximum number of.
  3. Next, the bar chart is drawn. usego.Bar()Draw a bar chart, where,xandyRepresents the attribute and its corresponding value, which islistformat.xaxisandYaxis’ sets the titles of the corresponding axes
  4. Returns afigureObject to facilitate transmission to the front end.

The assets directory contains image and CSS data, which are used for the front-end layout.

5. Background deployment

The file associated with background deployment is the app_callback.py file. This file updates the front-end page layout using callbacks.

First, let’s look at the callback function for ranking the frequency of page visits:

# Page visit frequency ranking
@app.callback(
    dash.dependencies.Output('graph_website_count_rank'.'figure'),
    [
        dash.dependencies.Input('input_website_count_rank'.'value'),
        dash.dependencies.Input('store_memory_history_data'.'data')
    ]
)
def update(value, store_memory_history_data):

    Get the history file correctly
    if store_memory_history_data:
        history_data = store_memory_history_data['history_data']
        figure = plot_bar_website_count_rank(value, history_data)
        return figure
    else:
        # Cancel update page data
        raise dash.exceptions.PreventUpdate("cancel the callback")
Copy the code

The code flow of this function is as follows:

  1. First decide what the input is (the data that triggers the callback), what the output is (the data that the callback outputs), and what data you need to bring with you.dash.dependencies.InputRefers to the data that triggers the callback, whiledash.dependencies.Input('input_website_count_rank', 'value')Said whenidforinput_website_count_rankThe components of thevalueThis callback is triggered when changes occur. And the correction passedupdate(value, store_memory_history_data)The result is printed toidforgraph_website_count_rankthevalueIn plain English, change its value.
  2. fordef update(value, store_memory_history_data)The parsing. The first is to judge the input datastore_memory_history_dataWhether the object is not empty, then read the history filehistory_data, and then call what we just saidapp_plot.pyIn the fileplot_bar_website_count_rank(), returns afigureObject and returns the object to the front end. At this point, the layout of the front page is displayedPage visit frequency rankingThe chart.

There is one more thing that needs to be said about the last file process. Here we first post the code:

Upload file callback
@app.callback(

    dash.dependencies.Output('store_memory_history_data'.'data'),
    [
        dash.dependencies.Input('dcc_upload_file'.'contents')
    ]
)
def update(contents):

    if contents is not None:

        Receive base64 encoded data
        content_type, content_string = contents.split(', ')

        Base64 decoding of files uploaded by the client
        decoded = base64.b64decode(content_string)

        Add suffixes to files uploaded by clients to prevent file overwriting
        The following method ensures that the file name is not duplicatedSuffix = [STR (random randint (0100))for i in range(10)]
        suffix = "".join(suffix)
        suffix = suffix + str(int(time.time()))

        # Final file name
        file_name = 'History_' + suffix
        # print(file_name)

        Create a directory for storing files
        if (not (exists('data'))):
            makedirs('data')

        The path to the file to be written to
        path = 'data' + '/' + file_name

        Write to the local disk file
        with open(file=path, mode='wb+') as f:
            f.write(decoded)


        Use SQLite to read local disk files
        Get history data
        history_data = get_history_data(path)
        
        Get search keyword data
        search_word = get_search_word(path)

        # Check whether the read data is correct
        if(history_data ! ='error') :# to find
            date_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
            print('New client data received, data correct, time :{}'.format(date_time))
            store_data = {'history_data': history_data, 'search_word': search_word}
            return store_data
        else:
            # didn't find
            date_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
            print('New client data received, data error, time :{}'.format(date_time))
            return  None

    return None
Copy the code

The code flow of this function is as follows:

  1. First, determine whether the contents of the data uploaded by the user are not empty, and then base64 decode the files uploaded by the client. In addition, you can add suffixes to files uploaded by the client to prevent file overwriting and write the files uploaded by the client into local disk files.

  2. After writing, use SQLite to read the local disk file. If the read is correct, parsed data is returned; otherwise, None is returned

Now comes the core of our data extraction, which is extracting the data we want from the Chrome history file. Since the Chrome history file is an SQLite database, we need to use database syntax to extract what we want.

' ''Have no one to answer your questions? We have created a Python learning QQ group: 857662006 to find like-minded friends and help each other. There are also good video tutorials and PDF e-books in the group. '' '
Get sorted historical data
def get_history_data(history_file_path):

    try:

        Get database content
        Data format: tuple
        select_statement = "SELECT urls.id, urls.url, urls.title, urls.last_visit_time, urls.visit_count, visits.visit_time, visits.from_visit, visits.transition, visits.visit_duration FROM urls, visits WHERE urls.id = visits.url;"
        result = query_sqlite_db(history_file_path, select_statement)

        Sort the results by the first element
        The # sort and sorted built-in functions prioritize the first element, then the second, and so on
        result_sort = sorted(result, key=lambda x: (x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8]))

        Return the sorted data
        return result_sort
    except:
        # print(' Error reading! ')
        return 'error'
Copy the code

Select_statement specifies the select_statement for querying a database as follows:

  1. FROM the table (FROM)urlsSELECT the following fieldsurls.id.urls.url.urls.title.urls.last_visit_time.urls.visit_count, in turn representsThe URL of the ID.The URL address.The title of the URL.The last access time of the URL.Number of URL accesses.
  2. Then, FROM (FROM) tablevisitsSELECT the following fieldsvisits.visit_time.visits.from_visit.visits.transition.visits.visit_durationAnd respectively representAccess time.What link did you jump from.Access to jump.Duration of the visit.
  3. rightStep 1andStep 2The results are linked to form a table. And then screen out WHERE they fiturls.id = visits.urlThe line. inurls,idRepresents the URLidIn thevisits,urlIt’s also the URLidSo you can only join them if they are equal to each other, you can only keep them, otherwise you have to get rid of this row.
  4. Using sort functionssorted, and this function in turn isx[0].x[1].x[2].x[3].x[4].x[5].x[6].x[7].x[8]To sort, that means to sorturls.id.urls.url.urls.title.urls.last_visit_time.urls.visit_count.visits.visit_time.visits.from_visit.visits.transition.visits.visit_duration.
  5. Returns sorted data

Here we list what each field represents:

The field name meaning
urls.id The serial number of the url
urls.url The url address
urls.title The title of the url
urls.last_visit_time The last access time of the URL
urls.visit_count Number of url accesses
urls.visit_time Access time of the URL
urls.from_visit Where do I access this URL
urls.transition Url of the jump
urls.visit_duration Duration of the URL

6. How do I obtain a Chrome history file

  1. First, open your browser and typechrome://version/, among them,Personal Data pathThat is, the directory where historical files are stored.
  2. Jump toPersonal Data path, such as/Users/xxx/Library/Application Support/Google/Chrome/DefaultFind a man calledHistoryThis file is the history file.

How to run

Online demo: http://39.106.118.77:8090 (server, not pressure)

Running this program is very simple, just need to follow the following command to run:

Jump to the current directory
cdDirectory nameUninstall the dependent libraries first
pip uninstall -y -r requirement.txt
Re-install the dependent libraries
pip install -r requirement.txt
# start running
python app.py

# After successful operation, open http://localhost:8090 in your browser
Copy the code