Analysis of the background

Analysis purposes

  • Analysis mo worship the bike order relevant data on the long ride, as the reference and basis for operation optimization (because of lack of order amount of data the original data set, and because the bike is riding time as the standard for billing, riding time is the most important factor affecting the size of the order amount, so the analysis on the riding time).
  • It mainly focuses on analyzing the influence of riding time (including two dimensions of working days/weekends and peak hours/non-peak hours), riding location and user value on riding time.

Data set profile

  • The original data set is from one million randomly sampled user data of Mobike in Shanghai urban area in August 2016 provided by Udacity, with a total of 102,361 order records, including starting point, destination, rental time, return time, user ID, vehicle ID, transaction number and route track information.
  • After cleaning and information extraction, 22 new variables were added to the new data set, mainly used for analysis: Ttl_min (cycling duration (min)), Distance (straight line distance of cycling starting point and ending point (km)), dayType (weekdays/weekends), Hourtype (peak hours/off-peak hours), ring_stage (inner ring/inner ring/Inner ring/Outer ring), rate (high-value users) / Medium value users/low value users).
  • In the process of using the new data set, a small number of abnormal records of cycling speed, distance and duration were removed, and the final number of order records was 102338.

Analysis conclusion

Summary of User behavior

In the process of data exploration, it is found that cycling time (including weekdays/weekends, peak hours/off-peak hours), cycling area and user value all have an impact on cycling time. When the four variable conditions are defined successively, it can be found that:

  • Under the same cycling geographic location or user value conditions, the cycling time has an obvious rule to the average cycling time, and the average cycling time in peak hours and weekends is higher than that in non-peak hours and working days.
  • In general, the higher the user value is, the longer the cycling duration is, the farther the cycling area is from the city center, and the longer the cycling duration is, but the former has a much smaller impact on the cycling duration than the latter.

In addition to focusing on the influence of type variables on riding time, the following findings are found:

  • By comparing different cycling geographic locations and user value types, the distribution characteristics of data points in working days and peak hours are highly similar, indicating that the characteristics of vehicle use behavior in working days and peak hours are similar.
  • High-value users are more distributed within the inner ring.

Summary of optimization Suggestions

In view of the above user behavior data analysis conclusions, the following optimization suggestions are proposed:

  • Given the cycling time (working day/weekend, peak/off-peak hours) highly affect riding long features, can be different divisions of riding time, targeted to launch cycling package, marketing activities, to raise the order frequency and order amount (such as cycling at different times of the badges rewards, limited-time free meet peak, cycling etc.)
  • In view of the behavior characteristics of users far from urban areas with high single consumption (long cycling time) and low consumption frequency (few high-value users), corresponding cycling packages can be launched according to the geographical location of cycling, so as to improve the consumption frequency of users in such areas and their dependence on mobike
  • Since the user behavior characteristics are similar during weekdays and peak hours, it can be taken into consideration when designing operational activities

Other instructions

  • Due to the small amount of information in the original data set, the content available for analysis is limited. After expanding the information and scope of the data set, the content available for analysis includes but is not limited to:

  • The critical path conversion rate is analyzed according to users’ click data in the APP, so as to determine whether the supply and demand of bikes in a certain region is balanced, so as to optimize the quantity of bikes and scheduling efficiency

  • Monthly and quarterly periodicity of bicycle use as a reference and basis for marketing program design and optimization

The influence of preferential activities such as cycling vouchers, cycling packages and top-up cashback on users’ cycling behavior can be used as a reference and basis analysis process for users’ refined operation or marketing plan design

The analysis process

The influence of four types of variables on cycling duration is mainly concerned. Firstly, the data distribution of cycling duration and cycling distance is introduced. Then, by drawing the violin diagram, we observed the highly similar data characteristics of the two variables with the change of type variables, and determined that we only need to pay attention to the relationship between the key indicator of cycling time and the four types of variables. Finally, the influence of other variables on riding time under the control of type variables under different conditions is plotted by point diagram.

Import all libraries you need and set the chart to display directly
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

Clear warnings in output
import warnings
warnings.simplefilter("ignore")

Import the cleaned and collated dataset
df_e = pd.read_csv('mobike_master.csv')

Convert the corresponding column to a category variable
order_dict = {'ring_stage': ['inside inner ring'.'inside middle ring'.'inside outer ring'.'outside outer ring'].'rate': ['high-value user'.'middle-value user'.'low-value user'].'daytype': ['weekdays'.'weekends'].'hourtype': ['rush hours'.'non-rush hours']}
for var in order_dict:
    order = pd.api.types.CategoricalDtype(ordered = True, categories = order_dict[var])
    df_e[var] = df_e[var].astype(order)

# Data cleaning: remove a small amount of data that are obviously abnormal in cycling speed, cycling duration and cycling distance
df_e['speed'] = df_e['distance'] / (df_e['ttl_min'] / 60)
df_e = df_e[-(((df_e['speed'] < 12) | (df_e['speed'] > 20)) & ((df_e['ttl_min'] > 720) | (df_e['distance'] > 50))))Copy the code

Cycling duration distribution

The data range of cycling duration is very large, with a minimum value of 1 minute and a maximum value of 666 minutes, presenting a long-tail distribution, and most cycling duration is short. It can be found by using the X-axis plot of log transformation that cycling duration presents a right skew distribution, and the peak value appears between 7-10 minutes.

bins = 10 ** np.arange(0, np.log10(df_e.ttl_min.max()) + 0.15.0.15)
plt.hist(data = df_e, x = 'ttl_min', bins = bins);
plt.xscale('log')
xticks = (1.2.5.10.20.50.100.200.500)
plt.xticks(xticks, xticks);
plt.xlabel('Riding Duration (min)');
plt.title('Distribution of Riding Duration');
Copy the code

Distribution of cycling distance

The range of the data is also very large, the minimum is 0.146km, the maximum is 32.497km, and also presents a long-tail distribution. Most of the cycling distance is short, and a few of the cycling distance is long. It can be found by using the X-axis plot of log transformation, The cycling distance presented a right skew distribution, and the peak appeared between 0.7-1.3km.

bins = 10 ** np.arange(np.log10(df_e.distance.min()), np.log10(df_e.distance.max()) + 0.08.0.08)
plt.hist(data = df_e, x = 'distance', bins = bins);
plt.xscale('log')
xticks = (0.1.0.2.0.5.1.2.5.10.20)
plt.xticks(xticks, xticks);
plt.xlabel('Riding Distance (km)');
plt.title('Distribution of Riding Distance');
Copy the code

Relationship between cycling duration and distance and other variables

  • In terms of cycling time, it is found that the median cycling duration and distance in weekends and peak periods are higher than those in weekdays and non-peak periods (except that the cycling distance on weekends is slightly lower than that on weekdays).
  • In terms of cycling area, once the user’s cycling area is outside the inner ring, the median cycling duration and cycling distance become higher with the distance from the city center, which may be due to the increasing distance between the user’s starting point and destination as the user gets closer to the suburbs.
  • In terms of user value, the median cycling duration and distance decreased as user value became lower.

By comparison, it is found that the data distribution characteristics of two numerical variables, cycling duration and cycling distance, are almost the same as the change characteristics under classification. Therefore, cycling distance can no longer be analyzed for the following reasons: On the one hand, the cycling time is real data of the original data (distance is through the beginning and end of the ride point estimation of linear distance), on the other hand the worship the bike is on the basis of riding long as pay, therefore, in the case of data features highly similar, select data quality and higher value cycling duration as subsequent analysis indicators.

Since the data of cycling duration and cycling distance both present a very long tail, log transformation of the two data is carried out first in order to observe data characteristics more clearly
df_e['log_ttl_min'] = np.log10(df_e['ttl_min'])
df_e['log_distance'] = np.log10(df_e['distance'])
cat_vars = ['daytype'.'hourtype'.'ring_stage'.'rate']
fig, ax = plt.subplots(ncols = 4, nrows = 2, figsize = [20.10])
color = sb.color_palette()[0]
for i in range(len(cat_vars)):
    var = cat_vars[i]
    # Draw the first row
    sb.violinplot(data = df_e, x = var, y = 'log_ttl_min', ax = ax[0, i], color = color);
    ttl_min_ticks = [1.2.5.10.20.50.100.200.500]
    ax[0, i].set_yticks(np.log10(ttl_min_ticks));
    ax[0, i].set_yticklabels(ttl_min_ticks);
    ax[0, i].set_ylabel('Riding Duration (min)');
    if i == 2:
        xlabels = ax[0, i].get_xticklabels()
        ax[0, i].set_xticklabels(xlabels, rotation = 10);
    # Draw the second line
    sb.violinplot(data = df_e, x = var, y = 'log_distance', ax = ax[1, i], color = color);
    distance_ticks = [0.1.0.2.0.5.1.2.5.10.20]
    ax[1, i].set_yticks(np.log10(distance_ticks));
    ax[1, i].set_yticklabels(distance_ticks);
    ax[1, i].set_ylabel('Riding Distance (km)');
    if i == 2:
        xlabels = ax[1, i].get_xticklabels()
        ax[1, i].set_xticklabels(xlabels, rotation = 10);
plt.suptitle('riding duration and distance by other features', fontsize = 'xx-large');
Copy the code

Under the condition of given cycling time, the variation rule of cycling time with cycling area and user value

  • In general, except for the data within the inner ring, the average cycling time in other areas increased with the distance from the starting point to the city center
  • High-value users had the highest average ride duration except weekends and off-peak hours outside the Outer ring
  • Overall, average cycling time was higher during peak hours and weekends than during off-peak hours and weekdays
  • From the first column of the two can be roughly seen in the picture, working days and during peak hours, the average riding time with the change of the user value and cycling area variables and the variation characteristics are very similar, this may be because the user working days and during peak hours, accounted for most of the office worker, and the transport behavior of workers with similar characteristics
# Create a custom function to plot the pointplot under control variables
def ppltgrid(row_dict) :
    for var in row_dict:
        firstplot = list(row_dict.keys())[0]    # Set the number of the first drawing for subsequent operations to get the Y-axis of the first drawing
        a0,b0,c0 = var.split(', ')
        a,b,c = int(a0), int(b0), int(c0)
        plt.subplot(a,b,c)
        flagid, flag, hue, x = row_dict[var]['flagid'], row_dict[var]['flag'], row_dict[var]['hue'], row_dict[var]['x']
        ax = sb.pointplot(data = df_e[df_e[flagid] == flag], x = x, y = 'log_ttl_min', hue = hue,
                          palette = 'Blues_r', linestyles = ' ', dodge = 0.1);
        ax.set_title("{}'s riding duration across {} and {}".format(flag, x, hue), fontsize = 'small');
        ylocs = np.arange(1.1.25.0.025)
        ylabels = np.round(np.power(10, ylocs), 2)
        ax.set_yticks(ylocs);
        ax.set_yticklabels(ylabels);
        ax.set_yticklabels([],minor = True);    # Do not display the default main scale
        if c % b == 1:    # Set the Y-axis label for the first image of each row. Other images are not displayed to avoid obsctering the content of the graph
            ax.set_ylabel('Mean Riding Duration (min)');
        else:
            ax.set_ylabel(' ');
        if x == 'ring_stage' or x == 'rate':    The # ring_stage and rate category names are too long, making the font smaller
            xlabels = ax.get_xticklabels()
            ax.set_xticklabels(xlabels, fontsize = 'small');
        if var == firstplot:
            ylim = ax.get_ylim()    Get the Y-axis of the first drawing
        else:
            plt.ylim(ylim);    # make all plots from the second start have the same Y-axis range as the first one
Copy the code
plt.figure(figsize = [15.10])
row_dict = {'2, 2, 1': {'flagid': 'daytype'.'flag': 'weekdays'.'hue': 'rate'.'x': 'ring_stage'},
            '2,2,2': {'flagid': 'daytype'.'flag': 'weekends'.'hue': 'rate'.'x': 'ring_stage'},
            '2, 2, 3': {'flagid': 'hourtype'.'flag': 'rush hours'.'hue': 'rate'.'x': 'ring_stage'},
            ', 2, 4-trichlorobenzene ': {'flagid': 'hourtype'.'flag': 'non-rush hours'.'hue': 'rate'.'x': 'ring_stage'}}
ppltgrid(row_dict)
Copy the code

Under the conditions of given cycling area and user value, cycling time changes with cycling time

  • By limiting geographical location variables of cycling and user value variables respectively, and observing the impact of cycling time on average cycling duration, it can be found that the relative positions of data point distribution are highly similar, indicating that cycling time has an obvious rule on average cycling duration. That is, when cycling geographic location or user value conditions are the same, Average cycling time during peak hours and weekends is higher than non-peak hours and weekdays.
  • Comparing the first and second lines of data point distribution, can be found in the user value variables under the perspective of data points, as the user value from high to low, far lower than the first line longitudinal changes in riding position under the perspective of the longitudinal variation, indicating the user value of the average riding long effect is small, much smaller than riding position to its influence.
plt.figure(figsize = [20.10])
row_dict = {'2,4,1': {'flagid':'ring_stage'.'flag': 'inside inner ring'.'hue': 'daytype'.'x': 'hourtype'},
            '2,4,2': {'flagid':'ring_stage'.'flag': 'inside middle ring'.'hue': 'daytype'.'x': 'hourtype'},
            '2, 3': {'flagid':'ring_stage'.'flag': 'inside outer ring'.'hue': 'daytype'.'x': 'hourtype'},
            '2,4,4': {'flagid':'ring_stage'.'flag': 'outside outer ring'.'hue': 'daytype'.'x': 'hourtype'},
            'two, four, five: {'flagid':'rate'.'flag': 'high-value user'.'hue': 'daytype'.'x': 'hourtype'},
            '2 minus 2': {'flagid':'rate'.'flag': 'middle-value user'.'hue': 'daytype'.'x': 'hourtype'},
            '2,4,7': {'flagid':'rate'.'flag': 'low-value user'.'hue': 'daytype'.'x': 'hourtype'}}
ppltgrid(row_dict)
Copy the code

The code has been submitted to Github for more content on my personal blog

The resources

  • Query the longitude and latitude coordinates of Shanghai geographic center — International Hotel (Autonavi Open Platform) :Lbs.amap.com/console/sho…
  • Shanghai Inner Ring, Central Ring and Outer ringZhidao.baidu.com/question/36…
  • ** Mobike charges for August 2016: www.33lc.com/article/764…