DolphinDB Database is a new generation of high-performance, distributed time-series databases with rich data analysis and distributed computing capabilities. DolphinDB is used to analyze user behavior data from Taobao apps for further analysis of business issues.
Data source: User Behavior Data from Taobao for Recommendation
In this tutorial, we package DolphinDB and the data sets we use into Docker. DolphinDB DFS in docker ://user_behavior It contains a table, User, which holds the behavior records of nearly one million Taobao APP users between November 25, 2017 and December 3, 2017. We use the combined partition method. The first layer is divided into 180 partitions according to date, one partition per day, and the second layer is divided into hash partitions according to userID. The structure of the user table is as follows:
The meanings of various user behavior types are as follows:
- Pv: Browse the product details page
- Buy: Buy goods
- Cart: To add items to a shopping cart
- Fav: Collect goods
1. Download the Docker deployment package
DolphinDB and the data used have been packaged into docker containers. Make sure the Docker environment is deployed before use. For the Docker installation tutorial, see docs.docker.com/install/. From www.dolphindb.cn/downloads/b… Download the deployment package and execute the following code in the directory where the deployment package is located.
Decompress the deployment package:
gunzip bigdata.tar.gz
Copy the code
Import a container snapshot as an image:
cat bigdata.tar | docker import - my/bigdata:v1
Copy the code
Get ID of mirror my/ bigData :v1:
docker images
Copy the code
Start the container (replace the images ID as needed) :
docker run -dt -p 8888:8848 --name test <image id> /bin/bash ./dolphindb/start.sh
Copy the code
Enter the local IP address 8888 (for example, localhost:8888) in the address bar of the browser. Then go to DolphinDB Notebook. The following code is executed in DolphinDB Notebook.
The DolphinDB license is valid until 1 September 2019. If the license expires, download the DolphinDB community version from the DolphinDB website. Replace with the community edition license bigdata. Tar/dolphindb dolphindb. Lic.
2. User behavior analysis
View data volume:
login("admin","123456")
user=loadTable("dfs://user_behavior","user")
select count(*) from user
Copy the code
98914533
There are 98,914,533 records in the user table.
Analyze the behavior of users in the whole process from browsing to purchasing:
PV=exec count(*) from user where behavior="pv"
Copy the code
88596903
UV=count(exec distinct userID from user)
Copy the code
987984
During the nine days, page views of Taobao APP were 88,596,903 and unique visitors were 987,984.
The exec used above is a DolphinDB feature that is similar to Select. The difference between the two is that the SELECT statement always returns a table, exec returns a vector when selecting a column, a scalar when used with an aggregate function, and a matrix when used with Pivoy BY to facilitate subsequent calculations of the data.
Count the number of users who view a page only once:
onceUserNum=count(select count(behavior) from user group by userID having count(behavior)=1)
Copy the code
92
jumpRate=onceUserNum\UV*100
Copy the code
0.009312
Only 92 users left the APP after browsing only one page, accounting for 0.0093% of the total users, which is almost negligible, indicating that Taobao is attractive enough to keep users in the APP.
Count the number of user actions:
behaviors=select count(*) as num from user group by behavior
Copy the code
Calculate the conversion rate from browse to intended purchase:
Adding goods to the shopping cart and collecting goods can be considered as users’ intention to purchase. Count the number of user behaviors with intention to purchase:
fav_cart=exec sum(num) from behaviors where behavior="fav" or behavior="cart"
Copy the code
8318654
intentRate=fav_cart\PV*100
Copy the code
9.389328
The conversion rate from browse to purchase is only 9.38%.
buy=(exec num from behaviors where behavior="buy")[0]
Copy the code
1998976
buyRate=buy\PV*100
Copy the code
2.256259
intent_buy=buy\fav_cart*100
Copy the code
24.030041
The conversion rate from browsing to final purchase is only 2.25%, and from intended purchase to final purchase is 24.03%, indicating that most users will collect or add their favorite products to their shopping cart, but not necessarily buy them immediately.
Statistics on unique visitors for various user behaviors:
userNums=select count(userID) as num from (select count(*) from user group by behavior,userID) group by behavior
Copy the code
pay_user_rate=(exec num from userNums where behavior="buy")[0]\UV*100
Copy the code
67.852313
During the nine days, the number of paying users using Taobao APP accounted for 67.8 percent, indicating that most users will shop on Taobao APP.
Count the number of users per day for various user behaviors:
DailyUserNums =select sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, The sum (iif (behaviors = = "cart", 1, 0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user group by date(behaveTime) as dateCopy the code
The number of visits to Taobao APP increased significantly on Friday, Saturday and Sunday (2017.11.25, 2017.11.26, 2017.12.02, 2017.12.03).
The iIF is a DolphinDB conditional operator. Its syntax is IIF (cond, trueResult, falseResult). Cond is usually a Boolean expression and trueResult is returned if cond is satisfied. Returns falseResult.
Statistics are collected on the number of user behaviors at different time points each day. We offer the following two methods:
The first method is to count the data separately for each time period and combine the results. For example, count the number of user behaviors at different time periods on Wednesday, November 29, 2017.
Re1 =select behaveTime as time, sum(iif(behavior=="pv",1,0)) as pageView, Sum (iif(behavior=="cart",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29t00:00:00: 2017.11.29 t05:59:59 re2=select behaveTime as time, sum(iif(behavior=="pv",1,0)) as pageView, Sum (iif(behavior=="cart",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29t06:00:00: 2017.11.29 t08:59:59 re3=select behaveTime as time, sum(iif(behavior=="pv",1,0)) as pageView, Sum (iif(behavior=="cart",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29t09:00:00: 2017.11.29t11:59:59 re4=select behaveTime as time, sum(iif(behavior=="pv",1,0)) as pageView, Sum (iif(behavior=="cart",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29t00:00:00: 2017.11.29 t13:59:59 re5=select behaveTime as time, sum(iif(behavior=="pv",1,0)) as pageView, Sum (iif(behavior=="cart",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29t15:00:00: 2017.11.29t17:59:59 re6=select behaveTime as time, sum(iif(behavior=="pv",1,0)) as pageView, Sum (iif(behavior=="cart",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29t15:00:00: 2017.11.29t21:59:59 re7=select behaveTime as time, sum(iif(behavior=="pv",1,0)) as pageView, Sum (iif(behavior=="cart",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29t11:00:00: 2017.11.29 T23:59:59 re = unionAll ([re1, re2 re3, re4, re5, re6, re7], false)Copy the code
This approach is simpler, but requires a lot of repetitive code. You can also wrap repetitive code into functions.
def calculateBehavior(startTime,endTime){ return select first(behaveTime) as time, Sum (iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, The sum (iif (behaviors = = "cart", 1, 0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from user where behaveTime between startTime: endTime}Copy the code
In this case, you only need to specify the start time of the time range.
Another way to do this is through DolphinDB’s Map-Reduce framework. For example, collect the user behavior on Wednesday, November 29, 2017.
Def caculate(t){return first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, Sum (iif(behavior=="cart",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, Sum (iif(behavior=="buy",1,0)) as payment from t} ds1 = repartitionDS(<select * from user>, 'behaveTime') RANGE, 2017.11.29 T00:00:00 2017.11.29 T06:00:00 0 2017.11.29 T09:00:00 2017.11.29 T12:00:00 2017.11.29 T14:00:00 2017.11.29T18:00:00 2017.11.29T22:00:00 2017.11.29T23:59:59) WedBehavior = Mr (ds1, caculate, unionAll{, false})Copy the code
We used the repartitionDS function to repartition the user table by time range (without changing the original partitioning method of the User table), and generated multiple data sources, and then computed the data sources in parallel using the Mr Function. DolphinDB would apply the caculate function to the various data sources and combine the results.
On weekdays, taobao APP usage is highest in the early morning (0 to 6 PM), followed by the afternoon (14 to 16 PM).
Statistics of user behavior on Saturday (November 25, 2017) and Sunday (November 26, 2017) :
ds2 = repartitionDS(<select * from user>, `behaveTime, RANGE, 2017.11.25 T00:00:00 2017.11.25 T06:00:00 0 2017.11.25 T09:00:00 2017.11.25 T12:00:00 2017.11.25 T14:00:00 SatBehavior = Mr (ds2, caculate, unionAll{, false}) SatBehavior = Mr (ds2, caculate, unionAll{, false})Copy the code
ds3 = repartitionDS(<select * from user>, `behaveTime, RANGE, 2017.11.26 T00:00:00 2017.11.26 T06:00:00 0 2017.11.26 T09:00:00 2017.11.26 T12:00:00 2017.11.26 T14:00:00 2017.11.26T18:00:00 2017.11.26T22:00:00 2017.11.26T23:59:59) SunBehavior = Mr (ds3, caculate, unionAll{, false})Copy the code
The usage rate of Taobao APP was higher on Saturday and Sunday than on weekdays. Similarly, the peak of Taobao APP use on Saturday and Sunday is in the wee hours of the morning (0 to 6 PM).
3. Commodity analysis
allItems=select distinct(itemID) from user
Copy the code
4142583
During the nine days, 4,142,583 items were involved.
Count the purchase times of each commodity:
itemBuyTimes=select count(userID) as times from user where behavior="buy" group by itemID order by times desc
Copy the code
Top 20 items in sales:
salesTop=select top 20 * from itemBuyTimes order by times desc
Copy the code
The item with ID 3122135 had the highest sales, with 1,408 purchases.
Count the quantity of goods in each purchase time:
buyTimesItemNum=select count(itemID) as itemNums from itemBuyTimes group by times order by itemNums desc
Copy the code
The results showed that the vast majority (370,747 items) were purchased only once during the nine days, accounting for 8.94% of all items. The more times you buy, the fewer items are involved.
Count the number of user actions for all products:
AllItemsInfo =select sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, Sum (iif(behavior=="cart",1,0) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user group by itemIDCopy the code
Top 20 items with page views:
pvTop=select top 20 itemID,pageView from allItemsInfo order by pageView desc
Copy the code
The ID of the most viewed product is 812879, with 29,720 browsing times, but the sales volume is only 135, not among the top 20.
Statistics of the number of user behaviors of top 20 products:
select * from ej(salesTop,allItemsInfo,`itemID) order by times desc
Copy the code
The product 3122135 with the highest sales volume has 1777 page views, not among the top 20 page views. The conversion rate from browsing to purchase is as high as 79.2%. This product may be just needed, and users decide to buy it without too much browsing.
Extended exercises:
(1) Calculate the purchase rate of Taobao APP every hour on November 25, 2017 (purchase rate = times of purchase/total number of behaviors *100%)
(2) Find out who buys the most and which items he buys the most
(3) Calculate the purchase times of the commodity with ID 3122135 in each time period
(4) Count The Times of each behavior in each category
(5) Calculate the highest selling items in each category
This tutorial is for learning purposes only.