Original link:tecdat.cn/?p=9800
Original source:Tuo End number according to the tribe public number
introduce
This article does not suggest that R is better or faster at data analysis than Python; I personally use both languages every day. This article merely provides an opportunity to compare the two languages.
The data in this article is updated daily, and my file version is larger at 4.63GB.
The CSV file contains 311 complaints from the city of New York. It is the most popular data set in New York City’s Open Data portal.
Data workflow
install.packages("devtools")
library("devtools")
install_github("ropensci/plotly")
Copy the code
library(plotly)
Copy the code
You need to create an account to connect to the Plotly API. Alternatively, you can just use the default GGploT2 graph.
set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key
Copy the code
Analysis in R using Dplyr
Assume sqlite3 is installed (and therefore accessible from a terminal).
$ sqlite3 data.db # Create your database
$.databases # Show databases to make sure it works
$.mode csv
$.import <filename> <tablename>
# Where filename is the name of the csv & tablename is the name of the new database table
$.quit
Copy the code
Load the data into memory.
library(readr)
# data.table, selecting a subset of columns
time_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv',
select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'),
showProgress = T))
Copy the code
kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))
Copy the code
user.self | sys.self | elapsed | user.child | sys.child | |
---|---|---|---|---|---|
time_data.table | 63.588 | 1.952 | 65.633 | 0 | 0 |
time_data.table_full | 205.571 | 3.124 | 208.880 | 0 | 0 |
time_readr | 277.720 | 5.018 | 283.029 | 0 | 0 |
I’ll read the data using data.table. This fread function greatly improves the read speed.
About dplyr
By default, the dplyr query only extracts the first 10 rows from the database.
library(dplyr) ## Will be used for pandas replacement
# Connect to the database
db <- src_sqlite('/users/ryankelly/data.db')
db
Copy the code
The two best options for data processing (besides R) are:
- The data table
- dplyr
Preview the data
# Wrapped in a function for display purposes
head_ <- function(x, n = 5) kable(head(x, n))
head_(data)
Copy the code
Agency | CreatedDate | ClosedDate | ComplaintType | Descriptor | City |
---|---|---|---|---|---|
NYPD | 04/11/2015 02:13:04 AM | Noise – Street/Sidewalk | Loud Music/Party | BROOKLYN | |
DFTA | 04/11/2015 02:12:05 AM | Senior Center Complaint | N/A | ELMHURST | |
NYPD | 04/11/2015 02:11:46 AM | Noise – Commercial | Loud Music/Party | JAMAICA | |
NYPD | 04/11/2015 02:11:02 AM | Noise – Street/Sidewalk | Loud Talking | BROOKLYN | |
NYPD | 04/11/2015 02:10:45 AM | Noise – Street/Sidewalk | Loud Music/Party | NEW YORK |
Choose a few columns
ComplaintType | Descriptor | Agency |
---|---|---|
Noise – Street/Sidewalk | Loud Music/Party | NYPD |
Senior Center Complaint | N/A | DFTA |
Noise – Commercial | Loud Music/Party | NYPD |
Noise – Street/Sidewalk | Loud Talking | NYPD |
Noise – Street/Sidewalk | Loud Music/Party | NYPD |
ComplaintType | Descriptor | Agency |
---|---|---|
Noise – Street/Sidewalk | Loud Music/Party | NYPD |
Senior Center Complaint | N/A | DFTA |
Noise – Commercial | Loud Music/Party | NYPD |
Noise – Street/Sidewalk | Loud Talking | NYPD |
Noise – Street/Sidewalk | Loud Music/Party | NYPD |
Noise – Street/Sidewalk | Loud Talking | NYPD |
Noise – Commercial | Loud Music/Party | NYPD |
HPD Literature Request | The ABCs of Housing – Spanish | HPD |
Noise – Street/Sidewalk | Loud Talking | NYPD |
Street Condition | Plate Condition – Noisy | DOT |
Use WHERE to filter rows
ComplaintType | Descriptor | Agency |
---|---|---|
Noise – Street/Sidewalk | Loud Music/Party | NYPD |
Noise – Commercial | Loud Music/Party | NYPD |
Noise – Street/Sidewalk | Loud Talking | NYPD |
Noise – Street/Sidewalk | Loud Music/Party | NYPD |
Noise – Street/Sidewalk | Loud Talking | NYPD |
Filter multiple values IN a column using WHERE and IN
ComplaintType | Descriptor | Agency |
---|---|---|
Noise – Street/Sidewalk | Loud Music/Party | NYPD |
Noise – Commercial | Loud Music/Party | NYPD |
Noise – Street/Sidewalk | Loud Talking | NYPD |
Noise – Street/Sidewalk | Loud Music/Party | NYPD |
Noise – Street/Sidewalk | Loud Talking | NYPD |
Look for unique values in a DISTINCT column
## City
## 1 BROOKLYN
## 2 ELMHURST
## 3 JAMAICA
## 4 NEW YORK
## 5
## 6 BAYSIDE
Copy the code
Use COUNT (*) and GROUP BY to query the value COUNT
# dt[, .(No.Complaints = .N), Agency] #setkey(dt, No.Complaints) # setkey index's the data q <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(No.Complaints = n()) head_(q)Copy the code
Agency | No.Complaints |
---|---|
The 3-1-1 | 22499 |
ACS | 3 |
AJC | 7 |
ART | 3 |
CAU | 8 |
Use ORDER and – to sort results
How many cities are in the database?
# dt[, unique(City)]
q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())
head(q)
Copy the code
## Number.of.Cities
## 1 1818
Copy the code
Let’s map the 10 most watched cities
City | No.Complaints |
---|---|
BROOKLYN | 2671085 |
NEW YORK | 1692514 |
BRONX | 1624292 |
766378 | |
STATEN ISLAND | 437395 |
JAMAICA | 147133 |
FLUSHING | 117669 |
ASTORIA | 90570 |
Jamaica | 67083 |
RIDGEWOOD | 66411 |
- with
UPPER
Convert CITY format.
CITY | No.Complaints |
---|---|
BROOKLYN | 2671085 |
NEW YORK | 1692514 |
BRONX | 1624292 |
766378 | |
STATEN ISLAND | 437395 |
JAMAICA | 147133 |
FLUSHING | 117669 |
ASTORIA | 90570 |
JAMAICA | 67083 |
RIDGEWOOD | 66411 |
Complaint Type (by city)
# Plot result
plt <- ggplot(q_f, aes(ComplaintType, No.Complaints, fill = CITY)) +
geom_bar(stat = 'identity') +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
plt
Copy the code
Part 2 time series operation
The data provided does not fit SQLite’s standard date format.
Create a new column in the SQL database, then re-insert the data using the formatted date statement to create a new table and insert the formatted date into the original column name.
Filter SQLite lines with a timestamp string: YYYY-MM-DD hh: MM: ss
# dt[CreatedDate < '2014-11-26 23:47:00' & CreatedDate > '2014-09-16 23:45:00',
# .(ComplaintType, CreatedDate, City)]
q <- data %>% filter(CreatedDate < "2014-11-26 23:47:00", CreatedDate > "2014-09-16 23:45:00") %>%
select(ComplaintType, CreatedDate, City)
head_(q)
Copy the code
ComplaintType | CreatedDate | City |
---|---|---|
Noise – Street/Sidewalk | The 2014-11-12 11:59:56 | BRONX |
Taxi Complaint | The 2014-11-12 11:59:40 | BROOKLYN |
Noise – Commercial | The 2014-11-12 11:58:53 | BROOKLYN |
Noise – Commercial | The 2014-11-12 11:58:26 | NEW YORK |
Noise – Street/Sidewalk | The 2014-11-12 11:58:14 | NEW YORK |
Use strftime to pull units of hours from the timestamp
# dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]
q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>%
select(ComplaintType, CreatedDate, City, hour)
head_(q)
Copy the code
ComplaintType | CreatedDate | City | hour |
---|---|---|---|
Noise – Street/Sidewalk | The 2015-11-04 02:13:04 | BROOKLYN | 02 |
Senior Center Complaint | The 2015-11-04 02:12:05 | ELMHURST | 02 |
Noise – Commercial | The 2015-11-04 02:11:46 | JAMAICA | 02 |
Noise – Street/Sidewalk | The 2015-11-04 02:11:02 | BROOKLYN | 02 |
Noise – Street/Sidewalk | The 2015-11-04 02:10:45 | NEW YORK | 02 |
Summary time series
First, create a new column with a timestamp rounded to the previous 15-minute interval
# Using lubridate::new_period()
# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]
q <- data %>%
mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%
select(CreatedDate, interval)
head_(q, 10)
Copy the code
CreatedDate | interval |
---|---|
The 2015-11-04 02:13:04 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:12:05 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:11:46 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:11:02 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:10:45 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:09:07 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:05:47 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:03:43 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:03:29 | The 2015-11-04 02:00:00 |
The 2015-11-04 02:02:17 | The 2015-11-04 02:00:00 |
Plot the results for 2003