Pandas is one of the most popular data statistics infrastructure libraries in Python. But someone like me who works with Java/Scala expects a similar solution from the JVM. A search on the Web found a rich ecosystem: Spark for big data, Weka for GUI-powered data mining, Smile for machine learning, Joinery for converging transformations, and more. But tablesAW was selected. The reason also has to speak from the old saying: leave the scene to talk about the application is playing rogue.
Application scenarios
Data processing process
Bub uses Spark to process massive user behaviors generated by services in modules. Redundant or simple summaries are filtered and exported to different PostgreSQL tables to store normalized basic data. For example, LoginTable and ShoppingTable record user login/purchase behavior respectively. Then set up different services in the secondary processing module, such as S1, S2. S1 accesses the database directly, while S2 accesses both the database and the statistical results of S1, and also relies on the local CSV file.
Classification of computing requirements
- Query calculation. Whole table query, column statistics, transformation, etc.
- Filter sort. Conditional query, custom multiple sort and so on.
- Aggregate connections. Such as
LoginTable
Record terminal type: Android /iOS;ShoppingTable
It records the types and quantities of items purchased by users. When comparing the purchasing behavior of different end users, the two tables are joined and aggregated by terminal type. - Model validation. The analysis model needed to evaluate the decision is variable and needs to be adjusted repeatedly to get the final result. The key is to iterate quickly.
The characteristics of the summary
latitude | The characteristics of | instructions |
---|---|---|
The amount of data | Small single machine can bear | The original data is summarized by Spark |
format | Database, local CSV, JSON | Json comes mostly from restful services |
To calculate | Add delete change check, condition query, table join, aggregation, statistics | The more built-in operators, the higher the scalability of the better |
interaction | Output to specified formats, visualization, interactivity, etc | Web Rendering, Terminal, [Jupyter Notebook](https://jupyter.org/) |
integration | Lightweight, in the form of libraries rather than services | Facilitate embedded processes and other logical interactions |
Why TablesAW
It’s simple, because it’s perfect for the bug. Tablesaw Features list:
- Data import: RDBMS, Excel, CSV, Json, HTML or fixed width files. In addition to local access, it also supports remote access over HTTP, S3, etc.
- Data export: CSV, Json, HTML or fixed-width files.
- Table operations: similar
Pandas DataFrame
, add delete change check, link, sort.
These are the basic functions. Bugs find the following points more interesting:
- Based on thePlotlyVisual framework. Get rid of
java
UI system, better and Web docking. Support 2D, 3D view, chart types are also very rich: curve, scatter, box statistics, candle chart, thermal map, pie chart, etc. More importantly:- Interactive charts. It is especially suitable for multiple data sets comparison and 3d perspective rotation.
- Graphs are exported as strings
Javascript
. Easy to render HTML with Web Services.
- with
smile
Docking. Tablesaw The table can be exported assmile
Identifies data formats for easy use with its powerful machine learning library.
All right, with that said, let’s get straight to the dry stuff.
The basic application
The installation
Tablesaw Contains multiple libraries. Tablesaw-core and tablesAW-jsplot are recommended. The former is the base library and the latter is used to render diagrams. Other information, such as TablesAW-HTML, tablesAW-JSON, and tablesAW-BreakerX, supports data format change and can be selected as required. In fact, the need to write two lines of code on the line, lightweight and flexible. Tablesaw-core is used as an example to describe how to install the Jar package.
Maven repository configuration:
<dependency>
<groupId>tech.tablesaw</groupId>
<artifactId>tablesaw-core</artifactId>
<version>0.37.3</version>
</dependency>
Copy the code
SBT warehouse configuration:
libraryDependencies += "tech.tablesaw" % "tablesaw-core" % "0.37.3"
Copy the code
Form to create
There are two basic ways:
- Read from the data source and create directly
- Create empty table encoding to add columns or rows
The following defines the CSV file format to be processed. The first column is date, the second column is name, the third column is working hours (the unit is hour), and the fourth column is remuneration (unit is yuan). Then three typical examples are given to illustrate the different approaches to import.
1. Import CSV directly
// Read the CSV file input.csv automatically infer the schema
val tbl = Table.read().csv("input.csv")
// Read the contents of the table
println(tbl.printAll())
/ / check the schema
println(tbl.columnArray().mkString("\n"))
Copy the code
The output table is as follows:
date | name | Working hours | remuneration |
---|---|---|---|
2019-01-08 | tom | 8 | 1000 |
2019-01-09 | jerry | 7 | 500 |
2019-01-10 | Zhang SAN | 8 | 999 |
2019-01-10 | jerry | 8 | 550 |
2019-01-10 | tom | 8 | 1000 |
2019-01-11 | Zhang SAN | 6 | 800 |
2019-01-11 | Li si | 12 | 1500 |
2019-01-11 | Cathy | 8 | 900 |
2019-01-11 | tom | 6.5 | 800 |
Can be found to be more perfect speculation, and good support for Chinese. The output schema is:
Date column: date
String column: name
A Double column: working hours
An Integer column: pay
Tablesaw supports the following data types: SHORT, INTEGER, LONG,FLOAT,BOOLEAN,STRING,DOUBLE,LOCAL_DATE,LOCAL_TIME,LOCAL_DATE_TIME, INSTANT, TEXT, SKIP. Most column and normal table types are no different, but there are two types that need to be emphasized:
- INSTANT. Timestamps, accurate to nanoseconds, introduced in Java 8.
- The SKIP. Specifies that columns are ignored and not read in.
2. Specify the schema to be imported from CSV
Sometimes the automatic conjecture is not very accurate, such as expecting LONG but identifying it as INTEGER; Or when the data is appended after the read, the type changes, such as when the reward read is an integer but then increments dynamically with floating point data. In this case, you need to set a CSV schema. You can use CsvReadOptions provided by TablesAW. For example, pre-set the reward to float:
import tech.tablesaw.api.ColumnType
import tech.tablesaw.io.csv.CsvReadOptions
// Specify the data types of the CSV columns in sequence
val colTypes: Array[ColumnType] = Array(ColumnType.LOCAL_DATE.ColumnType.STRING.ColumnType.DOUBLE.ColumnType.DOUBLE)
val csvReadOptions = CsvReadOptions.builder("demo.csv").columnTypes(colTypes)
val tbl = Table.read().usingOptions(csvReadOptions)
/ / check the schema
println(tbl.columnArray().mkString("\n"))
Copy the code
The output schema is:
Date column: date
String column: name
A Double column: working hours
Double column: pay
3. Encoding setting schema and data filling
This approach is suitable for a variety of scenarios and can import data from different data sources at run time.
The basic process is:
-
Create an empty table with a name
-
Schema: Adds columns with specified names and data types in sequence to the table.
-
Append data to a table by row. The elements in each row are added separately to the specified column.
To make some changes to the previous example, assuming that the data is coming from the network, the data structure serialized to local memory is:
// Define the local memory structure for data source conversion in the form of case class
case class RowData(date: LocalDate, name: String, workTime: Double, salary: Double)
Copy the code
Create a function to add the collected data set to the table:
// @param tableName specifies the tableName
// @param colNames list of column names in the table
// @param colTypes a list of data types for each column of the table
// @param rows
def createTable(tblName: String, colNames: Seq[String], colTypes: Seq[ColumnType], rows: Seq[RowData) :Table = {
// Create table name
val tbl = Table.create(tblName)
// Create schema: add columns in order
val colCnt = math.min(colTypes.length, colNames.length)
val cols = (0 until colCnt).map { i =>
colTypes(i).create(colNames(i))
}
tbl.addColumns(cols: _*)
// Add data
rows.foreach { row =>
tbl.dateColumn(0).append(row.date)
tbl.stringColumn(1).append(row.name)
tbl.doubleColumn(2).append(row.workTime)
tbl.doubleColumn(3).append(row.salary)
}
tbl
}
Copy the code
The above illustrates the complete process of adding data: create tables, add columns, and append elements to columns. All creation and deformation can be done based on these three basic operations.
Column processing
Column operations are the foundation of table processing. Having covered column data types, name Settings, and element appending, let’s move on to the basics.
1. Traversal and deformation
For example, print the names of all records in the demo table in order:
// Get the name column, indexed by the column name
val nameCol = tbl.stringColumn("name")
// traversal by line number
(0 until nameCol.size()).foreach( i =>
println(nameCol.get(i))
)
// Use the traversal interface provided by column directly
nameCol.forEacch(println)
Copy the code
In addition to traversal, another common use is to deform a column into another column: type invariant value change; Type changes. Taking working hours as an example, we regard working hours no less than 8 as full attendance:
// Get the time column based on the index of the column
val workTimeCol = tbl.doubleColumn(2)
// deformation 1: map, the output column type is the same as the input column type
val fullTimeCol = workTimeCol.map { time =>
// The time type is Double, so we need to convert the result to Double, otherwise the compilation fails
if (time >= 8)
1.0
else
0.0
}
// Deformation 2: mapInto, input/output columns can be of different data types, but the target columns of the same size need to be created in advance
val fullTimeCol = BooleanColumn.create("Full-time",
workTimeCol.size()) // Create a Boolean column that records the attendance tag
val mapFunc: Double2BooleanFunction =
(workTime: Double) => workTime >= 8.0 // Create a mapping function based on SAM
workTimeCol.mapInto(mapFunc, fullTimeCol) / / deformation
tbl.addColumns(fullTimeCol) // Add columns to the table
Copy the code
The output is:
date | name | Working hours | remuneration | full-time |
---|---|---|---|---|
2019-01-08 | tom | 8 | 1000 | true |
2019-01-09 | jerry | 7 | 500 | false |
2019-01-10 | Zhang SAN | 8 | 999 | true |
2019-01-10 | jerry | 8 | 550 | true |
2019-01-10 | tom | 8 | 1000 | true |
2019-01-11 | Zhang SAN | 6 | 800 | false |
2019-01-11 | Li si | 12 | 1500 | true |
2019-01-11 | Cathy | 8 | 900 | true |
2019-01-11 | tom | 6.5 | 800 | false |
2. Column operation
Tablesaw provides rich operation functions for columns and provides different specialized interfaces for different data types. It is recommended to consult the API documentation first and write code last. Here are a few major categories:
- Multiple column crossing. For example, all the elements in a column are evaluated with the same data, or the elements in two columns are evaluated sequentially. For example, hourly wage per person:
// Divide the third column by the second column to get the hourly rate
tbl.doubleColumn(3).divide(tbl.doubleColumn(2))
Copy the code
- Single column statistics. Mean, standard deviation, maximum N values, minimum N values, window functions, etc.
// The standard deviation of the third column
tbl.doubleColumn(3).workTimeCol.standardDeviation()
Copy the code
- Sorting. By default, the value, time, and string types support increasing, descending, and custom sorting.
3. The filter
Tablesaw Column filtering criteria are defined as Selection. You can set the criteria to and, or, or Not. Each type of column provides a direct interface generation condition prefixed with “is”. Select * from work hours = 8 and pay < 1000 between 2019-01-09 and 2019-01-10:
// Set time filtering criteria
val datePattern = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val dateSel = tbl.dateColumn(0)
.isBetweenIncluding(LocalDate.parse("2019-01-09", datePattern),
LocalDate.parse("2019-01-10", datePattern))
// Set time filter criteria
val workTimeSel = tbl.doubleColumn(2).isEqualTo(8.0)
// Set the compensation filtering criteria
val salarySel = tbl.doubleColumn(3).isLessThan(1000)
// filter table for all conditions
tbl.where(dateSel.and(workTimeSel).and(salarySel))
Copy the code
The output results are as expected:
date | name | Working hours | remuneration | full-time |
---|---|---|---|---|
2019-01-10 | Zhang SAN | 8 | 999 | true |
2019-01-10 | jerry | 8 | 550 | true |
Form processing
In addition to the basic operations can refer to the official website instructions, there are three table operations worth special mention: join, group aggregation, table.
The connection
Join two tables with a common column name. The basic way is to concatenate data from other columns in the same table with a common column key. There are differences according to different combinations of ways:
- Inner. The intersection of data in public columns, other filtering.
- Outer. Data in the public column is union, and missing data is set to a default null value. Specifically, it can be divided into three categories:
- LeftOuter. Result table public column data is exactly the same as the left table, not filtered in it, missing setting null value.
- RightOuter. Result table public column data and right table exactly the same, not in the filter, missing set null value.
- FullOuter. The common column data of the result table is the union of the two tables.
For example, add a new table tBL2 to record where each person works:
name | place |
---|---|
Zhang SAN | headquarters |
Li si | Stores 1 |
Cathy | Stores 2 |
Connect to the demo table in inner mode:
val tbl3 = tbl.joinOn("name").inner(tbl2)
Copy the code
The contents of TBL3 are:
date | name | Working hours | remuneration | full-time | place |
---|---|---|---|---|---|
2019-01-10 | Zhang SAN | 8 | 999 | true | headquarters |
2019-01-11 | Zhang SAN | 6 | 800 | false | headquarters |
2019-01-11 | Li si | 12 | 1500 | true | Stores 1 |
2019-01-11 | Cathy | 8 | 900 | true | Stores 2 |
You can see that both Tom and Jerry are filtered out by the intersection join of names.
Group aggregation
Similar to groupby in SQL, the interface is: TBL. Summarize (col1, col2, col3, aggFunc1, aggFunc2…) By (groupCol1 groupCol2). The argument by represents a collection of grouped column names. Summarize col1, col2, col3 represent the set of column names that need to be aggregated after grouping, and aggFunc1, aggFunc2 represent aggregate functions that will be used for all aggregate columns. Here’s an example of how to calculate the total compensation per person:
tbl3.summarize("Reward", sum).by("name")
Copy the code
name | Sum [reward] |
---|---|
tom | 2800 |
jerry | 1050 |
Zhang SAN | 1799 |
Li si | 1500 |
Cathy | 900 |
table
Unlike group aggregation, when grouped by column, you may not need to aggregate the same set of data into a single value, but save it for more complex operations. In this case, you need to split tables. The interface is simple: tbl.spliton (col…) Sets the set of column names for a subtable. Such as:
// Divide the tables by name and location, and save the generated child tables to the List
tbl.splitOn("name"."Location").asTableList()
Copy the code
visualization
Tablesaw can export tables as interactive HTML. It also supports direct investigation and browser opening during debugging, and provides personalized encapsulation for different types of tables. As a simple example, look at the time curve of compensation per person:
// Group TBL by name column, with date as the timeline, to display the change curve of reward
// And set the name of the chart to: Salary curve
val fig = TimeSeriesPlot.create("Salary curve", tbl, "date"."Reward"."name")
Plot.show(fig)
Copy the code
summary
Tablesaw Bug briefly introduced the functions and usage of tablesAW. Based on my own experience, I like it the most:
- Unified and clear API interface
- Interactive charts are easy to generate and can be connected to the Web
In addition, tablesAW development and maintenance is in full swing, and we expect more interesting features to be added in the future.