Pandas is one of the most popular data statistics infrastructure libraries in Python. But someone like me who works with Java/Scala expects a similar solution from the JVM. A search on the Web found a rich ecosystem: Spark for big data, Weka for GUI-powered data mining, Smile for machine learning, Joinery for converging transformations, and more. But tablesAW was selected. The reason also has to speak from the old saying: leave the scene to talk about the application is playing rogue.

Application scenarios

Data processing process

Bub uses Spark to process massive user behaviors generated by services in modules. Redundant or simple summaries are filtered and exported to different PostgreSQL tables to store normalized basic data. For example, LoginTable and ShoppingTable record user login/purchase behavior respectively. Then set up different services in the secondary processing module, such as S1, S2. S1 accesses the database directly, while S2 accesses both the database and the statistical results of S1, and also relies on the local CSV file.

Classification of computing requirements

  • Query calculation. Whole table query, column statistics, transformation, etc.
  • Filter sort. Conditional query, custom multiple sort and so on.
  • Aggregate connections. Such asLoginTableRecord terminal type: Android /iOS;ShoppingTableIt records the types and quantities of items purchased by users. When comparing the purchasing behavior of different end users, the two tables are joined and aggregated by terminal type.
  • Model validation. The analysis model needed to evaluate the decision is variable and needs to be adjusted repeatedly to get the final result. The key is to iterate quickly.

The characteristics of the summary

latitude The characteristics of instructions
The amount of data Small single machine can bear The original data is summarized by Spark
format Database, local CSV, JSON Json comes mostly from restful services
To calculate Add delete change check, condition query, table join, aggregation, statistics The more built-in operators, the higher the scalability of the better
interaction Output to specified formats, visualization, interactivity, etc Web Rendering, Terminal, [Jupyter Notebook](https://jupyter.org/)
integration Lightweight, in the form of libraries rather than services Facilitate embedded processes and other logical interactions

Why TablesAW

It’s simple, because it’s perfect for the bug. Tablesaw Features list:

  • Data import: RDBMS, Excel, CSV, Json, HTML or fixed width files. In addition to local access, it also supports remote access over HTTP, S3, etc.
  • Data export: CSV, Json, HTML or fixed-width files.
  • Table operations: similarPandas DataFrame, add delete change check, link, sort.

These are the basic functions. Bugs find the following points more interesting:

All right, with that said, let’s get straight to the dry stuff.

The basic application

The installation

Tablesaw Contains multiple libraries. Tablesaw-core and tablesAW-jsplot are recommended. The former is the base library and the latter is used to render diagrams. Other information, such as TablesAW-HTML, tablesAW-JSON, and tablesAW-BreakerX, supports data format change and can be selected as required. In fact, the need to write two lines of code on the line, lightweight and flexible. Tablesaw-core is used as an example to describe how to install the Jar package.

Maven repository configuration:

<dependency>
  <groupId>tech.tablesaw</groupId>
  <artifactId>tablesaw-core</artifactId>
  <version>0.37.3</version>
</dependency>
Copy the code

SBT warehouse configuration:

libraryDependencies += "tech.tablesaw" % "tablesaw-core" % "0.37.3"
Copy the code

Form to create

There are two basic ways:

  • Read from the data source and create directly
  • Create empty table encoding to add columns or rows

The following defines the CSV file format to be processed. The first column is date, the second column is name, the third column is working hours (the unit is hour), and the fourth column is remuneration (unit is yuan). Then three typical examples are given to illustrate the different approaches to import.

1. Import CSV directly

// Read the CSV file input.csv automatically infer the schema
val tbl = Table.read().csv("input.csv")

// Read the contents of the table
println(tbl.printAll())

/ / check the schema
println(tbl.columnArray().mkString("\n")) 
Copy the code

The output table is as follows:

date name Working hours remuneration
2019-01-08 tom 8 1000
2019-01-09 jerry 7 500
2019-01-10 Zhang SAN 8 999
2019-01-10 jerry 8 550
2019-01-10 tom 8 1000
2019-01-11 Zhang SAN 6 800
2019-01-11 Li si 12 1500
2019-01-11 Cathy 8 900
2019-01-11 tom 6.5 800

Can be found to be more perfect speculation, and good support for Chinese. The output schema is:

Date column: date

String column: name

A Double column: working hours

An Integer column: pay

Tablesaw supports the following data types: SHORT, INTEGER, LONG,FLOAT,BOOLEAN,STRING,DOUBLE,LOCAL_DATE,LOCAL_TIME,LOCAL_DATE_TIME, INSTANT, TEXT, SKIP. Most column and normal table types are no different, but there are two types that need to be emphasized:

  • INSTANT. Timestamps, accurate to nanoseconds, introduced in Java 8.
  • The SKIP. Specifies that columns are ignored and not read in.

2. Specify the schema to be imported from CSV

Sometimes the automatic conjecture is not very accurate, such as expecting LONG but identifying it as INTEGER; Or when the data is appended after the read, the type changes, such as when the reward read is an integer but then increments dynamically with floating point data. In this case, you need to set a CSV schema. You can use CsvReadOptions provided by TablesAW. For example, pre-set the reward to float:

import tech.tablesaw.api.ColumnType
import tech.tablesaw.io.csv.CsvReadOptions

// Specify the data types of the CSV columns in sequence
val colTypes: Array[ColumnType] = Array(ColumnType.LOCAL_DATE.ColumnType.STRING.ColumnType.DOUBLE.ColumnType.DOUBLE)
val csvReadOptions = CsvReadOptions.builder("demo.csv").columnTypes(colTypes)
val tbl = Table.read().usingOptions(csvReadOptions)

/ / check the schema
println(tbl.columnArray().mkString("\n")) 
Copy the code

The output schema is:

Date column: date

String column: name

A Double column: working hours

Double column: pay

3. Encoding setting schema and data filling

This approach is suitable for a variety of scenarios and can import data from different data sources at run time.

The basic process is:

  • Create an empty table with a name

  • Schema: Adds columns with specified names and data types in sequence to the table.

  • Append data to a table by row. The elements in each row are added separately to the specified column.

    To make some changes to the previous example, assuming that the data is coming from the network, the data structure serialized to local memory is:

// Define the local memory structure for data source conversion in the form of case class
case class RowData(date: LocalDate, name: String, workTime: Double, salary: Double)
Copy the code

Create a function to add the collected data set to the table:

// @param tableName specifies the tableName
// @param colNames list of column names in the table
// @param colTypes a list of data types for each column of the table
// @param rows
def createTable(tblName: String, colNames: Seq[String], colTypes: Seq[ColumnType], rows: Seq[RowData) :Table = {
  // Create table name
  val tbl = Table.create(tblName)

  // Create schema: add columns in order
  val colCnt = math.min(colTypes.length, colNames.length)
  val cols = (0 until colCnt).map { i =>
    colTypes(i).create(colNames(i))
  }
  tbl.addColumns(cols: _*)

  // Add data
  rows.foreach { row =>
    tbl.dateColumn(0).append(row.date)
    tbl.stringColumn(1).append(row.name)
    tbl.doubleColumn(2).append(row.workTime)
    tbl.doubleColumn(3).append(row.salary)
  }

  tbl
}
Copy the code

The above illustrates the complete process of adding data: create tables, add columns, and append elements to columns. All creation and deformation can be done based on these three basic operations.

Column processing

Column operations are the foundation of table processing. Having covered column data types, name Settings, and element appending, let’s move on to the basics.

1. Traversal and deformation

For example, print the names of all records in the demo table in order:

// Get the name column, indexed by the column name
val nameCol = tbl.stringColumn("name")

// traversal by line number
(0 until nameCol.size()).foreach( i =>
      println(nameCol.get(i))
)

// Use the traversal interface provided by column directly
nameCol.forEacch(println)
Copy the code

In addition to traversal, another common use is to deform a column into another column: type invariant value change; Type changes. Taking working hours as an example, we regard working hours no less than 8 as full attendance:

// Get the time column based on the index of the column
val workTimeCol = tbl.doubleColumn(2)

// deformation 1: map, the output column type is the same as the input column type
val fullTimeCol = workTimeCol.map { time =>
  // The time type is Double, so we need to convert the result to Double, otherwise the compilation fails
  if (time >= 8)
    1.0
  else
    0.0
}

// Deformation 2: mapInto, input/output columns can be of different data types, but the target columns of the same size need to be created in advance
val fullTimeCol = BooleanColumn.create("Full-time", 
                                       workTimeCol.size()) // Create a Boolean column that records the attendance tag
val mapFunc: Double2BooleanFunction = 
  (workTime: Double) => workTime >= 8.0                    // Create a mapping function based on SAM
workTimeCol.mapInto(mapFunc, fullTimeCol)                  / / deformation
tbl.addColumns(fullTimeCol)                                // Add columns to the table
Copy the code

The output is:

date name Working hours remuneration full-time
2019-01-08 tom 8 1000 true
2019-01-09 jerry 7 500 false
2019-01-10 Zhang SAN 8 999 true
2019-01-10 jerry 8 550 true
2019-01-10 tom 8 1000 true
2019-01-11 Zhang SAN 6 800 false
2019-01-11 Li si 12 1500 true
2019-01-11 Cathy 8 900 true
2019-01-11 tom 6.5 800 false

2. Column operation

Tablesaw provides rich operation functions for columns and provides different specialized interfaces for different data types. It is recommended to consult the API documentation first and write code last. Here are a few major categories:

  • Multiple column crossing. For example, all the elements in a column are evaluated with the same data, or the elements in two columns are evaluated sequentially. For example, hourly wage per person:
// Divide the third column by the second column to get the hourly rate
tbl.doubleColumn(3).divide(tbl.doubleColumn(2))
Copy the code
  • Single column statistics. Mean, standard deviation, maximum N values, minimum N values, window functions, etc.
// The standard deviation of the third column
tbl.doubleColumn(3).workTimeCol.standardDeviation() 
Copy the code
  • Sorting. By default, the value, time, and string types support increasing, descending, and custom sorting.

3. The filter

Tablesaw Column filtering criteria are defined as Selection. You can set the criteria to and, or, or Not. Each type of column provides a direct interface generation condition prefixed with “is”. Select * from work hours = 8 and pay < 1000 between 2019-01-09 and 2019-01-10:

// Set time filtering criteria
val datePattern = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val dateSel = tbl.dateColumn(0)
                 .isBetweenIncluding(LocalDate.parse("2019-01-09", datePattern),
                                     LocalDate.parse("2019-01-10", datePattern))
// Set time filter criteria
val workTimeSel = tbl.doubleColumn(2).isEqualTo(8.0)
// Set the compensation filtering criteria
val salarySel = tbl.doubleColumn(3).isLessThan(1000)
// filter table for all conditions
tbl.where(dateSel.and(workTimeSel).and(salarySel))
Copy the code

The output results are as expected:

date name Working hours remuneration full-time
2019-01-10 Zhang SAN 8 999 true
2019-01-10 jerry 8 550 true

Form processing

In addition to the basic operations can refer to the official website instructions, there are three table operations worth special mention: join, group aggregation, table.

The connection

Join two tables with a common column name. The basic way is to concatenate data from other columns in the same table with a common column key. There are differences according to different combinations of ways:

  • Inner. The intersection of data in public columns, other filtering.
  • Outer. Data in the public column is union, and missing data is set to a default null value. Specifically, it can be divided into three categories:
    • LeftOuter. Result table public column data is exactly the same as the left table, not filtered in it, missing setting null value.
    • RightOuter. Result table public column data and right table exactly the same, not in the filter, missing set null value.
    • FullOuter. The common column data of the result table is the union of the two tables.

For example, add a new table tBL2 to record where each person works:

name place
Zhang SAN headquarters
Li si Stores 1
Cathy Stores 2

Connect to the demo table in inner mode:

val tbl3 = tbl.joinOn("name").inner(tbl2)
Copy the code

The contents of TBL3 are:

date name Working hours remuneration full-time place
2019-01-10 Zhang SAN 8 999 true headquarters
2019-01-11 Zhang SAN 6 800 false headquarters
2019-01-11 Li si 12 1500 true Stores 1
2019-01-11 Cathy 8 900 true Stores 2

You can see that both Tom and Jerry are filtered out by the intersection join of names.

Group aggregation

Similar to groupby in SQL, the interface is: TBL. Summarize (col1, col2, col3, aggFunc1, aggFunc2…) By (groupCol1 groupCol2). The argument by represents a collection of grouped column names. Summarize col1, col2, col3 represent the set of column names that need to be aggregated after grouping, and aggFunc1, aggFunc2 represent aggregate functions that will be used for all aggregate columns. Here’s an example of how to calculate the total compensation per person:

tbl3.summarize("Reward", sum).by("name")
Copy the code
name Sum [reward]
tom 2800
jerry 1050
Zhang SAN 1799
Li si 1500
Cathy 900

table

Unlike group aggregation, when grouped by column, you may not need to aggregate the same set of data into a single value, but save it for more complex operations. In this case, you need to split tables. The interface is simple: tbl.spliton (col…) Sets the set of column names for a subtable. Such as:

// Divide the tables by name and location, and save the generated child tables to the List
tbl.splitOn("name"."Location").asTableList()
Copy the code

visualization

Tablesaw can export tables as interactive HTML. It also supports direct investigation and browser opening during debugging, and provides personalized encapsulation for different types of tables. As a simple example, look at the time curve of compensation per person:

// Group TBL by name column, with date as the timeline, to display the change curve of reward
// And set the name of the chart to: Salary curve
val fig = TimeSeriesPlot.create("Salary curve", tbl, "date"."Reward"."name")
Plot.show(fig)
Copy the code

summary

Tablesaw Bug briefly introduced the functions and usage of tablesAW. Based on my own experience, I like it the most:

  • Unified and clear API interface
  • Interactive charts are easy to generate and can be connected to the Web

In addition, tablesAW development and maintenance is in full swing, and we expect more interesting features to be added in the future.