The second article | Flink DataSet API programming guide

Flink’s biggest highlight is real-time processing. Flink sees batch processing as a special case of streaming processing, where batch and streaming data can be processed through a single engine, and Flink will focus more resources on batch streaming fusion in the future. I introduced the use of the DataStream API in the Flink DataStream API programming guide. In this article, I will introduce the use of the Flink batch computing DataSet API. You can learn from this article:

DataSet Transformation
Use of Source and Sink
Broadcast variable basic concept and use Demo
Distributed cache concept and use Demo
The Transformation of the DataSet API uses a Demo case

WordCount sample

Before introducing the DataSet API, let’s take a look ata simple example of Word Count to intuitively understand the programming model of the DataSet API. The specific code is as follows:

public class WordCount {
    public static void main(String[] args) throws Exception {
        // The execution environment for batch processing
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        / / the data source
        DataSource<String> stringDataSource = env.fromElements("hello Flink What is Apache Flink");

        / / conversion
        AggregateOperator<Tuple2<String, Integer>> wordCnt = stringDataSource
                .flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
                    @Override
                    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                        String[] split = value.split("");
                        for (String word : split) {
                            out.collect(Tuple2.of(word, 1));
                        }
                    }
                })
                .groupBy(0)
                .sum(1);
        / / outputwordCnt.print(); }}Copy the code

As you can see from the example above, the basic programming model is:

Gets the ExecutionEnvironment for batch processing
Loading a data source
Conversion operations
Data output

The data source, transformation operation, and data output are explained below.

Data Source

The DataSet API supports reading batch datasets from multiple data sources into the Flink system and converting them into DataSet. There are three main types: file – based, collection – based and generic class data sources. At the same time in the DataSet in the API can be custom InputFormat/RichInputFormat interface, with access to different types of data sources, data format such as CsvInputFormat, TextInputFormat, etc. You can see the supported data source methods from the methods provided by the ExecutionEnvironment class, as shown in the following figure:

File-based data source

readTextFile(path) / TextInputFormat

explain

Read the text file, pass the file path parameters, and convert the file content to DataSet type.

// Read local files
DataSet<String> localLines = env.readTextFile("file:///path/to/my/textfile");
// Read the HDSF file
DataSet<String> hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
Copy the code

readTextFileWithValue(path)/ TextValueInputFormat

explain

Read the text file content and convert the file content to DataSet of DataSet[StringValue] type. This method differs from readTextFile(String) in that its generic type is StringValue, which is a variable String type. Using StringValue to store text data can effectively reduce the number of String objects created and reduce the pressure of garbage collection.

// Read local files
DataSet<StringValue> localLines = env.readTextFileWithValue("file:///some/local/file");
// Read the HDSF file
DataSet<StringValue> hdfsLines = env.readTextFileWithValue("hdfs://host:port/file/path");
Copy the code

readCsvFile(path)/ CsvInputFormat

explain

Create a CSV reader to read comma-separated (or other delimiter) files. The DataSet can be directly converted to a Tuple or POJOs type. You can specify row cutters, column cutters, fields, and so on in a method.

// read a CSV file with five fields, taking only two of them
// Read a CSV file with 5 fields, take only the first and fourth fields
DataSet<Tuple2<String, Double>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")
                               .includeFields("10010")  
	                          .types(String.class, Double.class);

// Read a CSV file with three fields and convert it to POJO type
DataSet<Person>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")
                         .pojoType(Person.class, "name"."age"."zipcode");
Copy the code

readFileOfPrimitives(path, Class) / PrimitiveInputFormat

explain

Read a file of the original DataSet type (e.g. String,Integer) and return a set of the corresponding original DataSet type

DataSet<String> Data = env.readFileOfPrimitives("file:///some/local/file", String.class);
Copy the code

Aggregation-based data sources

fromCollection(Collection)

explain

Create a DataSet DataSet from a Java collection with elements of the same data type

DataSet<String> data= env.fromCollection(arrayList);
Copy the code

fromElements(T …)

explain

Create a DataSet DataSet from a given sequence of data elements, and all data objects must be of the same type

 DataSet<String> stringDataSource = env.fromElements("hello Flink What is Apache Flink");
Copy the code

generateSequence(from, to)

explain

Specify the from to range interval, and then generate the numeric sequence data set inside the interval. Since it is processed in parallel, the final order cannot be guaranteed to be consistent.

DataSet<Long> longDataSource = env.generateSequence(1.20);
Copy the code

General-type data source

The DataSet API provides a common data interface of Inputformat to access data of different data sources and format types. The InputFormat interface is mainly divided into two types: one is based on the file type and corresponds to the readFile() method in the DataSet API; The other is an interface based on common data types, such as reading RDBMS or NoSQL databases, corresponding to the createInput() method in the DataSet API.

readFile(inputFormat, path) / FileInputFormat

explain

Customize input source of file type, read and convert specified format file to DataSet

env.readFile(new MyInputFormat(), "file:///some/local/file");
Copy the code

createInput(inputFormat) / InputFormat

explain

Customize a generic data source to convert the read data to a DataSet. The following example uses the built-in JDBCInputFormat of Flink to create a JDBCInputFormat that reads the mysql data source, complete the reading of the Person table from mysql, and convert the DataSet [Row]

DataSet<Tuple2<String, Integer> dbData =
    env.createInput(
      JDBCInputFormat.buildJDBCInputFormat()
                     .setDrivername("com.mysql.jdbc.Driver")
                     .setDBUrl("jdbc:mysql://localhost/mydb")
                     .setQuery("select name, age from stu")
                     .setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO))
                     .finish()
    );
Copy the code

Data Sink

The data output of Flink in DataSet API can be divided into three types. The first is a file-based implementation, corresponding to the write() method of the DataSet, which outputs the DataSet data to the file system. The second is based on the implementation of common storage media, corresponding to the output() method of DataSet, such as JDBCOutputFormat to output data to a relational database. The last is client-side output, which directly collects DataSet data from different nodes to the Client and outputs it in the Client, such as the print() method of the DataSet.

Standard data output methods

// Text data
DataSet<String> textData = / / [...].

// Write data to a local file
textData.writeAsText("file:///my/result/on/localFS");

// Write data to the HDFS file
textData.writeAsText("hdfs://nnHost:nnPort/my/result/on/localFS");

// Write data to a local file, overwrite if the file exists
textData.writeAsText("file:///my/result/on/localFS", WriteMode.OVERWRITE);

/ / the data output to the local CSV file, specify the separator for the "|"
DataSet<Tuple3<String, Integer, Double>> values = / / [...].
values.writeAsCsv("file:///path/to/the/result/file"."\n"."|");

// Use a custom TextFormatter object
values.writeAsFormattedText("file:///path/to/the/result/file".new TextFormatter<Tuple2<Integer, Integer>>() {
        public String format (Tuple2<Integer, Integer> value) {
            return value.f1 + "-"+ value.f0; }});Copy the code

Use custom output types

DataSet<Tuple3<String, Integer, Double>> myResult = [...]

// Write data of type tuple to relational database
myResult.output(
    // Create and configure OutputFormat
    JDBCOutputFormat.buildJDBCOutputFormat()
                    .setDrivername("com.mysql.jdbc.Driver")
                    .setDBUrl("jdbc:mysql://localhost/mydb")
                    .setQuery("insert into persons (name, age, height) values (? ,? ,?) ")
                    .finish()
    );
Copy the code

The DataSet conversion

You can perform transformations from one DataSet to another, or you can make transformations easily. Flink provides a wealth of transformations. Specific use is as follows:

Map

When you come into a

  DataSource<String> source = env.fromElements("I"."like"."flink");
        source.map(new MapFunction<String, String>() {
            @Override
            // Uppercase the data
            public String map(String value) throws Exception {
                return value.toUpperCase();
            }
        }).print();
Copy the code

FlatMap

Input an element to produce zero, one, or more elements

stringDataSource
                .flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
                    @Override
                    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                        String[] split = value.split("");
                        for (String word : split) {
                            out.collect(Tuple2.of(word, 1));
                        }
                    }
                })
                .groupBy(0)
                .sum(1);
Copy the code

MapPartition

The function is similar to Map function, except that MapPartition operation processes data based on partition in DataSet. In function call, data will be passed in the form of Iteator according to partition. The number of elements in each partition is related to parallelism, and any number of result values will be returned.

 source.mapPartition(new MapPartitionFunction<String, Long>() {
            @Override
            public void mapPartition(Iterable<String> values, Collector<Long> out) throws Exception {
                long c = 0;
                for (String value : values) {
                    c++;
                }
                // Prints the number of elements per partition
                out.collect(c);
            }
        }).print();
Copy the code

Filter

Filter data, retaining it if true or filtering it out if false

DataSource<Long> source = env.fromElements(1L.2L.3L.4L.5L);
        source.filter(new FilterFunction<Long>() {
            @Override
            public boolean filter(Long value) throws Exception {
                return value % 2= =0;
            }
        }).print();
Copy the code

Project

Can only be used in Tuple data sets, projection operations, select a subset of the Tuple data field

  DataSource<Tuple3<Long, Integer, String>> source = env.fromElements(
                Tuple3.of(1L.20."tom"), 
                Tuple3.of(2L.25."jack"), 
                Tuple3.of(3L.22."bob"));
        // Go to the first and third elements
        source.project(0.2).print();

Copy the code

Reduce

To combine elements of a data set into a single element by pair merging, which can be used on the entire data set or on the grouped data set.

DataSource<Tuple2<String, Integer>> source = env.fromElements(
                Tuple2.of("Flink".1),
                Tuple2.of("Flink".1),
                Tuple2.of("Hadoop".1),
                Tuple2.of("Spark".1),
                Tuple2.of("Flink".1));
        source
                .groupBy(0)
                .reduce(new ReduceFunction<Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
                return Tuple2.of(value1.f0, value1.f1 + value2.f1);
            }
        }).print();

Copy the code

ReduceGroup

To combine elements of a dataset into a single element, which can be used on the entire dataset or on the grouped dataset. The input value to the reduce function is an Iterable of grouped elements.

DataSource<Tuple2<String, Long>> source = env.fromElements(
                Tuple2.of("Flink".1L),
                Tuple2.of("Flink".1L),
                Tuple2.of("Hadoop".1L),
                Tuple2.of("Spark".1L),
                Tuple2.of("Flink".1L));
        source
                .groupBy(0)
                .reduceGroup(new GroupReduceFunction<Tuple2<String,Long>, Tuple2<String,Long>>() {
                    @Override
                    public void reduce(Iterable<Tuple2<String, Long>> values, Collector<Tuple2<String, Long>> out) throws Exception {
                        Long sum = 0L;
                        String word = "";
                        for(Tuple2<String, Long> value:values){
                            sum += value.f1;
                            word = value.f0;

                        }
                        out.collect(Tuple2.of(word,sum));
                    }
                }).print();

Copy the code

Aggregate

Aggregate Function is used to combine a set of element values into a single value, which can be used on the entire DataSet or on subsequent datasets. Used only on Tuple data sets, including Sum,Min,Max functions

DataSource<Tuple2<String, Long>> source = env.fromElements(
                Tuple2.of("Flink".1L),
                Tuple2.of("Flink".1L),
                Tuple2.of("Hadoop".1L),
                Tuple2.of("Spark".1L),
                Tuple2.of("Flink".1L));
        source
                .groupBy(0)
                .aggregate(SUM,1)// Add the second value
                 .print();

Copy the code

Distinct

DataSet DataSet element deduplication

DataSource<Tuple> source = env.fromElements(Tuple1.of("Flink"),Tuple1.of("Flink"),Tuple1.of("hadoop"));
        source.distinct(0).print();// Delete the first field of the tuple
/ / the result:
(Flink)
(hadoop)

Copy the code

Join

The default join is a DataSet that produces a Tuple2 DataSet. The associated key can be specified by key expression, key-selector function, field position, and CaseClass field. Two Tuple cubes can be associated by field positions, with the fields in the left cube specified by the WHERE method and the fields in the right cube specified by the equalTo() method. Such as:

DataSource<Tuple2<Integer,String>> source1 = env.fromElements(
                Tuple2.of(1."jack"),
                Tuple2.of(2."tom"),
                Tuple2.of(3."Bob"));
        DataSource<Tuple2<String, Integer>> source2 = env.fromElements(
                Tuple2.of("order1".1),
                Tuple2.of("order2".2),
                Tuple2.of("order3".3));
        source1.join(source2).where(0).equalTo(1).print();

Copy the code

A custom Join Funciton can be specified in the process of association. The input parameter of Funciton is a meta-ancestor composed of data elements in the data set on the left and data elements in the data set on the right, and returns a data after calculation and processing. Such as:

// User ID, purchased item name, purchased item quantity
        DataSource<Tuple3<Integer,String,Integer>> source1 = env.fromElements(
                Tuple3.of(1."item1".2),
                Tuple3.of(2."item2".3),
                Tuple3.of(3."item3".4));
        // Product name and unit price
        DataSource<Tuple2<String, Integer>> source2 = env.fromElements(
                Tuple2.of("item1".10),
                Tuple2.of("item2".20),
                Tuple2.of("item3".15));
        source1.join(source2)
                .where(1)
                .equalTo(0)
                .with(new JoinFunction<Tuple3<Integer,String,Integer>, Tuple2<String,Integer>, Tuple3<Integer,String,Double>>() {
                    // The user buys the total amount of each item
                    @Override
                    public Tuple3<Integer, String, Double> join(Tuple3<Integer, String, Integer> first, Tuple2<String, Integer> second) throws Exception {
                        return Tuple3.of(first.f0,first.f1,first.f2 * second.f1.doubleValue());
                    }
                }).print();

Copy the code

In order to better guide the bottom layer of Flink to correctly process data sets, the Size Hint can be used to mark the Size of data sets in the association of DataSet, and Flink can adjust the calculation strategy according to the Hint given by the user. For example, you can use joinWithTiny or joinWithHuge to indicate the size of the second dataset. The following is an example:

DataSet<Tuple2<Integer, String>> input1 = / / [...].
DataSet<Tuple2<Integer, String>> input2 = / / [...].

DataSet<Tuple2<Tuple2<Integer, String>, Tuple2<Integer, String>>>
            result1 =
            // The second dataset is a small dataset
            input1.joinWithTiny(input2)
                  .where(0)
                  .equalTo(0);

DataSet<Tuple2<Tuple2<Integer, String>, Tuple2<Integer, String>>>
            result2 =
            // h indicates that the second dataset is a large dataset
            input1.joinWithHuge(input2)
                  .where(0)
                  .equalTo(0);

Copy the code

Flink’s Runtime can perform joins in a variety of ways. In different situations, every possible approach trumps the others. The system tries to automatically select a reasonable method, but allowing users to manually select a policy makes Flink more flexible and efficient in performing Join operations.

DataSet<SomeType> input1 = / / [...].
DataSet<AnotherType> input2 = / / [...].
// Broadcast the first input and build a hash table from it, which will be probed by the second input, suitable for scenarios where the first dataset is very small
DataSet<Tuple2<SomeType, AnotherType> result =
      input1.join(input2, JoinHint.BROADCAST_HASH_FIRST)
            .where("id").equalTo("key");
// Broadcast the second input and build a hash table from it, which will be probed by the first input, suitable for scenarios where the second dataset is very small
DataSet<Tuple2<SomeType, AnotherType> result =
      input1.join(input2, JoinHint.BROADCAST_HASH_SECOND)
            .where("id").equalTo("key");
// Repartition the two datasets and convert the first dataset into a hash table, suitable for scenarios where the first dataset is smaller than the second, but both datasets are relatively large
DataSet<Tuple2<SomeType, AnotherType> result =
      input1.join(input2, JoinHint.REPARTITION_HASH_FIRST)
            .where("id").equalTo("key");
// Repartition the two datasets and convert the second dataset into a hash table, suitable for scenarios where the second dataset is smaller than the first, but both datasets are relatively large
DataSet<Tuple2<SomeType, AnotherType> result =
      input1.join(input2, JoinHint.REPARTITION_HASH_SECOND)
            .where("id").equalTo("key");
// Repartition the two datasets and sort each partition. This applies to scenarios where both datasets are already sorted
DataSet<Tuple2<SomeType, AnotherType> result =
      input1.join(input2, JoinHint.REPARTITION_SORT_MERGE)
            .where("id").equalTo("key");
// It is equivalent to not specifying, and the system handles it by itself
DataSet<Tuple2<SomeType, AnotherType> result =
      input1.join(input2, JoinHint.OPTIMIZER_CHOOSES)
            .where("id").equalTo("key");

Copy the code

OuterJoin

OuterJoin associates two datasets, including left, right, and full outer join, corresponding to leftOuterJoin, rightOuterJoin, and fullOuterJoin methods in DataSet API respectively. Note that external concatenation applies only to the Java and Scala DataSet apis.

The usage is almost similar to join:

// left outer join
source1.leftOuterJoin(source2).where(1).equalTo(0);
// right outer link
source1.rightOuterJoin(source2).where(1).equalTo(0);

Copy the code

In addition, the outer link also provides corresponding association algorithm hints, which can select appropriate optimization strategies according to the distribution of left and right data sets to improve the efficiency of data processing. The following code can be interpreted in the above join.

DataSet<SomeType> input1 = / / [...].
DataSet<AnotherType> input2 = / / [...].
DataSet<Tuple2<SomeType, AnotherType> result1 =
      input1.leftOuterJoin(input2, JoinHint.REPARTITION_SORT_MERGE)
            .where("id").equalTo("key");

DataSet<Tuple2<SomeType, AnotherType> result2 =
      input1.rightOuterJoin(input2, JoinHint.BROADCAST_HASH_FIRST)
            .where("id").equalTo("key");

Copy the code

The association algorithm of external join is different from join. Each external connection supports only part of the algorithm. As follows:

LeftOuterJoin support:
- OPTIMIZER_CHOOSES
- BROADCAST_HASH_SECOND
- REPARTITION_HASH_SECOND
- REPARTITION_SORT_MERGE
- RightOuterJoin support:
  - OPTIMIZER_CHOOSES
  - BROADCAST_HASH_FIRST
  - REPARTITION_HASH_FIRST
  - REPARTITION_SORT_MERGE
- FullOuterJoin support:
  - OPTIMIZER_CHOOSES
  - REPARTITION_SORT_MERGE

CoGroup

CoGroup is the operation to join the grouped DataSet. The two DataSet data sets are joined together. Each DataSet is first grouped according to key, and then the grouped DataSet is transferred to the user-defined CoGroupFunction. Combine two datasets based on the same Key record. Records with the same Key are stored in a Group. If you specify that the Key has records in only one dataset, the Co-group function associates the Group with the empty Group.

// User ID, purchased item name, purchased item quantity
        DataSource<Tuple3<Integer,String,Integer>> source1 = env.fromElements(
                Tuple3.of(1."item1".2),
                Tuple3.of(2."item2".3),
                Tuple3.of(3."item2".4));
        // Product name and unit price
        DataSource<Tuple2<String, Integer>> source2 = env.fromElements(
                Tuple2.of("item1".10),
                Tuple2.of("item2".20),
                Tuple2.of("item3".15));

        source1.coGroup(source2)
                .where(1)
                .equalTo(0)
                .with(new CoGroupFunction<Tuple3<Integer,String,Integer>, Tuple2<String,Integer>, Tuple2<String,Double>>() {
                    // Each Iterable stores grouped data, i.e. data with the same key grouped together
                    @Override
                    public void coGroup(Iterable<Tuple3<Integer, String, Integer>> first, Iterable<Tuple2<String, Integer>> second, Collector<Tuple2<String, Double>> out) throws Exception {
                        // Store the purchase quantity of each item
                        int sum = 0;
                        for(Tuple3<Integer, String, Integer> val1:first){
                        sum += val1.f2;

                    }
                    // Quantity of each item * unit price
                    for(Tuple2<String, Integer> val2:second){
                        out.collect(Tuple2.of(val2.f0,sum * val2.f1.doubleValue()));

                        }
                    }
                }).print();

Copy the code

Cross

Combine two data sets into a data set, and return the Cartesian product of all data rows of the two connected data sets. The number of returned data rows is equal to the number of data rows in the first data set that meet the query conditions multiplied by the number of data rows in the second data set that meet the query conditions. The Cross operation can combine the associated data set into a data set of the target format by applying Cross Funciton, or return a data set of type Tuple2 if Cross Funciton is not specified. The Cross operation is computation-intensive and it is recommended to use it with algorithm hints such as crossWithTiny() and crossWithHuge().

/ / / id, x, y, coordinate values
        DataSet<Tuple3<Integer, Integer, Integer>> coords1 = env.fromElements(
                Tuple3.of(1.20.18),
                Tuple3.of(2.15.20),
                Tuple3.of(3.25.10));
        DataSet<Tuple3<Integer, Integer, Integer>> coords2 = env.fromElements(
                Tuple3.of(1.20.18),
                Tuple3.of(2.15.20),
                Tuple3.of(3.25.10));
        // Find the Euclidean distance between any two points

        coords1.cross(coords2)
                .with(new CrossFunction<Tuple3<Integer, Integer, Integer>, Tuple3<Integer, Integer, Integer>, Tuple3<Integer, Integer, Double>>() {
                    @Override
                    public Tuple3<Integer, Integer, Double> cross(Tuple3<Integer, Integer, Integer> val1, Tuple3<Integer, Integer, Integer> val2) throws Exception {
                        // Calculate Euclidean distance
                        double dist = sqrt(pow(val1.f1 - val2.f1, 2) + pow(val1.f2 - val2.f2, 2));
                        // Return the Euclidean distance between two points
                        return Tuple3.of(val1.f0,val2.f0,dist);
                    }
                }).print();

Copy the code

Union

Merge two DataSet datasets. Data element formats of the two datasets must be the same. Multiple datasets can be continuously merged.

     DataSet<Tuple2<String, Integer>> vals1 = env.fromElements(
                Tuple2.of("jack".20),
                Tuple2.of("Tom".21));
        DataSet<Tuple2<String, Integer>> vals2 = env.fromElements(
                Tuple2.of("Robin".25),
                Tuple2.of("Bob".30));
        DataSet<Tuple2<String, Integer>> vals3 = env.fromElements(
                Tuple2.of("Jasper".24),
                Tuple2.of("jarry".21));
        DataSet<Tuple2<String, Integer>> unioned = vals1
                .union(vals2)
                .union(vals3);
        unioned.print();

Copy the code

Rebalance

Rebalance all data in the dataset so that each partition has the same amount of data. This will reduce the impact of data skewness. Note that only map-like operators (such as Map and flatMap) can be used after Rebalance.

DataSet<String> in = / / [...].
Rebalance DataSet, then use the map operator.
DataSet<Tuple2<String, String>> out = in.rebalance()
                                        .map(new Mapper());

Copy the code

Hash-Partition

Hash partitions are performed based on the given Key. Data with the same Key is stored in the same partition. You can specify a key using the location of the element, the name of the element, or the key selector function.

DataSet<Tuple2<String, Integer>> in = / / [...].
// Hash partition based on the first value, then use MapPartition conversion.
DataSet<Tuple2<String, String>> out = in.partitionByHash(0)
                                        .mapPartition(new PartitionMapper());

Copy the code

Range-Partition

Range partition based on a given Key. Data with the same Key will be placed in the same partition. You can specify a key using the location of the element, the name of the element, or the key selector function.

DataSet<Tuple2<String, Integer>> in = / / [...].
// Range partition based on the first value, then MapPartition conversion operation.
DataSet<Tuple2<String, String>> out = in.partitionByRange(0)
                                        .mapPartition(new PartitionMapper());

Copy the code

Custom Partitioning

In addition to the partitioning above, custom partitioning functions are supported.

DataSet<Tuple2<String,Integer>> in = / / [...].
DataSet<Integer> result = in.partitionCustom(partitioner, key)
                            .mapPartition(new PartitionMapper());

Copy the code

Sort Partition

Locally resorts all partitions in the DataSet according to the specified fields. The ordering mode is specified by the ord. ASCENDING and orD. DESCENDING keywords. You can specify multiple fields to sort partitions as follows:

DataSet<Tuple2<String, Integer>> in = / / [...].
// The first field is sorted in ascending order and the second field is sorted in descending order.
DataSet<Tuple2<String, String>> out = in.sortPartition(1, Order.ASCENDING)
                                        .sortPartition(0, Order.DESCENDING)
                                        .mapPartition(new PartitionMapper());

Copy the code

First-n

Returns n random results from the Grouped datasets, which can be applied to general-type datasets, Grouped datasets, and sorted datasets.

DataSet<Tuple2<String, Integer>> in = / / [...].
// Returns any 5 elements of the dataset
DataSet<Tuple2<String, Integer>> out1 = in.first(5);
// Returns any two elements within each group
DataSet<Tuple2<String, Integer>> out2 = in.groupBy(0)
                                          .first(2);
// Return the first three elements within each group
// The grouped dataset is sorted in ascending order by the second field
DataSet<Tuple2<String, Integer>> out3 = in.groupBy(0)
                                          .sortGroup(1, Order.ASCENDING)
                                          .first(3);

Copy the code

MinBy / MaxBy

Returns the smallest or largest record for a specified field or combination from the dataset, or a random record from the collection if the selected field has multiple identical values.

DataSet<Tuple2<String, Integer>> source = env.fromElements(
                Tuple2.of("jack".20),
                Tuple2.of("Tom".21),
                Tuple2.of("Robin".25),
                Tuple2.of("Bob".30));
// Compare the second element to find the tuple whose second element is the smallest
// Use minBy on the entire DataSet
ReduceOperator<Tuple2<String, Integer>> tuple2Reduce = source.minBy(1);
tuple2Reduce.print();/ / return (20), jack,

// You can also use minBy on the grouped DataSet
source.groupBy(0) // Group by the first field
      .minBy(1)  // Find the tuple in each group whose value is the minimum of the second element
      .print();


Copy the code

Radio variable

The basic concept

Broadcast variable is a data sharing method commonly used in distributed computing framework. Its main role is to small data sets with the method of network transmission, on each machine maintain a read-only cache variable, the compute node instances were broadcast can be read directly in the local memory data sets, it can avoid many times in the process of data from other nodes by means of remote reading small data set, thereby improve the overall task computing performance.

The broadcast variable can be understood as a common shared variable that can broadcast the DataSet so that different tasks can read the data. Only one copy of the broadcast data is stored on each node. If broadcast variables are not used, a dataset dataset will be copied in tasks on each node, resulting in wasted memory.

The basic steps for using broadcast variables are as follows:

// The first step is to create the data set to broadcast
DataSet<Integer> toBroadcast = env.fromElements(1.2.3);

DataSet<String> data = env.fromElements("a"."b");

data.map(new RichMapFunction<String, String>() {
    @Override
    public void open(Configuration parameters) throws Exception {
      // The third step is to access the broadcast variable data set as a collection
      Collection<Integer> broadcastSet = getRuntimeContext().getBroadcastVariable("broadcastSetName");
    }
    @Override
    public String map(String value) throws Exception {... } }).withBroadcastSet(toBroadcast,"broadcastSetName"); // Step 2 broadcast the data set

Copy the code

As you can see from the above code, the DataSet API supports reading broadcast variables through RuntimeContext in the RichFunction interface.

We first implement the Open() method in RichFunction, then call getRuntimeContext() to get the application’s RuntimeContext, and then call getBroadcastVariable() to get the broadcast variable by broadcast name. Meanwhile, Flink directly converts data sets to local Collection through the collect operation. It is important to note that the data type of the Collection object must be the same as the type of the defined dataset, otherwise conversion problems will occur.

Matters needing attention:

Because the contents of broadcast variables are stored in memory on each node, the broadcast variable dataset is not easy to grow too large.
After the broadcast variable is initialized, it cannot be modified to ensure that the data on each node is the same.
If multiple operators are going to use the same data set, the broadcast variables need to be registered separately after the multiple operators.
Broadcast variables can only be used in batch processing.

Use the Demo

public class BroadcastExample {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        ArrayList<Tuple2<Integer,String>> RawBroadCastData = new ArrayList<>();

        RawBroadCastData.add(new Tuple2<>(1."jack"));
        RawBroadCastData.add(new Tuple2<>(2."tom"));
        RawBroadCastData.add(new Tuple2<>(3."Bob"));

        [userId,userName]
        DataSource<Tuple2<Integer, String>> userInfoBroadCastData = env.fromCollection(RawBroadCastData);

        ArrayList<Tuple2<Integer,Double>> rawUserAount = new ArrayList<>();

        rawUserAount.add(new Tuple2<>(1.1000.00));
        rawUserAount.add(new Tuple2<>(2.500.20));
        rawUserAount.add(new Tuple2<>(3.800.50));

        // process data: UserId, user purchase amount, [UserId,amount]
        DataSet<Tuple2<Integer, Double>> userAmount = env.fromCollection(rawUserAount);

        // Convert the DataSet to the map collection type
        DataSet<HashMap<Integer, String>> userInfoBroadCast = userInfoBroadCastData.map(new MapFunction<Tuple2<Integer, String>, HashMap<Integer, String>>() {

            @Override
            public HashMap<Integer, String> map(Tuple2<Integer, String> value) throws Exception {
                HashMap<Integer, String> userInfo = new HashMap<>();
                userInfo.put(value.f0, value.f1);
                returnuserInfo; }}); DataSet<String> result = userAmount.map(new RichMapFunction<Tuple2<Integer, Double>, String>() {
            // Store the set of data returned by the broadcast variable
            List<HashMap<String, String>> broadCastList = new ArrayList<>();
            // Store the value of the broadcast variable
            HashMap<String, String> allMap = new HashMap<>();

            @Override

            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                // Get broadcast data, return a list
                this.broadCastList = getRuntimeContext().getBroadcastVariable("userInfo");
                for(HashMap<String, String> value : broadCastList) { allMap.putAll(value); }}@Override
            public String map(Tuple2<Integer, Double> value) throws Exception {
                String userName = allMap.get(value.f0);
                return "User ID:" + value.f0 + "|"+ "Username:" + userName + "|" + "Purchase Amount:" + value.f1;
            }
        }).withBroadcastSet(userInfoBroadCast, "userInfo"); result.print(); }}Copy the code

Distributed cache

The basic concept

Flink provides a distributed cache, similar to Hadoop, so that files can be accessed locally by parallel instances of user functions. Distributed caching works by registering a file or directory (local or remote file system, such as HDFS) for a program, registering a cache file through ExecutionEnvironment, and giving it an alias. When the application is executed, Flink automatically copies the registered file or directory to the local file system of all TaskManager nodes. Users can look up the file or directory using the alias they registered, and then access the file on the TaskManager node’s local file system.

How to use the distributed cache:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Register an HDFS file
env.registerCachedFile("hdfs:///path/to/your/file"."hdfsFile")
// Register a local file
env.registerCachedFile("file:///path/to/exec/file"."localExecFile".true)
// Access data
getRuntimeContext().getDistributedCache().getFile("hdfsFile");


Copy the code

The method of retrieving cache files is similar to that of broadcasting variables. It implements the RichFunction interface and retrieves the RuntimeContext object through the RichFunction interface, and then retrieves the corresponding local cache file through the interface provided by RuntimeContext.

Use the Demo

public class DistributeCacheExample {
    public static void main(String[] args) throws Exception {
        // Get the operating environment
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        / * * * * to register a local files file contents as follows: * 1, * 2 "jack", "Tom" * 3, "Bob" * /
        env.registerCachedFile("file:///E://userinfo.txt"."localFileUserInfo".true);

        ArrayList<Tuple2<Integer,Double>> rawUserAount = new ArrayList<>();

        rawUserAount.add(new Tuple2<>(1.1000.00));
        rawUserAount.add(new Tuple2<>(2.500.20));
        rawUserAount.add(new Tuple2<>(3.800.50));

        // process data: UserId, user purchase amount, [UserId,amount]
        DataSet<Tuple2<Integer, Double>> userAmount = env.fromCollection(rawUserAount);

        DataSet<String> result= userAmount.map(new RichMapFunction<Tuple2<Integer, Double>, String>() {
            // Save cached data
            HashMap<String, String> allMap = new HashMap<String, String>();

            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                // Get distributed cached data
                File userInfoFile = getRuntimeContext().getDistributedCache().getFile("localFileUserInfo");
                List<String> userInfo = FileUtils.readLines(userInfoFile);
                for (String value : userInfo) {

                    String[] split = value.split(",");
                    allMap.put(split[0], split[1]); }}@Override
            public String map(Tuple2<Integer, Double> value) throws Exception {
                String userName = allMap.get(value.f0);

                return "User ID:" + value.f0 + "|" + "Username:" + userName + "|" + "Purchase Amount:"+ value.f1; }}); result.print(); }}Copy the code

summary

This paper mainly explains the basic use of Flink DataSet API. This paper first introduces a DataSet API WordCount case, then introduces the DataSet API data source and Sink operation, and basic use. Then each conversion operation is explained in detail and a specific use case is given. Finally, the concepts of broadcast variables and distributed caching are explained, and a complete Demo case is provided on how to use these two advanced features.

The public account “Big Data Technology and Data Warehouse” replies to “information” to receive the big data data package

The second article | Flink DataSet API programming guide

WordCount sample

Data Source

File-based data source

readTextFile(path) / TextInputFormat

readTextFileWithValue(path)/ TextValueInputFormat

readCsvFile(path)/ CsvInputFormat

readFileOfPrimitives(path, Class) / PrimitiveInputFormat

Aggregation-based data sources

fromCollection(Collection)

fromElements(T …)

generateSequence(from, to)

General-type data source

readFile(inputFormat, path) / FileInputFormat

createInput(inputFormat) / InputFormat

Data Sink

Standard data output methods

Use custom output types

The DataSet conversion

Map

FlatMap

MapPartition

Filter

Project

Reduce

ReduceGroup

Aggregate

Distinct

Join

OuterJoin

CoGroup

Cross

Union

Rebalance

Hash-Partition

Range-Partition

Custom Partitioning

Sort Partition

First-n

MinBy / MaxBy

Radio variable

The basic concept

Use the Demo

Distributed cache

The basic concept

Use the Demo

summary

Related Posts

TOP100summit2017: Geng Xinyue — Technology management from the perspective of the company’s overall goals

What does a zero code development platform mean for an enterprise?

Is blockchain + insurance the best landing field for coin reduction?