Basic introduction and application scenarios
The Tunnel is an offline batch data channel provided by MaxCompute. It provides large amounts of offline data to be uploaded and downloaded only when a batch of data is greater than or equal to 64MB. In small-batch streaming data scenarios, use the DataHub real-time data channel for better performance and experience.
SDK upload best practices
import java.io.IOException;
import java.util.Date;
import com.aliyun.odps.Column;
import com.aliyun.odps.Odps;
import com.aliyun.odps.PartitionSpec;
import com.aliyun.odps.TableSchema;
import com.aliyun.odps.account.Account;
import com.aliyun.odps.account.AliyunAccount;
import com.aliyun.odps.data.Record;
import com.aliyun.odps.data.RecordWriter;
import com.aliyun.odps.tunnel.TableTunnel;
import com.aliyun.odps.tunnel.TunnelException;
import com.aliyun.odps.tunnel.TableTunnel.UploadSession;
public class UploadSample {
private static String accessId = "<your access id>";
private static String accessKey = "<your access Key>";
private static String odpsUrl = "http://service.odps.aliyun.com/api";
private static String project = "<your project>";
private static String table = "<your table name>";
private static String partition = "<your partition spec>"; Public static void main(String args[]) {Account Account = new AliyunAccount(accessId, accessKey); Odps odps = new Odps(account); odps.setEndpoint(odpsUrl); odps.setDefaultProject(project); TableTunnel tunnel = new TableTunnel(odps); Try {// Determine write partition PartitionSpec PartitionSpec = new PartitionSpec(partition); // Create a session on the server that is valid for 24 hours on the local partition of this table. Within 24 hours, 20000 blocks of data can be uploaded. // Create a session in seconds. Therefore, it is strongly recommended that the data in the same partition be uploaded using Session as much as possible. UploadSession uploadSession = tunnel.createUploadSession(project, table, partitionSpec); System.out.println("Session Status is : "+ uploadSession.getStatus().toString()); TableSchema schema = uploadSession.getSchema(); If CloseWriter is successful, the Block is uploaded successfully. If CloseWriter fails, the Block can be uploaded again. A maximum of 20000 BlockId (0-19999) are allowed in one Session. If more than that, please send Session and create a new Session, and so on. // Too little data is written to a single Block, resulting in a large number of small files that seriously affect computing performance. It is strongly recommended that more than 64MB of data be written to the same Block each time (data less than 100GB can be written to the same Block) // The total number of records can be calculated by the average data size and the approximate total number of records. Users can use a certain number of blocks per Session according to their service requirements, for example, 100 blocks. However, it is recommended to use as many blocks as possible in each Session, because creating a Session is a heavy operation. Int maxBlockID = 20000; maxBlockID = 20000; maxBlockID = 20000;for(int blockId = 0; blockId < maxBlockID; BlockId++) {// get at least 64MB of data ready to write // for example: Try {// create a Writer on this Block. After the Writer is created, if more than 4KB of data is not written in a certain Block for two consecutive minutes, Will be disconnected timeout / / it is recommended that before the creation of writer in memory to prepare data can be written directly RecordWriter RecordWriter = uploadSession. OpenRecordWriter (blockId); Int recordNumber = 1000000; // Convert all data to Tunnel Record format and enter int recordNumber = 1000000;for(int index = 0; i < recordNumber; Record record = uploadsession.newRecord ();for (int i = 0; i < schema.getColumns().size(); i++) {
Column column = schema.getColumn(i);
switch (column.getType()) {
case BIGINT:
record.setBigint(i, 1L);
break;
case BOOLEAN:
record.setBoolean(i, true);
break;
case DATETIME:
record.setDatetime(i, new Date());
break;
caseDOUBLE: record. SetDouble (I, 0.0);break;
case STRING:
record.setString(i, "sample");
break;
default:
throw new RuntimeException("Unknown column type: "+ column.getType()); }} // Write this data to the server. Every 4KB data is written to the server, a network transmission is performed. // if no network transmission is performed for 120s, the server closes the connection, and the Writer becomes unavailable. } recordWriter.close() is not visible in the odPS temporary directory until the whole Session is committed. } catch (TunnelException e) {// It is recommended to retry a certain number of times. System.out.println("write failed:"+ e.getMessage()); } catch (IOException e) {// It is recommended to retry a certain number of times e.printStackTrace(); System.out.println("write failed:"+ e.getMessage()); }} / / submit all Block, uploadSession getBlockList () to specify required Block, after the success of the Commit data will only be formally written Odps partitions, Commit fail and retry 10 timesfor(int retry = 0; retry < 10; + + retry) {try {/ / second level operation, formally submitted data uploadSession.com MIT (uploadSession. GetBlockList ());break;
} catch (TunnelException e) {
System.out.println("uploadSession commit failed:" + e.getMessage());
} catch (IOException e) {
System.out.println("uploadSession commit failed:" + e.getMessage());
}
}
System.out.println("upload success!"); } catch (TunnelException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }}}Copy the code
Constructor example:
PartitionSpec(String spec) : Construct such an object from a String.
Parameters:
Spec: Partition definition string, such as pt=’1′,ds=’2′. Private static String partition = “pt=’XXX’,ds=’XXX'”;
Q&A
What is the MaxCompute Tunnel?
A Tunnel is a data channel of MaxCompute. You can upload or download data to MaxCompute through a Tunnel. Currently, only table (excluding View) data can be uploaded and downloaded by a Tunnel.
Can BlockId be repeated?
The blockId in the same UploadSession cannot be duplicated. For the same UploadSession, open the RecordWriter with a blockId, write a batch of data, call Close, and commit. You cannot use this blockId to open another RecordWriter to write data. By default, the number of blocks ranges from 0 to 19999 at most 20000.
Is there a limit to the Block size?
The maximum size of a block is 100GB. It is recommended that each block contains a file larger than 64MB. Files smaller than 64MB are called small files. The new BufferedWriter can be used to upload files more easily, avoiding problems such as small files tunnel-sdK-bufferedWriter
Is the Session shareable and does it have a life cycle?
Each Session has a lifetime of 24 hours on the server and can be used within 24 hours after creation. It can also be shared across processes/threads, but the same BlockId must not be used repeatedly. Create Session-> estimate data -> allocate blocks (for example, 0-100 for thread 1 and 100-200 for thread 2) -> Prepare data -> Upload data ->Commit All blocks that have been successfully written.
Does it consume resources if a Session is not used after it is created?
When a Session is created, two file directories are generated. If a large number of file directories are created but not used, the number of temporary directories will increase, which may burden the system. Avoid such behavior and try to share and use sessions.
How to handle a Write/Read timeout or IOException?
When data is uploaded, every 8KB of data written by Writer triggers a network action. If no network action is performed within 120 seconds, the server automatically closes the connection to Writer. In this case, Writer becomes unavailable.
You are advised to use the [tunnel-sdK-bufferedWriter] interface to upload data. This interface hides the details of blockId from users and has an internal data cache, which automatically retries failure.
The Reader has a similar mechanism for downloading data. If there is no network for a long time, the I/O will be disconnected. It is recommended that the Read process be continued without interconnecting with other system interfaces.
What is the SDK for MaxCompute Tunnel?
MaxCompute Tunnel currently provides a Java version of the SDK.
Does the MaxCompute Tunnel support multiple clients to upload a table at the same time?
Support.
MaxCompute Tunnel Supports batch upload or stream upload
MaxCompute Tunnel is used for batch upload, not for streaming upload. Streaming upload can use [DataHub high-speed streaming data channel], millisecond delay write.
Does the MaxCompute Tunnel need to have partitions before uploading data?
Yes, tunnels do not automatically create partitions.
The relationship between Dship and MaxCompute Tunnel?
Dship is a tool that uses the MaxCompute Tunnel to upload and download files.
Is the Tunnel upload data appended or overwritten?
Append mode.
What about the Tunnel routing function?
The routing function indicates that the Tunnel SDK obtains the Endpoint of the Tunnel by setting MaxCompute. Therefore, the SDK can set only MaxCompute’s endpoint to work properly.
When MaxCompute Tunnel is used to upload data, the size of a block is appropriate
There is no absolute optimal answer. Factors such as network conditions, real-time requirements, data usage and cluster small files should be considered comprehensively. Generally, if the number of large continuous upload mode, can be 64m-256m, if the batch mode is once a day, you can set a larger to about 1G
Use MaxCompute Tunnel to download, always prompt timeout
Generally, the endpoint is incorrect. Check the endpoint configuration. You can use Telnet to check the network connectivity.
MaxCompute Tunnel download, You have NO privilege ‘odps:Select’ on {acs:odps:*:projects/XXX/tables/XXX}. Project ‘XXX’ is protected
Data protection function is enabled in this project. Users can direct data from one project to another project, which requires the operation of the owner of the project.
ErrorCode=FlowExceeded, ErrorMessage=Your flow quota is exceeded.**
The Tunnel controls the concurrency of requests. By default, the number of concurrent upload and download requests is set to 2000, and any request sent to the end will occupy one Quota unit. If this error occurs, there are several recommended solutions: 1. Sleep and try again. 2 To increase the concurrent quota of the tunnel in the project, contact the administrator to evaluate the traffic pressure. Report to the project owner to investigate who is occupying a large number of concurrent quotas.