A list,

Before submitting a big data job to run on a cluster, you usually need to JAR the project first. Here, Maven is used as an example. The common packaging methods are as follows:

  • Use MVN package directly without any plug-ins;
  • Use maven-assembly-plugin;
  • Use maven-shade-plugin;
  • Using maven-jar-plugin and maven-dependency plugin;

The following are detailed explanations.

Second, the MVN package

Without configuring any plug-ins in the POM, the project is packaged directly using the MVN package, which is feasible for projects that do not use external dependency packages. However, if a third-party JAR package is used in the project, there will be a problem, because the JAR package of the MVN package does not contain the dependency package, which will cause the exception that the job cannot find the third-party dependency when running. This approach is limited because actual projects tend to be complex and often rely on third-party jars.

Developers of big data frameworks have this in mind, so almost all frameworks support using jars to specify third-party dependencies when submitting jobs. However, the problem with this approach is that you have to keep all jars in your production environment consistent with those in your development environment, which has maintenance costs.

For these reasons, it is easiest to adopt an All In One packaging approach where All dependencies are packaged into a JAR file with minimal dependency on the environment. To do this, you can use the Maven-assembly-plugin or maven-shade-plugin plugin provided by Maven.

Maven-assembly-plugin is a maven-assembly plugin

The Assembly plug-in supports packaging all dependencies and files for a project into the same output file. Currently, the following file types are supported:

  • zip
  • tar
  • tar.gz (or tgz)
  • tar.bz2 (or tbz2)
  • tar.snappy
  • tar.xz (or txz)
  • jar
  • dir
  • war

3.1 Basic Usage

Introduce the plug-in in pom.xml, specify the packaged-format configuration file assembly.xml(name is customizable), and specify the main entry class for the job:

<build>
    <plugins>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <descriptors>
                    <descriptor>src/main/resources/assembly.xml</descriptor>
                </descriptors>
                <archive>
                    <manifest>
                        <mainClass>com.heibaiying.wordcount.ClusterWordCountApp</mainClass>
                    </manifest>
                </archive>
            </configuration>
        </plugin>
    </plugins>
</build>Copy the code

The assembly. XML file reads as follows:

<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.0.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="Http://maven.apache.org/ASSEMBLY/2.0.0 http://maven.apache.org/xsd/assembly-2.0.0.xsd"> <id>jar-with-dependencies</id> <! <formats> jar</ formats> </formats> <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <outputDirectory>/</outputDirectory>
            <useProjectArtifact>true</useProjectArtifact>
            <unpack>true</unpack> <scope>runtime</scope> <! -- Excluding storm-core, which is already available in the Storm environment, > <excludes> <exclude> org.apapache. Storm: Storm-core </exclude> </ Excludes > </dependencySet> </assembly>Copy the code

3.2 Packaging Commands

To package using maven-assembly-plugin, run the following command:

# mvn assembly:assembly Copy the code

After packaging, two JAR packages are generated. Jar-with-dependencies are JAR packages that contain third-party dependencies. The suffix is specified by the < ID > tag in assembly.

4. Maven-shade-plugin

Maven-shade-plugin is more powerful than maven-assembly-plugin. For example, your project depends on many jars, which in turn depend on other jars, so that when the project depends on different versions of jars, And when there are resource files with the same name in the JAR, the Shade plug-in tries to bundle all the resource files together instead of overwriting them as assembly does.

Usually usemaven-shade-pluginIt can meet most of the packaging requirements and has the simplest configuration and widest applicability. Therefore, it is recommended to use this method.

4.1 Basic Configuration

The configuration example for maven-shade-plugin packaging is as follows:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <configuration>
        <createDependencyReducedPom>true</createDependencyReducedPom>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.sf</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.dsa</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                    <exclude>META-INF/*.rsa</exclude>
                    <exclude>META-INF/*.EC</exclude>
                    <exclude>META-INF/*.ec</exclude>
                    <exclude>META-INF/MSFTSIG.SF</exclude>
                    <exclude>META-INF/MSFTSIG.RSA</exclude>
                </excludes>
            </filter>
        </filters>
        <artifactSet>
            <excludes>
                <exclude>org.apache.storm:storm-core</exclude>
            </excludes>
        </artifactSet>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <transformers>
                    <transformer
                       implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                    <transformer
                       implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                    </transformer>
                </transformers>
            </configuration>
        </execution>
    </executions>
</plugin>Copy the code

The above configuration comes from Storm Github, and some files are excluded from the above configuration, because some JAR packages are generated using Jarsigner generated file signature (completion verification), which is divided into two files in the meta-INF directory:

  • A signature file, with A. SF extension;
  • A signature block file, with A. DSA,.rsa, or.EC extension.

If some packages have double references, this may cause an Invalid Signature File digest for Manifest main Attributes exception during packaging, so exclude these files from the configuration.

4.2 Packaging Commands

When using maven-shade-plugin for packaging, the packaging command is the same as normal packaging:

# mvn packageCopy the code

When packaged, two JAR packages are generated and submitted to the server cluster using jars that do not start with Original.

Other packaging requirements

1. Use jars in non-Maven repositories

In general, the above two packages will suffice for most usage scenarios. However, if you want to put some jars that are not managed by Maven into the final JAR, such as jars from other non-Maven repositories that you introduced under Resources /lib, At this point, you can use the Maven-jar-plugin and maven-dependency plugin to drop it into the final JAR.

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <configuration>
                <archive>
                    <manifest>
                        <addClasspath>true</addClasspath> <! <classpathPrefix>lib/</classpathPrefix> <! - application of the main entrance of class - > < mainClass > com. Heibaiying. BigDataApp < / mainClass > < manifest > < / archive > < / configuration > < / plugin > < plugin >  <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-dependency-plugin</artifactId> <executions> <execution> <id>copy</id> <phase>compile</phase> <goals> <! > <goal>copy-dependencies</goal> </goals> <configuration> <! --> <outputDirectory>${project.build.directory}/lib
                        </outputDirectory>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>Copy the code

2. Exclude the existing jars in the cluster

In order to avoid conflicts, the official documentation usually recommends that you exclude JAR packages already provided in the cluster, as follows:

Official document Coby Applications:

When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime.

Running Topologies on a Production Cluster

Then run mvn assembly:assembly to get an appropriately packaged jar. Make sure you exclude the Storm jars since the cluster already has Storm on the classpath.

There are two main ways to exclude JAR packages according to the above instructions:

  • Add dependencies that need to be excluded<scope>provided</scope>Tag, in which case the JAR will be excluded, but this is not recommended because you cannot use the JAR when running locally.
  • The suggestion is directlymaven-assembly-pluginmaven-shade-pluginIs used in the configuration file<exclude>To exclude.

3. Package Scala files

By default, Maven does not insert Scala files into the final JAR. You need to add the Maven-Scala-plugin plugin as follows:

< plugin > < groupId > org. Scala - tools < / groupId > < artifactId > maven scala plugin - < / artifactId > < version > 2.15.1 < / version > <executions> <execution> <id>scala-compile</id> <goals> <goal>compile</goal> </goals> <configuration> <includes> <include>**/*.scala</include> </includes> </configuration> </execution> <execution> <id>scala-test-compile</id> <goals> <goal>testCompile</goal>
            </goals>
        </execution>
    </executions>
</plugin>Copy the code

The resources

For details on the configuration of Maven’s various plug-ins, see the official documentation:

  • Maven – assembly – the plugin: maven.apache.org/plugins/mav…
  • Maven – shade – plugin: maven.apache.org/plugins/mav…
  • Maven – jar – plugin: maven.apache.org/plugins/mav…
  • Maven dependency – plugin: maven.apache.org/components/…

More information about maven-shade-Plugin configuration can also be found on this blog: Maven-shade-Plugin Getting Started Guide