The problem of feedback


  • directory
    • Project environment
    • Download the way
    • A small problem
    • Solutions and processes
    • Code sample
    • The problem of feedback

Project environment

  • For environmental deployment, see Deployment Procedure on the right
  • Environmental Details:
    • Spark version: Apache Spark 2.2.0 with Hadoop2.7
    • MongoDB version: 3.4.9-with openSSL
    • JDK version: 1.8
    • Python version: 3.4.4

Download the way

  • 1. Official MongoDB-Spark Connector
#The first waySpark - the shell - packages org. Mongo. Spark: mongo - spark - connector_2. 11:2. 2.0
#The second,Pyspark - packages org. Mongo. Spark: mongo - spark - connector_2. 11:2. 2.0Copy the code
  • 2. Third-party Connector
    • A bit of a pit project project link
    • The official website of the project sponsor cannot be opened and the corresponding version cannot be found
    • 0.12.x(x)
    • The command is as follows:
Spark - the shell - packages com. Stratio. The datasource: spark - mongodb_2. 11:0. 13.0Copy the code

Small problem:

  • Problem: The company’s network is toxic and cannot be accessed from the Internet.

Solutions and processes:

  • Solution :Teamviewer goes home and uses Spark shell to open the Jar package of MongoDB’s official solution.
    • Skip to step 4 manually by downloading the jar package below
    • Jar package download link password: QKKP
  • 1. After Maven is compiled (it will be compiled when you download it without having to manually compile it), two JAR packages are generated
    • The default path is C:/User/< User name >/. Ivy2 /
  • 2, There are two folders caches,jars under the. Ivy2 file if the download is correct.
  • When you open the jars folder, you will see two jars:
    • Org. Mongo. Spark_mongo – spark – connector_2. 11-2.2.0. Jar
    • Org. Mongodb_mongo – Java – driver – 3.4.2. Jar
  • 4. Copy the two jars to the jars folder in the Spark root directory.

Code examples:

    # -*- coding: UTF-8 -*-
""" Created on Oct 24, 2017 @Author: Leo """

import os
from pyspark.sql import SparkSession

os.environ['SPARK_HOME'] = "Your Spark root directory"
os.environ['HADOOP_HOME'] = "Your Hadoop root directory"


class PySparkMongoDB:
    def __init__(self):
        This is the uri configuration
        # mongodb://< mongodb address: port >/
        The default port is 27017
        self.uri_conf = "Mongodb ://127.0.0.1/< db name >.
        
        # Connect MongoDB(maintain connections via SparkSession)
        self.my_spark = SparkSession \
            .builder \
            .appName("myApp") \
            .config("spark.mongodb.input.uri", self.uri_conf) \
            .config("spark.mongodb.output.uri", self.uri_conf) \
            .getOrCreate()

    def read_mgo(self):
        df = self.my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
        print(df.show())


if __name__ == '__main__':
    mgo = PySparkMongoDB()
    mgo.read_mgo()
Copy the code