PySpark-MongoDB Connector

The problem of feedback

Author: Leo
Wechat: Leo-sunhailin
E-mail: [email protected]

directory
- Project environment
- Download the way
- A small problem
- Solutions and processes
- Code sample
- The problem of feedback

Project environment

For environmental deployment, see Deployment Procedure on the right
Environmental Details:
- Spark version: Apache Spark 2.2.0 with Hadoop2.7
- MongoDB version: 3.4.9-with openSSL
- JDK version: 1.8
- Python version: 3.4.4

Download the way

1. Official MongoDB-Spark Connector

#The first waySpark - the shell - packages org. Mongo. Spark: mongo - spark - connector_2. 11:2. 2.0
#The second,Pyspark - packages org. Mongo. Spark: mongo - spark - connector_2. 11:2. 2.0Copy the code

2. Third-party Connector
- A bit of a pit project project link
- The official website of the project sponsor cannot be opened and the corresponding version cannot be found
- 0.12.x(x)
- The command is as follows:

Spark - the shell - packages com. Stratio. The datasource: spark - mongodb_2. 11:0. 13.0Copy the code

Small problem:

Problem: The company’s network is toxic and cannot be accessed from the Internet.

Solutions and processes:

Solution :Teamviewer goes home and uses Spark shell to open the Jar package of MongoDB’s official solution.
- Skip to step 4 manually by downloading the jar package below
- Jar package download link password: QKKP
1. After Maven is compiled (it will be compiled when you download it without having to manually compile it), two JAR packages are generated
- The default path is C:/User/< User name >/. Ivy2 /
2, There are two folders caches,jars under the. Ivy2 file if the download is correct.
When you open the jars folder, you will see two jars:
- Org. Mongo. Spark_mongo – spark – connector_2. 11-2.2.0. Jar
- Org. Mongodb_mongo – Java – driver – 3.4.2. Jar
4. Copy the two jars to the jars folder in the Spark root directory.

Code examples:

    # -*- coding: UTF-8 -*-
""" Created on Oct 24, 2017 @Author: Leo """

import os
from pyspark.sql import SparkSession

os.environ['SPARK_HOME'] = "Your Spark root directory"
os.environ['HADOOP_HOME'] = "Your Hadoop root directory"


class PySparkMongoDB:
    def __init__(self):
        This is the uri configuration
        # mongodb://< mongodb address: port >/
        The default port is 27017
        self.uri_conf = "Mongodb ://127.0.0.1/< db name >.
        
        # Connect MongoDB(maintain connections via SparkSession)
        self.my_spark = SparkSession \
            .builder \
            .appName("myApp") \
            .config("spark.mongodb.input.uri", self.uri_conf) \
            .config("spark.mongodb.output.uri", self.uri_conf) \
            .getOrCreate()

    def read_mgo(self):
        df = self.my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
        print(df.show())


if __name__ == '__main__':
    mgo = PySparkMongoDB()
    mgo.read_mgo()
Copy the code

The problem of feedback

Project environment

Download the way

Small problem:

Solutions and processes:

Code examples:

Related Posts

Node.js must be a Stream

In-depth understanding of loading mechanisms such as the Java Virtual Machine

Spring loop dependencies