The problem of feedback
- Author: Leo
- Wechat: Leo-sunhailin
- E-mail: [email protected]
- directory
- Project environment
- Download the way
- A small problem
- Solutions and processes
- Code sample
- The problem of feedback
Project environment
- For environmental deployment, see Deployment Procedure on the right
- Environmental Details:
- Spark version: Apache Spark 2.2.0 with Hadoop2.7
- MongoDB version: 3.4.9-with openSSL
- JDK version: 1.8
- Python version: 3.4.4
Download the way
- 1. Official MongoDB-Spark Connector
#The first waySpark - the shell - packages org. Mongo. Spark: mongo - spark - connector_2. 11:2. 2.0
#The second,Pyspark - packages org. Mongo. Spark: mongo - spark - connector_2. 11:2. 2.0Copy the code
- 2. Third-party Connector
- A bit of a pit project project link
- The official website of the project sponsor cannot be opened and the corresponding version cannot be found
- 0.12.x(x)
- The command is as follows:
Spark - the shell - packages com. Stratio. The datasource: spark - mongodb_2. 11:0. 13.0Copy the code
Small problem:
- Problem: The company’s network is toxic and cannot be accessed from the Internet.
Solutions and processes:
- Solution :Teamviewer goes home and uses Spark shell to open the Jar package of MongoDB’s official solution.
- Skip to step 4 manually by downloading the jar package below
- Jar package download link password: QKKP
- 1. After Maven is compiled (it will be compiled when you download it without having to manually compile it), two JAR packages are generated
- The default path is C:/User/< User name >/. Ivy2 /
- 2, There are two folders caches,jars under the. Ivy2 file if the download is correct.
- When you open the jars folder, you will see two jars:
- Org. Mongo. Spark_mongo – spark – connector_2. 11-2.2.0. Jar
- Org. Mongodb_mongo – Java – driver – 3.4.2. Jar
- 4. Copy the two jars to the jars folder in the Spark root directory.
Code examples:
# -*- coding: UTF-8 -*-
""" Created on Oct 24, 2017 @Author: Leo """
import os
from pyspark.sql import SparkSession
os.environ['SPARK_HOME'] = "Your Spark root directory"
os.environ['HADOOP_HOME'] = "Your Hadoop root directory"
class PySparkMongoDB:
def __init__(self):
This is the uri configuration
# mongodb://< mongodb address: port >/
The default port is 27017
self.uri_conf = "Mongodb ://127.0.0.1/< db name >.
# Connect MongoDB(maintain connections via SparkSession)
self.my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", self.uri_conf) \
.config("spark.mongodb.output.uri", self.uri_conf) \
.getOrCreate()
def read_mgo(self):
df = self.my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
print(df.show())
if __name__ == '__main__':
mgo = PySparkMongoDB()
mgo.read_mgo()
Copy the code