This article is posted in Nuggets by WX shin-Devops
The configuration process
- The installation
pyspark
- configuration
mysql-connector.jar
- Create a connection
- Read the data
Install PySpark
Create a new project locally and execute PIP Install PySpark ==3.0 to install PySpark.
MySQL – the Connector configuration
download
Go to https://dev.mysql.com/downloads/connector/j/ to download the corresponding version of the Platform Independent package:
Connector/J version | JDBC version | MySQL Server version | JRE Required | JDK Required for Compilation | Status |
---|---|---|---|---|---|
5.1 | 3.0, 4.0, 4.1, 4.2 | 5.61, 5.71, 8.01 | JRE 5 or higher1 | JDK 5.0 AND JDK 8.0 or higher2, 3 | General availability |
8.0 | 4.2 | 5.6, 5.7, 8.0 | JRE 8 or higher | JDK 8.0 or higher2 | General availability. recommended |
Click to see the full version of the association
For example, mysql-connector-java-8.0.20.tar.gz is decompressed to obtain mysql-connector-java-8.0.19.jar
Move to the SPARK_HOME path
If you use other installation methods, run the echo $SPARK_HOME command on the local PC to view the Spark installation path.
Install PySpark directly by PIP install PySpark ==3.0, $SPARK_HOME is empty, The “copy mysql-connector.jar into the $SPARK_HOME/jars” folder mentioned in other configuration documents on the web cannot be executed.
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
Copy the code
This article is posted in Nuggets by WX shin-Devops
The solution is to find $SPARK_HOME using the _find_spark_HOME method in the PySpark code:
>>> from pyspark import find_spark_home
>>> print(find_spark_home._find_spark_home())
/home/ityoung/test- spark/venv/lib/python3.6 / site - packages/pysparkCopy the code
Then set $SPARK_HOME to that path and copy mysql-connector.jar into $SPARK_HOME/jars:
export SPARK_HOME=/home/ityoung/test- spark/venv/lib/python3.6 / site - packages/pyspark mv mysql connector - Java - 8.0.19. Jar$SPARK_HOME/jars
Copy the code
Spark code Example
Reference: zhuanlan.zhihu.com/p/136777424 reproduced indicate the source
main.py
# this article published on [Denver] (https://juejin.cn/user/3579665589502909), the author: strict north (wx: shin - the conversation), theft is prohibited
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
if __name__ == '__main__':
# Spark initialization
sc = SparkContext(master='local', appName='sql')
spark = SQLContext(sc)
Mysql > modify mysql
prop = {'user': 'xxx'.'password': 'xxx'.'driver': 'com.mysql.cj.jdbc.Driver'}
# database address (need to change)
url = 'jdbc:mysql://host:port/database'
# read table
data = spark.read.jdbc(url=url, table='tb_test', properties=prop)
Print the data type
print(type(data))
# Display data
data.show()
# Close the Spark session
spark.stop()
Copy the code
Modify the configuration in the code and run to see the data output:
python main.py
Copy the code
This article is posted in Nuggets by WX shin-Devops