Why did pyspark fail to call python third-party libraries in RDD?

problem description

Hi, I called the jieba participle when I was running pyspark on the company line, and found that I could successfully import, but when I called the participle function in RDD, it suggested that there was no module jieba, without these problems in the local virtual machine

the environmental background of the problems and what methods you have tried

attempted to replace root installation jieba

related codes

/ / Please paste the code text below (do not replace the code with pictures)
import jieba
[x for x in jieba.cut ("this is a test text")]
Building prefix dict from the default dictionary.
Loading model from cache / tmp/jieba.cache
Loading model cost 0.448 seconds.
Prefix dict has been built succesfully.
Ufolu6587u672c"]
/ / above is a common call to jieba that can successfully segment
cut = name.map (lambda x: [y for y in jieba.cut (x)])
cut.count ()
/ / the above code will not report an error when running in the local virtual machine. But an error will be reported when the fortress machine runs online.

what result do you expect? What is the error message actually seen?

18-07-13 10:16:17 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 1.0 (TID 16, hadoop13, executor 17): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/ opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/worker.py", line 98, in main

command = pickleSer._read_with_length(infile)

File "/ opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/serializers.py", line 164, in _ read_with_length

return self.loads(obj)

File "/ opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/serializers.py", line 422, in loads

return pickle.loads(obj)

File "/ opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/cloudpickle.py", line 664, in subimport

__import__(name)

ImportError: ("No module named jieba", < function subimport at 0x27a9488 >, ("jieba",))

)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Mar.28,2021

jieba must be installed on every machine in the Spark cluster

Menu