Apache Spark

The Spark Integration adds support for the Python API for Apache Spark, PySpark.

This integration is experimental and in an alpha state. The integration API may experience breaking changes in further minor versions.

The spark driver integration is supported for Spark 2 and above.

To configure the SDK, initialize it with the integration before you create a SparkContext or SparkSession.

Copied
import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

if __name__ == "__main__":
    sentry_sdk.init(
        dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
        enable_tracing=True,
        integrations=[
            SparkIntegration(),
        ],
    )

    spark = SparkSession\
        .builder\
        .appName("ExampleApp")\
        .getOrCreate()
    ...

The spark worker integration is supported for Spark versions 2.4.x and 3.1.x.

Create a file called sentry-daemon.py with the following content:

sentry-daemon.py
Copied
import sentry_sdk
from sentry_sdk.integrations.spark import SparkWorkerIntegration
import pyspark.daemon as original_daemon

if __name__ == '__main__':
    sentry_sdk.init(
        dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
        enable_tracing=True,
        integrations=[
            SparkWorkerIntegration(),
        ],
    )

    original_daemon.manager()
    ...

In your spark_submit command, add the following configuration options so the spark clusters can use the Sentry integration.

Command Line Options	Parameter	Usage
--py-files	sentry_daemon.py	Sends the `sentry_daemon.py` file to your Spark clusters
--conf	spark.python.use.daemon=true	Configures Spark to use a daemon to execute it's Python workers
--conf	spark.python.daemon.module=sentry_daemon	Configures Spark to use the Sentry custom daemon

Copied
./bin/spark-submit \
    --py-files sentry_daemon.py \
    --conf spark.python.use.daemon=true \
    --conf spark.python.daemon.module=sentry_daemon \
    example-spark-job.py

You must have the Sentry python sdk installed on all your clusters to use the Spark integration. The easiest way to do this is to run an initialization script on all your clusters:

Copied
easy_install pip
pip install --upgrade sentry-sdk

In order to access certain tags (app_name, application_id), the worker integration requires the driver integration to also be active.
The worker integration only works on UNIX-based systems due to the daemon process using signals for child management.

This integration can be set up for Google Cloud Dataproc. It's recommended that Cloud Dataproc image version 1.4 or 2.0 be used with Spark 2.4 and 3.1, respectively, (as required by the worker integration).

Set up an Initialization action to install the sentry-sdk on your Dataproc cluster.
Add the driver integration to your main python file submitted in in the job submit screen
Add the sentry_daemon.py under Additional python files in the job submit screen. You must first upload the daemon file to a bucket to access it.
Add the configuration properties listed above, spark.python.use.daemon=true and spark.python.daemon.module=sentry_daemon in the job submit screen.

Help improve this content
Our documentation is open source and available on GitHub. Your contributions are welcome, whether fixing a typo (drat!) or suggesting an update ("yeah, this would be better").

Suggest an edit to this page | Contribute to Docs | Report a problem

Package:: pypi:sentry-sdk
Version:: 2.0.0
Repository:: https://github.com/getsentry/sentry-python
API Documentation:: https://getsentry.github.io/sentry-python/

Docs

Sentry for Python

Distributed Tracing

Performance Monitoring

Metrics

Profiling

Crons

User Feedback

Apache Spark

Docs

Sentry for Python

Distributed Tracing

Performance Monitoring

Metrics

Profiling

Crons

User Feedback

Apache Spark

Driver

Worker

Behavior

Google Cloud Dataproc