Introduction
Deequ is an open-source data quality library developed by AWS for validating large datasets efficiently using Spark. It enables data engineers to define and enforce quality constraints on datasets, ensuring reliability in ETL pipelines. Installing Deequ locally is useful for testing, debugging, and developing custom quality checks before deploying them in production.
This guide covers how to install Deequ using Docker and language-specific package managers like pip (Python), npm (Node.js), and Maven/Gradle (Java).
Installing Deequ with Docker
The easiest way to use Deequ without manual dependency management is through Docker. Since Deequ requires Apache Spark, running it in a container simplifies setup.
Steps:
Pull a Spark image with Scala support
docker pull jupyter/pyspark-notebook
Run the container and install Deequ
docker run -it --rm jupyter/pyspark-notebook bash
Inside the container, start a Spark shell and load Deequ
pyspark --packages com.amazon.deequ:deequ:2.0.3-spark-3.3
This method ensures a clean, isolated environment without modifying your local machine.
Installing Deequ for Python (pip)
Python users can interact with Deequ using PyDeequ, a wrapper around the Scala library.
Steps
Ensure Java and Spark are installed
sudo apt install openjdk-11-jdk # For Ubuntu
brew install openjdk@11 # For macOS
Install PyDeequ with pip
pip install pydeequ
Verify installation
import pydeequ
print(pydeequ.__version__)
Installing Deequ for Java (Maven/Gradle)
Deequ is built for JVM-based applications and integrates seamlessly with Java projects.
Maven Installation
Add the dependency in pom.xml
<dependency>
<groupId>com.amazon.deequ</groupId>
<artifactId>deequ</artifactId>
<version>2.0.3-spark-3.3</version>
</dependency>
Update dependencies:
mvn clean install
Gradle Installation
Add the dependency in build.gradle
:
dependencies {
implementation 'com.amazon.deequ:deequ:2.0.3-spark-3.3'
}
Sync the project:
gradle build
Installing Deequ for Node.js (npm)
Deequ is a JVM-based tool, so it does not have a native Node.js package. However, you can use it with JavaScript via a backend service that exposes Deequ’s functionality over an API.
Alternative: Using Deequ in a REST API
Deploy a Java-based API with Deequ (e.g., using Flask, FastAPI, or Spring Boot).
Use Node.js to call the API
fetch('http://localhost:5000/deequ-check', { method: 'POST' })
.then(response => response.json())
.then(data => console.log(data));
Managing and Verifying Installation
After installation, verify that Deequ works as expected:
- For PyDeequ: Run a basic constraint check in Spark.
- For Java: Ensure the package is recognized in the project.
- For Docker: Check that the container runs without errors.
By installing Deequ locally, you can efficiently develop and test data quality constraints before deploying them in production environments.