How to Install Deequ Locally

Introduction

Deequ is an open-source data quality library developed by AWS for validating large datasets efficiently using Spark. It enables data engineers to define and enforce quality constraints on datasets, ensuring reliability in ETL pipelines. Installing Deequ locally is useful for testing, debugging, and developing custom quality checks before deploying them in production.

This guide covers how to install Deequ using Docker and language-specific package managers like pip (Python), npm (Node.js), and Maven/Gradle (Java).

Installing Deequ with Docker

The easiest way to use Deequ without manual dependency management is through Docker. Since Deequ requires Apache Spark, running it in a container simplifies setup.

Steps:

Pull a Spark image with Scala support

docker pull jupyter/pyspark-notebook

Run the container and install Deequ

docker run -it --rm jupyter/pyspark-notebook bash

Inside the container, start a Spark shell and load Deequ

pyspark --packages com.amazon.deequ:deequ:2.0.3-spark-3.3

This method ensures a clean, isolated environment without modifying your local machine.

Installing Deequ for Python (pip)

Python users can interact with Deequ using PyDeequ, a wrapper around the Scala library.

Steps

Ensure Java and Spark are installed

sudo apt install openjdk-11-jdk # For Ubuntu 
brew install openjdk@11 # For macOS

Install PyDeequ with pip

pip install pydeequ

Verify installation

import pydeequ 
print(pydeequ.__version__)

Installing Deequ for Java (Maven/Gradle)

Deequ is built for JVM-based applications and integrates seamlessly with Java projects.

Maven Installation

Add the dependency in pom.xml

<dependency>
    <groupId>com.amazon.deequ</groupId>
    <artifactId>deequ</artifactId>
    <version>2.0.3-spark-3.3</version>
</dependency>

Update dependencies:

mvn clean install

Gradle Installation

Add the dependency in build.gradle:

dependencies {
    implementation 'com.amazon.deequ:deequ:2.0.3-spark-3.3'
}

Sync the project:

gradle build

Installing Deequ for Node.js (npm)

Deequ is a JVM-based tool, so it does not have a native Node.js package. However, you can use it with JavaScript via a backend service that exposes Deequ’s functionality over an API.

Alternative: Using Deequ in a REST API

Deploy a Java-based API with Deequ (e.g., using Flask, FastAPI, or Spring Boot).

Use Node.js to call the API

fetch('http://localhost:5000/deequ-check', { method: 'POST' })
  .then(response => response.json())
  .then(data => console.log(data));

Managing and Verifying Installation

After installation, verify that Deequ works as expected:

  • For PyDeequ: Run a basic constraint check in Spark.
  • For Java: Ensure the package is recognized in the project.
  • For Docker: Check that the container runs without errors.

By installing Deequ locally, you can efficiently develop and test data quality constraints before deploying them in production environments.

More From Author

Leave a Reply

Recent Comments

No comments to show.