Command Line Interface

Pointblank provides a powerful command-line interface (CLI) that allows you to perform data validation directly from the terminal without writing Python code. This is especially useful for:

quick data quality checks in CI/CD pipelines
exploratory data analysis
integration with shell scripts and automation workflows
non-Python users who want to leverage Pointblank’s validation capabilities

Installation

The CLI is automatically available when you install Pointblank:

pip install pointblank

Terminal Demos

Check out our CLI terminal demos to see the CLI in action with comprehensive command examples and real output.

Basic Usage

The main CLI command is pb validate, which performs single-step validations:

pb validate [DATA_SOURCE] --check [CHECK_TYPE] [OPTIONS]

Data Sources

You can validate various types of data sources:

CSV files

pb validate data.csv --check rows-distinct

Parquet files (including glob patterns and directories)

pb validate data.parquet --check col-vals-not-null --column age
pb validate "data/*.parquet" --check rows-distinct
pb validate data/ --check rows-complete  # directory of parquet files

GitHub URLs (direct links to CSV or Parquet files)

pb validate "https://github.com/user/repo/blob/main/data.csv" --check rows-distinct
pb validate "https://raw.githubusercontent.com/user/repo/main/data.parquet" --check col-exists --column id

Database tables (connection strings)

pb validate "duckdb:///path/to/db.ddb::table_name" --check rows-complete

Built-in datasets

pb validate small_table --check col-exists --column a

Enhanced Data Source Support

The CLI leverages Pointblank’s centralized data processing pipeline, providing comprehensive support for various data sources:

GitHub Integration

Validate data directly from GitHub repositories without downloading files:

# Standard GitHub URLs (automatically converted to raw URLs)
pb preview "https://github.com/user/repo/blob/main/data.csv"
pb validate "https://github.com/user/repo/blob/main/sales.csv" --check rows-distinct

# Raw GitHub URLs (used directly)
pb scan "https://raw.githubusercontent.com/user/repo/main/data.parquet"

Advanced File Patterns

Support for complex file patterns and directory structures:

# Glob patterns for multiple files
pb validate "data/*.parquet" --check col-vals-not-null --column id
pb preview "sales_data_*.csv"

# Entire directories of Parquet files
pb scan data/partitioned_dataset/
pb missing warehouse/daily_reports/

# Partitioned datasets (automatically detects partition columns)
pb validate partitioned_sales/ --check rows-distinct

Database Connections

Enhanced support for database connection strings:

# DuckDB databases with table specification
pb validate "duckdb:///warehouse/analytics.ddb::customer_metrics" --check col-exists --column customer_id

# Preview database tables
pb preview "duckdb:///data/sales.ddb::transactions"

Automatic Data Type Detection

The CLI automatically detects and handles:

CSV files: single files or glob patterns
Parquet files: files, patterns, directories, and partitioned datasets
GitHub URLs: both standard and raw URLs for CSV/Parquet files
database connections: connection strings with table specifications
built-in datasets: Pointblank’s included sample datasets

This unified approach means you can use the same CLI commands regardless of where your data is stored.

Available Validation Checks

Data Completeness

rows-distinct: Check if all rows are unique (no duplicates)
rows-complete: Check if all rows are complete (no missing values)
col-vals-not-null: Check if a column has no Null/None/NA values

Column Existence

col-exists: Verify that a specific column exists in the dataset

Numeric Value Checks Within Columns

col-vals-gt: column values greater than a fixed value
col-vals-ge: column values greater than or equal to a fixed value
col-vals-lt: column values less than a fixed value
col-vals-le: column values less than or equal to a fixed value

Value Set Checks Within Columns

col-vals-in-set: column values must be in an defined set

Examples

Basic Validation

# Check for duplicate rows
pb validate data.csv --check rows-distinct

# Check if all values in 'age' column are not null
pb validate data.csv --check col-vals-not-null --column age

# Check if all values in 'score' are greater than 50
pb validate data.csv --check col-vals-gt --column score --value 50

Range Validations

# Check if all ages are less than 100
pb validate data.csv --check col-vals-lt --column age --value 100

# Check if all prices are less than or equal to 1000
pb validate data.csv --check col-vals-le --column price --value 1000

# Check if all ratings are between 1 and 5 (using two commands)
pb validate data.csv --check col-vals-ge --column rating --value 1
pb validate data.csv --check col-vals-le --column rating --value 5

Set Membership

# Check if status values are in allowed set
pb validate data.csv --check col-vals-in-set --column status --set "active,inactive,pending"

Debugging Failed Validations

Use --show-extract to see which rows are causing validation failures:

# Show failing rows when validation fails
pb validate data.csv --check col-vals-lt --column age --value 65 --show-extract

# Limit the number of failing rows shown
pb validate data.csv --check col-vals-not-null --column email --show-extract --limit 5

Exit Codes for Automation

Use --exit-code to make the command exit with a non-zero code when validation fails:

# For use in CI/CD pipelines
pb validate data.csv --check rows-distinct --exit-code

Output Examples

Successful Validation

✓ Loaded data source: data.csv
✓ Col Vals Lt validation completed

Validation Result: Column Values Less Than

  Property            Value
 ──────────────────────────────────────────────
  Data Source         data.csv
  Check Type          col-vals-lt
  Column              age
  Threshold           < 100.0
  Total Rows Tested   1,250
  Passing Rows        1,250
  Failing Rows        0
  Result              ✓ PASSED

╭─────────────────────────────────────────────────╮
│ ✓ Validation PASSED: All values in column       │
│   'age' are < 100.0 in data.csv                 │
╰─────────────────────────────────────────────────╯

Failed Validation

✓ Loaded data source: data.csv
✓ Col Vals Lt validation completed

Validation Result: Column Values Less Than

  Property            Value
 ──────────────────────────────────────────────
  Data Source         data.csv
  Check Type          col-vals-lt
  Column              age
  Threshold           < 65.0
  Total Rows Tested   1,250
  Passing Rows        1,180
  Failing Rows        70
  Result              ✗ FAILED

╭─────────────────────────────────────────────────╮
│ ✗ Validation FAILED: 70 values >= 65.0 found    │
│   in column 'age' in data.csv                   │
│ 💡 Tip: Use --show-extract to see the failing   │
│ rows                                            │
╰─────────────────────────────────────────────────╯

Integration with CI/CD

The CLI is useful when integrating data validation into your CI/CD pipelines:

# Example GitHub Actions workflow
name: Data Validation
on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - name: Install dependencies
        run: pip install pointblank
      - name: Validate data quality
        run: |
          pb validate data/sales.csv --check rows-distinct --exit-code
          pb validate data/sales.csv --check col-vals-not-null --column customer_id --exit-code
          pb validate data/sales.csv --check col-vals-gt --column amount --value 0 --exit-code

Getting Help

Use the --help flag to see all available options:

pb validate --help

Terminal Demonstrations

Check out our step-by-step terminal demos that show real command execution with actual output.