Skip to content

eyadsibai/brute-force-plotter

Repository files navigation

Brute Force Plotter

[Work in progress] Tool to visualize data quickly with no brain usage for plot creation

Installation

Using UV (Recommended)

UV is a fast Python package installer and resolver. First, install UV:

$ pip install uv

Then install the project using:

$ git clone https://github.com/eyadsibai/brute_force_plotter.git
$ cd brute_force_plotter
$ uv sync

This will create a virtual environment (.venv) and install all dependencies with locked versions for reproducibility.

Useful UV Commands:

  • uv sync - Install dependencies and sync the environment
  • uv add <package> - Add a new dependency
  • uv remove <package> - Remove a dependency
  • uv lock - Update the lockfile
  • uv run <command> - Run a command in the virtual environment

Usage

As a Python Library (NEW!)

You can now use brute-force-plotter directly in your Python scripts:

import pandas as pd
import brute_force_plotter as bfp

# Load your data
data = pd.read_csv('data.csv')

# Define data types (c=category, n=numeric, g=geocoordinate, i=ignore)
# Define data types (c=category, n=numeric, t=timeseries, i=ignore)
# Option 1: Automatic type inference (NEW!)
output_path, dtypes = bfp.plot(data)
print(f"Inferred types: {dtypes}")

# Option 2: Manual type definition (c=category, n=numeric, i=ignore)
dtypes = {
    'column1': 'n',  # numeric
    'column2': 'c',  # category
    'column3': 't',  # time series (datetime)
    'column4': 'i'   # ignore
}

# Create and save plots (always returns tuple)
output_path, dtypes_used = bfp.plot(data, dtypes, output_path='./plots')

# Or show plots interactively
bfp.plot(data, dtypes, show=True)

# Example with geocoordinates
geo_data = pd.read_csv('cities.csv')
geo_dtypes = {
    'latitude': 'g',   # geocoordinate
    'longitude': 'g',  # geocoordinate
    'city_type': 'c',  # category
    'population': 'n'  # numeric
}
bfp.plot(geo_data, geo_dtypes, output_path='./maps')
output_path, dtypes_used = bfp.plot(data, dtypes, show=True)

# Generate minimal set of plots (reduces redundant visualizations)
output_path, dtypes_used = bfp.plot(data, dtypes, output_path='./plots', minimal=True)

# Option 3: Manually infer types first, then edit if needed
dtypes = bfp.infer_dtypes(data)
# Edit dtypes as needed...
output_path, dtypes_used = bfp.plot(data, dtypes, output_path='./plots')

See example/library_usage_example.py for more examples.

As a Command-Line Tool

Example

It was tested on python3 only (Python 3.10+ required)

Using UV:

$ git clone https://github.com/eyadsibai/brute_force_plotter.git
$ cd brute_force_plotter
$ uv sync

# With automatic type inference (NEW!)
$ uv run python -m src example/titanic.csv example/output --infer-dtypes --save-dtypes example/auto_dtypes.json

# With manual type definition
$ uv run python -m src example/titanic.csv example/titanic_dtypes.json example/output

# Or use the brute-force-plotter command:
$ uv run brute-force-plotter example/titanic.csv example/titanic_dtypes.json example/output

Command Line Options

  • --skip-existing: Skip generating plots that already exist (default: True)
  • --theme: Choose plot style theme (darkgrid, whitegrid, dark, white, ticks) (default: darkgrid)
  • --n-workers: Number of parallel workers for plot generation (default: 4)
  • --export-stats: Export statistical summary to CSV files
  • --minimal: Generate minimal set of plots (reduces redundant visualizations)
  • --infer-dtypes: Automatically infer data types from the data (NEW!)
  • --save-dtypes PATH: Save inferred or used dtypes to a JSON file (NEW!)
  • --max-rows: Maximum number of rows before sampling is applied (default: 100,000)
  • --sample-size: Number of rows to sample for large datasets (default: 50,000)
  • --no-sample: Disable sampling for large datasets (may cause memory issues)

Using UV:

$ uv run brute-force-plotter example/titanic.csv example/titanic_dtypes.json example/output --theme whitegrid --n-workers 8 --export-stats

# Generate minimal set of plots (fewer redundant visualizations)
$ uv run brute-force-plotter example/titanic.csv example/titanic_dtypes.json example/output --minimal
$ uv run brute-force-plotter example/titanic.csv example/output --infer-dtypes --save-dtypes example/auto_dtypes.json --theme whitegrid --n-workers 8 --export-stats

Large Dataset Handling

For datasets exceeding 100,000 rows, brute-force-plotter automatically samples the data to improve performance and reduce memory usage. This ensures plots are generated quickly even with millions of rows.

Default Behavior:

  • Datasets with ≤ 100,000 rows: No sampling, all data is used
  • Datasets with > 100,000 rows: Automatically samples 50,000 rows for visualization
  • Statistical exports (--export-stats) always use the full dataset for accuracy

Customization:

# Increase sampling threshold to 200,000 rows
$ python3 -m src data.csv dtypes.json output --max-rows 200000

# Use a larger sample size (75,000 rows)
$ python3 -m src data.csv dtypes.json output --sample-size 75000

# Disable sampling entirely (use with caution for very large datasets)
$ python3 -m src data.csv dtypes.json output --no-sample

Time Series Example

The tool now supports time series data! Here's how to visualize time series:

# Generate example time series data
$ python3 example/timeseries_example.py

# Plot the time series data
$ python3 -m src example/timeseries_data.csv example/timeseries_dtypes.json example/timeseries_output

The time series example generates plots for:

  • Single time series line plots
  • Numeric values over time (e.g., sales over time)
  • Multiple time series overlays
  • Grouped time series by category (e.g., sales by region over time)

Time Series dtypes example:

{
  "date": "t",           # time series column
  "temperature": "n",    # numeric - will plot over time
  "sales": "n",          # numeric - will plot over time
  "region": "c",         # category - will group time series
  "id": "i"              # ignore
}

Library Usage:

import pandas as pd
import brute_force_plotter as bfp

# Load a large dataset
data = pd.read_csv('large_data.csv')  # e.g., 500,000 rows

dtypes = {'col1': 'n', 'col2': 'c'}

# Automatic sampling (default: max_rows=100000, sample_size=50000)
bfp.plot(data, dtypes, output_path='./plots')

# Custom sampling parameters
bfp.plot(data, dtypes, output_path='./plots', max_rows=200000, sample_size=75000)

# Disable sampling
bfp.plot(data, dtypes, output_path='./plots', no_sample=True)

Note: Sampling uses a fixed random seed (42) for reproducibility, ensuring consistent results across multiple runs.

Arguments

  • json.dump({k:v.name for k,v in df.dtypes.to_dict().items()},open('dtypes.json','w'))

  • the first argument is the input file (csv file with data) example/titanic.csv

  • second argument is a json file with the data types of each columns:

    • c for category
    • n for numeric
    • g for geocoordinate (latitude/longitude) - NEW!
    • i for ignore

    Example: example/titanic_dtypes.json

  • second argument is a json file with the data types of each columns (c for category, n for numeric, t for time series, i for ignore) example/titanic_dtypes.json

  • the first argument is the input file (csv file with data) example/titanic.csv

  • second argument is a json file with the data types of each columns (c for category, n for numeric, i for ignore) example/titanic_dtypes.json

{
"Survived": "c",
"Pclass": "c",
"Sex": "c",
"Age": "n",
"SibSp": "n",
"Parch": "n",
"Fare": "n",
"Embarked": "c",
"PassengerId": "i",
"Ticket": "i",
"Cabin": "i",
"Name": "i"
}	
  • third argument is the output directory

Geocoordinate Example

For data with latitude and longitude columns:

{
  "city": "i",
  "latitude": "g",
  "longitude": "g",
  "population": "n",
  "category": "c"
}

See example/cities_geo.csv and example/cities_geo_dtypes.json for a complete example.

  • c stands for category, i stands for ignore, n for numeric, t for time series (datetime)

Minimal Mode

The --minimal flag reduces the number of plots generated by removing redundant visualizations while keeping the most informative ones:

What's reduced in minimal mode:

  • Correlation matrices: Only Spearman correlation (removes Pearson correlation)
    • Spearman is more robust to outliers and works for both linear and monotonic relationships
  • Category vs Category: Only heatmap (removes bar plot)
    • Heatmap shows the same information more compactly
  • Category vs Numeric: Only box plot and violin plot (removes bar plot and strip plot)
    • Box and violin plots are the most informative for showing distributions

What's kept in minimal mode:

  • All single-variable distributions (histograms, violin plots, bar plots)
  • All numeric vs numeric scatter plots
  • Missing values heatmap

Example reduction: For the Titanic dataset, minimal mode generates 38 plots instead of 45 (15.6% reduction).

Use --minimal when you want to:

  • Reduce clutter in your output directory
  • Focus on the most informative visualizations
  • Speed up plot generation for large datasets

Features

The tool automatically generates:

Distribution Plots:

  • Histogram with KDE for numeric variables
  • Violin plots for numeric variables
  • Bar plots for categorical variables
  • Correlation matrices (Pearson and Spearman, or just Spearman in minimal mode)
  • Line plots for time series variables
  • Correlation matrices (Pearson and Spearman)
  • Missing values heatmap

2D Interaction Plots:

  • Scatter plots for numeric vs numeric
  • Heatmaps for categorical vs categorical (and bar plots in full mode)
  • Bar/Box/Violin/Strip plots for categorical vs numeric (Box/Violin only in minimal mode)
  • Heatmaps for categorical vs categorical
  • Bar/Box/Violin/Strip plots for categorical vs numeric
  • Line plots for time series vs numeric (values over time)
  • Multiple time series overlays for time series vs time series

3D Interaction Plots:

  • Grouped time series plots (time series + category + numeric)
    • Shows how numeric values change over time, grouped by categorical values

Map Visualizations (NEW!):

  • Interactive maps for geocoordinate data (latitude/longitude)
  • Color-coded markers based on categorical variables
  • Automatic detection of lat/lon column pairs
  • Support for common naming patterns (lat, lon, latitude, longitude, x_coord, y_coord)

Statistical Summaries (with --export-stats):

  • Numeric statistics (mean, std, min, max, quartiles)
  • Category value counts
  • Missing values analysis

Example Plots

Age Distribution (Histogram with Kernel Density Estimation, Violin Plot)

Heatmap for Sex and Pclass

Pclass vs Survived

Survived vs Age

Age vs Fare

Testing

The project includes a comprehensive test suite with 81+ tests covering unit tests, integration tests, and edge cases.

Running Tests

# Run all tests
$ pytest

# Run with coverage report
$ pytest --cov=src --cov-report=html

# Run specific test categories
$ pytest -m unit          # Unit tests only
$ pytest -m integration   # Integration tests only
$ pytest -m edge_case     # Edge case tests only

# Run tests in parallel (faster)
$ pytest -n auto

# Run with verbose output
$ pytest -v

Test Coverage

The test suite achieves ~96% code coverage and includes:

  • Unit tests: Core plotting functions, utilities, statistical exports, large dataset handling
  • Integration tests: CLI interface, library interface, end-to-end workflows
  • Edge case tests: Empty data, missing values, many categories, Unicode support

Writing Tests

When contributing, please:

  1. Add tests for new features in the appropriate test file
  2. Ensure tests pass locally before submitting PR
  3. Aim for >90% code coverage for new code
  4. Use the fixtures in conftest.py for test data

Development

Setting Up for Development

When developing for this project, it's important to set up code quality tools to ensure consistency:

1. Install Development Dependencies

Using UV:

$ uv sync  # Installs all dependencies including dev tools

2. Install Pre-commit Hooks (REQUIRED)

This project uses pre-commit hooks to automatically enforce code quality standards on every commit:

$ pre-commit install

After installation, the hooks will run automatically on git commit and check:

  • ✅ Ruff linting (with auto-fix)
  • ✅ Ruff formatting
  • ✅ Trailing whitespace removal
  • ✅ End-of-file fixes
  • ✅ YAML/JSON/TOML validation
  • ✅ Large file detection

3. Manual Code Quality Checks

You can also run these checks manually:

# Lint code (check for issues)
$ ruff check .

# Lint and auto-fix issues
$ ruff check --fix .

# Format code
$ ruff format .

# Run all pre-commit hooks on all files
$ pre-commit run --all-files

4. Running Tests

Always run tests before submitting changes:

$ pytest

Why Pre-commit Hooks?

Pre-commit hooks ensure that:

  • All code follows consistent style guidelines
  • Linting issues are caught before they reach CI
  • Code quality is maintained automatically
  • Review cycles are faster (no style nitpicks)

Note: If you try to commit code that doesn't pass the checks, the commit will be blocked. Fix the issues reported and commit again.

Recent Updates (2025)

✅ Updated all dependencies to latest stable versions ✅ Added correlation matrix plots (Pearson and Spearman) ✅ Added missing values visualization ✅ Added statistical summary export ✅ Added configurable plot themes ✅ Added parallel processing controls ✅ Added skip-existing-plots option ✅ Improved logging and progress indicators ✅ Code cleanup and better error handling ✅ Comprehensive test suite with 96% coverageInteractive map visualization for geocoordinate data (NEW!) ✅ Time series support with line plots, grouped plots, and multi-series overlaysAutomatic data type inference - No need to manually specify data types! ✅ Comprehensive test suite with 96% coverage (81+ tests)Large dataset fallback with automatic sampling

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines on:

  • Setting up your development environment
  • Using code quality tools (Ruff, pre-commit)
  • Submitting pull requests
  • Coding standards and best practices

Code Organization

The project follows a modular architecture for better maintainability and reduced merge conflicts:

src/
├── core/               # Core functionality
│   ├── config.py      # Global configuration
│   ├── data_types.py  # Type inference
│   └── utils.py       # Utilities
├── plotting/          # Visualization modules
│   ├── base.py        # Common plotting functions
│   ├── single_variable.py
│   ├── two_variable.py
│   ├── three_variable.py
│   ├── summary.py
│   ├── timeseries.py
│   └── maps.py
├── stats/             # Statistical exports
│   └── export.py
├── cli/               # Command-line interface
│   ├── commands.py
│   └── orchestration.py
├── library.py         # Python API
└── brute_force_plotter.py  # Compatibility layer

This structure enables parallel development and makes it easier to locate and modify specific functionality.

Contributors

Contributors

Code Contributors

Special Thanks

The following haven't provided code directly, but have provided guidance and advice:

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Tool to visualize data quickly with no brain usage for plot creation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages