[Work in progress] Tool to visualize data quickly with no brain usage for plot creation
Using UV (Recommended)
UV is a fast Python package installer and resolver. First, install UV:
$ pip install uvThen install the project using:
$ git clone https://github.com/eyadsibai/brute_force_plotter.git
$ cd brute_force_plotter
$ uv syncThis will create a virtual environment (.venv) and install all dependencies with locked versions for reproducibility.
Useful UV Commands:
uv sync- Install dependencies and sync the environmentuv add <package>- Add a new dependencyuv remove <package>- Remove a dependencyuv lock- Update the lockfileuv run <command>- Run a command in the virtual environment
As a Python Library (NEW!)
You can now use brute-force-plotter directly in your Python scripts:
import pandas as pd
import brute_force_plotter as bfp
# Load your data
data = pd.read_csv('data.csv')
# Define data types (c=category, n=numeric, g=geocoordinate, i=ignore)
# Define data types (c=category, n=numeric, t=timeseries, i=ignore)
# Option 1: Automatic type inference (NEW!)
output_path, dtypes = bfp.plot(data)
print(f"Inferred types: {dtypes}")
# Option 2: Manual type definition (c=category, n=numeric, i=ignore)
dtypes = {
'column1': 'n', # numeric
'column2': 'c', # category
'column3': 't', # time series (datetime)
'column4': 'i' # ignore
}
# Create and save plots (always returns tuple)
output_path, dtypes_used = bfp.plot(data, dtypes, output_path='./plots')
# Or show plots interactively
bfp.plot(data, dtypes, show=True)
# Example with geocoordinates
geo_data = pd.read_csv('cities.csv')
geo_dtypes = {
'latitude': 'g', # geocoordinate
'longitude': 'g', # geocoordinate
'city_type': 'c', # category
'population': 'n' # numeric
}
bfp.plot(geo_data, geo_dtypes, output_path='./maps')
output_path, dtypes_used = bfp.plot(data, dtypes, show=True)
# Generate minimal set of plots (reduces redundant visualizations)
output_path, dtypes_used = bfp.plot(data, dtypes, output_path='./plots', minimal=True)
# Option 3: Manually infer types first, then edit if needed
dtypes = bfp.infer_dtypes(data)
# Edit dtypes as needed...
output_path, dtypes_used = bfp.plot(data, dtypes, output_path='./plots')See example/library_usage_example.py for more examples.
As a Command-Line Tool
It was tested on python3 only (Python 3.10+ required)
Using UV:
$ git clone https://github.com/eyadsibai/brute_force_plotter.git
$ cd brute_force_plotter
$ uv sync
# With automatic type inference (NEW!)
$ uv run python -m src example/titanic.csv example/output --infer-dtypes --save-dtypes example/auto_dtypes.json
# With manual type definition
$ uv run python -m src example/titanic.csv example/titanic_dtypes.json example/output
# Or use the brute-force-plotter command:
$ uv run brute-force-plotter example/titanic.csv example/titanic_dtypes.json example/output--skip-existing: Skip generating plots that already exist (default: True)--theme: Choose plot style theme (darkgrid, whitegrid, dark, white, ticks) (default: darkgrid)--n-workers: Number of parallel workers for plot generation (default: 4)--export-stats: Export statistical summary to CSV files--minimal: Generate minimal set of plots (reduces redundant visualizations)--infer-dtypes: Automatically infer data types from the data (NEW!)--save-dtypes PATH: Save inferred or used dtypes to a JSON file (NEW!)--max-rows: Maximum number of rows before sampling is applied (default: 100,000)--sample-size: Number of rows to sample for large datasets (default: 50,000)--no-sample: Disable sampling for large datasets (may cause memory issues)
Using UV:
$ uv run brute-force-plotter example/titanic.csv example/titanic_dtypes.json example/output --theme whitegrid --n-workers 8 --export-stats
# Generate minimal set of plots (fewer redundant visualizations)
$ uv run brute-force-plotter example/titanic.csv example/titanic_dtypes.json example/output --minimal
$ uv run brute-force-plotter example/titanic.csv example/output --infer-dtypes --save-dtypes example/auto_dtypes.json --theme whitegrid --n-workers 8 --export-statsFor datasets exceeding 100,000 rows, brute-force-plotter automatically samples the data to improve performance and reduce memory usage. This ensures plots are generated quickly even with millions of rows.
Default Behavior:
- Datasets with ≤ 100,000 rows: No sampling, all data is used
- Datasets with > 100,000 rows: Automatically samples 50,000 rows for visualization
- Statistical exports (
--export-stats) always use the full dataset for accuracy
Customization:
# Increase sampling threshold to 200,000 rows
$ python3 -m src data.csv dtypes.json output --max-rows 200000
# Use a larger sample size (75,000 rows)
$ python3 -m src data.csv dtypes.json output --sample-size 75000
# Disable sampling entirely (use with caution for very large datasets)
$ python3 -m src data.csv dtypes.json output --no-sampleThe tool now supports time series data! Here's how to visualize time series:
# Generate example time series data
$ python3 example/timeseries_example.py
# Plot the time series data
$ python3 -m src example/timeseries_data.csv example/timeseries_dtypes.json example/timeseries_outputThe time series example generates plots for:
- Single time series line plots
- Numeric values over time (e.g., sales over time)
- Multiple time series overlays
- Grouped time series by category (e.g., sales by region over time)
Time Series dtypes example:
{
"date": "t", # time series column
"temperature": "n", # numeric - will plot over time
"sales": "n", # numeric - will plot over time
"region": "c", # category - will group time series
"id": "i" # ignore
}Library Usage:
import pandas as pd
import brute_force_plotter as bfp
# Load a large dataset
data = pd.read_csv('large_data.csv') # e.g., 500,000 rows
dtypes = {'col1': 'n', 'col2': 'c'}
# Automatic sampling (default: max_rows=100000, sample_size=50000)
bfp.plot(data, dtypes, output_path='./plots')
# Custom sampling parameters
bfp.plot(data, dtypes, output_path='./plots', max_rows=200000, sample_size=75000)
# Disable sampling
bfp.plot(data, dtypes, output_path='./plots', no_sample=True)Note: Sampling uses a fixed random seed (42) for reproducibility, ensuring consistent results across multiple runs.
-
json.dump({k:v.name for k,v in df.dtypes.to_dict().items()},open('dtypes.json','w'))
-
the first argument is the input file (csv file with data) example/titanic.csv
-
second argument is a json file with the data types of each columns:
cfor categorynfor numericgfor geocoordinate (latitude/longitude) - NEW!ifor ignore
Example: example/titanic_dtypes.json
-
second argument is a json file with the data types of each columns (c for category, n for numeric, t for time series, i for ignore) example/titanic_dtypes.json
-
the first argument is the input file (csv file with data) example/titanic.csv
-
second argument is a json file with the data types of each columns (c for category, n for numeric, i for ignore) example/titanic_dtypes.json
{
"Survived": "c",
"Pclass": "c",
"Sex": "c",
"Age": "n",
"SibSp": "n",
"Parch": "n",
"Fare": "n",
"Embarked": "c",
"PassengerId": "i",
"Ticket": "i",
"Cabin": "i",
"Name": "i"
} - third argument is the output directory
For data with latitude and longitude columns:
{
"city": "i",
"latitude": "g",
"longitude": "g",
"population": "n",
"category": "c"
}See example/cities_geo.csv and example/cities_geo_dtypes.json for a complete example.
- c stands for category, i stands for ignore, n for numeric, t for time series (datetime)
The --minimal flag reduces the number of plots generated by removing redundant visualizations while keeping the most informative ones:
What's reduced in minimal mode:
- Correlation matrices: Only Spearman correlation (removes Pearson correlation)
- Spearman is more robust to outliers and works for both linear and monotonic relationships
- Category vs Category: Only heatmap (removes bar plot)
- Heatmap shows the same information more compactly
- Category vs Numeric: Only box plot and violin plot (removes bar plot and strip plot)
- Box and violin plots are the most informative for showing distributions
What's kept in minimal mode:
- All single-variable distributions (histograms, violin plots, bar plots)
- All numeric vs numeric scatter plots
- Missing values heatmap
Example reduction: For the Titanic dataset, minimal mode generates 38 plots instead of 45 (15.6% reduction).
Use --minimal when you want to:
- Reduce clutter in your output directory
- Focus on the most informative visualizations
- Speed up plot generation for large datasets
The tool automatically generates:
Distribution Plots:
- Histogram with KDE for numeric variables
- Violin plots for numeric variables
- Bar plots for categorical variables
- Correlation matrices (Pearson and Spearman, or just Spearman in minimal mode)
- Line plots for time series variables
- Correlation matrices (Pearson and Spearman)
- Missing values heatmap
2D Interaction Plots:
- Scatter plots for numeric vs numeric
- Heatmaps for categorical vs categorical (and bar plots in full mode)
- Bar/Box/Violin/Strip plots for categorical vs numeric (Box/Violin only in minimal mode)
- Heatmaps for categorical vs categorical
- Bar/Box/Violin/Strip plots for categorical vs numeric
- Line plots for time series vs numeric (values over time)
- Multiple time series overlays for time series vs time series
3D Interaction Plots:
- Grouped time series plots (time series + category + numeric)
- Shows how numeric values change over time, grouped by categorical values
Map Visualizations (NEW!):
- Interactive maps for geocoordinate data (latitude/longitude)
- Color-coded markers based on categorical variables
- Automatic detection of lat/lon column pairs
- Support for common naming patterns (lat, lon, latitude, longitude, x_coord, y_coord)
Statistical Summaries (with --export-stats):
- Numeric statistics (mean, std, min, max, quartiles)
- Category value counts
- Missing values analysis
The project includes a comprehensive test suite with 81+ tests covering unit tests, integration tests, and edge cases.
Running Tests
# Run all tests
$ pytest
# Run with coverage report
$ pytest --cov=src --cov-report=html
# Run specific test categories
$ pytest -m unit # Unit tests only
$ pytest -m integration # Integration tests only
$ pytest -m edge_case # Edge case tests only
# Run tests in parallel (faster)
$ pytest -n auto
# Run with verbose output
$ pytest -vTest Coverage
The test suite achieves ~96% code coverage and includes:
- Unit tests: Core plotting functions, utilities, statistical exports, large dataset handling
- Integration tests: CLI interface, library interface, end-to-end workflows
- Edge case tests: Empty data, missing values, many categories, Unicode support
Writing Tests
When contributing, please:
- Add tests for new features in the appropriate test file
- Ensure tests pass locally before submitting PR
- Aim for >90% code coverage for new code
- Use the fixtures in
conftest.pyfor test data
When developing for this project, it's important to set up code quality tools to ensure consistency:
1. Install Development Dependencies
Using UV:
$ uv sync # Installs all dependencies including dev tools2. Install Pre-commit Hooks (REQUIRED)
This project uses pre-commit hooks to automatically enforce code quality standards on every commit:
$ pre-commit installAfter installation, the hooks will run automatically on git commit and check:
- ✅ Ruff linting (with auto-fix)
- ✅ Ruff formatting
- ✅ Trailing whitespace removal
- ✅ End-of-file fixes
- ✅ YAML/JSON/TOML validation
- ✅ Large file detection
3. Manual Code Quality Checks
You can also run these checks manually:
# Lint code (check for issues)
$ ruff check .
# Lint and auto-fix issues
$ ruff check --fix .
# Format code
$ ruff format .
# Run all pre-commit hooks on all files
$ pre-commit run --all-files4. Running Tests
Always run tests before submitting changes:
$ pytestPre-commit hooks ensure that:
- All code follows consistent style guidelines
- Linting issues are caught before they reach CI
- Code quality is maintained automatically
- Review cycles are faster (no style nitpicks)
Note: If you try to commit code that doesn't pass the checks, the commit will be blocked. Fix the issues reported and commit again.
✅ Updated all dependencies to latest stable versions ✅ Added correlation matrix plots (Pearson and Spearman) ✅ Added missing values visualization ✅ Added statistical summary export ✅ Added configurable plot themes ✅ Added parallel processing controls ✅ Added skip-existing-plots option ✅ Improved logging and progress indicators ✅ Code cleanup and better error handling ✅ Comprehensive test suite with 96% coverage ✅ Interactive map visualization for geocoordinate data (NEW!) ✅ Time series support with line plots, grouped plots, and multi-series overlays ✅ Automatic data type inference - No need to manually specify data types! ✅ Comprehensive test suite with 96% coverage (81+ tests) ✅ Large dataset fallback with automatic sampling
Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines on:
- Setting up your development environment
- Using code quality tools (Ruff, pre-commit)
- Submitting pull requests
- Coding standards and best practices
The project follows a modular architecture for better maintainability and reduced merge conflicts:
src/
├── core/ # Core functionality
│ ├── config.py # Global configuration
│ ├── data_types.py # Type inference
│ └── utils.py # Utilities
├── plotting/ # Visualization modules
│ ├── base.py # Common plotting functions
│ ├── single_variable.py
│ ├── two_variable.py
│ ├── three_variable.py
│ ├── summary.py
│ ├── timeseries.py
│ └── maps.py
├── stats/ # Statistical exports
│ └── export.py
├── cli/ # Command-line interface
│ ├── commands.py
│ └── orchestration.py
├── library.py # Python API
└── brute_force_plotter.py # Compatibility layer
This structure enables parallel development and makes it easier to locate and modify specific functionality.
- Eyad Sibai / @eyadsibai
The following haven't provided code directly, but have provided guidance and advice:
- Andreas Meisingseth / @AndreasMeisingseth
- Tom Baylis / @tbaylis
This project is licensed under the MIT License - see the LICENSE file for details.




