Skip to content

CGAT-core Documentation

Licence Conda Build Status

Welcome to the CGAT-core documentation! CGAT-core is a powerful Python framework for building and executing computational pipelines, with robust support for cluster environments and cloud integration.

Key Features

  • Pipeline Management: Build and execute complex computational pipelines
  • Cluster Support: Seamless integration with various cluster environments (SLURM, SGE, PBS)
  • Cloud Integration: Native support for AWS S3 and other cloud services
  • Resource Management: Intelligent handling of compute resources and job distribution
  • Container Support: Execute pipeline tasks in containers for reproducibility

Getting Started

Installation Guide

Tutorial

Examples

Core Components

Pipeline Development

Writing Workflows

Run Parameters

Pipeline Modules

Execution Environments

Cluster Configuration

Container Support

Cloud Integration

Advanced Features

Parameter Management

Execution Control

Database Integration

Project Information

How to Contribute

Citations

License

FAQ

Additional Resources

API Documentation

GitHub Repository

Issue Tracker

Need Help?

If you need help or have questions:

  1. Check our FAQ
  2. Search existing GitHub Issues
  3. Create a new issue if your problem isn't already addressed

Overview

CGAT-core has been continuously developed over the past decade to serve as a Next Generation Sequencing (NGS) workflow management system. By combining CGAT-core with CGAT-apps, users can create diverse computational workflows. For a practical demonstration, refer to the cgat-showcase, which features a simple RNA-seq pipeline.

For advanced usage examples, explore the cgat-flow repository, which contains production-ready pipelines for automating NGS data analysis. Note that it is under active development and may require additional software dependencies.

Citation

If you use CGAT-core, please cite our publication in F1000 Research:

Cribbs AP, Luna-Valero S, George C et al. CGAT-core: a python framework for building scalable, reproducible computational biology workflows [version 1; peer review: 1 approved, 1 approved with reservations].
F1000Research 2019, 8:377
https://doi.org/10.12688/f1000research.18674.1

Support

Example Workflows

cgat-showcase

A simple example of workflow development using CGAT-core. Visit the GitHub page or view the documentation.

cgat-flow

This repository demonstrates CGAT-core's flexibility through fully tested production pipelines. For details on usage and installation, see the GitHub page.

Single-Cell RNA-seq

  • Cribbs Lab: Uses CGAT-core for pseudoalignment pipelines in single-cell Drop-seq methods.
  • Sansom Lab: Develops single-cell sequencing analysis workflows using the CGAT-core workflow engine (TenX workflows).

Pipeline Modules Overview

CGAT-core provides a comprehensive set of modules to facilitate the creation and management of data processing pipelines. These modules offer various functionalities, from pipeline control and execution to database management and file handling.

Available Modules

  1. Control: Manages the overall pipeline execution flow.
  2. Database: Handles database operations and uploads.
  3. Files: Provides utilities for file management and temporary file handling.
  4. Cluster: Manages job submission and execution on compute clusters.
  5. Execution: Handles task execution and logging.
  6. Utils: Offers various utility functions for pipeline operations.
  7. Parameters: Manages pipeline parameters and configuration.

Integration with Ruffus

CGAT-core builds upon the Ruffus pipeline library, extending its functionality and providing additional features. It includes the following Ruffus decorators:

  • @transform
  • @merge
  • @split
  • @originate
  • @follows
  • @suffix

These decorators can be used to define pipeline tasks and their dependencies.

S3 Integration

CGAT-core also provides S3-aware decorators and functions for seamless integration with AWS S3:

  • @s3_transform
  • @s3_merge
  • @s3_split
  • @s3_originate
  • @s3_follows

For more information on working with S3, see the S3 Integration section.

By leveraging these modules and decorators, you can build powerful, scalable, and efficient data processing pipelines using CGAT-core.