← Back
1. Bioinformatics Summary
This CWL workflow implements a data cleaning pipeline for Single-Cell RNA-Seq (scRNA-seq) data generated by 10x Genomics Chromium.
*
Primary Computational Step: Extraction. The workflow takes a compressed bundle containing Raw, Filtered, and Secondary Analysis report folders and extracts all contained
.tar files into a temporary directory named
counts using a Bash script mapped to a Docker container (
scidap/scidap:v0.0.4).
*
Core Analysis Tool: SoupX (called via an internal sub-workflow
soupx.cwl).
* SoupX is an R package that uses a "Pure Soup" model to estimate and correct for
Ambient RNA (contamination).
* It estimates the frequency of observed mRNA molecules in the reaction soup (empty droplets).
* It calculates the probability contamination occurred in every cell and adjusts the count matrix to remove this background noise.
* It outputs corrected count matrices in
MEX (Matrix Exchange) and
HDF5 formats.
*
Diagnostic Outputs: It generates plots visualizing raw vs. corrected expression levels and diagnostic ratios to validate the accuracy of the contamination removal.
*
Final Packaging: The corrected matrices are compressed back into a tarball for transport.
2. Biological Explanation
*
The Biological Problem: Single-cell capture methods (like 10x Chromium) often result in the co-loading of mRNA from dead or lysed cells into the droplets of live cells. This is known as
Ambient RNA or cell-free mRNA "soup."
*
Impact: This background noise can lead to false positives—where scientists might incorrectly think a gene is highly expressed in a cell type, while it was actually just leaked from the soup into that cell's droplet.
*
Sample Type: Tissues dissociated into single-cell suspensions (e.g., tumor biopsies, blood, brain tissue).
*
Biological Question: How do we recover the *ground truth* expression profile of the live cells, removing the technical artifact of background contamination?
*
Context with BioSkill: This workflow is the crucial first step
before applying the Clustering bioSkill. If you analyze raw (uncleaned) data, clustering algorithms often group empty droplets with real cells, or identify spurious clusters based on high ambient expression.
3. Criticism of Workflow
*
Missing soupx.cwl Definition: The workflow references a sub-workflow named
soupx.cwl, but that definition is not included in the prompt. Without it, the
estimate_contamination step would fail as the tool logic is undefined.
*
Inefficient Data Extraction: The
extract_count_matrices_to_folder step explicitly unpacks
three different folders (
raw,
filtered,
secondary_analysis_report). In a production environment, the
secondary_analysis_report folder is usually just CSVs and JSONs (metadata), which the R tool SoupX does not need. Unpacking unnecessary data wastes disk space and I/O time.
*
Docker Version Hardcoding: The workflow relies on a specific Docker tag
scidap/scidap:v0.0.4. If the container registry does not have this version, the pipeline crashes. It is safer to rely on a wider tag range (e.g.,
latest or
v0.0) and pin a specific version *after* a build step.
*
Lack of SoupX Parameter Tuning: The workflow passes in parameters like
genelist_file,
fdr, and
expression_threshold, but without the definition of
soupx.cwl, it is unclear how these parameters are actually used (e.g., are they used for filtering, or just passed to the script?).
*
No Downstream Integration: Purely a preprocessing tool. It stops at cleaning matrices. It does not validate if the cleaning was successful (e.g., no negative counts introduced by the subtraction algorithm).
4. Guidance for Non-Technical Person
The "Washing Machine" Analogy:
Imagine you are trying to identify specific ingredients in a batch of soup made in a washing machine (using a 10x chip).
*
The Problem: Sometimes, cells break apart in this machine. The washing machine leaks their "soup" (RNA) into other cells' compartments. This creates noise—imagine finding a fish bone in your glass of orange juice because the fish tank leaked.
*
What this Workflow does: It acts like a high-tech filter and detector. It figures out how much "soup" is in the washing machine (the leak) and then mathematically subtracts that noise from each individual cell's drink.
*
Why you should care:
*
Accuracy: It ensures you aren't analyzing junk.
*
Speed: It removes the wasted cells before the computer tries to cluster them.
*
Decision Making: It allows biologists to confidently say, "Gene X is definitely active in this cell type," without worrying that it was just leaked from dead cells.
*
Next Steps (via the related BioSkill): Once this cleaning is done, the clustering step (described in the bioSkill) can be safely run to identify what these cell types actually are (e.g., identifying immune cells, cancer cells, or stem cells in the soup).