Trovomics CSV File Guidelines

When uploading metadata to Trovomics, it is essential that your CSV files follow strict formatting rules. This enables accurate linking of your sample data, smooth downstream analyses, and minimal rework due to file validation errors. The Trovomics CSV validation function checks for:

  1. Presence of Required Columns

  2. Illegal Characters in Column Names

  3. Illegal Characters in Data Cells

Any violations trigger error details and a validation failure. Review these guidelines carefully to avoid upload errors and ensure efficient data processing.

1. Required Columns

Your CSV file must include the following mandatory columns, which are used to link metadata to your actual omics data files:

  • SampleName

    • A unique identifier for each sample in your experiment.

    • If you have multiple replicates, each row should have a distinct SampleName, unless they are multiple files for the same sample.

  • Filename

    • The Filename should be unique across the entire analysis.

    • The name of the raw data file (FASTQ) corresponding to each sample.

    • This should match the file’s exact name. Do not include the path.

These two columns are case-sensitive and must appear exactly as shown (SampleName and Filename). Files missing either required column will be rejected.

2. General Guidelines

All metadata values should be categorical. If your dataset contains a column with numerical data, it should be turned into appropriate categories before uploading the csv file.

Example:

Numerical values

Variable: Age

Values: 21, 25, 30, 37, 28, 32, 22, 33, 29, 34.

Covert these values into categories appropriate for the goals of your analysis (e.g. younger_than_30, 30_or_older).

Categorical values

Variable: Age

Values: younger_than_30, younger_than_30, 30_or_older, 30_or_older,

younger_than_30, 30_or_older, younger_than_30, 30_or_older, younger_than_30,

30_or_older

3. Column Name Guidelines

In addition to the required columns, you may include any number of optional columns (e.g., Condition, Genotype, Tissue, Treatment). These additional columns capture the experimental metadata that Trovomics uses for downstream analyses. All columns, including required and optional, must adhere to the following naming rules:

3.1 Valid Characters

Column names may contain only letters, numbers and underscores.

Column names must start with a letter.

3.2 Example Disallowed Characters

Column names must not contain any of the following:

  • Punctuation/Special Characters:

    , ; : . / \ | ? < > [ ] { } ( ) + -

  • Whitespace:

    Spaces are disallowed in column names (e.g., “Group Name” is invalid). If needed,

    use camelCase or underscores (e.g., “GroupName” or “Group_Name”).

  • Leading Characters:

    Column names should start with a letter, not a number or an underscore (_).

4. Data Cell Guidelines

Each row in the CSV file represents a single sample (or replicate). You can include as many optional metadata columns as you need; for example, you might have columns for Condition, Dose, Genotype, Treatment, Batch, Tissue, Cell_Type, etc.

4.1 Required Columns

  • SampleName: Must be present and non-empty for every row.

  • Filename: Must be present and non-empty for every row, exactly matching the

    physical data file name (not including the file extension).

    Missing values in either SampleName or Filename will invalidate the CSV.

4.2 Valid Characters

  • Data Cell values may contain only letters, numbers and underscores.

  • Data Cell values must start with a letter.

4.3 Data Columns

  • Formatting & Type

    • Avoid using any of the disallowed characters listed in Section 3.2.

  • Missing or Null Data

    • If you wish to explicitly label missing data, consider using the value no_data instead of NA, NaN, or NULL, which are reserved and may interfere with the pipeline.

4.4 Examples of Valid vs. Invalid Column Names and Data Cell Values

  • Valid: SampleName, Filename, Condition, Replicate01, Treatment_2, A673, ETV6, H3K4me3

  • Invalid: Sample Name, Drug+, Group?, _Condition, Group-Name, 5e10, pi, Inf, NA

5. Practical Tips for Omics Datasets

  1. Consistent Naming: Align SampleName with your laboratory records. For example, if a sample is known as Patient1_TissueA_Rep1, use that exact identifier in the CSV.

  2. Exact Filename Matching: Ensure the Filename column matches the actual data files in your system (e.g., Patient1_TissueA_Rep1_S1_L001_R1_001.fastq.gz). Any mismatch leads to processing errors.

  3. Descriptive Metadata Columns: Add columns like Condition, Genotype, Tissue, Treatment for robust downstream analysis. Avoid disallowed characters or spaces in these column names.

  4. Make Sure to Include Only Valid Characters: Columns and data cells may contain only letters, numbers and underscores, and must start with a letter.

  5. Local Validation: It is good practice to inspect your CSV before uploading. This can save time and prevent re-uploads.

6. Example CSV Layout

  • SampleName: Unique for each row

  • Filename: Matches each FASTQ file exactly

  • Condition, Tissue, Treatment, etc: Additional metadata columns

  • No spaces or disallowed characters in headers or required cell values

single-end data:

csv

SampleName,Filename,Condition,Tissue

Sample1,Sample1_S1_L001_R1_001.fastq.gz,Control,Liver

Sample2,Sample2_S2_L001_R1_001.fastq.gz,Control,Liver

Sample3,Sample3_S3_L001_R1_001.fastq.gz,Control,Liver

Sample4,Sample4_S4_L001_R1_001.fastq.gz,Treatment,Liver

Sample5,Sample5_S5_L001_R1_001.fastq.gz,Treatment,Liver

Sample6,Sample6_S6_L001_R1_001.fastq.gz,Treatment,Liver

Sample7,Sample7_S7_L001_R1_001.fastq.gz,Control,Heart

Sample8,Sample8_S8_L001_R1_001.fastq.gz,Control,Heart

Sample9,Sample9_S9_L001_R1_001.fastq.gz,Control,Heart

Sample10,Sample10_S10_L001_R1_001.fastq.gz,Treatment,Heart

Sample11,Sample11_S11_L001_R1_001.fastq.gz,Treatment,Heart

Sample12,Sample12_S12_L001_R1_001.fastq.gz,Treatment,Heart

paired-end data:

csv

SampleName,Filename,Condition,Tissue

Sample1,Sample1_S1_L001_R1_001.fastq.gz,Control,Liver

Sample1,Sample1_S1_L001_R2_001.fastq.gz,Control,Liver

Sample2,Sample2_S2_L001_R1_001.fastq.gz,Control,Liver

Sample2,Sample2_S2_L001_R2_001.fastq.gz,Control,Liver

Sample3,Sample3_S3_L001_R1_001.fastq.gz,Control,Liver

Sample3,Sample3_S3_L001_R2_001.fastq.gz,Control,Liver

Sample4,Sample4_S4_L001_R1_001.fastq.gz,Treatment,Liver

Sample4,Sample4_S4_L001_R2_001.fastq.gz,Treatment,Liver

Sample5,Sample5_S5_L001_R1_001.fastq.gz,Treatment,Liver

Sample5,Sample5_S5_L001_R2_001.fastq.gz,Treatment,Liver

Sample6,Sample6_S6_L001_R1_001.fastq.gz,Treatment,Liver

Sample6,Sample6_S6_L001_R2_001.fastq.gz,Treatment,Liver

Sample7,Sample7_S7_L001_R1_001.fastq.gz,Control,Heart

Sample7,Sample7_S7_L001_R2_001.fastq.gz,Control,Heart

Sample8,Sample8_S8_L001_R1_001.fastq.gz,Control,Heart

Sample8,Sample8_S8_L001_R2_001.fastq.gz,Control,Heart

Sample9,Sample9_S9_L001_R1_001.fastq.gz,Control,Heart

Sample9,Sample9_S9_L001_R2_001.fastq.gz,Control,Heart

Sample10,Sample10_S10_L001_R1_001.fastq.gz,Treatment,Heart

Sample10,Sample10_S10_L001_R2_001.fastq.gz,Treatment,Heart

Sample11,Sample11_S11_L001_R1_001.fastq.gz,Treatment,Heart

Sample11,Sample11_S11_L001_R2_001.fastq.gz,Treatment,Heart

Sample12,Sample12_S12_L001_R1_001.fastq.gz,Treatment,Heart

Sample12,Sample12_S12_L001_R2_001.fastq.gz,Treatment,Heart

7. Final Notes

Adhering to these guidelines ensures that your omics metadata can be processed reliably by Trovomics and the underlying R scripts. Properly formatted CSV files avoid errors, streamlines your analysis workflow, and reduces troubleshooting steps.

If you have any questions or encounter persistent validation failures, please consult our user support resources at support@trovomics.com

Thank you for helping us maintain a robust and efficient environment for your omics research. We look forward to supporting your scientific discoveries!

Next
Next

Precision Medicine with Dr. Brian McDonough