Skip to contents

Intro

This vignette will deal with how to properly and in a reproducible way work your phenotypic and genotypic data in a format compatible with R/qtl, a widely used package in linkage mapping.

The main objective of this article is to demonstrate the use of the format_qtl_input function. This function formats the genotype, map, and phenotype files for QTL analysis in R/qtl. It allows the user to specify whether the genotype data should be converted from dosage to ABH format or used as is.

So we need 3 critical files:

  • genotype file
  • map
  • phenotype file

Let’s get those and it should be really easy then to use the function to put them together in a format that R/qtl can understand.

The main advantage of using this function is that it automates the integration of map, phenotype, and genotype data. If you need to redo the phenotypic analysis later, having a script that allows you to reproducibly update your files will make the process much easier.

Data

Phenotype

We continue with the data of our previous script.

# Phenotype
url1 <- "https://raw.githubusercontent.com/vegaalfaro/geneticMapRFiles/refs/heads/main/phenotype/phenotypes-binary.csv"

# Download file
if (!file.exists("local_copy_phenotypes-binary.csv")) {
  download.file(url1, destfile = "local_copy_phenotypes-binary.csv")}

# Read
pheno <- read.csv("local_copy_phenotypes-binary.csv")

# Most important variables
vars <- names(pheno)[c(9, 10, 16, 22, 23, 24, 26)]
response_vars <- vars #make a copy

knitr::kable(head(pheno[, 1:7]), caption = "Preview of the first 7 columns of the phenotype dataset")
Preview of the first 7 columns of the phenotype dataset
ID Row UID Gen Geno F1Num Scale
1001 1204 1021-MAP-2024 F2 1592-ConicalxRound-F2 1592 955
1002 1204 1022-MAP-2024 F2 1592-ConicalxRound-F2 1592 955
1003 1204 1023-MAP-2024 F2 1592-ConicalxRound-F2 1592 945
1004 1204 1024-MAP-2024 F2 1592-ConicalxRound-F2 1592 947
1005 1204 1025-MAP-2024 F2 1592-ConicalxRound-F2 1592 955
1006 1204 1026-MAP-2024 F2 1592-ConicalxRound-F2 1592 955

In addition, we will need some meta-data available from the same data repository geneticMapRFiles. The file ID contains the column VCF_name which will match the ID in the genotype matrix. The phenotype and the genotype files share the same individuals but the names are slightly different. This ID file, 1001-F2-Alpha-A1, 1002-F2-Alpha-B1, 1003-F2-Alpha-C1). They share parto of the name but we need to make them match.

url2 <- "https://github.com/vegaalfaro/geneticMapRFiles/raw/main/IDs/ID.csv"

# Download file
if (!file.exists("local_copy_ID.csv")) {
  download.file(url2, destfile = "local_copy_ID.csv")}

# Read
ID_Pedigrees <- read.csv("local_copy_ID.csv")

# Make as character
ID_Pedigrees$ID <- as.character(ID_Pedigrees$ID)

head(ID_Pedigrees)
#>     ID Sample_Name Population Type         VCF_name
#> 1 1001     1001-F2       1592   F2 1001-F2-Alpha-A1
#> 2 1002     1002-F2       1592   F2 1002-F2-Alpha-B1
#> 3 1003     1003-F2       1592   F2 1003-F2-Alpha-C1
#> 4 1004     1004-F2       1592   F2 1004-F2-Alpha-D1
#> 5 1005     1005-F2       1592   F2 1005-F2-Alpha-E1
#> 6 1006     1006-F2       1592   F2 1006-F2-Alpha-F1
knitr::kable(head(ID_Pedigrees), caption = "Preview of the ID Meta-data")
Preview of the ID Meta-data
ID Sample_Name Population Type VCF_name
1001 1001-F2 1592 F2 1001-F2-Alpha-A1
1002 1002-F2 1592 F2 1002-F2-Alpha-B1
1003 1003-F2 1592 F2 1003-F2-Alpha-C1
1004 1004-F2 1592 F2 1004-F2-Alpha-D1
1005 1005-F2 1592 F2 1005-F2-Alpha-E1
1006 1006-F2 1592 F2 1006-F2-Alpha-F1

Genotype data

We’ll load the ordered genotype matrix that we worked on in the previous scripts and is avaialbe in the the GitHub repository geneticMapRFiles.

url3 <- "https://github.com/vegaalfaro/geneticMapRFiles/raw/main/R_data/genotype_matrices_hmm_pop2-2025-03-12.RData"

download.file(url3, destfile = "local_copy_geno.RData", mode = "wb")

load("local_copy_geno.RData")

Maps

These code loads the maps that we worked on and saved in the previous script.

url4 <- "https://github.com/vegaalfaro/geneticMapRFiles/raw/main/R_data/genetic_maps_pop2-2025-06-06.RData"

download.file(url4, destfile = "local_copy_maps.RData", mode = "wb")

load("local_copy_maps.RData")

Prepare data for R/QTL

Phenotype

This section filters the dataset to include only F2 individuals, excluding parents and F1s. It also matches phenotype IDs to their corresponding VCF IDs for integration with genotypic data, and selects only the relevant columns required for downstream analysis in R/qtl.

# We subset like we did in the previous script.
Pop2 <- pheno[pheno$Geno %in% c("1629-CylindricalxFlat-F2", 
                                "Cylindra", 
                                "F1-1629", 
                                "Mono"), ]

# Keep only F2 Individuals
Pop2_progeny <- Pop2[Pop2$Gen == "F2",]
Pop2_progeny$ID <- as.character(Pop2_progeny$ID) # Declare ID as character

# Add VCF ID (ID that matches the geno file)
Pop2_progeny <- Pop2_progeny %>%
  left_join(ID_Pedigrees %>% select(ID, VCF_name), by = "ID")

# Ensure "VCF_name" is included in the selection
keep <- c(vars, "VCF_name")

# Select only the desired columns and rename VCF_name to ID
Pop2_progeny <- Pop2_progeny %>%
  ungroup() %>%  # Remove any existing grouping (Annoying dplyr thing)
  dplyr::select(all_of(keep)) %>%
  rename(ID = VCF_name)

knitr::kable(head(Pop2_progeny), caption = "Preview of the columns of the phenotypic data")
Preview of the columns of the phenotypic data
biomass max_width width_50 length length_width_ratio shoulder_top tip_angle_top ID
7046.486 89.0997 79.57 111.6102 1.25 128.2565 41.445822 2001-F2-Beta-A10
5883.917 72.3297 72.12 105.1587 1.45 108.9967 49.626571 2002-F2-Beta-B10
8650.555 88.8501 60.36 149.8485 1.69 124.2265 57.371546 2003-F2-Beta-C10
8809.267 86.2728 85.23 135.5865 1.57 197.8657 44.441035 2004-F2-Beta-D10
10023.938 110.4552 110.35 119.0250 1.08 341.3724 62.982283 2005-F2-Beta-E10
10861.375 92.2389 86.41 170.6049 1.85 256.2808 5.314546 2006-F2-Beta-F10

Please notice that the phenotype file Pop2_progeny of the which we got a glimpse, has the ID at the end. The ID column is not really needed for R/qtl, in fact it detects it as another phenotype trait. So we have the phenotype ready. Let’s move on to the map file.

Map

We will use the map that uses the reference genome order. The function format_qtl_input requires the columns marker, chrom and position. So we’ll change our file slightly to accomodate the requirements.

# Map
map2 <- MAP_Ref_Genome_Order %>%
  dplyr::mutate(marker = paste0(chrom, "_", position_Mb)) %>%
  dplyr::select(marker, chrom, position = position_cM)

Format data with format_qtl_input

The code below uses the format_qtl_input to convert the phenotype, genetic map and genotype matrix in a format that can be used to load the data in R/qtl.

# Example usage
 result <- format_qtl_input(geno_genome_order, map2, Pop2_progeny, numeric = TRUE)
#> Warning in format_qtl_input(geno_genome_order, map2, Pop2_progeny, numeric =
#> TRUE): Warning: Some IDs in pheno do not match column names in geno.

# Save results as csv 
 write.csv(result, file = "../data/rqtl.csv", row.names = FALSE)

Let’s look at the data.

knitr::kable(head(result[,1:15]), caption = "Preview of csv file ready for R/qtl")
Preview of csv file ready for R/qtl
biomass max_width width_50 length length_width_ratio shoulder_top tip_angle_top ID CHR1_98645 CHR1_235900 CHR1_312228 CHR1_473415 CHR1_635793 CHR1_672859 CHR1_672878
CHR1 CHR1 CHR1 CHR1 CHR1 CHR1 CHR1
0.00 0.53 1.05 1.59 2.68 3.20 3.20
7046.4863 89.0997 79.57 111.6102 1.25 128.2565 41.44582176 2001-F2-Beta-A10 A A A A A A A
5883.9174 72.3297 72.12 105.1587 1.45 108.9967 49.62657071 2002-F2-Beta-B10 A A A A A A A
8650.5549 88.8501 60.36 149.8485 1.69 124.2265 57.37154613 2003-F2-Beta-C10 A A A A A A A
8809.2671 86.2728 85.23 135.5865 1.57 197.8657 44.44103452 2004-F2-Beta-D10 A A A A A A A

If you look through the full rqtl.csv file, you’ll notice a few individuals have phenotypic data but no genotypic data. This is expected—some didn’t meet the quality criteria for genotyping, were duplicates, or were recorded but not actually sequenced. No problem: format_input_qtl handled this by assigning NAs, which rqtl will automatically ignore.