Intro
This vignette will deal with how to properly and in a reproducible way work your phenotypic and genotypic data in a format compatible with R/qtl, a widely used package in linkage mapping.
The main objective of this article is to demonstrate
the use of the format_qtl_input
function. This function
formats the genotype, map, and phenotype files for QTL analysis in
R/qtl. It allows the user to specify whether the genotype data should be
converted from dosage to ABH format or used as is.
So we need 3 critical files:
- genotype file
- map
- phenotype file
Let’s get those and it should be really easy then to use the function to put them together in a format that R/qtl can understand.
The main advantage of using this function is that it automates the integration of map, phenotype, and genotype data. If you need to redo the phenotypic analysis later, having a script that allows you to reproducibly update your files will make the process much easier.
Data
Phenotype
We continue with the data of our previous script.
# Phenotype
url1 <- "https://raw.githubusercontent.com/vegaalfaro/geneticMapRFiles/refs/heads/main/phenotype/phenotypes-binary.csv"
# Download file
if (!file.exists("local_copy_phenotypes-binary.csv")) {
download.file(url1, destfile = "local_copy_phenotypes-binary.csv")}
# Read
pheno <- read.csv("local_copy_phenotypes-binary.csv")
# Most important variables
vars <- names(pheno)[c(9, 10, 16, 22, 23, 24, 26)]
response_vars <- vars #make a copy
knitr::kable(head(pheno[, 1:7]), caption = "Preview of the first 7 columns of the phenotype dataset")
ID | Row | UID | Gen | Geno | F1Num | Scale |
---|---|---|---|---|---|---|
1001 | 1204 | 1021-MAP-2024 | F2 | 1592-ConicalxRound-F2 | 1592 | 955 |
1002 | 1204 | 1022-MAP-2024 | F2 | 1592-ConicalxRound-F2 | 1592 | 955 |
1003 | 1204 | 1023-MAP-2024 | F2 | 1592-ConicalxRound-F2 | 1592 | 945 |
1004 | 1204 | 1024-MAP-2024 | F2 | 1592-ConicalxRound-F2 | 1592 | 947 |
1005 | 1204 | 1025-MAP-2024 | F2 | 1592-ConicalxRound-F2 | 1592 | 955 |
1006 | 1204 | 1026-MAP-2024 | F2 | 1592-ConicalxRound-F2 | 1592 | 955 |
In addition, we will need some meta-data available from the same data
repository geneticMapRFiles
. The file ID contains the
column VCF_name
which will match the ID in the genotype
matrix. The phenotype and the genotype files share the same individuals
but the names are slightly different. This ID file, 1001-F2-Alpha-A1,
1002-F2-Alpha-B1, 1003-F2-Alpha-C1). They share parto of the name but we
need to make them match.
url2 <- "https://github.com/vegaalfaro/geneticMapRFiles/raw/main/IDs/ID.csv"
# Download file
if (!file.exists("local_copy_ID.csv")) {
download.file(url2, destfile = "local_copy_ID.csv")}
# Read
ID_Pedigrees <- read.csv("local_copy_ID.csv")
# Make as character
ID_Pedigrees$ID <- as.character(ID_Pedigrees$ID)
head(ID_Pedigrees)
#> ID Sample_Name Population Type VCF_name
#> 1 1001 1001-F2 1592 F2 1001-F2-Alpha-A1
#> 2 1002 1002-F2 1592 F2 1002-F2-Alpha-B1
#> 3 1003 1003-F2 1592 F2 1003-F2-Alpha-C1
#> 4 1004 1004-F2 1592 F2 1004-F2-Alpha-D1
#> 5 1005 1005-F2 1592 F2 1005-F2-Alpha-E1
#> 6 1006 1006-F2 1592 F2 1006-F2-Alpha-F1
knitr::kable(head(ID_Pedigrees), caption = "Preview of the ID Meta-data")
ID | Sample_Name | Population | Type | VCF_name |
---|---|---|---|---|
1001 | 1001-F2 | 1592 | F2 | 1001-F2-Alpha-A1 |
1002 | 1002-F2 | 1592 | F2 | 1002-F2-Alpha-B1 |
1003 | 1003-F2 | 1592 | F2 | 1003-F2-Alpha-C1 |
1004 | 1004-F2 | 1592 | F2 | 1004-F2-Alpha-D1 |
1005 | 1005-F2 | 1592 | F2 | 1005-F2-Alpha-E1 |
1006 | 1006-F2 | 1592 | F2 | 1006-F2-Alpha-F1 |
Genotype data
We’ll load the ordered genotype matrix that we worked on in the previous scripts and is avaialbe in the the GitHub repository geneticMapRFiles.
url3 <- "https://github.com/vegaalfaro/geneticMapRFiles/raw/main/R_data/genotype_matrices_hmm_pop2-2025-03-12.RData"
download.file(url3, destfile = "local_copy_geno.RData", mode = "wb")
load("local_copy_geno.RData")
Maps
These code loads the maps that we worked on and saved in the previous script.
url4 <- "https://github.com/vegaalfaro/geneticMapRFiles/raw/main/R_data/genetic_maps_pop2-2025-06-06.RData"
download.file(url4, destfile = "local_copy_maps.RData", mode = "wb")
load("local_copy_maps.RData")
Prepare data for R/QTL
Phenotype
This section filters the dataset to include only F2 individuals, excluding parents and F1s. It also matches phenotype IDs to their corresponding VCF IDs for integration with genotypic data, and selects only the relevant columns required for downstream analysis in R/qtl.
# We subset like we did in the previous script.
Pop2 <- pheno[pheno$Geno %in% c("1629-CylindricalxFlat-F2",
"Cylindra",
"F1-1629",
"Mono"), ]
# Keep only F2 Individuals
Pop2_progeny <- Pop2[Pop2$Gen == "F2",]
Pop2_progeny$ID <- as.character(Pop2_progeny$ID) # Declare ID as character
# Add VCF ID (ID that matches the geno file)
Pop2_progeny <- Pop2_progeny %>%
left_join(ID_Pedigrees %>% select(ID, VCF_name), by = "ID")
# Ensure "VCF_name" is included in the selection
keep <- c(vars, "VCF_name")
# Select only the desired columns and rename VCF_name to ID
Pop2_progeny <- Pop2_progeny %>%
ungroup() %>% # Remove any existing grouping (Annoying dplyr thing)
dplyr::select(all_of(keep)) %>%
rename(ID = VCF_name)
knitr::kable(head(Pop2_progeny), caption = "Preview of the columns of the phenotypic data")
biomass | max_width | width_50 | length | length_width_ratio | shoulder_top | tip_angle_top | ID |
---|---|---|---|---|---|---|---|
7046.486 | 89.0997 | 79.57 | 111.6102 | 1.25 | 128.2565 | 41.445822 | 2001-F2-Beta-A10 |
5883.917 | 72.3297 | 72.12 | 105.1587 | 1.45 | 108.9967 | 49.626571 | 2002-F2-Beta-B10 |
8650.555 | 88.8501 | 60.36 | 149.8485 | 1.69 | 124.2265 | 57.371546 | 2003-F2-Beta-C10 |
8809.267 | 86.2728 | 85.23 | 135.5865 | 1.57 | 197.8657 | 44.441035 | 2004-F2-Beta-D10 |
10023.938 | 110.4552 | 110.35 | 119.0250 | 1.08 | 341.3724 | 62.982283 | 2005-F2-Beta-E10 |
10861.375 | 92.2389 | 86.41 | 170.6049 | 1.85 | 256.2808 | 5.314546 | 2006-F2-Beta-F10 |
Please notice that the phenotype file Pop2_progeny
of
the which we got a glimpse, has the ID at the end. The ID column is not
really needed for R/qtl, in fact it detects it as another phenotype
trait. So we have the phenotype ready. Let’s move on to the map
file.
Format data with format_qtl_input
The code below uses the format_qtl_input
to convert the
phenotype, genetic map and genotype matrix in a format that can be used
to load the data in R/qtl.
# Example usage
result <- format_qtl_input(geno_genome_order, map2, Pop2_progeny, numeric = TRUE)
#> Warning in format_qtl_input(geno_genome_order, map2, Pop2_progeny, numeric =
#> TRUE): Warning: Some IDs in pheno do not match column names in geno.
# Save results as csv
write.csv(result, file = "../data/rqtl.csv", row.names = FALSE)
Let’s look at the data.
biomass | max_width | width_50 | length | length_width_ratio | shoulder_top | tip_angle_top | ID | CHR1_98645 | CHR1_235900 | CHR1_312228 | CHR1_473415 | CHR1_635793 | CHR1_672859 | CHR1_672878 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CHR1 | CHR1 | CHR1 | CHR1 | CHR1 | CHR1 | CHR1 | ||||||||
0.00 | 0.53 | 1.05 | 1.59 | 2.68 | 3.20 | 3.20 | ||||||||
7046.4863 | 89.0997 | 79.57 | 111.6102 | 1.25 | 128.2565 | 41.44582176 | 2001-F2-Beta-A10 | A | A | A | A | A | A | A |
5883.9174 | 72.3297 | 72.12 | 105.1587 | 1.45 | 108.9967 | 49.62657071 | 2002-F2-Beta-B10 | A | A | A | A | A | A | A |
8650.5549 | 88.8501 | 60.36 | 149.8485 | 1.69 | 124.2265 | 57.37154613 | 2003-F2-Beta-C10 | A | A | A | A | A | A | A |
8809.2671 | 86.2728 | 85.23 | 135.5865 | 1.57 | 197.8657 | 44.44103452 | 2004-F2-Beta-D10 | A | A | A | A | A | A | A |
If you look through the full rqtl.csv
file, you’ll
notice a few individuals have phenotypic data but no genotypic data.
This is expected—some didn’t meet the quality criteria for genotyping,
were duplicates, or were recorded but not actually sequenced. No
problem: format_input_qtl
handled this by assigning
NAs
, which rqtl will automatically ignore.