Binning SNP Markers

Introduction

This vignette demonstrates how to bin SNP markers based on linkage disequilibrium (LD) and detect genetic duplicates in a linkage map F₂ table beet population using the geneticMapR package and its dependencies. Specially useful is MapRtools.

Setup

In case you need to install packages, here’s how to do it! The package includes helper functions to check and install from CRAN or GitHub. Feel free to use them as shown below if you need to install packages needed for following along this article.

# Ensure devtools is available for GitHub installs
load_or_install_cran("devtools")

# Install/load GitHub packages
load_or_install_github("MapRtools", "jendelman/MapRtools")
load_or_install_github("geneticMapR", "vegaalfaro/geneticMapR")

# Install/load CRAN packages
load_or_install_cran("ggdendro")
load_or_install_cran("stringr")
load_or_install_cran("ggplot2")

Load Libraries

library(geneticMapR)
library(MapRtools)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(ggdendro)
library(stringr)

Load Data

We begin by loading a preprocessed genotype dataset from our data repository geneticMapRFiles

geno_matrices_url <- "https://raw.githubusercontent.com/vegaalfaro/geneticMapRFiles/main/R_data/filtered_geno_matrices_1629.RData"

# Download file
if (!file.exists("local_copy.filtered_geno_matrices_1629.RData")) {
  download.file(geno_matrices_url, 
                destfile = "local_copy.filtered_geno_matrices_1629.RData")
}

load("local_copy.filtered_geno_matrices_1629.RData")

Identify Genetic Duplicates

It’s often useful to include known duplicates in your dataset to assess GBS quality. Generally it’s good to have a system that allows you to figure out if you inadvertently included a genetic duplicate. In this data, I included a duplicate of the F1 individual. Occasionally, there is human error and some genetic duplicates are included in the experiment inadvertently.

For convenience in the code below, we search for and assign a name to the F1 and parental individuals.

# Assign names
P1 <- "P2550-Cylindra-P1-Theta-A9"
P2 <- "P2493-Mono-P2-Theta-B9"
F1s <- c("7001-F1-Beta-H9", "7002-F1-Gamma-F11")

rename_geno_matrix() standardizes the column names of a genotype matrix by renaming parental genotypes (P1, P2), F1 individuals (F1.1, F1.2, …), and F₂ individuals (F2.1, F2.2, …). This shortens the names as they are quite large in the real data example and they get in the way of visualization. We’ll change the names temporarily for visualization purposes.

# First make copy of geno
geno <- het_phased_geno_1629_filt
# Rename
geno <- rename_geno_matrix(geno, P1, P2, F1s)

We will use a dendrogram to determine genetic duplicates. The closer two samples are to the lowest branches of the dedrogram, the more related they are more likely to be duplicates. In this case, we can see the two identical F1s (on the far right hand side) are on the second-lowest branches. In addition, F24 and F25 appear to be even more more closely related and are likely duplicates included in the experiment because of human error. In most projects I have worked on, surprises like this are not uncommon.

# Estimate distance matrix using  genotype matrix
distance_matrix <- dist(t(geno))

# Perform hierarchical clustering
hclust_result <- hclust(distance_matrix, method = "single")

# Convert hclust object to dendrogram data for better visualization
dendro_data <- ggdendro::dendro_data(hclust_result)

# Plot dendrogram using ggplot2
dendro <- ggplot() +
  geom_segment(data = dendro_data$segments, 
               aes(x = x, y = y, xend = xend, yend = yend), 
               color = "blue") +
  geom_text(data = dendro_data$labels, 
            aes(x = x, y = y, label = label), 
            hjust = 1, angle = 90, size = 2) +
  labs(title = "Hierarchical Clustering Dendrogram",
       x = "Samples", y = "Distance") +
  theme_test() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

dendro

dendro plot

Filter Parents and F1 from the Dataset

To avoid bias during LD binning, we remove the parents and F₁ individuals. We’ll keep the two F_2s which seem to be genetic duplicates for now.

hom_phased_geno_1629 <- drop_parents(hom_phased_geno_1629, P1, P2, F1 = F1s)
het_phased_geno_1629_filt <- drop_parents(het_phased_geno_1629_filt, P1, P2, F1 = F1s)

Bin Homozygous Markers

We begin by cleaning the homozygous matrix and removing non-informative markers.

# Identify monomorphic markers
mono_markers <- apply(hom_phased_geno_1629, 1, function(x) length(unique(na.omit(x))) == 1)
# Identify zero variance markers
hom_phased_geno_1629 <- hom_phased_geno_1629[apply(hom_phased_geno_1629, 1, var) > 0, ]
# Filter them out from matrix
hom_phased_geno_1629 <- hom_phased_geno_1629[!mono_markers, ]
# If genotype matrix contains missing genotypes, remove them
hom_phased_geno_1629 <- hom_phased_geno_1629[complete.cases(hom_phased_geno_1629), ]

# bin markers using MapRtools::LDbin
LDbin_hom <- LDbin(hom_phased_geno_1629, r2.thresh = 0.99)
# Extract binned genotype matrix
geno.hom.bin <- LDbin_hom$geno

# Check dimensions of binned genotype
dim(geno.hom.bin)
#> [1] 998 100

Visualize Binned Homozygous Markers

map_hom <- extract_map(geno.hom.bin)
geneticMapR::plot_cover(map = map_hom, customize = TRUE) + 
  ggtitle("Binned Homozygous Markers")

coverageplot

Bin Heterozygous Markers

We now repeat the binning process for heterozygous markers.

LDbin_het <- LDbin(het_phased_geno_1629_filt, r2.thresh = 0.99)
geno.het.bin <- LDbin_het$geno
dim(geno.het.bin)
#> [1] 2480  100

Visualize Binned Heterozygous Markers

map_het <- extract_map(geno.het.bin)
geneticMapR::plot_cover(map = map_het, customize = TRUE) + 
  ggtitle("Binned Heterozygous Markers")

coverageplot

Mrker Filtering Tradeoff

This real-data example shows once more how filtering for homozygous vs. heterozygous markers affects genome coverage. The balance between marker type and map resolution.

Save Results

save(geno.het.bin, 
     geno.hom.bin,  
     dendro,
     file = "processed_data/R_data/binned_geno_1629.RData")

Conclusion

This vignette illustrates how to identify duplicate individuals in your dataset and perform LD-based marker binning using tools available in the geneticMapR workflow.

Session Information

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] stringr_1.5.1          ggdendro_0.2.0         ggplot2_3.5.2         
#> [4] tibble_3.3.0           tidyr_1.3.1            dplyr_1.1.4           
#> [7] MapRtools_0.36         geneticMapR_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6         xfun_0.52            bslib_0.9.0         
#>  [4] htmlwidgets_1.6.4    rstatix_0.7.2        lattice_0.22-7      
#>  [7] vctrs_0.6.5          tools_4.5.1          generics_0.1.4      
#> [10] parallel_4.5.1       ca_0.71.1            pkgconfig_2.0.3     
#> [13] Matrix_1.7-3         RColorBrewer_1.1-3   desc_1.4.3          
#> [16] distributional_0.5.0 lifecycle_1.0.4      scam_1.2-19         
#> [19] splines2_0.5.4       compiler_4.5.1       farver_2.1.2        
#> [22] CVXR_1.0-15          textshaping_1.0.1    codetools_0.2-20    
#> [25] carData_3.0-5        seriation_1.5.7      htmltools_0.5.8.1   
#> [28] sass_0.4.10          yaml_2.3.10          gmp_0.7-5           
#> [31] Formula_1.2-5        pillar_1.11.0        pkgdown_2.1.3       
#> [34] car_3.1-3            ggpubr_0.6.1         jquerylib_0.1.4     
#> [37] MASS_7.3-65          cachem_1.1.0         iterators_1.0.14    
#> [40] foreach_1.5.2        TSP_1.2-5            abind_1.4-8         
#> [43] nlme_3.1-168         tidyselect_1.2.1     digest_0.6.37       
#> [46] stringi_1.8.7        purrr_1.1.0          labeling_0.4.3      
#> [49] splines_4.5.1        fastmap_1.2.0        grid_4.5.1          
#> [52] cli_3.6.5            magrittr_2.0.3       broom_1.0.8         
#> [55] withr_3.0.2          Rmpfr_1.1-1          scales_1.4.0        
#> [58] backports_1.5.0      bit64_4.6.0-1        registry_0.5-1      
#> [61] rmarkdown_2.29       bit_4.6.0            HMM_1.0.2           
#> [64] ggsignif_0.6.4       ragg_1.4.0           evaluate_1.0.4      
#> [67] knitr_1.50           ggdist_3.3.3         mgcv_1.9-3          
#> [70] rlang_1.1.6          Rcpp_1.1.0           glue_1.8.0          
#> [73] jsonlite_2.0.0       R6_2.6.1             systemfonts_1.2.3   
#> [76] fs_1.6.6

Andrey Vega Alfaro

2025-05-05