Does a large effective population size result in faster decay of linkage disequilibrium (LD)?

Does a large effective population size result in faster decay of linkage disequilibrium (LD)?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am thinking about an invasive species that was introduced into North America just a few (<20) generations ago. Using microsatellite genotypes (105 loci), I have identified almost no significant linkage across the genome. I'm tempted to speculate that this has something to do with rapid population expansion that has caused LD to decay that quickly, but I'm not sure whether this is a reasonable train of thought. Can anybody think of why this might or might not be the case?

It is certainly possible as yes, rapid population growth will reduce LD. From Slatkin, 1994:

In a rapidly growing population, however, there will be little chance of finding significant nonrandom associations even between completely linked loci if the growth has been sufficiently rapid.

Or Nature Reviews, 2002

Przeworski… showed that population growth tends to decrease the extent of LD, especially for longer periods of growth. By contrast, population subdivision tends to increase the extent of LD, especially when a sample contains individuals from several strongly differentiated subpopulations.

See also box 1:

Rapid population growth decreases LD by reducing genetic drift.

Estimates of linkage disequilibrium and effective population size in rainbow trout

The use of molecular genetic technologies for broodstock management and selective breeding of aquaculture species is becoming increasingly more common with the continued development of genome tools and reagents. Several laboratories have produced genetic maps for rainbow trout to aid in the identification of loci affecting phenotypes of interest. These maps have resulted in the identification of many quantitative/qualitative trait loci affecting phenotypic variation in traits associated with albinism, disease resistance, temperature tolerance, sex determination, embryonic development rate, spawning date, condition factor and growth. Unfortunately, the elucidation of the precise allelic variation and/or genes underlying phenotypic diversity has yet to be achieved in this species having low marker densities and lacking a whole genome reference sequence. Experimental designs which integrate segregation analyses with linkage disequilibrium (LD) approaches facilitate the discovery of genes affecting important traits. To date the extent of LD has been characterized for humans and several agriculturally important livestock species but not for rainbow trout.


We observed that the level of LD between syntenic loci decayed rapidly at distances greater than 2 cM which is similar to observations of LD in other agriculturally important species including cattle, sheep, pigs and chickens. However, in some cases significant LD was also observed up to 50 cM. Our estimate of effective population size based on genome wide estimates of LD for the NCCCWA broodstock population was 145, indicating that this population will respond well to high selection intensity. However, the range of effective population size based on individual chromosomes was 75.51 - 203.35, possibly indicating that suites of genes on each chromosome are disproportionately under selection pressures.


Our results indicate that large numbers of markers, more than are currently available for this species, will be required to enable the use of genome-wide integrated mapping approaches aimed at identifying genes of interest in rainbow trout.

Estimation of Linkage Disequilibrium and Effective Population Size in Three Italian Autochthonous Beef Breeds

The authors have revised the manuscript according to most of the comments. From what I remember, the major concerns were the PCA analysis and the Discussion section resembling a long literature review. I am not convinced with the answer to the first and the most important comment. The authors provide figures of 4 GRM, and mentioned that it is very analogous to conducting a PCA on a GRM. Well, it is not. They extracted a set of parameters from each GRM and together analysed in a PCA to represent a PCA on a multi-breed GRM. As I mentioned before, it is incorrect and misleading. The authors are encouraged to perform a PCA on a combined GRM of the 4 breeds. Please refer to this article as an example: &ldquoGenome-wide linkage disequilibrium and genetic diversity in five populations of Australian domestic sheep&rdquo.

Furthermore, heatmaps in Figure 1 should all follow the same colour scale (

AU: We thank the reviewers for their review. Please find below our responses to the points raised. All our responses are preceded by &ldquoAU&rdquo. Changes in the manuscript are done in track changes and highlighted in yellow. We hope to find the new version of the manuscript suitable for publication in the &ldquoAnimals&rdquo journal.

Looking forward to hearing from you.

The authors have revised the manuscript according to most of the comments. From what I remember, the major concerns were the PCA analysis and the Discussion section resembling a long literature review. I am not convinced with the answer to the first and the most important comment. The authors provide figures of 4 GRM, and mentioned that it is very analogous to conducting a PCA on a GRM. Well, it is not. They extracted a set of parameters from each GRM and together analysed in a PCA to represent a PCA on a multi-breed GRM. As I mentioned before, it is incorrect and misleading. The authors are encouraged to perform a PCA on a combined GRM of the 4 breeds. Please refer to this article as an example: &ldquoGenome-wide linkage disequilibrium and genetic diversity in five populations of Australian domestic sheep&rdquo.

Furthermore, heatmaps in Figure 1 should all follow the same colour scale (

AU: We are convinced that PCA can be conducted on any set of variables and also in this case, PCA analysis would be useful to give a general picture of these breeds. However, The PCA has been removed from the manuscript.

The heatmap colors has been modified as requested.

I think the authors made a strong effort to consider my comments.

They have obviously not understood my remark about PLINK.

They use two different spellings. This should be harmonized.

The official version seems to be "PLINK".

AU: We thank the reviewers for their review. Please find below our responses to the points raised. All our responses are preceded by &ldquoAU&rdquo. Changes in the manuscript are done in track changes and highlighted in yellow. We hope to find the new version of the manuscript suitable for publication in the &ldquoAnimals&rdquo journal.

Looking forward to hearing from you.

I think the authors made a strong effort to consider my comments.

They have obviously not understood my remark about PLINK.

They use two different spellings. This should be harmonized.

The official version seems to be "PLINK".

AU: Authors are apologized to not understand what the review was underlining. In the abstract, &ldquoplink&rdquo has been changed to &ldquoPLINK&rdquo, as it is called in Materials and Methods.

The authors addressed all issues proposed by all reviewers and thus I recommend to accept the revised manuscript for publication.

The authors addressed all issues proposed by all reviewers and thus I recommend to accept the revised manuscript for publication.

AU: Authors thank for the review.

The authors have done a major revision, and the manuscript has improved considerably. There are minor issues with the manuscript that requires correction.

L42: "PON faces nowadays" to "PON faces"
L58: "usually is" to "it is"
L79: "the gene mapping" to "gene mapping"
L80: "three beef cattle" to "three Italian beef cattle"
L91: "and balanced by sex" to "with equal number of male and female samples" ?
L135: "Default options" to "The default options"
L136: "MAF was set to > 0.01". You did not set MAF to > 0.01. You considered alleles with MAF > 0.01. This sentence is repeated from L98. Pleasde delete this sentence.
L137: "13 generations vs. 4000 Kbp" needs clarification.
L144: "by [39]" to "by Ohta and Kimura [39]"
L146: "and the lowest allele frequency used to 0.01". It is repeated a few times before.
L185-186: "0.104, 0.105, 0.106 ± 0.08 Mbp in CAL, MUP and PON, LIM respectively". 3 values for 4 breeds! Please be more cautious with mistakes after a few rounds of revision.
L205: Delete "and was intermediate for CAL and MUP".
L207: "fluctuations were present" to "more fluctuations were present"
L207-209: Delete these lines. You don't need to repeat the whole figure or the table completely in the text.
L222: "More precisely" compared to what? Please delete it.
L222: "PON and LIM presented the two opposite extremes" Are you sure that both of them extremes?! You may say, one showing the highest, and the other showing the lowest.
L241: "Similarly" to "Similar"
L243: "while for PON Ne" to "while for PON, Ne"
L245: "historical trend of LIM resulted extremely different than the local breeds" to "historical trend of LIM was very different than those for the local breeds"
Discussion: In a previous comment, I asked the authors to shorten the Discussion. This time, I skipped reading it. Whether 53 references were needed for this study, I don't know. The authors should consider that they are writing for the readers to read not to skip.

AU: We thank the reviewer for the constructive review. Please find below our responses to the points raised. All our responses are preceded by &ldquoAU&rdquo. Changes in the manuscript are done in track changes. We hope to find the new version of the manuscript suitable for publication in the &ldquoAnimals&rdquo journal.

Looking forward to hearing from you.

The authors have done a major revision, and the manuscript has improved considerably. There are minor issues with the manuscript that requires correction.

L42: "PON faces nowadays" to "PON faces"

AU: Changed as suggested (L43)

AU: Changed as suggested (L60)

L79: "the gene mapping" to "gene mapping"

AU: Changed as suggested (L82)

L80: "three beef cattle" to "three Italian beef cattle"

AU: Changed as suggested (L83)

L91: "and balanced by sex" to "with equal number of male and female samples" ?

AU: the number of males and females has been added to the manuscript (L94)

L135: "Default options" to "The default options"

AU: Changed as suggested (L144)

L136: "MAF was set to > 0.01". You did not set MAF to > 0.01. You considered alleles with MAF > 0.01. This sentence is repeated from L98. Pleasde delete this sentence.

AU: Changed as suggested (L144-145)

L137: "13 generations vs. 4000 Kbp" needs clarification.

AU: These parameters are reported by the software and the cited authors

L144: "by [39]" to "by Ohta and Kimura [39]"

AU: Changed as suggested (L153)


AU: Changed as suggested (L155)

L146: "and the lowest allele frequency used to 0.01". It is repeated a few times before.

L185-186: "0.104, 0.105, 0.106 ± 0.08 Mbp in CAL, MUP and PON, LIM respectively". 3 values for 4 breeds! Please be more cautious with mistakes after a few rounds of revision.

AU: Values have been corrected. However, CAL and MUP presented the same value (0.105), see Table S2

L205: Delete "and was intermediate for CAL and MUP".

AU: Changed as suggested (L232)

L207: "fluctuations were present" to "more fluctuations were present"

AU: Changed as suggested (L234)

L207-209: Delete these lines. You don't need to repeat the whole figure or the table completely in the text.

AU: The sentence has been reduced (L234)

L222: "More precisely" compared to what? Please delete it.

AU: Changed as suggested (L252)

L222: "PON and LIM presented the two opposite extremes" Are you sure that both of them extremes?! You may say, one showing the highest, and the other showing the lowest.

AU: The meaning of the sentence is that the two breeds are at the opposite extremes in the graph. Authors prefer to leave the sentence as it is in order to not repeat &ldquohigh&rdquo and &ldquolow&rdquo expressions.

L241: "Similarly" to "Similar"

AU: Changed as suggested (L274)

L243: "while for PON Ne" to "while for PON, Ne"

AU: Changed as suggested (L278)

L245: "historical trend of LIM resulted extremely different than the local breeds" to "historical trend of LIM was very different than those for the local breeds"

AU: Changed as suggested (L280)

Discussion: In a previous comment, I asked the authors to shorten the Discussion. This time, I skipped reading it. Whether 53 references were needed for this study, I don't know. The authors should consider that they are writing for the readers to read not to skip.

AU: Authors thank the reviewer. We improved the discussion, but we prefer to leave the included References to allow to examine in depth the specific topics for interested readers

the authors have addressed duly my point of criticism

such that I recommend the publication of the manuscript.

AU: We thank the reviewer for the constructive review.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

The manuscript is about the study of LD and effective population size in 3 local Italian cattle breeds and comparisons with Limousin breed. This is an interesting study. However, there are key points that need to be addressed.

1. The PCA analysis is incorrect. A PCA analysis based on descriptive statistics of the GRM does not have any information. What you need to do is to perform PCA on the GRM itself.
2. The English grammar and the scientific writing should defenitely be improved. The manuscript is full of mistakes, and unfortunately, it is not written carefully.
3. There are long sentences in the manuscript, some sentences are not necessary. Keep it short and get to the point.
4. In the results section, authors tend to repeat the tables and figures in the text.
5. The discussion section looks like a literature review report rather than a discussion section. Some of the literature review can be moved to Introduction. Please discuss your findings and report relevant findings from the literature. The reader is interested to know about your findings, not reading a lengthy literature review.

Please see below for further comments.

Use capital letters in the title (e.g., beef -> Beef)
L18: "control" to "control group"
L20: "genotypes" to "genotype"
Cange missingness to call rate.
"r2" and "Ne" should be italic everywhere.
"contemporary" to "current" everywhere
L24: "was found" to "was"
L24: "Calvana e" to "Calvana and"
L28: "across generations" to "across generations for local breeds"
L30,60: "demographical" are you sure that this is the right word?!
L37: "basin" .
L39: "(Calvana, CAL Mucca Pisana, MUP and Pontremolese, PON)"to "(Calvana (CAL), Mucca Pisana (MUP) and Pontremolese (PON))"
L42: "Pontremolese faces nowadays the" to "PON faces nowadays a"
L43: "of only few" to "a limited number of"
L43: "the breed" which breed?
L45: "comes" to "originates"
L51: "the region" to "this region"
L54: "the size of an ideal population" to "a population size". There is nop ideal population or ideal population size!
L55: "change as the real population under study" to "changes as a real population"
L60: "the general" to "generally"
L66: "of alleles" to "between alleles"
L66: The sentence needs a reference.
L88: "analyzed" to "genotyped"
L92: "Limousin" to "Limousin, City". Mention the city where ANACLI is located.
L94: Use capital letter only for the first letter. The same for other headings.
L99 and elsewhere: "<" to " < " and ">" to " > "
L104: "relationships" to "identical by state relationships"
L104: "status" which status?
Eq1,L107,L108: bold Z
L107: italic "pi"
L108: "Matrix Z" to "Z"
L109: "matrix X which" to "the matrix that"
L111: "For each GRM" to "For the GRM of each breed"
L112 and elsewhere: "off diagonal" to "off-diagonal" and "diagonal" to "diagonal values"
L112: "values of the diagonal" to "of diagonal values"
L123: "values of the off diagonal" to "off-diagonal values"
L124: "absolute and squared root" of what?
L116: Please refer to my previous comment on the PCA.
L120: "better measure of LD because" to "better measure of LD than D', because"
L121: "change" to "changes"
L121: delete "as D'"
L121: "The r2 values range" to "The r2"
L122: delete this line
L123: Change it to "was calculated as:"
Eq2: Change freq X to freq(X) and * to x.
L125-126: Change them to "where freq(A), freq(a), freq(B) and freq(b) are the allele frequencies and freq(AB), freq(ab), freq(Ab) and freq(aB) are the genotype frequencies. The LD extent was"
L127: "in PLINK" to "using PLINK"
L128: More description is needed here.
L129: "Also, the" to "The"
L130: "Mbp: 0 - 0.25 0.25 - 0.5 0.5 - 0.75 0.75 - 1 Mbp" to "Mbp (0 - 0.25, 0.25 - 0.5 0.5 - 0.75 and 0.75 - 1 Mbp)"
L131: "had been" to "were" and "has been" to "was also"
L132: delete "also"
L132 and elsewhere "kb" to "Kbp"
L136: "recent effective population size" to "current Ne". After using an abbreviation, do not use the full term!
L137: "relationship" to "relationships"
L140: "4000kb" to "4000 Kbp"
L141: "50kb" to "50 Kbp"
L142: "4000 kb" to "4000 Kbp"
Eq3 and elsewhere: "NT(t)" to "Ne(t)"
L144: "estimated t generation ago" to "estimated for t generation ago". "t" should be italic.
L145: "Ct" to "ct" and "rate t" to "rate at t"
L147 and elsewhere: "cNe" to "Ne0"
L148: "v. 2" to "v.2" and "set on random option" to "set to random"
L154: "individuals for each breed" to "individuals from each breed"
L157-158: Change to "Table 1. Number of autosomal SNPs and individuals before (pre-) and after (post-) quality control (QC) per breed."
Table 1: Delete "Local" and "Commercial" rows.
L162: "The lower" to "The"
L164: "Calvana." to "CAL"
L165: "relatedness" to "relatedness average"
l173-176: Change to "<1 in all breeds. The average of the diagonal values was 0.99 for CAL, MUP and LIM, and 0.97 for PON. The highest diagonal values were 3.25, 1.75, 1.54 and 1.22 for PON, MUP , CAL and LIM. The minimum diagonal values ranged from 0.67 (MUP) to.0.78 (LIM)." I think the sentences are unnecessarily too long.
L177: "in all breeds" to "for all breeds"
L178-188: Delete these lines and replace them with the new PCA results.
L194-195: When you mention a range, mention it from small to large not the other way around.
L198: "namely" ?
L200: "The highest average" to "The average of"
L200-201: Why

0.14? Either 2 or 3 decimals.
L201: "The mean" to "The mean and SD of"
L202: "Table 2" not "Table 1". The authors have not written the manuscript carefully.
L202: "= 0.21" to "0.21"
L203: "r2 = 0.19" to "0.19" and "r2 = 0.17" to "0.17"
L202-208: Re-write this paragraph. Difficult to read.
L210: Change it to "Table 2. The average and standard deviation (SD) of linkage disequilibrium (r2) for Bos Taurus"
Table 2: "CAL 1 " to "CAL". Add headers "Breed 1 " and "Autosome"
L218: "values" which values?
L218: "the majority of which was <0.2" to "within breed"
L218: "Medians" to "Median"
L219-223: Re-write thse lines. Long and unclear. Write it short, clear and concise please.
L228: "decay behavior" to "decay"
L230-232: Change it to "Mbp and up to 1 Mbp (Figure 4). Different breeds showed different patterns of LD decay."
Figure 4: "LD (r2)" to "Linkage disequilibrium (r2)"
L234: "LD decay plot" to "Linkage disequilibrium in different distances of the genome"
Delete the first sentence in lines 236-237. It is unnecessary.
L238: "maintained" to "remained". Please check the English throughly in the manuscript.
L238: "it was at" to "it was remained at"
L239: ">0.12 Mbp" to "> 0.12 Mbp (Figure 4)"
L240: "was" to "is"
L241-246: Long and difficult to read. Please re-write it.
L243: "higher value". You need to mention which value. Mean or SD?
Table 3: "Cal 1 " to "CAL". Add a header "Breed 1 "
L250 "1 kb" to "1 Kb"
L252: "1 Kbp" to "1kb"
L253: "more uniform and rapid" to "faster"
L256: "breed, was" to "breed was"
L258-260: Delete "None of the local breeds had an estimated Ne higher than 100 in the 13th generation. In general, CAL had a slightly higher Ne through generations, compared to MUP and PON. More precisely," Very wordy and repeating what is shown in the figure.
L262: "varied between 204 and 45 (in the most distant and recent generation, respectively)." to "decreased from 204 to 45 (80th to 13th generations ago)"
L263-264: "with a maximum of 920 (80 generations ago) and a minimum of 310 on generation 13th." to "decreasing from 920 to 310 (80th to 13th generations ago)"
Figure 5: "Effective Population Size" to "Effective population size"
L268-270: Change to "Regarding the current Ne (Ne0), CAL, MUP, PON and LIM showed 41.7, 18.7, 17.0 and 327.9, respectively."
L272: "effective population size parameters" to "Ne"
L274: "a beef breed" to "a commercial beef breed"
L279: delete "more precisely, "
L280-286: Update these lines with the new PCA results.
L292: Change to "Average LD was different between local breeds and LIM. LIM is one"
L299: "in an equilibrium" to "to an equilibrium"
L299: "observed in PON, but also in the other two local breeds" to "observed in the local breeds"
L301: "belonging to" to "of"
L302: "had the characteristic curve of populations" .
L304: "confirmed the slower LD decay found in this study" It is not clear what you mean. Slower than what? Re-write it.
L304: "The authors analyzed two local breeds" to "Mastrangelo et al. [44] analyzed two local breeds Cinisara and Modicana"
L305: "an island of Italy, and for these breeds the" to "an island in Italy. The"
L23306: "Modicana, similar to PON (0.17) but higher than CAL and MUP (0.14)." to "Modicana." I deleted the rest, because the values are not similar.
Because of the huge amount of edits, I stop typing here. Please revise the manuscript throughly.
There are also mistakes in the references. See L455 and L460 for example.

Please see the attachment

Estimation of LD and Effective Population Size in three Italian autochthonous beef breeds

Many local beef breeds are endangered carrying out research into them is commendable and deserves support.

The authors were investigating the genomic architecture of three local Italian breeds and were especially interested in the effective population sizes and how they change with time.

The analysis seems relatively sound, but the presentation of the results is not very appealing.

How exactly were the r squared values adjusted (formula 3)? Which value was chosen for alpha? And why? Alpha is rather a parameter than a constant and should be explained briefly.

Figure 1: the colors are difficult to compare since the scales are different (maximum values from

1.6 (A). Should be consistent.

Table 2: this is tiring, should be omitted or go to the Supplemental Material

Figure 2: an ordination plot (e.g. MDS) showing each and every animal, with colors according to breed, would be more conclusive.

Table 3: given that the changes of r squared are happening almost exclusively in the first interval, this table and the corresponding analysis does not make sense to me.

Figure 3: this is again quite meaningless. I don&lsquot see the point to show each and every chromosome when they are more or less the same within each breed. Think about a different presentation.

Supplemental tables: what is the interest of S1? Why is the chromosome length differ between breeds in S2? S3 needs a caption. S2 needs a caption with more details explaining the variables.

Discussion: Line 280: &bdquoA PCA &hellip not only differentiated the local breeds from the commercial. &ldquo This is not justified at all. You could state this if you had found that the three local breeds form a group with LIM being apart. This is obviously not the case.

Explaining differences between and similarities of effective population curves with relatedness between the populations does not make sense. You could have two identical curves from two species that are completely distinct and vice versa.

The text is full of carelessness and imprecision.

Line 28: &bdquosample size&ldquo instead of effective population size

Line 145: Capital C instead of small c

Line 236/7: &bdquoFigure 3&ldquo instead of Figure 4

Table 3: horizontal line below column headers too short

Line 454/5: The journal is called &bdquoJ Anim Breed Genet.&ldquo

Please see the attachment

The manuscript by Fabbri and colleagues reported a study on linkage disequilibrium (LD) among three Italian local cattle breeds (Calvana, Mucca Pisana ,Pontremolese) using Limousin breed as control. The results showed that Calvana and Mucca Pisana breeds had moderate level of LD (

0.14) and Pontremolese had the highest level of LD (0.17), whereas Limousin presented the lowest level of LD (0.07). The obtained results would provide scientific evidence for conservation of three local cattle breeds with very low effective population sizes (Ne). In general, the manuscript is well-organized and well-written. Minor changes need to done before final acceptance.


Nile tilapia (Oreochromis niloticus) is one of most important farmed fish species worldwide (FAO, 2018). Breeding programs established since the 1990s have played a key role in improving commercially important traits and expanding Nile tilapia farming. The Genetically Improved Farmed Tilapia (GIFT) is the most widespread tilapia breeding strain (Lim and Webster, 2006), which has been introduced to several countries in Asia, Africa and Latin America (Gupta and Acosta, 2004). The genetic base of GIFT was established from eight African and Asian populations, and after six generations of selection, the genetic gains ranged from 10 to 15% per generation for growth-related traits (Eknath et al., 1993), providing evidence that selective breeding using phenotype and pedigree information can achieve high and constant genetic gains (Gjedrem and Rye, 2018).

The recent development of dense SNP panels for Nile tilapia (Joshi et al., 2018 Yá༞z et al., 2019) will provide new opportunities for uncovering the genetic basis of important commercial traits especially in those traits that are difficult or expensive to measure in selected candidates. As has been demonstrated for different traits in salmonid species, the incorporation of genomic evaluations in breeding programs is expected to increase the accuracy of breeding values, compared to pedigree-based methods (Tsai et al., 2016 Bangera et al., 2017 Correa et al., 2017 Sae-Lim et al., 2017 Yoshida et al., 2017 Barria et al., 2018b Vallejo et al., 2018 Yoshida et al., 2019a).

Genomic studies exploit the linkage disequilibrium (LD) between SNPs and quantitative trait locus (QTL) or causative mutation. Thus, knowing the extent and decay of LD within a population is important to determine the number of markers that are required for successful association mapping and genomic prediction (de Roos et al., 2008 Khatkar et al., 2008 Porto-Neto et al., 2014 Brito et al., 2015). Therefore, when low LD levels are present within a population, a higher marker density is required to capture the genetic variation across the genome (Khatkar et al., 2008). In addition, LD patterns provide relevant information about past demographic events including response to both natural and artificial selection (Slatkin, 2008). Therefore, the LD estimates throughout the genome, reflects the population history and provides insight about the breeding system and patterns of geographic subdivision, which can be explored to study the degree of diversity in different populations.

To date, the most widely used measures of LD between two loci are Pearson’s squared correlation coefficient (r 2 ) and Lewontin’s D’ (commonly named D’). Values lower than 1 for D’ indicate loci separation due to recombination, while D’ = 1 indicates complete LD between loci, i.e. no recombination. However, this parameter is highly influenced by allele frequency and sample size. Thus, high D’ estimations are possible even when loci are in linkage equilibrium (Ardlie et al., 2002). Therefore, LD measured as r 2 between two loci is suggested as the most suitable measurement for SNP data (Pritchard and Przeworski, 2001).

LD patterns have been widely studied in different livestock species, such as sheep (Prieur et al., 2017), goats (Mdladla et al., 2016), pigs (Ai et al., 2013), beef (Espigolan et al., 2013 Porto-Neto et al., 2014) and dairy cattle (Bohmanova et al., 2010). In aquaculture, recent studies have aimed at characterizing the extent and decay of LD in farmed species, such as Pacific white shrimp (Jones et al., 2017), Pacific oyster (Zhong et al., 2017), rainbow trout (Rexroad and Vallejo, 2009 Vallejo et al., 2018), coho salmon (Barria et al., 2018a) and Atlantic salmon (Hayes et al., 2006 Gutierrez et al., 2015 Kijas et al., 2016 Barria et al., 2018c). However, to date there is scarce information about population genomic structure and LD in farmed Nile tilapia assessed by the use of dense SNP panels. The assessment of LD patterns in Nile tilapia is still limited to a few studies in which either a small number of markers (14 microsatellites) (Sukmanomon et al., 2012) and individuals (4 to 23 samples) (Hong Xia et al., 2015) have been used. Recently, the construction of a dense linkage map for Nile tilapia suggested a sigmoid recombination profile in most linkage groups (LG), showing higher recombination rates in the middle and lower recombination at the end of the LGs (Joshi et al., 2018). These patterns are consistent with the high LD levels found in the end of almost all chromosomes in a hybrid Nile tilapia population (Conte et al., 2019). The objectives of the present study were to i) estimate the population structure and genetic differentiation ii) to assess the genome-wide levels of LD and iii) determine the effective population size among three Nile tilapia breeding populations established in Latin America.


Scheme for Data Collection

In order to survey sequence variation and LD most efficiently, we resequenced a segment of 𢏁 kb at each end of an �-kb segment in all individuals from three population samples. Each of these two-segment units will be referred to as a “locus pair.” Ten such locus pairs, selected from different chromosomes or different arms of the same chromosome, were surveyed ( table 1 ). In an attempt to characterize “typical” LD levels in the human genome, genomic regions were chosen according to a fixed set of criteria. The first one was that crossing-over rates were close to the genomewide average, as determined by comparing the physical and genetic maps. The average crossing-over rate for the selected regions was 1.29 cM/Mb ( table 1 ). Because percent G+C content is related to sequence divergence and mutation rate (Wolfe et al. 1989 ), as well as crossing-over rate (Fullerton et al. 2001 ), the second criterion was that G+C content was 35%�%. Furthermore, in an attempt to reduce the probability that the observed patterns of LD were affected by natural selection, we chose regions that do not contain or flank known coding regions. The ten locus pairs were resequenced in all individuals of samples drawn from three large populations from the major ethnic groups: Hausa of Cameroon (Sub-Saharan Africa), Italians (Europe), and Han Chinese (Asia). Unlike many other studies of LD, the present study is based on resequencing every individual in each sample. Thus, LD and levels of polymorphism can be assessed and contrasted for the same genomic regions and population samples, allowing more-precise inferences about population and genetic factors that affect the decay of LD.

Descriptive Summary of Sequence Variation and LD

The average divergence between human and chimpanzee sequences at the 10 locus pairs is 1.19%. We tested for heterogeneity of sequence divergence across the surveyed genomic regions by using the average number of sequence differences between all human and the chimpanzee sequences and averaging that over all regions. The expected numbers were then calculated for each region, with its length taken into account. The difference between observed and expected numbers was evaluated by a global χ 2 test that rejected the hypothesis of homogeneous divergence rates. Region 3 made the greatest contribution to the global χ 2 , showing significantly higher interspecies divergence than the other regions. Once region 3 was removed, the remaining nine regions showed no significant global χ 2 . This suggests that the mutation rate is higher in region 3, even though its percent G+C and CpG content are not correspondingly higher. Heterogeneity of polymorphism levels was assessed in the same way. No significant heterogeneity in polymorphism level across regions was found. (This test of heterogeneity of polymorphism levels assumes no linkage between sites. However, because linkage between sites increases the variance of the numbers of polymorphic sites, our conclusion of no heterogeneity would be the same if linkage were taken into account.)

As shown in table 2 , nucleotide diversity over all loci is 0.11% in the Hausa sample. This is �% and 50% higher than in the Italian and Chinese samples, respectively. The number of segregating sites in the African sample is �% greater than either the Italian or the Chinese sample. Given these differences between populations, it is important to interpret analysis of pooled samples with caution.

Table 2

Summary Statistics of Sequence Variation

Hausa Italians Chinese
Region L a S b π c (%)TD d D e (%)S b π c (%)TD d D e (%)S b π c (%)TD d D e (%)
12,423 (2,049)12.08𢄡.271.056.08−.161.083.03−.371.04
42,560 (2,431)10.12.741.417.111.721.448.08.031.45
62,920 (2,902)16.10−.931.238.06−.531.219.04𢄡.451.20
Ne 11,555 10,504 7,353

Using the diploid phase-unknown sequence data, we calculated the maximum-likelihood estimate of the summary statistics of LD r 2 and |D′| for all pairs of polymorphic sites in the 10 locus pairs this was done for each population sample separately and for the pooled sample (Hill 1974 ). (This estimation procedure relies on the assumption of Hardy-Weinberg equilibrium. Tests of Hardy-Weinberg equilibrium did not show significant departures after Bonferroni correction.) Because estimates of LD for low-frequency alleles in small samples are not very informative, only alleles with frequencies in the range 0.1𠄰.9 were included in this analysis. As shown in figure 1 , in the Italians, mean r 2 for sites separated by ρ kb is 0.53, whereas, for sites separated by 8� kb, the average is 0.23. The Chinese result is similar, with average r 2 value of 0.38 for sites separated by ρ kb and an average r 2 of 0.28 for sites separated by 8� kb. In the Hausa, sites separated by ρ kb have an average r 2 of 0.21, considerably less than in Italians and Chinese, and for sites 8� kb apart, r 2 has dropped to an average of 0.11. Likewise, |D′| declines with distance more rapidly in the Hausa than in the other two population samples. The values of |D′| and r 2 are sensitive to the allele frequencies and sample sizes, and comparison of results between studies should take this into account. This issue is considered in more detail in the Discussion section.

Testing the Equilibrium Model

In the next section, the parameters of the equilibrium model will be estimated. Before proceeding with estimation, we tested the data for compatibility with this model. Because the polymorphism assay is based on a random sample in which all individuals are fully sequenced and because sequence from a chimpanzee outgroup was obtained, a variety of tests of the equilibrium model are available.

The HKA test is used to determine whether the levels of intraspecific polymorphism and interspecific divergence at a set of loci are consistent with the equilibrium model (Hudson et al. 1987 ). A multilocus version of the original HKA test was applied to all 10 regions in each population sample. No significant departures from the equilibrium model were detected ( table 3 ).

Table 3

Results of Multilocus HKA and Tajima’s D Tests

PopulationP a Mean D% Larger b Variance D% Larger b

Tajima’s D statistic, which summarizes information about the spectrum of allele frequency, was calculated for each region in each population sample (Tajima 1989 ). These values, as well as their averages and variances, are shown in tables ​ tables2 2 and ​ and3. 3 . We tested whether the observed average and variance of Tajima’s D across loci was consistent with the equilibrium model by estimating the critical values of these distributions from Monte Carlo simulations (software kindly provided by J. Hey). (The mutation parameters used in the simulations were estimated in the HKA test, using both the polymorphism and divergence data.) In these simulations, the regions were assumed to be unlinked and to have no recombination occurring within them. As shown in table 3 , the Italian sample has a positive average Tajima’s D that is significantly different from the equilibrium expectations. Less than 1% of the simulated samples had an average value of Tajima’s D that was as large or larger than the observed average value. The Chinese sample shows a marginally significant variance of Tajima’s D. If realistic levels of recombination were incorporated in the simulations, the P value for this observed variance would be smaller (as would the P value of the observed average Tajima’s D in Italians.) Although the African sample shows a negative overall Tajima’s D, the observation is far from statistically significant. The departures of the Italian and Chinese samples from the equilibrium model suggest that estimates of parameters based on this model should be interpreted with caution.

Estimating the Neutral Mutation Rate and the Effective Population Size

On the basis of the observed divergence (D) between human and chimpanzee sequences𠅊nd assuming a divergence time (t) of 5 million years—we can estimate the substitution rate for these regions as μy=D˲t=0.0119/ֲ×ʅ휐 6 )]=1.19휐 -9 /year. Under the equilibrium model, this substitution rate is an estimate of the average neutral mutation rate per site at these loci. Note that no correction for multiple hits has been applied.

Under the equilibrium model, the expected nucleotide diversity (π) is 4Neμ, where μ here is the neutral mutation rate per generation. This suggests estimating the effective population size (Ne) by π/4μ. The estimates of Ne shown in table 2 were obtained in this way, using the overall nucleotide diversity for each population sample and μ=20μy=2.38휐 -8 , where we have assumed a generation time of 20 years. Similarly, estimates of effective population size can be obtained using the number of polymorphic sites. Because of the observed departures from the expectations of the equilibrium model, different estimates would be obtained for some samples using nucleotide diversity and number of polymorphic sites. As shown in table 2 , the estimate of effective population size for the African sample is larger than that for the non-African ones, in line with previous studies (Przeworski et al. 2000 ). We emphasize however, that the data show significant departures from the simple equilibrium model in the non-African populations thus, the meaning of these estimated values is unclear.

Estimating the Population Crossing-Over Parameter

Under a simple two-locus Wright-Fisher equilibrium model, the level of LD depends on the composite parameter, ρ=4Nerbp, where Ne is the effective population size and rbp is the crossing-over rate per generation between adjacent nucleotide positions, and the rate and tract length of gene conversion. The ratio of gene conversion to crossing-over rate is denoted by f (see Material and Methods section). We have used a pairwise composite likelihood method to estimate ρ and f for fixed values of mean conversion-tract length (L). By this method, the CIs for estimates of ρ and f based on a single locus pair are very large and make interpretation of individual estimates difficult (simulation results not shown). However, when the data from all 10 locus pairs are combined, good estimates can be obtained.

Although relatively little is known about gene conversion in mammals, studies in yeast and fruit flies suggest that the conversion-tract length is 300𠄲,000 bp (Hilliker et al. 1994 Paques and Haber 1999 ) and that f is 𢏂𠄴 (Fogel et al. 1983 Foss et al. 1993 Hilliker et al. 1991 ). We restricted our attention to models with L=500 bp or 1,000 bp. We focused initially on the African sample, because it did not show departures from the equilibrium model assumed in the estimation procedure. For L=500 bp, the maximum composite-likelihood estimate of ρ and f in the African sample are 0.00089 and 7.3, respectively. Assuming the crossing-over rate per generation is 1.29 cM/Mb, the effective population size estimate for the African sample is �,000, that is, 0.00089/ʄ * 1.29 * 10 -8 ). This is roughly consistent with, but somewhat larger than, estimates of effective population size based on polymorphism levels described above.

Figure 2 shows a 95% confidence region for ρ and f based on the African data. From the figure, it is clear that small values of f imply large values of ρ. Also, fπ.8 is incompatible with the data. In addition, for fρ, the plot in figure 2 suggests that ρ is likely to be Ϡ.002, which in turn implies an implausibly large effective population size.

Pairwise composite likelihood surface for the African sample. The heavy contour line indicates an approximate 95% confidence region based on simulations (see Material and Methods section). The other contour lines are at arbitrary intervals to depict the shape of the surface. The dot indicates the maximum at f=7.3 and r=.00089.

If we assume that Ne=12,000, as estimated from the levels of polymorphism in the African sample, and that rbp=1.29 cM/Mb, as direct estimates of crossing-over rates suggest, then ρ=6.2휐 -4 . Fixing this value of ρ and assuming L=500 bp, the maximum composite-likelihood estimate of f is 11 (approximate 95% confidenceregion=4.5�). For smaller values of L, even larger values of f are estimated. The point, f=11, ρ=6.2휐 -4 is within the highest contour interval shown in figure 2 , thus, is well within the 95% confidence region for f and ρ.

It is well known that admixture may increase LD levels, even at unlinked sites. This raises the possibility that unrecognized admixture might affect the estimates of the population cross-over and gene conversion parameters. To investigate this issue, we estimated |D′| and r 2 for pairs of unlinked sites for the Hausa and the pooled samples. The Hausa sample was chosen because it is used for estimating the gene conversion parameters, and the pooled sample was examined because it is artificially admixed. The significance of the observed mean |D′| and r 2 values was evaluated by a test in which diploid genotypes for each entire locus pair were randomly permuted across individuals. In the Hausa sample, the observed mean |D′| and r 2 were not significantly different from random expectations: the observed mean |D ′ |=0.52 and r 2 =0.064, and the mean of the corresponding quantities from permutations are 0.55 and 0.071 (top and bottom 2.5th percentiles: 0.51𠄰.60 and 0.060𠄰.085, respectively). Conversely, the results for the pooled sample are consistent with some level of admixture: the observed mean of |D ′ |=0.37 and r 2 =0.032, and the mean of the corresponding quantities from permutations are 0.30 and 0.022 (top and bottom 2.5th percentiles: 0.27𠄰.34 and 0.018𠄰.026, respectively). Thus, estimates of gene-conversion parameters that we obtain from the Hausa sample are unlikely to be inflated as a result of unrecognized admixture.

Estimates of ρ for the Italian and Chinese samples are shown in table 4 . Because the equilibrium model is not compatible with the data from these populations, the estimates of ρ may not accurately estimate 4Nerbp, but they may nonetheless provide useful indices of the rate of decay of LD with distance. From these estimates, it appears that LD decays at a rate roughly four times slower in the two non-African populations than in the African population. In agreement with the above results indicating a departure from the equilibrium model, the effective population sizes that these decay rates imply are not compatible with those estimated on the basis of polymorphism levels in these populations (see tables ​ tables2 2 and ​ and4 4 ).

Table 4

Estimates of the Population Crossing-Over Rate and 95% CIs (× 10 𢄤 )

Hausa Italians Chinese
L(bp)f Ne a Ne a Ne a
500413 (7.1�)25,1942.9 (1.2𠄵.7)5,6203.4 (1.5𠄶.8)6,589
1,000411 (5.7�)21,3182.2 (1.0𠄴.4)4,2642.6 (1.0𠄶.0)5,039
50088.4 (4.7�)16,2791.9 (.9𠄳.7)3,6822.3 (.9𠄴.9)4,457
1,00086.0 (3.4�)11,6281.4 (.6𠄲.8)2,7131.6 (.7𠄳.5)3,101

The estimated population crossing-over and gene conversion rates can be used to calculate the expected values of the descriptive statistics of LD, r 2 and |D′|. Figure 1 shows the observed decay of LD with distance and the decay expected on the basis of a crossing-over model with and without gene conversion for our data. In agreement with the expectation that gene conversion affects mainly the decay of LD over short distances, the crossing-over/gene-conversion model shows a sharp decline within 1 kb. As shown in table 4 , when gene conversion is included in the model, the estimate of the population crossing-over parameter for any given sample decreases. As a consequence, over longer distances, the expected LD is greater if gene conversion is taken into account than in a model including only the effect of crossing-over (as can be seen by comparing the dashed and solid lines in fig. 1 ). Because |D′| is sensitive to both allele frequencies and sample size, the results in figure 1 cannot be readily compared to those obtained from other studies. To facilitate comparisons, we calculated the distance at which the expected |D′| reaches half its maximum value on the basis of our estimates of the population crossing-over parameter and for different sample sizes and ranges of allele frequencies ( table 5 ). For any given set of population parameters, the distance at which |D ′ |=0.5 differs as much as fourfold for the sample sizes considered in table 5 and even more for the allele frequencies of 0.1𠄰.9 versus 0.3𠄰.7. These results underscore the difficulty of comparing LD levels across studies. On the basis of the estimates of the population crossing-over parameter and gene-conversion rate for the non-African samples, the expected |D′| for allele frequencies 0.1𠄰.9 equals one half the |D′|at 55� kb in samples of 90 chromosomes. On the basis of the corresponding estimates for the African sample, the expected |D′| for allele frequencies 0.1𠄰.9 halves at 11� kb in samples of 90 chromosomes.

Table 5

Distance (in Kilobases) at which Expected |D′| = .5 [Note]

Hausa Italians Chinese
SampleSize a AlleleFrequency f = 4f = 8f = 4f = 8f = 4f = 8

Note.— Distance was calculated using the population-specific 4Nerbp estimates in table 2 based on L = 500 bp.


We find that various forms of local and global effective size exhibit quite divergent behaviours in populations under migration, and the general relationship between the different forms of N e is similar under the island and the linear stepping stone migration models.

3.1. Island model

The change of local and global effective sizes during approach to migration𠄍rift equilibrium for the island model with a migration rate of one individual per generation is shown in Figure ​ Figure1. 1 . The identical size of local populations and the symmetrical migration scheme imply that all local realized N e are identical for each particular type of effective size, and that some types of N e behave in a similar way. All the 10 N eIRx are the same, for example, and they coincide with N eIMeta that represents a weighted harmonic average of the local N eIRx. At equilibrium they all converge on the eigenvalue effective size, N eE  = 605, and they are very close to this value after about t = 150 generations. The realized additive genetic variance effective size of a local population (N eAVRx) is also very similar, but not identical to, the N eIRx.

Global (Meta) and realized local (Rx) effective population sizes over 500 generations in a metapopulation following an island model pattern of migration. There are ten (10) ideal subpopulations of constant effective size N ex  N cx =ꁐ, and in every generation each subpopulation receives on average one (1) immigrant drawn at random from an infinitely large migrant pool to which the other subpopulations have contributed equally (m' =਀.02 m =਀.022). N eI relates to the rate of inbreeding, N eAV to the rate at which additive genetic variation is lost, N eV to of the amount of allele frequency change, and N eLD reflects the degree of linkage disequilibrium resulting from a balance between genetic drift and recombination. The eigenvalue effective size is N eE =򠘅, reflecting the equilibrium state when inbreeding increases at the same constant rate globally as well as locally resulting in N eE = N eIMeta = N eIRx. Initial inbreeding and kinship is zero (0) within and between all subpopulations. Note that expected genetic change is the same for all subpopulations under an island model

The most important observation refers to the different behaviors of the local realized effective sizes N eIRx and N eAVRx on one hand, i.e., those relating to the 50/500 rule in conservation, and those of N eVRx and N eLDRx on the other, i.e. those that are typically targeted when estimating effective size from genetic marker data (Figure ​ (Figure1). 1 ). Clearly, applying either of the temporal or LD methods, which estimate N eVRx and N eLDRx, respectively, will tell us very little about rates of inbreeding (N eIRx) or potentials for maintaining genetic variation (N eAVRx) in local populations that are part of a metapopulation system. The trajectories of N eVRx and N eLDRx change only marginally during the first few generations such that N eVRx decreases slightly and N eLDRx increases. Then they reach equilibrium and stay indefinitely just under/over their original values of N ex =ꁐ i.e., at t = 500 we have N eVRx =ꁉ.0 and N eLDRx =ꁑ.9.

With respect to the global population, the dynamics of the variance and additive genetic variance effective sizes (N eVMeta and N eAVMeta) are very similar, but not identical. They both start out at N e  = 500 (the sum of the local N ex) and converge, at marginally different rates, on N eE  = 605. Before equilibrium has been approached N eAVMeta is a poor indicator of the rate of decay of additive genetic variation in the local populations, which is quantified by N eAVRx.

Increasing migration to ten individuals per generation (m = 0.22 m' =਀.20) reveals a pattern that is qualitatively very similar to that for m = 0.022 (Figure ​ (Figure2 2 vs. Figure ​ Figure1). 1 ). The major difference is that the higher migration rate results in a faster approach to equilibrium (note the different x𠄊xis scales of Figure ​ Figure2 2 vs. Figure ​ Figure1). 1 ). Further, the trajectories for N eVRx and N eLDRx level out at values that are more distant from the starting point (N ex =ꁐ) than at the lower migration rate. N eVRx =ꁄ.5 in generation t = 50 (compared to N eVRx =ꁉ.0 in Figure ​ Figure1). 1 ). For N eLDRx, the expected local equilibrium value has increased from N eLDRx =ꁑ.9 (at m = 0.022 Figure ​ Figure1) 1 ) to N eLDRx =ꁷ.2 (at m = 0.22 Figure ​ Figure2). 2 ). In contrast to the simulations with m = 0.022 (Figure ​ (Figure1), 1 ), simulated values with m = 0.22 are a bit high, in the range 81�, rather than close to the expected value of 77.2 (Supporting Information Appendix S1). At large, however, the lack of coupling persists between the quantities relating to the 50/500 rule on one hand, and those estimated in most empirical studies on the other.

As in Figure ​ Figure1 1 except that immigration rate is ten (10) individuals per generation (m' =਀.20 m =਀.22) and the process is only followed over 50 generations. The eigenvalue effective size is N eE =򠔐

3.2. Island model equilibrium conditions

Figure ​ Figure3 3 depicts the equilibrium values at different migration rates (m) for the local forms of N eIRx, N eVRx, and N eLDRx in an island model metapopulation with the same basic demography as previously, (s = 10, N ex = N cx =ꁐ). Thus, comparing the curves in Figure ​ Figure3 3 with those in Figure ​ Figure2, 2 , for example, the equilibrium values for m = 0.22 are N eIRx =򠔐, N eVRx =ꁄ.5, and N eLDRx =ꁷ.2. When m is small, say, m < 0.10, the expected local equilibrium values of N eVR and N eLDR are close to those in isolation when all local N e are the same (N ex =ꁐ). An unbiased estimator targeting N eVRx or N eLDRx, such as methods based on the temporal or the LD approaches, is thus expected to provide empirical estimates close to the local N e under isolation. In contrast, such estimates are poor indicators of equilibrium N eIRx at low migration rates. In fact, local N eVRx at equilibrium is never even close to local N eIRx for any value of m, and local N eLDRx is only close at very high migration rates when the entire metapopulation is panmictic or nearly so.

Equilibrium values for local inbreeding (N eIRx), variance (N eVRx), and linkage disequilibrium (N eLDRx) effective size at different positive migration rates (m >਀). The values refer to an island model metapopulation with 10 ideal subpopulations of size N ex = N cx =ꁐ at migration𠄍rift equilibrium. Note that the equilibrium condition implies that the curve for local N eIRx coincides with that for the eigenvalue effective size (N eE), which reflects the global inbreeding effective size (N eIMeta) at equilibrium

The time required for reaching migration𠄍rift equilibrium (Figure ​ (Figure3) 3 ) can be very long at low migration rates. Thus, for m' =਀.002 (one immigrant per 10 generations), for example, it takes about 800 generations for N eIRx to approach its approximate equilibrium value of N eIRx = N eIMeta = N eE  = 1,590 in the present metapopulation (s = 10, N ex =ꁐ), whereas N eVRx and N eLDRx will remain close their starting value of N ex =ꁐ during the entire process. Further, the high values of N eIRx at low migration rates should not be misinterpreted as suggesting complete or near isolation as an adequate strategy for genetic management of subdivided populations. The reason is that local inbreeding easily accumulates to unsatisfactorily high levels when migration is low. In the present example with m' = 0.002, for instance, the N eIRx =򠔀 criterion will be met in generation ≈򠉵. At this time, however, local inbreeding has increased to f >਀.75, a value that would most likely be considered unacceptably high in the context of genetic conservation (see Laikre et al., 2016 and below).

3.3. Linear stepping stone model

We finally consider an ideal linear stepping stone model with the same basic demographic characteristics as the ones above, i.e., with s = 10 ideal subpopulations sized N ex = N cx =ꁐ, which are now arranged in a line and numbered from left to right (Figure ​ (Figure4). 4 ). Migration only occurs between neighboring subpopulations, and in every generation each subpopulation receives on average one half (0.5) immigrant from each neighbor. Thus, there is an average of one immigrant per generation into subpopulations 2𠄹 (as in the island model of Figure ​ Figure1), 1 ), whereas those at the ends (1 and 10) only get 0.5 immigrants. Due to this migration pattern the approach to equilibrium is much slower than for an island model with similar migration rates (Figure ​ (Figure4). 4 ). The eigenvalue effective size is N eE  = 959, and all the local effective sizes expected to approach N eE are still far from this value after 500 generations, particularly those for the 𠇎nd” populations (1 and 10).

As for the island models, the realized local variance effective sizes in Figure ​ Figure4 4 remain just under their initial value of N ex =ꁐ, and in generation t = 500 we have N eVR1 =ꁉ.3 and N eVR5 =ꁉ.0. The simulated values for the realized local N eLD for subpopulations 1 and 5 vary in the range N eLDR1,5 =ꁂ�. Clearly, the tendency of realized local N eV and N eLD to follow trajectories that are strikingly different from those of the realized local N eI and N eAV persists also under the linear stepping stone model, which represents an extreme relative to the island model with respect to connectivity (Allendorf et al., 2013 Kimura & Weiss, 1964).


Southern forests dominated by pines contain one third of the entire forest carbon in the contiguous U.S. [1]. Among the southern pines, loblolly pine is the most common, productive and valuable commercial timber species due to its rapid growth and vast territory, comprising 80 % of the planted forestland and over one half of the standing volume in the southern U.S. The native range of loblolly pine extends south from New Jersey to central Florida, and west to central Texas, occupying 55 million acres of forest land [2, 3]. Since forests capture and store carbon dioxide through photosynthesis, the widely planted loblolly pine in the southern U.S. provides great value in offsetting atmospheric carbon dioxide and mitigating climate changes caused by greenhouse gas emissions [4, 5].

Genomic tools and resources that focus on the dissection of complex traits are revolutionizing traditional loblolly pine breeding and assist with the breeding and deployment of genotypes better adapted to climate change and able to sequester greater amount of carbon. Two key prerequisites for development and application of genomics-assisted breeding are the characterization of the genetic variation and the collection of genome-wide molecular markers. A high level of genetic polymorphism is expected in loblolly pine due to its life traits, typical for conifer species, such as longevity, wide geographic distribution, large effective population size and high outcrossing rate. This was confirmed in early studies with isozymes [6, 7], DNA-based markers [8–10], and especially more recently with SNP [11–13] markers. About 4000 SNP markers have been genotyped in the previous association genetics studies [11, 13, 14], but many more markers are needed for genomic selection [15–18].

In the previous loblolly pine association mapping studies, an Illumina Infinium high-throughput SNP genotyping array developed for multiplex genotyping of 7216 SNP markers was used to dissect genetic control of diverse phenotypic traits [11, 13, 14, 19–21]. These SNPs were derived originally from amplicon sequencing data based on a relatively small, but range-wide sample of 18 loblolly pine megagametophytes and using PCR primers that were designed using unigene contig sequences assembled from expressed sequence tag (EST) sequences. Finally, about 4000 SNPs from this 7 K SNP array were polymorphic or could be genotyped in follow-up studies [11, 13, 14, 19–21].

Given adequate geographic distribution sampling, the genetic structure underlying loblolly pine populations could also be elucidated using SNPs. For instance, Eckert et al. [19] analyzed SNP and simple sequence repeat (SSR) markers among 907 rangewide loblolly pine trees and found that the population structure reflected mainly the Mississippi River discontinuity.

Efficiency of marker-assisted breeding and genomic selection depends largely on genome wide linkage disequilibrium (LD). Brown et al. [12] found substantial historic recombination between SNPs in the sampled alleles sequenced in 19 genes and demonstrated that LD significantly declined within 2 Kb in loblolly pine. A genome wide study by Chhatre et al. [11] confirmed rapid LD decay in loblolly pine. These studies suggested that a very large number of markers would be required to link phenotypes to genotypes in association mapping studies and in genomic selection of this species. Therefore, for a species such as loblolly pine with a large genome and rapid LD decay, even thousands of markers cannot meet the requirement of identifying all important functional genomic regions. Fortunately, genotyping by sequencing (GBS), which enables simultaneous marker discovery and genotyping, has facilitated the generation of large numbers of molecular markers [22]. Nevertheless, the large size and complex structure of the loblolly pine genome pose challenges for the whole genome resequencing. The loblolly pine genome assembly v. 1.01 spans 23.2 Gbp and contains 14.4 million scaffolds [23]. Tentatively, 50,172 putative genes with an average length of 2.7 Kbp have been annotated in the current loblolly pine genome assembly [24]. Moreover, various highly repetitive DNA elements compose up to 82 % of the loblolly pine genome, among which retrotransposons dominate and comprise 62 % of the genome [23, 24]. Therefore, reduction of genome complexity is highly desired for application of GBS to loblolly pine.

In our study, we used the entire exome region for target enrichment to limit GBS to mostly coding regions, which represent only

40–60 Mbp of sequence space or less than 0.2 % of the entire loblolly pine genome. In the previous studies, technologies for solution-based enrichment of target regions of interest have been developed for loblolly pine [25–27]. Capture size has been significantly expanded due to the improvement in probe design and capture efficiency, making it possible to capture up to 200 Mbp of target sequence with a single design (NimbleGen SeqCap EZ Developer Enrichment Kit). These developments made it possible for us to target and enrich the entire loblolly pine exome, thus greatly enlarging the available number of molecular polymorphisms in loblolly pine.

In this study, we describe the probe design and efficiency of the loblolly pine exome capture using the NimbleGen SeqCap EZ method in a population sample containing 375 clonally-propagated trees from an association mapping population generated for the Allele Discovery of Economic Pine Traits II (ADEPT 2) project [14]. Counties of origin are known for 362 out of 375 maternal trees (Fig. 1). SNPs were identified by aligning the exome capture sequences to loblolly pine genome assembly v. 1.01 [28]. The inferred SNP genotypes were then applied to study LD decay and population structure.

The counties of origin of the maternal trees colored by states. This map shows the sampling sites of the 362 out of 375 maternal parents of the ADEPT2 population used in this study


Analytical approximations

As discussed in the Appendix, if we ignore effects from sampling individuals, the expected value of r 2 has two components, (3) which represent the contributions to r 2 from drift and mixture, respectively. In a closed population at equilibrium with constant N, r will vary randomly in the range [−1, 1] (or less, depending on allele frequencies), so that E(r) = 0 and there is no mixture LD. In that case, only the drift term is relevant and on the basis of Weir and Hill (1980) and Hill (1981). We use this standard-model expectation as a point of reference for evaluating the effects of migration on r 2 and .

Migration changes both the drift and mixture terms in Equation 1, in contrasting ways. First, migration expands the total number of parents that contribute to a local population, and this reduces the drift term. We quantify this effect by calculating how the effective pool of parents (EPP) changes as a function of m, n, and N: EPP = N/[(1 − m) 2 + m 2 /(n − 1)] (Equation A1). The expected magnitude of reduction in drift LD due to migration is calculated as Δr 2 drift = 1/(3 EPP) − 1/(3N). At the same time, migration brings together in the local population individuals that are progeny of parents with (potentially very) different suites of allele frequencies. This creates mixture disequilibrium, which will tend to increase overall LD. We quantify this effect by the term Δr 2 mix (Equation A10). Two primary factors determine the magnitude of mixture LD (Equation A6): population differentiation (all else being equal, genetically divergent populations create more mixture LD) and mixture fraction (LD is highest with equal mixture fractions). In an equilibrium model, these two factors act in opposing ways, as higher migration rates reduce levels of genetic divergence. As a result, under equilibrium conditions mixture LD is expected to be largest at relatively low levels of migration (Figure A1).

Table 1 summarizes results of applying the formulas developed in the Appendix to the two general metapopulation scenarios. Some general patterns can be noted. First, in all cases the expected contribution to overall r 2 from population mixture [Δr 2 mix] is at least an order of magnitude smaller than the expected reduction in drift LD from recruiting additional parents [Δr 2 drift]. This occurs because, under the equilibrium model assumed, the population mixture never involves large fractions of genetically divergent individuals as population divergence increases (and with it the opportunity for creating large mixture LD), migration rate also drops sharply. As a consequence, we expect that in all cases the reductions in LD due to equilibrium migration will outweigh any additional mixture LD. Second, the EPP rises only slowly with low levels of migration, so substantial upward biases in local are not expected until migration rates are fairly high in genetic terms (m > 5–10%). Third, the two metapopulation scenarios are expected to produce generally similar results (indexed by the ratio /N) for low and moderate migration, but for m > 0.1 upward bias is expected to rise faster for n = 10, N = 100. This is expected because with high migration rates, for both scenarios should converge on the overall metapopulation Ne ∼ 1000, which is a larger multiple of local Ne for the scenario with N = 100.

Empirical results from simulations

Equilibrium migration:

The main simulation results for equilibrium migration are plotted in Figures 1 and 2. Although our analyses here focus on bias (for an evaluation of precision of the LD method, see Waples and Do 2010), we have plotted empirical confidence intervals (C.I.’s) in Figure 1, and some general patterns are worth noting: (1) C.I.’s are tighter for the [10, 100] scenario because the variance of increases with true Ne (Hill 1981) (2) C.I.’s are wider for mN < 1 because those scenarios have low genetic diversity in local populations and fewer allelic comparisons for calculating r 2 and (3) C.I.’s are tighter for moderate migration (mN = 1–10), because this level of migration is sufficient to maintain high levels of allelic diversity but not so high that becomes substantially biased upward.

Bias in estimates of local Ne (indicated by the ratio ) as a function of amount of migration among subpopulations. Migration is scaled by migration rate (m) (A) or number of migrants per generation (mN) (B). Local subpopulation size (N) was 100 or 500 ideal individuals. Values shown are based on harmonic mean calculated using data for 20 loci assayed in S = 100 individuals. Vertical lines in B show the central 90% of the empirical distribution of .

Comparison of observed from simulations (same data that are plotted in Figure 1) with expected values based on theoretical considerations (from Table 1).

The simulation results generally agreed with the analytical predictions. For both metapopulation scenarios, the shape of the relationship between /N and m was similar to that predicted. Little bias to local was found for either scenario for low or moderate m, while m ≥ 0.1 produced more substantial upward bias. As expected, this latter effect was stronger for N = 100 than N = 500. As also expected, for N = 500 we found no evidence for downward bias in that could be attributed to population mixture (see below for discussion of results for N = 100). It appears that migration rate (m) is a more reliable indicator than the effective number of migrants (mNe) of the likely consequences of migration on (compare Figure 1A and 1B).

Two important deviations from the predicted patterns are also evident. First, although theoretical derivations in the Appendix capture the general pattern of the relationship between and m, empirical results showed more upward bias than predicted under high migration rates (Figure 2). The second deviation is that for the scenario with N = 100, n = 10, we observed a downward bias in at low migration rates (harmonic mean = 92.9 for m = 0.01 and 80.2 for m = 0.001). With N = 100, m = 0.01 means that a local population on average receives one immigrant per generation from the metapopulation as a whole, and the rate is one immigrant every 10 generations for m = 0.001. Since migration was stochastic, some generations can by chance receive an unusually large number of immigrants. Similarly, if one or a few migrants are unusually successful at reproducing, their offspring can contribute substantial admixture LD to the population for several generations before the associations decay through recombination. Furthermore, because the harmonic mean is strongly affected by occasional low values, and because of the nonlinear effects of m on mixture LD, we expect that the observed reduction in for low migration rates was due to a few low values rather than a general across-the-board reduction in . This is supported by results shown in Figure 3, which compares the distribution of for m = 0.001 with that under complete isolation. The distributions are generally similar, except that the scenario with rare migration produced four estimates with < 40 compared to none for m = 0. If those four values are omitted, harmonic mean becomes 98.0, nearly identical to the value ( = 98.3) for the scenario with no migration. In the rare-migration scenario, the frequency of relatively high estimates was also reduced slightly (Figure 3), which could be due to a small amount of residual disequilibrium from migrants in previous generations.

Distribution of estimates for scenarios with true Ne = 100 in each local subpopulation and either metapopulations of n = 10 subpopulations connected by rare migration events (m = 0.001, solid bars) or completely isolated subpopulations (open bars). In both cases, each sample of S = 100 individuals was taken from a single subpopulation, and 20 loci were used for the estimate. The bin with the asterisk includes all estimates >300.

To explore this issue further, we examined results for one of the metapopulations that produced one very low estimate ( = 13.8 for population 10). We used Rannala and Mountain’s (1997) method as implemented in GeneClass2 (Piry et al. 2004) to search for first-generation migrants in the entire metapopulation (N = 1000). Three migrants were identified at the P < 0.001 level (one each in populations 1, 5, and 9) and were detected with high certainty because the low migration rate produced very strong divergence (FST = 0.48) and essentially nonoverlapping sets of alleles in different populations. Surprisingly, no first-generation migrants were detected in population 10. However, when simulations were used to generate a “likely” range of multilocus genotypes that would be produced by each population (Paetkau et al. 2004), seven individuals from population 10 were estimated to have multilocus genotypes with a <1/1000 probability of being produced by a population with allele frequencies observed in population 10. Inspection of these seven individuals showed that in most cases they carried one allele that was rare and one that was common in population 10—the pattern that would be expected for F1 or backcross progeny of first-generation immigrants. We concluded, therefore, that the low for population 10 could be traced to one or a few immigrants in a recent generation that produced a number of descendants.

Why did first-generation migrants in population 10 produce low estimates of Ne while those in populations 1, 5, and 9 did not? ( = 88.0, 84.7, and 60.3, respectively, for the latter three populations—lower than average but well within the range expected). The primary reason appears to be an interaction with the criterion used for screening out rare alleles. We used PCRIT = 0.02, which excludes alleles at frequency <0.02. Figure 4 shows how for each of the 10 populations in the metapopulation varied as a function of PCRIT. For 6 of the populations (Figure 4, black lines), showed little variation for PCRIT in the range [0.01–0.05]. The three populations with identified first-generation migrants (Figure 4, blue lines) all had “typical” values for PCRIT = 0.02–0.05 but sharply reduced values for PCRIT = 0.01 ( ≤ 22). “Foreign” alleles that occur in only a single first-generation migrant cannot exceed frequency 0.01 in a sample of S = 100 individuals, so effects of lone migrants are screened out when PCRIT > 0.01 is used. The red line in Figure 4 is for population 10, which shows a different pattern: high estimates ( ∼ 150–170) for PCRIT ≥ 0.03 and very low estimates ( = 11–14) for PCRIT = 0.02 or 0.01. When the seven individuals with highly unlikely genotypes were excluded from population 10, estimated effective size jumped dramatically to a value ( = 179 using the PCRIT = 0.02 criterion) comparable to the estimates found when rare (presumably mostly recent immigrant) alleles were screened out.

Changes in as a function of the criterion for excluding rare alleles (PCrit). Each line shows data for a sample of S = 100 from one of the 10 subpopulations in a single metapopulation connected by rare migration (m = 0.001, as shown in Figure 3). The three dashed blue lines are the populations in which exactly one first-generation immigrant was detected ( depressed only for PCrit = 0.01). The red line is a population that appears to include a number of descendants of recent immigrants.

Results discussed so far used relatively large sample sizes (S = 100 individuals). Figure 5 shows that the biases discussed above are magnified with smaller samples: for low migration (m ≤ 0.01), is a smaller fraction of N as S decreases, and for high migration (m ≥ 0.1) rises more sharply compared to N for smaller S. It is worth noting that with S = 50, alleles carried in a homozygous state by a single immigrant will not be screened out at PCRIT = 0.02, and with S = 25 the same criterion would include any allele that occurs in even a single copy in the sampled individuals. Waples and Do (2010) found that inclusion of singleton alleles was associated with upwardly biased estimates of Ne and suggested adjusting PCRIT according to sample size to exclude alleles found in only a single copy. Application of this rule would reduce some of the biases seen in Figure 5.

The ratio as a function of the migration rate (m) among subpopulations. Local subpopulation size (N) was 100 ideal individuals. Values shown are based on harmonic mean calculated using data for 20 loci assayed in S = 25–100 individuals.

Nonequilibrium migration:

Pulse migration at 10 times the equilibrium rate led to substantial biases in , with the direction of bias depending on whether immigrants were genetically divergent (Figure 6). When background (equilibrium) migration was low enough to lead to strong genetic differences between populations, 10× pulse migration depressed to a fraction of the local Ne. Conversely, when genetic differentiation was low due to high background migration, a sudden influx of large numbers of immigrants inflated the estimate of local Ne, reflecting the reality that parents from throughout the metapopulation contributed offspring to the sample. Pulse migration at twice the equilibrium rate had parallel but much more modest effects (Figure 6).

Effects of nonequilibrium (pulse) migration on estimates of local Ne for simulated “island model” metapopulations with n = 10 and true local Ne = 100. After simulations reached migration–drift equilibrium, a single generation of pulse migration occurred at a level 2 or 10 times the equilibrium rate m, after which samples of S = 50 individuals were taken for genetic analysis. Values shown are harmonic mean across 100 replicate subpopulations.

Joint estimates of m and Ne:

With equilibrium migration at m = 0.05 in a n = 10, Ne = 100 metapopulation and sample sizes of S = 50, from e stim was downwardly biased (harmonic mean = 68) and had a multimodal distribution, with 25% of the estimates below 50, 13% between 125 and 225, and 26% infinite (Figure 7). In contrast, ldn e estimates had a unimodal distribution with a moderate upward bias (harmonic mean = 121, range 62–790, 73% of estimates between 50 and 150). Simulations using the same parameters but allowing up to 40 alleles per locus and running for 2000 generations before collecting data produced nearly identical e stim results: harmonic mean = 72, 24% of estimates below 50, and 28% infinite. ldn e performed better with the 40-allele data sets, whose greater number of allelic comparisons provided enhanced precision: harmonic mean = 116, and 100% of estimates fell in the range [50–300] (data not shown). When the subpopulations were completely isolated (m = 0), the e stim estimates of Ne were strongly upwardly biased and sensitive to assumed mutation rate: harmonic mean = 149 assuming u = 5 × 10 −4 (the value used in the simulations) and harmonic mean = 360 assuming u = 10 −6 (default value in Estim) (data not shown).

Distribution of for simulated data using ldn e and e stim (Vitalis and Couvet 2001). An island model of equilibrium migration was simulated, with n = 10, local Ne = 100, m = 0.05, S = 50, and 20 loci. The e stim estimates assumed that the mutation rate was 5 ×10 −4 , the value used in the simulations. The last bin on the right includes all estimates >400. The arrows indicate harmonic mean for the two methods.

e stim also provides estimates of migration rate, which are not sensitive to assumed mutation rate. Mean was 0.01 for the isolation scenario and 0.11 for the m = 0.05 scenario. These mean values omitted replicates for which m could not be estimated because was infinite (this excluded 51% of the replicates for true m = 0 and 26% of the replicates for true m = 0.05) (data not shown).


All analyses were performed using genotypes generated in previous work. Therefore, for this study, no animal ethics approval was requested because no new animals were sampled.

Animals used in this study (Table 1) were part of a large experimental Australian population[7] that includes the three main cattle types: Bos taurus breeds (Angus, Hereford, Limousin and Shorthorn), Bos indicus (Brahman) and composite cattle (Tropical Composite, Santa Gertrudis and Belmont Red). To confirm our findings, genotyping data from each cattle type (Angus, Brahman and Santa Gertrudis) were sourced from the Bovine HapMap consortium[3].

All animals were genotyped using the BovineHD SNP chip (Illumina, San Diego that includes 777 962 markers. Quality control and imputation of missing data in the Australian sample followed the pipeline described by Bolormaa et al.[8]. Briefly, stringent filters were applied to each SNP (call rate, duplicated map position, extreme departure from Hardy-Weinberg equilibrium), resulting in 729 068 informative SNPs. Missing genotypes were imputed within each breed type using 30 iterations of the BEAGLE software[9]. Genotypes for the same set of SNPs were extracted from the Bovine HapMap dataset[10] but missing genotypes were not imputed. LD between each pair of SNPs, measured as r 2 , which is less susceptible to bias due to differences in allelic frequency[4], and within-breed genetic diversity (heterozygosity and proportion of polymorphic SNPs) were calculated using PLINK v1.07[11]. For the X chromosome, two scenarios were explored: one including all markers, and the second including only fairly polymorphic markers with a minor allele frequency (MAF) greater than 0.1 in all breeds.

Author information


INRA, UMR 1332 de Biologie du Fruit et Pathologie, F-33140, Villenave d’Ornon, France

José Antonio Campoy, Emilie Lerigoleur-Balsemin, Hélène Christmann, Rémi Beauvieux, José Quero-García, Elisabeth Dirlewanger & Teresa Barreneche

University Bordeaux, UMR 1332 de Biologie du Fruit et Pathologie, F-33140, Villenave d’Ornon, France

José Antonio Campoy, Emilie Lerigoleur-Balsemin, Hélène Christmann, Rémi Beauvieux, José Quero-García, Elisabeth Dirlewanger & Teresa Barreneche

Current address: CNRS, UMR 5602 GEODE, Géographie de l’environnement, F-31058, Toulouse, France

INRA, UAR 0415 SDAR, Services Déconcentrés d’Appui à la Recherche, F 33140, Villenave d’Ornon, France

Current address: INRA, ISVV, UMR Ecophysiologie et Génomique Fonctionnelle de la Vigne, F 33140, Villenave d’Ornon, France



  1. Cleve

    Really interesting, thanks

  2. Everard

    I congratulate, the brilliant idea

  3. Wulfgar

    The author tries to make his blog for ordinary people, and it seems to me that he did it.

  4. Finbar

    Excuse, I can help nothing. But it is assured, that you will find the correct decision. Do not despair.

Write a message