Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gene with Expansion & Contraction #50

Open
Johnsonzcode opened this issue Jan 11, 2025 · 6 comments
Open

gene with Expansion & Contraction #50

Johnsonzcode opened this issue Jan 11, 2025 · 6 comments
Labels
question Further information is requested

Comments

@Johnsonzcode
Copy link

Hi @sinamajidian
I am finishing this pipeline with 8 species and got the result from the HTML
image

so how do I fetch the genes with Expansion & Contraction from the result files?
I am a newbie in this field, thank you for your time!

@sinamajidian
Copy link
Collaborator

Hi! please take a look at this tutorial notebook and let me know if you have any questions.

@Johnsonzcode
Copy link
Author

The tutorial you sent me seems like report.html, but it didn't show me how to retrieve the genes belonging to gains/duplications/losses for a specific species.
I can find the code like this

phylostratigraphy = os.path.join(output_folder, "phylostratigraphy.html")
treeprofile = ham_analysis.create_tree_profile(outfile=phylostratigraphy)

from IPython.display import IFrame
IFrame(os.path.basename(phylostratigraphy), width=800, height=600)

but it only shows the result of the genes' number of gains/duplications/losses.

I would like to retrieve the genes belonging to gains/duplications/losses for a specific species.

Thank you !

@Johnsonzcode
Copy link
Author

After exploring the pyHam, I found following code would help me, but I can't still get duplicated genes and lost genes

from pyham import Ham

# 加载物种树和 OrthoXML 文件
species_tree_path = "species_tree_checked.nwk"
orthoxml_path = "FastOMA_HOGs.orthoxml"

# 初始化 pyHAM 分析对象
ham_analysis = Ham(species_tree_path, orthoxml_path)

# 获取目标物种的基因组对象
target_species = "Anas_platyrhynchos"
target_genome = ham_analysis.get_extant_genome_by_name(target_species)

# 获取祖先基因组对象
ancestral_genome = ham_analysis.get_ancestral_genome_by_name("Anas_platyrhynchos/Anser_cygnoides")

# 比较基因组并获取进化事件的基因名称
comparison = ham_analysis.compare_genomes_vertically(target_genome, ancestral_genome)

# 获取基因扩增事件的基因名称
gained_genes = comparison.get_gained()
gained_gene_names = [gene.prot_id for gene in gained_genes]

for gene_id in gained_gene_names:
    print(gene_id)

But when I tried the same way about duplicated genes and lost genes, It didn't work.

>>> duplicated_gene_names = [gene.prot_id for gene in duplicated_genes]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'HOG' object has no attribute 'prot_id'

Why did the same type of result, in the same way didn't work?

@sinamajidian sinamajidian added the question Further information is requested label Jan 13, 2025
@sinamajidian
Copy link
Collaborator

Sorry I should have mentioned the specific location in the notebook.
There are few points here to clarify here. Firstly these are definitions of reported values in the pyham phylostratigraphy:

- lost: numbers of HOGs that have been lost in between this node and its parent.
- gain: numbers of HOGs that have "emerged" at this node.
- retained: numbers of HOGs that have stay retained (in term of copy numbers) in between this node and its parent.
- - dupl: numbers of HOGs that have arose by duplication in between this node and its parent. 

we can think of a HOG as an ancestral gene. When there are 10 HOGs in an ancestral (internal) node, it means the ancestral genome has 10 ancestral genes.

Consider comparing two ancestral nodes:

ancestral_genome_10 = ham.get_ancestral_genome_by_name("internal_NODE_10")
ancestral_genome_11 = ham.get_ancestral_genome_by_name("internal_NODE_11")
vertical_ = ham.compare_genomes_vertically(ancestral_genome_10, ancestral_genome_11) # The order doesn't matter!
gained= vertical_.get_gained()
lost= vertical_.get_lost()
duplicated= vertical_.get_duplicated()
print(list(gained)[0], list(dup)[0], list(duplicated)[0])

<HOG(HOG:0000616_11)> <HOG(HOG:0014967.1c.5b_59)> <HOG(HOG:0000037.1d.9b.2a_59)>

So, all the elements are HOGs. Note that a hog has descendant genes that can be accessed with [descendant_gene.prot_id for descendant_gene in hogi.get_all_descendant_genes()].

Now let's compare an extant genome with ancestral genome (similar to your code).

ancestral_genome_10= ham.get_ancestral_genome_by_name("internal_node10")
extant_genome = ham.get_extant_genome_by_name("extant_human")
comparison = ham.compare_genomes_vertically(ancestral_genome_10,extant_genome)
gained= comparison.get_gained()
lost= comparison.get_lost()
duplicated= comparison.get_duplicated()
print(list(gained)[0], list(dup)[0], list(duplicated)[0])

Gene(1000000022)  <HOG(HOG:0014967.1c.5b_59)>  <HOG(HOG:0014967.1c.5b_59)>

The gained ones correspond to the child node, which is extant species, so they are reported as extant genes.

So, it depends whether both elements of comparison are ancestral genome or only one of them. When one is extant, you can access to its extant genes directly. When both are ancestral genomes, the ancestral genes of each are hog at that level.

Pro tip: number of duplication events is different than number of duplicated genes. A duplication event can results in one or more genes in the pyham framework.comparison.get_duplicated() is a dictionary {<HOG(HOG:0000037.1d.9b.2a_59)>: [<HOG()>, <HOG()>], <HOG(HOG:0000064.4e_59)>: [<HOG()>, <HOG()>], showing such relationship. The number of keys is the number of duplication events. The sum of length of all values (python dic) is the total number of duplicated genes (=number of genes resulting from duplication, reported in phylostratigraphy).

It looks a bit complicated. Please let me know which part is not clear.

@Johnsonzcode
Copy link
Author

OK, I see.
When a gene is not extant in the species, I can access it with [descendant_gene.prot_id for descendant_gene in hogi.get_all_descendant_genes()]. vertical_.get_xx() could only obtain the events number rather than the gene number.

Thank you for your detailed explanation!

@Johnsonzcode
Copy link
Author

By the way, what's the difference between this methodology and CAFE5 when talking about Expansion & Contraction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants