|
Muscorian is a
text analysis pipeline for the gathering and analysis of PubMed
abstracts to extract assertions on entity-entity using natural language
processing (NLP) and statistical co-occurrence of entity terms. Entity
terms may refer to protein names or gene names; hence, entity-entity
interactions can be gene-gene, gene-protein, or protein-protein
interactions. Muscorian also includes the construction and analysis of
protein-protein interaction maps. These interaction maps are fundamental
for interactomic analysis. However, most of the components used in
Muscorian needed individual analysis and benchmarking.
Biological Corpus Collection
(BioCorpus) provides the test data and benchmarking tools. This work was initiated in Department of
Zoology, The University of Melbourne, Australia, under the financial
sponsorship of Cooperative Research Centre for Innovative Dairy Products
(Dairy CRC), Australia, as part of my Doctor of Philosophy thesis.
Annotated Publications from this Project
Ling, MHT, Lefevre, C, Nicholas, KR, Lin, F. 2007.
Re-construction of
Protein-Protein Interaction Pathways by Mining Subject-Verb-Objects
Intermediates. In J.C. Ragapakse, B. Schmidt, and G. Volkert (Eds.),
Proceedings of the Second IAPR Workshop on Pattern Recognition in
Bioinformatics (PRIB 2007). Lecture Notes in Bioinformatics 4774. (pp.
286-299) Springer-Verlag. [Abstract]
[PDF]
Prior to this project, it was generally considered in the NLP field that
biomedical text are domain-specific and will require a certain degree of tool
adaptation from the generic-domain to be of use. Muscorian refuted this
assumption by demonstrating that an un-adapted generic text processor can
perform comparably to adapted tools. At the same time, the
un-adapted text processor forms the generalized layer to transform
unstructured text into a structured table of subject-verb-object on
which question-specific tools can be built. This study also
demonstrates the flexibility of this generalization-specialization
paradigm by using the same generalized layer for 2 specialized
questions.
Ling, MHT,
Lefevre, C, Nicholas, KR.
2008. A Case Study where Parts-of-Speech Tagging Error Does Not
Adversely Affect Extraction of Protein-Protein Interactions from Text.
The Python Papers 3 (1): 65-80 [Abstract]
[PDF]
This manuscript attempts to find out the reason why an un-adapted
text processor can perform comparably to adapted tools. It was found
that although an un-adapted text processor's parts-of-speech (POS)
tagging accuracy is lower than specialized tools, it has minimal
effect on the transformation to subject-verb-object structures due
to complementary POS tag use in shallow parsing (breaking down
sentences into phrases); thus, supporting our previous findings.
Ling, MHT, Lefevre,
C, Nicholas, KR.
2008. Filtering Microarray Correlations by Statistical Literature Analysis Yields
Potential Hypotheses for Lactation Research. The Python Papers 3(3): 4. [Abstract]
[PDF]
Besides NLP, statistical linguistics which depends on the appearance of words or
names in text has been used to extract potential protein-protein interactions,
such as in the case of PubGene and CoPub Mapper. In the case of PubGene, it was
found that the presence of 2 protein names in 1 abstract out of 10 million
(1-PubGene)
suggest 60% likelihood of interaction and increases to 72% when the names
appears 5 times or more (5-PubGene). This manuscript analyzed
PubGene methods using Poisson distribution and found that 1-PubGene
is generally more stringent that 99% confidence on Poisson
distribution; thus, explaining 1-PubGene's expectedly good
performance. This study demonstrated that NLP extracted interactions
were almost a proper subset of statistical extraction, suggesting
that NLP can be used to annotate statistical extractions. This study
also found that a majority of co-expressed genes from microarray
analysis, including 7 pairs of perfectly co-expressed genes, were
not mentioned in text, suggesting that these potential interactions
had not been studied experimentally. Hence, we suggest that text
mining may be used to construct a "state of current knowledge"
suitable to identify potential hypotheses for further experimental
research.
Kuo, CJ, Ling, MHT, Lin, KT, Hsu, CN.
2009. BIOADI: A Machine Learning Approach to Identify Abbreviations and
Definitions in Biological Literature. BMC Bioinformatics 10(Suppl 15):S7. [Full
Text] [PDF]
This manuscript deals with a limitation identified in my doctoral thesis -
real-time identification of gene/protein names and its abbreviations in text instead of
a dictionary approach used in my thesis. We identified about 1.7 million unique long form /
abbreviations pairs in the entire PubMed with 95.86% precision and 89.9% recall at an average
computational speed of 10.2 seconds per thousand abstracts. At the same time, BIOADI is also
a standalone tool that can be incorporated into an analysis pipeline. This study also contributed
an annotated corpus to the community for tool evaluation purposes.
Ling, MHT, Lefevre,
Christophe, Nicholas, Kevin R. 2010.
Biomedical Literature Analysis: Current State and Challenges. To
appear in Columbus, Frank
(ed). Text Mining: Software, Applications and Implications. Nova Science
Publishers, Inc.
This manuscript reviews the central (information
retrieval, information extraction and text mining) and allied (corpus
collection, databases and system evaluation methods) domains of
computational to present the current state of biomedical literature
analysis for protein-protein and protein-gene interactions and
challenges ahead - Firstly, biomedical text mining is highly dependent
in PubMed (MedLine) as text repository but neither the implementation
details nor performance is terms of precision and recall is known.
Secondly, extraction of interactions depends on the recognition of
entity (protein and gene) names in text and whether different names
refers to the same protein remains an open problem. Thirdly, extraction
of interactions by co-occurrence and NLP has been shown to be
complementary suggesting the improvement of future systems in this
direction. Fourthly, evidence suggests that generic NLP engines may be
able to process text for interaction extractions due to complementary
POS tag use in shallow parsing process but more extensive evaluations
are needed. Fifthly, there is a shortage of suitable corpora for system
evaluation resulting in difficulty in comparison (due to different
corpus or databases used in evaluation) prompting the collection of a
common set of corpora for communal use. Lastly, biomedical literature
analysis tools must demonstrate real world applications without a steep
learning curve before the slow adoption of these tools by biologists
(the intended users) can be reversed.
Ling, MHT. 2009.
Understanding Mouse Lactogenesis by Transcriptomics and Literature Analysis.
Doctor of Philosophy. Department of Zoology, The University of
Melbourne, Australia. [Full
Text]
This thesis is
advised by Professor Kevin R. Nicholas (currently in Deakin University,
Australia) and co-advised Associate Professors Christophe Lefevre
(currently in Deakin University, Australia) and Feng Lin (currently in
Nanyang Technological University, Singapore). This thesis refuted
previous assumption that generic computational linguistics processor is
unable to process biomedical text due to domain-specificity and
attributed it to complementary parts-of-speech tag use in the shallow
parsing (breaking down sentences into phrases) process. This thesis
confirmed that subject-verb-object structure is a suitable intermediate
for extracting protein-protein interactions from text and demonstrated
the flexibility of this technique in information extraction. This thesis
demonstrated that information extraction by computational linguistics
can supplement information extraction by statistical co-occurrence.
Using computational and statistical information extraction, a filter
representing the current state of biological knowledge was built to be
used with microarray analysis for identifying potential novel hypotheses
for further research. This thesis examined the relevance of mouse
hormone-treated mammary tissue culture in studying mouse lactogenesis by
comparing the transcriptomes of cultured tissues with
in vivo
mammary tissues across the lactation
cycle using Affymetrix microarrays. It concluded that the tissue culture
is useful in the study of primary hormonal responses but is unlikely to
be useful in studying sustained responses and the tissue culture is a
useful tool to “re-construct” the set of hormonal stimuli required to
simulate mouse mammary tissues into lactogenesis.
|