Maurice Ling's Professional Portfolio - Research Portfolio:

   R & D Summary

    Muscorian - Mining Biomedical Literature for Protein-Protein Interactions


Muscorian is a text analysis pipeline for the gathering and analysis of PubMed abstracts to extract assertions on entity-entity using natural language processing (NLP) and statistical co-occurrence of entity terms. Entity terms may refer to protein names or gene names; hence, entity-entity interactions can be gene-gene, gene-protein, or protein-protein interactions. Muscorian also includes the construction and analysis of protein-protein interaction maps. These interaction maps are fundamental for interactomic analysis. However, most of the components used in Muscorian needed individual analysis and benchmarking. Biological Corpus Collection (BioCorpus) provides the test data and benchmarking tools. This work was initiated in Department of Zoology, The University of Melbourne, Australia, under the financial sponsorship of Cooperative Research Centre for Innovative Dairy Products (Dairy CRC), Australia, as part of my  Doctor of Philosophy thesis.

Annotated Publications from this Project

Ling, MHT, Lefevre, C, Nicholas, KR, Lin, F. 2007. Re-construction of Protein-Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates. In J.C. Ragapakse, B. Schmidt, and G. Volkert (Eds.), Proceedings of the Second IAPR Workshop on Pattern Recognition in Bioinformatics (PRIB 2007). Lecture Notes in Bioinformatics 4774. (pp. 286-299) Springer-Verlag. [Abstract] [PDF]

Prior to this project, it was generally considered in the NLP field that biomedical text are domain-specific and will require a certain degree of tool adaptation from the generic-domain to be of use. Muscorian refuted this assumption by demonstrating that an un-adapted generic text processor can perform comparably to adapted tools. At the same time, the un-adapted text processor forms the generalized layer to transform unstructured text into a structured table of subject-verb-object on which question-specific tools can be built. This study also demonstrates the flexibility of this generalization-specialization paradigm by using the same generalized layer for 2 specialized questions.

Ling, MHT, Lefevre, C, Nicholas, KR. 2008. A Case Study where Parts-of-Speech Tagging Error Does Not Adversely Affect Extraction of Protein-Protein Interactions from Text. The Python Papers 3 (1): 65-80 [Abstract] [PDF]

This manuscript attempts to find out the reason why an un-adapted text processor can perform comparably to adapted tools. It was found that although an un-adapted text processor's parts-of-speech (POS) tagging accuracy is lower than specialized tools, it has minimal effect on the transformation to subject-verb-object structures due to complementary POS tag use in shallow parsing (breaking down sentences into phrases); thus, supporting our previous findings.

Ling, MHT, Lefevre, C, Nicholas, KR. 2008. Filtering Microarray Correlations by Statistical Literature Analysis Yields Potential Hypotheses for Lactation Research. The Python Papers 3(3): 4. [Abstract] [PDF]

Besides NLP, statistical linguistics which depends on the appearance of words or names in text has been used to extract potential protein-protein interactions, such as in the case of PubGene and CoPub Mapper. In the case of PubGene, it was found that the presence of 2 protein names in 1 abstract out of 10 million (1-PubGene) suggest 60% likelihood of interaction and increases to 72% when the names appears 5 times or more (5-PubGene). This manuscript analyzed PubGene methods using Poisson distribution and found that 1-PubGene is generally more stringent that 99% confidence on Poisson distribution; thus, explaining 1-PubGene's expectedly good performance. This study demonstrated that NLP extracted interactions were almost a proper subset of statistical extraction, suggesting that NLP can be used to annotate statistical extractions. This study also found that a majority of co-expressed genes from microarray analysis, including 7 pairs of perfectly co-expressed genes, were not mentioned in text, suggesting that these potential interactions had not been studied experimentally. Hence, we suggest that text mining may be used to construct a "state of current knowledge" suitable to identify potential hypotheses for further experimental research.

Kuo, CJ, Ling, MHT, Lin, KT, Hsu, CN. 2009. BIOADI: A Machine Learning Approach to Identify Abbreviations and Definitions in Biological Literature. BMC Bioinformatics 10(Suppl 15):S7. [Full Text] [PDF]

This manuscript deals with a limitation identified in my doctoral thesis - real-time identification of gene/protein names and its abbreviations in text instead of a dictionary approach used in my thesis. We identified about 1.7 million unique long form / abbreviations pairs in the entire PubMed with 95.86% precision and 89.9% recall at an average computational speed of 10.2 seconds per thousand abstracts. At the same time, BIOADI is also a standalone tool that can be incorporated into an analysis pipeline. This study also contributed an annotated corpus to the community for tool evaluation purposes.

Ling, MHT, Lefevre, Christophe, Nicholas, Kevin R. 2010. Biomedical Literature Analysis: Current State and Challenges. To appear in Columbus, Frank (ed). Text Mining: Software, Applications and Implications. Nova Science Publishers, Inc.

This manuscript reviews the central (information retrieval, information extraction and text mining) and allied (corpus collection, databases and system evaluation methods) domains of computational to present the current state of biomedical literature analysis for protein-protein and protein-gene interactions and challenges ahead - Firstly, biomedical text mining is highly dependent in PubMed (MedLine) as text repository but neither the implementation details nor performance is terms of precision and recall is known. Secondly, extraction of interactions depends on the recognition of entity (protein and gene) names in text and whether different names refers to the same protein remains an open problem. Thirdly, extraction of interactions by co-occurrence and NLP has been shown to be complementary suggesting the improvement of future systems in this direction. Fourthly, evidence suggests that generic NLP engines may be able to process text for interaction extractions due to complementary POS tag use in shallow parsing process but more extensive evaluations are needed. Fifthly, there is a shortage of suitable corpora for system evaluation resulting in difficulty in comparison (due to different corpus or databases used in evaluation) prompting the collection of a common set of corpora for communal use. Lastly, biomedical literature analysis tools must demonstrate real world applications without a steep learning curve before the slow adoption of these tools by biologists (the intended users) can be reversed.

Ling, MHT. 2009. Understanding Mouse Lactogenesis by Transcriptomics and Literature Analysis. Doctor of Philosophy. Department of Zoology, The University of Melbourne, Australia. [Full Text]

This thesis is advised by Professor Kevin R. Nicholas (currently in Deakin University, Australia) and co-advised Associate Professors Christophe Lefevre (currently in Deakin University, Australia) and Feng Lin (currently in Nanyang Technological University, Singapore). This thesis refuted previous assumption that generic computational linguistics processor is unable to process biomedical text due to domain-specificity and attributed it to complementary parts-of-speech tag use in the shallow parsing (breaking down sentences into phrases) process. This thesis confirmed that subject-verb-object structure is a suitable intermediate for extracting protein-protein interactions from text and demonstrated the flexibility of this technique in information extraction. This thesis demonstrated that information extraction by computational linguistics can supplement information extraction by statistical co-occurrence. Using computational and statistical information extraction, a filter representing the current state of biological knowledge was built to be used with microarray analysis for identifying potential novel hypotheses for further research. This thesis examined the relevance of mouse hormone-treated mammary tissue culture in studying mouse lactogenesis by comparing the transcriptomes of cultured tissues with in vivo mammary tissues across the lactation cycle using Affymetrix microarrays. It concluded that the tissue culture is useful in the study of primary hormonal responses but is unlikely to be useful in studying sustained responses and the tissue culture is a useful tool to “re-construct” the set of hormonal stimuli required to simulate mouse mammary tissues into lactogenesis.