A Online Mendelian Inheritance in Man Mining Tool.
MimMiner was created by Marc A. van Driel, Jorn Bruggeman, Gert Vriend, Han Brunner, and Jack Leunissen.
Creation of feature vectors:
Retrieval of text-based information can be done on a keyword basis, or e.g. through natural language parsing. Keyword-based techniques consider text to consist of documents and terms. In this analysis each OMIM record is a document (see figure S1, below). All words in the OMIM records were considered as terms. We did not use all words, but only those found in the anatomy (A) and the disease (C) sections of the Medical Subject Headings vocabulary (MeSH) MeSH provides a standardized way to retrieve information that uses the same concepts, but different terminology. The MeSH vocabulary was not designed as a phenotype dictionary, but it includes anatomy and disease terms, while most vocabularies contain only anatomy terms. Furthermore, we chose MeSH because of its size and internal hierarchical structure, which make it a rich dictionary that is needed to match the OMIM texts. Each MeSH entry is a collection of terms with synonyms and plurals, called a concept. A concept is uniquely identified by a descriptor. For example, the concept Neuron also contains the synonym Nerve Cell and the plurals Neurons and Nerve Cells, and is identified by the descriptor D009474. The MeSH concepts, rather than single keywords (like in keyword vectors) as used usually in keyword-based methods, served as features characterizing OMIM records: every entry in the feature vectors represent a MeSH concept.
Each OMIM record was screened for concepts by matching the words in the records with MeSH terms. The number of times the terms of a given concept are found in an OMIM record reflects the concepts relevance to the phenotype. Non-specific concepts like syndrome or disease were excluded. This list of descriptor frequencies per OMIM record constitutes the initial feature vector.
Figure S1: Analysis overview.
The OMIM database was parsed using MeSH. Three correction measures were applied (expansion, inverse document frequency and local weight correction) before the phenomap was computed. All disease phenotype with a causative gene list in UniProt was extracted from this phenomap (1653 phenotypes). Phenotype similarity was compared to protein sequence similarity and relations of the causative genes in PFAM, HPRD, and GO.
We extracted per record the text-(TX) and clinical synopsis (CS, if present) fields from the OMIM database (omim.txt). Each record was screened (case-insensitive) for MeSH concepts (see creation of feature vectors). Some concepts cannot be described by a single term, e.g. cleft palate. In such cases only the longest, most specific term was counted. So, cleft palate is used but not the single words cleft or palate. A term was not allowed to span two sentences separated by a full stop. Within a sentence the term were not allowed to span a comma, (semi)colon. We also included matches where terms were swapped, e.g. retardation mental and mental retardation. The total number of concepts per feature vector (of an OMIM record) theoretically can vary between 1 and 5436 MeSH entries.
Results feature vectors:
3778 of the possible 5436 MeSH terms were found in the OMIM records. The observed concepts are stored in feature vectors; one feature vector per OMIM record. The number of concepts per record varies from 1 to 242 and the average number of concepts per vector is 16.4. The use of hypernyms (eq. 1) increases the average number of concepts per vector to 45.0 (min: 1; max: 477). This broadens the phenotype description and, more importantly, the number of common concepts between pairs of vectors increases from 0.85 to 5.88, allowing for a larger number of meaningful comparisons.
Normalization of the feature vectors by the inverse document frequency (eq. 2) and the correction for the record length (eq. 3) does not influence the number of concepts per vector, but this weighting influences the distances between feature vectors as determined with equation 4. The normalization step of equation 3 scales all concept frequencies to values between 0 and 1. These normalization/weighting steps are non-linear. For example, a specific concept like Hair Follicle becomes 1.7 times more important relative to the less specific concept Skin.
The MimMiner similarity matrix is available for download (gzip, 75MB). Please, if you use the data don't forget to cite us.
We performed a Smith-Waterman analysis for all known UniProt sequences of the 1653 OMIM records with a gene. Blosum90 was used as a substitution matrix; gap open penalty was -3*average matrix identity score (IS); gap extension was -3*IS/10. Given the normal raw score of Smith-Waterman analysis, an e-value was computed: