skip to: online tools | main navigation | content | footer

Computer Sciences R&D

Home » Research & Development » Computer Sciences » Informatics & Decision Sciences » Ensemble Text Analysis

Ensemble Text Analysis

Much information of importance is about relationships; Alice and Bob are connected via a car pool, Bob is connected to Cathy via working at Google, Cathy is connected to bird flu via having traveled to Hong King. Relationships are conveniently represented by graphs. The DHS ADVISE program has funded Sandia to support the development of such relationship graphs.

To populate a graph with relationships, however, it is often necessary to deduce the relationships from plain source text, such as news stories or passenger manifests. The code that does so must first understand that "Bob" and "Alice" are people, but that "Google" is a company (and therefore a workplace) and not a person. The process of assigning these sorts of labels to words is called "Named Entity Recognition", or NER.

NER is a supervised machine learning task. As such, its accuracy and robustness can be improved by a machine learning meta-method called "ensembles". The core ensemble idea is to generate a large number of different but related NER models, and to combine them to generate output labelings that are more robust and accurate.

Sandia has applied its ensemble methods to one NER problem in labeling Medline abstracts, as illustrated in the figure. The blue line shows the best previously published performance (as measure by average F1, which is the harmonic mean of the precision and recall of the word labels) on a standard test data set. The red line shows the ensemble performance increasing past the previous baseline, and eventually stablizing, as the size of the ensemble grows.