For applications like producing an overlay of a map indicating the location of recent events, it is useful to be able to associate a longitude / latitude position with a news article describing a located event, such as a fire, accident or riot. For example, the following (made up) example describes a fire:
A fire broke out in downtown Chicago last night. The Main Street storefront, a branch of Borders, was entirely destroyed despite the prompt arrival of five fire engines from the nearby Frederick Street Station. Main Street East will be closed to traffic for at least three days.
A named entity recogniser (such as that provided with the GATE OS toolkit) might extract the following named entitites from this text:
The correct long/lat for the topic of the text would be retrieved from Google's Geocoding API by entering, e.g. Borders Main Street East, Chicago, IL, USA. However, inferring the correct combination of named entities to use to construct the `topical' address and performing any remaining disambiguation (e.g. Illinois, Unitied States) is not trivial.
Google's Places API provides autocompletion of partial addresses so one way of thinking about the problem is to construct the most likely topical address by searching e.g. Main Street, Chicago and then using the returned places strings in addition to the named entities found in the text to refine the geocoding search. This process might iterate until a single fully specified address yielding a long/lat is found.
The Cytora dataset provides the correct topical long/lat for each article along with the named entities relevant to its retrieval for a range of texts for which the task is easy (fully-specified contiguous address), medium (non-contiguous addresses), hard (non-contiguous bits of address with confounding information), or impossible (not enough information in article).
A baseline approach might be developed by using GATE or similar to find named entities and then using some simple rule-based heuristics to look up and integrate information from the Places API. After evaluating the baseline on a test subset of the Cytora dataset, various approaches might be tried to improve on this. For example, a parser such as RASP could be run to identify relations amongst non-contiguous address components (and even to replace or complement supervised named entity recognition), and Imitation Learning or Learning to Search (see tutorial) could be used to find a policy for deciding the next best action given a partial address and features of the text and/or places strings.
The project would best be undertaken in Java or Python using XML or JSON as the interface between the various components and using toolkits such as GATE, RASP as well as for Imitation Learning. If you are interested in the project, please contact me and I will make a subset of the (copyrighted) Cytora corpus available so that you can get a better idea of the task.
Triage of contracts, automatically identifying clauses that may require further negotiation, would cut costs and ensure companies focussed legal effort on the subset of clauses in the subset of contracts that represent a risk to them.
Given a preprocessed contract, in which individual clauses have been identified and the text recovered in a format convenient for application of NLP and ML tools, the task is to classify clauses into types such as `indemnification' or `termination' and then to identify within these different types, clauses which are high risk. For example, a disproportionate or one-sided requirement to indemnify the other party, or a heavy penalty for termination of the contract. It is necessary to classify each clause into a number of types first and then to further classify each clause within a type because the linguistic patterns corresponding to potential risk vary considerably between clause types. Furthermore linguistic features such as negation are critical to the identification of (non)risk (e.g. the Parties will not indemnify...)
Initially, a baseline system could be developed that used ranked Boolean keyword search over clauses with additional operators such as NEAR to identify the different clause types (e.g. Part* NEAR indemnif*) and then binary classification of the returned clauses of each type wrt risk using SVM classifiers. Both steps could be evaluated in terms of Precision/Recall and F-measure as the dataset is fully annotated with clause type and risk. A number of possible extensions of this baseline could be undertaken that would use features or information beyond bag-of-words to improve performance. For instance, a parser such as RASP or Spacy might be used to provide dependency structured input to the classifier and these fed into a SVM deploying tree kernels to indentify risk clauses, or a sequential (recurrent lstm / convolutional) neural model might be trained to learn the classification function without explicit feature engineering.
The project would best be undertaken in Java using JSON as the interface between the various components and using toolkits such as NLTK, RASP and/or SpaCy to perform the text processing, LUCENE to index and search the dataset, and machine learning toolkits such as SVMlight or Theano to construct classifiers. Please contact me if you are interested in this project and I will send you a sample of the dataset.