Applications and project ideas
Here are some applications to inspire your own NLP projects:
- Guessing passwords from social network profiles (http://www.sciencemag.org/news/2017/09/artificial-intelligence-just-made-guessing-your-password-whole-lot-easier).
- Chatbot lawyer overturns 160,000 parking tickets in London and New York (www.theguardian.com/technology/2016/jun/28/chatbot-ai-lawyer-donotpay-parking-tickets-london-new-york).
- GitHub - craigboman/gutenberg: Librarian working with project gutenberg data, for NLP and machine learning purposes (https://github.com/craigboman/gutenberg).
- Longitudial Detection of Dementia Through Lexical and Syntactic Changes in Writing (ftp://ftp.cs.toronto.edu/dist/gh/Le-MSc-2010.pdf)—Masters thesis by Xuan Le on psychology diagnosis with NLP.
- Time Series Matching: a Multi-filter Approach by Zhihua Wang (https://www.cs.nyu.edu/web/Research/Theses/wang_zhihua.pdf)—Songs, audio clips, and other time series can be discretized and searched with dynamic programming algorithms analogous
to Levenshtein distance.
- NELL, Never Ending Language Learning (http://rtw.ml.cmu.edu/rtw/publications)—CMU’s constantly evolving knowledge base that learns by scraping natural language text.
- How the NSA identified Satoshi Nakamoto (https://medium.com/cryptomuse/how-the-nsa-caught-satoshi-nakamoto-868affcef595)—Wired Magazine and the NSA identified Satoshi Nakamoto using NLP, or stylometry.
- Stylometry (https://en.wikipedia.org/wiki/Stylometry) and Authorship Attribution for Social Media Forensics (http://www.parkjonghyuk.net/lecture/2017-2nd-lecture/forensic/s8.pdf)—Style/pattern matching and clustering of natural language text (also music and artwork) for authorship and attribution.
- Online dictionaries like Your Dictionary (http://examples.yourdictionary.com/) can be scraped for grammatically correct sentences with POS labels, which can be used to train your own Parsey McParseface
(https://ai.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html) syntax tree and POS tagger.
- Identifying ‘Fake News’ with NLP (https://nycdatascience.com/blog/student-works/identifying-fake-news-nlp/) by Julia Goldstein and Mike Ghoul at NYC Data Science Academy.
- simpleNumericalFactChecker (https://github.com/uclmr/simpleNumericalFactChecker) by Andreas Vlachos (https://github.com/andreasvlachos) and information extraction (see chapter 11) could be used to rank publishers, authors, and reporters for truthfulness. Might be combined with Julia Goldstein’s “fake
news” predictor.
- The artificial-adversary (https://github.com/airbnb/artificial-adversary) package by Jack Dai, an intern at Airbnb—Obfuscates natural language text (turning phrases like ‘you are great’ into ‘ur
gr8’). You could train a machine learning classifier to detect and translate English into obfuscated English or L33T (https://sites.google.com/site/inhainternetlanguage/different-internet-languages/l33t). You could also train a stemmer (an autoencoder with the obfuscator generating character features) to decipher obfuscated
words so your NLP pipeline can handle obfuscated text without retraining. Thank you Aleck.