TwitterLinguistics : An Open Research Project

Data Scientists are Will-Doers

Not Can-Doers

My presentation at PyCon on “The Linguistics of Twitter” produced exactly the response I was looking for, smart motivated people made contact with me and asked how they could help.

Quick Recap

The slides from the presentation are available on Slideshare and the video will be available shortly.

In a nutshell, processing data from Twitter presents unique challenges due to:

  • 140 Character limit
  • API lookup range limiting
  • Dialectical English usage

Linguistic Challenges

Twitter users shorten words to fit into the character limit, and word selection has been proven to be regionally influenced.

This presents an opportunity to explore the use of regional American English dialects on Twitter in an effort to build out systems which could facilitate communication.

TwitterLinguistics Project

An oversimplified outline of the project is:

  • Collect geo-tagged tweets from Twitter
    • For the different dialectical regions in America
    • Lots of tweets, think in the millions at least for each region
    • Post the data in a publicly accessible location
  • Process the regions with the Natural Language Toolkit
    • While this is starting out as a Python project, other programming languages are welcome to help
    • Build out regional corpora
  • English only for now
    • Including other languages would comlplicate an already very complicated challenge
  • Sit back and admire all the hard work we did ;)

More Information


We definitely need help, and lots of it.

Casual programmers, people who have an interest in exploring Natural Language Processing or are just an awesome individual there is probably a way to contribute.

If you are just learning NLP in Python I strongly reccommend bot the NLTK book from O’Reilly and ‘Python Text Processing with NLTK 2.0 Cookbook.’

Without programming knowledge, a good place to start would be both of those books listed above if you want to be actively involved in the project.

There are also opportunities for publicity, data sharing and data validation which require less time on your end.

Email me at to get on the mailing list while its hot.

Technorati Tags: Natural Language Processing, NLTK, Python, Social Media, Twitter

This entry was posted in Analytics, Linguistics, Natural Language Processing, Social Media, Twitter and tagged , , , , . Bookmark the permalink.

3 Responses to TwitterLinguistics : An Open Research Project

  1. Pingback: Training Part of Speech Taggers with NLTK Trainer «

  2. How should this page use Fb for advertising
    and get more Likes to his web page ?

  3. Many people will try to figure out how to get in on their own.
    The locks created earlier were very basic in nature. It is a locksmith service provider that heeds to your concerns on locks and security systems wherever you are in Auburn, Washington.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>