Data Scientists are Will-Doers
Not Can-Doers
My presentation at PyCon on “The Linguistics of Twitter” produced exactly the response I was looking for, smart motivated people made contact with me and asked how they could help.
Quick Recap
The slides from the presentation are available on Slideshare and the video will be available shortly.
In a nutshell, processing data from Twitter presents unique challenges due to:
- 140 Character limit
- API lookup range limiting
- Dialectical English usage
Linguistic Challenges
Twitter users shorten words to fit into the character limit, and word selection has been proven to be regionally influenced.
This presents an opportunity to explore the use of regional American English dialects on Twitter in an effort to build out systems which could facilitate communication.
TwitterLinguistics Project
An oversimplified outline of the project is:
- Collect geo-tagged tweets from Twitter
- For the different dialectical regions in America
- Lots of tweets, think in the millions at least for each region
- Post the data in a publicly accessible location
- Process the regions with the Natural Language Toolkit
- While this is starting out as a Python project, other programming languages are welcome to help
- Build out regional corpora
- English only for now
- Including other languages would comlplicate an already very complicated challenge
- Sit back and admire all the hard work we did
More Information
Interested?
We definitely need help, and lots of it.
Casual programmers, people who have an interest in exploring Natural Language Processing or are just an awesome individual there is probably a way to contribute.
If you are just learning NLP in Python I strongly reccommend bot the NLTK book from O’Reilly and ‘Python Text Processing with NLTK 2.0 Cookbook.’
Without programming knowledge, a good place to start would be both of those books listed above if you want to be actively involved in the project.
There are also opportunities for publicity, data sharing and data validation which require less time on your end.
Email me at mdh@michaeldhealy.com to get on the mailing list while its hot.




Pingback: Training Part of Speech Taggers with NLTK Trainer «streamhacker.com