Researchers like new datasets. Many of us build tools and techniques that work nicely with existing data, but may perform poorly with “out-of-sample” datasets. The ability to generate new and interesting big datasets, especially ones that draw a crowd of researchers and are of public interest, is what grew directly out of 10-years of eRulemaking research.
Then along comes Twitter and Facebook. People tell me, “Hey Stu, there is useful information mixed in with very large quantities of not-so-useful information.” Where have I heard this? Right, this reminds me of the mass email campaigns that may do more harm than good when they drown out the legitimate voice of the informed citizen.
Twitter itself doesn’t provide a search engine capable of generating accurate, complete, targeted tweet datasets for research or analysis. So we wired up DiscoverText for harvesting this data. People asked for it, we built it. Now Twitter tells us not to share large collections. In their view, this is proprietary data. In my view, this is prime historical data in the public domain (Twitter-History a.k.a. ‘Twistory’) that yearns to be free.