Following the blog post about oddities in the Facebook importing that Stu was experiencing while visiting the DMI summer school, I put on my detective hat and went looking for possible reasons why things that should be available from the Facebook Graph API were not being collected by DiscoverText. This proved to be a somewhat complicated task since a lot of the Facebook realm is less than fully documented. As software developers, we share that pain point.
So, perhaps it is permissions – My first thought was that maybe it was a permissions issue, with variation between different Facebook accounts making a huge difference in what gets imported. I have two different DiscoverText accounts attached to two separate Facebook accounts just for testing, so I set both of them to import a set of test archives. Not knowing exactly where to start, I did a public post import of the search terms “Casey Anthony Trial” and then grabbed the public page for the “Cooking Channel”, a friend’s personal wall (to ensure that the test account couldn’t import it), and Mark Zuckerberg’s public page.
The Notifications page in DiscoverText contains information that can be a little cryptic. When your import first finishes, you will get a Notification that starts with the tag [Ingestion] or [Scheduled Ingestion]. This line gives you a report about what was or wasn’t added to your archive of the things DT tried to import.
Looking at the three sets of items I tried to bring into DiscoverText recently, you will see that the number of items written is sometimes less than the items found. This occurs in Facebook imports because DiscoverText only imports text and not all Facebook posts contain text. Sometimes people just post pictures, video, or links without any text to explain it. These posts show up in DT as empty items, so they are skipped. For Scheduled Ingestions (ones that have been put on a repeating schedule for DT to automatically gather more from a source), you also have the option of only gathering new items which makes the user’s job easier by not writing duplicate entries.
Except for the expected fail using the account that has no “friends” while trying to scrape my friend’s personal wall, I didn’t see any of the problems Stu reported when I ran my initial tests. However, an undocumented item of the DiscoverText architecture came to light. After running several tests, it came to my attention that the 30-day trial account was importing many fewer items than my main Enterprise account in specific situations. For the Casey Anthony Trial search, which was searching across all public posts, both accounts had very similar returns. When I imported the Cooking Channel page (Archive: Graph Test Page above) , the trial account imported a tenth of the files of my enterprise account.
Further investigation revealed that a trial account will only retrieve either a set number of items or the last month of posts and comments, whichever is larger. I don’t know the exact number and I haven’t heard back from the programmer, but I think it is around 500…or three months back in history.
However, this still isn’t one of the problems Stu and his class were seeing. I asked what they were searching on and he answered “Aruba”. If you are going to search on a topic, might as well be a tropical island. To be more specific, they were trying to import information from Facebook Groups having to do with Aruba. For anyone who hasn’t played with FB enough, Facebook has several types of pages, including People, Pages, and Groups.
“People” is your everyday user’s personal page. “Pages” are usually devoted to businesses or public figures and act as forums for the creators to put information out there as well as often allowing feedback from the public. Groups allow users to create discussion areas for people with common interests to gather and talk, share, and be social about that interest.
I found that Facebook changed its Groups format and has been prepping to archive all the old groups. This may be part of the problem, since FB makes little difference in it’s Graph API between currently active Groups and ones that have been closed for archiving. This means there are a bunch of possible groups in the import stream that will return nothing of value. This makes it harder to get good data quickly. Of the top 3 Groups that return after searching for Aruba, the first two are being archived and return less than 5 posts between the two of them. The third group, on the other hand, finally got us somewhere.
The group Aruba! has almost 2,000 members and is an Open Group according to Facebook. It also returned a notification of “1 item was written out of 1139 items found”. This was what Stu was running into. Without getting into the messy business about Access Tokens, lets just say that even though this group is “Open” and can be viewed by anyone, it isn’t completely open and requires permission to join the group.
It seems that the Graph API permissions we are using to get all of the other pages and groups don’t work for this specific setup for a group. An update to DiscoverText in the near future should clear things up.