Text-mining vs./and/or Linked Data?

May 17, 2011 in Uncategorized

Reading Jonathan Rochkind’s musings on using Wikipedia as an authority file (something I’m all for), I was struck by comment that

I think wikipedia-miner, by applying statistical analysis text-mining ‘best guess’ type techniques, provides more relationships than dbpedia alone does. I know that wikipedia-miner’s XML interface is more comprehensible and easily usable by me than dbpedia’s (sorry linked data folks).

XML-over-REST vs. SPARQL debates aside, I think there is an interesting issue here regarding the kind of relationships that statistical text-mining produces vs. the kind typically found in Linked Data. Linked Data favors “factoids” like date-and-place-of-birth, while statistical text-mining produces (at least in this case) distributions interpretable as “relationship strength”. The wikipedia-miner results aren’t “facts” in any normal sense, but as Rochkind suggests they may be more useful. Now sure, you could represent the wikipedia-miner results as Linked Data, but what I’m trying to get at here isn’t a question of data models or syntax. It’s about how and when we choose to treat the patterns in our data as facts, and when we are content to treat them as patterns. Thoughts?

5 responses to Text-mining vs./and/or Linked Data?

  1. i guess the problem with “treating patterns as facts” is that they sometimes feel sort of random (after all, they entirely depend on the method you used to extract them), so many people tend to think that they are “less valuable” or “less reliable” than “actual facts”. but there also is the problem of low data quality even in many of the explicit facts (simple errors in wikipedia’s infoboxes), so just representing “facts” does not necessarily mean that they are true. i guess the problem you are referring to here is basically the same that i mentioned in my recent “From AI to BI” post (http://dret.typepad.com/dretblog/2011/05/from-ai-to-bi.html), which also suggested that the richness and value of linked data mostly lies in in the fact that it can be a massive amount of linked data, ready to be mined and massaged BI-style, instead of directly exposing or formalizing facts and knowledge AI-style, and being queried with SPARQL. my bet is still on the BI side, and i think the sooner the linked data community sees and works with the immense value of raw data connected by links, without obsessing too much about “real semantics” being represented, the more success it will have.

  2. I hadn’t made the connection to your AI vs. BI post, but you’re right, it’s basically the same difference in orientation. I do see the value in some basic “facts” like dates and places and roles within organizations. But beyond that are a range of less well-defined relationships that don’t seem to lend themselves to expression as Linked Data.

  3. Coming at it from an archival perspective, rather than a linked data perspective, I can see a separate issue.

    Archival finding aids are intended to be exactly that – aids to *finding* data, not to interpreting it. Archivists document their holdings with some high-level, fairly unambiguous facts (I’m sure that’s debatable) in order to get researchers to the materials they need to do their interpretations. This kind of “treating patterns as facts” seems most problematic in that context because it puts archivists in the position of making interpretations – even if they’re computer-generated interpretations – and expecting that researchers will be working from those interpretations to form their own. I’d be more comfortable with giving the researchers access to the tools and data to perform analytics themselves rather than present them in a form that implies that the results of analytics are fact.

    On the other hand, as you suggest, terminology is important. “Patterns” are a bit less definite than “facts.”

    I can also see how this would be different in other fields, which is the other interesting angle. Would this be thought of differently in libraries, or museums?

  4. I am a warehouse worker in a large city.

    I found this convo through a Tweet.

    I think the title of the essay is misleading.

    The objective of the writer’s essay is to point to a distinction between patterns of relationship strength and analysis of facts.

    My reply is insignificant. It reminds me of the Sonic Youth song “Pattern Recognition”

    The previous reply expanded the thread w/ a motif of its own: the archivit’s perspective on pattern/fact aids in her own way.

    I am a warehouse worker and I think patterns are equal to facts when you are analysing material. If you add analysis/archivazation you are going to be takling about the flow of material. I don’t mean rhyme flow here. I mean what goes where, what size, etc. and what is it in?

  5. Not sure if the comment above is concrete poetry or incredibly sophisticated algorithmic spam.

    Misty, I’m skeptical that archivists can produce interpretation-free, factual finding aids. Part of the problem here is that there is no bright line separating fact and interpretation, so we can’t just divide labor between the fact-creators and interpretation-creators. I’d argue that all we have are different degrees of interpretation, with greater or lesser requirements for supporting evidence.

Leave a reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>