Question: What is a Persistent Object Identifier and Why Should I Care?
Answer: A persistent object identifier (POID) is a unique, long-lasting name or pointer to a resource. It will never be reassigned to any other resource and it will not change regardless of where the resource is located. In other words, the POID will be invariant over time. If it moves or if ownership is changed, the POID will remain actionable. If not endowed with persistence, the external value of object identifiers is decreased.
This is an issue of great familiarity to the world of digital cartography, where companies might want to augment or correct data in the base data they receive from suppliers. For example, suppose the data supplied to you has an erroneous attribute in the form of an incorrect street name. In order to fix this issue you would link an additional attribute to the object ID in the database, write some code to snag the exceptions when working in this area. The kludge would work until the next revision of the database when new object IDs were assigned. At that point, you either developed complex matching software (never foolproof) or started all over again appending attributes to the new identifier of the old data. One would have thought that this problem had been put to bed by now, but in today’s Geoweb, it has once again raised its ugly head.
Why so? Well, we now have mash-ups, User Generated Content and a host of companies creating data designed to tell us more about location and locations. The interests of these data creators are unique and each may be listening to a different drummer. In a very simple sense, everyone is adding stuff to other stuff to create a new awareness about our surroundings. While many of these companies may be very good at adding new data, they seem not to be particularly proficient at trapping the basic business listing data that is associated with their unique data.
Hmmm. Well, how does this all work? What insures that all of the new data intended to convey information about a specific location are harmonized so they all point to the same instance?
GPS and its derivatives are helping make sure that we can repetitively and reliably identify a point on the surface of the earth in order to attach attributes to it. But how do we come up with some authoritative scheme that would identify a specific business so that we can aggregate all (perhaps, some) of the data about the business that is being generated about it?
Let’s look a little closer at this issue. For example, there are a number of companies that provide business listing data (e.g. Acxiom, Amacai, Dunn & Bradstreet, infoUSA and Navteq). These companies are involved in compiling and distributing basic business information (name, address, phone number, etc,) to companies interested using these data for local search, navigation and other spatial endeavors.
Today’s Internet has mutated into a social web and various efforts (some corporate, some not) now gather user experiences about business as a method of providing a more complete (richer) evaluation of businesses (ratings, evaluations, reviews, pricing, etc.) than is available from the business listings aggregators. Finally, the Web has also enabled business owners to contribute data about the operational aspects of their businesses (detailed location information, services, menus, credit cards accepted, opening hours, parking, etc.) that is captured by online local search providers but is not captured by most business listings providers.
From a hypothetical perspective, this richness of data is great! Well it would be great if all of these data actually pointed to the same thing.
So now, we return to persistent identifiers. In order to make the connection between data that refer to the same feature, you need to be able to link your data to a unique feature ID. If the identifier is not persistent, then each time the id changes, the augmented data linked to specific locations are no longer actionable (since the location can no longer be identified).
The question, “Does this data refer to that business?” is one of the nagging problems in the “geoweb” that impacts both users of the data and the data fusers How many times have you read a restaurant review and thought “That’s not the name of that restaurant – and that’s not on their menu anyway”? Or, how many times have you seen duplicate listings for businesses that differ by name form or some aspect of the address identifier (An example provided to me by infoUSA compared a
third party address:
BP – Bucky’s
101 S 40th Street
With the infoUSA address:
107 S. 40th Street
Omaha, NE 68111-1448
(I blanked out the telephone number and ID)
I am sure that some of you do not think that this is a big problem. However, I can personally assure you that it is a big and time-consuming problem. It is common, when trying to match business listings between one company’s data and specialty data from another provider, that the match rate is less than forty percent. However, if you would like to get an idea about approximate size of the problem go to http://maps.google.com. The page will show a map of the United States, your job is to enter the term “restaurants”, with no other qualifier. The reply will be a list of restaurants that apparently includes 1,411,326,571 establishments.
Okay, Okay, so this is not a valid example, but even if we throw out 1 billion, we are left with more restaurants than there are people in the United States.
Yes, I know that there is a great deal of confounding in this example, but there must be some sort of problem here since Google begins to list all of those restaurants on the left edge of the page. (And “No” I did not check each listing, but I understand that the number of restaurants in the Google local search database is around 82 million, far beyond the number of restaurants in the U.S). The reason for many of these “false” restaurant entries is that Google likely agglomerates business listings databases from several companies, which generates duplicates due to mismatches on some aspect of the listing. Then, Google allows users to edit address/name information, which, in some cases, creates additional erroneous listings. Finally, Google allows business owners to edit their listings, which, likley, generates additional duplicate listings due to differences between the owner’s information and the information in the business listing database.
What to Do?
Oh, back to those persistent IDs and how they can be used to your advantage. infoUSA has recently announced a string of associations with companies such as BooRah (restaurant reviews), Urban Mapping, Maponics, Merchant Circle, Gas Price Watch.com and others who own data that helps make infoUSA’s data more informative. Since these deals caught my eye and I had been thinking about this type of data fusion headache, I called Pankaj Mathur, a senior account executive with infoUSA and asked him what was up with all of the news.
Pankaj indicated that while infoUSA had traditionally stayed away from collecting volatile data (think gas prices), many of its customers had strategies that called for integrating data from a variety of sources in order to help meet their goals for market growth. In order to accommodate these needs, infoUSA had decided to share, with select partners, the persistent IDs that it has created for objects in its business listings database.
BooRah for instance provides rating and reviews for approximately 150,000 of the restaurant listings in infoUSA’s database. According to the infoUSA press release on the topic, “BooRah’s content will be linked to infoUSA’s individual record ID numbers, which can then be added into infoUSA’s base record data including company name, address, phone number and category (SIC Code).”
In a related extension, infoUSA press releases indicated that the Company has worked with Urban Mapping and Maponics (providers of neighborhood boundary data) by pretagging their listing addresses with the neighborhood in which they fall based on the data provided by these vendors. Doing so, will allow users of infoUSA data who also provide data from Maponics or Urban Mapping, to allow their customers to search for businesses by neighborhood in addition to ZIP Code.
The beauty of infoUSA’s move is that once the matching is done, the association is done forever, since it is linked to the persistent ID of the business listing (object). By working with select partners and sharing its persistent ID, infoUSA will extend the usability of its own data and could be on the way becoming a standard.
Of course, there is always a fly in the ointment. In this case, it has nothing to do with infoUSA’s use of technology. Unfortunately, its founder (Vin Gupta), who was forced out as Chief Executive in 2008, has now suggested that the company should consider selling itself. Since he remains the stockholder with the most shares, this might be interesting. Here is a link to the press release on Gupta’s Gauntlet.
How about Nokia as a buyer? Oops, Navteq is already in that business? You know, the listings that don’t have telephone numbers?