GeoSearch going forward

A quick look back at what we have tried in the past year, mostly to explain how we got to where we are:

1st quarter 2008: GeoSearch module started

  • Basically consisted of a pagination engine that listed X features at a time in a KML document
  • Sitemap linked only to KML document describing GeoServer namespace
  • Did not work because G did not crawl KML links. Atom links did not help either
  • Problematic because GE would crash if it got one of these paginated documents
  • If the features were indexed, then the source document link would be to a GE crasher.

2nd quarter 2008: Regionating code starts

  • First iteration of regionating code. Could only do datasets with a few thousand features
  • Added GWC caching
  • Second iteration of regionating implementation begins. Supported a few hundred thousand features, but increased demand on data preparation. The main advantage is that you did not have to precalculate the hierarchy.
  • Started feeding the hierarchies to the crawler, the idea is that the hierarchy would be built as needed.
  • Worked on making the hierarchies two-way, so that if someone started on a tile in the middle, GE would also be able to zoom out. This was not possible.
  • Still unable to get G to crawl our data

3rd quarter 2008

  • Our KML documents get hit, but only the top ones in the hierarchy, and very rarely
  • G points out that our KML is invalid and we should validate against 2.2
  • Work on preparing data
  • Still unable to get G to crawl our data

4th quarter 2008

  • G makes it clear we need complete sitemaps, linking to every single tile, crawler cannot follow KML links.
  • This is bad, because our ability to gradually build the hierarchy is now worthless. Everything must be precalculated
  • G crawler is actually down for several days, leading to some confusion on whether we are doing the right thing
  • It becomes clear that the KML crawler is completely separate from the one that has intermittently been hitting our site
  • Sitemap implemented in GWC. It is heavy on I/O and seeds the entire hierarchy when the sitemap is requested
  • Sitemap backported to GeoServer, problematic because there is not good seeding mechanism, so the bot builds the hierarchy gradually and unreliably
  • Weekly phonecalls with G start
  • Some KML improvements merged
  • Nice layer description page added to geosearch module. Alleviates some of the problems.
  • More problems with the data that was loaded

Learned in the process

  • Regionating and GeoSearch are completely separate issues. I personally lost sight of this because they came in one package and until the layers got really big it seemed convenient to reuse the concepts. They meet when the user tries to open a KML document in Google Earth, but for layers that span across multiple files this is something G will have to solve, anything we do is just a hack.
  • Regionating is tough and requires you to prepare your data. Probably more than people will be willing to do
  • Geospatial information without text attributes will get crawled but pruned during indexing, so it doesn't make sense to worry about the 1 million+ layers
  • The name tag (title.ftl) is important and the best attribute(s) must be selected manually
  • There is a 2-3 week cycle between changing output and feedback. Rushing it is hard, and G is making changes too.

Going forward (just my thoughts)

  • Linking: We primarily rely on links inside the description to guide users from search results, but G also lets users click their way to the source document, and this is often more prominent. This is problematic, we could solve it using a redirect for non-bot clients, but this is really a big no-no. Still looking for a good solution, Google Earth support for Atom tags appears to be the best solution, but outside our control.
  • KML output: G reduces features to a centroid before storing. No polygons or lines, formatting is stripped away as well. This means we can save lots of space and bandwidth, and perhaps generate a weighted centroid.
  • Regionating: We have used the hierarchies because they are prioritized, pre-cached, easy to generate sitemaps from and good if a client follows the link. But regionating is more complicated than what we need for pure geosearch. We need to decide what route we want to go, especially if we want a custom output format. The paginated approach does have some advantages, it may also remove the requirement to have reliable fids.
  • Persistence: G prefers persistent documents, or something that looks like it. Mainly, we should not move features around. Not sure whether this requires fids, but we should keep track of what documents have been modified, so that the bot does not have to download gigabytes of data to get one new placemark. Sitemaps support last modified timestamps, these are important. Caching makes all of this easier in most cases.
  • Datasets: There is currently no point in exposing data where each placemark does not have a reasonable description that can be found through text searches. This also means that many of the layers with 1 million+ features are not very important.
  • Documents: In the future a concept of documents will change the point above. The data will still be interesting because the description of the document. Think of light poles in NYC as an example, every single feature is pretty boring, but the complete set can be interesting. The main problem with documents is that there is a 3 mbyte limit on files, and currently there are no plans to make multi-file documents. So this will bring us back to something similar to regionating again.

Proposal

  • Lets stop mixing regionating and geosearch. Rename H2 database directory from "geosearch" to "kml_hierarchy".
  • Let the user choose the <name> tag through the control panel instead of title.ftl. Have title.ftl for those that need soemething fancier (ie. use several fields)
  • Create custom output format and keep it in the geosearch module. Something easy and fast that does not reuse all the GML code. There's really no reason to, since we'll just need the freemarker template and a centroid.
  • Lets not worry too much about documents at this point. We can add the layer abstract to the first page of features. So it will work for small layers (everything in one document), and for bigger ones G does not have a solution anyway.
  • Lets discuss whether the pagination is reliable without fids. If it is, we should use it instead of regionating. The sitemap can then be "guessed" based on the number of features in the datastore, so we don't have to recalculate everything. For SQL databases, we may be able to reuse the KML regionating attribute to get order of importance?
  • I'm divided on whether we should cache geosearch pages. It makes for easy timestamps, and will definitely lighten the load since the bot comes back pretty often if you have thousands of pages. There is a delay in the indexing process anyway, feeding live data to G does not make sense... so I am leaning towards yes. The downside is there needs to be a way to communicate the total number of pages, but this could be done gradually by just requesting X more pages every time the sitemap is refreshed, and stop when there are no more pages. So two backend requests for every sitemap request (check last page and one further).