Matching spatial data from different sources
Introduction
Along with the increasing capacity of geographic information systems there is an increasing demand of digital spatial data. In order to accommodate this demand, spatial data are captured in a huge amount by public and private organisations. Data capture is dependent on the application and is realized using different scales and different data models. The result is a multiple representation of the same topographic objects of the landscape. This leads to new problems of data handling as well as new requirements by the users. It is quite obvious, that multiple acquisition and revision of spatial data seem not to be economical. The amount of data which is needed to perform national and international projects is that huge that it could only be captured with high investments. Integration methods are needed to increase the application potential, to enlarge the re-useabilty of the data and to minimize the costs of updating.
Multiple data capture
The main problem to integrate two data sets is the different geometrical representation of spatial objects. Even multiple acquisition of data in the same data model leads to different data sets because of different discretization of coordinates and different interpretation of the landscape. Additional problems are arising because of different views of the world and different data quality characteristics.
Figure 1 illustrates these problems using the example of road network data in Germany. One of the main data sources for road network data in Germany is the European standard GDF (Geographic Data File). GDF data are captured for most areas in Western Europe and are available for the whole area of Germany. They are acquired by two different consortiums, the European Digital Road Map Association (EDRA) and European Geographic Technologies (EGT). These two consortiums can be seen as rivals. Both acquire GDF data independently for the whole area of Germany in the same data model. This leads to two different data sets not only because of different discretization of the coordinates but also because of different data sources and different data capture instructions. Figure 1 a) shows this situation with geometrical as well as topological differences between the data sets.
|
| Figure 1: Multiple data capture |
More differences can be seen by comparing GDF data with data of the German topographic cartographic spatial database (ATKIS). Presently ATKIS contains 60 different feature types for the whole area of Germany in the scale 1:25,000 (Beside this scale there are further levels of data aggregation in the scales 1:200,000 and 1:1,000,000, GDF and ATKIS correspond the most at the level 1:25,000). Feature types which are captured in ATKIS as well as in GDF form only a subset because of the different applications. The aim of ATKIS is to provide the users with a basic set of spatial topographic objects whereas GDF was especially developed for purposes of vehicle navigation. Figure 1b) shows the different capture of an intersection in ATKIS and GDF.
A further example of a data model in which road network data are acquired for the whole area of Germany is shown in figure 1c). The automatic cadastral system (ALK) is the digital proof of real estate cadastral maps in Germany. It captures the landscape in scales between 1:500 and 1:2,000. The whole data acquisition will not be completed within this century but there are already comprehensive data sets available. Because of the large scale most features of the landscape will be captured as area features. This is also valid for roads. Although all geometrical information of GDF data is available in ALK data there exist no common geometrical elements in the two data sets.
Integration is a matching problem
In the following the integration of two data sets is defined as a matching problem which means that primitives of the data sets should be matched to each other. The word primitives could stand for a geometrical element as well as for an object structure. After matching the primitives of two data sets an integration can be performed. This could be for example a combination and supplement of feature classes or attributes or an improvement of the geometry.
The decision which primitives should be matched is dependent on the similarity of the data sets. If, for example, data should be matched that were captured in the same data model corresponding features could be easily found because of similar object structures and attributes. Afterwards the corresponding geometry elements can be matched to each other. This is a top down approach.
In this work a matching strategy for GDF and ATKIS data is developed. Because of the different object structures in GDF and ATKIS it is not possible to perform the matching objectwise. The geometrical and topological representation of the data have to be used (bottom up approach). Because the common data set of ATKIS and GDF contains especially objects from the road environment it was decided to use the road middle axes as matching primitives.
Strategy
The approach of automatic matching is subdivided into five steps. After a preprocessing step a list of potential matching pairs is computed. This list is ambiguous and contains typically a large amount of matching pairs. To improve the time complexity of the following steps unlikely matchings have to be identified and eliminated. The result is a smaller but still ambiguous list with potential matchings partners. These matchings have to be evaluated with a support function in order to compute a unique combination of matching pairs which represents the solution of the matching problem.
Results
The approach was tested on 4 test areas with the size 2 * 2 km2 and different street density. The ATKIS and GDF data were kindly made available by the surveying institute of the state of Baden-Württemberg and the company Bosch/Teleatlas. The results of the automatic approach were compared with manually produced matchings. Table 1 shows the results of the different test areas. On average 96.26% of the elements were matched by the automatic procedure in the same way as manually. It can be seen that test area 1 contains significantly fewer successful matchings than the other test areas. The reason is the different actuality of the ATKIS and GDF data in this test area. There were a lot of intersection areas strongly changed because of calming the traffic. The GDF data were acquired after the changes whereas the ATKIS data were acquired before the changes. This led in some local areas to completely different data sets and increased the number of wrong matchings.
| Testsite | 1 | 2 | 3 | 4 | all |
|---|---|---|---|---|---|
| number of matchings | 208 | 349 | 332 | 469 | 1358 |
| number of ATKIS elements | 363 | 530 | 530 | 640 | 2063 |
| number of successful matchings | 328 | 523 | 515 | 620 | 1986 |
| in percent | 90.35 % | 98.87 % | 97.16 % | 96.87 % | 96.26 % |
| Table 1: Results of the automatic matching procedure | |||||
After the automatic matching a manual postprocessing can be done. Quality measures are needed to minimize the costs of postprocessing which calculate automatically how certain a single matching pair or the whole matching between two data sets is. The attributes and relations of the elements can be used to calculate the certainty of a matching pair. Dependent from the application it is possible to define maximum values which must not exceed. For example all matching pairs could be marked as uncertain if the distance of the matched elements exceeds a specific value. This approach is useful if the matching result is used to improve the geometrical quality of the data sets. An application from the field of vehicle navigation would mark all matching pairs as uncertain which do not have exactly the same topological relations.
Further quality measures can be derived from the information theory. A local as well as a global measure could be defined for an automatic verification of the results. Table 2 shows the results of the global quality measure. This measure describes the amount of loss of information through the matching. The higher the loss of information the higher is the number of wrong matchings. With this measure it is possible to verify the quality of the matching in an automatic way.
| test area | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| percent successful matchings | 90.35 % | 98.67 % | 97.16 % | 96.87 % |
| global quality measure [bit] | 10.57 | 9.36 | 9.71 | 9.76 |
| Table 2: Global quality measure | ||||
A local quality measure was used to evaluate the single matching pairs. A measure from the information theory was defined to subdivide all matching pairs into three groups. Table 3 shows the results of this analysis. The three classes contain approximately the same number of matching pairs with the exception of the class red which contains a little less matchings.
| quality class | red | yellow | green |
|---|---|---|---|
| number matchings | 367 | 494 | 497 |
| number wrong matchings | 20 | 9 | 2 |
| Table 3: Local quality measure | |||
Matchings which are members of the class red have strongly different attributes and/or topological relations. This matchings are classified as uncertain. The class yellow contains only nine wrong matchings. This class contains all matchings which are classified as conditionally certain. The class green contains all the matchings which are very certain. Wrong matchings which are still in this class arise in areas in that no unique matching can be done because several combination of matchings have the same quality. Uncertain matchings can be highlighted on a screen to give an operateur the possibility of an interactive work-over of the results.
Summary
Today only a few studies are existing which are dealing with matching and integration of spatial data. Especially when working on interdisciplinary projects it is often very important to use data from different sources. The approach which is presented in this paper is important for the realisation of an interoperable GIS (IOGIS). With this matching procedure different kinds of applications are possible (see figure 2). The geometry of two data sets can be compared automatically in order to identify wrong captured elements or updatings in the data sets and to improve the geometrical quality. A further application is the transfer of attributes or object classes which are captured in one data set but not in the other one. For example street names are captured in GDF but not in ATKIS. This is an application with a high level of automation potential. The full integration of data sets from different sources is presently researched in the field of multiple representation. The automatic matching of two data sets could play an important role to overcome problems of this kind.
![]() |
| Figure 2: Integration of ATKIS and GDF |
Publications
- Walter, V. and Fritsch, D. [1996]: Integration von Straßenverkehrsdaten aus unterschiedlichen Datenmodellen', Nachrichten aus dem Karten- und Vermessungswesen, Reihe I (115), 179-192.
- Walter, V. [1997]: Zuordnung von raumbezogenen Daten - am Beispiel der Datenmodelle ATKIS und GDF, Dissertation, Deutsche Geodätische Kommission (DGK) Reihe C, Nummer 480.
- Walter, V. [1999]: Matching Spatial Data Sets: a Statistical Approach, International Journal for Geographical Information Science, Vol. 13, No. 5, 445 - 473
Financed by:
This research work was carried out under doctoral fellowship sponsored by Siemens Nixdorf Information Systems, Munich, Germany, which is gratefully acknowledged.
for further information contact Dr.-Ing. Volker Walter

