Geodata

Overview


                                                                                     

Spatial data requires additional processing to tabular data. The following provides information about what data problems to look for and how to fix them. For specific information about CODs please see: Common Operational Dataset Processing page, COD-PS Standards and Process , COD-AB Standards and Process 

Spatial Data Quality Checks


The six themes outlined below should be considered before using and disseminating geographic data. If the data do not meet the criteria defined by these themes, and/or the data cannot be cleaned to meet these criteria, the sources for these data should be reviewed. If there is no other option to correct the problems, these issues should be documented in the metadata.

  1. They have a known source: data should not be used if the source is unknown because there is no guarantee of the verification of the data or the appropriate permission to use the data. 
  2. They are complete in geographic scope: data need to span the entire country(s)/region(s) of interest. See example below of an incomplete dataset. See example in Figure 3. In this case, more research may be done to see if data are available which span the entire country. 
  3. They have complete and accurate attribute information: if data do not have information about each geographic feature, there is an increased risk for the data to be used incorrectly. See more specific details on how much attribute information is needed under the Data Cleaning topic.
  4. They have a known projection: Unknown or incorrect coordinate reference systems (including datum and projection) can prevent the data from being overlaid properly with other sources of geographic information and incorrect spatial analysis. If the data’s coordinate reference system is unknown, refer to the source of the data to see if the original coordinate reference system can be determined. 
  5. They are up-to-date and relevant to the current situation: the information associated with the data must be up-to-date OR useful to the situation for analysis. See example below of administrative boundaries not reflecting the current situation. However, if updated data are not available, out of date data are better than none, but the problem should be documented in the metadata record.
  6. They have correct topology: the spatial properties of the data must be accurate for the data to be used correctly. See the example of topological errors in a polygon file in Figure 3. Topology is checked differently for polygon, arc and point files. 

Point Files:  all points are generally in the correct location. Two examples of files that do not pass the topology check are 1) a file where a type error was made in the latitude and/or longitude field(s) of the file and the point is not in the correct location or 2) the location for a populated place is obviously incorrect (e.g. located in the ocean or incorrect administrative unit). 
Polygon and Arc Files: no gaps and/or overlaps between the lines that make up the arcs or polygons are present in the data.

Data Cleaning


The following are the most common types of processing that needs to be done. More information about spatial data through the COD material see the resource section. 

Defining the Coordinate System

The coordinate reference system (CRS), often referred to as the “projection” of a given dataset, is simply a defined set of values that describe how to interpret the X and Y coordinates stored in a dataset. For example, if we have a shapefile containing only one point and its coordinates are X=34.2 and Y=45.7, we have no idea if these are degrees of latitude and longitude, meters in UTM, or state plane feet. It is the coordinate reference system that specifies degrees, meters, or feet as well as other necessary parameters such as the origin or various projection parameters. 

The CRS has two primary parts:  

  • Projection (or possibly unprojected in the case of latitude-longitude coordinates) which defines how the spherical coordinates are projected to planar coordinates.
  • Datum which defines the mathematical model of the Earth’s shape that is used in the CRS. Commonly this is WGS84, however, in some regions, other systems may be more commonly used. 

If the CRS definition is missing from the dataset, many GIS software will assume that it is in geographic (lat/long) based on the WGS84 datum. If this assumption is wrong, the data will not align with other datasets having correct CRS definitions. 

Defining vs. Reprojecting

There are two operations that are used to change the CRS:

  • Defining the CRS (or projection in ArcGIS terminology) does not change the coordinates in the data, it simply changes how the coordinate values are interpreted. 
  • Reprojecting the data actually changes the coordinate values based on a mathematical transformation from the old CRS to the new CRS.

 In the data cleaning process, there is usually no need to reproject the data, only to make sure that the CRS definition is correct.

To view/edit the CRS, follow the guidance below:

  • To view the coordinate reference system of a dataset
    • In ArcCatalog or ArcMap, right click on the dataset and choose properties
    • Choose the XY Coordinate System tab
    • The coordinate system is displayed 
      • Note: if the coordinate system is GCS_Assumed_Geographic or something similar, it means that the CRS is undefined and ArcGIS is assuming that the coordinates are geographic (latitude-longitude) based on the WGS84 datum. If this assumption is accurate, the coordinate system should be redefined to GCS_WGS_1984 as described below in order to eliminate any ambiguity.
  • To change the coordinate reference system of a dataset
    • Ensure the dataset is not open in any ArcMap project 
    • In ArcCatalog, right click on the dataset and choose properties
    • Choose the XY Coordinate System tab
    • Click Select
    • Locate the name of the CRS that the data set is actually in (remember, we are not reprojecting here, only defining)
      • For example, to define a data set as latitude-longitude based on the WGS84 datum, choose Geographic Coordinate Systems>World>WGS 1984.prj
  • To define the same projection on many datasets at the same time
    • Ensure that none of the datasets are open in any ArcMap project 
    • Open ArcToolbox
    • Choose Samples>Data Management>Projections>Batch Define Coordinate System and complete the wizard

Verifying Attributes and Features

Essential vs. Non-Essential Information

A dataset may have too much or not enough information in the attribute table. Too little information may cause the data to be unusable or misrepresented. Too much information may cause confusion to users of the data and also lead to misrepresentation. To guide in what is too much and too little information, attributes are broken into essential, marginal and external according to the following definitions: 

Essential

Essential information is information pertinent to the use of the geometry and which provide linkages between the geometry and other essential datasets such as demographic or situational assessments. At very minimum, all datasets should include a unique ID and a standard name for each feature. 

Marginal

Marginal information is information which aid in the use of the geometry and which don't change very often. Data may be valid and useful without marginal attributes, but this information enhances their use. For example, a polyline layer for rivers is usable with only the feature name (e.g. river name) and classification (e.g. river type), however marginal information such as drainage basin size may enhance the use. 

External

External information is information which can be stored in a table and linked to the data using a unique ID. This information may be time sensitive (e.g. demographic data or a specific thematic such as % food insecure population, security risk, etc). The information should be stored in tables which share a unique ID to the geometry. Joins and relates can be used to link these data when needed. Keeping these data external to the geo-dataset avoids the inclusion of obsolete data in the geographic dataset.


Attribute Names

Geodatabases allow for long attribute names, which is useful but may become problematic when exporting to formats that do not support long filenames. Ideally, attribute names should be limited to 10 characters and not begin with numbers or contain spaces. The alias function in ArcCatalog can be used to assign a longer and more descriptive name to the attribute that will be visible in ArcGIS applications. 

Features

A geodatabase should include the minimum number of feature classes possible per feature dataset. Features of like geometry (e.g. points or polygons or lines) and content (e.g. populated places or land use or road) should be one feature class with the classification of each feature differentiated using an attribute field. 

Example: Populated place gazetteer

In many cases, there is a separate shapefile or feature class associated with the different types of populated places. Consider the following: 

Shapefile/Feature Class 1: National Capital 
Shapefile/Feature Class 2: Administrative Capitals 
Shapefile/Feature Class 3: Cities with population greater than 100,000 
Shapefile/Feature Class 4: Cities with a population between 50,000 and 99,999 
Shapefile/Feature Class 5: Cities with a population less than 49,999 

Ideally, these separate layers are combined into one feature class for all populated places including small towns, large cities, administrative capitals, national capitals, etc. These files can easily be merged by creating an empty feature class with the following schema (essential attributes): 

Feature class schema (essential attributes)


In this case, controlled vocabulary for the feature type defines the type of populated place (e.g. national capital, the administrative capital, etc.) OCHA does not have a defined schema for any particular dataset but follows standards used by partners and the providers of the data. 

Verifying Geometry

Repair Topological Errors

In order to ensure data are used correctly and accurately in geospatial analysis and cartographic representation, topological errors should be fixed and features developed prior to use.  Instructions on how to identify and repair common types of topological errors can be found in How to Check and Repair Topology using ArcGIS.

Create polyline from a polygon

It is recommended that all data repositories include polygons AND polylines of administrative and international boundaries. Boundaries should be represented with the polyline file and background landmass and labels should be represented with polygons. Preparing data repositories with both of these files in advance allows for quick map creation and proper cartographic representation. Ideally, the line versions of the administrative boundaries should have the coastlines removed. 



​The 4 maps above are meant to demonstrate the importance of appropriate data preparation of international and administrative boundaries in cartographic 
representation. The figure at top left displays the map polygons and each consecutive map show the polyline layers added in order from bottom
(first administrative boundary) to top (coastline).

The polyline layer should always be created from same source and data that will be used for the polygon layer. A simple method for creating a line from polygon file in ArcGIS is outlined below

How to create polylines from polygons in ArcGIS

  • Create the line file from the polygon in ArcCatalog
    • Open ArcCatalog
    • Right-click on the polygon layer and select Export > to coverage
    • Maximize the coverage to see the arc, label, polygon, region and tic. Right-click on the arc and select Export > to Shapefile (single) or geodatabase 
  • Remove the outer border in ArcMap
    • Open ArcMap
    • Add the arc shapefile created from the coverage
    • Go to Editor, and select > Start Editing
    • Go to Selection in the main panel, and select Select by Attributes
    • Set query “RIGHTPOLY” = 1

Query to select outer line.

  • Select Apply and OK
  • You will see the outline highlighted. Now select Delete on your keyboard, and the outline is deleted

At any administrative level, boundaries that coincide with boundaries at a higher level in the hierarchy should be removed.  The outer boundary is removed so that it does not conflict with the boundary of the administrative unit at a higher level. For example, first administrative level boundaries will be encompassed by international boundaries, and international boundaries will be encompassed by coastlines.


Formats


Consider the way in which spatial data is shared. The format it is shared in may impact who can use it (e.g. non GIS people can use tabular data).  For details on how to change spatial data formats see:  Steps For Data Format Conversion

Outputs/Resources