Paper - Tyler Conlon

advertisement
Tyler Conlon 2012
Mapping identical city names
Overview
A large amount of information can be gathered from a name. Names of people, places often hold an
iconic connection in our minds. City names however are often indicative of trends of exploration,
expansion and power. The goal for this visualization is to see the most prevalent trends in identical city
names around the world.
In some areas of the world city names are repeated many times it a small area. For example the
most prevalent city name is Krajan, Indonesia. Officially, this city name exists in 55 different locations in
the country. It does not appear anywhere outside Indonesia however. This is somewhat of a local
outliner, and possibly an error in the categorization of cities themselves, but none the less the highest
scorer on the list.
In other cases a name may be used globally in many countries. The city Victoria exists in 22
different places in the world spanning 15 different countries including Argentina, Canada, Chile
Columbia, The United Kingdom, Grenada, Honduras, Malta, Mexico, Malaysia, Philippines, Romania
Seychelles (an island chain north of Madagascar), El Salvador and The United States. This includes 4
continents. Somewhat curiously the Australian state of Victoria is not counted because it does not have
a subsequent city. Clearly the name is very popular choice for cities.
Programming
The rendered output for this project was generated using Processing. Processing is a C++ based
graphics programming environment. The parsing and processing of the data was done using the PHP
scripting language from the command line. PHP was chosen mostly for its ease of use, and simple
implementation of associative arrays. For this project I also needed a reliable method for resolving the
origin language of over 100,000 city names. I chose to use the Google Translate API. The API is
extremely easy to setup and access using CURL in PHP.
Implementation
All the data comes from 2 CSV files. One has all the cities in the world with a population more
than 1000. The other is a table of ISO country codes to country names. Firstly, the entire country code
list and city text file were loaded into memory. The number of duplicate names were counted in an
array. Then the duplicate city names were looped through. At this point the threshold for duplicates
can be changed to show only city names with 2 instances for example. In addition another array holds
the number of different countries where a name is found. This array would show the 15 different
countries for Victoria, but 22 overall instances, since some instances fall within the same county.
At this point all the information regarding which cities we are interested in has been found. The
next task is to create a list of lines that need to be drawn. The latitude and longitude was available from
the cities CSV file. The script then moves through cities of interest and connects each instance of a city
to another instance of that city. If directed the script will ignore lines (or links as I refer to them) within
a country. That is, do not draw link between Newport Oregon and Newport Rhode Island; but rather
only link these two instances to Newport in the United Kingdom. Due to the simple memory structures
being used it became difficult to determine which lines had already been draw. More specifically if a
link was made between Berg Germany and Austria the program should not create another link between
Austria and Germany (simply a reverse of the original link). A simple solution to this is to sum the
latitude and longitude of the two link points and use this and the city name as the key for the $links
associative array. Since there cannot be two identical key values for the array, any two cities with the
same coordinates will be eliminated.
There were several attributes which were considered to be expressed by a link's color, among
them distance, number of connections, and population. Eventually it was decided that line color best
would represent city name language origin. Clearly there are some inconsistencies which are present in
this aspect of the visualization. Inconsistencies exist in the spelling of the name, and character set used.
It is important to note that the goal here is to visualize trends not individual points or links. The
assumption is that the majority of city names will resolve accurately enough to show this trend. Since
the Google Translate API is a pay per character service city names were resolved for all cities in the list
and stored in a file for later access. This requires access to the API only once per city.
The API resolved most city names. Some returned no result, these were skipped. The API
returns a confidence level of the detection, but it seems to be inaccurate for single words. For example
a simple test of the word "bonjour" resulted in a detection of "fr", but only had a confidence level of
around 50%. Due to this, confidence was not used. The API had some problems with Spanish names as
well. I suspect this may have been a character code issue, but names with San, Santa, Los, Las, La etc.
returned no result. This was largely an issue because of the rate which these names occur. The top 6
(omitting Krajan) most frequently occurring city names are San Miguel, San Antonio, Santa Cruz, San
Francisco, San Isidro and San Vicente. All of these were affected by this issue. These names occur so
frequently it would have affect the overall chart trend. An additional filter needed to be added to
search for these cases and give these city names the proper language. The API also incorrectly reported
a number of cities in China as English. Presumably this is because the city names were written in English
characters and this may throw off the Google translate API. These links are visible over China, but are
restricted to within China itself so have little effect on the trend of the chart. RGB values were assigned
to each link in the PHP script. These values were based on the most frequently occurring languages.
Finally a small program was written in processing to read a simple CSV file containing coordinates and
colors for each line. The output was rendered at a very high resolution with highly transparent lines to
reduce clutter.
Trends
The most obvious trend visible is the English name link between the UK and America, specifically
the area around New England. Perhaps the more interesting trend is the number of links between the
Philippines and south America due to Spanish exploration in both regions. Curiously there is not a very
visible link between these two regions and the originating country of Spain. Also visible is the links
between France and Quebec.
red - Spanish
green - French
blue -English
purple - Indonesian
orange - German
white - other
Tyler Conlon 2012
admin@tyconpowered.com
Download