Tyler Conlon 2012 Mapping identical city names Overview A large amount of information can be gathered from a name. Names of people, places often hold an iconic connection in our minds. City names however are often indicative of trends of exploration, expansion and power. The goal for this visualization is to see the most prevalent trends in identical city names around the world. In some areas of the world city names are repeated many times it a small area. For example the most prevalent city name is Krajan, Indonesia. Officially, this city name exists in 55 different locations in the country. It does not appear anywhere outside Indonesia however. This is somewhat of a local outliner, and possibly an error in the categorization of cities themselves, but none the less the highest scorer on the list. In other cases a name may be used globally in many countries. The city Victoria exists in 22 different places in the world spanning 15 different countries including Argentina, Canada, Chile Columbia, The United Kingdom, Grenada, Honduras, Malta, Mexico, Malaysia, Philippines, Romania Seychelles (an island chain north of Madagascar), El Salvador and The United States. This includes 4 continents. Somewhat curiously the Australian state of Victoria is not counted because it does not have a subsequent city. Clearly the name is very popular choice for cities. Programming The rendered output for this project was generated using Processing. Processing is a C++ based graphics programming environment. The parsing and processing of the data was done using the PHP scripting language from the command line. PHP was chosen mostly for its ease of use, and simple implementation of associative arrays. For this project I also needed a reliable method for resolving the origin language of over 100,000 city names. I chose to use the Google Translate API. The API is extremely easy to setup and access using CURL in PHP. Implementation All the data comes from 2 CSV files. One has all the cities in the world with a population more than 1000. The other is a table of ISO country codes to country names. Firstly, the entire country code list and city text file were loaded into memory. The number of duplicate names were counted in an array. Then the duplicate city names were looped through. At this point the threshold for duplicates can be changed to show only city names with 2 instances for example. In addition another array holds the number of different countries where a name is found. This array would show the 15 different countries for Victoria, but 22 overall instances, since some instances fall within the same county. At this point all the information regarding which cities we are interested in has been found. The next task is to create a list of lines that need to be drawn. The latitude and longitude was available from the cities CSV file. The script then moves through cities of interest and connects each instance of a city to another instance of that city. If directed the script will ignore lines (or links as I refer to them) within a country. That is, do not draw link between Newport Oregon and Newport Rhode Island; but rather only link these two instances to Newport in the United Kingdom. Due to the simple memory structures being used it became difficult to determine which lines had already been draw. More specifically if a link was made between Berg Germany and Austria the program should not create another link between Austria and Germany (simply a reverse of the original link). A simple solution to this is to sum the latitude and longitude of the two link points and use this and the city name as the key for the $links associative array. Since there cannot be two identical key values for the array, any two cities with the same coordinates will be eliminated. There were several attributes which were considered to be expressed by a link's color, among them distance, number of connections, and population. Eventually it was decided that line color best would represent city name language origin. Clearly there are some inconsistencies which are present in this aspect of the visualization. Inconsistencies exist in the spelling of the name, and character set used. It is important to note that the goal here is to visualize trends not individual points or links. The assumption is that the majority of city names will resolve accurately enough to show this trend. Since the Google Translate API is a pay per character service city names were resolved for all cities in the list and stored in a file for later access. This requires access to the API only once per city. The API resolved most city names. Some returned no result, these were skipped. The API returns a confidence level of the detection, but it seems to be inaccurate for single words. For example a simple test of the word "bonjour" resulted in a detection of "fr", but only had a confidence level of around 50%. Due to this, confidence was not used. The API had some problems with Spanish names as well. I suspect this may have been a character code issue, but names with San, Santa, Los, Las, La etc. returned no result. This was largely an issue because of the rate which these names occur. The top 6 (omitting Krajan) most frequently occurring city names are San Miguel, San Antonio, Santa Cruz, San Francisco, San Isidro and San Vicente. All of these were affected by this issue. These names occur so frequently it would have affect the overall chart trend. An additional filter needed to be added to search for these cases and give these city names the proper language. The API also incorrectly reported a number of cities in China as English. Presumably this is because the city names were written in English characters and this may throw off the Google translate API. These links are visible over China, but are restricted to within China itself so have little effect on the trend of the chart. RGB values were assigned to each link in the PHP script. These values were based on the most frequently occurring languages. Finally a small program was written in processing to read a simple CSV file containing coordinates and colors for each line. The output was rendered at a very high resolution with highly transparent lines to reduce clutter. Trends The most obvious trend visible is the English name link between the UK and America, specifically the area around New England. Perhaps the more interesting trend is the number of links between the Philippines and south America due to Spanish exploration in both regions. Curiously there is not a very visible link between these two regions and the originating country of Spain. Also visible is the links between France and Quebec. red - Spanish green - French blue -English purple - Indonesian orange - German white - other Tyler Conlon 2012 admin@tyconpowered.com