>> Andres Monroy: Welcome, everyone. My name is... the MSR, and I have the pleasure to introduce Cesar...

advertisement
>> Andres Monroy: Welcome, everyone. My name is Andres Monroy. I'm a researcher here in
the MSR, and I have the pleasure to introduce Cesar Hidalgo. Cesar is a professor at the MIT
Media Lab. And actually I was a student at the Media Lab, but I left right when he started, so I
feel like when I left the Media Lab, the Media Lab became a lot more interesting, not only
because of Cesar, but there's a lot of [inaudible] doing really interesting work there as well.
So Cesar is the head of the Macro Connections Groups at the Media Lab. He's also the ABC
Career professor of media arts and sciences. And he actually just published -- he has one book
he published last year, right? And then one, too, this year.
>> Cesar Hidalgo: Yeah, we have an MIT Press book 2014, and now like the new book, Why
Information Grows: The Evolution of Order, From Atoms to Economies, coming out on June
2nd.
>> Andres Monroy: Awesome. All right. Welcome Cesar Hidalgo.
>> Cesar Hidalgo: Thank you. Thank you, Andre. So it's a pleasure for me to be in this
environment because, first of all, I've been a user of Microsoft products from a very young age.
But also, you know, it's kind of like a technical environment which I think probably I can get
good feedback, ideas, about some of the projects that we're doing that are not necessarily
publications but actually platforms that have now been adopted and used by large numbers of
people.
So what I'm going to do today is I'm going to show you a number of different sites that fall into a
category that I like to call data visualization engines. What these sites try to solve is a little bit of
like the opposite of the problem that many big companies in the tech center have been resolving.
Many companies like EMC or SAP or Oracle have solved the problem of like, hey, you're a
client, you have shit loads of data, we'll help you catch all of that data, have sure that it's properly
indexed and stored and everything, so then view you can possibly retrieve it and, you know,
operate on that data.
But on the other hand you need -- well, the other side of the coin would be to have some
technologies that would allow you to visualize that data and to basically draw insight from it.
And unfortunately all of the large companies that have solved the problem of storing the data and
catching the data have not solved very well the problem of giving the data back to the people,
you know, in a way that makes it easy to understand.
And I've been solving this problem not because I'm very interested on databases or database
visualization, per se, I am interested now, but originally I was very interested in understanding
the world. But I've always been a very macro guy. You know? And if you're a macro guy, you
basically want to look at the work at the large type of structure, you know, at a large scale.
And in that context I had to develop these visualization engines to help inform the research that I
was doing and eventually found that this -- hey, Shahar, how are you doing? -- that these
visualization engines, you know, were something that also helped distributed research in a
different way than papers.
So what I want to do today is I want to show you seven different projects. The first one is
Observatory of Economic Complexity. The second one is going to be DataViva, you know,
which makes available data for the entire formal sector economy of Brazil, more than 50 million
people, a billion visualizations. Pantheon, which looks at data and cultural production, which we
work with Shahar. The Global Language Network which actually was Shahar's thesis at MIT.
For some reason this one, you know, like it got [inaudible]. Immersion, which is a really sign of
the e-mail interface, Place Pulse, and Street Score.
Okay? And what I want to do is I want to show you that these data visualization engines are not
just way of transforming bits into colors and shapes but actually are tools that we can use to get
stories out of data and draw insight so we end up learning about the world.
So the first example that I want to show you is the one of the observatory of economic
complexity. This is a tool that makes available data for international trade for basically all
countries in the world and for the last 50 years. It's the number one destination nowadays if you
search for international trade data. So if you search what are the exports of Argentina in Google,
the first link is this site.
And what this site does is allows you to display all of that data but the Trojan horse aspect of the
observatory of economic complexity is that by displaying international trade data and by
focusing on the types of products that country export, not just in values, you are starting to
introduce the idea that actually what matters for economic development is the mix of products
that you make.
So to illustrate that, let's look at a few examples. The first example here shows the products
exported by Chile. And as you see, Chile, you know, is a country that for its size -- is around 18
million people -- exports quite a bit. That's $78 billion. You know, that's a good chunk of
change.
But you see the products that Chile export are mostly refined copper, copper ore, raw copper,
grapes, a little bit of wine. You know, wine, you know, even if you export a little doesn't kind of
like make economy, you know? It's very hard to export, you know, too many billions of wine.
And now let's look at the products that South Korea exports. Now, South Korea obviously is a
much larger country, much larger economy, exports $562 billion, but also, you know, the
difference is not just in the amount of the export but also what they export. They export
integrated circuit, cars, LCDs, refined petroleum, broadcasting equipment and so forth.
Now, the question is are these differences in the mix of products a country make consequential
or they just cosmetic. Should we care about those things.
So to introduce this very simply, let's just first look at the bilateral trade between Chile and South
Korea to make a contrast about the monetary values and about the mix of products that they're
exchanging.
So if you look here at the top, you see now that Chile exports $4.6 billion to South Korea.
Okay? So from about $76 billion that it export, 4.6 goes to South Korea. And these are mostly
atoms -- refined copper, raw copper, copper ore, sulfate chemical woodpulp, pig meat, you
know, grapes. You know, products of relatively low sophistication.
Now, Chile imports $2.5 billion from Korea. So Chile has a positive trade surplus with Korea.
They export 4.6 billion; they import 2.5 billion. But it has kind of like a negative imagination
balance because what Chile's exporting is atoms. What Chile is importing from Korea is the way
in which those atoms are arranged.
And that's kind of like information. Like in cars, delivery trucks, you know, you have actually
that you are involving the knowledge, know-how, and imagination of Korean workers.
So the question is what -- you know, Chile is exporting atoms and is making a shit load of
money; Korea is exporting the way that the atoms are arranged. It appears that that might matter
for something, but can we get some quantitative evidence that these things are consequential.
And actually we can. There's like three things that are very important that we can predict very
accurately by knowing the mix of products that countries make. The first one is that we can
anticipate what products they're going to make in the future. So we can predict the future paths
of [inaudible].
The second one is we can predict which countries are going to grow and which countries are not
going to grow. So we can predict actually economy growth, long-term economy growth.
Finally we can predict the levels of inequality. Because at the end of the day, the mix of
products that countries make demands certain types of institutions. So let's say imagine like you
are a country that has tobacco plantations. Well, that's a type of product that you can be readily
productive at with a really shitty institutional environment. Okay?
But if you're a software industry, and imagine like software industry you be treating workers as
workers were being treated in the tobacco plantation 300 years ago, probably you guys wouldn't
be very productive. You know? Because you demand a different type of institution given the
type of work that you produce.
And in that context, the mix of products that a country makes also helps moderate even the
distribution of that income. So let's look at the first of these predictions. Yes.
>>: So a question. So it is true that you can make these inferences based on the data you show,
and I think that's very valuable, but I guess my question is what is the bar? Because I just did
this search and I can find articles in The Economist that talk about the connection to these two
countries, or there's a four-page report by a Ph.D. candidate that, you know, talks about these
things at length. So it's possible the problem identify a pattern that sort of summarizes your
inference as well. Question is, well, how to -- what and how to compare the results of this to
something else.
>> Cesar Hidalgo: Yeah. So like just to give an idea, like, first of all, what I'm showing you
here is the database visualization technology. On the other hand, we have a number of papers
that have published in science, in PNAS, a book that came in MIT Press that document all of the
statistics about this.
And the way that you compare this is that in some way when you are looking at the economy
growth of countries, what you need to make sure, you know, is that you are able to control for
other possible confounding factors. And for that there are many statistical methods that you can
use to try to discount for those other factors.
So, for instance, if I'm looking at the economy growth of countries and I want to try to show you
that the mix of products that countries make is what matters for economy growth, I have to be
able to show you that, well, that is robust to control for a country's initial level of income, by its
level of location, by its institutional environment and so forth.
And we have shown in our research that basically the mix of products that countries make is a
much stronger predictor in traditional [inaudible] controls. Actually it incorporates information
of all of those controls and provides additional information in the predictions. Yeah.
>>: [inaudible] statistics, but guess I wonder shouldn't you be looking to the qualitative insight
you can glean from looking at the data presented like so as opposed to, you know, using insight
[inaudible] foreign affairs or economics.
>> Cesar Hidalgo: Why is an or? I don't understand why it's an or. So, for example, one of the
things that this site is used for is that journalists sometimes, you know, would say they want to
discuss the economic ties between India and China. And what they would do, they would like go
and sometimes go to Google and search, hey, what does China export to India and, you know -and so forth. And then they would bump into a site, and then they would embed this
visualization as part of the narrative.
So it's not a really would you either have narratives or either you have charts. Like here you
have a site that makes around 30 million charts, and you can use those 30 million charts to
construct narratives as well.
And what I'm doing here is actually using a few of these charts to tell you some little narratives.
One of these narratives is like, look, you know, Chile and Korea, even though the trade balance
goes in one direction, you know, the balance of imagination go in an opposite direction.
I could certainly write that as like a 300-word op ed in which I would embed these stories. You
know? Like the other things also you can embed in narratives, and that's why we have written
papers about it. Because in the papers we include the narratives and the charts. But I don't think
it's an or. Yeah.
So like just to recapitulate, what I want to try to show you guys very briefly is that there's three
things that we can predict based on the mix of products that countries make. What are the
products they're going to make in the future, how much they're going to grow, and how unequal
they're going to be.
So let's look at the first of those predictions. So to look at the first of the predictions, I'm going
to a visualization that is a little bit more complex. You know? This is the product space. Okay.
And the product space in this case is also showing the products that Chile exports. Now, in this
case for the year 1979, but now the nodes that are painted show the products that Chile exports a
lot of. The ones that are not painted, the ones they export little of.
So Chile 1979 exports a lot of preserved fish but very little fresh fish, very little frozen fish fillet.
This have comparative advantage in the light gray products. Now, Chile also exports, for
instance, miscellaneous fruit but doesn't export vegetables or fruit vegetable juices.
But we know which products are connected and what the opportunities for the Chilean economy
are because we know what's the probability that two products have co-exported. Okay? So now,
for instance, if I say, well, Chile isn't going to export vegetables in the future, and I say one, two,
three, four products that are connected to vegetables, Chile does not export, or one, two, three,
four, you know, five products that are connected to vegetables are products that Chile does
export.
So in some sense vegetables is a product for which Chile has kind of like a high density, there's a
lot of activity happen around it, which will predict that maybe should be a product that Chile
would export in the future.
Now, if I look, for example, at parts of meta working machine tools, these are product that is also
connected to many products none of which Chile's able to export successfully, so I will predict
that that's not going to be a future export. This -- I'm sure that is very simple for you guys, but
this is just like any standard recommended system, more or less is kind of doing this thing. So
this is like in some way making explicit, the graph that underlies any recommender system.
Okay?
So let's look if the prediction pans out. So the prediction was that Chile was going to diversify in
the fishing cluster, in the processed food clusters. And if I go from 1979 to year 1996, you
know, I'm going to find that actually, you know, like Chile has diversified, so now they do export
vegetables, they do export the fruit and vegetable juices. They do now export a little fresh fish
and so forth. Okay? Yes.
>>: How do you control for things like political stability and just, you know ->> Cesar Hidalgo: So like the institutional barriers that are -- like there are variables out there
that we use, and I can show you in the book, but they are -- they are very self-service so they're
kind of shitty [inaudible].
So you have like the World Governance indicators, you know, that are available since 1996, and
they have, for example, they're controlled for corruption, political instability, and variables like
that that are determined by the people at the World Bank that developed the World Governance
Indicators.
What we find, though, is that those variables have very little predictive power and the reason
why they have very little predictive power is that at the end of the day like institutions of a
country are also reflected in the mix of products that a country is able to make.
So in some sense like the component that matters for an economy is probably already reflected in
this vector that has 700 different features, you know, each one of which is whether this country
export or not that product.
>>: So if you were to look at Syria or Iraq 1979 versus today ->> Cesar Hidalgo: Oh, yeah, yeah. Like but these countries -- like that's like we can start to
[inaudible] political economy, but let's look at like -- let's say let's look at Iraq, and let's go before
the U.S. kind of like made a mess out of it, so 1985 -- oh, shit. I'm not online. This was also
cached in the browser. Do you guys have an MSFTP open? Let's see.
>>: [inaudible].
>> Cesar Hidalgo: Oh, yeah, give us your little rights away. Yeah. Okay. There. Sure. Okay.
Thank you.
So let's look at, you know, Iraq. And, for example, if you look at Iraq in 1985, it's a very boring
product space. Just have like oil there. Oil is a [inaudible]. It's very large. If you look at it as a
treemap, you kind of like see what share represents. But this is something that is a [inaudible] of
countries that are concentrated in natural resource exports, which is the problem of political
capture, economic capture that is well described by [inaudible] in his book about a million.
And the idea is the following, is that let's say I have a country that has [inaudible] diverse
economy. Controlling that country politically or economically is kind of hard because there's a
lot of different actors that are playing in that economy. But in a country like this, if I become the
president and petroleum happens to be a state industry, I have absolute control. I have political
control on the one hand and I have economic control on the other hand.
So that's why countries that find petroleum, well, the second thing that they find is a bad
president because they are countries that are very easy to capture. While a country that has a
very diverse economy like Japan, Germany, or the U.S., even if you controlled the entire tech
sector, there's a lot of other sectors that still -- that have a wait and makes that political process
much harder to fall in the hands of a single individual. But that's a good question.
So okay. So the first idea was that given the mix of products a country makes, you can predict
what they're going to do next. Hopefully that was something that I was able to communicate.
The second idea, you know, is that actually if you know the mix of products that a country
makes, you can predict their future level of economic growth. And this is a little bit more
complex. I'm going to do a very superficial description because I don't want to go into details of
the math. But basically what you do is you take the matrix that connect countries to products, to
project it over the space of countries, and then you take an eigenvector of that matrix.
And that is a measure that tells you how complex the economy of that country is because in some
sense the eigenvector of that matrix is telling you how diverse are the countries that make what
you make and how ubiquitous are the products that are made by those countries.
So if you're a country that makes a lot of products that fewer countries can make, you're kind of a
sophisticated economy. If you're a country that makes few products that everyone makes, then
your economy is kind of simple. Okay?
It's the same as a multiple choice test. If you're a kid that only answers the question that
everybody got right and answer very few questions, probably you know very little. If you're a
kid that answered all of the questions, even those that nobody else was able to answer, you're
probably a kid that has a lot of capacities.
So you run this test over countries, and you get this measure that is the economic complexity
index, which you can interpret as the ability of countries to generate income. And on the Y axis
you have actually the amount of income that they actually generate. So, for example, here you
have a country like China, and China is a country that you see that is kind of like below the
cloud. Below the cloud of points, not the cloud that is someone else's computer.
Here you have that China is a country that has an economy that is too complex, given the level of
income, because basically the complexity of the Chinese economy is comparable to that of Hong
Kong, Netherlands, or Norway; meaning, that is a country that is doomed to grow. And develop
complexity, you know, the income should be higher.
Now, if you look at a country like, for example, Qatar or Kuwait, given what they're able to
make, their incomes are too high, and that's kind of clear. Why? Well, because they're selling a
lot of items, you know, basically they're digging money off the ground, but they're not actually
generating much economic value.
You have Australia, South Arabia, and Greece, for instance. If you go back in time, Greece, for
example, was one of the countries that was most out of whack before the crisis in terms of this
measure.
And what we find is that statistically is that here you have Norway and here you're going to have
Greece -- is that over time the countries that are below the line tend to grow faster than ones that
are above the line. And that's a very strong predictor. Actually would predict like 50 percent of
future economic growth like 50 years in advance in the best cases -- sorry, 20 years in advance.
Okay?
So it's a long-term economic growth because the capacity of countries to make products is sort of
like a fundamental that helps us explain the capacity that they have to generate income. Okay?
Now, the last thing that I wanted to show you is that eventually we're also working now on a new
paper that is doing a connection between the complexity of our country's economy and the level
of income inequality. But to illustrate that, I'm going to change into a new platform, you know,
which is a much more ambitious platform, which is called DataViva.
And DataViva is a platform that makes one billion visualizations. This data contains -- this
platform contains data for all entire former sector economy in Brazil. This is 50 million people
for the last 12 years. It includes data on exports at a monthly basis. It includes data on
employment, on salary, on industries and location.
It's the most ambitious data visualization Web site for public data to date. And we launch it last
week in this new version and would actually work in now all of the marketing to get it out with a
number of different venues.
But what I'm going to show you here is a little bit of what DataViva does, and then I'm going to
tell you how we can start looking at this idea that maybe the mix of products that a country
makes also might affect the way in which they distribute income.
So let's look at a profile that is generated by DataViva. So if you were to search Sao Paulo, what
you would get, it is like relatively long profile in which you can learn about trade, wage and
employment, economic opportunities and so forth.
So, for example, here you see that Sao Paulo's trade balance was kind of like roughly neutral.
They were importing as much as they were exporting until 2008. And after the crisis obviously
both exports and import went down, but then imports rebounded and exports did not. And now
they have been carrying this large gap for a period of like five to six years.
Then you can ask yourself the question, well, what are things that actually Sao Paulo exports?
And you see that in Sao Paulo most of the industries that are located there are still industries that
are managing the exports of soybean, raw sugars, coffee, corn. Not a very sophisticated
economy. What are the destinations that Sao Paulo export to, China, India, or what are the
products that they import.
And the nice thing about this platform is that also you can ask like full of questions. So, for
instance, when you look at destinations, say, wow, the export look to China to United States.
What do they export to China? Okay?
And I can ask what are the exports to Sao Paulo to China, and I would get visually like soybeans,
sulfate chemical woodpulp, raw sugar. Okay? You know, then ->>: [inaudible].
>> Cesar Hidalgo: Yeah.
>>: How do they have some room for soybean?
>> Cesar Hidalgo: Because remember that this is data based on the industry. So you -- to be a
soybean executive, you don't have to be sitting in a field. You have to be sitting in a building
downtown Sao Paulo trading soybeans, you know, on the Internet, on the phone, and stuff like
that.
So in that context, Sao Paulo export soybeans because the people that are moving the soybeans
around, you know, yeah, are sitting there and getting salaries there and sending their kids to
school there than, you know -- okay.
Then, you know, you can look at industries. Then like other things that are kind of interesting,
then you can look at wages data, you can look at like, for example, like the distribution of
income in Sao Paulo.
You can look at like the occupations that exist in the city and how much money each one of
them -- so how much does it cost to hire an information analyst in Sao Paulo, on average 5,000
reals, 5,000 -- 5.3 reals.
What are the common majors that people study? Could look at the universities that people
would have to go to. What are the industries that would employ information analysts in Sao
Paulo? You guys were like looking to open a research lab in Sao Paulo I heard like recently like
where we were preparing for the conversation.
Now here you see these are the industries that hire them. You know, for example, when it comes
to IT consultancy in that case actually they made a little bit more than average, they may make
6.2 thousand reals a month. So you then get an idea of like the average salaries that are
commanded by each occupation in each industry.
But in this context what I wanted to do is to highlight a little story that would connect the
inequality of countries with the industries that are present in them. So here we have now the
experts for the entirety of Brazil. Brazil also gets a profile, in that way we can profiles for the
entire country, for each state and for each of the 5,000 municipalities, which would be the
equivalent of a city here in the U.S.
And if you do entire Brazil, you see Brazil exports $225 billion, but they export a lot of iron ore
and crude petroleum, refined petroleum. These are extractive industries. And they also export
actually quite a bit of different types of machinery and transportation, like aircraft and cars. Like
these little Embraer jets. They're Brazilian. You know, the engines are not Brazilian, but the jet
is, you know, Brazilian made.
So the question is, well, you know, like in terms of exports, you can think that like maybe the
resources actually contribute a lot to the Brazilian economy. But let's see how those things look
like in terms employment.
So here now you have the occupations in Brazil that are employed in the extractive industries.
And in total, let me -- let me do [inaudible] because I think here there was one portion that we
don't get to yesterday. There's like a little -- yeah.
So let's look now, you know, at the people that are employed, you know, in the extractive
industries, you know, in Brazil and how many of those and what type of jobs they do. And what
we're going to find, you know, I don't know -- I don't think this is my site. This is your Internet,
guys. You know?
We're going to start with a process in industries. So if you look at manufacturing, you're going
to find that manufacturing even though it doesn't export that much, it represents 7.9 million jobs.
The entire [inaudible] Brazil is 50 million jobs.
So actually this is a big fraction. This is like 20 percent of all employment goes into this
processing industry. Now, if you look at extractive industries -- let me load it again. Maybe I'm
not online anymore. If Google doesn't work. Oh, no, yeah. Did Amazon go down? This is in
U.S. Okay. Sorry about that, guys.
So sorry about the back. But like what you would find is that the mineral extraction industry,
they employ around 300,000 people in total. So while on the one hand, you know, they represent
around 20 percent of the total exports of Brazil, on the other hand they export as few as 300,000
people.
The manufacturing sector, you know, and the processing industries, they represent a relatively
comparable fractions of total exports. They're not bigger or much bigger than the process in -than the process is done in the extractive industries, but they employed 8 million people. So you
can see that obviously they are a much more inclusive type of economic activity in which you
know you involve a large number of individuals in the process.
So what I want to show next, you know, is another set of projects that look not only now at the
economy of countries but look at their cultural production. As you can imagine, I spend a lot of
my life looking at the mix of products that countries make or at the economic activities that are
present in each part of Brazil and how this evolve over time.
And at some point what happen is that that shit got old and I got bored and I got tired. So what
did we started to do next? We started to look at cultural production because basically as I
presenting this type of research around the world, one of the questions that I was always getting
was, well, you're looking at products, how about services, you know? How about like the hotel
industry, how about like the restaurant industry?
And I was saying, well, you know, first of all, like in the real world we kind of like started to
include that, because this includes industry data and employment data, and that will include
industries, second, you know, there's no reason to believe that that's a very interesting question
because all of the mechanisms that we describe in terms of products probably hold in the case of
industries, like you're going to have related varieties and you're going to move from industries
that you are good at to industries that are similar.
So one day I realize that a question that nobody had asked me was like, well, what about Elvis
Presley? And when it dawned on me, I said like, well, that must be a good question because in
some sense, well, the U.S. does export let's say soybeans and aircraft engines, but they also
export culture. Like, for instance, like Miles Davis or Elvis Presley or [inaudible] Armstrong.
Where do we count that? What are the patterns that are defined actually by cultural production
rather than industrial production.
So we created another project called Pantheon in which actually we started to accumulate data in
cultural production for the last, you know, 6,000 years.
So just to give an idea, like if Chile looks like this in the context of the [inaudible] complexity,
the country export refined copper and copper ore, wine, and grapes, in the case of Pantheon,
Chile looks like a country that has exported politicians like Pinochet, Allende, O'Higgins, soccer
players like Zamarano and Salas, or writers like Pablo Neruda and Gabriela Mistral.
Now, how do we know which people goes into Pantheon? I want to just give you like the short
answer to that question. For the long answer, you can go to the method section. But what we do
is first, you know, we look at all people that have presence in at least 25 different languages in
the Wikipedia.
Why do we do that? Well, first because if we were to use only the English Wikipedia, we're
going to be very biased towards the English language. So in some sense we look at people that
have presence in a large number of languages, because that gives us a bit of an idea of which
persons have global fame rather than local fame.
So, for instance, American football players don't make it into Pantheon because they tend to be
locally famous. They have long pages in English Wikipedia; they have basically pages in few
other languages. Someone like Isaac Newton, on the other hand, will have page in almost 200
Wikipedia language editions. Okay?
So that's the first thing. So we look at people that have presence in many different language in
the Wikipedia. Why 25? Because with 25 languages in the Wikipedia, we got a set of 11,334
people that we had to then create manually. And that's kind of like was at the limit of what we
could do with that manual curation.
Why manually? Because in some sense, even the ontology that you're seeing here to describe the
cultural production of Chile did not exist and we have to create it by using uncontrolled
vocabularies and a lot of manual labor.
So what you can do with Pantheon, well, is on the one hand, sure, you can look at difference
between countries, so, for example, if you look at the pattern of cultural production of Chile, you
know, that includes 26 people; or the U.S., which includes 2,000 people, you see great
differences.
For instance, there's a lot of actors, singers, and magicians in the U.S., also a lot of scientists that
were not present in the case of Chile or even, you know, a much larger diversity of sports. You
know, because in Chile was mostly soccer and tennis.
But what is interesting about Pantheon is not so much that you can look at differences between
countries, is that you can look at the way in which cultural production evolves over time. So
what I want to do here, so I'm going to start -- yep.
>>: One questions. How do you deal with people who are born in one place but become
[inaudible].
>> Cesar Hidalgo: Like where you're born, that's where we put you in because that's the only
thing that we can like really encode with a hundred percent certainty. So the Greece people hate
us because there's a lot of famous Greeks that are born in present day Turkey.
But there's all the geocode in APIs you express in the boundaries. So there's no geocode in API
that I can use to know if like let's say priests born in the year 1200 in what is now Spain was
Visigoth or an Ostrogoth. You know? It's going to tell me Spain.
So we have to kind of like deal with those type of constraints.
But what is interesting is actually now when you look at the entire world and you see things
changing over time. And in that context, you know, here is like basically our 6,000 years of
history. And what I want to do is I want to start looking at these 6,000 years of history but
concentrating on different technological eras.
And what we're going to see is that there's a few things that happen when you change
technological eras. The first one is that the composition of culture production changes
dramatically. Okay. So who would remember changes. And the other thing is that the number
of people that would remember also changes.
So let's start, you know, by looking at the world up to year 1400. And this is all of the people
that will remember up to year 1400. This is basically before the printing press. This is the era of
writing but prior to a printing press.
And you see that most of the cultural production of the world, most of the people that we do
remember involves politicians and religious figures. Okay. What also is kind of curious here is
that the arch, for instance, are quite conspicuous. You have nine printers that fit into this period
from which most of them were born in the late 1380s or 1390s. You know? Because basically
these are people that were still in some way a little bit famous by the time that printing was
invented, like Donatello, Van Eyck [inaudible].
So what we're going to find is that when we change, you know, now the time window to a time
window when which we look at the period after printing but prior to film and radio, this matrix
of culture production for the world is going to change completely.
And why this is going to change? Well, like some of the theories of why this should change, you
know, would involve the work of Marshall McLuhan, which I'm sure that some people might be
familiar with. Marshall McLuhan said the medium is the message. Now, what he meant by that
is what changes society is not what people say but the technology that they use to say those
things.
So in some sense like what people say on the radio is gone with the wind, but the invention of
the radio was a transformative technology that change the type of discussions that were
happening, the type of people that was involved in those discussions and so forth.
There other person that argues this forcefully is Elizabeth Eisenstein. She wrote a book called
The Printing Press As an Agent of Change. And in that book Eisenstein [inaudible] the printing
press did not only change the number of books being produced but it changed what was in those
books. It changed, you know, like basically like -- it changes a lot of things.
First of all, it creates the information that is more permanent than the one that was passed on
before. So with that it develops the idea of spelling that didn't exist. Then eventually also, you
know, it starts reviving a lot of the classics. Because I don't know if you guys know, but like in
the year like 1200 not too many people in Europe knew about Aristotle or Socrates. That
information was basically more or less preserved in the Arab empire and then it was reimported
back in Europe and it was disseminated once again with the printing press.
And then eventually, you know, that involved in cooperation of new people into publications
because now printers were for-profit people that wanted to find books, you know, to print that
other people wanted to buy.
And these books could be like the dialogues of Galileo that became very popular when the
church make them illegal or other books of scientists of that time.
So like what Eisenstein argues is that with the invention of the printing press, there is the shift
towards the arts and the sciences. Do we see that here? So this is the matrix of cultural
production for people born between the year 1400 and 1900. And you see that is very different.
Now religious figures are just a mere 3 percent. Okay? So quite minor.
And you have a lot of physicists, biologists, mathematicians, chemists, astronomers, you know,
physicians, economists that are born in that period. You also have a lot of painters, composers
and artists. So it's a very different set of people and types of people that were remembering from
that era than from the era prior to printing.
Now, the nice thing is that this is not the last time, you know, that technology has changed. The
next change came at the beginning of the 20th century with the introduction of film and radio.
And with the introduction of film and radio, once again we get a new matrix of cultural
production.
So what happens now is that the matrix rearranges itself and now the arts continue to increase
but they're quite different. They're not painters and composers anymore. Now they are
performers. So you have actors, musicians, singers, and film directors.
So people don't talk about that movie that was written by that guy; they talk about that movie in
which Brad Pitt was in. Now, why is that? Well, because the medium before was captured in
the words. It was text. It was books. So you worried about the author. The medium now is
capturing the faces. So worry about the actor.
And the actor is actually something that it tells us that in this case [inaudible] must go from the
medium to the fame of the individual. Because actors existed all along. They were not invented
with film. The Greeks had actors. Shakespeare had actors. But no one remembers the actors
from the time of Shakespeare. Nobody remember the actors from time of the Greek. Or there
are very few that people would remember.
While, you know, when the silver screen comes along, you know, basically the performers -actors, musicians, singers and also some creative people like film director -- are the ones that get
enhanced.
Now, the second half of the 20th century has the introduction of a new technology, which is
television. And with the introduction of television, the matrix changes again. What we have is
the rise of the famous sportsman. So TV is perfect to stay at home, drinking a beer and watching
a game. You know?
You don't do that like if the Super Bowl you have to go to like a movie theater to watch it,
probably people would -- you know, it would not be that popular, you know? Like in some way
like TV is an intimate thing that you're like you're watching sport in your underwear. And in that
context, you know, you have like the frame of the famous sports player that gets enhanced with
it.
Now you might ask, you know, what happens with the Internet. You have to notice that I'm
looking at people that are globally famous based on date of birth. So the number of people that
are globally famous based on date of birth that were born after the Internet, you know, still were
born like after 1996, so it would be too early to make any conclusions based on data.
What I can do later if you guys want is to be speculative about how do I think the Internet is
changing this matrix. But I cannot give you any evidence of my speculation.
The other thing, though, that this helps explain is -- which I find fascinating, I was not excepting
in the beginning -- is that in a -- as I show you, the matrix of cultural production changes with
technologies because the composition of who becomes famous changes. But also the infraction
of people that become famous also changes.
So here what I have in the Y axis -- and I apologize for this chart because it's a PDF for a
printout. We don't it in the visualization engine yet -- is what I'm looking here is that the birth of
globally famous people divided by the population of the world at that time. Okay?
So it's kind of like the -- have you guys play Civilization? Yeah? So it's kind of like the
birthrate of these like great architects in civilization, stuff like that. Okay? And this number,
you know, this is the year 500 BC, is basically constant all the way to year 1450. You see some
things going up and down. That's Gaussian noise.
Actually, we measured it, and there's sort of -- like there are a lot of small divisions. Once in a
while you get a big one. They also tend to be coincide with like whole numbers. So I think it's
just historians put in a lot of people in the year like, you know, 200 or shit like that.
But it's actually Gaussian noise. Then you get this -- you know, we use that change point
analysis technique, which is a statistical technique, to determine when the mean of a time series
changes. And actually, you know, the change point analysis technique identifies the time of the
printing press as a time in which the rate of producing globally famous people doubles.
>>: Or the rate of remembering them.
>> Cesar Hidalgo: What?
>>: Or the rate of remembering them.
>> Cesar Hidalgo: Yeah, yeah, yeah. Yeah. This is obviously the people that would remember.
Yeah. Sadly. It's not if they were famous at that time, it's that if we know them now. Yeah.
>>: Your first library [inaudible] before that.
>> Cesar Hidalgo: Oh, yeah, yeah. No, these are, sorry, public library. Yeah.
>>: What?
>> Cesar Hidalgo: These are kind of like modern public libraries, like off an institution. Yeah.
Yeah. And then eventually then with the introduction of, you know, like new communication
and broadcasting technologies, this radius starts to explode. Obviously we don't know if it's
because these people are recent or because they're memorable. Okay? Because a lot of the
people that are now in our dataset, you know, are going to be forgotten, but some of them are not
going to be.
The other thing is if we look at the [inaudible] of fame no matter which window we look at is in
parallel with the same more or less exponent around 4 and 5. So it means that like the number
of -- the ratio of people with a certain level of fame remains constant despite changes in
communication technologies. And that is something that I find to be quite interesting.
Now, hopefully with this Pantheon study I have I was able to show to you that by looking at this
dataset of cultural production we can learn about how broadcasting technologies have changed
who remember, not which rate we remember people and how the distribution of fame looks like.
But what we can do also is to look at how other factors, the connectivity of languages, affect the
number of globally famous people that each language will be able to produce. And this is a
paper that was created by Shahar. He's the first author. This is Ronen, et al. Shahar is there.
Say hi, Shahar.
>>: Hi.
>> Cesar Hidalgo: And up here in PNAS last December, source of collaboration with Steven
Pinker from Harvard, and basically what we did here is what we tried to do is a lot of people
when they were looking at importance of languages, they were looking at intensive measures.
How many people speak the language. How rich are those people. How big is the area. How
big is their military power.
But in reality thinking of language in the context of the power of the people that speak it is not
the best way to think about language because the whole point of language is that you use it to
communicate it. It's not an intrinsic property. It is kind of a medium. I can communicate with
you guys because we share this language of English. Even though you can hear from my accent
it's not my native tongue.
So in some way when you think about the language, the language can do two things. One thing
is it can help communicate people that speak the language, and in that case it can help transmit
information indirectly between groups that do not speak that language.
So, for instance, Shahar can learn a very nice joke, you know, in Israel, in Hebrew, then we meet
together, you know, he tells it to me, and then I go back to Chile and I tell that joke in Spanish,
and in some sense English was not the final destination of that joke, but it was kind of like this
intermediate language through which that information went through.
So we decided to go and try to map these networks. And, of course, you know, like finding data
on which languages [inaudible] spoken was kind of difficult, but we were able to get our hands
on three datasets that you should not interpret at datasets that are reflective of the entire world
but that are datasets that are representative of very specific leads.
Which are important because at the end of the day, you know, most of the information that is
produced and generated in the world is produced by elites. And here by elite I don't mean kind
of like the king and the countess, but kind of like everybody, for instance, that already has read a
few books in their life. In a global context, they're an elite.
And in this context what we have is that this is the network of languages that would come up if
you look at 2.2 million book translations, you know, from a dataset from UNESCO. And you
see that here English is kind of like this big global hub which connects to a large number of
languages. Then you have Russian here. Anybody speaks Russian here? Okay.
So here, for example, you see Russian, and Russian is kind of like a strong local hub that is
connected to a lot of languages that are not very connected to anything else.
You know, for example, like [inaudible] or Georgian [inaudible] in part, you know, is because
there was explicit policy during the Soviet time to translate books to and from Russian, and there
were many countries that had a political affiliation with the regime of the time, and therefore
they receive much of the information from Russian sources.
>>: So the size of the circle represents how many ->> Cesar Hidalgo: How many people. Exactly. So, for instance, here you have that English and
Chinese have the same number of speakers if you count native and nonnative speakers. If you
count people that speak English as a second language, there's 1.5 billion people in the world that
speak English.
>>: And the thickness of the link ->> Cesar Hidalgo: The number of books translated from one to the other. So, for example,
between English and French, there have been many books that have been translated between
English and French.
>>: Books translated from one to the other or the number of books that exist in both languages?
>> Cesar Hidalgo: No, translated.
>>: [inaudible] to say Harry Potter book 2 exists in both French and Spanish. Is there an edge
from French to Spanish because of Harry Potter book 2 ->> Cesar Hidalgo: No.
>>: Okay.
>> Cesar Hidalgo: No, no, no. So let's say Harry Potter book 2, you know, was written in
English, then it gets translated to Spanish. That's a link in that direction. Now, let's say that then
there is a translation that goes from Spanish to French. Then in that case it would count from a
Spanish to French.
So even, for example, let's say Mark Twain's Tom Sawyer was translated from English to
Spanish and there's some translation to go from Spanish to Catalan. In that case that counts as a
Spanish-Catalan link because an expression of the co-usage of those languages.
>>: You think [inaudible] arrow, it should be ->> Cesar Hidalgo: Yeah. So in the paper we have the arrows. In the Web site we're calling it
the arrows [inaudible].
>>: Sure.
>> Cesar Hidalgo: Like this one we hacked it in a week. You know? This is not kind of like a
big data [inaudible]. It's kinds of like a mini site. Because the difference, for example, this site,
you know, gets hundreds of thousands of people every month. This one gets like 400,000 people
every month. So it's kind of like a resource and the traffic increase over time.
This one go like 300,000 people in like three days. And then by now there's very few traffic
because it's just kind of like one idea that you're putting out. You're not putting out a resource.
You know? So in that -- that's why something that you hack relatively quickly because it has a
different type of lifecycle.
>>: [inaudible] speakers but there's not so much literary activity in terms of translation, can you
identify?
>> Cesar Hidalgo: So, for example, like Chinese is definitely like the language that for its size
tends to be very peripheral. But Hindi actually tends to be quite peripheral, too, because
obviously in India there is also a lot of use of English, and through English, India connect itself
to the world.
Spanish for me was actually a little bit surprisingly disconnected in the books. But depends on
the type of media you look to see something different. So, for instance, in Twitter Spanish
actually tends to be like I think the second most important [inaudible] language. Arabic is the
language that tends to be kind of peripheral in most of them and also is spoken in many people.
In Wikipedia, German, you know, ends up being the second most important language in the
Wikipedia. And German is a language that always rank high because we're using Eigenvectors
[inaudible] languages not because it's connected to too many languages, but it tends to be
connected to languages that are always influential.
So German tends to be connected to all of like, you know, languages of Western Europe and
Eastern Europe, and in that context, you know, it has kind of like some good neighbors when
you're doing Eigenvectors and trying to get measure.
But what we find which is hard which is kind of cool is that when you look at this network, you
can ask yourself the question, now, okay, now I have measures of the connectivity of languages.
There are not intensive measures of how much people speak the language or how rich they are.
Are those measures better at explaining the number of famous people produced by that language
than those in intensive measures.
And the answer, you know, is that in the case of Wikipedia and the book translation network,
yes, they actually are much better at explaining the number of famous people produced by a
language. In fact, if you look at the PNAS paper, in the best case it will explain almost 90
percent of the variance in number of famous people produced by a language only by looking at
the Eigenvector centrality of that language.
So obviously we don't have a closer story in this case because we cannot differentiate between
the hypothesis that people are learning a language because the language is producing good
content or the languages that were connected are better at diffusing the content.
I think that probably the strongest part of the fact is in the latter, that once you become famous in
English, it's very easy to become globally famous where when you become famous let's say in
[inaudible] you are probably famous in [inaudible]. Okay? And it's very hard for your fame to
become global simply because that information is trapped. You know? Kind of be like if you do
like a great piece of code for [inaudible] not going to go very far.
So with that in mind, I want to show you like a few other projects. You know? Yeah.
>>: Real quick. You've showed us three different images based on three different datasets.
Naively would seem to have very different interpretations. They seem to have very different
interconnectivities, different sets of nodes. Which one do I believe?
>> Cesar Hidalgo: Like that's the wrong question, and I'll tell you why. Because you're
interpreting all of these images to be representative of something that is beyond the dataset. But
the way that you should always interpret these datasets is that, well, this is Wikipedia. What is
the Wikipedia dataset representative of? Wikipedia. What is the Twitter dataset representative
of? Twitter. Not of the world. So for Wikipedia you should believe the Wikipedia. For
Twitter ->>: So you convince [inaudible] of the National Academy of Sciences that Wikipedia is
intrinsically interesting enough that it was worth writing an article only about translations
between Wikipedias?
>> Cesar Hidalgo: In Wikipedia and Twitter --
>>: I mean, I assume that they -- that somewhere in your article you suggested there was some
conclusion to be drawn beyond these three particular measures of social media.
>> Cesar Hidalgo: No, no, no. Like we actually like were very [inaudible] that they should be
[inaudible] because like you say that Twitter is irrelevant, they have -- governments have come
down because of things that happen on Twitter. Like books. Books is kind of like most of our
history has been based on the fact that actually we write things down. You know?
So in some sense, sure, this is not representative of all our communication, but is representative
of three leads that are very globally influential. This is not people that are not globally
influential. Like books, translations, Wikipedia, and Twitter, outlets that are worth to study.
>>: I found that people are trying to push us into saying, like come up with one definitive
conclusion, like put a single number on everything. And this is not really possible. What we
found, though, is that different media have different way. And like an interesting conclusion that
we make in the paper, at least allude to, is the fact that we see a different variance on Twitter
than what we see in book translations.
It means that potentially the languages -- the languages that we see on Twitter are more
associated with developing countries, so there's more -- it's a more democratic medium, the book
translation, so possibly this map 20 or 30 or 50 years from now will be very different because the
medium involved -- the media involved will be very different and also the cultures involved will
be very different as well.
So the world is also changing. It's always changing. Unfortunately we cannot make -- we don't
have the long data to make like the long-term conclusions about how it is the global language
network changing over time. So we're not trying to proclaim something we cannot do. But still
these are different -- each one of those is true to its own ->>: [inaudible] say you gave me a whole bunch of post talk just so stories which sounded really
nice. Behold here's the [inaudible] cluster in books world it's because of we can [inaudible] this
happened. What did you find in here that actually surprised you? What did you learn from this
that you could not have learned had you not drawn this diagram?
>> Cesar Hidalgo: Like for me I would have never expected that like the connectivity of a
language in the network of translation explains 90 percent of the variance of the famous people
that are being produced by that language. You know? You don't see like those 90 percent are
squares in social science research very commonly.
I would have been happy with like a 30 percent and three stars next to it to be honest. So in that
context like the strength of the fact for me was very important. Also, you know, like what you
highlight here is exactly something that points to the misconception that you had, which is
people tend to think of data as something that is a reflection of a world that is exogenous to it
while in reality what you have here is a representation of languages and the languages cannot be
separated from the medium in which they were expressed.
So I'm pretty sure if I were to look at a different set of medium, I would get a different network.
And in that context I would be learning about the expressions that exist in that medium.
So those are a couple of things. Now what you have is that you have a very interesting
hypothesis because now you have the hypothesis, well, in the context of global fame and
attention. Is this network actually something that matters to a point in which we maybe should
try to do something about it? Does this network help explain other things that we never tried to
explain in this network?
For instance, you know, one of the things that we are looking at is like well, you know, if you do
a gravity model of trade, you know, does this network help explain trade between two countries
after controlling for distance and the size of the economies. And if the answer is yes, what
would that tell you? Well, it would give validity to the [inaudible] theory of the economy in
which social interactions are the ones that precondition or preestablish the network that is going
to make a possible economic activity because these people have to speak to each other.
>>: So for Twitter, does it [inaudible] Twitter, what does it mean?
>> Cesar Hidalgo: Ah. So we look at all the usage that we were able to get our hands on, which
is like a billion tweets. And then detect the language. So then if you tweeted let's say in English
and then you tweeted in Chinese ->>: The same person doing both.
>> Cesar Hidalgo: Exactly. Yeah. So the same person.
>>: I see.
>> Cesar Hidalgo: So I know, for example, like I contribute to this link because I tweet in
English and I tweet in Spanish.
>>: I see.
>> Cesar Hidalgo: You know? But I don't tweet in other languages. You know? Exactly. So
this is the same person tweet both. In Wikipedia is the same editor has to edit article in both
languages.
>>: So I think as a scientist this is very satisfying, making use of our [inaudible]. There's a lot
of elemental surprise and there's a lot of developmental sort of explanation of the world,
exploration as well, but as an engineer my question is [inaudible] how is this actionable.
>> Cesar Hidalgo: Oh, yeah, yeah, like that will have to go back to like the other projects that
are not about languages and their influence. But, for example, like this project like DataViva,
which makes available a billion visualizations and it's actually in the context of a world in which
most governments around the world have passed on laws that mandate to open up their data and
they have IT teams that [inaudible].
So in that context what I'm trying to do is to develop a set of technologies in which we can
deliver the data for them, you know, in the context of these large data visualization engines that
make that information, you know, globally available.
So in that context, for example, this data visualization is actionable. Another thing that we're
doing is with EMC they're looking at, hey, shit, we sell to a lot of people our fucking back end,
you know, now we're saying they can hook up any front end that they want. But it's kind of a
pain. Can we like ship it together with our own front end.
So, for example, we send them all the data of DataViva that is [inaudible] back end. Then we're
going to go there, visit them for a week, create like a front end, you know, for them to show them
that is very quickly and easy to do, and we can do that because all of the libraries that we use to
create this we have created ourselves too.
So this is -- we don't just create the [inaudible] but we create the libraries. And in that context,
you know, that is kind of like something actionable for like EMC or for Oracle that would want
to distribute that front end to have something better than my PHP admin if you're looking at your
data.
>>: [inaudible] I appreciate that. But then in going back to the previous one, I was hoping for
an answer like, well, I don't know, we should encourage more translation [inaudible] or we
should make, I don't know, the Skype translated technology [inaudible] or we should make
access to Twitter more easy for some population. Something like that. There is some
discrepancy that you observe and you can perhaps ->> Cesar Hidalgo: So in that context, for example, once I remember -- like I think Shahar was
also in that conversation, there was these people at the middle [inaudible] hundred companies.
So we get people from all different type of sectors. And one of these groups of people was
people that were -- they build mostly like the router and Internet infrastructure for like Southeast
Asia.
And what they're always asking themselves the question is like, hey, where is traffic going to be
in five years from now? Because traffic changes. And building infrastructure is not that easy.
Takes a lot of time and effort.
So what they were thinking is like, well, can we use this network to try to say, hey, these are two
languages that are actually like kind of spoken so as the level of incomes are going to go up, the
number of people that are going to be integrated into the communication technologies and want
to communicate across, you know, is going to increase and therefore we should expect a larger
increase, you know, let's say in bandwidth between Thailand and Indonesia -- let's say between
Thailand and Malaysia, just to give you like a made-up example.
So in that context you could try to look at those things as well. For me kind of like what is more
interesting is actually that scientific experiment aspect in which you are learning about what are
the things that determine how information grows and diffuses in our society and in our economy.
And in that context this was like what I would say until now heretofore not very a popular factor
that people were using to describe that, and it happens to be one that is actually quite strong in
the diffusion at least of cultural information, and therefore should need to be considered as a
main explanation, not as a cite collection.
>>: Predictive context, which I appreciate a lot, actually, I really like the examples you gave
about builds and capacities, network bandwidths and cables and things like that. Are they happy
with the data that's inferred from Twitter or Wikipedia, or are they asking the same question as
has been asked, which is, well, I don't care so much about Twitter, I care about the world.
>> Cesar Hidalgo: Yeah, so in that context, you know, like I would say when you're working in
a practical context, people are very understanding of like the constraints of implementation. So
obviously if it would be possible to get better data and if that better data make a real difference,
like companies there would like, I don't know, start serving people, you know, and see which
languages they speak or try to find a different way.
In that context, I would say as a first approach at least to see if let's say this has legs, if we would
like look at it ten years ago would have made the right predictions, it's something that people
would be very much willing to give it a shot. Yes.
>>: So this all fascinating. You've shown almost like an evolution where the medium has
slowly shifted our perception more and more and more towards a performer. And I'm wondering
if you almost foresee the medium -- the performer now playing the role and selling the next
medium basically in a way.
Because the onus being more put on -- I mean, looking at language being a much stronger
predictive factor and looking at the evolution towards more of a performance-based production
basically, I'm wondering if almost our next evolution is more of a cultural one that is more
enhanced by the medium -- sorry, the performer than the medium.
>> Cesar Hidalgo: So I can see like how the medium and the performer interact in a different
way now because I think we're going through like a new age of invention in which there's a lot of
people that have become famous because of helping create a new medium.
You can think of like Bill Gates or Steve Jobs or Mark Zuckerberg, all these people that have
global fame, people that have contribute to creation of like a new medium, whether it is a
personal computer or social media or Twitter or whatever those things.
>>: There's almost like a bias associated with that where the performer kind of reinforces the use
of the medium or the growing of the medium and how much you're willing to forgive the
medium if you want in a way and stick with it and allow it to evolve. I don't know if ->> Cesar Hidalgo: You [inaudible].
>>: Different translations.
>>: You had famous people by date of birth.
>> Cesar Hidalgo: Yes.
>>: Does it change much if you do it by date of death, especially in the more recent centuries?
Seems like a more sensible thing to ->> Cesar Hidalgo: So if you do date of death, what you find is that there's actually a much
stronger bias in favor of like big developed countries as cultural centers. Because what happened
is that people with talent, you know, are basically born everywhere but they don't die
everywhere. You know? Like everybody that -- I think it would be very hard to find, for
instance, someone in the Wikipedia that has a presence in a number of languages and that was -had the same place of birth and place of death. You know?
>>: [inaudible].
>> Cesar Hidalgo: Yeah. And the data that like we could look into -- although there is a lot of
people that is alive in our dataset still, from Justin Bieber to George Bush.
>>: You wait long enough.
>> Cesar Hidalgo: Yeah. So what I wanted to show you is like a few other examples. I don't
know if you guys want to look into that, some that look at cities. I don't any of you are interested
in maps. Or some that look at e-mail. You know?
So the first one is this example that involves e-mail, which I call e-mail is the revenge of the
Internet. And this is my e-mail. So this is all of the e-mails that I received let's say between
noon and like 2 p.m. when I guess I lost my Internet connection. And they pile up.
And e-mail has a horrible design interface because basically what e-mail is designed to do is
designed to make you act urgently. So what is e-mail designed around? It's designed around
messages, not people. Each one of these is a message, not a people, or a message thread. And
designed around time.
What's the most important in the e-mail? What's on top. So if the e-mail was the newspaper, the
headline would be the latest e-mail that came to your inbox. Now, in reality, obviously they're in
some context in which we want to push that sense of urgency, but we need technology to help us
push a sense of reflection.
So what we did is we did the design experiment that we decided lets turn e-mail a hundred
percent around. Let's flip it on its toes. And how would e-mail look if we flip it on its toes? We
flip it on its toes, it would do two things. The first thing is we wouldn't center the e-mail on
messages, we would center on people. Why? Because people think about people. I think of
Andreas, I don't think of like -- I don't remember the title of the e-mail that we're exchanging, but
I know it was you. So that's what I'm going to search and I know who you connect me to and so
forth.
Now, we center it on people. And also let's not just put like an narrow window of time, let's look
at the network that you're weaving over the long race, not over the short race.
And that's Immersion. And anybody can try it. You guys can log in with your Gmail if you
have Gmail. And basically what it does is this now is showing me the network that have built
over the last 10.6 years. Okay? I can like include more people if I want, you know ->>: That's your personal e-mail?
>> Cesar Hidalgo: Yeah, this is my personal e-mail. And people are connected and have been
in a cc in the same message.
>>: So the connection means that you send e-mail to somebody else.
>> Cesar Hidalgo: Yeah. So, for example, this is my mom and this is my sister. And my mom,
my sister, and this is -- you know, since this is ten years, my girlfriend in the year 2004 are
connected there. Because there were e-mails that involved the three of them.
And these were my friends in Chile, you know, and then here, then I start moving here and I go
through like this is my advisor, the Ph.D. [inaudible] and then I go from there, and then this is
when I went to Harvard, and then this is when I go to MIT, and this is the people at MIT and the
red is kind of like my group. You know?
Now, obviously, you know, this network changes a lot over time. So what I usually do, this is
like what is active on my inbox like in the past month. So these are the balls that I'm having like
juggling or that I'm dropping. So like I have Maggie, my admin, you know, she's the one that
keeps my universe from colliding. Here I have, for instance, this is people in my company and
the clients that we have. Then here, you know, Kevin is one of my students. And here is like
also having these collaborators from Colgate, Kevin is creating like a tool that allows you to
create a DataViva in like 50 seconds. Okay? So and we're doing it with ->>: That's the size indicates how many e-mails that you send to ->> Cesar Hidalgo: Exactly. And here you have -- you know, like you have other groups. Nicky
works on the project that I want to show you next and so forth. So it allows you to reflect and to
see how these social interactions have been evolving and so forth.
The original idea was to transform this in a full-fledged inbox, like a visual inbox. So then when
I would like send e-mails, receive e-mails, I would know that the e-mail is coming from here or
from there. Now an e-mail from here is more important than an e-mail from there. So it's that
now you're prioritizing based on the position that they're occupying on your social network, not
on the time that they decided to press send.
Which I thought it would be a better way to try to weave your own network. But the students
that did this, one now is working on Google and the other one is doing consulting by himself. So
basically this product is just parked on a server at the Media Lab. You know? But it's a nice
project.
>>: Some of the people who are in the periphery could be very important and you might want to
respond quickly, right?
>> Cesar Hidalgo: Yeah. They're like the new people. Or the mistresses. Yeah. No, it's true.
Like sometimes you show this to people and they're like who's that one, who's that one. And
then they blush. Because it's a person that you don't connect with anyone else, the mistress.
That's why they show up in the border.
But what I find is that if you have people in the border, if you don't connect them to your
network, it's very hard to keep that link. So the ability for you to preserve things with other
people also depends on your ability to embed the people in your social network. Because your
friends provide a service of keeping their friends connected to you also.
>>: [inaudible] tiny little dot [inaudible] but an e-mail from that person ->> Cesar Hidalgo: Could be very important. Exactly.
>>: [inaudible].
>> Cesar Hidalgo: No, no, like there you could have like some sort of like important algorithm
that is exogenous to the social network but maybe has information of the social network of
everyone that is using e-mail.
So in some sense I'm pretty sure that if you were to mine, you know, like e-mail data, you should
discover that [inaudible] is kind of like an important guy in the network. You know? And you
could use that. So it would go beyond your own inbox, the data you wanted to use.
>>: But also there's some -- has something to do with whether the e-mail is -- you know,
[inaudible].
>> Cesar Hidalgo: Oh, yeah, yeah.
>>: For about two years out.
>> Cesar Hidalgo: But that's something like, for example, already is done very well with the
priority inbox of Google and everything.
>>: I see.
>> Cesar Hidalgo: Yeah. So like mailing lists, they tend to be like lower priority. And those are
easy to [inaudible] because mailing list are places that everybody like basically sends e-mail to
but they never go in the other direction.
And in that context I think we're out of time. But what I wanted to show you is just kind of like
this side process that we have been looking at urban perception and we have created like
computer vision and machine learning algorithms to create very high resolution maps, you know,
of urban perception.
So we have collected around 1.2 million preferences in this Place Pulse Web site that allows us
to determine which place looks safer, livelier, more depressing, et cetera.
And then we take those preferences and we use to train machine learning algorithms that then we
can use to generate, you know, maps that are very, very, very high resolution. So this map of
New York has 300,000 points, but it's based on 2,000 images from New York because you
wouldn't be able to crowd source 300,000 points for multiple comparisons. I never was able to
get that traffic.
And what we're doing now is that now that we have the technology to create these maps that tell
us which places look good or bad, we are looking at which places are changing positively and
changing negatively. And that is something that is very interesting because actually like we're
starting to look at the dynamics of the city.
So these are places, for example, these are the before-after pictures, 2007, 2014. And these are
places that have been highly improved in Williamsburg. And we're starting to see, well, now
that we can detect the places that change, you know, how much of our, you know, like exclusion
of minorities is happening in those places because we have a measure of gentrification that we
can use to actually look which are the places that had new construction.
We can see what predicts whether this place or that place is going to be the next one to get
improved, is it the proximity to other things, is it the demographic component that they have.
Can we actually now that we have a measure of urban change start predicting urban change and
informing what are the things that are causing the urban environment to change, is it private
investment or is it public investment? Is it that people basically got allowed building permits and
they just put their own money and they build shit themselves, or is that the government decided
to like clean up all the streets, made them nice into parks and then eventually the buildings came
later.
These are questions that are hard to answer right now because you don't have good measures of
urban change, but we hope that with this technology you're going to start creating maps of urban
change that can be refreshed with a relatively high frequency because this is just computer vision
and machine learning that you can keep on cranking and turning and improving.
And in that context, we can start assigning questions of what causes urban change and what are
the effects of a changing urban environment.
And with that in mind, I would like to finish up. I'll just put a plug for my book, it's called Why
Information Grows. It has zero to do with anything that I talk today. It's actually what I like
doing. It's about evolution of physical order from atoms to economies. So I start the book by
describing, you know, [inaudible] statistical physics, then I go into [inaudible] statistical physics.
And from there, you know, I contrast that with the information theory of Shannon. I explain why
there are some important difference in there and why eventually information is related primarily
to physical order. And then I describe the mechanisms, then explain the origin of physical order
in the universe, and I show how those mechanisms are reembodied in society and economy to
ultimately conclude that the growth of economies is nothing other than an epic phenomena of the
growth of information in the world. Thank you.
[applause].
>> Andres Monroy: We have some time for some questions.
>> Cesar Hidalgo: We had questions during the talk, too, but if anybody has anything else.
>> So you don't talk about exactly how [inaudible].
>> Cesar Hidalgo: What did I use to explain?
>>: Just a standard technique for [inaudible] like they don't look cluttered, everybody can see ->> Cesar Hidalgo: Yeah, yeah. So like there's two things. So, for example, this side is built
custom on D3, which is a JavaScript visualization library. But what we've done, you know, is
that all of this side, for example, like DataViva, when you're making a billion charts, you don't
want to use D3 to create each one of those charts. You know, you're going to go crazy.
So we created a library ourselves that is called D3plus. You know? And the D3plus library,
what it does is that like it provides you kind of like cookie-cutter, well-designed D3
visualizations that you can incorporate with like one line of code.
So let's say you want to create like kind of like this nice pie chat that mouse over and makes the
sizes of the font proportionate to the size of the [inaudible]. Just have to do that code and that's
it. You know?
So that allows us to then scale and create this more ambitious online project that create lots of
visualizations because we have kind of like that level building block that we can put, we have to
figure out what query is, what I'm going to show, connect those two, and then we can create
these like large, you know, informative profiles, like the ones that we have here for industries or
occupations or for -- yeah. Yeah. So that's what we use. Okay. Yeah. Okay?
>> Andres Monroy: Thank you very much.
>> Cesar Hidalgo: Thank you.
Download