Yandex.Ru

advertisement
EVA’99-Moscow
E. Kolmanovskaia
Yandex.Ru – search and research engine
Elena Kolmanovskaia
Yandex project manager
Phone: (095) 785-25-25
E-mail: klm@comptek.ru
Internet: www.yandex.ru
www.comptek.ru
Yandex.Ru – Russian Web Search Engine
Yandex.Ru is a unique product for indexing Russian-language resources (sites) on the Web
(something like AltaVista, eXsite, etc). Search area is "Russian" Internet (it means 'su' & 'ru'
domains, former USSR domains (e.g. 'ua', 'kz') and Web-sites in other domains containing Russian
texts of any kind). Russian Web consists now of about 35 thousands servers, more than 60 Gb texts.
Approximate number of online users - about 1,5 million. Russian Web has two main languages,
Russian and English. Web grows quickly, a year ago there were less than 5 thousands servers.
Yandex.Ru includes web spider, HTML-parser, indexing module. CompTek developed all
algorithms except Porter algorithm for English morphology. All software is done by CompTek.
For the first time Yandex was announced as a full-text morphological retrieval product line at
the 18 of October 1996. Yandex - "yet another indexer" in English transcription; or "language
indexer" in Russian ("Ya" is the last letter of the Russian alphabet and the first letter in Russian
word "language" [yazyk, yazykovyi]). We use to write it with the first Russian letter and "ndex" in
Latin to underline the local meaning (and proud) of the product - Яndex. Yandex.Ru was opened
for public access at the 23 of September 1997.
Additional problem for Russian Web-search (unknown for English sites) is peaceful
coexistence of different Russian charsets. The most spread are Windows-1251 and UNIX KOI8-R,
than ISO-8859-5, Alt-866 and Macintosh. Some sites are clever enough to present the same
information in the requested charset, some - not. For example, AltaVista search for Russian words
presents two different results in Windows and KOI. It means that Russian web-search engine must
understand all charsets, recognize if they represent the same information (site) or not and be able to
show to user the results in correspondent charset. Yandex.Ru can do it and even more, It's able to
calculate uniqueness of documents not only concerning charsets but also concerning mirrors.
Today (September 1999) YANDEX.RU statistics:
- 41 635 indexed Web-servers
- 10 949 302 indexed pages
- more that 99,12 Gb indexed information (index data base less than 25 GB)
- more than 25'000 unique IP every day
- more than 150'000 unique IP every week day
Yandex kernel
All the products with Yandex prefix have the same Yandex-kernel. The difference between
products is the different application (external interface).
The Yandex-kernel features are:
- Russian morphology module (90,000 vocabulary, correct treatment of unknown and new words,
one of the best world linguistic schools -- Melchuk/Apresian, morphological analysis and
synthesis, learning vocabulary) + English morphology
- indexing module (size of the index = 35% of text size, i.e. very small - very important for huge
texts; stores full word's address, including token number; highlighting of found words; indexing
speed 2MB/min on a PC; very fast retrieving)
- parsing tools (SGML-like text mark-up language, external text mark-up language)
- complicated query language (Boolean operators, distances between words and paragraphs, text
zones)
- high-quality sorting algorithm for query result (very important for huge texts and heavy queries)
7~7~1
EVA’99-Moscow
E. Kolmanovskaia
-
natural language query, search of similar document
All Russian and English words are normalizes at indexing and at search. Not only words are
indexing but also numbers and marks (mixture of letters and digits). Natural language query
simplifies search engine usage. The simplest way to ask Yandex search engine is just to write in
query field exactly what you need.
Yandex product line
Yandex.Site - the tool for indexing and search on user's own Web-site
Doesn't matter how wonderful did you organize your Web-server, the real life is usually more
complicated than the scheme. Your site grows – it means that your visitors will need to go deeper
and deeper – it means that they can be bothered by clicking again and again. Yandex.Site can
provide the information of any level in a couple of clicks – visitor has only to ask and look at the
result.
Yandex.Site can be easy designed for your server conditions – administrator can tell which
directories must be indexed and which not, what file types must be excluded. Additional feature is
catalog search – any directories can be logically united in one catalog item and Yandex.Site can
provide independent search only inside one or several items. The same directory can be included in
different items. It's also possible to create different catalog sets.
User's site can be reindexed as often as it is necessary. Usually it's enough to do it once a day
(at night) but in case of news-site it can be done every hour or even faster. Indexing process does
not stop search process, they are transparent for each other, and the only exception is a couple of
seconds when a new index base becomes available. During these seconds search queries are waiting
in line, usually it's unnoticeable.
A special version of Yandex.Site was created for ISP – it supports several virtual servers at the
same computer (or local network). From the ISP point of view it's just one program Yandex.Site,
from the host owner's point of view there is an independent Yandex.Site for every host.
Yandex.CD - the tool for search through static texts
Yandex.CD is quit similar to Yandex.Site. The main difference is that Yandex.CD does not need
Web-server – the search part can be installed on any computer with Windows 32 and Internet
browser (IE or Netscape 3.0 or higher). The idea is that the texts are not changeable so they must be
indexed only once. Index data base is enclose to the texts. This product is used at CD-editions.
Yandex.Lib – full-functional Yandex library.
Yandex.Lib is the stand-alone module and the library (correspondingly), ready to be build in
different third-party retrieval systems. It includes three groups of functions: indexing, search and
highlighting. Yandex.Lib can works with several databases simultaneously.
Yandex.Dict - a Russian morphology module only (without indexing).
Yandex.Dict is also a library to be built in third-party products (usually with pre-indexed texts).
As an example of Yandex.Dict we show an extension to Digital's AltaVista search engine. Just
imagine – a simple query "new Russians" ("новый русский") in all Russian forms looks like:
(((новый | нов | новейший) ~ русский) | ((нового | новейшего) ~ русского) | ((новому | новейшему) ~
русскому) | ((новым | новейшим) ~ русским) | ((новом | новейшем) ~ русском) | ((новые | новы |
новейшие) ~ русские) | ((новых | новейших) ~ русских) | ((новыми | новейшими) ~ русскими))
Yandex.Ru – Russian Web ReSearch Engine
Search engine provides a possibility to research Russian Internet – both content and users.
What kind of information you can find in Russian Web? According to sites in Yandex.Ru base (data for
the beginning of 1999 year):





Business and marketing (including advertising and public relations) – about 35%
Self-expression (home pages) -13,5 %
Internet-life (download programs, Internet projects, on-line libraries etc.) - 11,8%
Science, medicine (schools, universities etc.)- 10,2 %
Culture
(theatres, museums) - 9,5 %
7~7~2




EVA’99-Moscow
E. Kolmanovskaia
Mass-media (newspapers, magazines, radio, TV) - 6,7 %
Adult (sex) – 2 %
Services (mails, trading, soft delivery) – 1,3 %
Officials – 1 %
Who lives in Russian Internet? The pioneers are as usual hi-tech companies. Then – consulting
and advertising. They quickly realized that Internet presentation is much less expensive than mass
media one. Travel agencies and hotels, real estate, cars and various device vendors learned alredy to
use Internet as a powerful weapon to find more clients. Internet users now represent about three
percents of all-Russian population, but they are it's most active and educated (at least technical)
part, and mainly middle class. Off-line research made by Gallup and Comcon confirm this
impression.
Yandex.Ru also uses to study queries. For example, we found out that the words "bank" and
"currency rate" have extremely grown in queries and overcame usual top 5, such as "Moscow",
"sex", "porno", "Russia", "referat", a week before august crisis. Now we began to study queries
systematically. We invented NINI-index (Internet Users' Interest inconstancy). This index consists
of it's value, 5 words which mostly grew in queries during last week in comparison with previous
one and 5 words that mostly fell. These ten words represent users' interest changing. It’s possible to
restrict study area by any word set, for example, by politics (we publish also polit-NINI), or trade
marks and so on.
Yandex.Ru is an Internet product of common use – so a lot of Internet users come there. It’s not
only advertising place but also a place to make queries. We asked people what information source
they entrust. The answers were:
Internet
TV
Newspapers and magazins
Rumours
Give no credence to anybody
35.99%
16.99%
10.34%
1.50%
35.18%
We can also analyze for any word its most frequent neighboors in queries. For example the
word “art” (“искусство”) is usually asked with the following words:






battle
museum, figurative
applied
contemporary
decorative
history, love
Culture in Internet
I was proposed to say a few words about «Culture» resources in Russian Internet. To
investigate this problem I examined thematic catalogues. At @Rus (former «Au» www.atrus.ru/rus/) in section «Culture and art» there are 2917 resources. At Rambler counter
(counter.rambler.ru/top100/) – 2576 resources. At List.Ru (www.list.ru) – 3744. It represents the
same percentage (9-10%) which I announced before by Yandex.Ru data.
What are main culture resources? The most peculiar Internet content – texts (Moshkov's text
collection exists already for 5 years), images (foto, pictures) and music (mp3 format). All these
resources are collected mainly by independent persons. Then organisations are present – theatres,
museums, libraries, artist unions, then editons and information (posters, encyclopedia).
Here is the list of the best resources from the three catalogues.
@Rus, composite criterion: elite league +popularity



Библиотека Максима Мошкова http://lib.ru/
Литература http://www.litera.ru/
Центр современного искусства Сороса www.sccamoscow.ru/
7~7~3










EVA’99-Moscow
E. Kolmanovskaia
Государственный академический Большой театр России http://www.bolshoi.ru/
Союз архитекторов России http://www.uar.ru/
Gumilevica: гипотезы, теории, мировоззрение http://kulichki.rambler.ru/~gumilev
Госфильмофонд http://www.aha.ru/~filmfond
Государственный Эрмитаж http://www.hermitage.ru/
Государственная Третьяковская галерея http://www.tretyakov.ru/
Государственный музей изобразительных искусств им. А. С. Пушкина http://www.museum.ru/gmii
Музей Рериха в Нью-Йорке http://www.roerich.org/ru/home_ru.html
Культура - информационное агентство http://www.guelman.ru/culture
Кирилл и Мефодий - досуг http://www.km.ru/
Here are Rambler counter «culture sites» positions – it represents to some extend @user's demand@ but
also catalogue «folk caracter».
58
65
74
80
95
100
102
Music phone www.cdru.com (Музыка)
Referat.Ru - сервер для студентов и школьников
(Образование)
Библиотека Максима Мошкова (lib.ru)
(Литература)
Full Albums in MP3
(Музыка)
MP3 European & American Charts. Full Albums MP3. (Музыка)
Cyber Archive of Mp3z, Gamez, Appz (Музыка)
Музыка! Гитара! Блюз! Система запроросов! ЖМИ! (Музыка)
Here are List.Ru data, sorted by Yandex citation index.
CI, Citation Index is a usual measure of science work or person significance. Index value represents
the number of references at this work (or name) by other scientists in their works.
As applied to WWW, citation Yandex is a measure of Web page or Web-site publicity among other
Web-resources creators, i.e. among “writers”. It’s the main differemce between CY and counters,
such as Rambler Top100, Top List, Count.ru, which are a measure of publicity among “readers”.
CY, Citation Yandex is the number of Interenet-resources, where there are links at this resource,
measured by Yandex data.
1194
1077
653
647
520
456
413
Библиотека Мошкова www.lib.ru
Все музеи России www.museum.ru
Music.Ru www.music.ru
Музыкальная Шкатулка www.cdru.com
http://www.mtv.com www.mtv.com
Гос.Эрмитаж www.hermitage.ru
Современное искусство в сети www.guelman.ru
Elena S. Kolmanovskaia
Elena Kolmanovskaia, Yandex project manager, was graduated from Moscow Institute of Oil and
Gas. She received a MS degree in Applied Mathematics in 1987 and has conducted considerable
research in the area of data analysis and structure simulation at the All-Russian Scientific Oil
Geological Research Institute. For two years she worked in USA as chief programmer (East Cost
Sheet Metal Corporation). From 1996 she is the chief of Yandex (full-text retrieval search systems
concerning Russian morphology) team and project. She is also the author of "Tales of Russian
Internet", published at Russian Internet search engine Yandex.Ru.
7~7~4
Download