Languages and Information Technologies Prof. Milena Stanković University of Niš, Faculty of Electronic Engineering ESP 2015, 22.05.2015. Outline Internet as a global data resource • Searching • Text mining Web 2.0 (Blogs, Wikis, RSS) New information technologies in learning and teaching languages • On line courses • Social networking • Learning through gaming • Mobile applications • Augmented reality • Semantics in learning applications Conclusions ESP 2015, 22.05.2015. Internet as a global network and data repository • More than 190 countries are linked into exchanges of data, news and opinions. • Estimated number of Internet users worldwide is 3,000,608,300, which is nearly 40 percent of the world's population. • The total number of websites with a unique hostname online exceeded 1 billion. (Internet Live Stats, December 30, 2014), http://www.internetlivestats.com ESP 2015, 22.05.2015. Web search-indexing https://developer.apple.com/library/mac/documen tation/UserExperience/Conceptual/SearchKitConce ESP 2015, 22.05.2015. pts/searchKit_basics/searchKit_basics.html Searching https://developer.apple.com/library/mac/d ocumentation/UserExperience/Conceptual/ SearchKitConcepts/searchKit_basics/search Kit_basics.html ESP 2015, 22.05.2015. Unstructured Data on the Internet Unstructured data (or unstructured information) refers to the information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Techniques such as Data mining, Text mining, Natural Language Processing, Web mining, Text analytics, Multimedia data mining, provide different methods to find patterns in, or otherwise interpret the information. ESP 2015, 22.05.2015. Text Mining? • Text Mining is about discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. • Text mining is different from web search. In search, the user is typically looking for something that is already known and has been written by someone else. • In text mining, the goal is to discover unknown information, something that is not directly visible. ESP 2015, 22.05.2015. Text mining workflow ESP 2015, 22.05.2015. Word level representation of the documents • The most common representation of text used for many techniques. • Word frequencies in texts have the power distribution: • …small number of very frequent words • …big number of low frequency words. • Relations among word surface forms and their senses: • Homonymy: same form, but different meaning (e.g. bank: river bank, financial institution) • Polysemy: same form, related meaning (e.g. bank: blood; bank: financial institution) • Synonymy: different forms, same meaning (e.g. singer, vocalist) • Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal). ESP 2015, 22.05.2015. Stop-words Stop-words are words that from non-linguistic view do not carry information • they have mainly functional role • usually we remove them to help the methods to perform better. Stop words are language dependent. • English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN, AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY, ... • Dutch: de, en, van, ik, te, dat, die, in, een, hij, het, niet, zijn, is, was, op, aan, met, als, voor, had, er, maar, om, hem, dan, zou, of, wat, mijn, men, dit, zo, ... ESP 2015, 22.05.2015. Stemming • Stemming is a process of transforming a word into its stem (normalized form). • Different forms of the same word are usually problematic for text data analysis, because they have different spelling and similar meaning (e.g. consign, consigned, consigning, consignment,…) ESP 2015, 22.05.2015. Some rules in Porter stemmer ESP 2015, 22.05.2015. Taxonomies/thesaurus level • Thesaurus has a main function to connect different surface word forms with the same meaning into one sense (synonyms) • aditionally, we often use hypernym relation to relate general-to-specific word senses By using synonyms and hypernym relation we compact the representation of documents. WordNet - the most commonly used general thesaurus which exists in many other languages (e.g. EuroWordNet) http://www.illc.uva.nl/EuroWordNet/ ESP 2015, 22.05.2015. WordNet relations ESP 2015, 22.05.2015. Phrases level Google n-gram corpus ESP 2015, 22.05.2015. Part-of speech examples Stanford Part-Of-Speech Tagger (POS tagger) ESP 2015, 22.05.2015. http://nlp.stanford.edu/software/tagger.shtml Vector Space Model The most common way to deal with documents is first to transform them into sparse numeric vectors and then deal with them with linear algebra operations • by this, we forget everything about the linguistic structure within the text • this is sometimes called “structural curse” because this way of forgetting about the structure doesn’t harm efficiency of solving many relevant problems This representation is referred to also as “Bag-Of-Words” or “Vector-Space-Model” Typical tasks on vector-space-model are classification, clustering, visualization, etc. ESP 2015, 22.05.2015. Vector space document representation – example ESP 2015, 22.05.2015. Example document and its vector representation Term frequency–Inverse document frequency ( tf–idf ) Tfidf(t,D) = tf(t,D)*idf(t,D) Idf(t,D)=log(N/Dt) ESP 2015, 22.05.2015. Supervised Learning-Classification Given: set of documents labeled with content categories The goal: build a model which automatically assigns right content categories to new unlabeled documents. ESP 2015, 22.05.2015. Unsupervised Learning-Clustering • Clustering is a process of finding natural groups in the data in a unsupervised way (no class labels are pre-assigned to documents) • Key element is similarity measure. • Cosine similarity is most widely used. • Most popular clustering methods are: • K-Means clustering (flat, hierarchical) • Agglomerative hierarchical clustering • EM (Gaussian Mixture) ESP 2015, 22.05.2015. Web 2.0 • Users are active - no longer limited to consuming information. Instead, they are producers of the information. • Sites utilize tools that make them easy to publish on the web. • Social Networking • Collective intelligence • Multimedia publishing has exploded ESP 2015, 22.05.2015. Useful Web 2.0 Tools • • • • • • • • Weblogs Wikis Forums Real Simple Syndication (RSS) Aggregators Social Bookmarking and Networking Online Photo Galleries Audio/video-casting ESP 2015, 22.05.2015. Moodle - LMS Moodle is a free and open-source software learning management system distributed under the GNU General Public License. Developed on pedagogical principles, Moodle is used for blended learning, distance education, flipped classroom, and other eLearning projects in schools, universities, workplaces, and other sectors. ESP 2015, 22.05.2015. Languages Courses on the Internet • Duolingo - At the moment, it offers Spanish, English (for Spanish speakers), French, German, Portuguese and Italian and more languages are in beta and on the way soon. • The Omniglot intro to languages has a great first overview of many languages, and follows it up with links to courses and other tools for that language. • BBC’s languages has a great mini-introduction to almost 40 different languages! • About.com has some interesting articles, courses, and word lists for English as a second language, French, German, Italian, Japanese, Mandarin, and Spanish. • Internet polyglot has some great courses and help to memorize words for many languages. ESP 2015, 22.05.2015. Duolingo ESP 2015, 22.05.2015. Essential benefits of learning a foreign language through online courses • Multimedia • Repetition • Autonomy • Accessibility • New Learning Methods ESP 2015, 22.05.2015. Skype in the classroom ESP 2015, 22.05.2015. About this Skype lesson “I teach English in a primary school Hungary. We are looking for a nice class or group for regular meetings. Children in the partner group should be the teachers and they may teach my students English. We may plan any language games to play, or any basic topic to talk or any grammar to practice. I think practicing the language with the same age group may motivate my students to learn the language better. We were so glad to find a partner class or group”. ESP 2015, 22.05.2015. Skype in the Primary school Čegar in Niš Skype in he classroom. ESP 2015, 22.05.2015. Vocabulary learning • Memrise is one of the most versatile sites for providing pre-made mnemonics for vocabulary in a wide range of languages, which is always expanding since the system is open to people adding their own public vocabulary lists and suggestions in. ESP 2015, 22.05.2015. Native content in the language • Tunein - lets to listen live streamed radio from all over the world! ESP 2015, 22.05.2015. Language learning forums Fluent in 3 months forum - the forum on this site is one of the most active language learning forums online, with 20,000 members. How to learn any language forum If you are a foreign language enthusiast, a polyglot or just want to learn a new language on your own, you will find here: • How to choose a new language to learn • A detailed, hands-on guide to teaching yourself a foreign language. • Reviews of books about language learning • The questions about language learning people ask most frequently. ESP 2015, 22.05.2015. Get it pronounced/corrected by a native speaker • Forvo is a great site if you come across a new word and would really like to hear how it’s pronounced by a native speaker. It has a huge database covering many languages that you can search and get an instant answer. • Rhinospike - is better to hear how an entire sentence or even a couple of short paragraphs are pronounced by a native speaker. • Lang 8 - is a site where you can write text in a particular language, and pretty soon have natives look over it and give you great feedback. ESP 2015, 22.05.2015. Multilingual dictionaries • Wordreference is one of sites to search for the meaning of words in French, Spanish, Italian and Portuguese. • Bab.la is another dictionary for a bunch (24) of languages. • Google Translate – while it will mess things up a lot, as far as automatic translations go that are completely free,. • Proz term search , the Interactive Terminology for Europe and Mymemory – specialized dictionaries , specifically for finding technical terminology that is less likely to appear in other general dictionaries. ESP 2015, 22.05.2015. Social networking In the past few years, a series of language learning social networks have popped up, and they make learning more fun, efficient, interactive and interesting than usual. Through these language education social networks, the student now can study language into enjoyable environment by meeting and interacting with native language speakers from around the world. Live Mocha ESP 2015, 22.05.2015. Interactive games ESP 2015, 22.05.2015. Mobile applications • Babel mobile for Android • DuoLingo • Rosetta Course ESP 2015, 22.05.2015. Augmented reality Word Lens Word Lens is an augmented reality application that recognizes printed words using its camera and optical character recognition capabilities and instantly translates these words into the desired language. Does not require connection to the Internet. ESP 2015, 22.05.2015. Adaptation by usage of semantic rules https://elearning4109.wikispaces.com/The+Semantic+Web+%2 6+Ontologies+in+e-learning ESP 2015, 22.05.2015. DSi framework ESP 2015, 22.05.2015. Conclusions In stead of a conclusion, I would like to say that this conference will be a nice opportunity to exchange experiences and ideas about possibilities to use contemporary information technologies in learning and teaching languages. Also, we will have an opportunity to discuss ideas for some new common projects based on the usage of information technologies and languages. ESP 2015, 22.05.2015.