CLIA Project Technical Progress Summary

advertisement
Presentation of the CLIA
Project
On the occasion of
FIRE
at
Kolkata
by
Pushpak Bhattacharyya,
IIT Bombay,
On behalf of
the CLIA Consortium
12 Dec 2008
Motivation
2
CLIA is a real need


Great language diversity in India
Low comfort level with English


less than 5% of the total population of
about 700 million can use English effectively
Need for critical information in large
quantity and high quality, especially in
agriculture, health, tourism, education

and sectors
CLIA project started in 2006: domainstourism and health
12 Dec 08
FIRE– Kolkata - CLIA Project
3
Geographically speaking
World Rank in
Terms of
#speakers:
Punjabi
Bengali
Hindi-Urdu: 5th
Bengali: 7th
Marathi: 14th
…
..
Marathi
Telugu
tamil
12 Dec 08
FIRE– Kolkata - CLIA Project
4
CLIA: basic information
5
Defining Diagram
12 Dec 08
FIRE– Kolkata - CLIA Project
6
CLIA Consortium Members










Name of Institute
Assigned
Language(s)
IIT Bombay (Consortium Leader)
IIT-Kharagpur (consortium co-leader)
IIIT Hyderabad
Anna University-KBC
Anna University-College of Engg
ISI Kol
Jadavpur University Kolkata
CDAC-Pune
Marathi, Hindi
Bengali
Telugu, Hindi
Tamil
Tamil
Bengali
Bengali
Marathi, Hindi,
Tamil
Punjabi
--
CDAC-Noida
Utkal University
12 Dec 08
FIRE– Kolkata - CLIA Project
7
Principal Investigators
Name of Institute
Names
IITB
IIT-Kgp
IIITH
AU-KBC
AU-CEG
ISI Kol
JU Kol
CDAC-P
CDAC-N
Utkal University
Prof. Pushpak Bhattacharyya
Prof. Sudeshna Sarkar
Prof. Vasudev Verma
Prof. Sobha L.
Prof. Ranjani Parthasarthy
Prof. Mandar Mitra
Prof. Sivaji Bandyopadhya
Dr. Ajai Kumar
Dr. Karunesh Arora
Prof. Sanghamitra Mohanty
12 Dec 08
FIRE– Kolkata - CLIA Project
8
Some prominent research
members
Name of Institute
Names
IITB
IIT-Kgp
IIITH
AU-KBC
AU-CEG
ISI Kol
JU Kol
CDAC-P
CDAC-N
Utkal University
Manoj, Vishal, Vishaal, Ashish
Nimesh, Dr. Rajendra
Bhupal, Praneet
Pattavi, Vijay, Vijay
Kaviha, Subha Lalitha
Prasenjt, Deepashri, Ayan
Asif, Pinaki
Swati, Abhishek
Gaur Mohan, Ankur
Balbant Rai
12 Dec 08
FIRE– Kolkata - CLIA Project
9
Prior expertise brought to the project
(Horizontal, i.e., language independent)
Name of Institute
Areas of prior expertise/experience
IITB
IIT-Kgp
IIITH
NLP (LR, WSD, MT), Semantic Search
Search and Ranking, Shallow Parsing
Commercial level search engine building,
query processing
NER, Information Extraction,
Summarization, Anaphora
Morphology, Interlingua
IR Evaluation, large scale IR system
building (SMART)
Example based MT, Summarization, NER
Converters, File format processors, MT
Parallel corpora, Query processing
Machine Translation, Lexical Resources
AU-KBC
AU-CEG
ISI Kol
JU Kol
CDAC-P
CDAC-N
Utkal University
12 Dec 08
FIRE– Kolkata - CLIA Project
10
Prior expertise brought to the project (vertical,
i.e., language specific)
Name of Institute
Areas of prior expertise/experience
IITB
Hindi Marathi wordnet building, Hindi
Marathi shallow parsing
Bengali shallow parsing including MA
Telugu-Eng CLIR, Telugu query processing
Tamil NER, Tamil IE, Tamil Morph
Tamil Morph, Eng-Tamil MT
Bengali statistical stemming, large scale
corpora for Bengali
Bengali NER, EBMT involving Bengali
Various Indian language converters
Aligned parallel corpora for Indian
languages
--
IIT-Kgp
IIITH
AU-KBC
AU-CEG
ISI Kol
JU Kol
CDAC-P
CDAC-N
Utkal University
12 Dec 08
FIRE– Kolkata - CLIA Project
11
Horizontal tasks of CLIA and the
organizations responsible

Input Query processing


Crawling, Indexing


IIT KGP, IIITH, IITB
User Interface


IIT KGP, IIITH, IITB
Searching, Ranking


IIIT Hyderabad
CDAC Noida
File format processing

12 Dec 08
CDAC Pune
FIRE– Kolkata - CLIA Project
12
Horizontal tasks of CLIA and the
organizations responsible (contd)

Document Processing (index time NER, IE)


Document Processing (Post Retrieval: Snippet,
Summary)


IIT KGP, Utkal, CDACP
Evaluation, Relevance Judgement


Jadavpur University
Distributed Search


AU KBC
ISI Kolkata
UNL based semantic search (for Tamil)

12 Dec 08
AU CEG
FIRE– Kolkata - CLIA Project
13
Languages and the organizations
responsible
Language
Organization(s)
Bengali
Hindi
IIT KGP (c), JU, ISI
IIITH (c), IITB, CDAC
Noida
IITB (c), CDAC Pune
CDAC Noida
AUKBC (c), AUCEG
IIITH
Marathi
Punjabi
Tamil
Telugu
12 Dec 08
FIRE– Kolkata - CLIA Project
14
CLIA Important Dates







Project Start Date: 29th Aug 06 (effectively Jan
2007)
First meeting of the Project Review and
Steering Group (PRSG): 2nd March 2007
Second PRSG: 30th Aug 2007
Third PRSG: 08th March 2008
Fourth PRSG: 15th July 2008
Alpha version released: 15th July, 2008
Beta version to be released (along with the 5th
PRSG): January, 2009
12 Dec 08
FIRE– Kolkata - CLIA Project
15
Related consortium: E-IL MT
project


English to Indian Language MT
Indian Languages: Hindi, Marathi,
Bengali, Urdu, Oriya, Telugu, Tamil


Approaches: Statistical MT, Example
Based MT
Members: CDAC Pune (c), IIT Bombay,
JU, UU, IIITH, IIITA
12 Dec 08
FIRE– Kolkata - CLIA Project
16
Related consortium:IL-IL MT
project


Indian Language to Indian Language MT
Indian Languages: Hindi, Marathi,
Bengali, Punjabi, Tamil, Telugu, Kannada


Approach: Transfer Based
Members: IIITH (c), CDAC Pune, IIT
Bombay, JU, University of Hyderabad,
AU KBC
12 Dec 08
FIRE– Kolkata - CLIA Project
17
All three projects are time bound
and result oriented



2 years time frame (extension granted
for 1 year)
Strict deliverables
For each project the budget outlay is
about Rs 80 million (USD 2 million)
12 Dec 08
FIRE– Kolkata - CLIA Project
18
CLIA: Top level technological
information
19
Process Flow
12 Dec 08
FIRE– Kolkata - CLIA Project
20
12 Dec 08
FIRE– Kolkata - CLIA Project
21
CLIA: achievements in 2 years
(Jan 2007 to Dec 2008)
Tools and resources
(Copyrightable code and data)
22
Steps towards overall evaluation

Yet to be completed


Large Relevance judgment base under
construction



Precision, Recall, MAP, F-score etc.
50 queries per language (6 languages)
About 5000 documents per language (6
languages)
Crawled and indexed document base of
English: approx 600,000 pages
12 Dec 08
FIRE– Kolkata - CLIA Project
23
Copyright for CLIA (code)
Code
Input
Processing
Details
Soft Keyboard (Hindi, Bengali, Tamil, Telugu, Punjabi, Marathi
Languages) (CDAC - P)
Algorithm for transliteration of Devanagari words to English using
Segment Based Transliteration (IIITH, IITB)
Implementation of Multilingual Sense Dictionary along with API for
accessing MSD during lexical substitution (IITB)
Implementation of automatic Multi-word extraction algorithm for
populating the multi-word field of index (IITB)
Bengali
Bengali stemmer (IITKGP)
Bengali Hindi transliteration (IITKGP)
Marathi
12 Dec 08
Implementation of Language Analyzers (Morphological Analyzer) for
Marathi (IITB)
FIRE– Kolkata - CLIA Project
24
Copyright for CLIA (code) contd.
Code
Punjabi
Details
Punjabi Spell Normalizer (CDAC-N)
Punjabi Stemmer (CDAC-N)
Font transcoders (Unicode - Proprietary fonts) - map files etc. (CDAC-N)
Tamil
Stemmer for Tamil (AUKBC)
Named Entity Recognition engine (AUKBC)
Information Extraction (AUKBC)
Font transcoders (Tamil Proprietary fonts) (AUKBC)
IE template Translation (AUKBC)
12 Dec 08
FIRE– Kolkata - CLIA Project
25
Copyright for CLIA (code) Cont..
Code
Telugu
Details
Language Analyzer for Telugu (IIITH)
Query Translation for Telugu and Hindi (IIITH).
Query Transliteration for all languages. (IIITH)
Transcoder (IIITH)
Indexing
CML converter (IITKGP)
Focused Crawler (IIITH)
Language Identifier (IIITH)
File Format Processors (CDACP)
12 Dec 08
FIRE– Kolkata - CLIA Project
26
Copyright for CLIA (code) Cont..
Code
Details
Ranking
Ranker implementation (IITKGP)
Output
Processing
Snippet Generation (JU)
Summary Generation (JU)
Snippet Translation (JU)
UNL
Sentence constituent UNL enconverter (AUCEG)
UNL indexer (AUCEG)
UNL Template based Information extractor (AUCEG)
UNL Template based Summarizer (AUCEG)
UNL based Search and ranking (ranking module under development)
(AUCEG)
12 Dec 08
FIRE– Kolkata - CLIA Project
27
Copyright for CLIA (data)
Data
Details
Input
Processing
Bengali
Synset dictionary entries for Bengali (shared with JU and CDAC
Pune)
English to Bengali Transliteration of NE list (shared with JU and IIT
KGP)
NE annotated corpora (IITKGP)
NE list transliterated (IITKGP)
Telugu
Telugu to English Dictionary (IIITH)
Telugu to English Transliteration list (IIITH)
NE annotated corpora for Telugu and Hindi. (IIITH)
Telugu corpus developed for IE module. (IIITH)
12 Dec 08
FIRE– Kolkata - CLIA Project
28
Copyright for CLIA (data) contd.
Data
Details
Input
Processing
Tamil
English - Tamil Parallel Named Entity List (AUKBC)
Tamil - English Dictionary (AUKBC)
Synset dictionary entries for Tamil (AUKBC)
Tamil Named Entity annotated corpus (AUKBC)
English Named Entity annotated corpus (AUKBC)
Named Entity Tagset (AUKBC)
12 Dec 08
FIRE– Kolkata - CLIA Project
29
Copyright for CLIA Cont..
Data
Punjabi
Details
Punjabi translations ( for parallel corpora ) (CDAC-N)
English - Hindi - Punjabi parallel named entity list (CDAC-N)
Punjabi Named Entity Tagged Corpus (under development) (CDAC-N)
Database for Punjabi stemmer (prior development) (CDAC-N)
Marathi
English to Marathi Transliteration of NE list (IITB and CDAC Pune)
Marathi-English parallel corpora in tourism domain used for training the
snippet translation SMT system (IITB)
List of Multi-Word Expressions in Marathi and Hindi (IITB)
English-Marathi Parallel list of Named-entities used for IE Template
translation (Shared with C-DAC Pune)
Hindi
Hindi to English Dictionary (IIIH)
Hindi to English transliteration list (IIIH)
Hindi MW list (IITB)
12 Dec 08
FIRE– Kolkata - CLIA Project
30
Copyright for CLIA Cont..
Data
Details
Evaluation of the
IR system
Set of test topics (general domain, tourism domain).(ISIK)
Relevance judgments for the above pair.(ISIK)
UNL
12 Dec 08
UW list - Tourism domain (AUCEG)
FIRE– Kolkata - CLIA Project
31
Conclusion





Large scale national level activity
Large number of tools and resources
developed under the consortium
Alpha release done in July, 2008
Beta release to take place in Jan, 2009
Look forward to more detailed
interactions and suggestions from the
international audience
12 Dec 08
FIRE– Kolkata - CLIA Project
32
Introducing people…
33
Principal Investigators
Name of Institute
Names
IITB
IIT-Kgp
IIITH
AU-KBC
AU-CEG
ISI Kol
JU Kol
CDAC-P
CDAC-N
Utkal University
Prof. Pushpak Bhattacharyya
Prof. Sudeshna Sarkar
Prof. Vasudev Verma
Prof. Sobha Nair
Prof. Ranjani Parthasarthy
Prof. Mandar Mitra
Prof. Sivaji Bandyopadhya
Dr. Ajai Kumar
Dr. Karunesh Arora
Prof. Sanghamitra Mohanty
12 Dec 08
FIRE– Kolkata - CLIA Project
34
Some prominent research
members
Name of Institute
Names
IITB
IIT-Kgp
IIITH
AU-KBC
AU-CEG
ISI Kol
JU Kol
CDAC-P
CDAC-N
Utkal University
Manoj, Vishal, Vishaal, Ashish
Nimesh, Dr. Rajendra
Bhupal, Praneet
Pattavi, Vijay, Vijay
Kaviha, Subha Lalitha
Prasenjt, Deepashri, Ayan
Asif, Pinaki
Swati, Abhishek
Gaur Mohan, Ankur
Balbant Rai
12 Dec 08
FIRE– Kolkata - CLIA Project
35
Overview






Technical Status of the Project
Technical Documentation
Shared resources
Testing methodology
Software Documentation
Alpha and Beta versions
12 Dec 08
FIRE– Kolkata - CLIA Project
36
Technical Summary
Work Flow
Input Query in IL
Input Query Processing
Search
Document Processing
Output Generation
Evaluation
12 Dec 08
FIRE– Kolkata - CLIA Project
38
Project Status
Input Query in IL
Input Query Processing
Search
Document Processing
Output Generation
Evaluation
12 Dec 08
FIRE– Kolkata - CLIA Project
39
Status - Input Processing

Stemmer




All Language stemmers developed
Integrated with Nutch through plug-ins
Monolingual retrievals are working
MWE


12 Dec 08
Guidelines are under discussion (IITB)
Marathi ~ 2000 MWE Bangla ~ 600 MWE
Tamil ~ 600 MWE
Punjabi ~ 4000 MWE
FIRE– Kolkata - CLIA Project
40
Status – Input Processing : NER
Language
NE-tagged
Corpus size
Accuracy
NE list Details
Hindi (IIITH)
50K words
68%
31,177 entries
English
50K (AUKBC)
88.5% (Precision)
73.7% (Recall)
F-Score-80.44%
7,500 entries (AUKBC)
Gazetteer List size (IITKgp) :
Health-39,819 entries
Tourism-90,848 entries
General-4,79,427 entries
Punjabi
(CDACN)
Not started
NA
Person-10,004 | City-500 | Company-500
Hospital-20,603
Marathi (IITB)
50K
61.43% (F-score)
Total-4763 | Time-361 | Numerical-706 |
Names - 3666
Bengali
(IITKgp)
125K
(all domains)
~ 75-78%
Bangla: 90,000 names (all domains)
Gazetteer list is being transliterated to Bangla
Tamil (AUKBC)
94K
88.5% (Precision)
73.7% (Recall)
F-Score-80.44%
NE-23,000 entries
Dictionary of Personal names-70,000
(Tagged corpus + Dictionary used for NER)
Telugu (IIITH)
60K
74%
38,000 entries
12 Dec 08
FIRE– Kolkata - CLIA Project
41
Status - Input Processing

WSD (IITB)



2nd version WSD
Interface for Sense-marking of corpus developed
by IITB
Dictionary



12 Dec 08
IITB working on E-Hin linkage
All LVs working on IL-IL linking and E-IL linking
~10,000 synsets generated from Tourism corpora
FIRE– Kolkata - CLIA Project
42
Status: Dictionary

Eng-Hin Linkage


~ 2500 synsets linked (IITB)
IL-IL Dictionary Status (as on 30 Sept 07)
Language
#Synsets linked
Bengali
Marathi
Punjabi
2005
4298 (all cross-linked)
559
Tamil
1890
Telugu
461
12 Dec 08
FIRE– Kolkata - CLIA Project
(without cross-linking)
43
Sample Input screen

Input Screen
12 Dec 08
FIRE– Kolkata - CLIA Project
44
Sample Input screen

Advanced search option
12 Dec 08
FIRE– Kolkata - CLIA Project
45
Project Status
Input Query in IL
Input Query Processing
Search
Document Processing
Output Generation
Evaluation
12 Dec 08
FIRE– Kolkata - CLIA Project
46
Status – Search

Size of Indexed corpus
Language
No of pages
No of URLs
English
Hindi
10,000
21,000
115
25
Bangla
3,000
25
Tamil
20,000
25
Punjabi
17,000
25
Marathi
3,300
42
12 Dec 08
FIRE– Kolkata - CLIA Project
47
Status – Search

cML-Text Converter (IIT-Kgp)




12 Dec 08
First version of the engine is ready
Software extracts the fields and body, but
does not identify paragraphs and blocks in
this version
Has been tested for Bengali
Ready to be integrated with Nutch
FIRE– Kolkata - CLIA Project
48
Project Status
Input Query in IL
Input Query Processing
Search
Document Processing
Output Generation
Evaluation
12 Dec 08
FIRE– Kolkata - CLIA Project
49
Status – Document Processing




Basic IE Engine and eleven IE Templates
are ready (AUKBC)
Has been tested with sample documents
(EILMT corpus)
First template “How to reach the place”
is getting translated to Tamil, Telugu
For other languages, the inflectionary
markers are being provided
12 Dec 08
FIRE– Kolkata - CLIA Project
50
Project Status
Input Query in IL
Input Query Processing
Search
Document Processing
Output Generation
Evaluation
12 Dec 08
FIRE– Kolkata - CLIA Project
51
Sample Output Screen
Output screen if Input language is Hindi
12 Dec 08
FIRE– Kolkata - CLIA Project
52
Sample Output screen
Output screen if Input language is Hindi, and English tab is selected
12 Dec 08
FIRE– Kolkata - CLIA Project
53
Sample Output screen
Output screen of translation of Snippet (English to Bengali)
12 Dec 08
FIRE– Kolkata - CLIA Project
54
Sample Output Screen
Advanced output screen with Hindi Summary
12 Dec 08
FIRE– Kolkata - CLIA Project
55
Sample Output Screen
Advanced output screen with Hindi Summary
12 Dec 08
FIRE– Kolkata - CLIA Project
56
Sample Output Screen
Sample screen with Information Extraction
12 Dec 08
FIRE– Kolkata - CLIA Project
57
Status – Output Generation

Snippet Generation (JU)



12 Dec 08
Working for monolingual retrieval
Integrated with Nutch
Has been tested for Bengali
FIRE– Kolkata - CLIA Project
58
Project Status
Input Query in IL
Input Query Processing
Search
Document Processing
Output Generation
Evaluation
12 Dec 08
FIRE– Kolkata - CLIA Project
59
Status - Evaluation

Corpora




12 Dec 08
Tourism and Health Corpora being collected for all
languages
News corpora also being collected.
Period of news corpora ranges from 2002 to 2007
For News corpora, ISI Kol having dialogues with
TOI and Hindustan Times for permission for the
use of their multilingual corpora
FIRE– Kolkata - CLIA Project
60
Details of Corpora (crawled)

Assumption in SRS:

Each language corpus has at
least 50,000 documents from
General / News + all
available documents in
Tourism and Health
12 Dec 08
FIRE– Kolkata - CLIA Project
61
Evaluation : Topics

Topics (ISI Kol)





A set of 95 topics are ready for evaluation
30 topics for training and 50 topics for testing and
15 topics as stand-by
Each topic = Title + Narration + Description
Translation of these 95 topics have been completed
by all the six language verticals
Sample Topic



12 Dec 08
<title> Euro Inflation</title>
<desc> Find documents about rises in prices after the
introduction of the Euro</desc>
<narr> Any document is relevant that provides
information on the rise of prices in any country that
introduced the common European currency.</narr>
FIRE– Kolkata - CLIA Project
62
Evaluation Methodology

Benchmark data creation
Corpus
IR
engine
1
Queries
IR
engine
n
IR
engine
2
Pool
Human judges
Relevance
Judgements
12 Dec 08
FIRE– Kolkata - CLIA Project
63
Evaluation Methodology

Benchmark data creation



Sample documents (corpus)
Sample Queries / Topics (95)
Relevance judgement



Pooling



12 Dec 08
No of relevance judged Bangla documents ~ 4,500
Independently judged against 23 topics by each of two
judges
Pooling strategies adopted by TREC
List of top ~100 documents are taken
Pool = union of these
FIRE– Kolkata - CLIA Project
64
Evaluation methodology

Evaluation engine
30 Topics/Queries
Corpus > 50,000 docs
Retrieval Engine
Top 100 Docs
Relevance Judgments
Evaluation Engine
Metrics
12 Dec 08
FIRE– Kolkata - CLIA Project
65
UNL




Monolingual retrieval is working for Tamil
documents
6500 words in UNL Dictionary
Words + MWE indexed
Documents indexed




12 Dec 08
No. of documents processed in Tourism - 564
No of Concept-Relation-Concept indexed - 11,754
No of Concept-Relation indexed - 11,754
No of Concepts indexed - 17,650
FIRE– Kolkata - CLIA Project
66
Testing Methodology

Testing methodology




Black box testing based on SRS and design documents
Unit testing by each sub-system
Test cases (format) and test reports
Integration testing



Top down / Bottom-up based on dependencies
Stubs and drivers
Sub-system wise testing (module-wise)







Input processing
Search and Retrieval
Document processing
Output Generation
Evaluation
UNL
System Testing

12 Dec 08
Performance testing
FIRE– Kolkata - CLIA Project
67
Integration








Use of controlled corpora for Integration
Use of EILMT English and Hindi parallel corpus
ISI generates the queries for corpus
Translation of queries by all LVs
English and Hindi synsets identified for building
multilingual dictionary by each LV
Each language vertical will be tested for their
respective cross-lingual retrieval
Information Extraction and output generation will be
done on the same corpora
Integration of each LV into Nutch at IITKgp
12 Dec 08
FIRE– Kolkata - CLIA Project
68
Test and Integration (contd.)



Bug tracking system (Bugzilla) to be installed
Currently planned for installation at IITB on
the same server as CVS
Bugzilla




12 Dec 08
Web-based general-purpose bug tracker tool
Detects not only software bugs but also all other
user-submitted tracking tickets
Eases communication between team members
Can be integrated with CVS and WIKI
FIRE– Kolkata - CLIA Project
69
Bugzilla

Requirements





A compatible database management system –
MySQL, Postgressql
A suitable release of Perl 5
A compatible web server
A suitable mail transfer agent, or any SMTP server
Bugzilla Demo

12 Dec 08
https://landfill.bugzilla.org/bugzilla-tip/index.cgi
FIRE– Kolkata - CLIA Project
70
Bugzilla - Design

Bugs can be
submitted by
anybody, and
will be
assigned to a
particular
developer
12 Dec 08
FIRE– Kolkata - CLIA Project
71
Deployment diagram
Deployment Diagram
for Nutch-based
Search Subsystem
Quoted from Mike Cafarella , Doug Cutting, Building Nutch: Open Source Search, Queue, v.2 n.2, April 2004
The real life scenario would have four more such index servers, one for every
Indian language and (maybe) more search servers to ensure greater number of
searches per unit time
12 Dec 08
FIRE– Kolkata - CLIA Project
72
Hosting of Alpha and Beta versions

Alpha Version






~10,000 documents in each language
Low complexity system
Hence simple hardware configuration sufficient
Does not include Summary generation and Output translation
Planned for Dec 2008
Beta Version



~10,00,000 documents in each language
Hardware configuration being worked out - based on disk
space requirements, throughput of system, response times,
simultaneous users etc.
Following details are being worked out:




12 Dec 08
Connectivity
Where to host
Support for hosting
Planned for July 2008
FIRE– Kolkata - CLIA Project
73
Elitex08: Demo of Alpha Version

Plan to demonstrate the following:







Cross-lingual information retrieval for all languages
Information Extraction and translation of at least
one template to Tamil / Telugu
Snippet Generation (monolingual)
Hardware integration – IITKgp
Publicity management / Poster design - JU
Funds: Participation fees to be shared
Demonstrate the same at IJCNLP08 exhibition
(in Hyderabad - Jan 2008)
12 Dec 08
FIRE– Kolkata - CLIA Project
74
Gantt chart (as on Aug 30)
12 Dec 08
FIRE– Kolkata - CLIA Project
75
Gantt chart (as on Aug 30)
12 Dec 08
FIRE– Kolkata - CLIA Project
76
Software documentation








SRS (Based on IEEE)
Design document v2.0 (based on RUP)
User Requirements Document (Ver 5.0)
Java docs
Test cases template
File naming conventions
Testing and integration guidelines
Code review guidelines
Skip templates
12 Dec 08
FIRE– Kolkata - CLIA Project
77
Software documentation : SRS

SRS





Introduction
Overall description
External interface requirements
System features (module-wise)
Advanced Search system for Tamil using
UNL
 Back to Software Documentation
12 Dec 08
FIRE– Kolkata - CLIA Project
Next 
78
Software documentation: DD

Design document (v 2.0)

Has been simplified to suit project needs


Introduction
System Architecture



System Design



Solution Architecture (brief description of systems,
subsystems)
Software Architecture ( block diagrams)
Logical Design (Class Diagrams )
Component Design (Component Diagrams )
Appendix - other details
 Back to Software Documentation
12 Dec 08
FIRE– Kolkata - CLIA Project
Next 
79
Software documentation:URD

URD











Introduction
Objective
Scope of the project
Product perspective
Capabilities of the Product
User Characteristics
Assumptions and dependencies
Operational environment
Input / Output scenarios
Definitions, acronyms and abbreviations
References
 Back to Software Documentation
12 Dec 08
FIRE– Kolkata - CLIA Project
Next 
80
Software documentation:Test

Test case template: for all tests
Test case
Test data
Expected
result
Actual result
 Back to Software Documentation
12 Dec 08
FIRE– Kolkata - CLIA Project
Remarks
Next 
81
Software documentation:File naming

File naming convention captures the following:






Subject & domain of document
Content Type (ppt / doc / rpt / Tr / etc)
Name of Institute (IITB / ISI / IIITH etc.)
Date of creation of doc (dd-mon-yy)
Version no.
Format

<Subject>_<Content_type>_<Institute>_<date>_<ver.no>.<file ext>

E.g. PRSG_Pres_IITB_08dec07_v1.ppt
 Back to Software Documentation
12 Dec 08
FIRE– Kolkata - CLIA Project
Next 
82
Shareable Resources and Tools

Shared Resources across projects

From ILILMT to CLIA:






From EILMT to CLIA


Morph Analyzer
POS Tagger
Chunker
Dictionary Standardization
IL-IL Synsets
Synsets E-IL
From CLIA to other projects:



12 Dec 08
NER engine
NE list
MWE
FIRE– Kolkata - CLIA Project
83
Collaborative tools used - CLIA
Tool

Googlegroups
Wiki
CVS
Google docs
Webex
Audioconferencing
12 Dec 08
Purpose
Group e-Mailing
Project Documents,
Member Contact details,
Minutes of meeting, Presentations,
Timelines, progress reports, fund
details etc
Source code
Sharing and editing of documents
Weekly teleconferences
FIRE– Kolkata - CLIA Project
84
CLIA Wiki site


http://www.cfilt.iitb.ac.in/~consortia/dokuwiki
CLIA Wiki contents








Project Team Contact details
Project documentation (SRS, Design doc, URD..)
Meeting minutes and presentations
Project fund details
Progress reports and timelines
Project resources
Corpus
Collaborative platform for audio conferences
12 Dec 08
FIRE– Kolkata - CLIA Project
85
CLIA Wiki site
12 Dec 08
FIRE– Kolkata - CLIA Project
86
Wiki – Upload notification
12 Dec 08
FIRE– Kolkata - CLIA Project
87
Thank You
Download