[clean-list] Sanskrit Transliteration - Parsing into Abstract Syntax Trees

Wed Aug 27 21:23:54 MEST 2008

Ok, it's time to spill the beans. My goal in Clean parsing has to do with
Sanskrit.

Sanskrit is written in a certain script
<http://en.wikipedia.org/wiki/Devanagari>. But prior to modern typography,
people used to approximate that script with ASCII character sets.
<http://en.wikipedia.org/wiki/Devanagari_transliteration>

Over time, several systems evolved.

My goals are

1 - bidirectional transliteration between the various ascii schemes:

// Given Harvard-Kyoto, produce Velthuis encoding:
translit Harvard Velthuis "ajJAna"  // output will be aj~naana

2 - unidirectional translation from any ascii scheme to Unicode

// Given Harvard-Kyoto, produce Unicode encoding:
// an expansion on http://www.iit.edu/~laksvij/language/sanskrit.html
translit Harvard Unicode "ajJAna" // output will be अज्ञान 

I'm thinking of using the Velthuis encoding 
<http://en.wikipedia.org/wiki/Devanagari_transliteration#Velthuis>
as the "Abstract Syntax Tree" for the whole project. Regardless of what
ascii I get, convert it to Velthuis and then convert the Velthuis to the
specified target.

I still have a few more days of banging my head against the MetarParser, but
I wanted to at least let people know where I'm heading with all these
questions.

Errata:
====

A major hitch in converting ascii to unicode is that all of the ascii
schemes are purely linear: you read them the way you would read english,
left to right.

However, Devanagari is non-linear in at least two places:
* short "i" precedes the consonants that it is pronounced after ... in other
words "agni" is written in Devanaagarii with the "i" between "a" and "g" ---
"aign" even though pronounced "agni"
* "r" goes to the far right of the consonants it _precedes_... in other
words "rgo" is written in Devanaagarii with the "r" after the "go"

There already is a good converter from Harvard-Kyoto to Devanaagarii
<http://www.iit.edu/~laksvij/language/sanskrit.html> so I may just focus on
bidirectional ASCII translation and then when I need Unicode, simply use his
online tool.

It would be nice to have all resources available in a Clean program though.

-- 
View this message in context: http://www.nabble.com/Sanskrit-Transliteration---Parsing-into-Abstract-Syntax-Trees-tp19187901p19187901.html
Sent from the Clean mailing list archive at Nabble.com.