Monday, October 17, 2011

Need documents of various media types and language ?

    I wrote this java document provider library for Google Translator Toolkit Client. It is quite handy when developing an application that works with documents, that either vary in media type or language. TDD is the best approach for writing these kinds of apps and this library allows you to set up your test suits with documents very easily.
   It downloads text data and makes combination of following document types and languages :

html - "html"
pdf - "pdf"
odt - "vnd.oasis.opendocument.text"
docx - "vnd.openxmlformats-officedocument.wordprocessingml.document"
doc - "msword"
xlsx - "vnd.openxmlformats-officedocument.spreadsheetml.sheet"
xls - "vnd.ms-excel"
ppt - "vnd.ms-powerpoint"

(bg, es, cs, da, de, et, el, en, fr, it, lv, lt, hu, mt, nl, pl, pt, ro, sk, sl, fi, sv)

You just need to call one of DocumentProvider's API methods :

DocumentProvider.getDocByTypeAndLang(type, lang);

to get object(s) representing a document :

long id;
long size;
long checksum;
String type;
String sample;
File sampleFile;
String url;
String state;
File file;
MediaType mediaType;
int wordCount;
String content;
List<String> words;
List<String> sampleWords;
int sampleWordCount;

No comments: