The Soyogo Treebank – a parsed corpus of child Japanese

Front Page

The Soyogo Treebank is a corpus of child Japanese with hand worked tree analysis. Highlights include:

Further results — notably, dependency graphs — derived from the analysis can be seen with the search interface.

About the data sourced from CHILDES

Data is sourced from two of the corpora available from CHILDES (http://childes.talkbank.org; MacWhinney, 2000):

and

These corpora contain samples transcribed in Latin script (Miyata, Muraki, and Morikawa, 2004) using the WAKACHI2002 v8.0 format proposed by Miyata (2018) and provided with morphological tags in JMOR08 format (Miyata and Naka, 2014).

About the data analysis

The Soyogo Treebank adds layers of syntactic information to the morphological analysis data from CHILDES following The Kusunoki Treebank (Kainoki, 2022). The result is a corpus with syntactic trees over morphological information, amounting to a full morpho-syntactic analysis of the child language data.

Search Interface

The Soyogo Treebank is associated with a powerful user interface that enables search using virtually any aspect of the annotation. Results of specific searches can be downloaded in the form of annotated data. The source data to which the search interface links is being updated to reflect improvements in analysis.

Attribution

Presentations of research results using the The Soyogo Treebank should include a citation taking the general form of the example below (with appropriate modifications depending on the date of access):

Butler, Alastair, Susanne Miyata and Yumiko Kinjo (2022) “The Soyogo Treebank – a parsed corpus of child Japanese” https://soyogo.github.io (accessed 9 January 2022).

Terms of use

This work is licensed under a Creative Commons Attribution 4.0 International License.

Creative Commons License