The Soyogo Treebank is a corpus of child Japanese with hand worked tree analysis. Highlights include:
Further results — notably, dependency graphs — derived from the analysis can be seen with the search interface.
Data is sourced from two of the corpora available from CHILDES (http://childes.talkbank.org; MacWhinney, 2000):
and
These corpora contain samples transcribed in Latin script (Miyata, Muraki, and Morikawa, 2004) using the WAKACHI2002 v8.0 format proposed by Miyata (2018) and provided with morphological tags in JMOR08 format (Miyata and Naka, 2014).
The Soyogo Treebank adds layers of syntactic information to the morphological analysis data from CHILDES following The Kusunoki Treebank (Kainoki, 2022). The result is a corpus with syntactic trees over morphological information, amounting to a full morpho-syntactic analysis of the child language data.
The Soyogo Treebank is associated with a powerful user interface that enables search using virtually any aspect of the annotation. Results of specific searches can be downloaded in the form of annotated data. The source data to which the search interface links is being updated to reflect improvements in analysis.
Presentations of research results using the The Soyogo Treebank should include a citation taking the general form of the example below (with appropriate modifications depending on the date of access):
Butler, Alastair, Susanne Miyata and Yumiko Kinjo (2022) “The Soyogo Treebank – a parsed corpus of child Japanese” https://soyogo.github.io (accessed 9 January 2022).
This work is licensed under a Creative Commons Attribution 4.0 International License.