Как заменить слово самым представительным упоминанием, используя модуль Stanford CoreNLP Coreferences

Я пытаюсь выяснить способ перезаписи предложений путем "разрешения" (замены слов) их связей с помощью модуля Stanford Corenlp Coreference.

Идея состоит в том, чтобы переписать предложение следующим образом:

Джон поехал в дом Джудиуса. Он обедал.

Джон поехал в дом Джудиуса. Джон приготовил обед Джуди.

Вот код, с которым я обманывал:

    private void doTest(String text){
    Annotation doc = new Annotation(text);
    pipeline.annotate(doc);


    Map<Integer, CorefChain> corefs = doc.get(CorefChainAnnotation.class);
    List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);


    List<String> resolved = new ArrayList<String>();

    for (CoreMap sentence : sentences) {

        List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);

        for (CoreLabel token : tokens) {

            Integer corefClustId= token.get(CorefCoreAnnotations.CorefClusterIdAnnotation.class);
            System.out.println(token.word() +  " --> corefClusterID = " + corefClustId);


            CorefChain chain = corefs.get(corefClustId);
            System.out.println("matched chain = " + chain);


            if(chain==null){
                resolved.add(token.word());
            }else{

                int sentINdx = chain.getRepresentativeMention().sentNum -1;
                CoreMap corefSentence = sentences.get(sentINdx);
                List<CoreLabel> corefSentenceTokens = corefSentence.get(TokensAnnotation.class);

                String newwords = "";
                CorefMention reprMent = chain.getRepresentativeMention();
                System.out.println(reprMent);
                for(int i = reprMent.startIndex; i<reprMent.endIndex; i++){
                    CoreLabel matchedLabel = corefSentenceTokens.get(i-1); //resolved.add(tokens.get(i).word());
                    resolved.add(matchedLabel.word());

                    newwords+=matchedLabel.word()+" ";

                }




                System.out.println("converting " + token.word() + " to " + newwords);
            }


            System.out.println();
            System.out.println();
            System.out.println("-----------------------------------------------------------------");

        }

    }


    String resolvedStr ="";
    System.out.println();
    for (String str : resolved) {
        resolvedStr+=str+" ";
    }
    System.out.println(resolvedStr);


}

Лучший результат, который я смог достичь на данный момент, -

Джон поехал в дом Джуди Джуди. Джон заставил Джуди поужинать.

который не очень близок...

Я уверен, что есть более простой способ сделать то, что я пытаюсь достичь.

В идеале я хотел бы реорганизовать предложение как список CoreLabels, чтобы я мог сохранить другие данные, которые они привязали к ним.

Любая помощь была оценена.

Ответ 1

Задача состоит в том, чтобы убедиться, что токен не является частью его репрезентативного упоминания. Например, у токена "Джуди" есть упоминание "Джуди" , поэтому, если вы замените его на фразу "Джуди" , вы получите двойное "s".

Вы можете проверить, является ли токен частью его представительского упоминания, сравнивая их индексы. Вы должны только заменить токен, если его индекс меньше, чем startIndex представительского упоминания, или больше, чем endIndex представительского упоминания. В противном случае вы просто сохраняете токен.

Соответствующая часть вашего кода будет выглядеть следующим образом:

            if (token.index() < reprMent.startIndex || token.index() > reprMent.endIndex) {

                for (int i = reprMent.startIndex; i < reprMent.endIndex; i++) {
                    CoreLabel matchedLabel = corefSentenceTokens.get(i - 1); 
                    resolved.add(matchedLabel.word());

                    newwords += matchedLabel.word() + " ";

                }
            }

            else {
                resolved.add(token.word());

            }

Кроме того, чтобы ускорить процесс, вы также можете заменить свое первое if-условие на:

if (chain==null || chain.getMentionsInTextualOrder().size() == 1)

В конце концов, если длина цепочки соопределения равна 1, нет никакой необходимости в поиске репрезентативного упоминания.

Ответ 2

private void doTest(String text){
    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation doc = new Annotation(text);
    pipeline.annotate(doc);


    Map<Integer, CorefChain> corefs = doc.get(CorefChainAnnotation.class);
    List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);


    List<String> resolved = new ArrayList<String>();

    for (CoreMap sentence : sentences) {

        List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);

        for (CoreLabel token : tokens) {

            Integer corefClustId= token.get(CorefCoreAnnotations.CorefClusterIdAnnotation.class);
            System.out.println(token.word() +  " --> corefClusterID = " + corefClustId);


            CorefChain chain = corefs.get(corefClustId);
            System.out.println("matched chain = " + chain);


            if(chain==null){
                resolved.add(token.word());
                System.out.println("Adding the same word "+token.word());
            }else{

                int sentINdx = chain.getRepresentativeMention().sentNum -1;
                System.out.println("sentINdx :"+sentINdx);
                CoreMap corefSentence = sentences.get(sentINdx);
                List<CoreLabel> corefSentenceTokens = corefSentence.get(TokensAnnotation.class);
                String newwords = "";
                CorefMention reprMent = chain.getRepresentativeMention();
                System.out.println("reprMent :"+reprMent);
                System.out.println("Token index "+token.index());
                System.out.println("Start index "+reprMent.startIndex);
                System.out.println("End Index "+reprMent.endIndex);
                if (token.index() <= reprMent.startIndex || token.index() >= reprMent.endIndex) {

                        for (int i = reprMent.startIndex; i < reprMent.endIndex; i++) {
                            CoreLabel matchedLabel = corefSentenceTokens.get(i - 1); 
                            resolved.add(matchedLabel.word().replace("'s", ""));
                            System.out.println("matchedLabel : "+matchedLabel.word());
                            newwords += matchedLabel.word() + " ";

                        }
                    }

                    else {
                        resolved.add(token.word());
                        System.out.println("token.word() : "+token.word());
                    }



                System.out.println("converting " + token.word() + " to " + newwords);
            }


            System.out.println();
            System.out.println();
            System.out.println("-----------------------------------------------------------------");

        }

    }


    String resolvedStr ="";
    System.out.println();
    for (String str : resolved) {
        resolvedStr+=str+" ";
    }
    System.out.println(resolvedStr);


}

Отличный ответ.

Джон поехал в дом Джудиуса. Он обедал. ----- > Джон поехал в дом Джуди. Джон приготовил обед Джуди. Том - умный мальчик. Он многое знает. ----- > Том - умный Том. Том многое знает.

Ответ 3

Integer corefClustId = token.get(CorefCoreAnnotations.CorefClusterIdAnnotation.class);

для меня эта строка выше ноль. не понимаю почему. Кто-нибудь может мне помочь, пожалуйста.