Basic idea of this project is to output a question-answer pair from FAQ (frequently asked questions) page, if it has search terms. Pair appears in special design on search engine results page, with separate output of question and answer. This project was an attempt to give explicit answer to the users question (expressed in query).
The program was in two parts: markup of pairs positions in document and fragment selection from page. Markup step executes during base preparation, by analysis of document DOM tree. Analysis in mass is heuristical, but for question quality machine-learned SVM classifier was used. Implementation was hard because of ambiguous nature of faq pages in the internet. Faq pages are written with mistakes and using different styles. Algorithm had to take into account all this diversity of markup variants and mistakes. As a tradeoff it has low recall of markup: not all pairs in document are marking.
Fragment selection was in two parts: pair selection and fragment selection from pair. For pair selection the combination of such features as BM25, closeness of query terms, sum of IDF of query terms was used. While selecting the fragment we had to exclude greetings from beginning of the text and make a trimming on punctuation excluding stop-words.