Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
News search is one of the most important Internet user activities. For a commercial news search engine, it is critical to provide users with the most relevant and fresh ranking results. Furthermore, it is necessary to group the related news articles so that users can browse search results in terms of news stories rather than individual news articles. This chapter describes a few algorithms for news search engines, including ranking algorithms and clustering algorithms. For the ranking problem, the main challenge is achieving appropriate balance between topical relevance and freshness. For the clustering problem, the main challenge is in grouping related news articles into clusters in a scalable mode. We begin by introducing a few news search ranking approaches, including a learning-to-rank approach (Section 2.1) and a joint learning approach from clickthroughs (Section 2.2). We then describe a scalable clustering approach to group news search results (Section 2.3).
Keywords
News search
freshness
relevance
clustering
temporal features
The main challenge for ranking in news search is how to make appropriate balance between two factors:Relevance and freshness. Here relevance includes both topical relevance as well as news source authority.
A widely adopted approach in practice is to use a simple formula to combine relevance and freshness. For example, the final ranking score for a news article can be computed as
(2.1)
where is the value representing the relevance between query and news article, isnews article age and is a time decay term, for which the older a news article is, the more penalty the article will receive for its final ranking. The parameter is used to control the relative importance of freshness in the final ranking result. In the literature of information retrieval, document is usually used to refer to a candidate item in ranking tasks. In this chapter, we use the terms document and news article equally because the application here is to rank news articles in a search.
The advantage of such a heuristic approach to a relevance and freshness combination is its efficiency in real practice, for which only the value of the parameter needs to be tuned by using some ranking examples. Furthermore, the appropriate value often leads to good ranking results for many queries, which also makes this approach effective.
The drawback of this approach is that it is incapable of further improving ranking performance, because such a heuristic rule is too naive to handle more complicated ranking cases. For example, in (2.1), time decay is represented by the term , which is totally dependent on the document age. In fact, an appropriate time decay factor should also rely on the nature of the query, since different queries have different time sensitivities: If a query is related to breaking news, such as an earthquake, that has just happened and has extensive media reports on casualty and rescue, then freshness should be very important because even a document published only one hour ago could be outdated. On the other hand, if a query is for an event that happened weeks ago, then relevance is more important in ranking because the user would like to find the most relevant and comprehensive reports in the search results.
Many prior works have exploited the temporal dimension in searches. For example, Baeza-Yates et al. [22] studied the relation among Web dynamics, structure, and page quality and demonstrated that PageRank is biased against new pages. In T-Rank Light and T-Rank algorithms [25], both activity (i.e., update rates) and freshness (i.e., timestamps of most recent updates) of pages and links are taken into account in link analysis. Cho et al. [66] proposed a page quality ranking function in order to alleviate the problem of popularity-based ranking, and they used the derivatives of PageRank to forecast future PageRank values for new pages. Nunes [269] proposed to improve Web information retrieval in the temporal dimension by combining the temporal features extracted from both individual documents and the whole Web. Pandey et al. [276] studied the tradeoff between new page exploration and high-quality page exploitation, which is based on a ranking method to randomly promote some new pages so that they can accumulate links quickly.
Temporal dimension is also considered in other information retrieval applications. Del Corso et al. [94] proposed the ranking framework to model news article generation, topic clustering, and story evolution over time, and this ranking algorithm takes publication time and linkage time into consideration as well as news source authority. Li et al. [221] proposed a TS-Rank algorithm, which considers page freshness in the stationary probability distribution of Markov chains, since the dynamics of Web pages are also important for ranking. This method proves effective in the application of publication search. Pasca [277] used temporal expressions to improve question-answering results for time-related questions. Answers are obtained by aggregating matching pieces of information and the temporal expressions they contain. Furthermore, Arikan et al. [20] incorporated temporal expressions into a language model and demonstrated experimental improvement in retrieval effectiveness.
Recency query classification plays an important role in recency ranking. Diaz [98] determined the newsworthiness of a query by predicting the probability of a user clicking on the news display of a query. König et al. [204] estimated the clickthrough rate for dedicated news search results with a supervised model, which is to satisfy the requirement of adapting quickly to emerging news event.
Learning-to-rank algorithms have shown significant and consistent success in various applications [226,184,406,54]. Such machine-learned ranking algorithms learn a ranking mechanism by optimizing particular loss functions based on editorial annotations. An important assumption in those learning methods is that document relevance for a given query is generally stationary over time, so that, as long as the coverage of the labeled data is broad enough, the learned ranking functions would generalize well to future unseen data. Such an assumption is often true in Web searches, but it is less likely to hold in news searches because of the dynamic nature of news events and the lack of timely annotations.
A typical procedure is as follows:
• Collect query-URL pairs.
• Ask editors to label the query-URL pairs with relevance grades.
• Apply a learning-to-rank algorithm to the train ranking model.
Traditionally, in learning-to-rank, editors label query-URL pairs with relevance grades, which usually have four or five values, including perfect, excellent, good, fair, or bad. Editorial labeling information is used for ranking model training and ranking model evaluation. For training, these relevance grades are directly mapped to numeric values as learning targets.
For evaluation, we desire an evaluation metric that supports graded judgments and penalizes errors near the beginning of the ranked list. In this work, we useDCG [175],
(2.2)
where is the position in the document list, and is the function of relevance grade. Because the range of DCG values is not consistent across queries, we adopt the NDCG as our primary ranking metric,
(2.3)
where is a normalization factor, which is used to make the NDCG of the ideal list be 1. We can use and to evaluate the ranking results.
We extend the learning-to-rank algorithm in news searches, for which we mainly make two modifications due to the dynamic nature of the news search: (1) training sample collection and (2) editorial labeling guideline.
The training sample collection has to be near real time for news searches by the following steps:
1. Sample the latest queries from the news search query log.
2. Immediately get the candidate URLs for the sampled queries.
3. Immediately ask editors to do judgments on the query-URL pairs with relevance and freshness grades.
We can see that all the steps need to be accomplished in a short period. Therefore, the training sample collection has to be well planned in advance; otherwise, any delay during this procedure would affect the reliability of the collected data. If queries are sampled from an outdated query log or if all of the selected candidate URLs are outdated, they cannot represent the real data distribution. If editors do not label query-URL pairs on time, it will be difficult for them to provide accurate judgments, because editors’ judgments rely on their good understanding of the related news events, which becomes more difficult as time elapses.
In a news search, editors should provide query-URL grades on both traditional relevance and freshness. Although document age is usually available in news searches, it is impossible to determine a...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.
Dateiformat: PDFKopierschutz: Adobe-DRM (Digital Rights Management)
Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Dateiformat: ePUBKopierschutz: Wasserzeichen-DRM (Digital Rights Management)
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet - also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Wasserzeichen-DRM wird hier ein „weicher” Kopierschutz verwendet. Daher ist technisch zwar alles möglich – sogar eine unzulässige Weitergabe. Aber an sichtbaren und unsichtbaren Stellen wird der Käufer des E-Books als Wasserzeichen hinterlegt, sodass im Falle eines Missbrauchs die Spur zurückverfolgt werden kann.