A n-grams collection extracted from the Portuguese Web

Description

The n-grams collection was extracted from the collected documents whose identified language was Portuguese. We extracted word n-grams up to the fifht order (5-grams). A set of regular expressions to tokenize the text were applied. After the extraction, all n-grams with tokens having more than 32 characters were discarded. N-grams with frequencies below 5 were discarded as well. The n-grams collection is available as a set of UTF-8 encoded files, containing the n-grams and their frequencies (2010-11-10).

This collection was build by David Batista, winner of the 2nd place of the Arquivo.pt award 2021 with the work Politiquices.pt.

Related publication: https://www.davidsbatista.net/assets/documents/publications/WPT05_fala2010.pdf
Also published at Harvard Dataverse

Producer

Latest update

29 de agosto de 2023

License

Creative Commons CCZero

Metadata quality
100.0/100

There are no reuses for this dataset yet.

Publish a reuse What's a reuse ?

There are no discussions for this dataset yet.

There are no community resources for this dataset yet.

Share your resources Learn more about the community

Information

Tags

ID

64ee072ff1b5a534ce7a4ed3

Temporality

Creation

29 de agosto de 2023

Frequency

Punctual

Temporal coverage

1996/01/01 to 2022/12/02

Latest update

29 de agosto de 2023

Spatial coverage

Territorial coverage

Portugal

Territorial coverage granularity

Country

Actions

Embed

Statistics for the year

Reuses of this dataset

0

Followers

0