A n-grams collection extracted from the Portuguese Web

Name: A n-grams collection extracted from the Portuguese Web
Creator: Arquivo.pt - pesquise páginas do passado
License: http://www.opendefinition.org/licenses/cc-zero
Keywords: n-grams-portuguese

Description

The n-grams collection was extracted from the collected documents whose identified language was Portuguese. We extracted word n-grams up to the fifht order (5-grams). A set of regular expressions to tokenize the text were applied. After the extraction, all n-grams with tokens having more than 32 characters were discarded. N-grams with frequencies below 5 were discarded as well. The n-grams collection is available as a set of UTF-8 encoded files, containing the n-grams and their frequencies (2010-11-10).

This collection was build by David Batista, winner of the 2nd place of the Arquivo.pt award 2021 with the work Politiquices.pt.

Producer

Arquivo.pt - pesquise páginas do passado

Latest update

29 de agosto de 2023

License

Creative Commons CCZero

Metadata quality

100.0/100

1 Main file

a-n-grams-collection-extractet-from-the-portuguese-web-dataverse-files.zip

Updated on 29 de agosto de 2023

zip (1.5GB)

0 downloads

URL: https://dados.gov.pt/s/resources/a-n-grams-collection-extracted-from-the-portuguese-web/20230829-160321/a-n-grams-collection-extractet-from-the-portuguese-web-dataverse-files.zip
Permalink: https://dados.gov.pt/es/datasets/r/4d2385f1-e094-4ec4-b988-31c0dc9dc383
sha1: 1f9cf8e2e44d180d1c72e4612de055edc6fa32e8
MIME Type: application/zip

Created on: 29 de agosto de 2023
Modified on: 29 de agosto de 2023

Size: 1.5GB

There are no reuses for this dataset yet.

Publish a reuse What's a reuse ?

There are no discussions for this dataset yet.

There are no community resources for this dataset yet.

Share your resources Learn more about the community

Information

License

Creative Commons CCZero

ID

64ee072ff1b5a534ce7a4ed3

Temporality

Creation

29 de agosto de 2023

Frequency

Punctual

Temporal coverage

1996/01/01 to 2022/12/02

Latest update

29 de agosto de 2023

Spatial coverage

Territorial coverage

Portugal

Territorial coverage granularity

Country

Actions

Embed

<div data-udata-dataset="64ee072ff1b5a534ce7a4ed3"></div><script  src="https://dados.gov.pt/static/oembed.js" async defer></script>

Statistics for the year

Download traffic metrics as CSV

A n-grams collection extracted from the Portuguese Web

Description

Producer

Latest update

License

Metadata quality:

Metadata quality

Information

Tags

License

ID

Temporality

Creation

Frequency

Temporal coverage

Latest update

Spatial coverage

Territorial coverage

Territorial coverage granularity

Actions

Embed

Statistics for the year

Reuses of this dataset

Followers