Title: | Chunkwise Text-File Processing for 'dplyr' |
---|---|
Description: | Data stored in text file can be processed chunkwise using 'dplyr' commands. These are recorded and executed per data chunk, so large files can be processed with limited memory using the 'LaF' package. |
Authors: | Edwin de Jonge [aut, cre] |
Maintainer: | Edwin de Jonge <[email protected]> |
License: | GPL-2 |
Version: | 0.6.0 |
Built: | 2024-11-16 04:04:21 UTC |
Source: | https://github.com/edwindj/chunked |
R is a great tool, but processing large text files with data is cumbersome.
chunked
helps you to process large text files with dplyr while loading
only a part of the data in memory. It builds on the execellent R package LaF
Processing commands are writing in dplyr syntax, and chunked
(using LaF
)
will take care that chunk by chunk is processed, taking far less memory
than otherwise. chunked
is useful for selecting columns, mutating columns
and filtering rows. It can be used in data pre-processing.
filter
select
rename
mutate
transmute
do
left_join
inner_join
anti_join
semi_join
tbl_vars
collect
filter, select, do, left_join, inner_join
The following operators are not implemented because data in chunked
is processed
chunkwise, so these are not available.
full_join
right_join
group_by
arrange
tail
insert_chunkwise_into
can be used to insert chunks of data into a
database. Typically chunked
can be used to for preprocessing data
before adding it to a database.
insert_chunkwise_into(x, dest, table, temporary = FALSE, analyze = FALSE)
insert_chunkwise_into(x, dest, table, temporary = FALSE, analyze = FALSE)
x |
tbl_chunk object |
dest |
database destination, e.g. src_sqlite() |
table |
name of table |
temporary |
Should the table be removed when the database connection is closed? |
analyze |
Should the table be analyzed after import? |
a tbl
object pointing to the table in database dest
.
Read chunkwise from a data source
read_chunkwise(src, chunk_size = 10000L, ...) ## S3 method for class 'character' read_chunkwise( src, chunk_size = 10000L, format = c("csv", "csv2", "table"), stringsAsFactors = FALSE, ... ) ## S3 method for class 'laf' read_chunkwise(src, chunk_size = 10000L, ...) ## S3 method for class 'tbl_sql' read_chunkwise(src, chunk_size = 10000L, ...)
read_chunkwise(src, chunk_size = 10000L, ...) ## S3 method for class 'character' read_chunkwise( src, chunk_size = 10000L, format = c("csv", "csv2", "table"), stringsAsFactors = FALSE, ... ) ## S3 method for class 'laf' read_chunkwise(src, chunk_size = 10000L, ...) ## S3 method for class 'tbl_sql' read_chunkwise(src, chunk_size = 10000L, ...)
src |
source to read from |
chunk_size |
size of the chunks |
... |
parameters used by specific classes |
format |
used for specifying type of text file |
stringsAsFactors |
|
an object of type tbl_chunk
read_csv_chunk will open a connection to a text file. Subsequent dplyr verbs and commands
are recorded until collect
, write_csv_chunkwise
is called. In that case
the recorded commands will be executed chunk by chunk. This
read_csv_chunkwise( file, chunk_size = 10000L, header = TRUE, sep = ",", dec = ".", stringsAsFactors = FALSE, ... ) read_csv2_chunkwise( file, chunk_size = 10000L, header = TRUE, sep = ";", dec = ",", ... ) read_table_chunkwise( file, chunk_size = 10000L, header = TRUE, sep = " ", dec = ".", ... ) read_laf_chunkwise(laf, chunk_size = 10000L)
read_csv_chunkwise( file, chunk_size = 10000L, header = TRUE, sep = ",", dec = ".", stringsAsFactors = FALSE, ... ) read_csv2_chunkwise( file, chunk_size = 10000L, header = TRUE, sep = ";", dec = ",", ... ) read_table_chunkwise( file, chunk_size = 10000L, header = TRUE, sep = " ", dec = ".", ... ) read_laf_chunkwise(laf, chunk_size = 10000L)
file |
path of texst file |
chunk_size |
size of the chunks te be read |
header |
Does the csv file have a header with column names? |
sep |
field separator to be used |
dec |
decimal separator to be used |
stringsAsFactors |
|
... |
not used
|
laf |
laf object created using LaF |
read_csv_chunkwise can be best combined with
write_csv_chunkwise
or
insert_chunkwise_into
(see example)
# create csv file for demo purpose in_file <- file.path(tempdir(), "in.csv") write.csv(women, in_file, row.names = FALSE, quote = FALSE) # women_chunked <- read_chunkwise(in_file) %>% #open chunkwise connection mutate(ratio = weight/height) %>% filter(ratio > 2) %>% select(height, ratio) %>% inner_join(data.frame(height=63:66)) # you can join with data.frames! # no processing done until out_file <- file.path(tempdir(), "processed.csv") women_chunked %>% write_chunkwise(file=out_file) head(women_chunked) # works (without processing all data...) iris_file <- file.path(tempdir(), "iris.csv") write.csv(iris, iris_file, row.names = FALSE, quote= FALSE) iris_chunked <- read_chunkwise(iris_file, chunk_size = 49) %>% # 49 for demo purpose group_by(Species) %>% summarise(sepal_length = mean(Sepal.Length), n=n()) # note that mean is per chunk
# create csv file for demo purpose in_file <- file.path(tempdir(), "in.csv") write.csv(women, in_file, row.names = FALSE, quote = FALSE) # women_chunked <- read_chunkwise(in_file) %>% #open chunkwise connection mutate(ratio = weight/height) %>% filter(ratio > 2) %>% select(height, ratio) %>% inner_join(data.frame(height=63:66)) # you can join with data.frames! # no processing done until out_file <- file.path(tempdir(), "processed.csv") women_chunked %>% write_chunkwise(file=out_file) head(women_chunked) # works (without processing all data...) iris_file <- file.path(tempdir(), "iris.csv") write.csv(iris, iris_file, row.names = FALSE, quote= FALSE) iris_chunked <- read_chunkwise(iris_file, chunk_size = 49) %>% # 49 for demo purpose group_by(Species) %>% summarise(sepal_length = mean(Sepal.Length), n=n()) # note that mean is per chunk
Genereric function to write chunk by chunk
write_chunkwise(x, dest, ...) ## S3 method for class 'chunkwise' write_chunkwise( x, dest, table, file = dest, format = c("csv", "csv2", "table"), ... )
write_chunkwise(x, dest, ...) ## S3 method for class 'chunkwise' write_chunkwise( x, dest, table, file = dest, format = c("csv", "csv2", "table"), ... )
x |
chunked input, e.g. created with |
dest |
where should the data be written. May be a character or
a |
... |
parameters that will be passed to the specific implementations. |
table |
table to write to. Only used when dest is a data base( |
file |
File to write to |
format |
Specifies the text format for written to disk. Only used
if |
Writes data to a csv file chunk by chunk. This function must be just in
conjunction with read_csv_chunkwise
. Chunks of data will
be read, processed and written when this function is called.
For writing to a database use insert_chunkwise_into
.
write_csv_chunkwise( x, file = "", sep = ",", dec = ".", col.names = TRUE, row.names = FALSE, ... ) write_csv2_chunkwise( x, file = "", sep = ";", dec = ",", col.names = TRUE, row.names = FALSE, ... ) write_table_chunkwise( x, file = "", sep = "\t", dec = ".", col.names = TRUE, row.names = TRUE, ... )
write_csv_chunkwise( x, file = "", sep = ",", dec = ".", col.names = TRUE, row.names = FALSE, ... ) write_csv2_chunkwise( x, file = "", sep = ";", dec = ",", col.names = TRUE, row.names = FALSE, ... ) write_table_chunkwise( x, file = "", sep = "\t", dec = ".", col.names = TRUE, row.names = TRUE, ... )
x |
chunkwise object pointing to a text file |
file |
file |
sep |
field separator |
dec |
decimal separator |
col.names |
should column names be written? |
row.names |
should row names be written? |
... |
passed through to |
chunkwise object (chunkwise), when writing to a file it refers to the
newly created file, otherwise to x
.
# create csv file for demo purpose in_file <- file.path(tempdir(), "in.csv") write.csv(women, in_file, row.names = FALSE, quote = FALSE) # women_chunked <- read_chunkwise(in_file) %>% #open chunkwise connection mutate(ratio = weight/height) %>% filter(ratio > 2) %>% select(height, ratio) %>% inner_join(data.frame(height=63:66)) # you can join with data.frames! # no processing done until out_file <- file.path(tempdir(), "processed.csv") women_chunked %>% write_chunkwise(file=out_file) head(women_chunked) # works (without processing all data...) iris_file <- file.path(tempdir(), "iris.csv") write.csv(iris, iris_file, row.names = FALSE, quote= FALSE) iris_chunked <- read_chunkwise(iris_file, chunk_size = 49) %>% # 49 for demo purpose group_by(Species) %>% summarise(sepal_length = mean(Sepal.Length), n=n()) # note that mean is per chunk
# create csv file for demo purpose in_file <- file.path(tempdir(), "in.csv") write.csv(women, in_file, row.names = FALSE, quote = FALSE) # women_chunked <- read_chunkwise(in_file) %>% #open chunkwise connection mutate(ratio = weight/height) %>% filter(ratio > 2) %>% select(height, ratio) %>% inner_join(data.frame(height=63:66)) # you can join with data.frames! # no processing done until out_file <- file.path(tempdir(), "processed.csv") women_chunked %>% write_chunkwise(file=out_file) head(women_chunked) # works (without processing all data...) iris_file <- file.path(tempdir(), "iris.csv") write.csv(iris, iris_file, row.names = FALSE, quote= FALSE) iris_chunked <- read_chunkwise(iris_file, chunk_size = 49) %>% # 49 for demo purpose group_by(Species) %>% summarise(sepal_length = mean(Sepal.Length), n=n()) # note that mean is per chunk